The Oxford Handbook of Networked Communication 9780190460518, 0190460512

Communication technologies, including the internet, social media, and countless online applications create the infrastru

109 82 6MB

English Pages 676 Year 2020

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Cover
The Oxford Handbook of NETWORKED COMMUNICATION
Copyright
Preface
Acknowledgements
Contents
About the contributors
INTRODUCTION
Chapter 1: Communication in the networked age
1. Introduction
2. Bridging academic siloes
3. The theoretical relevance of new methods
4. Genealogy of the handbook and roadmap
References
Part I: NETWORKS AND INFORMATION FLOW
Chapter 2: Networks and information flow: The Second Golden Age
1. Small worlds
2. Big data
3. Complicated contagion
4. Big data risks
5. Conclusion
References
Chapter 3: Rebooting mass communication: Using Computational and Network Tools to Rebuild Media Theory
1. The end of mass communication?
2. In defense of mass communication
3. New approaches to media studies
4. Network science and the media system
5. A network model of agenda setting
5.1 Actors in the Network
5.2 Relationship Typology
5.2.1 Issue Adoption Ties
5.2.2 Media Use Ties
5.2.3 Interorganizational Ties
5.2.4 Social Ties
5.2.5 Concept Association Ties
5.3 Link Direction and Agency
5.4 Individual and Dyadic Attributes
5.5 Network Mechanisms
5.6 Reducing Complexity in the Network Model
6. Challenges and the road ahead
References
Chapter 4: Propagation phenomena in social media
1. Introduction
2. User influence
2.1 The Million Follower Fallacy
2.2 User Types
2.3 Trendsetters
3. Propagation patterns
3.1 Word of Mouth
3.2 Social Conventions
4. Propagation applications
4.1 Topical Experts
4.2 Information Diet
5. Conclusion
References
Chapter 5: Dynamical processes in time-varying networks
1. Introduction
2. Time-varying networks
2.1 Representations
2.2 Properties
2.2.1 Time-Respecting Path
2.2.2 Connectivity and Latency
2.2.3 Burstiness
2.2.4 Memory
2.3 Activity-Driven Networks
3. Dynamical processes on activity-driven networks
3.1 Random Walks
3.2 Epidemic Spreading
3.3 Rumor Spreading
4. Discussion
References
Chapter 6: Partition-specific network analysis of digital trace data: Research Questions and Tools
1. Introduction
2. Social network theory and social network analysis
3. Homophily, communication, and SNA
4. Beyond SNA as usual: Three research questions
5. Tools, data, and results
5.1 Partitioning the Network
5.2 How Do Different Subgraphs in a Partitioned Network Relate to One Another?
5.3 Which Nodes Are Heavily Connected to Distinct Subgraphs?
5.4 How Do Subgraphs Change over Time?
6. Conclusion
Notes
References
Part II: COMMUNICATION AND ORGANIZATIONAL DYNAMICS
Chapter 7: How can computational social science motivate the development of theories, data, and methods to advance our understanding of communicational dynamics?
1. Can computational social science motivate the development of theories of communication and organizational dynamics?
2. Can computational social science motivate the development of new data collection instruments to study aommunication and organizational dynamics?
3. Can computational social science motivate the development of new methods to ostudy communication and organizational Dynamics?
Acknowledgements
References
Chapter 8: The new dynamics of organizational change
1. Organizational dynamics across disciplines
2. Case study: news media and new dynamics
2.1 Data on Digital Media
2.2 A Spark of New Theory
3. The"old" dynamics of organizational change
3.1 The Broad Shape of Organizational Change
3.2 Theoretical Perspectives on the Cycle of Change
3.3 The “New” Dynamics of Organizational Change
3.4 Emergence and New Disruptions
3.4.1 Existing Traditions
3.4.2 New Dynamics
3.5 Legitimacy and Stability
3.5.1 Existing Traditions
3.5.2 New Dynamics
3.6 Resiliency and Stability
3.6.1 Existing Traditions
3.6.2 New Dynamics
3.7 Failure and Decline
3.7.1 Existing Traditions
3.7.2 New Dynamics
4. A cookbook for studying new dynamics of organizational change
5. Caution in breaking new ground
6 Future research
References
Chapter 9: Online communication by emergency responders during crisis events
1. Introduction
2. Social media use during crisis events
3. Crisis communication in a networked age
4. Emergency responders in the spotlight
5. Methods for studying online communication of emergency responders
6. Emergency responders online: routine and reactions
7. Interorganizational communication on twitter
8. Discussion and conclusion
9. Opportunities for future work
Acknowledgements
Note
References
Chapter 10: Studying populationsm of online communities
1. Introduction and background
2. Benefit 1: Generalizability
3. Benefit 2: studying comminity-level variables
4. Benefit 3: studying diffusion between communities
5. Benefit 4: studying egological dynamics
6. Benefit 5: Studying multilevel processes
7. Limitations
8. Discussion
Acknowledgements
Notes
References
Chapter 11: Gender and networks in virtual worlds
1. Introduction
2. Studying gender in virtual worlds
2.1 Virtual Worlds Research
2.2 Mapping Gender in Virtual Worlds
2.3 Massively Multiplayer Online Games (MMOs)
3. Gender gaps in virtual worlds
3.1 Are There Gender Gaps in Virtual Worlds?
3.1.1 Underrepresentation
3.1.2 Differences in Play Styles, Motivations and Performance
3.2 Why Do Gender Gaps Occur in Virtual Worlds?
3.2.1 Gender Role Theory
3.2.2 Depersonalization
3.3 Gender Swapping and the Gender Gap
4. Current approaches to studying gender in virtual worlds
4.1 Traditional Methods
4.2 Digital Trace Data
5. Reasearhexample: gender and networks in everquest II
5.1 Data Quality and Management
5.2 Theory-Driven versus Data-Driven Analyses
5.3 Social Network Analysis
6. Future directions
References
Part III: INTERACTIONS AND SOCIAL CAPITAL
Chapter 12: Understanding social dynamics online: Social Networks, Social Capital, and Social Interactions
1. New (Data) oppoptunities, old (Theoretical) guidance
2. Conclusion
References
Chapter 13: The analysis of social capital in digital environments: A Social Investment Approach
1. Introduction
2. Social capital in the internet: an outcome-oriented approach
3. A missing piece: social investment patterns online
4. Social capitalization framework
5. Taxonomy of online social inc=vestment patterns
5.1 Cost of Uncertainty
5.2 Cost of Persistence
5.3 Cost of Mutuality (Reciprocity)
6. Diversity of social investment types: online networking examples
7. An emprical application: social investment in a facebook personal network
8. Discussion and future research
References
Chapter 14: Multiplying the medium: Tie Strength, Social Role, and Mobile Media Multiplexity
1. Introduction
2. Media multiplexity and the strength
3. Measuring the strength
4. Social mroles and institutions
5. Data and Methods
5.1 Dependent Variable
5.2 Independent Variables
5.3 Respondent Demographics and Control Variables
6. Analysis and results
7. Discussion and conclusion
Notes
References
Chapter 15: Revolutionizing mental health with social media
1. Introduction
2. Social media and well-being: potential
2.1 Early Detection
2.1.1 Quantifying Individual-Centric Risk
2.1.2 Population-Scale Measurement
2.2 Psychosocial Support
2.3 Health-Related Self-Disclosure
3. Social media and well-being: challenges
3.1 Risk to Vulnerable Populations
3.2 Privacy, Ethics, and Policy
3.2.1 Privacy-Preserving, Ethical-Intervention Design
3.2.2 Securing Disclosure of Personal Information
4. Conclusion
Notes
References
Chapter 16: The neuroscience of information sharing
1. Neural base of sharing decisions: value-based virality
2. Valuation
3. Self-related processing
4 Social cognition
5. Empirical support for value-based virality
6. Outcomes of imformation sharing: reach and impact
6.1 Sharers
6.2 Audiences
6.3 Sharer-Audience Interactions
7. Sharing processing in individuals and across populations
8. Sharing contexts as moderators of sharing processes
8.1 Audience Characteristics
8.2 Sharer Characteristics
8.3 Content Characteristics
8.4 Communication Channel Characteristics
8.5 Culture
9. Strengths and limitations of neuroscience for the study of viral information
9.1 Measurement and Prediction
9.2 Theory Development
9.3 Limitations
10. Conclusion
Notes
References
Part IV: POLITICAL COMMUNICATION AND BEHAVIOR
Chapter 17: Political communication research in a networked world
1. Requisites and attributes of democratic engagement
2. The role of communication in democratric engagement
3. What do we know?
4. The changing nature of the information and communication environment
Note
References
Chapter 18: Modelling and measuring delebration online
1. Introduction
2. Deliberation
3. Online deliberation
4. Deliberation in social media
4.1 Early Work: Flames and Bubbles
4.2 More Complex Measures of Environment: Network Analysis
4.3 More Complex Measures of Content: Argument Mining
4.4 Proliferating Criteria and the Core
5. Measuring argument quality
6. Measuring conceptual connections
7. Conclusion
Acknowledgments
References
Chapter 19: Moving beyond sentiment analysis: Social Media and Emotions in Political Communication
1. Introduction
2. Political communication on social media: methods and findings
2.1 Textual-Analysis-Based Approaches
2.1.1 Recognizing Political Communication
2.1.2 Sentiment Analysis of Political Text
2.1.3 Characterizing Political Text of the Masses
2.2 Nontextual Approaches to the Study of Online Political Communication
2.2.1 Survey-Based Approaches
2.2.2 Experimental Approaches
2.2.3 Network-Based Approaches
3. Moving forward: theory-driven inquiry
3.1 Theoretical Foundations of the Data Generation Process
3.1.1 The Fusion of Political Behaviors on Social Media
3.1.2 The Strategic and Instrumental Uses of Emotion
3.1.3 Emotional Interdependencies: The Spread of Emotion through Social Networks
3.2 Integrating Theory and Textual Analysis to Expand Mass Political Communication Research
3.2.1 Opinion Leaders: Using Emotion Strategically
3.2.2 The Psychology of Emotional Response
3.2.3 The Missing Piece: Emotion in InterpersonalPolitical Communication
4. Conclusion
Acknowledgments
Notes
References
Chapter 20: Dynamics of attention and public opinion in social media
1. Dynamics of attention in social media
1.1 Case Study: Attention Dynamics During Crisis Response
2. Opinion formation
2.1 Case Study: Attention Dynamics during Social Mobilization
3. Discussion
Acknowledgments
Notes
References
Chapter 21: A Satisficing search model of text production
1. Introduction
2. Satisficing search in message design
2.1 What Is Satisficing Search?
2.2 Satisficing Search and Public Discourse
3. Theory and measurement of key concepts
3.1 Semantic Sourcing
3.1.1 Measuring semantic sourcing
3.2 Semantic Aspiration
4. Advancing public opinion research with satisficing semantic search
5. Conclusion
References
Chapter 22: Studying networked communication in the middle east: Social Disrupter and Social Observatory
1. Introduction
2. Social media as social disrupter
2.1 How Have Digital Networks Changed the Middle East?
3. Social media as social observatory
4. Case 1: networks of political polarization in egypt
4.1 Content Polarization
4.2 Structural Polarization
5. Case 2: network of social fragmentation in qatar
5.1 Interlingual Social Networks
5.2 Political Engagement
6. Conclusion
Notes
References
Part V: MOBILITY AND SPACE
Chapter 23: Mobile space and agility as the subversive partner
References
Chapter 24: One foot on the street, one foot on the web: Analyzing the Ecosystem of Protest Movementsin an Era of Pervasive Digital Communication
1. Introduction
2. Exploring the social ensemble of digital movements
3. The toolkit of digital protest researchers
References
Chapter 25: Out stage, out street: Brooklyn Drag and the Queer Imaginary
1. Introduction
2. Context: Drag in brooklyn
3. Context: brooklyn's queer media imaginary
4. Analysis: The love/shade relationship between brooklyn and manhattan drag
5. Analysis: urban exceptionalism and brooklyn drag
6. Discussion and conclusions
6.1 Brooklyn Drag and Queer Terroir
6.2 Politics of Visibility
Notes
References
Chapter 26: Digital mapping of urban mobility patterns
1. Introduction
2. Role of geographic information system in mobility and health
2.1 Modern Geographic Information Systems
2.2 Combining Information about Space and Place
3. Location, place, and health research: examples and challenges
3.1 Children’s Exposure to Outdoor Tobacco Advertising
3.2 Measuring the Available Dose versus the Dose That Individuals Actually Experience
3.3 Children’s Exposure to Outdoor Alcohol Advertising and Alcohol Use
3.4 One Boy’s Day: A Classic among Mobility Pattern Studies
4. Examples of modern mobility pattern studies in the context of epidemiology
4.1 Prospective versus Retrospective Study Designs
4.2 Retrospective Data Collection of Human Mobility and Risk of Being Assaulted
4.3 Prospective Data Collection of Human Mobility and the Context of Teens’ Activities
5. Future directions
References
Chapter 27: Research on mobile phone data in the global south: Opportunities and Challenges
1. Introduction
2. Stable trends of mobile usage in the global south
2.1 Prepaid as the Dominant Model for Paying for Mobile Usage
2.2 Lower Costs for Calls and Devices
2.3 Mobile Payments
2.4 Increased Access to Mobile Data
2.5 Web-Page-Driven Mobile Internet
2.6 Other Forms of Subsidized Connectivity
3. Beyond connectivity: researching mobile data in the global south
3.1 Social Ties and Public Connections
3.2 Economic Transfers and Well-Being
3.3 Mobility and Location
3.4 Digital Innovation and Technology Entrepreneurship
4. Challenges to using mobile data from the global south fro academic research
5. Conclusion
Notes
References
Part VI: ETHICS OF DIGITAL RESEARCH
Chapter 28: The ethics of digital research
1. What i found out from facebook users' emails
2. Challenges that surfaced from conversations about digital research
2.1 Evolving Research Ecosystem
2.2 Heterogeneous Populations and Cultures
2.3 Informed Consent Best Practices
2.4 Understanding Risks
2.5 Privacy Practices Continue to Evolve
2.6 Technical and Legal Issues
2.7 Ongoing Challenges
References
Chapter 29: Digital trace data and social research: A Proactive Research Ethics
1. Introduction
2. Research design
3. Data collection
4. Analysis and reporting
5. Discussion
Notes
References
Chapter 30: A practioner's guide to ethical web data collection
1. Technical aspects of datra collection
1.1 Universal Best Practices
1.1.1 Anonymizing Data
1.1.2 Secure Storage
1.1.3 Backups
1.2 Obtaining Data Directly from Companies
1.3 Using Application Programming Interfaces
1.3.1 Access Model
1.3.2 Rate Limits
1.3.3 Terms of Service
1.3.4 Implementations
1.4 Scraping Web-Based Data
1.4.1 Implementation
1.4.2 Accounts
1.4.3 Rate Limits
1.4.4 Parsing
1.5 Crowdsourcing Data Collection and Analysis
1.6 Data Donations from Users
2. Legal issues
2.1 Terms of Service
2.1.1 APIs versus Websites
2.1.2 Computer Fraud and Abuse Act
2.2 Human Subjects and Institutional Review
2.2.1 Human Subject Rules Outside the United States
2.2.2 Toward Uniform Ethical Rules
2.3 Nondisclosure Agreements
2.4 The Robots Exclusion Standard
2.5 Click and Impression Fraud
3. Ethical aspects
3.1 Harm to Users and Consent
3.1.1 Harm to Users
3.1.2 Consent on the Web
3.1.3 “Public” Data and Context
3.2 Harm to Services
3.2.1 Responsible Disclosure
3.3 Reproducibility
4. Sharing data
4.1 Community Norms and Sharing Infrastructure
4.2 Limits to Sharing
5. Summary
Notes
References
Chapter 31: Responsible research on social networks: Dilemmas and Solutions
1. We are the data!
2. The Good bits
3. What could possibly go wrong?
4. Technical limitations: can we overcome these?
5. Best practice from a technical perspective
6. Best practice from a legal and societal perspective
7. What challenges and oppertunities remain?
Notes
References
Chapter 32: Unintended consequences of using digital methods in difficult research environments
1. Introduction
2. Risk
3. Difficult research environments
4. Digital methods
5. Digital methods and risk
5.1 Risk by Association and Rapport
5.2 Data Security with Introducing Digital Methods: Get Informed
5.3 What Is to Be Done: In the Field
5.4 Data Security When Entering New Spaces: Identification
5.5 What Is to Be Done: With Digital Methods
6. Conclusion
References
Chapter 33: Ethical issues in internet research: The Case of China
1. Introduction
2. The internet with "chinese characteristics"
3. Methodological approaches and ethical considerations
4. Online experiments and surveys
5. Content analysis
6. Digital ethnography
7. Discussion
Notes
References
CONCLUSION
Chapter 34: The past and future of communication research
1. Introduction
2. Revisiting the principles of networked communication
3. Theoretical developments
4. The virtues of interdisciplinary work
5. Current and future challenges
References
Index
Recommend Papers

The Oxford Handbook of Networked Communication
 9780190460518, 0190460512

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

   

NETWORKED COMMUNICATION

   

.........................................................................................................................................

NETWORKED COMMUNICATION ......................................................................................................................................... Edited by

BROOKE FOUCAULT WELLES and

SANDRA GONZÁLEZ-BAILÓN

1

3 Oxford University Press is a department of the University of Oxford. It furthers the University’s objective of excellence in research, scholarship, and education by publishing worldwide. Oxford is a registered trade mark of Oxford University Press in the UK and certain other countries. Published in the United States of America by Oxford University Press  Madison Avenue, New York, NY , United States of America. © Oxford University Press  All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, without the prior permission in writing of Oxford University Press, or as expressly permitted by law, by license, or under terms agreed with the appropriate reproduction rights organization. Inquiries concerning reproduction outside the scope of the above should be sent to the Rights Department, Oxford University Press, at the address above. You must not circulate this work in any other form and you must impose this same condition on any acquirer. Library of Congress Cataloging-in-Publication Data Names: Brooke Foucault Welles and Sandra González-Bailón editors. Title: The Oxford handbook of networked communication / edited by Brooke Foucault Welles and Sandra González-Bailón Description: New York, NY : Oxford University Press, [] | Includes bibliographical references and index. Identifiers: LCCN  | ISBN  (hardcover) Subjects: LCSH: Marx, Karl, –. | Socialism. | Marxian economics. Classification: LCC HX. .O  | DDC .—dc LC record available at https://lccn.loc.gov/  Printed by Sheridan Books, Inc., United States of America

P ..........................

H are networks changing how we communicate and the effects of that communication? How are digital technologies transforming a research agenda that predates the Internet? This Handbook addresses these questions through the lens of empirical research that spans multiple disciplines and methodological approaches but converges in a shared attempt to uncover the logic and consequences of communication in the digital age. We invited the contributors to this Handbook to be part of the project because their work represents the cutting edge in digital research. Communication is the home discipline for many of the authors (including the editors), but the chapters that follow also represent the work of scholars with diverse disciplinary backgrounds. The research represented here illuminates key aspects of how networks shape or enable communication, as well as some of the consequences of those networked dynamics, both for theory and practice. Together, the contributors to this Handbook form a global network that spans eleven countries, twenty-nine cities, and more than thirty institutions (see Figure ).

GLOBAL DISTRIBUTION OF CONTRIBUTORS TO THE HANDBOOK 51 AUTHORS, 29 CITIES, 11 COUNTRIES

  Institutional locations of the authors contributing to this Handbook.

vi



In spite of the diversity of theoretical and disciplinary backgrounds represented here, the Handbook still forms a coherent body of work, organized around thematic sections that highlight key areas of applied research. We wanted to ensure that the work discussed in the pages that follow would arise from common ground and facilitate cumulative research. This common ground takes the form of shared references to prior work from which the authors draw to take their research in different thematic directions (Figure ). The chapters and sections in this Handbook offer a shared understanding of the state of the art and, perhaps most important, the nature of pending challenges. CROSS-REFERENCES ACROSS CHAPTERS

CHAPTERS 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 SECTIONS

I. NETWORKS II. COMMUNICA- III. INTERACTIONS IV. POLITICAL AND TION AND AND SOCIAL COMMUNICATION INFORMATION ORGANIZATIONAL CAPITAL AND BEHAVIOR FLOW DYNAMICS

V. MOBILITY AND SPACE

VI. ETHICS OF DIGITAL RESEARCH

  Theoretical connections across the chapters and sections in the Handbook.

In chapter  we offer more details on the motivation and genealogy behind the Handbook, and we provide an overview of the sections and chapters that organize the contents. In chapter  we revisit the main themes articulating the book through the lens of the research discussed. Each part is further introduced by experts in the corresponding domains who have decades of experience and are able to connect past and future research to contextualize the work discussed in the longer trajectory of communication research. We are confident that, taken individually or as a whole, the chapters that follow will offer an exciting entry point to the current frontiers of networked communication research.

A

..............................................................

F, we would like to thank the authors included in this Handbook for agreeing to embark in this project as early as  and for working with us patiently during the long process of bringing the Handbook to fruition. We could not have found a more enthusiastic, interesting, and informed group of colleagues to engage with the questions we deemed important for this project. We have learned much from working with each of them, and we hope we can continue the conversation in years to come. This Handbook stemmed from a preconference workshop sponsored by the National Science Foundation Digital Societies and Technologies Research Coordination Network Award. We are grateful for their support in the formative stages of this project. We would also like to thank our editors at Oxford University Press for giving us the support we needed to make this Handbook happen. In particular, we are grateful to Angela Chnapko and Alexcee Bechthold for guiding us on the long and winding road of publishing an edited volume. Thanks also to Nicole Samay for helping finalize the figures in the preface. Last, we would also like to thank the Annenberg School for Communication at the University of Pennsylvania and the College of Arts, Media, and Design and the Network Science Institute at Northeastern University for their support and for providing the intellectual environment that allows projects like this to happen.

C

................................

xiii

About the contributors

INTRODUCTION . Communication in the Networked Age B F W  S Gˊ -Bˊ 



PART I. NE TWO R K S AND INFORMATION FLOW . Networks and Information Flow: The Second Golden Age D L . Rebooting Mass Communication: Using Computational and Network Tools to Rebuild Media Theory K O





. Propagation Phenomena in Social Media M C, Fˊ  B, S G,  K G



. Dynamical Processes in Time-Varying Networks B G̧  N P



. Partition-Specific Network Analysis of Digital Trace Data: Research Questions and Tools D F



x



PART II. C O M M U N I C A TI O N A N D ORGANIZATIONAL DYNAMICS . How Can Computational Social Science Motivate the Development of Theories, Data, and Methods to Advance Our Understanding of Communication and Organizational Dynamics? N C . The New Dynamics of Organizational Change M S. W . Online Communication by Emergency Responders During Crisis Events E S. S

 



. Studying Populations of Online Communities B M H  A S



. Gender and Networks in Virtual Worlds G B  C S



PART III. IN T E R AC T I ON S AN D SOCIAL CAPITAL . Understanding Social Dynamics Online: Social Networks, Social Capital, and Social Interactions N E



. The Analysis of Social Capital in Digital Environments: A Social Investment Approach K. H K



. Multiplying the Medium: Tie Strength, Social Role, and Mobile Media Multiplexity J J, J B,  T K



. Revolutionizing Mental Health with Social Media M  C



. The Neuroscience of Information Sharing C S  E B. F





xi

PART IV. P O L I T I C A L C O M M U N I C A T I O N A N D BE H A V I O R . Political Communication Research in a Networked World M X. D C



. Modeling and Measuring Deliberation Online N B



. Moving Beyond Sentiment Analysis: Social Media and Emotions in Political Communication J E. S



. Dynamics of Attention and Public Opinion in Social Media E F



. A Satisficing Search Model of Text Production D B. M



. Studying Networked Communication in the Middle East: Social Disrupter and Social Observatory J B-H, M M. H,  I W



PART V. M O BI LI TY A ND S PAC E . Mobile Space and Agility as the Subversive Partner C M . One Foot on the Streets, One Foot on the Web: Analyzing the Ecosystem of Protest Movements in an Era of Pervasive Digital Communication P G . Our Stage, Our Streets: Brooklyn Drag and the Queer Imaginary J L . Digital Mapping of Urban Mobility Patterns C N. M  D J. W





 

xii



. Research on Mobile Phone Data in the Global South: Opportunities and Challenges S A, E Q,  D H



PART VI. E T H I C S O F DI G I T A L RESEARCH . The Ethics of Digital Research J T. H . Digital Trace Data and Social Research: A Proactive Research Ethics E M-T . A Practitioner’s Guide to Ethical Web Data Collection A M  C W . Responsible Research on Social Networks: Dilemmas and Solutions J C, H H,  T H . Unintended Consequences of Using Digital Methods in Difficult Research Environments K E. P . Ethical Issues in Internet Research: The Case of China B M  M R



 



 

CONCLUSION . The Past and Future of Communication Research S Gˊ -Bˊ   B F W



Index



A  

..................................................................................

Seyram Avle is an Assistant Professor at the Department of Communication, University of Massachusetts, Amherst, MA. Nick Beauchamp in an Assistant Professor of Political Science, core faculty member of the NULab for Text, Maps and Networks and core faculty member of the Network Science Institute at Northeastern University, Boston, MA. Grace Benefield is a PhD candidate in the department of communication at the University of California, Davis, CA. Fabrício Benevenuto is an Associate Professor of Computer Science at the Universidade Federal de Minas Gerais in Brazil. Jeffrey Boase in as Associate Professor in the Institute of Communication, Culture, Information and Technology and the Faculty of Information at the University of Toronto, Canada. Javier Borge-Holthoefer is a Senior Researcher at the Internet Interdisciplinary Institute at the Universitat Oberta de Catalunya, Spain. Meeyoung Cha is an Associate Professor at the Graduate School of Culture Technology at the Korea Advanced Institute of Science and Technology, South Korea. Munmun de Choudhury is an Assistant Professor in the School of Interactive Computing at Georgia Tech, Atlanta, GA. Noshir Contractor is the Jane S. & William J. White Professor of Behavioral Sciences in the McCormick School of Engineering & Applied Science, the School of Communication and the Kellogg School of Management at Northwestern University, Evanston, IL. Jon Crowcroft is the Marconi Professor of Communications Systems in the Computer Laboratory of the University of Cambridge and the Chair of the Program Committee at the Alan Turing Institute, London, UK. Michael X. Delli Carpini is a Professor of Communication at the Annenberg School for Communication, University of Pennsylvania, Philadelphia, PA. He served as Walter H. Annenberg Dean of the school from  until the end of . Nicole Ellison is the Karl E. Weick Collegiate Professor of Information at the School of Information, University of Michigan, Ann Arbor, MI.

xiv

  

Emily B. Falk is an Associate Professor of Communication at the Annenberg School for Communication, University of Pennsylvania, Philadelphia, PA. Emilio Ferrara is an Assistant Research Professor and Associate Director of Applied Data Science at the USC Department of Computer Science. Brooke Foucault Welles is an Associate Professor in the department of Communication Studies and core faculty of the Network Science Institute at Northeastern University, Boston, MA. Deen Freelon is an Associate Professor in the School of Media and Journalism at the University of North Carolina at Chapel Hill, NC. Paolo Gerbaudo is a Senior Lecturer in Digital Culture and Society and the director of the Centre for Digital Culture at King’s College London, UK. Saptarshi Ghosh Department of Computer Science and Engineering, Indian Institute of Technology, Kharagpur, India. Bruno Gonçalves is a Vice President in Data Science and Finance at JPMorgan Chase. Previously, he was a Data Science fellow at New York University’s Center for Data Science while on leave from a tenured faculty position at Aix-Marseille Université, France. Sandra González-Bailón is an Associate Professor at the Annenberg School for Communication, and affiliated faculty at the Warren Center for Network and Data Sciences at the University of Pennsylvania, Philadelphia, PA. Krishna Gummadi is Head of the Networked Systems Research Group at the Max Planck Institute for Software Systems (MPI-SWS), Germany. Hamed Haddadi a Senior Lecturer and the Deputy Director of Research in the Dyson School of Design Engineering at Imperial College London, UK. Jeffrey T. Hancock is founding director of the Stanford Social Media Lab and a Professor in the Department of Communication at Stanford University, Stanford, CA. Tristan Henderson is a Senior Lecturer in Computer Science at the University of St. Andrews, Scotland, UK. Benjamin Mako Hill is an Assistant Professor at the Department of Communication in the University of Washington, Seattle, WA. Muzammil M. Hussain is an Assistant Professor of Communication Studies, and Faculty Associate in the Institute for Social Research at the University of Michigan, Ann Arbor, MI. David Hutchful is a graduate student at the School of Information, University of Michigan, Ann Arbor, MI.

  

xv

Jack Jamieson is a PhD candidate at the Faculty of Information, University of Toronto, Canada. Tetsuro Kobayashi is an Associate Professor at the Department of Media & Communication, City University of Hong Kong. K. Hazel Kwon in an Associate Professor at the Walter Cronkite School of Journalism and Mass Communication at Arizona State University, Tempe, AZ. David Lazer is a Distinguished Professor of Political Science and Computer and Information Science, as well as the co-director of the NULab of Texts, Maps, and Networks at Northeastern University, Boston, MA. Jessa Lingel is an Assistant Professor at the Annenberg School for Communication, University of Pennsylvania, Philadelphia, PA. Bo Mai is a PhD candidate at the Annenberg School for Communication, University of Pennsylvania, Philadelphia, PA. Drew B. Margolin is an Assistant Professor at the Department of Communication at Cornell University, Ithaca, NY. Carolyn Marvin is the Frances Yates Professor of Communication at the Annenberg School for Communication, University of Pennsylvania, Philadelphia, PA. Ericka Menchen-Trevino is an Assistant Professor at the School of Communication, American University, Washington, DC. Alan Mislove is an Associate Professor at the Khoury College of Computer Sciences at Northeastern University, Boston, MA. Christopher N. Morrison in an Assistant Professor of Epidemiology at the Mailman School of Public Health at Columbia University, NYC, NY. Katya Ognyanova is an Assistant Professor at the School of Communication and Information at Rutgers University, New Brunswick, NJ. Katy E. Pearce is an Associate Professor of Communication at the University of Washington, Seattle, WA. Nicola Perra is Senior Lecturer in Network Science at the Business School of the University of Greenwich, London, UK. Emmanuel Quartey is a Marketing and Communications Fellow at the Meltwater Entrepreneurial School of Technology Incubator (MINC) in Accra, Ghana. Maria Repnikova is an Assistant Professor of Communication and Director of the Center for Global Information Studies at Georgia State University, Atlanta, GA. Christin Scholz is an Assistant Professor at the Amsterdam School for Communication Research, University of Amsterdam, The Netherlands.

xvi

  

Jaime E. Settle is an Associate Professor at the Government Department at the College of William & Mary, Williamsburg, VA. Aaron Shaw is an Assistant Professor in the Department of Communication Studies at Northwestern University, Evanston, IL. Cuihua Shen is an Associate Professor of Communication at the University of California, Davis, CA. Emma S. Spiro is an Assistant Professor at the Information School, University of Washington. Seattle, WA. Ingmar Weber is the research director of the Social Computing Group at the Qatar Computing Research Institute, Qatar. Matthew S. Weber is an Associate Professor in the Hubbard School of Journalism and Mass Communication at the University of Minnesota, Minneapolis, MN. Douglas J. Wiebe is a Professor of Biostatistics and Epidemiology at the Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA. Christo Wilson is an Associate Professor in the Khoury College of Computer Sciences at Northeastern University, Boston, MA.

.............................................................................................................

INTRODUCTION .............................................................................................................

  ......................................................................................................................

     ......................................................................................................................

     ˊ -ˊ 

. I

.................................................................................................................................. C technologies, social media, and countless online applications create the infrastructure and interface through which many of our social interactions now take place. Several decades in the making, this shift toward transmedia communication has raised new questions about how we make meaning, socialize, cultivate individual and collective identities, and delimit public and private domains in an increasingly mediated world. The widespread adoption of communication technologies has also produced, as a collateral benefit, new ways of observing the world. Many of our interactions now leave digital trails that can help us unravel the rhythms of social life and the complexities of the social worlds we inhabit. In this Handbook, we use the term ‘networked communication’ to refer to those transformations but also, and mostly, to the new possibilities in the study of communication that new technologies have enabled. Networked communication represents a new direction in a research agenda that centers on the complexity, interconnectedness, and dynamism of communication practices. It is an approach that is fundamentally concerned with the ways that data—in particular, digital trace data—can illuminate new and old questions about the nature, processes, and outcomes of human communication. It is also an approach that advocates for interdisciplinary research and that draws from the theoretical principles and analytical tools of nerwork science and computational social science. At its heart, networked communication is founded on the belief that we cannot advance our theories of communication, its mechanisms, and effects, without embracing the insights produced by adjacent disciplines that are also immersed in the analysis of similar processes and data.



     ˊ-ˊ

. B A S

.................................................................................................................................. Networked communication builds on a body of literature calling for scholars to rethink the traditional siloes of communication scholarship—in particular mass communication and interpersonal communication. As far back as the s, Reardon and Rogers () noted the “false dichotomy” between mass and interpersonal communication. They argued that the distinction is largely one of institutional convenience and convention, and that the dichotomy hampers a comprehensive theorizing about the complex and rich processes of human communication. Updating this thinking for the internet age, O’Sullivan and Carr () introduced masspersonal communication as a useful framework for theorizing about how technology often simultaneously facilitates mass and interpersonal communication processes. Here we argue that networked communication usefully extends this idea to enable theorizing about interpersonal communication, mass communication, and the communication processes that emerge in between: from the dyadic and the group levels, to the level of collective dynamics and aggregated trends. The data and measurement instruments that digital platforms have made available facilitate the integration of those levels of analysis. This shift toward data-driven communication research builds on nearly a decade of scholarship developed under the banner of ‘computational social science’. In their  call to action, Lazer et al. () urged social scientists to embrace large-scale trace data in their research practices. Although their intended audience was broad, many of the examples the authors used were germane to the discipline of communication. Among other research opportunities, the call emphasized the transformational potential of studying organizational email networks, interpersonal cellular phone call networks, discourse spreading through social media and blogs, and interaction within virtual worlds, to name just a few. In the intervening years, a large body of work dedicated to data-driven communication research has emerged, including dozens of pieces by this volume’s editors and contributors. Yet as a subdiscipline, the scope and focus of computational communication research is still unsettled. This may be because existing work largely focuses on methods, showcasing techniques and/or training researchers to handle large data sets and computational analytics. For instance, in  Malcolm Parks edited an influential issue of the Journal of Communication that focused on big data (Parks, ); Communication Methods and Measures has issued several awards for papers describing, managing, and analyzing network data (e.g., Foucault Welles, Vashevko, Bennett, & Contractor, ; Shumate & Palazzolo, ); and the first professional organization dedicated to computational communication research focuses, mostly, on methods (Computational Methods Interest Group, International Communication Association).





. T T R  N M

.................................................................................................................................. In many ways, this focus on methods makes sense: researchers need instruction on technical skills in order to do computational work capably and correctly. Yet focusing too heavily on methods risks underselling the value of computational research. Within the discipline of communication, computational social scientists are all too easily cast only as methodologists—the technical whiz kids pushing the boundaries of quantitative research with little regard for whether or how those methods map onto theories of human communication. This, of course, obscures the transformative potential of datadriven research and misses the essence of Lazer et al.’s () argument. The central value of data-driven research is not data qua data, but rather data used to advance theoretical understandings. As Tufekci () compellingly argues, networked communication technologies have changed the scope, scale, and speed by which researchers can collect data, overhauling not only how we answer questions, but also how we decide which questions to ask. Networked communication research is centrally concerned with issues of interconnectivity, complexity, and dynamic processes that simultaneously feel fresh and speak to core theoretical issues such as persuasion, cultivation, gatekeeping, and mobilization. To that end, readers will note that this volume, although it includes discussions of data and method, does not focus on those things. There are excellent readers and handbooks available for those who are looking for detailed explanations of how to conduct computational communication research (in addition to those mentioned above, we recommend Salganik’s [] Bit by Bit: Social Research in the Digital Age; Hargittai and Sandvig’s [] Digital Research Confidential; and Ackland’s [] Web Social Science). This Handbook has a different focus: it is organized thematically, highlighting how new data sources help us tackle the core theoretical questions of our discipline.

. G   H  R

.................................................................................................................................. This Handbook is the culmination of a project that started in  during the International Communication Association conference held in Seattle, USA. During that meeting, we organized a panel titled “Emerging Research Agendas at the Intersection of Communication and Computational Social Science.” The discussion among panelists was the beginning of a conversation that we continued over the years. This conversation helped us identify a set of core questions in the collective research agenda



     ˊ-ˊ

as well as the scholars who are trying to solve those questions. Many of the participants of that first panel and the panels that followed in subsequent years are contributors to this volume—they are also active contributors to the ongoing discussion on how to best harness technological and computational developments to advance communication theory. We also invited colleagues from fields other than communication to share the insights gained around similar research questions, but considered from the vantage point of different disciplinary traditions. As you read on, you will discover a breadth of disciplinary backgrounds and approaches. Such an emphasis on interdisciplinary research in an edited volume like this creates the risk that chapters will be internally coherent but fail to speak to one another. To mitigate this risk, we have divided the Handbook into six parts. Each of these parts aims to identify the boundaries of a well defined research domain and provide guidance on how research is being developed in that domain using digital data and tools. The election of which domains to include or exclude was not straightforward: networked communication pervades nearly all of communication’s subdisciplines, and certainly more than could reasonably be covered in a single Handbook. We gave preference to domains and themes that embraced computational and network science methods and that feature prominently in the adjacent fields that produce the more technical outputs in those areas, such as computer science, information science, and physics. This allowed us to include relevant work from those disciplines while still speaking to core issues of interest to communication scholars. Part I of this Handbook focuses on information networks. This is an obvious topic to include in a discussion of networked communication since it tackles the essential question of how information moves between people. As Lazer notes in his introduction to this section, the question dates back at least half a century to Katz and Lazarsfeld’s () work on personal influence. This work experienced a renaissance following Watts and Strogatz’s () and Barabási and Albert’s () twin discoveries of the network structures that enable information to flow more efficiently in large networks. The chapters included in this part showcase how data can best be organized, theorized, and analyzed to reveal the intricacies of information diffusion, including innovations in how we understand the temporal and heterogeneous nature of networks and the dynamics they facilitate at the dyadic, group, and system levels. Part II focuses on organizational dynamics and it draws on a long history of research on networks in organizations, as Contractor’s introduction to the section highlights. The chapters in this part explore how digital trace data can be used to enhance the study of groups, teams, and communities. Perhaps more so than any other subdiscipline in the field of communication, network methods and digital trace data have made tractable organizational research that, until as recently as a decade ago, was nearly impossible to conduct. As the chapters in this part illustrate, it is now possible to study organizations across individual and collective levels of analysis, over time, and on a massive scale. Innovations introduced in these chapters speak to the power of networked communication to plug existing holes in organizational communication theory and exploit the new research possibilities.





Part III, on interactions and social capital, revisits theoretical concepts that felt relatively “settled” in the pre-internet age but have been rapidly and radically transformed by online and mobile communication. As Ellison notes in her introduction, most theories of interpersonal communication and social support were developed under dramatically different—and for the most part simpler—media conditions. The chapters that form this section explore how online data complicate and change how we understand interpersonal communication and how these changes affect us on deeply personal levels. Unlike other approaches to mediated interpersonal communication, these chapters remind us that the online and offline implications of networked communication cannot be easily separated. Shifting focus from individuals to societies, part IV examines how digital trace data enhance the study of political communication and dynamics of opinion formation. This part is perhaps the most thematically focused in the Handbook: as Delli Carpini’s introduction suggests, political communication research has a long tradition of bringing methodological developments, including improved measurement instruments, to the advancement of theories about deliberation and democratic participation. The chapters in this part focus on how people develop, refine, and disseminate political opinions. Although all speak to long-standing theoretical questions in political communication (deliberation, emotion in discourse, opinion formation, etc.), they also grapple with how the unique affordances of new media communication amplify, change, and complement more traditional modes of political engagement. At first glance, the topic of part V, mobility and space, may seem an odd choice for a volume focused on research that uses digital trace data. However, the chapters in this part speak to one of the critical misunderstandings about networked communication research: as communication moves online, physical space still matters in defining many of the data points we have available for analysis; likewise, our representations of space can be vastly improved by using digital tools to add layers of social information to cartographies that would otherwise be less informative. Indeed, as several chapters in this section point out, sometimes space is rendered more visible by the technologies people use to communicate. In her introduction to this section, Marvin explains how online communication increases the distance over which we can easily transmit information while, simultaneously, rendering that communication more visible, traceable, and permanent. The chapters in this part explore the opportunities and challenges inherent in that duality, including how digital data signal things, intentionally or unintentionally, about people, where they go, and what they do. The ramifications of this topic lead nicely into the last section of the Handbook, part VI, which is devoted to ethics and is therefore more applied than the other sections. We would be remiss not to include a section that explicitly addresses the ethical implications of using digital trace data in communication research. As Hancock highlights in his introduction, issues of ethics in networked research go well beyond how best to adhere to the principles of the Belmont Report (a question that itself is by no means resolved); they also include the questions of how to best engage with data, corporations, journalists, and the broader base of (nonparticipant) users of the platforms we study.



     ˊ-ˊ

The chapters in this part offer guidance on how to protect the privacy of subjects and abide by ethical considerations against an ever-shifting boundary that researchers must constantly redraw to keep up with technological advancements. Beyond the diversity of theoretical perspectives covered in these six parts, the most outstanding feature of this volume is that it offers a truly interdisciplinary integration of knowledge, spanning several disciplines that share the common goal of uncovering the hidden logic of communication phenomena in digital environments. The chapters in this volume are united in that all feature innovative research harnessing the new data resources made available by online technologies and also offer a comprehensive view of current theoretical discussions, with suggestions on how to take the next steps in a quickly evolving research environment. They do not, however, reflect disciplinary coherence or common methodological or theoretical approaches. This is by design. Using digital data to tackle interesting theoretical questions requires partnerships across disciplinary boundaries that—although on the rise—are still uncommon. In this volume we set the foundations for a common language that transcends the boundaries separating fields of research, and we emphasize the value of crosspollination and multidisciplinary collaboration. Social scientists, computer scientists, physicists, and other members of the fast-growing field of computational social science have never been closer to their goal of trying to understand communication dynamics, but there are not many venues in which they can engage in an open exchange of methods and theoretical insights. This volume deliberately creates a platform that integrates the knowledge produced in different academic silos so that we can address the big puzzles that beat at the heart of communication as a field. In addition to this diversity in disciplinary backgrounds, the authors featured in this volume are diverse along a number of other dimensions, including gender, race, ethnicity, and country of origin. As we settle the foundations for a novel (often highly technical) subdiscipline in communication, we are mindful of the disparities in representation that restrict the growth of many science, technology, engineering, and math (STEM) fields. We deliberately tried not to replicate that lack of diversity here. It should go without saying that research on networked communication benefits from being inclusive of many voices, and we trust that readers of this volume will quickly be disabused of any preconceived notions about who can or should perform this type of research. The only entry condition is the willingness to learn the tools that are necessary to analyze data. As we enter a new era of networked communication research, we hope this volume inspires readers to think critically about how digital data and computational methods will transform their own research practice, as well as the discipline of communication more broadly. Collectively, the chapters that follow define the boundaries of an emergent research domain that is already attracting much attention from researchers, students, and funding bodies. Moving beyond methods instruction, our contributors provide a canon of how research can and should be conducted in the digital era, offering a comprehensive view of current theoretical discussions and suggestions on how to take the next steps in a research environment that is quickly evolving—and that is also, for that reason, increasingly fascinating.





R Ackland, R. (). Web social science: Concepts, data and tools for social scientists in the digital age. Sage, London. Barabási, A. L., & Albert, R. (). Emergence of scaling in random networks. Science, (), –. Foucault Welles, B., Vashevko, A., Bennett, N., & Contractor, N. (). Dynamic models of communication in an online friendship network. Communication Methods and Measures, (), –. Hargittai, E., & Sandvig, C. (Eds.). (). Digital research confidential: The secrets of studying behavior online. MIT Press, Cambridge, MA. Katz, E., & Lazarsfeld, P. F. (). Personal influence, the part played by people in the flow of mass communications. Transaction Publishers, London. Lazer, D., Pentland, A. S., Adamic, L., Aral, S., Barabási, A. L., Brewer, D., . . . Jebara, T. (). Life in the network: The coming age of computational social science. Science, (), . O’Sullivan, P. B., & Carr, C. T. (). Masspersonal communication: A model bridging the mass-interpersonal divide. New Media & Society, . Parks, M. R. (). Big data in communication research: Its contents and discontents. Journal of Communication, (), –. Reardon, K. K., & Rogers, E. M. (). Interpersonal versus mass media communication: A false dichotomy. Human Communication Research, (), –. Salganik, M. J. (). Bit by Bit: Social Research in the Digital Age. Princeton, NJ: Princeton University Press. Shumate, M., & Palazzolo, E. T. (). Exponential random graph (p*) models as a method for social network analysis in communication research. Communication Methods and Measures, (), –. Tufekci, Z. (). Engineering the public: Big data, surveillance and computational politics. First Monday, (). http://firstmonday.org/ojs/index.php/fm/article/view// Watts, D. J., & Strogatz, S. H. (). Collective dynamics of “small-world” networks. Nature, (), –.

  .............................................................................................................

NETWORKS AND INFORMATION FLOW .............................................................................................................

  ......................................................................................................................

    The Second Golden Age ......................................................................................................................

 

T generation after World War II witnessed the first golden age of understanding networks and information flow, exemplified by the “Columbia” studies (Lazarsfeld, Berelson, and Gaudet ; Katz ). This golden age was driven in part by an increased concern about propaganda and mass media and was enabled by scientific revolutions based on a combination of conceptual breakthroughs and measurement and statistical techniques in evidence throughout the social sciences. Out of this period came a set of understandings of information flow—for example, around the interplay of mass media and social ties (Katz )—that dominated our understanding of the phenomenon for half a century. Today we are in the midst of a second golden age. We are blessed with absurd amounts of data with detailed markers of time and content. We have computational power and tools at our disposal that would appear to our antecedent selves from just a few decades ago to be lifted from a science fiction novel. Yet we are also cursed with an informational ecosystem that seems to be vastly more complicated than that of just a few decades ago. The powerful but simple models of information flow that emerged in the post–World War II years—broadcast, viral, two-step models of spread—which seemed to capture the essence of so much, now seem merely quaint. How can we even conceive of “broadcast” in a world where a celebrity can become president in part by posting content on Twitter, which then drives media coverage, which in turn propagates widely on algorithmically curated social media? What value do static models of network diffusion offer to understanding data when ties are episodic and attention fleeting? The chapters in this section are examples of the major scientific advances that are occurring.



 

. S W

.................................................................................................................................. The starting point for the first golden age was studies by Lazarsfeld and colleagues (Katz and Lazarsfeld ; Lazarsfeld, Berelson, and Gaudet ) of information flow about politics. The focus of these studies was on where people found out about events of the day, demonstrating the key role of opinion leaders who attended to mass media and then disseminated information more broadly to the public. The starting point for the second golden age is the Watts and Strogatz () paper on small worlds. The small world problem was originally developed (or at least popularized) by Milgram and collaborators (e.g., Milgram ). A number of individuals (located in the US Midwest) were required to get a message to a target (located in the Boston area). Famously, for completed chains, on average it took only about six hops to get from the source to the target. This was a surprising, even shocking, finding. Given that the vast majority of people’s ties were local, how could a message make its way to a target so far away, so fast? Watts and Strogatz offered an elegant answer to this puzzle: a few long range ties between nodes that otherwise only had local ties could radically reduce the distance between nodes even in very large networks. This finding was clearly important in its own right, but the Watts and Strogatz paper was also a marker in other ways. It marked the beginning of an explosion of scholarship on large-scale, whole network data that has occurred across the academy, from ecological networks (Dunne, Williams, and Martinez ) to neural networks (Bassett and Bullmore ) and, most relevant to this book, to social networks. This boom pulled in talent from across the academy into an emerging field of network science and, relevant here, brought diverse disciplines—especially physics and computer science—into the study of human network data. The work by Watts and Strogatz, Barabási and Albert (), Newman (), and others has emerged at the very core of social network research (Lazer, Mergel, and Friedman ). The chapter by Gonçalves and Perra (two physicists) is exemplary in this regard for its examination of how temporal resolution in social network data can provide a dramatically different picture of spreading processes than a static depiction does. The metaphor for social networks shifts from water pipes, with their rigid structure that directs flow, to billiards, where precise timing is quite consequential at both individual and collective levels. Concomitant with this approach was the beginning of the use of big data in studying human behavior.

. B D

.................................................................................................................................. If network science is the conceptual spark for the second golden age of the study of networks and information flow, “big data” is the kindling. By big data, I mean data that are large and complex (Lazer and Radford ; Lazer et al. ). Historically, the

   



complexity of whole network data meant that they had to be kept to very small numbers of individuals to be studied. Larger scale information likely existed (e.g., within Ma Bell), but compilation and analysis of those data would have been prohibitively costly. Thus, to the extent that large networks were studied before Watts and Strogatz, they were usually studied using a representative sample of individuals who were queried about their relationships (“egocentric” methods) (e.g., Huckfeldt and Sprague ). The last generation has witnessed the disappearance of these computational constraints (although scalability is certainly still an issue with certain statistical approaches). As a point of comparison, the bible of social network research, written a few years before Watts and Strogatz’s work by Wasserman and Faust (), did not explore any data sets in which the number of nodes was greater than one hundred or with more than one point of time captured; it would be a rare paper today that focused on a snapshot of a network that small in scale. Big data offer the opportunity to develop a society-scaled social science, through examination of a census of certain behaviors. There are significant opportunities in terms of using interventions and natural experiments to understand information spread with robust inferences of causation (Aral and Nicolaides ; Bond et al. ). There are also opportunities to use large-scale data to identify a small number of really interesting cases (Welles ). Big data have allowed a re-examination of foundational ideas in network research, especially those based on the notion of large-scale networks. Granovetter’s strength of weak ties hypothesis is a case in point. Granovetter () is the most cited paper within the network canon (in part because of its great relevance to large-scale networks). Granovetter’s core argument is that people are more likely to find out about jobs from weak ties than strong, because an individual’s strong ties are likely to know each other and thus are unlikely to provide novel information. While the concept evoked information spread in a global network, the original paper examined egocentric data collection from a few hundred people in the Boston area. Examination of the same idea in large, whole network data has supplied, at best, mixed support for the hypothesis. For example, Onnela et al. () simulate information flow in society-scale network data (from cell phone call logs), finding an inverse U shape in the importance of strength in information diffusion. Consistent with Granovetter, strong ties are too embedded to be an important conduit of novel information; however, very weak ties are too weak to carry much information. Even more on point is the study by Gee et al. (), which examines Facebook data from fifty-five countries to determine the role of tie strength in employment migration (moving to a job in an organization where a friend is employed). In every single country, a strong tie is more important than a weak tie in an employment shift; however, weak ties are more important in aggregate because there are many more weak ties than strong ones. This is a result that suggests a major reinterpretation of the original study: weak ties are not strong, they are plentiful. These papers highlight the “new wine in old bottles” potential of big data.



 

Perhaps even more interesting, however, is the potential big data offer for new wine in new bottles. It is difficult to build theory around phenomena that cannot be observed. The potential of observing the heretofore unobservable necessarily enables the development of new theories. The examination of dynamics in networks has been a locus of particular innovation. The dominant approach to the study of social networks a generation ago was structural (Wellman and Berkowitz ), where the structural metaphor was that a social network changes very slowly—like pipes, is largely exogenous from individual choice, and in any given moment creates opportunities and offers constraints. Even the standard visual image of a network conveys a static notion of the network in this regard. Someone who connects others who are otherwise disconnected might be considered highly central. That structural notion has gradually been challenged, first by the notion that network structures change over time (e.g., Doreian and Stokman ; Stokman and Doreian ), and later by the idea that many networks should be reconceived as intrinsically a concatenation of dyadic events (like billiard balls colliding). That is, an evolutionary notion of a network is that edges have start and end times (A and B became romantic partners on date  and ceased on date ). A time-varying view of a network fits a network in which the edges routinely turn off and on (A and B were proximate for time unit starting at time stamp). Thus, for example, Moody () illustrates how seemingly central actors might actually be quite late in terms of when they receive information, depending on the timing of ties within a network. Similarly, Morris, Goodreau, and Moody () demonstrate how concurrency in sexual relations has dramatic implications for the spread of sexually transmitted diseases in what otherwise would appear to be identical (static) network structures. The Gonçalves and Perra chapter offer a thorough examination of spreading processes within timevarying networks. The current consensus in the literature is that some networks are structural in the old sense—a slowly changing matrix of relationships that offer and obstruct opportunities for individuals—and some networks are structural in a new sense, in which that structure encompasses time and place. A second area of conceptual opportunity that has been explored is tracking the spread of actual content. The idea that information spreads through networks has a very long history and is reflected in a huge body of research. However, very little of that research actually tracks the spread of information in whole networks at the node and edge level. This again reflects the practical challenge in collecting data along these lines, which would not only be dynamic (an edge occurred at this time) but “coloring” the edge with content: this content was conveyed from A to B at a particular moment in time (Krafft et al. ). This too has changed, because of the potential availability of data that contain relational, content, and temporal information (email, social media), and because of the emergence of computational tools to automatically classify content (especially text). The chapter by Cha et al. is exemplary in this regard, examining the spread of particular content within social media.

   



. C C

.................................................................................................................................. Ironically, while big data have dramatically extended the reach of scholarship in the study of networks and information flow, the reality of information diffusion may still be outstripping our capacity to study it (Goel et al. ). The line between viral and broadcast media is increasingly being blurred. For example, a few individuals can command vast attention mediated only by sociotechnical systems such as Twitter. A peer-to-peer network with some very high degree nodes may intertwine elements of broadcast and viral dynamics. Consider, for example, the author’s experience with what seemed to be a mild earthquake in Boston. Going on Twitter, he saw another individual posting the question “Was that an earthquake?” and replied, “Yes, I think it was”; turning on the television, he saw two newspeople evaluating the scope of the earthquake by reading off tweets from their computer. It is difficult to fit this into any current model of information diffusion in networks. Furthermore, even the units of analysis are dissolving. What had been monolithic media entities are being disaggregated in multiple ways. Content from a given organization is delivered through multiple channels and media. CNN delivers text and video via its website, through Twitter and Facebook, and with instant alerts through dedicated apps on mobile phones. Its reporters have their independent brands, spreading content through social media, and publicly banter with reporters from other organizations. Algorithms anticipate what will spread, highlighting content that it is predicted will hold the attention of the target (Bakshy, Messing, and Adamic ; Lazer ). Simple models such as broadcast, peer to peer, and two-step spreading seem woefully inadequate to the realities of today’s media/social media environment. The Ognyanova chapter offers guidance in how to navigate and build an understanding of this complexity.

. B D R

.................................................................................................................................. These complexities, in turn, point to the challenges of designing big data–based research to study the media environment, as the Freelon chapter highlights in its examination of community detection as applied to a sample of Twitter data. Many big data are simply the digital refuse of modern sociotechnical systems. The mapping between behavior and relevant social science constructs may therefore be problematic (Margolin et al. ; Lazer and Radford ). If the researcher is focused on “emotional support from this person” as a relational construct, can “Facebook friendship” or “regularly texts on cell phone” be used as a proxy? The answer is that there is certainly a connection (e.g., Eagle, Pentland, and Lazer ), but it is likely that connection is noisy, varying substantially with subculture and over time. It is also likely that behavior-based proxies will at times be vastly better than any survey-based constructs,



 

because they avoid bias in responses and can be much more context sensitive. Another issue is that the relevant behaviors might not respect the existing big data silos. For example, existing research on cell phone data is typically from a single carrier within a country; furthermore, certain behaviors (like “talking”) will occur face to face; across different cell and landline carriers; within Google, Skype, or Facebook; and so forth. Each of these sources may offer enormous data, yet provide a highly distorted sample of the construct of scientific interest. More generally, big data are strewn with artifacts, in part because the platforms from which they are harvested are constantly evolving—both in the technology and in the norms around use. For example, Google flu trends was an effort to track flu prevalence based on flu-related queries; it started to dramatically overstate flu prevalence after a few years, likely in part because Google had improved the functionality of healthrelated queries. (Lazer et al.  described this as “blue team” dynamics, driven by users of the platform.) There are thus generalizability issues from platforms: across platforms, from platforms to behavior more generally, and within platforms over time (Tufekci ). Further, many platforms are highly vulnerable to manipulation—for example, with armies of bots for hire readied to push a message (“red team dynamics” in Lazer et al. ; i.e., driven by those who seek to manipulate the system).

. C

.................................................................................................................................. Our understanding of networks and information flow is undergoing a second revolution. The first revolution emerged from the fertile period after World War II in behavioral research. The second revolution started about twenty years ago and is the result of the intertwined rise of network science and large-scale data about society. Out of this generation of research has emerged a dramatic shift in our understanding of information spread in networks, for example, around network dynamics and content of spread. The chapters in this volume provide a high-definition picture of the state of the second revolution. The research on network dynamics, as represented by Gonçalves and Perra in their chapter, is likely the most mature of these domains and informs many of the other emerging applications, which involved dynamic data. There is still much to be done even in this territory, in terms of statistical inferential methods to capture interpretable temporal features in observational data, or in terms of higher order dependencies, such as sequence (A talks to C after talking to B; A only talks to B in the presence of C, D, and E). Ognyanova and Freelon offer a down payment on how computational tools might transform the field of communication; the challenge for the field now is that the phenomenon is undergoing rapid change that the tools are fighting to keep up with. The Cha chapter, in turn, both illuminates the opportunity to utilize novel social media data to study one of the core concepts in networked communication—that of diffusion—and the challenge of building a robust science on

   



such a dynamic foundation. While the first revolution of networked communication lasted about fifty years; arguably this second revolution has equal or greater potential, having not yet neared a paradigmatic plateau after twenty years.

R Aral, Sinan, and Christos Nicolaides. . “Exercise Contagion in a Global Social Network.” Nature Communications  (April): . Bakshy, Eytan, Solomon Messing, and Lada A. Adamic. . “Exposure to Ideologically Diverse News and Opinion on Facebook.” Science  (): –. Barabási, A. L., and R. Albert. . “Emergence of Scaling in Random Networks.” Science  (): –. Bassett, Danielle Smith, and Ed Bullmore. . “Small-World Brain Networks.” The Neuroscientist: A Review Journal Bringing Neurobiology, Neurology and Psychiatry  (): –. Bond, Robert M., Christopher J. Fariss, Jason J. Jones, Adam D. I. Kramer, Cameron Marlow, Jaime E. Settle, and James H. Fowler. . “A -Million-Person Experiment in Social Influence and Political Mobilization.” Nature  (): –. Doreian, Patrick, and Frans N. Stokman. . Evolution of Social Networks. London: Routledge. Dunne, Jennifer A., Richard J. Williams, and Neo D. Martinez. . “Network Structure and Biodiversity Loss in Food Webs: Robustness Increases with Connectance.” Ecology Letters  (): –. Eagle, Nathan, Alex Sandy Pentland, and David Lazer. . “Inferring Friendship Network Structure by Using Mobile Phone Data.” Proceedings of the National Academy of Sciences of the United States of America  (): –. Gee, Laura K., Jason J. Jones, Christopher J. Fariss, Moira Burke, and James H. Fowler. . “The Paradox of Weak Ties in  Countries.” Journal of Economic Behavior & Organization : –. Goel, Sharad, Ashton Anderson, Jake Hofman, and Duncan J. Watts. . “The Structural Virality of Online Diffusion.” Management Science  (): –. Granovetter, Mark S. . “The Strength of Weak Ties.” The American Journal of Sociology  (): –. Huckfeldt, Robert, and John Sprague. . Citizens, Politics and Social Communication: Information and Influence in an Election Campaign. New York: Cambridge University Press. Katz, Elihu. . “The Two-Step Flow of Communication: An Up-to-Date Report on an Hypothesis.” Public Opinion Quarterly  (): –. Katz, Elihu, and Paul Felix Lazarsfeld. . Personal Influence: The Part Played by People in the Flow of Mass Communications. Transaction Publishers. Krafft, Peter, Juston Moore, Bruce Desmarais, and Hanna M. Wallach. . “TopicPartitioned Multinetwork Embeddings.” In Advances in Neural Information Processing Systems , edited by F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, –. Curran Associates, Inc. Lazarsfeld, Paul Felix, Bernard Berelson, and Hazel Gaudet. . The People’s Choice: How the Voter Makes up His Mind in a Presidential Campaign. New York: Columbia University Press.



 

Lazer, David. . “The Rise of the Social Algorithm.” Science  (): –. Lazer, David, Ryan Kennedy, Gary King, and Alessandro Vespignani. . “The Parable of Google Flu: Traps in Big Data Analysis.” Science  (): –. Lazer, David, Ines Mergel, and Allan Friedman. . “Co-Citation of Prominent Social Network Articles in Sociology Journals: The Evolving Canon.” Connections  (): –. Lazer, David, Alex Pentland, Lada Adamic, Sinan Aral, Albert-Laszlo Barabási, Devon Brewer, Nicholas Christakis, et al. . “Social Science: Computational Social Science.” Science  (): –. Lazer, David, and Jason Radford. . Lazer, David, and Jason Radford. “Data ex machina: Introduction to big data.”Annual Review of Sociology  (): –. Margolin, Drew, Yu-Ru Lin, Devon Brewer, and David Lazer. . “Matching Data and Interpretation: Towards a Rosetta Stone Joining Behavioral and Survey Data.” In Seventh International AAAI Conference on Weblogs and Social Media. https://pdfs.semanticscholar.org/cf/cccdbdbead.pdf. Milgram, Stanley. . “The Small World Problem.” Psychology Today : –. Moody, J. . “The Importance of Relationship Timing for Diffusion.” Social Forces: A Scientific Medium of Social Study and Interpretation  (): –. http://sf.oxfordjournals. org/content///.short. Morris, Martina, Steven Goodreau, and James Moody. . “Sexual Networks, Concurrency, and STD/HIV.” Sexually Transmitted Diseases  (): –. Newman, M. . “The Structure and Function of Complex Networks.” SIAM Review  (): –. Onnela, J.-P., J. Saramäki, J. Hyvönen, G. Szabó, D. Lazer, K. Kaski, J. Kertész, and A.-L. Barabási. . “Structure and Tie Strengths in Mobile Communication Networks.” Proceedings of the National Academy of Sciences of the United States of America  (): –. Stokman, Frans N., and Patrick Doreian. . “Evolution of Social Networks Part II.” Special issue of Journal of Mathematical Sociology  (): –. Tufekci, Zeynep. . “Big Questions for Social Media Big Data: Representativeness, Validity and Other Methodological Pitfalls.” arXiv [cs.SI]. http://arxiv.org/abs/.. Wasserman, Stanley, and Katherine Faust. . Social Network Analysis: Methods and Applications. New York: Cambridge University Press. Watts, D. J., and S. H. Strogatz. . “Collective Dynamics of ‘Small-World’ Networks.” Nature : –. http://www.nature.com/nature/journal/v/n/abs/a.html. Welles, Brooke Foucault. . “On Minorities and Outliers: The Case for Making Big Data Small.” Big Data & Society  (): . Wellman, Barry, and S. D. Berkowitz. . Social Structures: A Network Approach. New York: Cambridge University Press.

  ......................................................................................................................

   Using Computational and Network Tools to Rebuild Media Theory ......................................................................................................................

 

I the last couple of decades, the field of mass communication has sought to redefine itself in a new information environment. This effort has gone beyond re-examining classic theories and evaluating their relevance in the context of new media. The very concept of mass communication as both a phenomenon and a field of study has been brought into question. A major work defining a potential shift toward the “demassifying” of mass communication was written by Chaffee and Metzger at the start of the twenty-first century. In it the authors point to contemporary trends toward increasing content personalization; availability and diversity of information channels; and enhanced individual capacity for production, dissemination, and selective exposure to content (Chaffee & Metzger, ). This chapter discusses briefly the challenges facing mass communication. It argues that despite the profound social and technological shifts of the last few decades, the research tradition remains highly relevant. Both mass media and mass audiences are still around, and our ability to study them has improved considerably. Computational methods, large-scale digital data collection, and new modeling techniques (combined with qualitative domain knowledge) offer novel ways to gain a nuanced understanding of media influence and public opinion formation. The growing use of computational techniques in the social sciences also highlights the need for better transparency, replicability, and ethical research standards. Network science provides one key way of investigating social and technological interactions, as well as the diffusion of information and patterns of influence among individuals, groups, and organizations. The chapter presents a network-based model of media message flows and agenda-setting processes to illustrate how the exploration of complex systems can produce important insights into mass communication theory.



 

. T E  M C?

.................................................................................................................................. Throughout most of the twentieth century, when dominant mass communication theories emerged, researchers operated under the conditions of a relatively uniform, centralized media system. Three major broadcast networks and a limited number of influential newspapers and magazines were credited with the shaping of public opinion in the United States (Bennett & Iyengar, ). Major technological and social transformations have since changed the practices of media production, dissemination, and consumption. Deregulation in the media sector and the advent of digital technologies facilitated the emergence of countless news outlets. Information overload, proliferation of distribution channels, and a perceived shift of power from corporations to users are said to characterize the media landscape of the twenty-first century (Castells, ). As a result of these shifts, current approaches to media studies have come under scrutiny (Chaffee & Metzger, ). Influential theoretical works suggest that mass communication frameworks have to be re-evaluated in view of the changing information environment (Bennett & Iyengar, ; Bennett & Manheim, ). One basic assumption of earlier research was that the public obtained news from a limited number of outlets with similar journalistic culture, content priorities, and gatekeeping routines. A virtually unrestricted access to diverse sources of content could violate that assumption, compromising the media’s consensus-building function (Takeshita, ). The improved capacity of consumers to select their preferred messages could decrease social cohesion and lead to a segmentation of audiences (Blumler & Kavanagh, ). As news consumption and information access grew increasingly personalized (Tewksbury, ), scholars predicted a coming era of cyberbalkanization (Sunstein, , ) and filter bubbles (Pariser, ), in which selective exposure dissolves mass audiences into small and isolated like-minded groups. A related trend that Bennett and Iyengar () refer to as the demise of the inadvertent audience was the unbundling of media content (Kaye & Quinn, ). Media companies would traditionally offer different types of news and entertainment materials packaged together. In the past, audiences had no control over the construction of those packages, nor did they have many alternatives to choose from. People were more likely to watch the evening newscast while waiting for the entertainment part of the network program or to browse through the pages of a newspaper after reading the sports section. Today, it is easier to construct a media diet consisting entirely of sports or entertainment news and to find many outlets that cover exclusively one’s areas of interest. Social media platforms provide one new pathway for inadvertent exposure to news. According to recent Pew reports, four in ten Americans get news on Facebook, and one in ten gets news on Twitter (Barthel et al., ). Exposure to media content through social networking sites, however, has a somewhat limited potential to increase the diversity of individual news consumption. This is due in part to homophily: our

  



preference to form online and offline social ties with people who are similar to us (McPherson, Smith-Lovin, & Cook, ). Our friends tend to share our socioeconomic background, demographic characteristics, political preferences, and personal interests. As a result, the content that our social network contacts post is likely to match our thematic and framing preferences. Individuals also tend to engage more with congruous content, a fact that has important consequences for digital platforms (Bakshy, Messing, & Adamic, ). Online systems often track user behavior and use it to highlight content that is most likely to elicit engagement and discount posts that seem unlikely to spark the person’s interest. In reality, the theoretically unlimited universe of news we encounter on social media is restricted by both technological and social factors. Our inadvertent exposure to diverse views is limited by our preference for like-minded friends, our tendency to select items and sources we agree with, and social media algorithms that prioritize the kind of content we have been known to favor.

. I D  M C

.................................................................................................................................. While the industry is changing in response to technological shifts, economic pressures, and new regulation, evidence suggests that both mass communication theories and mass media companies remain relevant and influential (Holbert, Garrett, & Gleason, ; Perloff, ). Even as digital platforms are ubiquitously used, newspapers, broadcast networks, and mainstream sources of online news retain an important role in the formation of public opinion (Shehata & Stromback, ). At this time, the World Wide Web’s potential to cause drastic audience fragmentation appears not to be fully realized. The trend toward information proliferation is countered by attention scarcity (Goldhaber, ). Individuals have a limited amount of time to spend on media products, and that provides an advantage to the bigger, easier to find, better-known news outlets (Nagler, ). Research examining audience fragmentation across traditional and online news sources (Webster, ) finds high levels of duplication across media outlets and no evidence of isolation in like-minded consumption groups. Attention concentration patterns are also evident online, where search engine ranking mechanisms often determine which sites will receive most of the traffic (Epstein & Robertson, ). While the Internet offers an enormous wealth of information sources, people still tend to cluster around a select few. The popular news sites are owned predominantly by large media companies (Miel & Faris, ). Traditional news organizations, particularly newspapers and cable TV stations, dominate the online information space (Pew Research Center, , ). Popular web sources of local coverage are likewise limited in number and mostly affiliated with traditional media (Hindman, ). Similar trends have been recorded for online platforms and services. Pew and Nielsen (Pew Research Center, ) examined millions of blogs and found that almost all of the news stories they linked to came from traditional media. A limited number of elite actors command



 

the attention of social media users as well (Wu et al., ). The most prominent Twitter accounts belong to traditional media and high-profile public personalities (Kwak et al., ). One of the most often discussed dividing lines in media consumption practices is grounded in political ideology. While some studies find evidence of ideologically motivated selective exposure (Stroud, ), others suggest that partisan preferences do not lead to selective avoidance (Holbert, Garrett, & Gleason, ). Controlling for ideology, Holbert, Hmielowski, and Weeks () found a strong positive association between the use of politically divergent cable networks like FOX News and MSNBC. Gentzkow and Shapiro () further reported that segregation in online news, while higher than that in most offline media use, was low in absolute terms. In a large-scale analysis of user behavior on Facebook, Bakshy et al. () found non-negligible levels of crosscutting exposure. An estimated % of the hard news users encountered, and % of the hard news they clicked on, diverged from their own political preferences. Scholars have argued that the array of new information sources has weakened considerably the power of traditional media and eliminated their gatekeeping role (Williams & Delli Carpini, , , ). Yet there is an interpretation under which those diverse sources are simply adding to the long list of exogenous factors known to have an impact on news coverage. There are still, as there have always been, many other influences, including elite news outlets, public figures, corporate lobbies, and experts. As long as the majority of Americans rely on large news organizations for information (Pew Research Center, ), discarding the filtering function of journalism seems premature. Notably, the increasing number of information sources did not bring about a proportional increase in content diversity (Ognyanova, ). The accelerated news cycle and the demands of fast-paced online journalism create strong pressures toward homogenization of content, as little time is left for research and reporting (Mitchelstein & Boczkowski, ; Boczkowski & De Santos, ). The financial problems of journalism further contribute to content cohesion by reducing the available resources for original content and increasing everyone’s reliance on a few large wire agencies (McChesney & Nichols, ). Another factor promoting cohesion in news content is the ongoing trend toward media concentration on a global scale. A relatively small and interconnected group of multinational media conglomerates owns a large number of high-profile news production sites (Arsenault & Castells, ). As companies seek economies of scope, organizational knowledge, resources, and staff are shared among the venues they own, across media formats.

. N A  M S

.................................................................................................................................. As discussed in the previous section, large media companies still play an important social role. That role, however, is more difficult to assess now than ever before.

  



Contemporary mass communication research needs to account for digital media formats, user-generated content, proliferation of distribution channels, complex influence patterns, and pathways of message diffusion. While these are serious challenges, they are by no means insurmountable. In the last few decades there have been significant advances in the theoretical frameworks and methodological approaches of the social sciences. Large-scale digital trace data provide some of the raw material needed to examine old and new theoretical constructs. Computational social science offers the tools to test complex hypotheses encompassing multiple parts of the media system. Fast-growing fields like network science give us new ways to structure existing theories, as well as the corresponding methodological apparatus. Automated text, image, audio, and video analysis allows us to examine high volumes of multiformat media content. Last but not least, scientific research practices, especially with regard to open data and replication policies, are slowly but surely improving (Crosas et al., ). Establishing better standards in that area is without a doubt one of the key tasks communication studies will face next. In the context of media research, computational approaches and large-scale digital data sets allow us to track the spread of interpersonal and media messages, evaluate patterns of influence, identify shifts in public opinion in near-real time, and gain a nuanced view of political and news agendas. While social media data have been widely used in this context (Freelon, ), other equally important data sets include largescale media content (Neuman et al., ), web content and hyperlink structures (Weber, ), blogs (Almquist & Butts, ), discussion forums (González-Bailón, Banchs, & Kaltenbrunner, ), online user behavior captured through the server logs of news websites, and more. One particularly promising research approach lies at the intersection of mass communication and computational linguistics (González-Bailón & Paltoglou, ), an area that has already produced important and interesting results (Kleinnijenhuis, Schultz, & Oegema, ; Soroka, Stecula, & Wlezien, ).

. N S   M S

.................................................................................................................................. Network science provides a set of methods and theoretical constructs uniquely suited to advance our understanding of mass communication (Ognyanova & Monge, ; Fu, ). As the discipline examines increasingly complex processes, placing interpersonal and media messages in the context of larger social structures, it faces an important shift in theoretical focus. The main emphasis moves from the attributes of organizations, news stories, and consumers to social relations and interactions, influence patterns, flows of information, and resources (Borgatti et al., ). This theoretical orientation reflects the current state of the media system as it moves to networked forms of content production, delivery, and consumption. Persistent industry-wide trends increase the levels of consolidation, interorganizational



 

collaborations, and local and global partnerships (Arsenault & Castells, ). Online and mobile formats connect newsrooms and audience members (Cardoso, ), making content diffusion both faster and easier to track through digital traces (Anderson, ; Lazer et al., ). Professional and personal social ties affect individual news consumption and distribution habits (Boczkowski, ). News stories are placed within networks of semantic relations (Diesner & Carley, ) and hyperlink connections (Turow & Tsui, ). Network science allows for a multilevel analysis, capturing the structural determinants of social and political processes, public perceptions, media agendas, and individual behavior. Its ability to deal with complexity has also made it one of several key areas expected to advance policy research and guide media regulation (Friedland et al., ). A number of major mass communication theories are grounded in distinctly network concepts, even if not always explicitly defined and tested as such. One classic example, the two-step flow of communication (Katz & Lazarsfeld, ), deals with the diffusion of news stories through social networks. Its main premise is that media messages are channeled through a particularly active audience segment known as the opinion leaders. These individuals receive, interpret, and disseminate the news among the larger public. Other theories, such as agenda setting (Ognyanova, ; Guo & McCombs, ) and gatekeeping (Barzilai-Nahon, ), have recently been reinterpreted in network terms. Many early media effects theories, as well as recent works lamenting the end of mass communication, exhibit a certain lack of nuance in conceptualizing influence patterns. It seems fairly clear that media messages can influence mass audiences without reaching each individual directly, simultaneously, and through a single channel or format. The concept of a “mass audience” (or, for that matter, a single “public” that has an opinion) was always just a useful simplification. It may retain its usefulness even if some members of the audience read stories in a physical newspaper, others see those stories as Facebook posts, and still others find them on the web through links from Twitter. The contribution of network thinking is that it gives us the instruments to track the complex patterns of message diffusion and social contagion through multiple channels and to assess their impact on individuals and larger social groups (Aral, Muchnik, & Sundararajan, ). Early network research in the field of media effects tends to conceptualize social structure as a conduit for the spread of ideas and information. The focus in that context is on individuals and the connections among them. Media outlets are not seen as part of this network, though they do produce the content that propagates through it. The work of Menzel and Katz (), building on the two-step framework, provides one canonical example of this type. Their research mapping the social ties of health professionals uncovers a multistep influence of medical journals and interpersonal relations on drug adoption. A more flexible way to think about this system would view individuals and media outlets as embedded in a multidimensional network (see Figure .). This type of model still examines interpersonal ties, but it also incorporates individual connections to (and potentially among) specific news sources. Friemel (), for instance, uses a similar

   Classic two-step flow model (e.g., Katz and Lazarsfield, 1955)

Media

Public

Multistep flow network model (e.g., Menzel & Katz, 1955)



Media-integrated multistep flow network model (e.g., Friemel, 2015)

Opinion Leaders

 . Early and contemporary network models of media influence.

approach to examine the social networks of high school students along with their connections to various TV programs. Works within this framework often allow for the possibility that individuals as well as media outlets can generate, selectively filter, and disseminate messages. This line of research has produced a number of studies exploring online influence patterns among news organizations and audiences, including research focusing on social media platforms (Xu, Sang, Blasiola, & Park, ). To further demonstrate the versatility of network approaches to mass communication theories, as well as the ability of those frameworks to address key concerns facing the discipline, the next section of this chapter presents a network model of the media system, with applications to agenda-setting research. Agenda setting is selected as one of the key theoretical frameworks in the field with premises challenged by the digital transformation of the media system (Bennett & Iyengar, ; Chaffee & Metzger, ). The model allows for empirical examination of various aspects of the theory, as well as its performance in the context of digital platforms and potentially fragmented audiences.

. A N M  A S

.................................................................................................................................. One of the dominant media effects theories, agenda setting suggests that media can influence the way we see the world (McCombs, ). Both the content and the format of news stories are said to provide cues about the social relevance of objects and events. As a result, at any given time a limited number of issues occupies the attention of journalists, citizens, and politicians. The focus of public and political attention on a



 

narrow range of topics facilitates a shared perception of community priorities, allowing social mobilization and collective action to take place. Agenda-setting theory in its present form was first articulated by Maxwell McCombs and Donald Shaw (), who studied the impact of media on the issue priorities of undecided voters. Academics have since extended the scope of the framework to also study the formation of media agendas (agenda-building research), investigating factors that influence the salience of items in the news. Work in that area involves exploring key external news sources (extramedia level), the influence of media on each other (intermedia level), and the internal newsroom dynamics affecting editorial decisions (Dearing & Rogers, ). The current pressures to rethink the model of influence proposed by McCombs and Shaw emerge from parallel developments in theory and society (Williams & Delli Carpini, ). In order to address added layers of complexity, a new conceptualization of the agenda-setting process needs to incorporate a variety of relevant features and relationships characterizing news outlets, audience members, and social issues. A framework of this kind would benefit from the instruments provided by network theory, a field that specializes in the examination of complex dynamics involving attributes and relations, as well as higherorder structures. Traditional agenda-setting research relies largely on correlation and regression tests—methods that cannot easily be used to study the influence flows in the media system. Network analysis provides a way of addressing that problem. The model presented here (see Figure .) is structured as a dynamic, multidimensional network of issues, individuals, and information sources. The following paragraphs provide a brief description of the actors and relationship types incorporated in the framework, as well as the rationales for their inclusion. This section also sketches relevant actor and object characteristics. Finally, network mechanisms are mapped onto agenda-setting processes. Agenda-Setting: A Network Model Information sources Audience members Issues (objects, attributes) Issue adoption Interorganizational ties Social connections Concept associations Media consumption

 . Network model of agenda setting.

  



.. Actors in the Network Classic agenda-setting research examines the patterns of influence between news outlets and consumers. Similarly, two key types of nodes in the network model presented here are information sources and audience members. The third element in this system is the topics or issues that media opt to cover and individuals choose to pay attention to. Following McCombs’s (, ) conceptualizations, issues are broadly defined to include any object that may draw attention or about which one may hold an opinion. That allows research to explore general topics, specific stories or events, public figures, organizations, countries, and other objects in the news. Furthermore, in the network model described here, nodes denoted as “issues” may also be prominent object aspects or interpretations. In this way, the framework accommodates studies of second-level agenda setting and framing (McCombs, , ).

.. Relationship Typology The multidimensional network model includes five types of links, defined in a way that allows for a wide range of operationalizations.

... Issue Adoption Ties Key to the agenda-setting process, this type of tie indicates that an issue has become salient for media outlets or consumers. The connection between an issue and a news source is formed when the item is covered by the outlet. The link may be conceptualized in binary terms (presence/absence of connection) or weighted based on traditional salience dimensions, such as the placement prominence of a story or the time/space dedicated to it (Kiousis, ). Similarly, an issue adoption link connecting an issue to an audience member is recorded when it becomes clear that an issue has captured the person’s attention. This can be assessed through a typical agenda-setting survey instrument (McCombs, ). Alternatively, the link can be observed through digital traces (e.g., an individual posts a link to a story about the issue on a social networking platform or mentions it in a blog post).

... Media Use Ties Researchers have examined a number of pertinent relationships between individuals and media outlets. Higher news consumption, as well as reliance on media, is expected to enhance agenda-setting effects (McCombs & Reynolds, ; Wanta & Hu, ). In particular, exposure to a news source covering an issue is likely to increase the perceived importance of that issue (Stroud, ). This is therefore another key type of link in the model. As the literature has tested a number of related constructs (e.g., use, exposure, reliance, dependence), any of those can be substituted here. This allows for



 

conceptualizations ranging from a binary use/no use tie to a valued link weighted by exposure time or dependence strength.

... Interorganizational Ties A wide range of formal and informal relationships could constitute network ties between two media organizations. The list includes well-studied connections like partnership, ownership, and cross-investment (Arsenault & Castells, ). Baker and Faulkner () suggest a number of additional link types: market exchanges, strategic alliances, joint participation in syndicates, joint political action, interlocking directorates, family ties, and even joint illegal activities such as collusion. Interorganizational relationships, both of cooperation and competition, are pertinent to the media agenda-setting process, as they influence news selection (Dimmick, ). This may occur as a result of content sharing between outlets or a transfer of organizational routines and news values.

... Social Ties Social ties include friendship, kinship, and other communication connections between audience members. That definition also fits friend/follow links in online social media platforms, though it is important to note that the meaning and function of those online connections should be given serious consideration before exploring their patterns. These relationships are crucial, as they provide a social infrastructure allowing for the spread of media preferences and the diffusion of news content. Furthermore, interpersonal discussion is a major intervening variable in investigations of salience transfer between the media and public agendas (Dearing & Rogers, ). When conversations deal with issues covered by news media, communication can enhance agenda-setting effects (Wanta & Wu, ). This also means that direct exposure to specific news content may not always be a prerequisite for the effects to occur (Wanta & Ghanem, ). Combining agenda-setting research with two-step and diffusion models (Brosius & Weimann, ) has allowed researchers to study the interaction between interpersonal and media effects. The importance of examining social and media connections in parallel is recognized in a number of theoretical traditions. One example comes from the communication infrastructure theory (Kim & Ball-Rokeach, ), a framework incorporating interpersonal and mediated effects in a community context.

... Concept Association Ties Links between issues may connect items that have some association in meaning, a conceptual or semantic relationship. This is another broad definition allowing for multiple operationalizations, permitting the use of relationship ontologies such as those adopted in semantic web projects. As one example, attitude objects such as “presidential elections,” “Hillary Clinton,” and “Donald Trump” could be considered conceptually associated. Such a conceptual tie between two issues may make them

  



more likely to appear on the agenda together. In addition, research has suggested that some issues may have a competitive relationship, reducing the likelihood that they will be prominent at the same time (Djerf-Pierre, ).

.. Link Direction and Agency The proposed model does not contain inherent assumptions about agency. Those could, however, be built in based on the theoretical grounding and research design of a particular study. While all relationships in the system are presented as symmetric (see Figure .), it is possible to adopt an interpretation assuming a certain direction of influence. A directed link between individuals and media sources, for instance, would be grounded in an understanding of audiences as either active participants or passive consumers. The issue adoption links can also have a direction reflecting top-down processes or the view that individuals have the agency. An interesting alternative could build upon meme literature stemming from the work of Dawkins (), which implies that issues are the agents that propagate across hosts.

.. Individual and Dyadic Attributes In addition to capturing the relationships between actors and objects, a network representation of agenda setting allows for the inclusion of relevant node-level attributes. Information sources, for instance, may be characterized by revenue, geographic area, or format (e.g., radio, TV, print, online). Audience members have a range of demographic characteristics that can potentially influence the agenda-setting process (Wanta, ; Wanta & Ghanem, ). Issues can also be evaluated or classified in a number of ways, such as by domain (politics, science, entertainment, etc.) or scope (local, regional, national, international). Some important agenda-setting constructs are dyadic in nature and need to be operationalized not as individual properties, but as link-level attributes. One such example is obtrusiveness, or the extent to which a particular issue is part of someone’s personal everyday experience (Coleman et al., ). Items like “unemployment” or “crime” may be obtrusive for some individuals and not others, making obtrusiveness a characteristic of the relationship between person and issue.

.. Network Mechanisms As discussed previously, the network framework proposed here adopts some basic definitions of the agenda-setting perspective. The conceptualizations of issue, object, and attribute, as well as measures of salience, also apply here. Other concepts and processes, however, require a network interpretation. The prominence of an issue on



 

the agenda, for instance, is traditionally assessed based on a rank-ordered list of priorities (Valenzuela & McCombs, ). A direct network equivalent of that measure would be the issue’s degree centrality: the number of individuals and/or media sources directly connected to an issue, potentially weighting for the strength of those relationships (Freeman, ). More advanced measures could take into account the extent to which an item is embedded in the overall network, or the average number of steps to be traversed in order for the issue to reach every person/outlet included in the study (Borgatti & Everett, ). The basic agenda-setting process is typically defined as a transfer of salience from media to the public agenda, with effect strength evaluated through correlation analysis (McCombs, ). Audience members will perceive as salient an issue that features prominently in the news they consume. In network terms, this process should result in a propensity for triadic closure within a particular source-individual-issue configuration (see Figure ., panel ). When an information source is connected to both an audience member and an issue, there should be an increased probability for tie formation between the issue and the individual. Though it has a different theoretical grounding, this mechanism operates somewhat similarly to the balance principle known to predict transitivity in social relations (Granovetter, ). The capacity of individuals to place an issue on the media agenda could similarly be operationalized in network terms. Like its counterpart, bottom-up agenda setting can be expressed as a propensity toward the closure of triads in which an individual is linked to both an issue and a news source. However, while a single media outlet can influence a news consumer, the reverse effect is more likely to be a game of numbers. If a sufficiently large number of people have a shared concern, it may end up high on the news agenda, regardless of the media use patterns of those involved. Thus bottom-up

Agenda-Setting: A Network Model 1

Traditional media-to-public agenda-setting mechanism

Information sources Audience members 2

Bottom-up public-to-media agenda-setting mechanism

Issues (objects, attributes) Issue adoption Media consumption

 . Network mechanisms underlying agenda-setting processes.

  



agenda-setting effects may be produced by a preferential attachment mechanism (Easley & Kleinberg, ) similar to the one presented in figure ., panel . In a network context, this mechanism (also known as “cumulative advantage” or “the rich get richer”) describes a propensity to form links with nodes that are already well connected. Preferential attachment to popular issues is more generally one plausible generative mechanism for an agenda network of the type described here. Both news sources and individuals are likely to form connections to issues already considered important by the media and the public. All of the processes described so far could potentially operate in conjunction to shape agenda-setting patterns. Combining those mechanisms in a single model provides a useful way to evaluate how well each one explains the observed structure. This is one advantage of taking a network approach, as it allows for the simultaneous testing of multiple complementary and competing hypotheses operating at different levels of analysis (Monge & Contractor, ; Contractor, Monge, & Leonardi, ). Another network-centric analytical strategy aimed at predicting the adoption of issues comes from contagion and diffusion frameworks. Initially developed to track the spread of disease or technological innovations, those models have been used to study the propagation of topics through social networking platforms (Oh, Susarla, & Tan, ) and blogs (Leskovec et al., ; April, Leskovec, Backstrom, & Kleinberg, ). Two types of models—threshold (Valente, ) and cascade (Cointet & Roth, )—can be used to explore the diffusion of issues across outlets and individuals. In threshold models, adoption is based on the proportion of connections that have already adopted the issue. In a cascade model, each time an actor is “infected” with a new issue, there is a certain probability that the infection will spread to neighboring nodes.

.. Reducing Complexity in the Network Model The network model proposed here incorporates media effects, as well as intermedia and interpersonal influences. It provides a useful organizing framework encompassing different aspects and levels of agenda setting. This comes at a cost, as data collection and analysis need to account for complex structures with multiple types of nodes and relationships. While some research questions require that level of complexity, others may not. At present, most studies in the field have focused on a single dimension of the agenda-setting process and do not incorporate the full range of elements included here. A simple way to reduce complexity while preserving the basic ideas behind the model is to focus on a limited subset of its elements. Studies could—and many do— only investigate issue adoption and media consumption links, discarding interorganizational, social, and conceptual associations (see Figure ., panel ). Another way to simplify the analysis is to decrease the variety of node types present in the model. Reducing the number of modes (i.e., distinct sets of entities in a network) is a standard technique used with multimodal structures (Wasserman & Faust, ). The excluded elements are typically those of less relevance or interest to the researcher



 

The Agenda Setting Process: Reducing Complexity

1

Decreasing the number of included link types

2

Decreasing the number of included node types

Public agenda setting

3a

Media agenda setting

3b

Diffusion/2-step flow

3c Information sources

Agenda convergence/transfer Interorganizational ties

Audience members

Social connections

3rd level agenda setting

3d

Concept associations Issues (objects, attributes)

Media use/dependence

 . Reducing the complexity of networked agenda-setting models.

(Borgatti, ). As works within the agenda-setting perspective are largely concerned with the impact of news on public opinion, the nature of particular issues is often less important than the degree of correspondence between media and audience priorities. This being the case, issues are one element type that can be removed from the model (see Figure ., panel ). In their place, a new relationship—an agenda convergence tie—is defined. It represents the convergence of agendas between the remaining nodes (audience members and/or media outlets), or the transfer of salience across them. The link can, for instance, be evaluated based on one of many available measures of similarity or distance, the simplest of which is the number of shared issues (Borgatti & Halgin, ). Reducing complexity further, a study can focus on smaller subsets of nodes and links (Figure ., panel ). In the spirit of early agenda-setting research (Dearing & Rogers, ), researchers may opt to examine the media use and issue overlap (agenda convergence) relationships between individuals and information sources (Figure ., a). Among other things, this line of research allows us to investigate the extent to which different modes of media consumption affect patterns of media influence. Intermedia scholarship may similarly adopt models including interorganizational and shared issue relations between news outlets (Figure ., b). One study of that kind (Ognyanova, ) examines the levels of media fragmentation in a network of US news outlets from five industry sectors: newspapers, online sources, radio, cable TV, and network TV stations. The research is based on data describing seventy thousand news stories from sixty-four media outlets, collected by the Pew Project for Excellence in Journalism over a one-year period. Additional data sources provide information about

  



media ownership patterns and audience demographics. The study constructs a dynamic agenda convergence network reflecting the similarity in covered topics among media sources over time. To test its robustness, the analysis is repeated using different time windows for the construction of network snapshots (a week, two weeks, and a month), as well as different ways to measure topical similarity (simple matching, correlations, Jaccard index, cosine similarity, etc.). Indices of network cohesion provide a convenient way of evaluating media fragmentation, a construct that is otherwise difficult to operationalize. The results of that study suggest an increase in media content homogeneity over time. Additional tests use stochastic actor-oriented models (Snijders et al., ) to evaluate a range of factors contributing to the increasing similarity in news coverage across outlets. The dynamics of agenda convergence are found to be shaped by the story selections of popular outlets and driven by similarities in format, audience demographics, and political ideology. The analysis also shows that ownership relations lead to lower agenda convergence among outlets in the sample (Ognyanova, ). Agenda convergence networks based on topical similarity provide one way to conduct research exploring the media sector as an interconnected system. Another possibility is examining similarities in media organizations in terms of their audience rather than their news agenda. Webster & Ksiazek (), for instance, studied a network of media outlets and examined their patterns of audience sharing to determine the levels of audience fragmentation in the United States. The network of overlap (or spread) of issues across individuals is shown on figure ., c, although such research may fall outside the scope of traditional agenda-setting scholarship. Comparisons of issue associations across different agendas present another possibility (Figure ., d). This type of model was used in a study by Guo and McCombs () that examined media and public agendas during Texas gubernatorial and US senatorial elections. The analysis compares two issue networks. One of the networks represents conceptual associations between political figures and their attributes, extracted from media content. The other is based on similar associations reported by local residents. As the two concept maps exhibit high levels of similarity, Guo and McCombs conclude that media may be able to influence relations between objects and attributes perceived by audience members. The process is referred to as third-level or network agenda setting (NAS). Subsequent works in this line of research have examined issue networks extracted from political content on Twitter (Guo & Vargo, ; Vargo et al., ).

. C   R A

.................................................................................................................................. The agenda-setting framework described here provides a relevant example of a network approach to mass communication theories and effects. That model, or various equivalents of its reduced forms, have already been used to explore key issues facing mass



 

communication, including media homogenization (Ognyanova, ) and audience fragmentation (Webster, ). The stronger emphasis on computational social science and network thinking (both quantitative and qualitative) is one important trend that holds potential for advancing the theoretical and methodological sophistication of media studies. Another major driver of that advancement discussed here is the increasing availability of large-scale digital trace data recording individual, group, and organizational behavior (Pentland, ). These data include not only the now-dominant social networking platform data sets, but also raw media content, mobile device data, online activity captured through server logs, web archival data, various relevant text corpora, and more. A key point to reiterate (one that has been made often enough in the literature but warrants frequent repetition) is that “big data” research has its own big problems. Large sample size does not ensure representativeness (Hargittai, ) or guarantee that the available measures adequately reflect the theoretical constructs of interest (Shah, Cappella, & Neuman, ). Moreover, even the most detailed and comprehensive data sets require both domain expertise and formal theory to produce meaningful insights (Parks, ). Media studies is one field in which combining qualitative research and computational techniques can produce especially useful results. Sampling is particularly challenging in the context of network research, which often suffers from boundary specification problems. Identifying the set of interconnected actors to be examined can be a difficult task in fluid social networks that have no clearly defined natural boundaries. Decisions that affect actor inclusion are highly consequential, as different specification choices can result in dramatically different network-level statistics (Kossinets, ). The ethics of data collection and use, as well as considerations regarding privacy and informed consent, add another layer of complexity. These matters are especially difficult to navigate in the context of incongruent academic, corporate, and government standards (Lazer, ). Even within academia, the ethical norms for large-scale collection of personal data are by no means consistent or uniform. As longitudinal digital records reflect and preserve an ever-increasing proportion of our daily activities, the number and scope of relevant data sets is steadily growing. More important, new standards and tools that allow us to combine information from multiple complementary sources are becoming more common and well-established (Lazer et al., ; Driscoll & Thorson, ). This is particularly relevant for mass communication research, which frequently benefits from juxtaposing several data sets: traditionally public opinion polls and media content analysis, but more recently also records from social media and other digital platforms (Jungherr, ; Conway, Kenski, & Wang, ). At the same time, many disciplines are also facing greater challenges related to transparency and replicability. Large data sets are difficult to distribute, difficult to anonymize, and difficult to clean of sensitive information. They are often owned by companies that may be unwilling to open them to researchers. Extensive data cleaning procedures and complex statistical and computational methods are next to impossible

  



to fully describe in the space of an academic paper—at least not in a way that would allow for an exact replication. Acknowledging these issues, a number of disciplines are proposing standards and solutions to address them. Those efforts include setting up mechanisms for sharing of data, code, and detailed analysis descriptions alongside an article, as well as encouraging the publication of relevant and methodologically solid replication studies and papers with null results (Nosek et al., ). Such initiatives are by no means limited to the natural sciences; major journals in fields like political science have set standards requiring authors to publish their data and code or detailed analytical procedures (DART Group, ). Media research, as well as the field of communication studies in general, is lagging behind as far as transparency, replication, and open data standards are concerned. This is not a new problem, but is one that needs to be resolved in order to make the theoretical advances in the field less erratic and more verifiable. It is also a prerequisite to create optimal conditions for scholars to question, re-evaluate, and build upon earlier works in the field. Communication and technology, along with mass communication, are two fields particularly well positioned to lead the charge in that respect. Journal policies and institutional practices encouraging data and code sharing, detailed analysis descriptions, and publication of replication studies can do a lot to improve the ability of scholars to validate and build on existing research.

R Almquist, Z. W., and C. T. Butts. . “Dynamic Network Logistic Regression: A Logistic Choice Analysis of Inter-and Intra-Group Blog Citation Dynamics in the  US Presidential Election.” Political Analysis  ():–. doi:./pan/mpt. Anderson, C. W. . “Journalistic Networks and the Diffusion of Local News: The Brief, Happy News Life of the ‘Francisville Four’.” Political Communication  ():–. doi:./... Aral, S., L. Muchnik, and A. Sundararajan. . “Distinguishing Influence-Based Contagion from Homophily-Driven Diffusion in Dynamic Networks.” Proceedings of the National Academy of Sciences  (): –. doi: ./pnas.. Arsenault, A., and M. Castells. . “The Structure and Dynamics of Global Multi-Media Business Networks.” International Journal of Communication :–. Baker, W. E., and R. R. Faulkner. . “Interorganizational Networks.” In The Blackwell Companion to Organizations, edited by Joel A.C. Baum, –. Oxford: Blackwell Publishers Ltd. Bakshy, E., S. Messing, and L. Adamic. . “Exposure to Ideologically Diverse News and Opinion on Facebook.” Science  ():–. doi:./science.aaa. Barthel, M., E. Shearer, J. Gottfried, and A. Mitchell. . The Evolving Role of News on Twitter and Facebook. Washington, DC: Pew Research Center. Barzilai-Nahon, K. . “Toward a Theory of Network Gatekeeping: A Framework for Exploring Information Control.” Journal of the American Society for Information Science and Technology  ():–. doi:./asi..



 

Bennett, W. L., and S. Iyengar. . “A New Era of Minimal Effects? The Changing Foundations of Political Communication.” Journal of Communication  ():–. doi:./j.-...x. Bennett, W. L., and J. B. Manheim. . “The One-Step Flow of Communication.” The ANNALS of the American Academy of Political and Social Science  ():. doi:./ . Blumler, J. G., and D. Kavanagh. . “The Third Age of Political Communication: Influences and Features.” Political Communication  ():–. doi:./. Boczkowski, P. J. . News at Work: Imitation in an Age of Information Abundance. Chicago: University Of Chicago Press. Boczkowski, P. J., and M. De Santos. . “When More Media Equals Less News: Patterns of Content Homogenization in Argentina’s Leading Print and Online Newspapers.” Political Communication  ():–. doi:./. Borgatti, S. P. . “Two-Mode Concepts in Social Network Analysis.” In Encyclopedia of Complexity and System Science, edited by R. A. Meyers. . New York, NY: Springer. Borgatti, S. P., and M. G. Everett. . “A Graph-Theoretic Perspective on Centrality.” Social Networks  ():–. doi:./j.socnet.... Borgatti, S. P., and D. S. Halgin. . Analyzing Affiliation Networks. Lexington: LINKS Center for Social Network Analysis, University of Kentucky. Borgatti, S. P., A. Mehra, D. J. Brass, and G. Labianca. . “Network Analysis in the Social Sciences.” Science  ():. doi:./science.. Brosius, H. B., and G. Weimann. . “Who Sets the Agenda: Agenda-Setting as a Two-Step Flow.” Communication Research  ():–. doi:./. Cardoso, G. . The Media in the Network Society: Browsing, News, Filters and Citizenship. Liboa, Portugal: CIES-ISCTE. Castells, M. . “Informationalism, Networks, and the Network Society: A Theoretical Blueprint.” In The Network Society: A Cross-Cultural Perspective, edited by M. Castells, –. London: Edward Elgar Publishing. Chaffee, S. H., and M. J. Metzger. . “The End of Mass Communication?” Mass Communication & Society  ():–. doi:./SMCS_. Cointet, J. P., and C. Roth. . “Socio-Semantic Dynamics in a Blog Network.” Paper presented at the IEEE SocialCom  International Conference Social Computing, Vancouver, Canada. doi: ./CSE.. Coleman, R., M. McCombs, D. Shaw, and D. Weaver. . “Agenda Setting.” In Handbook of Journalism Studies, edited by K. Wahl-Jorgensen and T. Hanitzsch, . New York: Routledge. Contractor, N., P. Monge, and P. Leonardi. . “Multidimensional Networks and the Dynamics of Sociomateriality: Bringing Technology Inside the Network.” International Journal of Communication :–. Conway, B. A., K. Kenski, and D. Wang. . “The Rise of Twitter in the Political Campaign: Searching for Intermedia Agenda-Setting Effects in the Presidential Primary.” Journal of Computer-Mediated Communication :–. doi:./jcc.. Crosas, M., G. King, J. Honaker, and L. Sweeney. . “Automating Open Science for Big Data.” The ANNALS of the American Academy of Political and Social Science  (): –. doi:./. DART Group. . “Data Access and Research Transparency: A Joint Statement by Political Science Journal Editors.” Comparative Political Studies  ():–. doi:./ .

  



Dawkins, R. . The Selfish Gene. rd ed. New York: Oxford University Press. Dearing, J. W., and E. M. Rogers. . Agenda-Setting, Communication Concepts. Thousand Oaks, CA: Sage Publications. Diesner, J., and K. M. Carley. . “Revealing Social Structure from Texts: Meta-Matrix Text Analysis as a Novel Method for Network Text Analysis.” In Causal Mapping for Research in Information Technology, edited by V. K. Narayanan and D. J. Armstrong, –. Hershey, PA: IGI Global. Dimmick, J. W. . Media Competition and Coexistence: The Theory of the Niche. Mahwah, NJ: Lawrence Erlbaum Associates. Djerf-Pierre, M. . “The Crowding-Out Effect: Issue Dynamics and Attention to Environmental Issues in Television News Reporting over  Years.” Journalism Studies  ():– . doi:./X... Driscoll, K., and K. Thorson. . “Searching and Clustering Methodologies: Connecting Political Communication Content across Platforms.” The ANNALS of the American Academy of Political and Social Science  ():–. doi:./. Easley, D., and J. Kleinberg. . Networks, Crowds, and Markets: Reasoning about a Highly Connected World. New York: Cambridge University Press. Epstein, R., and R. E. Robertson. . “The Search Engine Manipulation Effect (SEME) and Its Possible Impact on the Outcomes of Elections.” Proceedings of the National Academy of Sciences (PNAS).  ():–. doi:./pnas.. Freelon, D. . “On the Interpretation of Digital Trace Data in Communication and Social Computing Research.” Journal of Broadcasting & Electronic Media  ():–. doi:./... Freeman, L. C. . “Centrality in Social Networks Conceptual Clarification.” Social Networks  ():–. Friedland, L. A., P. M. Napoli, K. Ognyanova, C. Weil, and E. J. Wilson. . Review of the Literature Regarding Critical Information Needs of the American Public. Washington, DC: Federal Communications Commission. Friemel, T. N. . “Influence Versus Selection: A Network Perspective on Opinion Leadership.” International Journal of Communication : –. Fu, J. S. . “Leveraging Social Network Analysis for Research on Journalism in the Information Age.” Journal of Communication  ():–. doi:./jcom.. Gentzkow, M., and J. M. Shapiro. . “Ideological Segregation Online and Offline.” SSRN Electronic Journal, –. doi:./ssrn.. Goldhaber, M. H. . “The Attention Economy and the Net.” First Monday  (–) doi: ./fm.vi.. González-Bailón, S., R. E. Banchs, and A. Kaltenbrunner. . “Emotions, Public Opinion, and US Presidential Approval Rates: A -Year Analysis of Online Political Discussions.” Human Communication Research  ():–. doi:./j.-...x. González-Bailón, S., and G. Paltoglou. . “Signals of Public Opinion in Online Communication A Comparison of Methods and Data Sources.” The ANNALS of the American Academy of Political and Social Science  ():–. doi:./. Granovetter, M. . “The Strength of Weak Ties.” American Journal of Sociology  ():–. Guo, L., and M. McCombs. . “Network Agenda Setting: A Third Level of Media Effects.” Paper presented at annual meeting of the International Communication Association (ICA), Boston, MA.



 

Guo, L., and C. Vargo. . “The Power of Message Networks: A Big-Data Analysis of the Network Agenda Setting Model and Issue Ownership.” Mass Communication and Society  (): –. doi:./... Hargittai, E. . “Is Bigger Always Better? Potential Biases of Big Data Derived from Social Network Sites.” The ANNALS of the American Academy of Political and Social Science  ():–. doi: ./. Hindman, M. . Less of the Same: The Lack of Local News on the Internet. Washington, DC: Federal Communications Commission. Holbert, R. L., R. K. Garrett, and L. S. Gleason. . “A New Era of Minimal Effects? A Response to Bennett and Iyengar.” Journal of Communication  ():–. doi:./ j.-...x. Holbert, R. L., J. D. Hmielowski, and B. E. Weeks. . “Clarifying Relationships Between Ideology and Ideologically Oriented Cable TV News Use: A Case of Suppression.” Communication Research  ():–. doi:./. Jungherr, A. . “The Logic of Political Coverage on Twitter: Temporal Dynamics and Content.” Journal of Communication  ():–. doi:./jcom.. Katz, E., and P. F. Lazarsfeld. . Personal Influence: The Part Played by People in the Flow of Mass Communication. Glencoe, IL: The Free Press. Kaye, J., and S. Quinn. . Funding Journalism in the Digital Age: Business Models, Strategies, Issues and Trends. New York: Peter Lang Publishing. Kim, Y.-C., and S. J. Ball-Rokeach. . “Community Storytelling Network, Neighborhood Context, and Civic Engagement: A Multilevel Approach.” Human Communication Research  ():–. doi:./j.-...x. Kiousis, S. . “Explicating Media Salience: A Factor Analysis of New York Times Issue Coverage during the  US Presidential Election.” Journal of Communication  ():–. doi:./joc/... Kleinnijenhuis, J., F. Schultz, and D. Oegema. . “Frame Complexity and the Financial Crisis: A Comparison of the United States, the United Kingdom, and Germany in the Period –.” Journal of Communication  ():–. doi:./jcom.. Kossinets, G. . “Effects of Missing Data in Social Networks.” Social Networks  ():–. doi:http://dx.doi.org/./j.socnet.... Kwak, H., C. Lee, H. Park, and S. Moon. . “What Is Twitter, a Social Network or a News Media?” Paper presented at th International World Wide Web (WWW) Conference. doi: ./. Lazer, D. . “The Rise of the Social Algorithm.” Science  ():–. doi:./ science.aab. Lazer, D., A. Pentland, L. Adamic, S. Aral, A. L. Barabasi, D. Brewer, N. Christakis, N. S. Contractor, J. H. Fowler, and M. Gutmann. . “Life in the Network: The Coming Age of Computational Social Science.” Science  ():. Leskovec, J., L. Backstrom, and J. Kleinberg. . “Meme-Rracking and the Dynamics of the News Cycle.” Paper presented at the th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France. doi: ./. Leskovec, J., M. McGlohon, C. Faloutsos, N. Glance, and M. Hurst. , April. “Cascading Behavior in Large Blog Graphs: Patterns and a Model.” Paper presented at the Society of Applied and Industrial Mathematics Conference on Data Mining, Minneapolis, MN. doi: ./..

  



McChesney, R. W., and J. Nichols. . The Death and Life of American Journalism: The Media Revolution that Will Begin the World Again. Philadelphia, PA: Nation Books. McCombs, M. . Setting the Agenda: The Mass Media and Public Opinion. Cambridge, UK: Polity Press. McCombs, M. . “A Look at Agenda-Setting: Past, Present and Future.” Journalism Studies  ():–. doi:./. McCombs, M. . “Extending Our Theoretical Maps: Psychology of Agenda-Setting.” Central European Journal of Communication  ():–. McCombs, M., and A. Reynolds. . “How the News Shapes Our Civic Agenda.” In Media Effects: Advances in Theory and Research, edited by J. Bryant and M. B. Oliver. –. New York: Routledge. McCombs, M., and D. Shaw. . “The Agenda-Setting Function of Mass Media.” Public Opinion Quarterly  ():. doi:./. McPherson, Miller, Lynn Smith-Lovin, and James M. Cook. . “Birds of a Feather: Homophily in Social Networks.” Annual Review of Sociology :–. doi:./ annurev.soc.... Menzel, H., and E. Katz. . “Social Relations and Innovation in the Medical Profession: The Epidemiology of a New Drug.” Public Opinion Quarterly  ():–. Miel, P., and R. Faris. . “News and Information as Digital Media Come of Age.” In Media Re:public. –. Boston, MA: Berkman Center for Internet & Society. Mitchelstein, E. and P. J. Boczkowski. . “Between Tradition and Change: A Review of Recent Research on Online News Production.” Journalism  ():. Monge, P., and N. Contractor. . Theories of Communication Networks. New York: Oxford University Press. Nagler, M. G. . “Understanding the Internet’s Relevance to Media Ownership Policy a Model of Too Many Choices.” The B.E. Journal of Economic Analysis & Policy  ():–. Neuman, W. Russell, L. Guggenheim, S. M. Jang, and Soo Young Bae. . “The Dynamics of Public Attention: Agenda-Setting Theory Meets Big Data.” Journal of Communication  ():–. doi:./jcom.. Nosek, B. A., G. Alter, G. C. Banks, D. Borsboom, S. D. Bowman, S. J. Breckler, S. Buck, C. D. Chambers, G. Chin, and G. Christensen. . “Promoting an Open Research Culture.” Science  ():–. doi:./science.aab. Ognyanova, K. . “Intermedia Agenda Setting in an Era of Fragmentation: Applications of Network Science in the Study of Mass Communication.” PhD diss., Annenberg School for Communication and Journalism, University of Southern California. Ognyanova, K., and P. Monge. . “A Multilevel, Multidimensional Network Model of the Media System: Production, Content, and Audiences.” Paper presented at International Network for Social Network Analysis (INSNA) Sunbelt Conference, Redondo Beach, CA. Oh, J., A. Susarla, and Y. Tan. . “Examining the Diffusion of User-Generated Content in Online Social Networks.” SSRN Electronic Journal. doi: ./ssrn.. Pariser, E. . The Filter Bubble: How the New Personalized Web Is Changing What We Read and How We Think. New York: Penguin Press. Parks, M. R. . “Big Data in Communication Research: Its Contents and Discontents.” Journal of Communication  ():–. doi:./jcom.. Pentland, A. . Social Physics: How Good Ideas Spread—The Lessons from a New Science. New York: Penguin Group.



 

Perloff, R. M. . “Mass Communication Research at the Crossroads: Definitional Issues and Theoretical Directions for Mass and Political Communication Scholarship in an Age of Online Media.” Mass Communication and Society:  (): –. doi:./ ... Pew Research Center. . “The State of The News Media : An Annual Report on American Journalism.” Washington, DC: Pew Research Center. Pew Research Center. . “The State of The News Media : An Annual Report on American Journalism.” Washington, DC: Pew Research Center. Pew Research Center. . “The State of The News Media : An Annual Report on American Journalism.” Washington, DC: Pew Research Center. Shah, D. V., J. N. Cappella, and W. R. Neuman. . “Big Data, Digital Media, and Computational Social Science Possibilities and Perils.” The ANNALS of the American Academy of Political and Social Science  ():–. doi:./. Shehata, A., and J. Stromback. . “Not (Yet) a New Era of Minimal Effects A Study of Agenda Setting at the Aggregate and Individual Levels.” The International Journal of Press/ Politics  ():–. doi:./. Snijders, T. A. B., G. G. van de Bunt, and C. E. G. Steglich. . “Introduction to Stochastic Actor-Based Models for Network Dynamics.” Social Networks  (): –. doi:./j. socnet.... Soroka, S. N., D. A. Stecula, and C. Wlezien. . “It’s (Change in) the (Future) Economy, Stupid: Economic Indicators, the Media, and Public Opinion.” American Journal of Political Science  ():–. doi:./ajps.. Stroud, N. J. . Niche News: The Politics of News Choice. New York: Oxford University Press. Sunstein, C. R. . Republic.com .. Princeton, NJ: Princeton University Press. Sunstein, C. R. . Going to Extremes: How Like Minds Unite and Divide. New York: Oxford University Press. Takeshita, T. . “Current Critical Problems in Agenda-Setting Research.” International Journal of Public Opinion Research  ():. doi:./ijpor/edh. Tewksbury, D. . “The Seeds of Audience Fragmentation: Specialization in the Use of Online News Sites.” Journal of Broadcasting and Electronic Media :–. doi:./ sjobem_. Turow, J., and L. Tsui, eds. . The Hyperlinked Society: Questioning Connections in the Digital Age. Ann Arbor: University of Michigan Press. Valente, T. W. . “Social Network Thresholds in the Diffusion of Innovations.” Social Networks  ():–. doi:./-()-. Vargo, C. J., L. Guo, M. McCombs, and D. Shaw. . “Network Issue Agendas on Twitter During the  U.S. Presidential Election.” Journal of Communication  ():–. doi:./jcom.. Wanta, W. . The Public and the National Agenda: How People Learn about Important Issues. Mahwah, NJ: Lawrence Erlbaum Associates. Wanta, W., and S. Ghanem. . “Effects of Agenda-Setting.” In Mass Media Effects Research: Advances through Meta-analysis, edited by R. W. Preiss, B. M. Gayle, N. Burrell, M. Allen, and J. Bryant, –. New York: Lawrence Erlbaum Associates. Wanta, W., and Y. W. Hu. . “The Effects of Credibility, Reliance, and Exposure on Media Agenda-Setting: A Path Analysis Model.” Journalism & Mass Communication Quarterly  ():–. doi:./.

  



Wanta, W., and Y. C. Wu. . “Interpersonal Communication and the Agenda-Setting Process.” Journalism & Mass Communication Quarterly  ():–. doi:./ . Wasserman, S., and K. Faust. . Social Network Analysis: Methods and Applications, Structural Analysis in the Social Sciences. New York: Cambridge University Press. Weber, M. . “Newspapers and the Long-Term Implications of Hyperlinking.” Journal of Computer-Mediated Communication  ():–. doi:./j.-...x. Webster, J. G. . The Marketplace of Attention: How Audiences Take Shape in a Digital Age. Cambridge, MA: MIT Press. Webster, J. G., and T. B. Ksiazek. . “The Dynamics of Audience Fragmentation: Public Attention in an Age of Digital Media.” Journal of Communication  ():–. doi:./ j.-...x. Williams, B. A., and M. X. Delli Carpini. . “Unchained Reaction: The Collapse of Media Gatekeeping and the Clinton–Lewinsky Scandal.” Journalism  ():. doi:./ . Williams, B. A., and M. X. Delli Carpini. . “Monica and Bill All the Time and Everywhere: The Collapse of Gatekeeping and Agenda Setting in the New Media Environment.” American Behavioral Scientist  ():–. doi:./. Williams, B. A., and M. X. Delli Carpini. . After Broadcast News: Media Regimes, Democracy, and the New Information Environment. New York: Cambridge University Press. Wu, S., J. M. Hofman, D. J. Watts, and W. A. Mason. . “Who Says What to Whom on Twitter.” Paper presented at the International World Wide Web Conference, Hyderabad, India. doi: ./. Xu, W. W., Y. Sang, S. Blasiola, and H. W. Park. . “Predicting Opinion Leaders in Twitter Activism Networks: The Case of the Wisconsin Recall Election.” American Behavioral Scientist  (): –. doi:./.

  ......................................................................................................................

     ......................................................................................................................

 , ˊ  ,  ,   

. I

.................................................................................................................................. S media and blogging services have become extremely popular. Every day hundreds of millions of users share random thoughts and emotional expressions, as well as notable political news and thoughts on social issues. In fact, these services have become important vehicles for news and channels of influence. Social media services also have made personal contacts and relationships more visible and quantifiable than ever before. Users interact by following each other’s updates and passing along interesting pieces of information to their friends. This kind of word-of-mouth propagation occurs whenever a user forwards a piece of information to her friends, making users a key element in this process (Cha, Benevenuto, Ahn, and Gummadi, ). The potential to detect in advance what kinds of content “go viral” has fascinated researchers across a wide range of disciplines, including marketing, political science, and journalism. The availability of digitally logged propagation events in social media helps researchers better understand how user influence, tie strength, repeated exposures, conventions, and various other factors come into play in the way people generate and consume information on social media. For this reason, recently a number of efforts have been made to understand and characterize roles users play in propagating information and describing the word-of-mouth patterns, which is the main focus of this chapter. In order to demonstrate these effects, we need to start from real data. While a wide variety of social media data is becoming available to researchers, we consistently refer to a set of research that was conducted on Twitter for the following two reasons. First, Twitter provides application programming interface (API) to gather publicly

    



available data that have been studied widely within the research community. The wide availability of data means that findings in most studies that rely on Twitter data are reproducible. Second, as a matter of convenience, Twitter has been the main medium that the authors of this chapter have studied, either jointly or independently. (For convenience, the pronoun “we” is used to refer to the authors of research to which any of us contributed.) In particular, the research discussed is based on a near-complete snapshot of the Twitter network, which contains  billion follow links among  million users and their . billion tweets. We refer readers to the description of data in Cha, Haddadi, Benevenuto, and Gummadi () for details. The network information was gathered in , and its anonymized form of data is available at http://twitter. mpi-sws.org/. The first part of this chapter describes the relative roles different users play in information propagation. The topics covered include identifying both influential users and other kinds of users, who play a special role in information flow, such as trendsetters and micro-celebrities. To better understand why certain trends are adopted more widely than others is critical not only for designing better search systems that facilitate the spread of up-and-coming topics while curtailing the storm of spam (Benevenuto, Magno, Rodrigues, and Almeida, ), but also as a necessary step for viral marketing strategies that can impact stock markets and political campaigns. Such a study, however, has been difficult because it does not lend itself to readily available quantification; essential components such as human connections and information flow cannot be reproduced on a large scale within the confines of the lab. We begin by analyzing a few kinds of users of Twitter and the roles they play in the flow of information. Particularly, we investigate the relative roles of three types of information spreaders: mass media sources such as BBC, grassroots or ordinary users, and evangelists or opinion leaders (Cha, Benevenuto, Haddadi, and Gummadi, ). Another measure of user influence is based on network and activity metrics: indegree (i.e., the number of people who follow a user), retweets (i.e., the number of times others “forward” a user’s tweet), and mentions (i.e., the number of times others mention a user’s name). Based on these three definitions, we make key observations about influence, such as that popular users who have high indegree are not necessarily influential in terms of spawning retweets or mentions (Cha, Haddadi, Benevenuto, and Gummadi, ). We also look into an approach to finding trendsetters, a set of users who adopt and spread new ideas, influencing other people before these ideas become popular (Saez-Trumper, Comarela, Almeida, Baeza-Yates, and Benevenuto, ). A trendsetter is not necessarily popular, yet that person’s ideas spread over the network successfully. We describe a novel ranking strategy to identify such trendsetters, which combines temporal attributes of nodes and edges of a network with a PageRank-based algorithm. We show that nodes with high number of followers tend to arrive late for new trends, while users in the top of our ranking tend to be early adopters who also influence their social contacts to adopt the new trend. The second part of this chapter characterizes word-of-mouth propagation patterns in two different scenarios. The first case is the spread of short web links (URLs), clean



. , . , . ,  . 

pieces of information that can be tracked (Rodrigues, Benevenuto, Cha, Gummadi, and Almeida, ). We discuss methodology to identify the most frequently shared web links on Twitter and describe their propagation shapes in terms of depth and height of propagation trees. The second case is the adoption of social conventions, which we examine through the use of retweet signals, a convention to indicate that one is forwarding a message written by some other person (Kooti, Yang, Cha, Gummadi, and Mason, ). The way Twitter users indicated retweets has evolved organically; numerous variations coexisted during the initial phase of Twitter until a large majority of users decided to use “RT” as a signal, which later was adopted by Twitter itself as a button. Through this excellent example, we discuss the adoption process of social conventions. Social media systems are more than merely a platform to propagate information and help individuals stay in touch with their peers. The fact that most communications occur in a public space presents an opportunity for mining those communications, engineering user behaviors, and providing novel systems and applications. We dedicate the final part of this chapter to describing several applications that can be built on the top of social media. In doing so, we describe efforts to identify topical experts—users who are authoritative sources of information on particular topics—and to extract topical news stories from the content posted by these experts (Ghosh, Sharma, Benevenuto, Ganguly, and Gummadi, ). In addition, this chapter discusses the generic ways users consume information about news through social media. In particular we discuss the balance of news that users receive from Twitter in terms of topics (Kulshrestha, Zafar, Noboa, Gummadi, and Ghosh, ). The series of studies discussed in this chapter provide readers with a holistic view of large-scale propagation phenomena in social networks. While the studies we have chosen all rely on Twitter as a source of data analysis, the techniques and findings from the three main topics of this chapter (i.e., user influence, propagation patterns, and propagation applications) can broadly apply to other similar social media.

. U I

.................................................................................................................................. Understanding how critical pieces of information, such as whom to vote for and what to buy, propagate within a society and by which users has been a key focus of communications researchers. Users hold vastly different degrees of influence in a network. This influence relationship determines how information propagates and why certain trends are adopted more widely than others. Identifying these differences is useful not only for designing better search systems that facilitate the spread of up-and-coming topics while curtailing the storm of spam, but also as a necessary step for various viral marketing strategies (e.g., launch of a movie or a political campaign). The degree of influence of users participating in a conversation has also been used to predict whether a certain piece of information is a rumor or not (Kwon, Cha, Jung, Chen, and Wang, ).

    



Seminal research in this area was begun by Paul Lazarsfeld in the s (Lazarsfeld, Berelson, and Gaudet, ); his “two-step flow of communication” framework posits that there exists a small group of primary influential users who play an essential role in determining which information to deliver and promote in a network. A number of qualitative and quantitative studies followed, seeking to identify the small set of influentials as well as to prove if the theory still applies in modern information networks. For instance, researchers at Cornell and Yahoo! Research tracked the flow of information on Twitter, focusing on “who says what to whom” (Wu, Hofman, Mason, and Watts, ). The research that we describe here belongs to this category. We introduce three relevant pieces of research. The first paper compares different definitions of user influence. The second paper explores one of the popular influence definitions and presents a deep dive into user studies. The third paper l describes an algorithmic way to find a specific type of influential users, the trendsetters, who are of interest to marketers.

.. The Million Follower Fallacy

Number of users with degree >=x

On social media, social links are distributed disproportionately more to a small set of users. In Twitter, a vast majority of users have fewer than one hundred followers (Dunbar, ), whereas a small fraction of users have more than a million followers. The number of followers or “degree of a user in a network” is the simplest form of user influence one may consider. Degree in fact can represent anything from intimate friendships to common interests, or even a passion for breaking news or celebrity gossip. Such directed links determine the flow of information and hence can potentially indicate a user’s influence on others, suggesting that the more followers one has, the more influence one exerts. Figure . shows the fraction of users in the network with given in- and outdegrees, where a node’s outdegree refers to the number of users whose tweets the node follows 1e8 1e7 1e6 1e5

Indegree

1e4 1e3 100

Outdegree

10 1

0

10

100

1e3

1e4

1e5

1e6

User degree

 . Degree distribution of Twitter users. Source: Adapted from Cha, Benevenuto, Haddadi, and Gummadi ().



. , . , . ,  . 

and a node’s indegree refers to the number of users following the node (Cha, Benevenuto, Haddadi, and Gummadi, ). The two distributions are similar, except for the two anomalous drops in the outdegree distribution around  and ,. The first glitch is due to the “suggested users” feature on Twitter, whereby all users are presented with a list of twenty popular users to follow upon registration. Unless a user specifies not to follow them, those suggested users are automatically added to the user’s outdegree list. The second glitch occurs because Twitter previously limited the total number of individuals a user can follow. The distributions for both in- and outdegree are heavy tailed. A majority of the users have small degree, but there are a few users with a large number of neighbors; % of users have no more than one hundred neighbors. Such a skewed degree distribution indicates that the network contains nodes that connect to a large number of other nodes. A question then naturally arises: Do users who have a million or more followers exert a high level of influence in the network? Are these most well-connected users equivalent to what is called a small set of opinion leaders in Lazarsfeld’s two-step flow of information theory? On Twitter, we can define and quantitatively measure three types of influence (Cha, Haddadi, Benevenuto, and Gummadi, ): • Indegree influence: the number of followers of a user; directly indicates the size of the audience for that user • Retweet influence: the number of retweets containing one’s name; indicates the ability of that user to generate content with pass-along value • Mention influence: the number of mentions containing one’s name; indicates the ability of that user to engage others in a conversation The most followed users (i.e., users with high indegree) span a wide variety of public figures and news sources. They include news sources (CNN, The New York Times); politicians (Barack Obama), athletes (Shaquille O’Neal); and celebrities such as actors, writers, musicians, and models (Ashton Kutcher, Britney Spears). As the list suggests, indegree measure is useful when we want to identify users who get lots of attention from their audience through one-on-one interactions; that is, the audience is directly connected to influentials. The most retweeted users were content aggregation services (Mashable, TwitterTips, TweetMeme), businessmen (Guy Kawasaki), and news sites (The New York Times, The Onion). They are trackers of trending topics and knowledgeable people in different fields, whom other users decide to retweet. Unlike indegree, retweets represent the influence of a user beyond one’s one-on-one interaction domain; popular tweets could propagate multiple hops away from the source before they are retweeted throughout the network. Furthermore, because of the tight connection between users, as suggested in the triadic closure (Granovetter, ), retweeting in a social network can serve as a powerful tool to reinforce a message; for instance, the probability of adopting an innovation increases when not one user but a group of users repeats the same message (Watts, and Dodds, ).

    



Indegree 27.4 3

5.6 7.1 23.8

Retweets 26.4

6.7

Mentions

 . Venn diagram of the top  influentials across measures. Source: Adapted from Cha, Haddadi, Benevenuto, and Gummadi ().

The most mentioned users were mostly celebrities. Ordinary users showed a great passion for celebrities, regularly posting messages to them or mentioning them, without necessarily retweeting their posts. This indicates that celebrities are often in the center of public attention and celebrity gossip is a popular activity among Twitter users. An interesting trend emerges when we look at the overlap across the three measures, as shown in Figure ., which depicts the relationship among the three measures for top  lists. The top influentials in all three cases were generally recognizable public figures and websites. Interestingly, we saw marginal overlap in these three top lists. These top  lists only had two users in common: Ashton Kutcher and Puff Daddy. The top  lists also showed marginal overlap, indicating that the three measures capture different types of influence. If retweets represent a citation of another user’s content, mentions represent a public response to another user’s tweet. Having analyzed the influence of Twitter users by employing three measures that capture different perspectives: indegree, retweets, and mentions, we conclude that indegree represents a user’s popularity but is not related to other important notions of influence such as engaging audience, that is, retweets and mentions. Retweets are driven by the content value of a tweet, while mentions are driven by the name value of the user. Such subtle differences lead to dissimilar groups of the top Twitter users; users who have high indegree do not necessarily spawn many retweets or mentions. This finding suggests that indegree alone reveals very little about the influence of a user. These findings provide new insights for viral marketing. The first finding in particular indicates that indegree alone reveals little about the influence of a user. This concept has been coined the “million follower fallacy” by Avnit (), who pointed to anecdotal evidence that some users follow others simply for etiquette (i.e., it is polite to follow someone who is following you), and many do not read all the broadcast tweets. We have empirically demonstrated that having a million followers does not always mean much in the Twitter world. Instead, we claim that it is more influential to have an active audience who retweet or mention the user.



. , . , . ,  . 

.. User Types In order to quantitatively measure the role of users in spreading information, one may examine how effective users are as information spreaders and measure the size of the audience each user could reach in the network. The audience in this case represents the distinct number of users who either posted or received one or more tweets about a specific event. Based on this assumption, we developed a computational framework that checks, for any given topic, how necessary and sufficient each user group is in reaching a wide audience (Cha, Benevenuto, Haddadi, and Gummadi, ). By analyzing the structure of a network and distribution of links, we found a broad division that yields three distinct user groups based on indegree: the extremely wellconnected users with more than , followers, the least connected masses with no more than  followers, and the remaining well-connected small group of users. Our division of users is based on the definition of different user roles from the theory on information flow (Katz and Lazarsfeld, ): mass media, who can reach a large audience but do not follow others actively; grassroots, who are not followed by a large number of users but have a huge presence in the network; and evangelists, who are socially connected and actively take part in information flow, like opinion leaders. Evangelists are also called influentials, opinion leaders, hubs, or connectors. Some researchers even subdivide this group into media elite, cultural elite, and experts. We picked several major events that occurred in  that spread widely in Twitter. These events span political, health, and social topics. Example events include the Iranian election in  and the death of singer Michael Jackson in , which are described in detail by Cha, Benevenuto, Haddadi, and Gummadi, (). To extract tweets relevant to major events, we first identified the set of keywords describing the topics by consulting news websites and informed individuals. Given our selected list of keywords, we identified the topics by searching for keywords in the tweet data set. We focused on a period of sixty days starting from one day prior to a key date; this either corresponds to the date when the event occurred or the date when the event was widely reported in the traditional mass media (TV and newspapers). We limited the duration because spammers typically hijacked popular keywords after a certain point. Figure . displays the necessary and sufficient conditions for reaching the audience for the Iranian election event. The line marked with annotation `Sufficient’ indicates the cumulative size the audience would represent and whether top spreaders alone can reach such an audience (i.e., sufficiency). The line marked with annotation “Necessary” indicates the cumulative size the audience would represent and whether top spreaders alone can reach such an audience (i.e., necessity). Here, the size of the audience is what could be reached after removing the top-k spreaders, where k is varied from  to the total number of spreaders. In case two or more users have the same indegree, we broke the tie based on the numeric user IDs so that the rank of every user is different. The x-axis represents the rank of the spreader based on indegree, from the most followed (appearing on the left- hand side) to the least followed (appearing on the right-hand side).

     Sufficient

20e6 Audience size



15e6

M

E

G

10e6 5e6 0

Necessary 1

10

100

1e3

1e4

1e5

Indegree rank of spreaders

 . Test of sufficiency and necessity conditions in reaching an audience of a given size for international headlines for the Iran election  event. Source: Adapted from Cha, Benevenuto, Haddadi, and Gummadi ().

The two vertical lines in the figure mark the boundaries between mass media (denoted by M), evangelists (E), and grassroots (G) regions. In all cases, we observe that the mass media presence is sufficient to reach a significant majority but not the entire audience. Apart from the Moldova case, in which the mass media did not play a key role, more than % of users can usually be reached by the mass media. Figure . shows that mass media are necessary and sufficient to reach a majority of Twitter audience, as seen by the trend in the plots of the sufficiency test) and the necessity test. Again, the sufficiency line represents the cumulative size the audience would represent and whether top spreaders alone can reach such an audience, and the black line represents gradually removing the top spreaders and what fraction of the audience can still be reached. Not only are mass media sufficient to reach a significant fraction of the audience, they are necessary to reach an audience of such a size. Focusing now on the necessity line, we see that the size of the audience that is reached decreases rapidly as top spreaders are removed. Without mass media, we lose the majority of the audience in all cases. Due to their high indegree, the mass media are able to directly cover a large fraction of the audience, even posting the fewest number of tweets. Yet evangelists extend the reach of mass media considerably. The test of necessary condition further shows that when evangelists are removed from the network, only a small fraction of the audience can be reached by using grassroots only. This test result indicates that evangelists can extend the reach of an audience by a considerable amount. Grassroots, on the other hand, help reach an insignificant fraction of the audience. As opposed to the dominant reach of mass media and evangelists, grassroots reach out to only a small fraction of the audience. They account for a negligible fraction of all audiences across the different news events. This occurs despite the fact that grassroots account for nearly all Twitter users (.%) and a significant fraction of all tweets. Overall, we found that Twitter brings a playing field together for all three voices: the mass media, evangelists, and grassroots. Mass media play a dominant role in the network. They excel at all aspects of news spreading: they have many followers, their links are well reciprocated, and they have topological advantages to collect diverse opinions of other users. Their tweets also reach a large portion of the audience directly,



. , . , . ,  . 

without the involvement of other influential users. On the other hand, the mass media in Twitter, unlike the traditional media networks, are not necessarily the first to report events. In some cases, in fact, it is the small, less-connected grassroots or evangelists that trigger the spreading of news or gossip, even without the mass media’s coverage of such topic. Evangelists played a leading role in the spread of news in terms of the contribution of the number of messages and in bridging grassroots who otherwise are not connected.

.. Trendsetters As discussed previously, influential people play an important role in the process of information diffusion, and the number of followers alone might not capture this notion of influence properly. More important, there are several ways to be influential, for example, to be the most popular or the first to adopt a new idea. Among the influentials are trendsetters, who adopt and spread new ideas before they become popular (Saez-Trumper, Comarela, Almeida, Baeza-Yates, and Benevenuto, ). Trendsetters are not necessarily well-known news outlets, celebrities, or politicians, but are the ones whose ideas spread widely and successfully through word of mouth. To be an innovator, a person needs to be one of the first to pick up a new or nascent trend, which may be adopted by other members of a social or information network. On the other hand, not all the early adopters are trendsetters, because only a few of them have the ability to propagate their ideas to their social contacts through word of mouth. In identifying trendsetters, two important aspects need to be considered. The first is the area or topic of the innovator, as people have different levels of expertise on various subjects. For example, marketing services actively search for potential influential people in a specific domain or area to promote certain products or services. Influential people include “cool” teenagers, local leaders, and popular public figures. Thus, it is important to specify topics and themes that define the context in which trendsetters will be identified. Second, it is important to consider time information associated with the posting of innovative ideas. Traditional ranking algorithms on social networks, such as the standard PageRank algorithm, do not consider time information related to ideas that become popular. Instead, they consider only aggregate usage statistics and a static network topology. As an example, Figure . considers that for node X, tx = n represents that X adopted the trend h in time n. Thus, node G was the first one to adopt h, while node E was the last one to adopt the same trend. Note that although node G is an innovator, its information was passed to H but not to the rest of the network. Thus, node G cannot be considered a trendsetter. On the other hand, if we compute the standard PageRank algorithm using this graph and ignore the time when trend h was adopted, node C would be considered the top ranked, although it has incoming links from nodes A and E and simply spread it to a larger audience. However, if we pay attention in time, we

    

AtA = 3



BtB = 5

CtC = 4

DtD = 6

EtE = 8

FtF = 7

GtG = 1

HtC = 2

 . Illustrative example of timing importance. Note: Without considering time information, nodes  and  are symmetrical, regardless of whether node  adopted the trend first. The edges represent social connections between nodes and the arrows go opposite to the information flow. Source: Adapted from Saez-Trumper, Comarela, Almeida, Baeza-Yates, and Benevenuto ().

will see that C adopted the trend before E, and therefore we cannot consider that C received information from E. We can also observe that nodes A and E have the same rank according to PageRank, even though A adopted the trend before E. In this example, the top trendsetter is node A because it was the first one to adopt this trend being followed—directly or indirectly—by many other participants of the network, such as nodes C, D, B, and F. We proposed a novel approach to identify trendsetters, introducing time information on the social graph to identify persons who spark the process of disseminating ideas that become popular in the network. The main idea consists of defining a topicsensitive, weighted innovation graph that provides key information to understand who adopted a certain topic that triggered attention of others in the network and then applying a modified PageRank algorithm to find trendsetters. Figure . depicts one of the main results from this work. It analyzes the percentage of users with high indegree (i.e., number of followers), high PageRank, and high trendsetter rank who adopted the trend before the peak of adoption. By peak of adoption we mean the slot of time in which a trend has its larger number of new adopters. A large fraction of trendsetters adopts the trend before a peak is reached, compared to those top-ranked users based on indegree and PageRank. In categories such as music, celebrity, and idioms, most of the users in the top of the PageRank and indegree rank start talking about these topics after the peak. In contrast, in six of the nine categories, more than % of the highest ranked users according to our approach adopted the trend before the peak. This work was motivated by the observation that users with high indegree often are not the first to adopt new ideas, but follow what is



. , . , . ,  .  100 90

% of Top 100 Users before the peak

80 70 60 50 40 30 20 10 0 CELEBRITY MUSIC

IDIOMS

GAMES POLITICAL NONE Category

InDegree

PageRank

MOVIES TECHNO. SPORTS Trendsetters

 . Percent of the top  users of each ranking that adopted the trend before the peak. Source: Adapted from Saez-Trumper, Comarela, Almeida, Baeza-Yates, and Benevenuto ().

already popular. The performance of indegree rank in this experiment tends to confirm this observation. The proposed algorithm to identify trendsetters can measure both directed and undirected influence in a network, as well as utilize early adoption as a key feature of being influential. This characteristic is useful to differentiate between trendsetters and other nodes that despite having a large indegree adopt the trends after they become popular.

. P P

.................................................................................................................................. Traditionally, users have discovered information on the Web by browsing or searching. Nowadays, tens of millions of web links are shared and propagated every day in a word-of-mouth process, through which people get to know about new content from friends and conversations with other users. While such word-of-mouth-based content discovery have existed for a long time in the form of emails and web forums, online

    



social networks have made this phenomenon extremely popular and globally reaching. Such word-of-mouth-based content discovery has become a major driver of traffic to many websites today. Most efforts that attempt to characterize patterns of propagation rely on propagation trees to describe how a piece of information is repeated by a set of users. The root of the tree represents the original poster or the content creator. Remaining nodes correspond to users who spread information by forwarding it to friends. Characterizing the shape of such propagation trees is important, as one may make an educated guess about how large the tree will grow as well as which direction it will grow in the future.

.. Word of Mouth To better understand this popular phenomenon, we present a detailed investigation of word-of-mouth exchange of URLs among Twitter users (Rodrigues, Benevenuto, Cha, Gummadi, and Almeida, ). Our methodology consisted of building an information propagation tree for every web link (URL) that was shared in Twitter during a random week in . Retweets were not part of the Twitter API by this time. Hence, we considered both explicit (e.g., retweets) and implicit information flows (e.g., when a user shares a URL that has already been posted by one of the contacts she follows, without citing the original tweet). Formally, we built information propagation paths based on Krackhardt’s () hierarchical tree model. A hierarchical tree is a directed graph in which all nodes are connected and all but one node, namely the root, have indegree of one. This means that all nodes in the graph (except for the root) have a single parent. Hence, an edge from node A to node B is added to the tree only when B is not already a part of the tree. An edge from node A to node B means that a piece of information was passed from A to B. While each hierarchical tree has a single root, there may be multiple users who independently share the same URL. In this case, the propagation pattern of a single URL will contain multiple trees and form a forest. We call the users in the root of hierarchical tree initiators. These users are the ones who independently shared URLs. We call all other nodes that participated in URL propagation spreaders. Initiators and spreaders make up the hierarchical tree. We call users who simply received a URL but did not forward it to others receivers. Figure . depicts this relationship. Later when we refer to the hierarchical tree structure, we do Initiator

Spreader

Receiver

A

B

C

shares URL-A

retweets URL-A

(no further action)

 . Terminology of cascades. Source: Adapted from Rodrigues, Benevenuto, Cha, Gummadi, and Almeida ().



. , . , . ,  . 

not include these users. For convenience, we collectively call all three types of users who potentially read the URL its “audience.” Word of mouth in Twitter yields propagation trees of a particular shape, and popular URLs spread through multiple disjoint propagation trees, with each of the propagation trees involving a large number of nodes. Popular URLs spread through not one but multiple disjoint propagation trees, with each of the propagation trees involving a large number of nodes. This means that there does not need to be a single root of propagation in social media. In particular, for less popular content, it is easier to see multiple disconnected propagation trees. This is because domains whose URLs or content is spread widely by word of mouth tend to be different from the domains that are popular on the general Web, where content is found primarily through browsing or searching. One way to determine the shape of a tree is by width and height, where height of a tree refers to the maximum hop count from root to any of the leaf nodes and width refers to the maximum number of nodes that are located at any given height level. For instance, a two-node cascade graph has height of . The propagation trees that we examined were interestingly wider than they are deep. Figure . shows their distributions. In fact, the maximum observed width of any propagation tree was ,, while the maximum observed height was , which is a difference of two orders of magnitude. Our finding is in sharp contrast to the narrow and deep trees found in Internet chain letters (Liben-Nowell and Kleinberg, ), where Internet chain letters showed a narrow tree shape (width = ) that went several hundred levels deep (height = ) for a large cascade that involved , spreaders. Examining the propagation shapes through URLs, we described how URLs could reach a wide audience of several tens of millions of users through word of mouth. Most word-of-mouth events, however, involved a single user who shared URLs and reached

Number of URLs with > x metric

107 106 105 104 103 102 101 100 0 10

101

102

103

104

105

Metric Height

Width

 . Height and width distribution of the largest cascade subtrees. Source: Adapted from Rodrigues, Benevenuto, Cha, Gummadi, and Almeida ().

    



only a small audience. These findings confirm the observation that global cascades are rare (Wu, Hofman, Mason, and Watts, ) but by definition are extremely large when they do occur. The size distribution of word of mouth was best fit with the power-law distribution y = cxa. In contrast to the exponents a of .–. seen in the indegree and outdegree distributions of the topology, the exponents .–. observed in cascade sizes are smaller. This difference may be due to the collaborative act of sharing in Twitter. If each user were to share a URL without the help of any other user, then the size distribution for such a cascade should be the same as the indegree distribution. However, when users collaboratively spread the same URL, the gap between the most popular user and the least popular user (based on indegree of a user) becomes less important to the success of cascades, hence yielding a smaller exponent. A possible explanation for the discrepancy between typical Internet chain shape and that of social media (i.e., much wider spread than is deeper) might be the difference in the way the two systems work. Twitter does not allow its users to restrict the recipients of tweets; tweets are broadcast to all a user’s followers. On the other hand, emails can be forwarded to a selective set of users, restricting the propagation to only a fraction of one’s friends, which creates narrower and deeper cascades.

.. Social Conventions The way in which social conventions emerge in communities has been of interest to social scientists for decades. Here we report on the emergence of a particular social convention on Twitter: the way to indicate a tweet is being reposted and to attribute the content to its source (Kooti, Yang, Cha, Gummadi, and Mason, ). Initially, different variations were invented and spread through the Twitter network. Later Twitter even incorporated one of these conventions into its API, which became crucial for a number of efforts that have been studying information propagation in Twitter. Every variation that arises from the interaction of individuals is at some point invented. It is possible for the same variation to be invented multiple times independently, which could be called “convergent evolution.” The first variation ever used to indicate that a tweet came from another user was via, followed by the original poster “@kosso,” as shown in Figure .. This variation is sensible, as it is immediately understandable to most English speakers. The very first use was in March , only twelve months after the launch of Twitter, and only four months after the first “@username” reference appeared in Twitter. This use, and the many subsequent uses of this and other variations, establishes that there is a need on Twitter to indicate a message is passed on from another source and to attribute the message to the source. The inventors and early adopters were well-connected, active, core members of the Twitter community. The diffusion networks of these conventions were dense and highly clustered, so no single user was critical to the adoption of the conventions.



. , . , . ,  .  HT Oct’07

Via Mar’07

Retweeting Jan’08

Retweet Nov’07

RT Jan’08

R/T Jun’08

Sep’08

 . Timeline of the various Twitter conventions. Source: Adapted from Kooti, Yang, Cha, Gummadi, and Mason ().

Random users

Early adopters

 . Profile description of random users and early adopters of Twitter conventions. Source: Adapted from Kooti, Yang, Cha, Gummadi, and Mason ().

Despite being invented at different times and having different adoption rates, only two variations (RT and via) came to be widely adopted. Focusing on a larger set of early adopters of each variation, we may now investigate their characteristics and connectivity to understand the early stages of the emergence of variations. We define the first one thousand adopters of each variation as the early adopters. Early adopters also differed from a typical Twitter user in the content of their biographical information. Users in the random sample, which represents the general Twitter population, describe themselves using words such as love, life, live, and music. In contrast, early adopters introduce themselves with words such as media, developer, geek, web, and entrepreneur, as shown in Figure .. Another important observation from this effort is that it confirms existing information theories that say people are more prone to adopt a trend or propagate information if they receive it from multiple sources. Figure . shows the time series of week-toweek user gains over the first . years of Twitter’s existence. It is clear that the different variations experienced very different patterns of growth. By the end of mid-, only two variations, RT and via, had achieved widespread usage. The recycle icon, HT, and R/T continued to add new users, but their popularity nearly stabilized. Retweet and retweeting began losing popularity, as the rate of new adopters declined, potentially because of their long length, which is costly given the -character limit.

    



105

Number of new adopters in a week

104

103

102

101

100 Jan’07 Apr’07 Jul’07 Oct’07 Jan’08 Apr’08 Jul’08 Oct’08 Jan’09 Apr’09 Jul’09 Time RT HT

via recycle

Retweet R/T

Retweeting

 . New adopters of variations over time. Source: Adapted from Kooti, Mason, Gummadi, and Cha ().

Interestingly, the final reach of the variations does not seem to be strongly related to either the amount of time the variation had to grow or the rate at which it grew. As can be seen in Figure ., via started the earliest and had slow growth (relative to the other variations) but ended with the second-highest number of adopters; Retweet and retweeting grew as fast as or faster than RT and started earlier than but never approached the reach of RT. It is an interesting question whether there are features of the variations, the inventors, or the early dynamics (or some combination thereof) that can be used to predict which variations would come to dominate.

. P A

.................................................................................................................................. Prior studies have shown that a very large majority of the content that becomes popular on social media is posted by only a small fraction of users, which includes celebrities, topical experts, and so on ( Zafar, Bhattacharya, Ganguly, Ghosh, and Gummadi, ). Hence, identifying these popular users and studying the content posted by them are important steps in understanding information consumption and propagation on social media. We describe some recent efforts to identify topical experts on Twitter and to



. , . , . ,  . 

characterize the information produced and consumed by users on social media. Some novel applications that have been developed and deployed over Twitter by these studies are discussed, such as applications to identify topical experts and topical content search systems.

.. Topical Experts Social media such as Twitter have millions of users with varying backgrounds and levels of expertise posting about topics that interest them every day. The democratization of content authoring has contributed tremendously to the success of these systems, but it also poses a big challenge: How can users tell who is who on social media such as Twitter? For instance, knowing the credentials of a Twitter user can crucially help others determine how much trust or importance they should place in the content posted by that user. The task of inferring topical attributes of Twitter users is not trivial, because most users do not describe in their profiles their topics of interest or their expertise. Prior studies have attempted to infer topics of interest of users from the text of the tweets they post or receive; however, such methods have had limited success because most tweets are informally written and contain conversational content. To approach this problem, we designed and evaluated a novel “who-is-who” service for inferring topical attributes that characterize individual Twitter users (Sharma, Ghosh, Benevenuto, Ganguly, and Gummadi, ). “Who’s -Who” is the title of a number of reference publications, which generally contain concise biographical information on a particular group of people. The proposed methodology exploited the Twitter Lists feature, which allows a user to create a named group containing other users who tend to tweet on a topic that is of interest to her and follow their collective tweets. For instance, a user can create a list named “Singers and Musicians” and add popular singers and musicians such as Lady Gaga, Britney Spears, and Rihanna to it. The key insight is that the list metadata (i.e., names and descriptions) provide valuable semantic cues about the attributes of the users included in the lists, including their topics of expertise and how the public perceives them. This methodology infers a user’s expertise by analyzing the metadata of crowdsourced lists that contain the user and can accurately and comprehensively infer topical attributes of millions of popular Twitter users, including a vast majority of Twitter’s most influential users. This method is also elegant in that it relies on the wisdom of the crowd. Table . shows the top topical attributes of some well-known Twitter users, as inferred by the list-based methodology. The who-is-who system is available at http://twitter-app.mpi-sws.org/ who-is-who/. In a similar vein to the work discussed previously, one may apply the knowledge from who-is-who in designing a search/recommender system for topical experts on Twitter. Such a system is useful because the quality of information posted in Twitter is highly variable, and finding users that are recognized sources of relevant and

    



Table 4.1 Top Topical Attributes for Some Well-Known Twitter Users User

Top topical attributes inferred by List-based methodology

Barack Obama Ashton Kutcher Lada Adamic Chuck Grassley BBC News Linux Foundation Yoga Journal

politics, celebs, government, famous, president, news, leaders, current events celebs, actors, famous, movies, stars, comedy, music, hollywood, pop culture academics, network-analysis, social-media, umsi, tech, icwsm, hci, thinkers politics, senator, congress, government, republicans, iowa, gop, conservative, health media, news, journalists, politics, english, newspapers, current, london linus, tech, open, software, libre, gnu, computer, developer, ubuntu, unix yoga, health, fitness, wellness, magazines, media, mind, meditation, body, inspiration

Source: Adapted from Sharma, Ghosh, Benevenuto, Ganguly, and Gummadi (2012).

Table 4.2 Examples of Topical Experts Identified by List-Based Methodology Topic

Experts

Music Politics Physics Neurology Medicine Environment

Katy Perry, Lady Gaga, Justin Timberlake, Pink, Justin Bieber, coldplay, Marshall Mathers Barack Obama, Al Gore, Bill Maher, NPR Politics, Sarah Palin, John McCain Institute of Physics, Physics World, Fermilab Today, CERN, astroparticle Neurology Today, AAN Public, Neurology Journal, Oliver Sacks, ArchNeurology NEJM, Harvard Health, Nature Medicine, Americal Medical News, FDA Drug Info GreenPeace USA, NYTimes Environment, TreeHugger.com, National Wildlife

Source: Adapted from Ghosh, Sharma, Benevenuto, Ganguly, and Gummadi (2012).

trustworthy information on specific topics (i.e., topical experts) is a key challenge. Given that we are able to associate who-is-who on Twitter, we can further use this information for discovering topical experts on Twitter and recommend which users to follow (Ghosh, Sharma, Benevenuto, Ganguly, and Gummadi, ). Our methodology relies on the intuition that users who have been included in several lists on a given topic are likely to be experts on the topic. Table . shows some sample experts identified by this methodology for a few specific topics. Note that this methodology can identify experts not only for popular topics of general interest such as music and politics, but also for more niche and specialized topics such as physics and neurology. We mined data from millions of Twitter Lists to build and deploy Cognos, a system for finding topical experts in Twitter. The system is available at http://twitter-app.mpisws.org/whom-to-follow/. Experimental evaluation showed that Cognos infers a user’s expertise more accurately and comprehensively than state-of-the-art systems that rely on the user’s profile or tweet content. In fact, despite relying on only a single feature (crowdsourced lists), Cognos was judged by human volunteers to yield results as good as, if not better than, those given by the official Twitter experts search engine for a wide range of queries.



. , . , . ,  . 

Note that identifying topical experts is only the first step toward finding relevant and interesting information on a particular topic. This is because a substantial fraction of the content posted by a Twitter user (even a topical expert) may contain day-to-day conversation instead of topical information. Hence, extracting relevant information on a specified topic remains a challenge even after identifying experts on that topic. We addressed this challenge and developed a methodology for extracting news stories on a given topic from the content posted by experts on that topic (Zafar, Bhattacharya, Ganguly, Ghosh, and Gummadi, ). Our methodology relies on the intuition that if multiple experts on a certain topic are discussing a certain news story on a particular day, then that news story is very likely to be relevant and important to that topic. We developed a real-time topical news service, named What-is-happening, available at http://twitter-app.mpi-sws.org/what-is-happening/. The service periodically collects the tweets posted by the topical experts. For a given topic (on which information is to be returned), the service relies on the tweets posted by experts on that topic. The tweets are clustered into news stories based on the hashtags contained in the tweets, and the news stories are ranked based on the number of topical experts who are discussing a news story on the given day. The top news stories returned by this service have been judged to be as good as, if not better than, those returned by the official Twitter search service for a large number of topics. It can be noted that there are several important advantages of relying on topical experts for topical information (Zafar, Bhattacharya, Ganguly, Ghosh, and Gummadi, ). The topical experts often post valuable information on the topics of their expertise. Especially, information on niche topics such as physics, neurology, and forensics is difficult to locate in general on Twitter, and the content posted by experts on these topics is a valuable source of specialized information on these topics. Further, content posted by topical experts, especially the news stories posted by multiple topical experts, is mostly authoritative and popular (e.g., highly retweeted), as well as free from spam, misinformation, and so on. Additionally, our studies show that the content posted by the relatively few topical experts covers a very large fraction of all topical information posted on Twitter. For all these reasons, the content posted by the topical experts is not only a valuable source of information for applications such as search/ recommendation systems, but also important for studying the propagation of popular information on social networks.

.. Information Diet With the widespread adoption of social media sites such as Twitter and Facebook, there has been a shift in the way information is produced and consumed in society. Previously, the only producers of information were traditional news organizations, which broadcast the same carefully edited information to all consumers over mass media channels. Now, with online social media, any user can be a producer of information, and every user selects which other users she connects to, thereby choosing

    



80 60 40 20 Arts Auto Business Career Edu Entr Env Fashion Food Health Hobbies Para Politics Religion Science Society Sports Tech

% of information diet

100

andersoncooper

carter_roberts

jamieoliver

 . Produced information diets of some popular Twitter users. Note: topical distribution of tweets posted by users. Source: Adapted from Kulshrestha, Zafar, Noboa, Gummadi, and Ghosh ().

the information she consumes. Moreover, the personalized recommendations that most social media sites provide also contribute to the information consumed by individual users. Thus, it has become important to characterize the information being produced and consumed by individual users on online social media. To study the composition of information that is produced, consumed, or propagated on social media, we proposed the concept of information diet, which quantifies the composition, or distribution of a set of information items (e.g., a set of tweets). In this study, information diet was defined as the topical composition of information over a set of eighteen broad topics, as shown in Figure . (Kulshrestha, Zafar, Noboa, Gummadi, and Ghosh, ). However, the concept can also be used to characterize the composition of information along other aspects, such as political ideologies. We characterized the information diets produced and consumed by various types of users on Twitter. It was observed that popular users, including news organizations and topical experts, post information diets that are very specific to one or two topics. For instance, Figure . shows the information diets posted by the Twitter accounts of three popular users: () andersoncooper, a noted journalist, () carter_roberts, the present president and CEO of World Wildlife Fund, and () jamieoliver, a celebrity chef. It is evident that these users post most of their content on their specific topics of expertise (politics, environment, and food, respectively). Since the information diets produced by popular users are very specialized, it is up to the users (who consume the information posted by the popular users) to choose which sources to consume from. It was found that the consumption diets of most Twitter users are also focused on only one or two topics. In fact, more than % of the users were found to consume more than % of their information diet on a single topic. In addition to the word-of-mouth consumption of information, there is another channel of information consumption on social media: personalized recommendations. Specifically, a Twitter user gets recommended content that is popular in her social network neighborhood (which is known as social recommendation). It was observed that while the word-of-mouth diets (chosen by the users themselves by selecting whom



. , . , . ,  . 

to follow) of most users are very focused, the recommendations add some topical diversity to their diets, by adding information on diverse topics apart from the users’ primary topics of interest. We developed a service in which Twitter users can check their consumption diets, as well as the production diets of any other Twitter user. This service, available at http://twitter-app.mpi-sws.org/whats-my-info-diet/, is intended to make people aware of the diets that they are consuming/producing on online social media. Note that this service is, in a way, complementary to the topical search services described previously. If any user feels that her information diet needs to be supplemented by more information on certain topics, she can use services such as Cognos and What-is-happening to discover experts and authoritative information on the said topics.

. C

.................................................................................................................................. The impressive growth of social networking services has made personal contacts and relationships more visible and quantifiable than ever before (Cha, Benevenuto, Haddadi, and Gummadi, ). Social media services have become a key platform for sharing news and other valuable information in our society. In fact, it is now difficult to find public spheres where social media have no impact. Studying influence patterns has long been difficult, as such a study does not lend itself to readily available quantification, and essential components such as human choices and the ways our societies function are hard to reproduce within the confines of the lab. The wide availability of data is the very reason that social media have become a key topic for many communications researchers. The data allow researchers to empirically validate important theories as well as exercise opportunities for designing new applications. With information propagating on a massive scale through word of mouth, understanding how users can influence one another has become a key challenge. By discussing million-scale propagation instances in social media, this chapter has explored ways to identify different types of influence on Twitter networks and has described the relative roles of three types of information spreaders. Among the main findings are that audience size (i.e., the number of followers) does not mean influence in Twitter and that influence of a user varies across topics. Another observation was made that one may design an algorithm capable of identifying those people who adopt and spread new ideas that influence other people before these ideas become popular. Identifying such trendsetters is important, because users with many followers do not necessarily initiate information propagation. In terms of propagation patterns, we demonstrated that information spreads more widely than deeply on social media. One of the largest widths of a propagation instance involved , users, indicating that cascades can be of massive scale. The largest depth seen was , indicating that shares of messages went down the chain of propagation for  steps. The sheer size of a large-scale cascade is a fascinating feature of social media,

    



as the majority of propagations occur in an organic fashion (i.e., by word of mouth). Furthermore, the users on social media, as an online community, create various communication conventions. The longitudinal data can be used to study how a particular communication convention (i.e., retweets) emerges and settles over time. These instances provide the ground truth data for researchers who study social norms and adoptions. Finally, the large amount of user-generated content on social media is a gold mine for many systems and applications developers. By mining the content in social media and through applying information retrieval techniques, one may build systems that identify topical experts, discover interesting content on a given topic, and so forth. One may also measure how balanced the set of topics users receive from social media is, which is the concept of information diet. The ability to quantify information diet can increase awareness for users about the kinds of information they receive as well as help build a topical recommender system. This is a new era of social media. As more conversations take place on social media, managing the excessive amount of information that reaches individuals on a daily basis will become ever more challenging. The studies discussed in this chapter on large-scale propagation phenomena in social media are essential to understanding the degree of user influence, the shape of propagation patterns, and the prospective propagation applications, which give insights into detecting content that will go viral.

R Avnit, A. (). The Million Followers Fallacy. http://blog.pravdam.com/the-millionfollowers-fallacy-guest-post-by-adi-avnit/. Benevenuto, F., G. Magno, T. Rodrigues, and V. Almeida. (). Detecting Spammers on Twitter. In Proceedings of the Annual Conference on Email and Anti-Spam (CEAS). http://citeseerx.ist.psu.edu/viewdoc/summary?doi=.... Cha, M., F. Benevenuto, Y.-Y. Ahn, and K. Gummadi. (). Delayed Information Cascades in Flickr: Measurement, Analysis, and Modeling. Elsevier Computer Networks, (), –. Cha, M., F. Benevenuto, H. Haddadi, and K. Gummadi. (). The World of Connections and Information Flow in Twitter. IEEE Transactions on Systems, Man and Cybernetics— Part A Systems and Humans, , –. Cha, M., H. Haddadi, F. Benevenuto, and K. P. Gummadi. (). Measuring User Influence in Twitter: The Million Follower Fallacy. In Proceedings of the International AAAI Conference on Weblogs and Social Media (ICWSM). https://www.aaai.org/ocs/index.php/ ICWSM/ICWSM/paper/viewPaper/ Dunbar, R. (). Coevolution of Neocortical Size, Group Size and Language in Humans. Behavioral Brain Science, (), –. Ghosh, S., N. K. Sharma, F. Benevenuto, N. Ganguly, and K. P. Gummadi. (). Cognos: Crowdsourcing Search for Topic Experts in Microblogs. In Proceedings of the Annual SIGIR Conference (SIGIR). https://dl.acm.org/citation.cfm?id= Granovetter, M. (). The Strength of Weak Ties. Elsevier Social Networks, (), –. https://www.sciencedirect.com/science/article/pii/B



. , . , . ,  . 

Katz, E., and P. Lazarsfeld. (). Personal Influence: The Part Played by People in the Flow of Mass Communications. New York: Free Press. Kooti, F., W. A. Mason, K. P. Gummadi, and M. Cha. (). Predicting Emerging Social Conventions in Online Social Networks. In Proceedings of the ACM Conference on Information and Knowledge Management (CIKM), –.https://dl.acm.org/citation. cfm?id= Kooti, F., H. Yang, M. Cha, K. P. Gummadi, and W. A. Mason. (). The Emergence of Conventions in Online Social Networks. In Proceedings of the International AAAI Conference on Weblogs and Social Media (ICWSM). https://www.aaai.org/ocs/index.php/ICWSM/ ICWSM/paper/viewPaper/ Krackhardt, D. . “Graph Theoretical Dimensions of Informal Organizations.” In Computational Organization Theory, –. Psychology Press. Kulshrestha, J., M. B. Zafar, L. E. Noboa, K. P. Gummadi, and Saptarshi Ghosh. (). Characterizing Information Diets of Social Media Users. In Proceedings of the International AAAI Conference on Weblogs and Social Media (ICWSM). https://www.aaai.org/ocs/ index.php/ICWSM/ICWSM/paper/viewPaper/ Kwon, S., M. Cha, K. Jung, W. Chen, and Y. Wang (). Prominent Features of Rumor Propagation in Online Social Media. In Proceedings of the IEEE International Conference on Data Mining Series (ICDM). https://ieeexplore.ieee.org/abstract/document/ Lazarsfeld, P., B. Berelson, and H. Gaudet. (). The People’s Choice: How the Voter Makes Up His Mind in a Presidential Campaign. New York: Duell, Sloan, and Pearce. Liben-Nowell, D., and J. Kleinberg. (). Tracing Information Flow on a Global Scale Using Internet Chain-Letter Data. Proceedings of the National Academy of Science, (), –. Rodrigues, T., F. Benevenuto, M. Cha, K.P. Gummadi, and V. Almeida. (). On Word-ofMouth Based Discovery of the Web. In Proceedings of the ACM SIGCOMM Conference on Internet Measurement Conference (IMC), –. https://dl.acm.org/citation.cfm? id= Saez-Trumper, D., G. Comarela, V. Almeida, R. Baeza-Yates, and F. Benevenuto. (). Finding Trendsetters in Information Networks. In Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), –. https://dl.acm. org/citation.cfm?id= Sharma, N., S. Ghosh, F. Benevenuto, N. Ganguly, and K. P. Gummadi. (). Inferring Who-Is-Who in the Twitter Social Network. ACM SIGCOMM Computer Communication Review, (), –. https://dl.acm.org/citation.cfm?id= Watts, D., and P. Dodds. (). Influentials, Networks, and Public Opinion Formation. Journal of Consumer Research, (), –. Wu, S., J. M. Hofman, W. A. Mason, and D. J. Watts. (). Who Says What to Whom on Twitter. In Proceedings of World Wide Web Conference (WWW), –. https://dl.acm. org/citation.cfm?id= Zafar, M. B., P. Bhattacharya, N. Ganguly, S. Ghosh, and K. P. Gummadi. (). On the Wisdom of Experts vs. Crowds: Discovering Trustworthy Topical News in Microblogs. In Proceedings of the ACM Conference on Computer Supported Cooperative Work and Social Computing (CSCW), –. https://dl.acm.org/citation.cfm?id=

  ......................................................................................................................

   -  ......................................................................................................................

 ̧     

. I

.................................................................................................................................. T modern world lives and breathes connections: an email to a distant contact, a flight to a faraway land, a railroad to a neighboring city, a road to a different neighborhood, or a phone call to a loved one. The properties of such connections, how they come together to form cohesive networks that knit together our societies, technologies, and thoughts, has been the subject of much scrutiny in recent decades (Barrat et al., ; Newman, ; Cohen and Havlin, ; Barabási, ) from researchers in both the physical and social sciences. Such networks are interesting not only for their own sake, but perhaps even more fundamentally because they form the backbone through which many other processes take place. Ideas (Weng et al., ) and memes (Goel et al., ) diffuse through emails and phone calls (Onnela et al., ), people and the viruses (Pastor-Satorras and Vespignani, ) they carry move through transportation networks, economic activity relies on the networks connecting financial institutions, and so on. The intense research in this area has shown how the structural features observed in real networks affect dynamical processes taking place on their fabric. In particular, heterogeneity in the number of contacts (Albert et al., ; Pastor-Satorras and Vespignani, ; Vespignani, ), the presence of hierarchies (Kitsak et al., ; Fortunato, ), and different types of correlations between nodes and links (Onnela et al., ; Newman, ) have been proven to be nontrivial effects on spreading phenomena. Traditional approaches have focused, in turn, on each of the two aspects mentioned: modeling the network itself and taking a given network and modeling a dynamical



 ̧   

process occurring on top of it. These two approaches can be thought of as corresponding to opposite limits of the typical timescale of network evolution, τG, which is much faster or much slower with respect to the typical timescale of the dynamical process τP. In either case, the two timescales are separated, resulting in an annealed (τG « τP) or a static network (τG » τP). While such approximations have proven their usefulness and validity in many circumstances such as cascading failures in power grids or the diffusion of data over the Internet, they have also come up short (Butts, , ) on the study of many important processes in which the two timescales are comparable (Stehlé et al., ) (τG  τP), as is the case for sexually transmitted diseases (Moody, ; Morris, ; Rocha et al., ), short-lived memes (Miritello et al., ; Panisson et al., ), and face-to-face interactions (Isella et al., ; Stehlé, Voirin, Barrat, Cattuto, Colizza, et al., ; Stehlé, Voirin, Barrat, Cattuto, Isella, et al., ). This intermediate regimen has become known as temporal networks (Holme, ; Holme and Saramäki, ) and has been the focus of much recent research (Perra, Baronchelli, et al., ; Perra, Gonçalves, et al., ; Ribeiro et al., ; Karsai et al., ; Liu et al., ; Tomasello et al., ; Scholtes et al., ; Lambiotte et al., ; Sun et al., ). This chapter provides an overview of recent developments in the study of dynamical processes on time-varying networks. We consider explicitly the case in which spreading occurs on a timescale comparable to the changes of the network in which it takes place. Common examples of this situation are the transmission of sexually transmitted diseases, information spreading through phone calls, and any other process in which edges are active only at specific times. From this vantage point we analytically and numerically study the effects of time-varying topologies on different types of diffusion processes, such as random walks, epidemic spreading, and rumor spreading.

. T-V N

.................................................................................................................................. Network modeling stems from sociology, psychology, and graph theory (Albert and Barabási, ; Boccaletti et al., ; Bollobas, ; Newman, ). Erdős-Rényi, Logit, and p*-models, and exponential random graphs (Erdős and Rényi, ; Molloy and Reed, ; Holland and Leinhardt, ; Frank and Strauss, ; Wasserman and Pattiston, ) are examples of the frameworks proposed and adopted over the years. Most recently, research in statistical physics and computer science has led to the development of a new class of models, exemplified by the preferential attachment model (Barabási et al., ; Barabási and Albert, ; Dorogovtsev et al., ; Dorogovtsev and Mendes, ; Fortunato et al., ; Boguña and Pastor-Satorras, ). Although very different in nature, these frameworks can all be classified as “connectivity driven.” Indeed, the goal is reproducing the final aggregated topology

   - 



of networks obtained through their evolution and growth over time. In other words, such approaches do not explicitly consider the temporal nature of networks and are concerned just with their static properties. As mentioned, virtually any network evolves in time. The importance of developing modeling frameworks able to account for and reproduce this fundamental feature is clear. Over the last decade various approaches have been put forward (Holme, ; Holme and Saramäki, ). The problem has been tackled using field dependence. For example, social scientists are traditionally inclined to adopt exponential random graphs (Wasserman and Pattiston, ). These models infer from data the importance of the structural element deemed relevant. Recently, this modeling framework has been generalized to consider the dynamical evolution of connections (Hanneke and Xing, ; Kolar et al., ). In the physics and computer science communities several other models have been introduced that are based on different intuitions and approaches. Some, which consider explicitly faceto-face interactions, model group dynamics starting from the observation that the more a node remains engaged in a group, the less likely it is to leave it (Zhao et al., ; Stehlé et al., ). Others focus on the timing of contacts, considering multiple timescales and extending mechanisms typically used in the classic static models, such as triadic closure and tie strength reinforcement mechanisms (Jo et al., ; Laurent, Saramäki, & Karsai, ). Others, motivated by modeling the spreading of infectious diseases, consider a turnover of peers in time or generate contact patterns within a specific time window given a certain mixing function (Volz and Meyers, ; Kretzschmar and Morris, ).

.. Representations 1

3 2 5 4

 . A simple weighted network.

The structure of a static network can be fully specified by just a few pieces of information: node labels, the list of connections between nodes, and the weight of each connection. In the case of weighted or directed networks, a weight or direction can also be assigned to each edge (see Figure .).



 ̧   

One mathematically convenient way to represent an unweighted network is as a matrix A, known as the adjacency matrix, in which each element, aij, is zero if there is no edge between nodes i and j and is one otherwise. From this representation we can obtain several network properties by using matrix algebra. For example, a vector containing the in- (out-)degree of each node can be obtained by simply taking the left (right) product of the matrix and a vector of s. It is also easy to see that, since aij is only non-zero when there is a path between i and j, the product 1:1

aij ajk

will only be non-zero when there is a path from i to j and from j to k. Therefore, by summing over j we obtain the total number of paths between i and k in two steps. In matrix form, this corresponds to the aij(2) element of the square of the adjacency matrix. This same argument can be justified for further powers of A, such that the aij(n) gives us the number of paths connecting nodes i and j in n steps. Also, the number of connected components is given by the number of zero eigenvalues of the Laplacian matrix, L = K  A, where K is a diagonal matrix of the node degrees. Weighted networks can be represented by simply replacing each value of  by the corresponding weight: 0

0 B1 B Aij ¼ B B0 @0 0

1 0 2 1 1

0 2 0 1 0

0 1 1 0 3

1 0 1C C 0C C 3A 0

Unfortunately, in the ever more common case of extremely large networks, this representation is not numerically convenient due to the large number of zero elements. In these cases, we represent the network as an edge list: i j wij 1 2 1 2 3 3 2 4 1 2 5 1 3 4 1 4 5 3 where we simply list one edge per line with the corresponding weight. Now the out(in-)degrees can be obtain by simply counting how many times each node is listed in the first (second) column, and it is easy to see how we may recover the above matrix representation from this more space-efficient version.

   - 



In order to be able to handle the case of temporal networks, we must extend these simple representations. The simplest way to achieve this is to include one extra temporal dimension (see Figure .). In this case, the adjacency matrix becomes an adjacency tensor: 0 1ð1Þ 0 0 0 0 0 B0 0 1 0 0C B C B0 1 0 0 0C B C @0 0 0 0 1A 0 0 0 1 0 0 1ð2Þ 0 1 0 0 0 B1 0 0 1 1C B C ðtÞ C Aij ¼ B B0 0 0 1 0C @0 1 1 0 1A 0 1 0 1 0 0 1ð3Þ 0 0 0 0 0 B0 0 1 0 0C B C B0 1 0 0 0C B C @0 0 0 0 1A 0 0 0 1 0 where we have a full adjacency matrix for each time step t, with all the numerical issues that this representation implies. On the other hand, the edge list must also be extended to indicate at which time steps that specific edge is active: i j wij 2 3 1 4 5 1 1 2 1 2 4 1 3 4 1 2 5 1 4 5 1 2 3 1 4 5 1

t 1 1 2 2 2 2 2 3 3

If we keep the edges in chronological order, we can simply replay the entire history of the system by processing the list one time step at a time. Thus far we have considered only the common case in which each connection has the duration of a single time step or can be represented as a multiple of individual time steps. This is the case with SMS, email, online chatting, phone calls, and so on. Simple changes can be made to include varying contact lengths.



 ̧    (b)

(a)

1 2 3 t 1

4 2

5 t

3

1

2

3

 . Temporal network representations.

.. Properties The representation changes previously introduced, while apparently trivial, have some deep and nonobvious consequences that require some further formalism modifications and the introduction of new metrics/quantities.

... Time-Respecting Path It is clear that the properties of the adjacency matrix at each time step depend strongly on the size of the time step. For example, if the time step is sufficiently small, it is possible that At matrix is fully disconnected, with just a few isolated edges. In turn, this implies that equation . now becomes atij atþ1 jk

2:1

since we require that the edge between i and j be active at time t and that the connection between j and k be active at t + . The adjacency matrix is different from time step to time step, making it clear that any process that takes place on the network is now strongly dependent on the temporal sequence of steps taken: if the aij edge is present only at time t and the ajk edge is present only at time t + , then we are able to reach node k from node i (passing through node j) only if we start our journey at time t. As a consequence, there is no static representation of the network that is exactly equivalent to the full temporal sequence. We define a time-respecting path to be a path that connects two nodes i and k through a sequence of link activations that follow each other in time.

   - 



... Connectivity and Latency In general, time-respecting paths are directed, regardless of the directionality of the network at each time step. As in the previous example, it might be possible to go from node i to node k at time t while still being impossible to go from node k to node i at the same time. We can then define a temporal network as strongly or weakly connected if there are directed or undirected paths between each pair of nodes, respectively. The diameter of the network can be defined in a similar fashion. Also, the distance between each two nodes will generally change in time. We can define the average distance between two nodes as the number of steps necessary to travel between them averaged over all possible initial times. As temporal paths have an intrinsic duration, we can similarly define the average time necessary to travel between any two nodes. This is necessarily different from the number of steps because we might have to “wait” several time steps for the next necessary edge to be present. The shortest connection time between two nodes is called the latency.

... Burstiness The concept of latency becomes particularly important when we consider temporal networks derived from human activity. Humans are notorious for following complex dynamical patterns that introduce highly not trivial temporal correlations in edge activation sequences. Consider, for example, the case of an individual who reserves a specific time each day to respond to emails and ignores email the rest of the time. Over the short period of time in which he is active, many edges will be activated, one following the other, but this burst of activity will be followed by a long period of inactivity (Jo et al., ). This type of behavior results in strong temporal correlations that can have an important impact on the observed properties of the network and on any processes. Such correlations can be easily detected by measuring the distribution of the time interval between two consecutive activations of the same edge or node, the socalled inter-event time distribution. Indeed, the observation in  (Barabási, ) of broad-tailed inter-event time distributions in human communications such as emails and letters spurred a flurry of research in models of human dynamics that might be able to account for this observation at the individual level. Individuals also vary widely in their activity levels. The vast majority of email (Vázquez et al., ), Twitter (Kwak et al., ), and cell phone (González et al., ) users have relatively low activity levels, while the few most active individuals account for the overwhelming proportion of the total activity. As a result, the individual activity distribution is broad tailed and can typically be well approximated by a power-law (Perra, Gonçalves et al., ) of the form P(a) / aβ. Such activity heterogeneity is a fundamental aspect of human activity and underlies many different kinds of behavior.

... Memory This brief discussion of some of the fundamental properties of temporal networks is lacking just one final aspect, a user’s preference of some edges over others. When an



 ̧   

individual becomes active and decides whom to contact, there is a much higher probability of her contacting, say, a boyfriend or a child than, say, a mechanic or a public information line. This phenomenon is usually accounted for by introducing the concept of memory (Gonçalves et al., ). Users “remember” whom they contacted in the past (and how often) and then choose accordingly, resulting in a higher chance of connecting to someone who has recently been connected to (as you might call someone back after a few minutes to confirm an appointment time) as well as to someone you contact often (such as a significant other or a family member). This results in strong and weak ties, in the language of Granovetter (), and explains the seminal observations of Moreno () about how individuals choose to spend their social capital differently from others.

.. Activity-Driven Networks Having introduced some of the fundamental concepts and ideas relating to temporal networks, we now turn to an analytically tractable approach, the so-called activitydriven modeling framework (Perra, Baronchelli, et al., ; Perra, Gonçalves, et al., ; Karsai et al., ; Liu, Baronchelli, & Perra, ; Liu, Perra, Karsai, & Vespignani, ; Tomasello et al., ). This approach has the advantage of keeping the number of assumptions to a minimum and has proven to be able to explain several properties observed in empirical networks. The activity-driven approach starts from the observation that the propensity of individuals to be engaged in social acts is highly heterogeneous and then gradually adds new details to obtain an increasingly realistic model that remains tractable while allowing us to study analytically the effects of time on spreading phenomena. In activity-driven networks N nodes are characterized with an activity rate ai, defined as the probability per unit of time to create new contacts or interactions with other individuals. The activity of each node is assigned from a probability distribution F (a) that can be defined a priori or determined from empirical observations. From a given activity distribution the network can be generated, in its simplest form, as follows: • At each time step t the network Gt starts with N disconnected nodes. • With probability aiΔt each vertex i becomes active and connected to m randomly selected vertices. Nonactive nodes can still receive connections from other active vertices. • At the next time step t +Δt, all the edges in the network Gt are deleted. It should be noted that in its simplest incarnation, this model generates a Markovian process with no burstiness or memory. Indeed, nodes do not have memory of the previous time steps and do not recollect with whom they interacted. The full dynamics of the network are encoded in the activity distribution F (a). In the common case where F (a)  aγ, the network generated at each iteration is a simple random graph with low

   - 



average connectivity. However, by integrating the links generated over a large time window of size T, the model generates a broad-tailed degree distribution, PT (k), due to the wide variation of activity rates in the system. It is crucial to note the different interpretation of the formation of hubs in this case in respect to different growing network models as the preferential attachment. While in activity-driven networks the creation of hubs is driven by the presence of highly active nodes that repeatedly engage in more interactions than typical nodes, in preferential attachment models hubs are the result of different forms of positional advantage (older nodes are more likely to have large numbers of connections) and a passive attraction of connections (the rich get richer). At each time step, the average number of active nodes is N. Since each active node creates m links, the average degree at each time t can be written as: hkit ¼

2Et ¼ 2mhai N

2:2

Interestingly, in the limit of small k/N and k/T, the degree distribution of the integrated network, defined as the union of the links generated at each time step, can be written as PT (k)  F [k/Tm], meaning that the degree distribution of the integrated network follows the same functional form of the activity distribution (Starnini and PastorSatorras, ). Remarkably, this theoretical prediction is approximately observed in empirical data (Perra, Gonçalves et al., ), giving us the first indication that despite their simplicity, activity-driven networks are able to accurately reproduce important features observed in real networks. The simplicity of the model, however, comes with a cost. As described so far, these models are unable to capture the lifetime distribution of links and the memory of nodes as previously described. To be able to account for them, we must introduce a simple reinforcement mechanism in how each of the m connections at each time step t is chosen. We define the probability, p(k), that the next communication event of a node currently having k social ties will result in the establishment of a new (k + )th link (Karsai et al., ; Ubaldi et al., ) and introduce a crucial difference to the model previously described: a node with k previously established social ties will connect randomly a new node with probability p(k). Otherwise, with probability   p(k) she will interact with a node already contacted, thus reinforcing earlier established ties. In this case, the selection is done randomly among the k neighbors. Empirical studies have shown that p(k) is well approximated by pðkÞ ¼ 1 

k c ¼ kc kþc

2:3

where c is a constant that does not depend on the degree of the node considered and can be defined to be c =  without loss of generality. The behavior of p(k) with k suggests that the larger the number of people with whom a node interacted, the smaller



 ̧   

the probability that a new tie will be activated. In other words, the activity of nodes is distributed most likely toward a small number of strong ties. We refer here to the original model as ML, meaning memory less, and to this modified version as WM. A comparison between time-aggregated networks generated by the two models allows us to better understand the effects of memory (Karsai et al., ). The ML dynamics induce an aggregated network with a degree distribution following the same functional form of the activity and a weight distribution decaying exponentially (Perra, Gonçalves et al., ; Starnini and Pastor-Satorras, ). In case of the WM dynamics, memory induces a considerably different structure. In particular, the degree distribution is more skewed in the WM model than in the ML. Furthermore, the WM model generates a heterogeneous weight distribution capturing observations in real data (Karsai et al., ).

. D P  A-D N

.................................................................................................................................. The following discussion analytically and numerically explores different spreading phenomena unfolding on ML and WM activity-driven networks. In particular, we will explore diffusive processes (i.e. random walks), as well as simple (i.e. epidemic spreading) and social contagion phenomena.

.. Random Walks Random walks are fundamental diffusion processes (Newman, ; Noh and Rieger, ; Barrat et al., ; Baronchelli and Pastor-Satorras, ) with applications in a wide variety of disciplines, ranging from economics and genetics to physics and the arts. They also form part of the secret behind the success of modern search engines such as Google (Brin and Page, ; Page et al., ). The basic idea behind random walks is a simple one: a set of particles or individuals at each time step randomly choose their next move based on their current location and a probability distribution over future states. In the context of networks, the future states correspond to the nearest neighbors of the current node and result in a random path through the graph. Clearly the behavior of random walkers is determined by the topological features of the underlying network and as such can provide fundamental clues about the structure of often unknown media through the way they diffuse. In network science, the large majority of research has taken place within one of the two limits previously discussed, considering either quenched or annealed graphs (Newman, ; Noh and Rieger, ; Barrat et al., ; Baronchelli and PastorSatorras, ). More recently, these studies have been extended to consider random walks diffusing at the same timescale as the one ruling network evolution (Perra, Baronchelli et al., ; Ribeiro et al., ; Lambiotte et al., ; Masuda, Porter,

   - 



and Lambiotte, ). Indeed, these phenomena are the perfect example test bed for the effects of time-varying topologies on diffusion processes. Before diving into the most recent results, let us formulate the problem with some mathematical rigor and briefly revise some classic results involving annealed networks. This will provide us with a theoretical framework we can later expand on to include activity-driven networks. Let us consider a generic graph, G, with degree distribution, P(k). If we want to be as generic as possible, all we need is to define the conditional probability P(k’|k) that a node of degree k’ is found at the end of an edge emanating from a node of degree k. With this definition, all nodes of the same degree class k are statistically equivalent. This is known as the configurational or Molloy-Reed model (see Molloy and Reed, ). Let us consider a number W of walkers diffusing uniformly (one edge at a time) in the network. The average number of walkers in a given degree class is then X W =Nk , with Nk being the number of nodes with degree k. The variation Wk ¼ ijk ¼k i i

of Wk in time is given by X dWk W0 0 ¼ Wk þ k k0 Pðk jkÞ 0k dt k

3:1

The first term on the right-hand side (r.h.s.) of this expression takes into account the number of walkers moving out of each node of degree k, while the second term describes the number of walkers reaching nodes of degree k from nodes of any other degree class. For uncorrelated networks, the probability that a node of degree k is connected to a node of degree k’ is a function of just k’, so that the stationary state of the above equation is given by Wk ¼

k W N

3:2

Interestingly, after an initial transient the number of walkers in each degree class reaches a dynamical equilibrium that scales linearly with the degree. In other words, nodes of higher degree will tendentially obtain a larger number of walkers. It is possible to prove that this holds for any undirected network (Noh and Rieger, ). Furthermore, we find pk =Wk/W is the probability that a random walker visits a given node of degree This quantity preserves the linear scaling with the degree, that is, pk = k/N. As a consequence, nodes with larger degree are characterized by a large probability of discovery (Newman, ; Noh and Rieger, ; Barrat et al., ). Let us now consider the same processes unfolding on an ML activity-driven network. In this case, the spreading phenomenon proceeds as follows: at each time step t an activity-driven network Gt is generated, and walkers diffuse on it for a time Δt. After diffusion, at time t + Δt, a new network Gt+Δt is generated. It is important to note that dynamics of the random walker and the network take place at the same timescale, introducing a unique feature not found on static or annealed networks: walkers can get trapped in temporarily isolated nodes for an extended period of time,



 ̧   

as described in our previous discussion of burstiness. As we will see, it is now convenient to consider activity classes instead of degree classes, by assuming that all nodes within degree class a are statistically equivalent. It can be shown (Perra, Baronchelli et al., ) that the number of walkers, Wa, in a given node of activity class a is given by ð dWa ¼ aWa þ amw  m < a > Wa þ a0 Wa0 Fða0 Þda0 dt

3:3

where w  W/N is the average number of walkers per node, and we have taken the continuous a limit. The first two terms on the right-hand side are due the activity of the nodes in class a, which release all the walkers they have and receive walkers originating from the nodes they connect to. The last two terms describe the activity of the nodes in all the other activity classes that connect to nodes of activity class a. The stationary state of the process is then Wa ¼

amw þ ϕ aþm

3:4

Ð where ϕ ¼ aFðaÞWa da is the average number of walkers moving out of active nodes. Interestingly, in the stationary state this quantity is constant, and can be evaluated selfconsistently as shown in Ref. []. It should be noted that the behavior of time-varying networks is strikingly different than that of static and annealed networks (see Figure .). Indeed, in time-varying

103

103

102

Wa

Wa

104

102 10 –3 10

10 10–3

10–2

a

10

10–1

–2

a

10–1

100

100

 . Diffusion dynamics in networks. Note: In the main panel Wa is shown as a function of a for random walkers diffusing on activity driven networks with activity distribution F(a) ~ aγ . γ =  (circles) and γ = . (diamonds). Solid lines describe the analytical prediction equation .. In the inset Wa is shown as a function of a random walks on top of an activity driven network with F(a) ~ a2, integrated over T =  time steps. The solid line corresponds to the curve Wa ~ a, fitting the simulation points for large value of a. In all the cases N = 5, m = , ε = 3, and w = 2. Averages performed over 3 independent simulations.

   - 



networks the number of walkers is not a linear function of the activity, but rather saturates at sufficiently large values of a. The origin of this difference is deeply rooted in the properties of the instantaneous network. Nodes with high activity have on average k  m connections at each time step, resulting in a limited capacity for collecting new walkers, a feature that is not present in time-aggregated views of dynamical networks (Perra, Baronchelli et al., ). These results highlight the importance of an appropriate consideration of the time-varying feature of networks in the study of exploration and spreading processes in dynamical complex networks.

.. Epidemic Spreading From the diffusion of individuals or walkers between nodes we now proceed to consider the way in which information or viruses spread and infect the nodes of a network. Not surprisingly, due to its practical importance, the modeling of the spreading of infectious diseases has a long tradition that dates back to the work of Bernulli in  (Bernulli, ). When considering illness that spreads from human to human, it is clearly of crucial importance consider the way we interact. For this reason, one of the most relevant applications of network science is devoted to the understanding of our contact patterns and how these affect the spreading of infectious diseases (Newman, ; Barrat et al., ; Keeling and Rohani, ). In this case, although contact networks are highly dynamical in nature (Morris and Kretzschmar, ; Morris, ; Isella et al., ), the large majority of research has been done on quenched or annealed networks. While these approximations are well suited to model influenza-like Illnesses, they fail to capture more complex diseases such as sexually transmitted diseases, where the concurrency, frequency, duration, and order of contacts are crucial ingredients (Morris and Kretzschmar, ; Morris, ; Rocha et al., ; Isella et al., ; Perra, Goncalves et al., ). As before, let us start by reviewing some classic results obtained on annealed networks before proceeding to the activity-driven case. In the susceptible-infectedsusceptible (SIS) epidemic compartmental model (Barrat et al., ; Kermack and McKendrick, ; Keeling and Rohani, ; Pastor-Satorras, Castellano, Van Mieghem, and Vespignani, ), the population is divided into two classes of individuals: susceptible individuals, who are healthy, and infected individuals, who have the disease and are able to spread it. The disease propagates from infected to susceptible neighbors with probability λ per contact, while infectious individuals spontaneously recover with rate µ, rejoining the ranks of susceptible individuals. The infection and recovery process can be described by the following transitions: S þ I → 2I I→S

3:5



 ̧   

In a well-mixed population the behavior of the epidemics is controlled by the reproductive number R = β/µ, where β = λ is the per capita spreading rate that takes into account the rate of contacts, of individuals. The reproductive number provides the average number of secondary infections generated by a primary case in a fully susceptible population (Keeling and Rohani, ). As such, it is easy to see that an epidemic can only occur when R > , which implicitly defines the epidemic threshold. Above this value epidemics can reach an endemic state and be sustained by the population. Indeed, it is easy to show that the SIS dynamics are characterized by a dynamic equilibrium defined by a finite fraction of individuals in the infected state, that is, an endemic state (Keeling and Rohani, ). Over the last fifteen years the well-mixed population approximation has been gradually relaxed through the inclusion of more realistic and data-driven connectivity networks and mobility schemes. This has highlighted new and interesting results showing clearly the importance of accounting for complex topologies when modeling spreading phenomena (Lloyd and May, ; Balcan et al., ; Wang et al., ; Chakrabarti et al., ; Castellano and Pastor-Satorras, ; Wang et al., ). In particular, the epidemic threshold has been found to depend on the topological properties of the networks (Castellano and Pastor-Satorras, ; Pastor-Satorras et al., ). In the case of static networks, the threshold is given by the principal eigenvalue of the adjacency matrix (Wang et al., ; Chakrabarti et al., ). For annealed networks, the epidemic threshold is a function of the first and second moments of the degree distribution (Barrat et al., ; Vespignani, ): β < k>2 > μ < k2 >

3:6

It is important to note that the heterogeneities observed in real networks induce second moments significantly larger than the first moment. In other words, the heterogeneity in the number of contacts pushes the epidemic threshold to small values, facilitating the spread of the disease. While the scenario emerging from this observation is rather scary, it suggests extremely efficient methods to protect ourselves from the diffusion by means of targeted vaccinations (Barrat et al., ; Vespignani, ). How do the network dynamics affect a disease spreading at the same timescale? It is very easy to understand the importance of time in this case. Indeed, a disease characterized by a small infectious period, µ1, will have time to explore a time-aggregate network, but it might not have time to spread on the dynamic instantaneous networks whose union defines the time-aggregated one (Moody, ; Morris and Kretzschmar, ; Morris, ; Isella et al., ). In fact, if the disease spreads on aggregated networks, all edges will be readily available to carry the contagion process, disregarding the fact that the edges may be active or not according to a specific time sequence defined by the agents’ activity. This intuitive observation can be precisely quantified by calculating analytically the epidemic threshold. Let us consider an SIS process unfolding on ML activity-driven networks. The epidemic dynamic can be characterized

   - 



by studying the number of infected individuals in the class of activity rate a (Perra, Goncalves et al., ). The variation of this quantity is described by the following equation: ð ð dIa a0 Ia0 0 Ia0 ¼ μIa þ λmðNa  Ia Þa da þ λmðNa  Ia Þ da0 dt N N

3:7

where Na represents the total number of individuals in activity class a. In equation . the first term on the r.h.s describes the recovery process, while the second takes into account the probability that a susceptible individual of class a is active and acquires the infection through a connection with any other infected individual. Finally, the last term considers the probability that a susceptible is contacted by any infected active individual. The equation can be solved yielding the epidemic threshold (Perra, Goncalves et al., ; Rizzo, Frasca, & Porfiri, ; Starnini & Pastor-Satorras, ): β 2 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi > μ < a > þ < a2 >

3:8

The threshold is then a function of the first and second moment of the activity-driven distribution. It considers the activity rate of each node while taking into account the actual dynamics of the interactions. We note that this formula does not depend on the time-aggregated network representation used. The spreading process is characterized considering the interplay between the timescale of the network and spreading process evolution. As shown in Figure ., in the case when the two dynamics unfold at comparable timescales, the epidemic threshold, predicted by equation . is significantly 0.2 Activity driven network Integrated network T = 20 Integrated network T = 40 R∞

RC0 0.1

0

0.1

0.2 R0

0.5

 . Epidemic dynamics in networks. Note: Density of infected nodes, i1, in the stationary state, obtained from numerical simulations of the SIS model on a network generated according to the activity driven model and two other networks resulting from an integration of the model over  and  time steps, respectively. N = 6, m = , η = , F(x) 1 x–γ with γ = . and ε  x   with ε = –3. Each point represents an average over 2 independent simulations. The red triangle marks the epidemic threshold as predicted by equation .



 ̧   

larger than the threshold of the same process taking place on a time-aggregated representation of the same network (Perra, Goncalves et al., ). This implies that neglecting the evolution of connectivity patterns might induce an underestimation of the spreading potential of an infectious disease. Highly active nodes engage repeatedly in social interactions and can drive the spreading process. Indeed, it has been proved that immunizing a small fraction of such nodes will result in complete protection for the entire population (Liu et al., ).

.. Rumor Spreading Epidemic spreading can be considered a simple version of the way in which information spreads. The field of rumor spreading takes the lessons learned from the way viruses diffuse among nodes and modifies and extends them to the more general case of diffusion of information. Rumor spreading is responsible for the adoption of innovations and the spreading of ideas, information, or rumors in a population (Barrat et al., ), and while it is sometimes referred to as “viral,” the manner in which rumors spread through a social network has significant differences from epidemic spreading. Indeed, the spreading of an idea is induced by a sequence of intentional and deliberate acts, while the spreading of a disease is passive and can occur just by being in proximity to others, despite our intentions. Also, multiple exposures to the same rumor might be required before “infection” occurs, and different individuals will naturally have a different propensity to adopt and or diffuse an idea or a piece of information (Barrat et al., ). All these properties make the modeling and description of social contagion extremely challenging. However, it is intuitive to understand that the features of the connectivity patterns describing the interactions of individuals play a crucial role on the unfolding of social contagion. As for the other dynamical processes, the large majority of research in the past considered the timescale separation limit. We first formulate the problem and revise some classic results in this limit, then discuss these dynamics in time-varying networks. In the most fundamental rumor spreading process (Daley and Kendall, ), each node can be in three possible states according to its status with respect to the rumor: ignorant (I) nodes are unaware of the rumor, spreaders (S) are aware and actively spread it, and stiflers (R) are aware of the rumor but have since decided to no longer spread it. The dynamic of the models can be described by the following transitions: I þ S → λ2S S þ R → α2R S þ S → α2R

   - 



Here λ and α describe the transition rates into the states of spreader or stifler, respectively. While the contagion transition in this case is mathematically equivalent to the one we considered for the epidemic process, spreaders recover to become stiflers when they come in contact with other spreaders or stiflers. In other words, the recovery process is not spontaneous but is instead mediated by interactions, a property that has critical effects on the spreading patterns. We can better understand this point by considering a rumor diffusing on a simple Watts and Strogatz (WS) network (Watts and Strogatz, ). In the WS model, nodes are arranged in a circle, and the links are static. Each node has k links connecting it to the closest nodes on the left, and k links connecting to the closest nodes on the right. Each edge is then randomly rewired with probability p. As p varies between  and , the network changes character from a regular graph to an Erdős-Rényi network (Erdős and Rényi, ). At intermediate values of p the resulting networks are simultaneously characterized by a small diameter, due to the random shortcuts, and high values of clustering, due to the local initial ordered arrangement (Watts and Strogatz, ). This topology is extremely useful to understand the dynamics of rumor spreading models. At small values of p the high level of clustering results in a localization of the spreaders that in time transition to stiflers, stopping the overall spreading of the rumor (Zanette, ). As p increases, the number of shortcuts increases, reducing the localization of connections and the annihilation of spreaders with a phase transition occurring at a point that depends on the value of k (Zanette, ). These results show clearly the difference between rumor and epidemic spreading. In this case the repetition of contacts induces an early termination of the spreading. Armed with this understanding of the classical rumor spreading model, we are now able to better understand the effect that the temporal dynamics of ML and WM activity-driven networks (Karsai et al., ; Ubaldi et al., ) can have on this process. Numerical simulations on the subject (Karsai et al., ) show a clear difference between the two cases. The repetition of contacts that is characteristic of WM dynamics results in a strong reduction on the final fraction of nodes aware of the rumors with respect to the ML case (see Figure .). In order to understand the biases induced in the dynamical properties of rumor spreading processes by the time-aggregated representation of the networks, Karsai et al. () considered the topologies generated by a time-aggregated view of ML and WM models and compared the results with their time-varying counterparts. The results obtained showed striking differences between the velocity of spreading. The time for a rumor to reach a consistent fraction of nodes can vary by four orders of magnitude between the two cases, with very slow spreading dynamics in time-varying networks. These results further illustrate the clear difference between the dynamical properties of processes taking place on time-aggregated or time-resolved networks and confirm once again that when the timescale of the processes is comparable with the evolution of the network, static representations of the system might introduce strong errors into the characterization of the phenomenon.



 ̧   

(a)

(b)

 . Rumor spreading processes. Note: Panel (a) visualizes spreading in ML activity driven networks; panel (b) visualizes spreading in WM activity driven networks. Node colors assign their states as ignorant (blue), spreader (red), and stifler (yellow) states. Node sizes, color, and width of edges represent the corresponding degrees and weights. The parameters of the simulations are the same for the two processes: N = , T = , λ = ., and α = .. The process was initiated from a single seed with maximum strength.

. D

.................................................................................................................................. The era of big data is revolutionizing our technology and how we communicate and interact. As our social interactions become more reliant on technology through email, cell phones, online social networks, and so on, they also become more amenable to large-scale analysis, creating unprecedented opportunities for the social sciences and also unique theoretical and practical challenges. Our social interactions are naturally represented as temporal connections taking place over an underlying social network. Despite a wealth of recent progress, the field of temporal networks is just now coming of age, and much still remains to explore. The quenched and annealed limits of network dynamics are well understood, but the intermediate regime where τG  τp is mostly unexplored. In this short chapter we have reviewed some recent theoretical results using the activity-driven framework for three classical dynamical processes: random walks, epidemic spreading, and rumor spreading. These processes are the most prototypical examples of dynamical processes and cover three important aspects of human behavior: mobility, public health, and information spreading. They also have a long and venerable history of theoretical, analytical, and simulation results that can guide us to better understand how our changing social landscape can impact our lives. The WP and ML network dynamical models we have explored are just the first steps toward the modeling of real-world social dynamics. For instance, they still do not take

   - 



into account burstiness or different classes of nodes (say, male and female). They are also not able to directly explain the latency, connectivity, and clustering properties of real networks, both social and technological. Much remains to be done in this emerging field, and we hope that this chapter can contribute to raising the interest of other researchers in this interesting and promising field of research. While simple, the results introduced here demonstrate the fundamental importance of understanding the way the different timescales interact in the real world, resulting in more complex and realistic models. Any process that relies on the explicit activation of an edge (as a phone conversation relies on the recipient picking up the phone) must be analyzed in light of a temporally explicit framework such as this if we are to properly understand and model real-world phenomena. In many cases, the timescale of contact is given by the process itself (like the duration of a phone call), while in others it must be explicitly chosen by the researcher. Methods to determine the optimal timescale are a subject of ongoing research. The rise of passive data collection that we are currently witnessing in the form of the World Wide Web and the “Internet of Things” will undoubtedly result in an unprecedented increase in the number of processes for which detailed temporal information is available. The development of methodologies that are able to correctly account for the inherent temporal characteristics and correlations is an important step in furthering our understanding of the modern world and of our place within it.

R Albert, R., and A. L. Barabási. Statistical mechanics of complex networks. Rev. Mod. Phys., :, . Albert, R., H. Jeong, and A. L. Barabási. Error and attack tolerance of complex networks. Nature, :, . Alessandretti, L., K. Sun, A. Baronchelli, and N. Perra. Random walks on activity-driven networks with attractiveness. Phys. Rev. E, ():, . Balcan, D., V. Colizza, B. Gonçalves, H. Hu, J. J. Ramasco, and A. Vespignani. Multiscale mobility networks and the spatial spreading of infectious diseases. Proc. Natl. Acad. Sci. U.S.A., :, . Barabási, A.-L. The origin of bursts and heavy tails in human dynamics. Nature, :, . Barabási, A.-L., and R. Albert. Emergence of scaling in random networks. Science, :, . Barabási, A.-L. Network science. Cambridge University Press, . Barabási, A.-L., R. Albert, and H. Jeong. Mean-field theory for scale-free random networks. Physica A, :, . Baronchelli, A., and R. Pastor-Satorras. Mean-field diffusive dynamics on weighted networks. Phys. Rev. E, ():, . Barrat, A., M. Barth´elemy, and A. Vespignani. Dynamical Processes on Complex Networks. Cambridge University Press, . Bernulli, D. Essai dune nouvelle analyse de la mortalité causée par la petite vérole et des advantages de l’inocoulation pur la prévenir. Mem. Math. Phys. Acad. Roy. Sci., :–, .



 ̧   

Boccaletti, S., V. Latora, Y. Moreno, M. Chavez, and D.-U. Hwang. Complex networks: Structure and dynamics. Physics Reports, :, . Boguña, M., and R. Pastor-Satorras. Class of correlated random networks with hidden variables. Phys. Rev. E, :, . Bollobas, B. Modern Graph Theory. Springer-Verlag, . Brin, S., and L. Page. The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems, (–):–, . Butts, C. T. Relational event framework for social action. Sociological Methodology, :–, . Butts, C. T. Revisiting the foundations of network analysis. Science, :–, . Castellano, C., and R. Pastor-Satorras. Thresholds for epidemic spreading in networks. Phys. Rev. Lett., ():, . Chakrabarti, D., Y. Wang, C. Wang, J. Leskovec, and C. Faloutsos. Epidemic thresholds in real networks. ACM Transactions on Information and System Security (TISSEC), (), . Cohen, R., and S. Havlin. Complex Networks: Structure, Robustness and Function. Cambridge University Press, . Daley, D. J., and D. G. Kendall. Epidemics and rumors. Nature, ():, . Dorogovtsev, S. N., and J. F. F. Mendes. Evolution of Networks: From Biological nets to the Internet and WWW. Oxford University Press, . Dorogovtsev, S. N., J. F. F. Mendes, and A. N. Samukhin. . Structure of growing networks with preferential linking. Phys. Rev. Lett., :. Erdős, P., and A. Rényi. On random graphs. Publications mathematicae, :, . Fortunato, S. Community detection in graphs. Physics Reports, :–, . Fortunato, S., A. Flammini, and F. Menczer. Scale-free network growth by ranking. Phys. Rev. Lett., ():, . Frank, O., and D. Strauss. Markov graphs. J. Am. Stat. Assoc., :–, . Goel, S., A. Anderson, J. Hofman, and D. J. Watts. The structural virality of online diffusion. Manage. Sci., ():–, . Gonçalves, B., N. Perra, and A. Vespignani. Modeling users’ activity on twitter: Validation of dunbar’s number. PLoS ONE, :e, . González, M. C., C. A. Hidalgo, and A.-L. Barabási. Understanding individual human mobility patterns. Nature, :, . Granovetter, Mark S. The strength of weak ties. Am. J. Soc., :, . Hanneke, S., and E. P. Xing. Discrete temporal models of social networks. In Proceedings of the rd International Conference on Machine Learning ICML-SNA, . Holland, P. W., and S. Leinhardt. An exponential family of probability distributions of directed graphs. J. Am. Stat. Assoc., :–, . Holme, P. Modern temporal network theory: A colloquium. Eur. Phys. J. B, ():, . Holme, P., and J. Saramäki. Temporal networks. Phys. Rep., :, . Isella, L., J. Stehlé, A. Barrat, C. Cattuto, J.-F. Pinton, and W. Van den Broeck. What’s in a crowd? Analysis of face-to-face behavioral networks. J. Theor. Biol., :–, . Jo, H.-H., M. Karsai, J. Kertész, and K. Kaski. Circadian pattern and burstiness in human communication activity. New. J. Phys, :, . Jo, H.-H., R. K. Pan, and K. Kaski. Emergence of bursts and communities in evolving weighted networks. PLoS ONE, :e, . Karsai, M., N. Perra, and A. Vespignani. Time varying networks and the weakness of strong ties. Scientific Reports, , .

   - 



Keeling, M. J., and P. Rohani Modeling Infectious Disease in Humans and Animals. Princeton University Press, . Kermack, W. O., and A. G. McKendrick. A contribution to the mathematical theory of epidemics. Proc. R. Soc. A, :, . Kitsak, M., L. K. Gallos, S. Havlin, and H. A. Makse. Identification of influential spreaders in complex networks. Nature Physics, :, . Kolar, M., L. Song, A. Ahmed, and E. P. Xing. Estimating time-varying networks. Ann. Appl. Stat., :–, . Kretzschmar, M., and M. Morris. Measures of concurrency in networks and the spread of infectious disease. Math. Biosci., :–, . Kwak, Haewoon, Changhyun Lee, Hosung Park, and Sue Moon. What Is Twitter, a Social Network or a News Media? In WWW ’ Proceedings of the th International Conference on World Wide Web, , ACM, . Lambiotte, R., V. Salnikov, and M. Rosvall. Effect of memory on the dynamics of random walks on networks. J. Complex Netw., ():–, . Laurent, G., J. Saramäki, and M. Karsai. From calls to communities: A model for time-varying social networks. Eur. Phys. J. B, ():, . Liu, Q.H., X. Xiong, Q. Zhang, and N. Perra. Epidemic spreading on time-varying multiplex networks. Phys. Rev. E, ():, . Liu, S., A. Baronchelli, and N. Perra. Contagion dynamics in time-varying metapopulations networks. Phys. Rev. E, :, . Liu, S., M. Perra, N. Karsai, and A. Vespignani. Controlling contagion processes in activity driven networks. Phys. Rev. Lett., :, . Lloyd, A. L., and R. M. May. How viruses spread among computers and people. Science, :, . Masuda, N., M.A. Porter, and R. Lambiotte. Random walks and diffusion on networks. Physics Reports, :–, . Miritello, Giovanna, Esteban Moro, and Rubén Lara. Dynamical strength of social ties in information spreading. Phys. Rev. E, :, . Molloy, M., and B. Reed. A critical point for random graphs with a given degree sequence. Random Structures and Algorithms, :, . Moody, J. The importance of relationship timing for diffusion: Indirect connectivity and STD infection risk. Soc. Forces, :, . Moreno, J. L. Who Shall Survive? nd ed. Beacon House, . Morris, M. Telling tails explain the discrepancy in sexual partner reports. Nature, :, . Morris, M. Sexually Transmitted Diseases. Edited by K. K. Holmes et al. McGraw-Hill, . Morris, M., and M. Kretzschmar. Concurrent partnerships and the spread of HIV.AIDS, :, . Nadini, M., K. Sun, E. Ubaldi, M. Starnini, A. Rizzo, and N. Perra. Epidemic spreading in modular time-varying networks. Scientific Reports, ():, . Newman. M. E. J. Spread of epidemic disease on networks. Phys. Rev. E, :, . Newman, M. E. J. Networks: An Introduction. Oxford University Press, . Noh, J. D., and H. Rieger. Random walks on complex networks. Phys. Rev. Lett., :, . Onnela, J.-P., J. Saramaki, J. Hyvonen, G. Szabo, D. Lazer, K. Kaski, J. Kertesz, and A.-L. Barabási. Structure and tie strengths in mobile communication networks. Proc. Natl. Acad. Sci. U.S.A., :, .



 ̧   

Page, L., S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web.Stanford InfoLab, . Panisson, A., A. Barrat, C. Cattuto, W. Van den Broeck, G. Ruffo, and R. Schifanella. On the dynamics of human proximity for data diffusion in ad-hoc networks. Ad Hoc Networks, : . Pastor-Satorras, R., C. Castellano, P. Van Mieghem, and A. Vespignani. Epidemic processes in complex networks. Rev. Mod. Phys., ():, . Pastor-Satorras, R., and A. Vespignani. Epidemic spreading in scale-free networks. Phys. Rev. Lett., :, . Perra, N., A. Baronchelli, D Mocanu, B. Gonçalves, R. Pastor-Satorras, and A. Vespignani. Random walks and search in time varying networks. Phys. Rev. Lett., :, . Perra, N., B. Gonçalves, R. Pastor-Satorras, and A. Vespignani. Activity driven modeling of dynamic networks. Scientific Reports, :, . Ribeiro, B., N. Perra, and A. Baronchelli. Quantifying the effect of temporal resolution on time-varying networks. Scientific Reports, :, . Rocha, Luis E. C., Fredrik Liljeros, and Petter Holme. Simulated epidemics in an empirical spatiotemporal network of , sexual contacts. PLoS Comput Biol, ():e, . Rizzo, A., M. Frasca, and M. Porfiri. Effect of individual behavior on epidemic spreading in activity-driven networks. Phys. Rev. E, ():, . Scholtes, I., N. Wider, R. Pfitzner, A. Garas, C. J. Tessone, and F. Schweitzer. Causality-driven slow-down and speed-up of diffusion in non-Markovian temporal networks. Nat. Commun., :, . Starnini, M., and R. Pastor-Satorras. Topological properties of a time-integrated activitydriven network. Phys. Rev. E, ():, . Starnini, M. and R. Pastor-Satorras. Temporal percolation in activity-driven networks. Phys. Rev. E, ():, . Stehlé, Juliette, Alain Barrat, and Ginestra Bianconi. Dynamical and bursty interactions in social networks. Phys. Rev. E, :, . Stehle, Juliette, Nicolas Voirin, Alain Barrat, Ciro Cattuto, Vittoria Colizza, Lorenzo Isella, Corinne R´egis, Jean-Franc¸ois Pinton, Nagham Khanafer, Wouter Van den Broeck, and Philippe Vanhems. Simulation of an seir infectious disease model on the dynamic contact network of conference attendees. BMC Medicine, (), . Stehlé, Juliette, Nicolas Voirin, Alain Barrat, Ciro Cattuto, Lorenzo Isella, Jean-Franc¸ois Pin-ton, Marco Quaggiotto, Wouter Van den Broeck, Corinne Regis, Bruno Lina, and Philippe Vanhems. High-resolution measurements of face-to-face contact patterns in a primary school. PLoS ONE, ():e, . Sun, K., A. Baronchelli, and N. Perra. Contrasting effects of strong ties on sir and sis processes in temporal networks. Eur. Phys. J. B, ():, . Tomasello, M.V., N. Perra, C. J. Tessone, M. Karsai, and F. Schweitzer. The role of endogenous and exogenous mechanisms in the formation of R&D networks. Scientific Reports, : . Tizzani, M., S. Lenti, E. Ubaldi, A. Vezzani, C. Castellano, and R. Burioni. Epidemic spreading and aging in temporal networks with memory. Phys. Rev. E, ():, . Ubaldi, E., N. Perra, M. Karsai, A. Vezzani, R. Burioni, and A. Vespignani. Asymptotic theory of time-varying social networks with heterogeneous activity and tie allocation. Scientific Reports, :, . Ubaldi, E., A. Vezzani, M. Karsai, N. Perra, and R. Burioni. Burstiness and tie activation strategies in time-varying social networks. Scientific Reports, :, .

   - 



Vázquez, A., J. G. Oliveira, Z. Dezso, K.-I. Goh, I. Kondor, and A.-L. Barabási. Modeling bursts and heavy tails in human dynamics. Phys. Rev. E, :, . Vespignani, A. Modeling dynamical processes in complex socio-technical systems. Nature Physics, :–, . Volz, E., and L. A. Meyers. Susceptible-infected-recovered epidemics in dynamic contact networks. Proc. R. Soc. B, :, . Wang, Z., C.T. Bauch, S. Bhattacharyya, A. d’Onofrio, P. Manfredi, M. Perc, N. Perra, M. Salathé, and D. Zhao. Statistical physics of vaccination. Physics Reports, , –, . Wang, Y., D. Chakrabarti, C. Wang, and C. Faloutsos. Epidemic spreading in real networks: An eigenvalue viewpoint. Presented at nd Symposium on Reliable Distributed Computing (SRDS), . Wasserman, S., and P. Pattiston. Logit models and logistic regression for social networks. Psychometrika, :–, . Watts, D. J., and S. H. Strogatz. Collective dynamics of “small-world” networks. Nature, :, . Weng, L., J. Ratkiewicz, N. Perra, B. Gonc¸alves, C. Castillo, F. Bonchi, R. Schifanella, F. Menczer, and A. Flammini. The role of information diffusion in the evolution of social networks. In Proceedings of the th ACM SIGKDD international conference on Knowledge discovery and data mining, –. ACM, . Zanette, D. H. Critical behavior of propagation on small-world networks. Phys. Rev. E, :, . Zhao, Kun, Juliette Stehlé, Ginestra Bianconi, and Alain Barrat. Social network dynamics of face-to-face interactions. Phys. Rev. E, :, .

  ......................................................................................................................

-       Research Questions and Tools ......................................................................................................................

 

. I

.................................................................................................................................. N analysis is one of the methodological cornerstones of data science. It predates the digital revolution by many decades but has found new relevance and applications in the age of social media and user-generated content. Digital traces of voluntary connections such as hyperlinks, Twitter follows, and Facebook friendships have supported hundreds of network-analysis-based studies. Their findings have helped us understand the dynamics of online social relationships, politics, health, business, rumor diffusion, and other topics of relevance to communication research. Apart from its methods, social network theory (SNT) is a discipline unto itself (Borgatti, Mehra, Brass, & Labianca, ), with its own membership organization (the International Network for Social Network Analysis, or INSNA), conferences (Sunbelt, the International Conference on Advances in Social Networks Analysis and Mining, Political Networks, etc.), and journals (Network Science, Social Networks, Social Network Analysis and Mining, Journal of Social Structure, etc.). These disciplinary features have had a number of consequences, among them relatively low levels of adoption of network analysis methods outside the discipline. Another is that much network analysis research on communication-relevant topics has been conducted by researchers with little or no knowledge of communication theory or concepts (see Freelon, ). Such studies often emphasize how effectively their data exhibit distinct network structural properties and/or how effectively network properties predict tie formation or information diffusion. In other words, this research is focused on using empirical cases to develop context-independent SNT at the expense of more

-      



context-sensitive communication theories. Indeed, the ability to demonstrate network properties or dynamics that persist across radically disparate cases is considered a major contribution worthy of publication in Science or Nature (e.g., Borgatti et al., ; Milo et al., ; Palla, Derényi, Farkas, & Vicsek, ). In this chapter I take the general view that subject-specific expertise and knowledge can add much to network analyses of communication processes. Without denigrating the contributions of SNT in any way, I suggest that communication scholars should use social network analysis primarily to contribute to communication theory. This has of course been occurring for many years (see, e.g., Ahuja & Carley, ; Barnett & Danowski, ; Hartman & Johnson, ; Kim & Barnett, ; Monge & Contractor, ; Reese, Grant, & Danielian, ), but the standard network analysis toolkit ignores a great deal of information that is quite valuable for communication scholars. New research questions, metrics, and tools are needed to fully exploit the richness of the digital data sets to which we now have access. This chapter introduces three seldom-used network analysis techniques of particular value to a major subtopic of communication research—the study of homophily and fragmentation—and applies them to a real-world data set. I have designed an open-source Python module called TSM (Twitter Subgraph Manipulator) that implements these techniques, as none of the major network analysis packages does so. The source code and data used in this chapter are available at http://dfreelon.org/TSM/tsm_demo_files.zip to help interested researchers learn to use the software.

. S N T  S N A

.................................................................................................................................. Network analysis methods have been applied to a diverse set of research contexts, from urban neighborhoods to neural pathways to the structure of the Web. The study of social networks, as its name implies, is principally concerned with relations between humans (as opposed to neurons or web servers). The term “social network analysis” (SNA) is somewhat ambiguous, as it encompasses both a set of methodological tools and a research discipline in its own right. If we were to imagine the two in a Venn diagram, the discipline would be almost completely subsumed within the methodological bubble, which would extend outward to overlap with the other disciplines that use SNA methods. For the sake of clarity, I use “SNA” to refer to the methods and the term “social network theory” (SNT) in Borgatti et al.’s sense (), to denote the discipline. A complete recapitulation of SNA and SNT lies beyond the scope of this chapter, but I briefly review a few of their key characteristics that are particularly relevant to my discussion. First, while traditional variable-based quantitative perspectives focus on individuals without regard to their relationships, SNT “explicitly assumes that actors participate in social systems connecting them to other actors” (Knoke & Yang, , p. ). It takes social structure seriously, rejecting the unstated assumption of many quantitative



 

studies that only individual-level characteristics matter. Its fundamental elements are nodes (modular, connectable units) and ties (the connections between them); a wide range of communication phenomena can be represented as an arrangement of these two types of elements.1 The details of their structure, their cross-sectional and longitudinal dynamics, and how they predict outcomes of interest comprise the primary concerns of SNT. What distinguishes SNT from other fields that use SNA methods is that the former is more concerned with context-independent regularities of social networks. Specific theories that fall within SNT’s remit include preferential attachment (Barabási & Albert, ), the strength of weak ties (Granovetter, ), and structural holes theory (Burt, ). These theories and others like them emphasize the general at the expense of the particular, and thus implications that are interesting from other perspectives often end up unexplored. A brief comparison of SNT and communication approaches to a single, well-covered topic—protest movements’ uses of online media—will substantiate this point. Research on this topic conducted from an SNT perspective emphasizes two major aspects: () structural network characteristics and () models that predict such characteristics (Bastos, Puschmann, & Travitzki, ; Bruns, Highfield, & Burgess, ; Fabrega & Sajuria, ; Morales, Borondo, Losada, & Benito, ; Morales, Losada, & Benito, ; Overbey, Greco, Paribello, & Jackson, ). Such papers rarely, if ever, relate their findings to theories of protest and social movements. Their main goals are to describe the data in terms familiar to SNT theorists and to construct models that fit the data reasonably well and generalize to other contexts. Scholars in communication and other fields have also applied SNA tools to online protest contexts, but this literature differs sharply in its theoretical focus. For example, Overbey et al.’s () SNT-based study of protest-related tweets from multiple protest events discusses how the data exhibit the properties of preferential attachment in addition to developing a broadly applicable predictive model of user prominence. In contrast, Meraz and Papacharissi () combine SNA with qualitative and quantitative text analysis, in a study of the #egypt hashtag during early  that is anchored in the related concepts of networked framing and networked gatekeeping. Unlike Overbey et al., they contextualize their findings within broader scholarly conversations about digitally enabled journalism and protest. Among other things, this allows them to conclude that the meaning-making processes at work in their data represented a novel development rather than simply reproducing twentieth-century media power dynamics. A purely SNT approach would not have been able to reach such a conclusion. Other communication studies of online protests that use SNA similarly emphasize communication theories in their literature reviews and conclusions (Garrido & Halavais, ; Theocharis, ; Tremayne, ). This is not to assert the inherent superiority or inferiority of SNT with respect to other conceptual approaches. My goal in this section thus far has merely been to distinguish SNT and SNA, since this distinction matters a great deal for the discussions to follow. While valuable in its own right, SNT is not always an appropriate framework for communication research. But communication researchers who use SNA find themselves at somewhat of a disadvantage compared to SNT adherents. This is because

-      



in addition to its other virtues, SNT plays a major role in the development of SNA tools, techniques, and software. SNA software programs such as UCINet, NodeXL, Multinet, and SNAP were all developed by SNT researchers. These, along with code libraries like igraph for R, C, and Python and NetworkX for Python, are some of the primary tools communication researchers use for SNA research. But these tools were developed with the research priorities of SNT in mind, which do not always align with those of other disciplines. As a result, analytical opportunities of potential importance to non-SNT fields have been neglected by SNA software developers. The bulk of this chapter is devoted to explaining and justifying three network analysis techniques of specific value to communication researchers interested in the concepts of homophily and political fragmentation. These techniques are appropriate for networks that have been partitioned into subgraphs, or subsets of densely connected nodes.2 The techniques are not especially well-known, so I also offer a software module of my own design that implements them.

. H, C,  SNA

.................................................................................................................................. The digital age has greatly elevated the importance of SNA methods for non-SNT fields in the social sciences. To whatever extent we live in a “network society” (Castells, ; Rainie & Wellman, )—that is, a society in which the overarching organizing principle is the network—we need analytical tools designed to account for that reality. Further, if the network is indeed the paradigmatic social structure of the twenty-first century, it stands to reason that different disciplines will focus on different aspects of it. The goal of pushing the boundaries of knowledge in the domain of human communication demands a rethinking of the analytical opportunities enabled by the fundamental elements of SNA. In other words, we should ask ourselves: When we decide to conceptualize a given empirical case as a collection of nodes and ties, what exactly is most interesting from a communication perspective about the patterns they form? This question has many answers, one of which pertains to people’s propensity to communicate primarily with those similar to themselves. That similar individuals tend to cluster together when given the opportunity is a truism widely known across the social sciences. Relevant types of similarity include shared personal interests, careers, ethnicities, genders, sexual orientations, nationalities, geographic locations, health conditions, and many more. Communication research on this topic has been conducted under multiple overlapping headings, including “homophily,” “fragmentation,” “polarization,” and “selective exposure,” but all share a common interest in how people organize their communications along various lines of similarity. They may, for example, consume or rebroadcast the same media content, communicate exclusively or mostly with in-group members (however the in-group is defined), or insert distinctive shibboleths into their messages. Such



 

behaviors have been empirically observed across a wide variety of communication contexts, both online (Conover et al., ; Freelon, Lynch, & Aday, ; Lawrence, Sides, & Farrell, ; Smith, Rainie, Shneiderman, & Himelboim, ) and offline (Baum & Groeling, ; Iyengar & Hahn, ; Stroud, ). Network analysis is ideally suited to probe the contours of online homophily. The digital traces automatically recorded by users as they go about their daily online business allow researchers to represent and analyze their communication patterns as networks. In many cases, people, institutions, or websites are abstracted as nodes, while connections of various sorts between them become the ties. Such connections can be undirected (e.g., Facebook friendships and co-membership on a Twitter list) or directed (e.g., Twitter follows and hyperlinks). In network analysis, homophily presents as a tendency for nodes to form clusters in which members are densely connected to one another but only sparsely connected to nonmembers. The shared characteristic(s) that bind cluster members together can often be detected via qualitative inspection (see Etling, Kelly, Faris, & Palfrey, ; Freelon et al., ; Himelboim, Smith, & Shneiderman, ). The established SNA toolkit includes several techniques for exploring network clusters. Perhaps the most important of these is the means by which the clusters are identified in the first place. Clusters can sometimes be assigned using preexisting membership categories, for example separating members of the US Congress into party-based clusters. But network analysts often use clustering algorithms to organize networks based on patterns of ties between nodes—in other words, to detect community structure (Blondel, Guillaume, Lambiotte, & Lefebvre, ; Clauset, Newman, & Moore, ; Newman, ; Newman & Girvan, ; Wakita & Tsurumi, ). The most popular of these algorithms attempt to organize network nodes into clusters, within which tie density is as high as possible and between which tie density is as low as possible. This process is known as modularity maximization. Commonly used network modularity-maximization algorithms include Clauset-Newman-Moore, Wakita-Tsurumi, Newman-Girvan, and the Louvain method. One factor driving the popularity of these particular algorithms is their inclusion in major network analysis software packages. Once a network has been partitioned, analyses typically focus on relationships within and between clusters. One simple but important analytical step is to measure each cluster’s size, which can reveal which topics and/or individuals are generating more and less attention within the network. Another is to identify each cluster’s hubs, or most prominent nodes. Prominence can be measured using multiple metrics such as in-degree, betweenness, and eigenvector centrality. Because most online communication networks are long-tailed in addition to being homophilous, a cluster’s hubs act as its leaders, defining its general tone and agenda (Freelon et al., ; Smith et al., ). Hubs can be compared to one another in terms of similarity, however that is defined, using qualitative or quantitative means. The general principle of homophily predicts that hubs within a single cluster will be more similar to one another than to those in other clusters. Where applicable, researchers may also examine the

-      



messages hubs send and receive using automated text analysis methods or traditional content analysis.

. B SNA  U: T R Q

.................................................................................................................................. Scholarly disciplines are both enabled and constrained by the tools they use. On the one hand, popular SNA tools have empowered communication researchers to draw many interesting and valuable conclusions. On the other, the limitations of these tools have set formidable boundaries on researchers’ empirical capabilities and imaginations. In this section I discuss three homophily-relevant, subgraph-based research questions that can be answered within SNA’s methodological paradigm, but which are prohibitively difficult to answer with existing tools. This chapter’s main contributions are to () explain the theoretical relevance of these questions and () introduce and demonstrate techniques for answering them. Each of these techniques would require extensive programming effort to be implemented through the network analysis programming libraries discussed above. One key research question when analyzing a partitioned network is: How do different subgraphs in a partitioned network relate to one another? Because subgraphs are defined on the basis of shared ties, it follows that some clusters may be “closer” to each other than others, where proximity is defined as the proportion or absolute number of cluster-spanning ties. Depending on what behaviors the ties signify, proximity may be interpreted as a greater propensity to speak to, receive information from, or rebroadcast content from certain clusters rather than others. Existing research has measured proximity in the special case of two-cluster networks, where the two clusters represent liberal and conservative social media enclaves (Adamic & Glance, ; Conover et al., ; Hargittai, Gallo, & Kane, ; Smith et al., ). It is a simple matter to calculate the proportion of ties that extend between two clusters out of all those that remain within one cluster or the other. However, with networks of more than two clusters, it becomes problematic to attempt to measure proximity using a single coefficient; the relevant quantities are the proximities between each cluster and every other cluster. Studies that examine networks containing more than two clusters should account for this, but most software packages do not permit them to do so easily. A second general research question is: Which nodes are heavily connected to by distinct subgraphs? Before I answer this question, let us first consider two related concepts from SNT that do not directly address it. Network brokers are defined as nodes that lie on paths between pairs of other nodes and thus can potentially facilitate connections between such pairs (Burt, ). In a network representing a large office, a broker might be an employee who is in a position to pass information back and forth between certain members of different departments who cannot communicate directly.



 

Similarly, betweenness centrality measures “the extent that the actor falls on the geodesic [shortest] paths between other pairs of actors in the network” (Hanneman & Riddle, , Chapter ). A node with high betweenness centrality may broker communications or relationships among many other nodes that are not otherwise connected. However, brokerage and betweenness centrality are not the only network properties a researcher might wish to know about nodes that communicate across different subgraphs. In a partitioned network, what matters is often less the absolute number of nodes a given node connects and more the specific subgraphs a node connects. This is because different subgraphs often represent different communities or interests. A node positioned between two subgraphs could bind together the two communities represented by those subgraphs, for example, a social media account that is widely read by members of one primarily liberal subgraph and a second mainly conservative subgraph. Because media outlets in the “Daily Me” era typically address fragmented audiences, those few that command attention from disparate communities may hold the potential to reduce information disparities between them—in other words, to play the role of a “local interest intermediary,” to borrow from Sunstein (). All that is needed is to operationalize this concept of local interest intermediation, which I attempt to do in the following section. The final research question addressed in this chapter is: How do subgraphs change over time? Social media data, like all digital trace data, are inherently longitudinal (Howison, Wiggins, & Crowston, ). Yet while there have been some longitudinal network studies of social media traces, few have specifically examined subgraphs. Most of those that do focus exclusively on subgraphs whose memberships remain fixed over time (Bode, Hanna, Yang, & Shah, ; Hargittai et al., ). Such research designs preclude a number of important empirical possibilities, including () that nodes may enter and exit the network, () that they may travel from one subgraph to another, and () that subgraphs may grow or shrink in size. Research questions that require these possibilities to remain open require methods that leave them open. The subgraphtracking procedure explained below is just such a method.

. T, D,  R

.................................................................................................................................. Before I proceed to the methods and results, a brief note about the analytical software I use is in order. The software module, TSM, is written in the Python programming language and requires Python  or higher. I have provided a two-part tutorial in the form of two Jupyter notebooks along with the full data set for anyone interested in performing the following analyses themselves. This tutorial contains all the Python code I used to analyze the data in this chapter. Specific code blocks in the tutorial’s second notebook are referred to in this section by the capital letter “B” followed by a

-      



number. For example, the first code block referred to below is B, followed by B (which corresponds to the first research question). (The first number is zero to maintain continuity with the numbered research questions.) The tutorial, which includes instructions for obtaining the sample data, is available at http://dfreelon.org/ TSM/tsm_demo_files.zip. The tweets in the current data set were collected using the now-defunct tweet archiving service TwapperKeeper. The sole keyword was the hashtag #wiunion, which was used in early  by protesters against Wisconsin governor Scott Walker’s plans to eliminate the collective-bargaining rights of most of the state’s public sector unions. The data set includes only retweets because they provide a convenient and meaningful basis for constructing Twitter networks (e.g., Conover et al., ; Lin, Keegan, Margolin, & Lazer, ). It consists of , retweets containing this hashtag that cover a period of exactly five weeks (thirty-five days) from February , , to March , . Roughly .% of all the tweets TwapperKeeper collected for this keyword between the start and end dates are retweets. An unknown number of tweets is missing from the full data set due to restrictions on the data collection capacities of Twitter’s public application programming interfaces (APIs). The primary goal of the remainder of this chapter is to answer the three research questions stated above for the #wiunion data set. I also briefly discuss the theoretical relevance of each finding.

.. Partitioning the Network To answer our research questions, the network must first be partitioned into subgraphs. To do so, I have partitioned the retweet network using the Louvain method, retaining only the ten largest communities. I retain only the top ten because for massive networks like the current one, Louvain creates a small number of very large clusters and a large number of very small clusters. Only the largest clusters will concern most researchers; on Twitter, the small ones typically represent situations in which one or two users retweet something. The top ten clusters in this data set accounted for % of all nodes in the network and % of all ties. As long as the Louvain algorithm starts running from the same node, it will create the same partition on each execution. However, if it begins running from a different node, it will yield a different partition. The practical upshot of this is that attempts to replicate research using Louvain will not produce identical results. However, the results should be substantively similar if community structure exists in the data. Upon replicating the present analysis multiple times, I found that the resulting clusters were very similar, suggesting that the data set does indeed possess a robust community structure. The reader can verify this independently by following the tutorial.



  Table 6.1 Ten Highest In-Degree Users in Three Clusters from Cross-Sectional #wiunion Partition Cluster

Screen Name

In-Degree

Conservatives (0)

michellemalkin conservativeind brooksbayne dloesch foxnews velvethammer theflacracker scrowder ondrock crispix49

1623 969 467 416 413 362 356 329 325 314

Wisconsin progressives (1)

defendwisconsin legaleagle bluecheddar1 all_a_twitt_r millbot wortnews dysolution mspicuzzawsj chrisjlarson aclumadison

1865 1559 1480 1132 1047 968 961 893 817 815

Labor unions (14)

weac aflcio seiu evale72 teacherreality shankerblog workingamerica smwia cateachersassoc tpmmedia

3089 2662 1076 750 615 341 313 272 269 243

Clusters initially emerge from the Louvain algorithm with numerical labels. Qualitative inspection of the screen names and tweets within each cluster allows researchers to create a subjective label conveying its general character. Table . displays the users with the ten highest in-degree values for three of the ten largest clusters (chosen for illustrative purposes) along with the subjective labels I gave them.3 The table shows a labor-union-centered cluster () featuring @aflcio, @seiu, and @cateachersassoc; a Wisconsin-based progressive cluster () featuring @defendwisconsin, @bluecheddar, and @aclumadison; and a conservative-leaning cluster () featuring @michellemalkin, @foxnews, and @conservativeind.

-      



.. How Do Different Subgraphs in a Partitioned Network Relate to One Another? The first research question concerns how each subgraph relates to every other subgraph. To address this I have created a proximity matrix (similar to adjacency matrices for individual nodes), which displays the proximity of each cluster to every other cluster. “Proximity” between clusters A and B is defined as the proportion of shared ties between A and B of all the ties involving at least one member of A. Table . displays the proximity matrix for our partitioned network. Each cell represents the number of ties involving the numbered row subgraph that are shared with the numbered column subgraph, divided by all ties involving the row subgraph. Diagonal cells contain internal ties—that is, ties for which both nodes are members of the same subgraph. The diagonal cell in any given row will frequently, though not always, be the largest proportion in that row. While reciprocal off-diagonal cells (e.g., - and -) always contain the same numerators, they almost never contain the same proportions because the denominators (the sums of each subgraph’s involved ties) are nearly always different. Table . helps us answer the first research question. Every cluster except  (which hosts mostly conservatives) represents a subset of individuals who are broadly aligned with American left-wing politics. Accordingly, most of these share substantial interconnections with at least one other cluster. The labor union cluster () is strongly connected to the local progressive cluster () and the national progressive cluster (). The local and national progressive clusters are also strongly connected to one another. But the conservative cluster () exhibits strong homophily; over % of its ties remain internal. No other subgraph exceeds % of the conservative cluster’s total. And this is exactly what we would expect: #wiunion was created by progressives, and even when conservatives use it, very little retweeting occurs across the ideological divide. Theoretically, these results help us understand this conversation in terms of the strong Table 6.2 Proximity Matrix for Cross-Sectional #wiunion Partition

0 1 3 4 5 6 10 14 16 20

0

1

3

4

5

6

10

14

16

20

0.819 0.012 0.006 0.009 0.009 0.011 0.005 0.008 0.006 0.007

0.09 0.49 0.3 0.325 0.375 0.331 0.258 0.407 0.289 0.214

0.002 0.011 0.287 0.008 0.012 0.013 0.021 0.008 0.012 0.01

0.019 0.089 0.063 0.26 0.103 0.093 0.067 0.092 0.109 0.106

0.048 0.27 0.242 0.271 0.386 0.234 0.183 0.232 0.2 0.243

0.006 0.024 0.025 0.024 0.023 0.216 0.021 0.027 0.021 0.025

0.001 0.004 0.009 0.004 0.004 0.005 0.378 0.003 0.006 0.004

0.011 0.074 0.04 0.061 0.059 0.068 0.035 0.192 0.053 0.067

0.003 0.017 0.019 0.024 0.017 0.018 0.023 0.018 0.288 0.027

0.002 0.007 0.01 0.013 0.012 0.012 0.009 0.013 0.015 0.297



 

connections between multiple factions on one dominant side (the left) against a much smaller oppositional contingent (the right). More broadly, these results offer evidence that retweeting operates as a mechanism of homophily for both sides when used in the context of an ideological hashtag (see also Conover et al., ). Another interesting feature of this chart emerges in the local and national progressives’ respective columns ( and , respectively). It is clear that these two columns account for sizable proportions of most of the other clusters’ ties other than the conservative cluster. Not coincidentally, these are also the two largest clusters by membership. This example illustrates the general point that large clusters often exert a sort of “gravitational pull” on the ties of smaller ones, in some cases so much so that they account for a greater proportion of ties than those internal to the smaller cluster. In an ideologically aligned constellation of subgraphs such as this one, this method can help convey which elements within a larger political ideology are driving attention and engagement. More broadly, it helps us think of homophily not only as a singular property of a given network or subgraph, but also as a series of relationships between each subgraph and its closer and more distant neighbors.

.. Which Nodes Are Heavily Connected to Distinct Subgraphs? The second research question concerns nodes that lie on the boundaries between subgraphs—what I call local interest intermediaries. In directed networks such as those based on hyperlinks or retweets, we might be especially interested in nodes that receive large numbers of ties from multiple clusters. In the case of retweets, such nodes might help bring ideologically opposed clusters together by promoting shared experiences and offering common ground for cross-cutting debate. But as discussed previously, the standard SNA toolkit does not offer methods to conduct this sort of analysis. The first order of business in defining such a method is specifying criteria to define local interest intermediary nodes. Here I suggest two such criteria, both of which must be present: first, such nodes should rank highly on at least one measure of network prominence such as in-degree or eigenvector centrality; and second, substantial proportions of their incoming ties must be distributed fairly evenly across two or more subgraphs. Nodes for which these two criteria hold true can be said to lie near the “borders” of different subgraphs, serving as a common point of connection. The second criterion requires some elaboration. I suggest requiring that the ratio of ties to a given node’s second most-connected cluster to those of its most-connected cluster meet or exceed a specific threshold proportion. For this I use a somewhat arbitrary threshold ratio of ., meaning that the second most-connected cluster must supply at least half the ties to the given node as does to its most-connected cluster. Increasing this threshold would require intermediary candidates to link to at least two

-      



clusters more evenly, while decreasing it would permit more inequality between clusters. To satisfy the prominence criterion, I allow only the top .% most prominent nodes to be considered as intermediary candidates. I use in-degree as my chosen prominence metric. The five most prominent local interest intermediaries detected by this analysis can be seen in Table .. The first two columns of the table contain the screen names and indegrees of the top five users by in-degree who satisfy the second research question’s two criteria in descending order of in-degree. The third column lists all ten clusters to which each user is connected in descending order by tie count, and the fourth column lists the tie counts themselves. The fifth column shows the ratio of the tie count of the user’s second most-connected cluster to that of her most-connected cluster; in accordance with the second criterion, this value always meets or exceeds .. The intermediary nodes as determined by this method are a diverse bunch: @thenewdeal is a well-followed, pseudonymous, self-proclaimed socialist devotee of Franklin Roosevelt, while @motherjones is the official account of the long-running left-wing magazine. @aflcio represents the labor union federation; @jennnnie is a littlefollowed private citizen of Madison, WI, who posted a series of highly retweeted tweets during the protests; and @mmflint is Michael Moore, the progressive documentary filmmaker. Two very quick generalizations about this group of intermediaries are that () they are nearly all progressive, and () the only Wisconsin local appears to be @jennnnie. Thus we may tentatively conclude that although the protests were a local event, national progressives played a larger role than locals in spreading the word across distinct progressive Twitter clusters. Also, no intermediaries spanned the boundaries between the sole conservative cluster and any of the progressive clusters. It is of theoretical interest that all these intermediaries connect different factions of a broader progressive coalition, as opposed to connecting progressives and conservatives. The two ideological sides would seem to share few if any major information sources, at least on Twitter. This in turn supports the notion that progressives and conservatives are using Twitter largely to share information from ideologically friendly sources (cf. Adamic & Glance, ; Sunstein, ).

.. How Do Subgraphs Change over Time? The task of measuring how network partitions change over time is not straightforward, especially in social media networks, where participants enter and exit constantly. The solution presented here divides the network into multiple cross-sectional time slices of equal duration and identifies matching subgraphs in adjacent slices.4 Two subgraphs in adjacent slices match one another if the Jaccard coefficient of their shared nodes () exceeds a given threshold and () exceeds the values for all other subgraph pairs. The Jaccard coefficient is a well-known measurement of the overlap between two sets and can be calculated by dividing their intersection by their union. The current method uses a version of the Jaccard that is weighted by a measure of node prominence

Table 6.3 Five Most Prominent (Highest In-Degree) Local Interest Intermediaries Screen Name

In-Degree

Cluster

Ties

Ratio of 2nd Cluster Ties to 1st Cluster Ties

thenewdeal

20,380

5 1 4 14 6 16 20 0 3 10

10,688 6023 1529 819 390 308 205 176 165 77

0.564

motherjones

13,444

4 5 1 14 16 6 20 0 10 3

4618 3852 3282 665 328 315 183 108 51 42

0.834

aflcio

7775

5 1 14 4 16 6 20 0 10 3

2570 2429 1595 795 152 97 68 41 18 10

0.945

jennnnie

7566

3 1 5 4 14 6 16 10 20 0

2499 2203 1611 444 292 181 157 71 66 42

0.882

mmflint

5212

16 1 5 4 14 6 20 0 10 3

2089 1055 967 595 199 135 87 37 28 20

0.505

-      



(in-degree in this case): instead of counting each node once, its in-degree is added to the numerator and the denominator. This is because in extremely skewed subgraphs like those commonly found in social media networks, prominent nodes contribute proportionally to subgraph integrity. On Twitter, entire subgraphs can be held together by a single highly retweeted node; if such a node were to be removed from its subgraph, the latter would completely dissipate. Weighting the Jaccard coefficient by prominence accounts for the proportional importance of such hubs. This cluster-matching process can be repeated across as many network cross sections as the data contain. Once it is complete, a number of properties of persistent subgraphs may be measured. A certain degree of subgraph turnover is inevitable over time as users of varying levels of commitment engage and depart. Subgraphs can grow or shrink in size as media attention waxes and wanes, or when external events occur that increase or decrease the topic’s public relevance. They can move in terms of their proximity to each other, growing more or less insular in response to external circumstances. They can split into multiple child subgraphs, each containing a substantial subset of the parent population, or multiple subgraphs can converge into a larger parent. And they can dissolve completely, as the Jaccard fails to meet or exceed whatever minimum threshold the researcher sets. The logistics of tracking subgraphs requires taking a few steps back from the previous two analyses. Specifically, rather than using a single cross-sectional partition encompassing the entire data set as I have until this point, I create separate time slices and partition each one independently. Since the data set covers five weeks, I create weekly slices, but for longer data sets I could just as easily use fortnights or months. (Attempts to partition shorter time periods such as single days tends to produce clusters dominated by single individuals, which can be difficult to classify [Freelon & Karpf, ]). Next I run the cluster-matching procedure on each adjacent pair of partitioned networks. Five weeks’ worth of data yields four adjacent slice pairs for the procedure to compare, and I can examine how the partitions change over these five weeks. The algorithm begins by calculating a weighted Jaccard coefficient between each pair of subgraphs in each of the two time slices and identifying the best matching subgraph in the second slice for each subgraph in the first slice. Here, the function compares only the top % of nodes by prominence metric (again, in-degree in this case) within each subgraph. This is because there tends to be a great deal of noisy churn in the long tails of large social media subgraphs caused by casual engagement; that said, the % cutoff may need to be increased for smaller networks. All nodes held in common between best-matching subgraphs are retained for later analysis. Table . displays the cluster-matching results for weeks  and . In each row, the first and second cells contain the IDs for two matching clusters. The presence of  and  in the second row means that cluster  from week  was the best match for cluster  from week . The decimal number in the third column is the weighted Jaccard coefficient between those two best-matching clusters. The weighted Jaccard threshold is set somewhat arbitrarily at . (see Freelon et al., ), which is why all the



 

Table 6.4 Week 1/Week 2 Cluster Matches and Weighted Jaccards Week 1 Cluster Match

Week 2 Cluster Match

Weighted Jaccard

Users Present in Both Clusters

3

17

0.831

jennnnie

10

5

0.5814

theuptake, motherjones, andrewkroll, ddayen, thinkprogress, aterkel, adamweinstein, mmfa

8

6

0.519

exposeliberals, kurtschlichter, wilkowmajority, heritage, adamsbaldwin, redostoneage, resisttyranny, dloesch, velvethammer, michellemalkin, gregwhoward, conservativeind, ondrock, brooksbayne

1

9

0.5035

Citizenradio, allisonkilkenny, naomiaklein, theyoungturks, majorityfm

11

0

0.4702

evale72, aflcio, nerdette, aftunion, seiu

6

3

0.4367

legaleagle, dane101, defendwisconsin, jimwitkins, news3david, djpain1, weac, wisco, patsimmswsj, bluecheddar1, scottwalkerssuv, millbot, mspicuzzawsj, melissaryan, dysolution, benedictatlarge, wortnews, isthmustdp, dissidentmind

7

2

0.434

raybeckerman, 1whoknu, marnus3, sharon1943, brneyesuss, bmangh, gottalaff, all_a_twitt_r, auriandra, tom__paine, rubberstamprosk, buzzflash, mattison, shoq, craftyme25, novenator, devbost, libertybelle4, thedailyedge, otoolefan, sickjew, angelsavant, arrghpaine, deberupts, steveweinstein

2

1

0.4228

Anonnewsnet, rawstory

coefficients exceed that value. The screen names in the fourth column are those that appear in both clusters. Each match is referred to as a persistent cluster or subgraph; those in table . persist over two weeks. Readers may notice that not all subgraphs from weeks  and  are represented in table .. This is because two of the ten subgraphs in week  failed to meet the minimum Jaccard threshold in conjunction with any subgraph in week . This can occur when a cluster’s nodes disengage from the topic at hand or are subsumed by other clusters in the latter time slice. To complete this analysis, one would examine the results of the cluster-matching method for all remaining time slice pairs. Interested readers can complete the tutorial to view such results for the full data set. Among other applications, this procedure can be used to visualize changes in persistent subgraph size over time, as in Figure .. I created this figure by combining the results of the cluster-matching analysis with the

-      



6

node count (000’s)

5 4

Labor unions WI progressives National progressives Conservatives @jennnnie Michael Moore

3 2 1 0

1

2

3 Week

4

5

 . Persistent subgraph size over time.

cluster node counts generated by the Louvain method. The six clusters depicted in the figure persisted (i.e., sustained best matches meeting or exceeding the . Jaccard threshold) over at least three weeks. This chart supports several substantive conclusions about how the #wiunion hashtag was used. First, it appears that conservative interest in the protests peaked early and then declined precipitously, with only a slight rebound in week . That week, the Wisconsin legislature passed into law the bill the protesters opposed, which is likely what led to the dissolution of four of the six persistent clusters and the sharp decline of the remaining two by week . Second, it is interesting that the cluster led by labor unions—the targets of the legislation—maintained a fairly low profile over the entire time period. Followers of the protests were more likely to retweet national and local progressives, who probably had larger Twitter followings at the time. Finally, the two individual users who managed to unify clusters on their own could scarcely be more different outside of their shared political ideology. @Jennnnie is a nonelite resident of Madison, Wisconsin, with  followers as of July , ; Michael Moore is a worldfamous documentary filmmaker and activist with over . million followers at that time. Yet both were able to find relatively large, comparably sized audiences for their messages during the protests. @Jennnnie’s prominence here demonstrates the power of nonelites to attract substantial levels of attention during mediated moments of high public attention (Freelon & Karpf, ). Results like these also promise to help researchers explore how the relationship between a subgraph’s size and its level of homophily changes over time. As with size, this method permits homophily to be measured directly over time (see Freelon et al., ). Having already established that larger clusters seem to attract more connections overall, we could observe how drastic changes in size affect homophily.



 

. C

.................................................................................................................................. This chapter had two major goals, one narrow and the other broad. The narrow goal was to broaden the social network analyst’s toolkit by introducing subgraph-specific, theoretically grounded research questions and explaining them through an empirical example. The broader goal was to consider some of the differences between an SNTbased approach to SNA and an approach motivated by research questions drawn from communication scholarship. As useful as we might find research questions and tools developed in other disciplines, we should not allow them to limit our empirical or theoretical horizons. In-depth explorations of the possibilities inherent in digital traces of communication activity push our discipline forward, especially when combined with the ability to realize those possibilities through software. The proposed metrics and empirical solutions demonstrated in this chapter offer clear value for researchers interested in online homophily and fragmentation. The measurement of proximity between individual subgraphs helps us better understand how allied and competing interests clash and collaborate. The identification of local interest intermediaries sheds light on individuals and institutions that bring isolated clusters together. And the ability to track subgraphs over time allows a number of homophily-relevant metrics to be measured longitudinally, which is critical given how quickly online communication dynamics can change. The tutorial and sample data are intended to help researchers apply these functions to their own data sets quickly and easily. They are likely to find them valuable for other communication theories and contexts not discussed here. Far from a finished product, TSM is an open-source work in progress, to which interested and capable parties may contribute at their convenience. One of its major limitations at present is a lack of compelling visualization options. Not only am I not a visualization expert, but many of TSM’s output types do not fit easily into the kinds of two-dimensional, spreadsheet-style data containers most social scientists are used to. But enterprising experts may be able to add visualization capabilities to the module that represent its output in intuitive, aesthetically pleasing ways. Likewise, contributors could add other capabilities drawn from SNA that are useful for studying various communication theories. TSM’s development has been heavily influenced by Twitter data, so applying it to data from other platforms might reveal new analytical possibilities. Finally, like any software product, TSM almost certainly has a number of bugs and inefficiencies lurking in its source code. Thorough testing in a range of empirical settings will help bring these to light, and I hope that code-literate users will help to correct them. Ultimately, the most important measures of any empirical metric are () the extent of its adoption by research communities and () the substantive findings it enables. Those I have proposed here offer new possibilities for researchers interested in homophily and fragmentation in online networks. TSM is not the most user-friendly tool, but I

-      



encourage those interested in probing a bit deeper into their social media data than most other tools allow to consider testing it. I submit this chapter in the hope that other researchers interested in communication research continue to develop, publish, and document their own techniques and tools, and that this essential labor is duly recognized as the rigorous scholarship it is.

N . Nodes are also known as “vertices” or “points,” while ties are also called “edges” or “links.” . Subgraphs are also known as “clusters,” “communities,” and “partitions.” I use these terms interchangeably throughout this chapter. . In-degree is the default prominence metric TSM uses to rank nodes in Louvain-generated subgraphs, but any other prominence metric supported by NetworkX can be substituted (e.g., eigenvector centrality, betweenness centrality, closeness centrality). . This method is similar in several respects to the one introduced by Greene, Doyle, and Cunningham (), though I developed it independently.

R Adamic, L. A., & Glance, N. (). The political blogosphere and the  US election: Divided they blog. In Proceedings of the rd International Workshop on Link Discovery (pp. –). New York: ACM. Ahuja, M. K., & Carley, K. M. (). Network structure in virtual organizations. Journal of Computer-Mediated Communication, (), –. http://doi.org/./j.-.. tb.x Barabási, A.-L., & Albert, R. (). Emergence of scaling in random networks. Science, (), –. http://doi.org/./science... Barnett, G. A., & Danowski, J. A. (). The structure of communication. Human Communication Research, (), –. http://doi.org/./j.-..tb.x Bastos, M. T., Puschmann, C., & Travitzki, R. (). Tweeting across hashtags: Overlapping users and the importance of language, topics, and politics. In Proceedings of the th ACM Conference on Hypertext and Social Media (pp. –). New York: ACM. http://doi.org/ ./. Baum, M. A., & Groeling, T. (). New media and the polarization of American political discourse. Political Communication, (), –. http://doi.org/./  Blondel, V. D., Guillaume, J.-L., Lambiotte, R., & Lefebvre, E. (). Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, (), P. Bode, L., Hanna, A., Yang, J., & Shah, D. V. (). Candidate networks, citizen clusters, and political expression: strategic hashtag use in the  midterms. The ANNALS of the American Academy of Political and Social Science, (), –. http://doi.org/./ 



 

Borgatti, S. P., Mehra, A., Brass, D. J., & Labianca, G. (). Network analysis in the social sciences. Science, (), –. http://doi.org/./science. Bruns, A., Highfield, T., & Burgess, J. (). The Arab Spring and social media audiences: English and Arabic Twitter users and their networks. American Behavioral Scientist, (), –. http://doi.org/./ Burt, R. S. (). Structural holes: The social structure of competition. Cambridge, MA: Harvard University Press. Castells, M. (). The rise of the network society. Malden, MA: Wiley-Blackwell. Retrieved from https://books-google-com.proxyau.wrlc.org/books/about/The_Rise_of_the_Network_ Society.html?id=FihjywtjTdUC Clauset, A., Newman, M. E. J., & Moore, C. (). Finding community structure in very large networks. Physical Review E, (), . http://doi.org/./PhysRevE.. Conover, M. D., Ratkiewicz, J., Francisco, M., Goncalves, B., Flammini, A., & Menczer, F. (). Political polarization on Twitter. In Proceedings of the th International Conference on Weblogs and Social Media (pp. –). Barcelona, Spain: AAAI. Etling, B., Kelly, J., Faris, R., & Palfrey, J. (). Mapping the Arabic blogosphere: Politics and dissent online. New Media & Society, (), –. http://doi.org/./ Fabrega, J., & Sajuria, J. (). The emergence of political discourse on digital networks: The case of the Occupy movement. arXiv:. [physics]. Retrieved from http://arxiv.org/ abs/. Freelon, D. (). On the cutting edge of Big Data: Digital politics research in the social computing literature. In S. Coleman and D. Freelon (Eds.), Handbook of Digital Politics (pp. –). Northampton, MA: Edward Elgar. Freelon, D., & Karpf, D. (). Of big birds and bayonets: Hybrid Twitter interactivity in the  presidential debates. Information, Communication & Society, (), –. http://doi.org/./X.. Freelon, D., Lynch, M., & Aday, S. (). Online fragmentation in wartime: A longitudinal analysis of tweets about Syria, –. The ANNALS of the American Academy of Political and Social Science, (), –. http://doi.org/./ Garrido, M., & Halavais, A. (). Mapping networks of support for the Zapatista movement. In M. McCaughey & M. D. Ayers (Eds.), Cyberactivism: Online activism in theory and practice (pp. –). New York: Routledge. Granovetter, M. S. (). The strength of weak ties. American Journal of Sociology, (), –. Greene, D., Doyle, D., & Cunningham, P. (). Tracking the evolution of Communities in Dynamic Social Networks. In Proceedings of the  International Conference on Advances in Social Networks Analysis and Mining (ASONAM) (pp. –). http://doi.org/./ ASONAM.. Hanneman, R. A., & Riddle, M. (). Introduction to social network methods. Riverside: University of California, Riverside. Retrieved from http://faculty.ucr.edu/~hanneman/ nettext/ Hargittai, E., Gallo, J., & Kane, M. (). Cross-ideological discussions among conservative and liberal bloggers. Public Choice, (), –. Hartman, R. L., & Johnson, J. D. (). Social contagion and multiplexity: Communication networks as predictors of commitment and role ambiguity. Human Communication Research, (), –. http://doi.org/./j.-..tb.x

-      



Himelboim, I., Smith, M., & Shneiderman, B. (). Tweeting apart: Applying network analysis to detect selective exposure clusters in Twitter. Communication Methods and Measures, (–), –. http://doi.org/./.. Howison, J., Wiggins, A., & Crowston, K. (). Validity issues in the use of social network analysis with digital trace data. Journal of the Association for Information Systems, (), –. Iyengar, S., & Hahn, K. S. (). Red media, blue media: Evidence of ideological selectivity in media use. Journal of Communication, (), –. Kim, K., & Barnett, G. A. (). The determinants of international news flow: A network analysis. Communication Research, (), –. http://doi.org/./ Knoke, D., & Yang, S. (). Social network analysis. Thousand Oaks, CA: Sage. Retrieved from https://books-google-com.proxyau.wrlc.org/books/about/Social_Network_Analysis. html?id=buiJiHtGusC Lawrence, E., Sides, J., & Farrell, H. (). Self-segregation or deliberation? Blog readership, participation, and polarization in American politics. Perspectives on Politics, (), –. Lin, Y.-R., Keegan, B., Margolin, D., & Lazer, D. (). Rising Tides or Rising Stars?: Dynamics of Shared Attention on Twitter during Media Events. PLoS ONE, (), e. https://doi.org/./journal.pone. Meraz, S., & Papacharissi, Z. (). Networked gatekeeping and networked framing on #Egypt. The International Journal of Press/Politics. http://doi.org/./ Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Chklovskii, D., & Alon, U. (). Network motifs: Simple building blocks of complex networks. Science, (), –. http://doi.org/./science... Monge, P. R., & Contractor, N. S. (). Theories of communication networks. New York: Oxford University Press. Morales, A. J., Borondo, J., Losada, J. C., & Benito, R. M. (). Efficiency of human activity on information spreading on Twitter. Social Networks, , –. http://doi.org/./ j.socnet... Morales, A. J., Losada, J. C., & Benito, R. M. (). Users[’] structure and behavior on an online social network during a political protest. Physica A: Statistical Mechanics and Its Applications, (), –. http://doi.org/./j.physa... Newman, M. E. J. (). Fast algorithm for detecting community structure in networks. Physical Review E, (). http://pre.aps.org/abstract/PRE/v/i/e Newman, M. E. J., & Girvan, M. (). Finding and evaluating community structure in networks. Physical Review E, (), . http://doi.org/./PhysRevE.. Overbey, L. A., Greco, B., Paribello, C., & Jackson, T. (). Structure and prominence in Twitter networks centered on contentious politics. Social Network Analysis and Mining, (), –. http://doi.org/./s--- Palla, G., Derényi, I., Farkas, I., & Vicsek, T. (). Uncovering the overlapping community structure of complex networks in nature and society. Nature, (), –. http://doi. org/./nature Rainie, L., & Wellman, B. (). Networked: The new social operating system. Retrieved from http://mitpress.mit.edu/catalog/item/default.asp?ttype=&tid= Reese, S. D., Grant, A., & Danielian, L. H. (). The structure of news sources on television: A network analysis of “CBS News,” “Nightline,” “MacNeil/Lehrer,” and “This Week with David Brinkley.” Journal of Communication, (), –. http://doi.org/./j...tb.x



 

Smith, M. A., Rainie, L., Shneiderman, B., & Himelboim, I. (). Mapping Twitter topic networks: From polarized crowds to community clusters. Pew Internet & American Life Project. Retrieved from http://www.pewinternet.org////mapping-twitter-topicnetworks-from-polarized-crowds-to-community-clusters/ Stroud, N. J. (). Polarization and partisan selective exposure. Journal of Communication, (), –. http://doi.org/./j.-...x Sunstein, C. (). Republic.com .. Princeton, NJ: Princeton University Press. Theocharis, Y. (). The wealth of (occupation) networks? Communication patterns and information distribution in a Twitter protest network. Journal of Information Technology & Politics, (), –. http://doi.org/./.. Tremayne, M. (). Anatomy of protest in the digital era: A network analysis of Twitter and Occupy Wall Street. Social Movement Studies, (), –. http://doi.org/./ .. Wakita, K., & Tsurumi, T. (). Finding community structure in mega-scale social etworks: [Extended abstract]. In Proceedings of the th International Conference on World Wide Web (pp. –). New York: ACM. http://doi.org/./.

  ........................................................................................................................

COMMUNICATION AND ORGANIZATIONAL DYNAMICS ........................................................................................................................

  ......................................................................................................................

         , ,            ? ......................................................................................................................

 

. C C S S M  D  T  C  O D?

.................................................................................................................................. T advancement of organizational dynamics and communication theory is the most important benchmark by which the long-term impact of a new intellectual approach is evaluated here. In this section I outline four ways that computational social science is motivating developments that () test existing theories at scale; () extend existing theories to offer more nuanced insights; () generate new theories about existing



 

phenomena by the inclusion and juxtaposition of concepts for which data were either unavailable or impractical to collect at scale; and () develop new theories about (relatively) new phenomena, such as the changing nature of organizing enabled by digital advances. In its early stages, researchers were able to showcase the potential of computational social science to help test existing theories “at scale.” For instance, one of the bestknown claims in network theory is that diverse social network ties provide individuals (and aggregates of individuals) greater access to social and economic opportunities (Burt, ). However, until the past decade these theories could only be empirically tested on relatively small networks, often made up of individuals within a single organization. These studies generated compelling evidence that organizational members who spanned “structural holes,” by connecting with others who were not directly connected, performed better than those who did not. Spanning structural holes gave those individuals access to social and economic opportunities for advancement. However, these ideas remained largely untested at the population level until a study conducted by Eagle and his colleagues (). Analyzing the call graph (network of who called whom on the phone) for the United Kingdom, they were able to demonstrate that individuals who had phone conversations with others who were not directly calling each other (e.g., those spanning structural holes) were more likely to reside in regions of higher social economic status. While the study left open the causal direction of this association—whether spanning structural holes leads to higher social economic status or vice versa—it provided an early example of how computational social science could be used to test existing theories at scale. The chapter by Benefield and Shen in this handbook utilizes the massively multiplayer online game (MMOG) EverQuest II to test several existing theories about gender roles and stereotypes. The chapter by Spiro in this handbook tests theories of social convergence that describe the coalescing of attention and people in the event of a crisis. These were previously used primarily tot study offline behavior, but they show that support for the theory is even more accentuated online. One of the theories that Hill and Shaw invoke in their chapter in this handbook is the well-established theory of the diffusion of innovation. They discuss how it is being used to study the diffusion of collaborative practices across peer production websites. In addition, they draw upon organizational population ecology theory, which posits that the fate of organizations is in large part determined not by what occurs inside them but by their position within the environment, including, for example, the carrying capacity of the niche they occupy. In his chapter in this handbook, Weber also builds on an ecology perspective but focuses on the community level, which posits that the fate of a population of organizations (in his case the traditional newspaper industry) is in large part determined by the community of industries in which they are embedded. More recently, Aral and Nicolaides () show how computational social science can be used not just to test existing theories but also to advance them by adding more nuance. Using data collected from over a million individuals over the course of five years, they showed that individuals’ exercise patterns were indeed influenced by those

     



of others in their social networks, as predicted by theories of social contagion. More important, they were able to extend our understanding of the mechanisms by which social contagion operates. Prior research had argued that social contagion occurs as a result of the person potentially being influenced engaging in social comparison processes (Festinger, ) with the potential influencer. Aral and Nicolaides () showed that individuals’ social comparison processes led them to be more likely to engage in exercise activity to stay ahead of those slightly less active than they were, as compared to those who were slightly more active. The chapter by Benefield and Shen in this handbook explores how mentoring is impacted by mentors who are gender swappers. Gender swapping is by no means a new phenomenon; consider its deployment in no less than five of Shakespeare’s plays. But Benefield and Shen showcase how digital trace data can offer new nuanced insights—especially about phenomena that are hard to observe (literally). In addition to testing at scale and advancing existing theories, computational social science also has demonstrated the potential to unleash new theories that draw on explanatory variables and concepts that require leveraging and juxtaposing diverse data sources that were heretofore unavailable or impractical. One novel source of data, until recently unavailable, that shows considerable promise is functional magnetic resonance imaging (fMRI), which measures an individual’s brain activity by detecting changes associated with blood flow. The approach is premised on the fact that cerebral blood flow reflects neuronal activation. For example, a recent study found that individuals whose friends were friends with each other were less likely to experience social exclusion (as measured by their fMRI) than individuals whose friends were less likely to be friends with each other (Schmälzle et al., ). Preliminary results from studies such as these hint at the prospect of building physiologically based, or at least physiologically informed, theories to explain the antecedents and outcomes of social networks. The power of gaining new insights by juxtaposing diverse disparate data led noted computer scientist Jim Hendler to herald the move away from big data to “broad” data (Hendler, ). The chapter by Spiro in this handbook describes the opportunities in government emergency response plans to link social media data posts by organizational entities with organizational-level features. In his chapter, Weber describes how he juxtaposed data from the Internet Archives with data from the Editor and Publisher Yearbook, as well as interviews, to turbocharge the explanatory power of theories of community ecology to explain the dynamics of change in the newspaper industry. While access to fMRI data was until recently unavailable, other sources of data were available in principle but were impractical to encode at scale. One such example is the coding of group interaction data in a way that includes details about which individual is directing remarks to which other individual(s) in the group. Encoding these interactions requires painstakingly careful attention to various nonverbal cues such as eye gaze, body posture, and conversational distance. Today, thanks to advances in the capture of high-resolution video data and machine learning algorithms, we are able to automate the detection and use of nonverbal cues as a way of accurately determining which member(s) in the group were the senders and the intended recipients of



 

interactions within the group (Mathur, Poole, Peña-Mora, Hasegawa-Johnson, & Contractor, ). The availability of these “big data from little teams” (Carter, Asencio, Wax, DeChurch, & Contractor, ) is leading to the development of new theories about how organizing in groups can be characterized by sequential structural signatures and to what extent the prevalence of distinct sequential structural signatures of interactions are systematically associated with how groups perform (Foucault Welles et al., ; Leenders, Contractor, & DeChurch, ; Schecter, Pilny, Leung, Poole, & Contractor, ). In addition to testing, advancing, and developing new theories about extant phenomena, computational social science also has the promise of advancing our theoretical understanding of new phenomena. Many of the same digital forces that have propelled the emergence of computational social science as a promising mode of intellectual inquiry have inspired not only the “effervescence of collective behavior” (Gonzalez-Bailon, ) but also the emergence of disruptive novel forms of organizing such as peer production (Benkler, Shaw, & Hill, ; Hendler, Hall, & Contractor, ) and flash organizations (Valentine et al., ). These novel, agile, and often ephemeral forms of organizing are in turn inviting the development of a new generation of technologies to assemble (Asencio et al., ) and enable (Zhou, Valentine, & Bernstein, ) teams as well as understand and theorize about what explains their effectiveness (Contractor, ; Wax, DeChurch, & Contractor, ; Mukherjee et al. ). The chapter by Benefield and Shen in this handbook examines pick-up groups (PUGS), which are ad hoc teams that come together temporarily (typically for a few hours) to accomplish a specific task in the MMOG EverQuest II. Likewise, the chapter by Spiro in this handbook reports on research that can inform the design of tools for disaster management that facilitate automated discovery of potential collaborators in the midst of an emergency. The discussion of peer production in the chapter by Hill and Shaw in this handbook challenges conventional notions of what constitutes a “team.” Are two individuals who contributed independently and asynchronously (say, a year apart) to a joint Wikimedia page on the same team? Will they be considered as being on a team if one commented on and/or edited another person’s contribution? Irrespective of whether or not we label them as a team, there is no argument that this is a new form of organized, coordinated activity that invites new theorizing. One of the collateral opportunities afforded by these new technologies is that they allow us to employ computational social science methods to study at very high resolution the actions and interactions of individuals during the stage at which they search, court, invite, or decline requests to form into teams—a process that has historically been invisible until after the team is formed and only if it forms. Of particular note is Hill and Shaw’s call for moving from single-platform to multiplatform studies of online platforms. These are crucial in enabling us to generate new theories about how variations in technological affordances of platforms might shape the processes and outcomes of organizing on those platforms. To summarize, computational social science over the last decade has demonstrated its potential to help us test existing theories at scale, extend these theories to offer more

     



nuanced insights, develop new theories made possible by the juxtaposition of data that were either unavailable or impractical to collect at scale, and develop theories about new phenomena that are gaining salience in the wake of many of the same digital advances that are fueling computational social science.

. C C S S M  D  N D C I  S C  O D?

.................................................................................................................................. The growth of computational social science would have been impossible without the windfall of digital trace data. A growing proportion of the data currently being deployed in the study of communication and organizational dynamics is drawn from the Web. And in almost all cases, the data being analyzed were not collected for research purposes. In many instances these were either server-side logs made available to researchers, often via APIs and sometimes under nondisclosure agreements (NDAs), or were scraped off the Web using scripts. All of these opportunistic data collection efforts rely on what Salganik () terms ready-made data, sometimes dismissively referred to as the inhalation of digital exhaust. Remarkable insights have been gleaned by analyzing these opportunistic data sources. Conducting network and text analytics on situation reports published daily on the Web during natural disasters made it possible to automate the generation and evaluation of the interorganizational networks engaged in disaster response—in close to real-time and without having to impose on the already busy responders (Varda, Forgette, Banks, & Contractor, ). The chapter by Spiro in this handbook demonstrates the theoretical and analytical strides that continue to be made by leveraging opportunistic data to study online communication from  official emergency management–related Twitter accounts dealing with disaster declarations over the span of fifteen months. An early example of this effort was our ability to understand how individuals organized into guilds and went on quests in MMOGs such as Sony Online Entertainment’s EverQuest II (Williams, Contractor, Poole, Srivastava, & Cai, ) and in virtual worlds such as Second Life (Foucault Welles & Contractor, ). These platforms also served as ideal crucibles to understand how the next generation of leaders, often as teens, were honing their teaming and leadership skills in these virtual environments (Reeves, Malone, & O’Driscoll, ). The chapter by Benefield and Shen in this book offers a compelling demonstration of the utility of such data to explore the impact of gender on networks. Weber () was among the first communication scholars to see the research value of not just studying the Web as it is at a



 

certain point but using the Internet Archives as the ultimate longitudinal opportunistic data source to study the dynamics of organizational—and indeed industry—changes. And the chapter by Hill and Shaw reports on their enormous success at curating one of the most definitive data sets from a population of peer-production sites based on the Wikimedia technology. Indeed, all four chapters in this section rely creatively and heavily on repurposing opportunistic online data. Notwithstanding the unprecedented opportunities they offer, these data also surfaced some important limitations that discourage our sole reliance on ready-made data. Recent changes in the policies of social media sites such as Facebook in closing down API access first to personal pages and more recently to group pages are a harbinger of what Freelon () has heralded as the “post-API age” for computational research. Aside from this potential shutout from digital trace data, most server-side logs were maintained by programmers for the primary purpose of debugging their software code. Organizations, increasingly recognizing the business potential of analyzing these data, are instrumenting the logs with those objectives in mind. Developers of the aforementioned MMOGs were among the first to recognize the potential of “re-instrumenting” server logs to include logging data that could provide insights for marketing, customer retention, and game design. More recently, developers of enterprise social media platforms such as Slack, Microsoft Teams, and Jive are also seeing the potential of conducting relational analytics using carefully instrumented logs to offer insights based on their clients’ use of these platforms (Leonardi & Contractor, ). There is clearly an opportunity for researchers to engage closely with such platform developers in developing mutually beneficial collaborations. These collaborations will entail transferring current insights from research into, for instance, the implementation of algorithms on these platforms, but also providing the research community with the ability to purposively instrument these platforms to log digital traces that are geared to addressing research questions rather than only to help debug software or drive business goals. These partnerships, while potentially promising, are not without risk. A collaboration can be abruptly terminated due to changes in key personnel, leadership, or ownership. In addition, the partnership will need to navigate significant intellectual property issues for the organization, and privacy issues for the users, that do not undermine the ability of the research to be published. King and Persily () propose an innovative model that includes creating an entity, Social Science One (SS), to explore a partnership between Facebook and universities brokered by the Social Science Research Council’s Social Data Initiative to conduct research on the effects of social media on democracy and elections. Alongside these approaches to engaging with organizations, it is also critical for the research community to innovate on the direct collection of data from participants unfettered by commercial constraints. Consider this the next generation extension of researchers designing carefully controlled experiments that relied on recruiting participants who came from primarily Western, educated, industrialized, rich, and democratic (WEIRD) populations (Henrich, ). Computational social scientists are

     



increasingly relying on online platforms such as Prolific.ac and Mechanical Turk (Mason & Suri, ) to recruit a more egalitarian participant pool, carefully curated to minimize unfair labor practices (Semuels, ) and the increasing threat of botassisted participants or participant-assisted bots (Dreyfuss, ). In addition, efforts such as the development and deployment of experiments on the web-based Volunteer Science platform (Radford et al., ) have demonstrated the potential to not only scale up the participant pool but also engage in a concerted effort to build a community of researchers coordinating on broader questions (such as a fairer, safer, more understanding Internet, in the case of CivilServant.io) from a number of studies that can be conducted, collated, and compared across a common participant pool. Beyond the Web, researchers are also recognizing the value of instrumenting humans directly in order to gain further insights about communication and organizational dynamics. The sociometric badge developed by Sandy Pentland and his team at the MIT Media Lab has been used to generate a new science of how to build teams (). Cattuto and colleagues () have demonstrated, as part of the SocioPatterns project, the use of RFID technology to track collaboration networks, for instance at interdisciplinary scientific conferences. In summary, while computational social science was catalyzed by the ability to opportunistically analyze large tracts of digital trace (or exhaust) data, the next generation of computational social science must consider more purposive instrumentation of online environments as well as personal wearable devices and apps offering what Salganik () refers to as “custom-made” data.

. C C S S M  D  N M  S C  O D?

.................................................................................................................................. It is a well-established adage that the methods we are acquainted with shape the questions we ask (Monge, ). The high-resolution temporal data on actions, interactions, and transactions have challenged not only our theoretical explanatory frameworks, but also the limits of our methodological tools. For instance, until the past decade, our understanding of communication and organizational network dynamics was premised on the assumption that we had panels of longitudinal network data at discrete time intervals. The methods of choice to analyze these data were, for instance, stochastic-actor-oriented models (Snijders, ). However, the advent of timestamped data chronicling every single relational event between a sender and a receiver propelled the development of a new approach to modeling network dynamics: relational event models (Brandes, Lerner, & Snijders, ; Butts, ). These models



 

explain the timing as well as the sender and receiver of every relational event as a conditional function of all previous relational events in the organizational context (Leenders et al., ; Pilny, Schecter, Poole, & Contractor, ). Many extant theories of organizing posit macro-emergent states as being shaped, leveraged, and aligned with microprocess mechanisms (Kozlowski & Ilgen, ). However, as Kozlowski () notes, we have stopped short of precisely articulating, let along testing, the temporal and sequential unfolding of these microprocess mechanisms. Relational event models provide a framework to posit these microprocess mechanisms as precise sequential structural signatures and test if the prevalence of these signatures is associated with certain emergent states. Consider the well-established body of research going back twenty-five years, relating boundary spanning in organizational teams to performance (Ancona & Caldwell, ). This research has generated mixed results on the impact of boundary spanning on performance (Marrone, ). Relational event models have the potential of taking collapsed data on boundary spanning and parsing it as a sequence of directed interactions: for instance, who spoke when with whom outside the team, and was it preceded or followed by an interaction within the team? Sequential structural signatures, such as these, have the potential to theoretically enrich our understanding of boundary spanning and potentially disambiguate the mixed results found in prior research. In her chapter, Spiro notes that one of the more striking results was the fact that the underlying social network among the emergency management organizations did not change during a crisis, even one that was severe. Interestingly, she has the data that will enable us to consider the possibility that while the “snapshot” structure of the network might not have changed, the sequential structure of how the network links unfolded—inferred using relational event models—might look very different in a crisis. These sequential structural signatures are being augmented and enriched by methodological advances in text analytics. While content analysis (Krippendorff, ) has been a mainstay of social science research for decades, the increasing availability of text data in digital form and novel computational techniques are changing the scale and scope of our ability to utilize them as useful “telescopes” to probe human attitudes and behavior (Gonzalez-Bailon & Paltoglou, ). The turn of the century witnessed the development of several topic-modeling techniques such as latent semantic analysis (Dumais, ) and latent Dirichlet allocation (Blei, Ng, & Jordan, ). These were developed by computer scientists who were “much better at building powerful telescopes than at knowing where to point them” (Golder & Macy, , p. ). Meanwhile, social scientists such as Pennebaker and his colleagues () developed tools such as Linguistic Inquiry and Word Count (LIWC) that were less computationally sophisticated but easier to use and interpret (they were word counts) by the social science community. For instance, these analyses revealed that leadership is closely related to the use of collective pronouns such as “we” and “us” rather than “I” or “me.” More recently there has been a move from “frequency counts” to mapping meaning. These employ vector space models (VSMs) of semantics (Turney & Pantel, ) that leverage a large corpus of text from locations such as Google (Le & Mikolov, ;

     



Mikolov, Sutskever, Chen, Corrado, & Dean, ). The chapter by Spiro discusses how topic modeling can illustrate differences in the content of communication among emergency management organizations between emergency and nonemergency event days. It can also glean the differences in topics communicated between organizations representing different functional roles, thereby adding content to what was previously an interorganizational contact network. The prevalence of large volumes of high-resolution data has also accelerated methodological developments in computational modeling of the dynamics of communication and organizational systems. When there was a dearth of dynamic empirical data, computational (and more specifically agent-based) models focused, by necessity, on developing simple, stylized models of social phenomena to explore how changes in inputs or mechanisms might impact emergent outcomes. For instance, simple computational models were able to demonstrate the plausibility of preferential attachment as a theoretical mechanism to explain the widespread prevalence of scale-free social networks (Wilensky, ). These were often referred to as intellective computational models (Pew & Mavor, ). The parameters in these computational models were often arbitrarily chosen and defended on theoretical grounds and/or resulted in emergent outcomes that were robust to modest changes in the values of these parameters. Today, with the availability of temporal data, we are witnessing a surge in the development of emulative computational models (Carley & Hirshman, ). These much larger models seek to emulate in substantial detail the dynamic features and characteristics of a specific team or organization (Carley, ). They have a much larger number of parameters; in the past, the modeler would have to specify values of the parameters informed by theories or the context being modeled. However, spurred by the availability of large amounts of dynamic empirical data, recent advances obviate the need for modelers to specify parameters. Instead we are able to leverage novel genetic algorithms and optimization techniques to empirically estimate these parameters (Stonedahl & Wilensky, ; Sullivan, Lungeanu, DeChurch, & Contractor, ; Thiele, Kurth, & Grimm, ). Using empirical data to estimate parameters in a computational model makes it analogous to a statistical (e.g., regression) model. This semblance has the potential to assuage the skepticism of traditional social science researchers, who have been understandably wary of deriving insights from computational models in which the specification of the parameters was (arguably) arbitrary. The proliferation of data available on the formation and performance of millions of overlapping teams on online platforms, such as Wikipedia, Github, and Kaggle, has also motivated a renewed interest in the development of methodologies leveraging hypergraph methods that represent teams as hyperedges rather than a collection of edges that fails to preserve the team’s entitativity (Lickel, Hamilton, & Sherman, ). Put simply, a team of three individuals can be represented in a network by three nodes connected by three edges. However, this representation loses information about whether this is one team of three individuals or three teams of pairs of individuals. If our goal is to study why individuals assemble into teams, there is a fundamental difference between explaining why A, B, and C (a hyperedge) assembled into one



 

team, versus A and B, A and C, and B and C pairing (as edges) into separate teams. Likewise, if our goal is to understand the impact of communication on an organizational outcome such as performance, it is fundamentally different to assess the impact of a (face-to-face or email) private interaction between A and B and another between A and C (both edges) versus a joint interaction involving A, B, and C (a hyperedge). Clearly, edges and graph theory are not the most appropriate way to analyze how collectives form and perform from a network perspective. In response we have seen advances leveraging the study of hypergraphs in which a hyperedge, unlike an edge, is not confined to connecting only two nodes (Berge, ; Ghasemian, Ghasemian, Zamanifar, Zamanifar, & Ghasem-Aghaee, ; Taramasco, Cointet, & Roth, ). Hypergraphs also enable us to measure the overlap between two teams that are defined as team interlocks. Team interlocks have been shown to be important predictors of the success of scientific teams (Lungeanu, Carter, DeChurch, & Contractor, ). While not explicitly invoking hypergraph methods, the chapter by Hill and Shaw in this handbook invokes hypergraph thinking by explaining the success (and failure) of peer production sites as well as the spread of ideas and practices across sites, based on overlapping membership across communities. More recently, there has also been an effort to extend relational event models, discussed previously, to model not just an edge (a dyadic relationship between two nodes) but a hyperedge event that represents, for instance, an email sent by one person to two or more others (Kim, Schein, Desmarais, & Wallach, ). In addition to propelling new methodological advances in areas such as relational event models, text analytics, computational agent-based models, and hypergraphs, computational social science has also invited careful re-examination of the classic approaches to research design and causal inference (Cook, Campbell, & Shadish, ). The excitement associated with the influx of computational social science scholarship from multiple (including non-social-science) disciplines rushing into this uncharted territory is reminiscent of what then Federal Reserve Bank chairman Alan Greenspan referred to as the “irrational exuberance” associated with the dot-com bubble in the s. This excitement has to be tempered with careful reflection on how these new modes of asking and answering questions require us to “modernize— but not replace” (Salganik, , p. ) the classic approaches. For instance, we know that in large samples, p-values rapidly go toward zero, thereby impacting our traditional norms of using them to conduct tests of statistical significance. These have led, for instance, to a call for also reporting effect sizes in addition to p-values to safeguard against making intellectual claims that are statistically significant but have no “practical significance” (Lin, Lucas, & Shmueli, , p. ) or, more radically, change the default p-value from . to . (Benjamin et al., ). More generally, computational social science is prompting us to reconsider the classical debates between theory driven research (TDR) and data driven research (DDR). Over the past half a century, there has been a strong preference for theorydriven research (TDR) over data-driven research (DDR) to advance our understanding of communication and organizational dynamics. Often referred to pejoratively as “dust

     



bowl empiricism,” in hindsight, part of the skepticism about DDR might have been grounded in the paucity of, or the inability to procure, large numbers of independent data sets where exploratory insights from one data set could be confirmed on another. Pure DDR has been fairly criticized for asking questions driven by the availability of data. This has led to comparisons with the drunken man who looked for his keys under the lamppost, not because that was where he lost them, but because that was where it was illuminated. Indeed, given the attention focused on data collected from Twitter, the New York Times (Zimmer, ) asked with tongue in cheek if “Twitterology” was a new science. Unfortunately, the same can also be said for much of pure TDR. We have a propensity to ask questions that lend themselves to be addressed by leveraging existing theories: looking for answers under the proverbial theoretical lamppost. This has led to calls for “taking off the theoretical straight jacket” (Schwarz & Stensaker, ). Benefield and Shen’s chapter in this volume explicitly pursues a dual TDR and DDR approach. And the chapters by Hill and Shaw, Weber, and Spiro all report research using multiple methods to address the TDR-DDR cycle. The debate between TDR and DDR has been joined by those who champion the value of phenomenon-driven research (PDR) (Schwarz & Stensaker, ). The importance of attempting to understand and help improve organizational phenomena was well captured in Lewin’s classic adage “nothing is quite so practical as a good theory” {Lewin, , p. }. PDR is motivated by a desire to solve problems associated with real-world phenomena (Watts, ). Advances in digital technologies have triggered a slew of new, or at least dramatically scaled-up, organizational phenomena, leading to novel communication and organizational dynamics that need to be understood and enabled. Understanding these phenomena should not rely on existing TDR, nor should it seek to de novo generate insights solely from DDR. Instead, PDR invites a delicate iterative waltz between TDR and DDR leveraging existing theory, modifying it to better fit the data, and sometimes necessitating the development of a new, more parsimonious theory, which must then be tested with new data (Mathieu, ). This iterative dance is distinct from deductive or inductive inference and is referred to as abductive inference (Haig, ; Halas, ; Meyer & Lunnay, ; Ren et al., ). The data driven segment of this iterative loop lends itself well to the utilization of data mining and machine learning techniques to classify and predict certain outcomes. These approaches will benefit immensely from recent interest in the interpretability of machine learning algorithms, which seek to uncover the logic in the algorithms making the classifications and predictions (Wang, Rudin, Doshi-Velez, Liu, Klampf, & MacNeille, ; Vellido, Martin-Guerrero, & Lisboa, ). Interpretable algorithms have the potential to inform the TDR segment of this iterative loop. They will also help address an enduring debate in the social sciences about the relative merits of prediction versus explanation (Hofman, Sharma, & Watts, ). A decade after the essay on computational social science in the journal Science (Lazer et al., ), and in part propelled by its growth, we are seeing a concerted and organized effort by journals, institutions, and funding agencies to help evolve the norms and



 

incentives associated with social science inquiry. Journals, such as Nature Human Behavior, solicit as one form of submission a “registered report” in which methods and proposed analyses are preregistered and reviewed prior to data collection. If the review is favorable, and the research is conducted as proposed, the results are guaranteed to be published irrespective of the findings, thereby obviating the bias that, for instance, statistically significant results are three times more likely to be published than papers with null results (Dickersin, Chan, Chalmersx, Sacks, & Smith, ). Institutions such as the Center for Open Science are creating platforms like the Open Science Framework (https://osf.io/) to serve as “a scholarly commons to connect the entire research cycle” with the goal of promoting transparency, openness, and reproducibility (Nosek et al., ) By way of incentives, they award “badges” to articles for preregistering a research plan to ward off accusations of p-hacking (Simonsohn, Simmons, & Nelson, ) and HARKing (Kerr, ), as well as for making available on the platform full sets of data and the code used. Some have compared this form of accreditation to the LEED certification for environmentally designed buildings. Finally, it is therefore not surprising that many of these ideas are at the core of two major funding initiatives, Next Generation Social Science, and Ground Truth, at the US Defense Advanced Research Project Agency (DARPA) (Rogers, ), and The Future of Work at the Human Technology Frontier, one of the  “Big Ideas” initiatives at the US National Science Foundation (NSF).

A Preparation of this chapter was supported by the following research grants: US Army Research Laboratory (WNF---), National Science Foundation (IIS-), DARPA (HRC), and NASA (NNXAMG). My thanks to Yessica Herrera for providing me a digital recording of my presentation on this topic at the Conference on Complex Systems  in Cancun, Mexico.

R Ancona, D. G., & Caldwell, D. F. (). Bridging the boundary: External activity and performance in organizational teams. Administrative Science Quarterly, (), . http://doi.org/./ Aral, S., & Nicolaides, C. (). Exercise contagion in a global social network. Nature Communications, , . http://doi.org/./ncomms Asencio, R., Huang, Y., DeChurch, L. A., Contractor, N. S., Sawant, A., & Murase, T. (). The MyDreamTeam builder: A recommender system for assembling & enabling effective teams. Paper presented at the Interdisciplinary Network for Group Research, Pittsburgh, PA. Benefield, G. A., & Cuihua Shen. (). Gender and networks in virtual worlds. In B. Foucault Welles & S. Gonzales-Bailon (Eds.), Handbook of communication in the networked age. Oxford: Oxford University Press.

     



Benjamin, D. J., Berger, J. O., Johannesson, M., Nosek, B. A., Wagenmakers, E. J., Berk, R., et al. (). Redefine statistical significance. Nature Human Behaviour, (), –. http://doi.org/./s---z Benkler, Y., Shaw, A., & Hill, B. M. (). Peer production: A form of collective intelligence. In Malone, T. W. & Bernstein, M. S. (Eds.) Handbook of Collective Intelligence (pp. –). Cambridge, MA: MIT Press. Berge, C. (). Graphs and hypergraphs. Amsterdam: North Holland Publishing Company. Blei, D. M., Ng, A. Y., & Jordan, M. I. (). Latent Dirichlet allocation. Journal of Machine Learning Research, (–), –. http://doi.org/./jmlr...-. Brandes, U., Lerner, J., & Snijders, T. A. B. (). Networks evolving step by step: Statistical analysis of dyadic event data (pp. –). Paper presented at the International Conference on Advances in Social Network Analysis and Mining (ASONAM). Burt, R. S. (). Structural holes: The social structure of competition. Cambridge, MA: Harvard University Press. Butts, C. T. (). A relational event framework for social action. Sociological Methodology, (), –. http://doi.org/./j.-...x Carley, K. M. (). Computational modeling for reasoning about the social behavior of humans. Computational & Mathematical Organization Theory, (), –. Carley, K. M., & Hirshman, B. (). Agent based model. In G. A. Barnett (Ed.), Encyclopedia of social networks (Vol. I, pp. –). Thousand Oaks, CA: SAGE Publications. Carter, D. R., Asencio, R., Wax, A., DeChurch, L. A., & Contractor, N. S. (). Little teams, big data: Big data provides new opportunities for teams theory. Industrial and Organizational Psychology, (), –. http://doi.org/./iop.. Cattuto, C., Van den Broeck, W., Barrat, A., Colizza, V., Pinton, J.-F., & Vespignani, A. (). Dynamics of person-to-person interactions from distributed RFID sensor networks. PLoS One, (), e–. http://doi.org/./journal.pone. Contractor, N. (). Some assembly required: Leveraging Web science to understand and enable team assembly. Philosophical Transactions of the Royal Society a: Mathematical, Physical and Engineering Sciences, (), –. http://doi.org/./ rsta..&domain=pdf&date_stamp Cook, T. D., Campbell, D. T., & Shadish, W. (). Experimental and quasi-experimental designs for generalized causal inference. Boston: Houghton Mifflin. Dickersin, K., Chan, S. S., Chalmersx, T. C., Sacks, H. S., & Smith, H. Jr. (). Publication bias and clinical trials. Controlled Clinical Trials, (), –. Dreyfuss, E. (, September ). A bot panic hits Amazon’s Mechanical Turk. Retrieved from https://www.wired.com/story/amazon-mechanical-turk-bot-panic/ Dumais, S. T. (). Latent semantic analysis. Annual Review of Information Science and Technology., (), –. Eagle, N., Macy, M., & Claxton, R. (). Network diversity and economic development. Science, (), –. http://doi.org/./science. Festinger, L. (). A theory of social comparison processes. Human Relations, (), –. http://doi.org/./ Foucault Welles, B., & Contractor, N. S. (). Individual motivations and network effects: A multi-level analysis of the structure of online social relationship. Annals of the American Academy of Political and Social Science, (), –. http://doi.org/./ Foucault Welles, B., Welles, B. F., Vashevko, A., Vashevko, A., Bennett, N., Bennett, N., et al. (). Dynamic models of communication in an online friendship network. Communication Methods and Measures, (), –. http://doi.org/./..



 

Freelon, D. (). Computational Research in the Post-API Age, Political Communication, doi: ./.. Ghasemian, F., Ghasemian, F., Zamanifar, K., Zamanifar, K., & Ghasem-Aghaee, N. (). Composing scientific collaborations based on scholars’ rank in hypergraph. Information Systems Frontiers: A Journal of Research and Innovation , (), . http://doi.org/./ s---z Golder, S. A., & Macy, M. W. (). Digital footprints: Opportunities and challenges for online social research. Annual Review of Sociology, (), –. http://doi.org/./ annurev-soc-- Gonzalez-Bailon, S. (). Decoding the social world. Cambridge, MA: MIT Press. Gonzalez-Bailon, S., & Paltoglou, G. (). Signals of public opinion in online communication: A comparison of methods and data sources. The ANNALS of the American Academy of Political and Social Science, (), –. http://doi.org/./ Haig, B. D. (). An abductive theory of scientific method. Psychological Methods, (), –. http://doi.org/./-X... Halas, M. (). Abductive reasoning as the logic of agent-based modelling. In T. Burczynski, J. Kolodziej, A. Byrski, & M. Carvalho (Eds.), Proceedings of the th European Conference on Modelling and Simulation, (pp. –). Digital Library of the European Council for Modelling and Simulation. http://doi.org/./-- Hendler, J. (). Broad data: Exploring the emerging web of data. Big Data, (), –. http://doi.org/./big.. Hendler, J. A., Hall, W., & Contractor, N. (). Web science—now more than ever. IEEE Computer, (), –. http://doi.org/./MC.. Henrich, Joseph, S. J. H., A. N. (). The weirdest people in the world? Doi.org, (–), –. http://doi.org/./SXX Hill, B. M., & Shaw, A. (). Studying populations of online communities. In B. Foucault Welles & S. Gonzales-Bailon (Eds.), Handbook of communication in the networked age. Oxford: Oxford University Press. Hofman, J. M., Sharma, A., & Watts, D. J. (). Prediction and explanation in social systems. Science, , –. Kerr, N. L. (). HARKing: Hypothesizing after the results are known. Personality and Social Psychology Review, (), –. http://doi.org/./spspr_ Kim, B., Schein, A., Desmarais, B. A., & Wallach, H. (). The Hyperedge Event Model. arXiv preprint arXiv:., –. King, G., & Persily, N. (, October ). A new model for industry-academic partnerships. Working paper. Cambridge, MA: Harvard University. Kozlowski, S. W. J. (). Advancing research on team process dynamics. Organizational Psychology Review, (), –. http://doi.org/./ Kozlowski, S. W. J., & Ilgen, D. R. (). Enhancing the effectiveness of work groups and teams. Association for Psychological Science, (), –. Krippendorff, K. (). Content analysis (Vol. , pp. l–). Beverly Hills, CA: Sage Publications. Lazer, D., Pentland, A., Adamic, L., Aral, S., Barabasi, A. L., Brewer, D., et al. (). Life in the network: The coming age of computational social science. NIH Public Access, (), –. http://doi.org/./science. Le, Q., & Mikolov, T. (). Distributed representations of sentences and documents (pp. –). Paper presented at the ICML ‘: Proceedings of the st International Conference on Machine Learning.

     



Leenders, R. T. A. J., Contractor, N. S., & DeChurch, L. A. (). Once upon a time: Understanding team processes as relational event networks. Organizational Psychology Review, (), –. http://doi.org/./ Leonardi, P., & Contractor, N. S. (). Better People Analytics: Measure who they know, not just who they are. Harvard Business Review, (), –. Lickel, B., Hamilton, D. L., & Sherman, S. J. (). Elements of a lay theory of groups: Types of groups, relational styles, and the perception of group entitativity. Personality and Social Psychology Review, (), –. http://doi.org/./SPSPR_ Lin, M., Lucas, H. C. Jr., & Shmueli, G. (). Research commentary—too big to fail: Large samples and the p-value problem. Information Systems Research, (), –. Lungeanu, A., Carter, D. R., DeChurch, L. A., & Contractor, N. S. (). How team interlock ecosystems shape the assembly of scientific teams: A hypergraph approach. Communication Methods and Measures, (–), –. http://doi.org/./.. Marrone, J. A. (). Team boundary spanning: A multilevel review of past research and proposals for the future. Journal of Management, (), –. http://doi.org/./  Mason, W., & Suri, S. (). Conducting behavioral research on Amazon’s Mechanical Turk. Behavior Research Methods, (), –. http://doi.org/./s--- Mathieu, J. E. (). The problem with [in] management theory. Journal of Organizational Behavior, (), –. Mathur, S., Poole, M. S., Peña-Mora, F., Hasegawa-Johnson, M., & Contractor, N. (). Detecting interaction links in a collaborating group using manually annotated data. Social Networks, (), –. http://doi.org/./j.socnet... Meyer, S. B., & Lunnay, B. (). The application of abductive and retroductive inference for the design and analysis of theory-driven sociological research. Sociological Research Online, (), –. http://doi.org/./sro. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. –). MIT Press. Mukherjee, S., Huang, Y., Neidhardt, J., Uzzi, B., & Contractor, N. (). Prior shared success predicts victory in team competitions. Nature Human Behaviour, , . http://doi.org/ ./s---y Monge, P. R. (). Systems theory and research in the study of organizational communication: The correspondence problem. Human Communication Research, (), –. Nosek, B. A., Alter, G., Banks, G. C., Borsboom, D., Bowman, S. D., Breckler, S. J., et al. (). Scientific standards: Promoting an open research culture. Science, (), –. Pennebaker, J. W., Booth, R. J., Boyd, R. L., & Francis, M. E. (). Linguistic Inquiry and Word Count: LIWC. Austin, TX: Pennebaker Conglomerates (www.LIWC.net). Pentland, A. S. (, April). The new science of building great teams. Harvard Business Review, –. Pew, R. W., & Mavor, A. S. (). Modeling human and organizational behavior. Washington, DC: National Academies Press. http://doi.org/./ Pilny, A., Schecter, A., Poole, M. S., & Contractor, N. (). An illustration of the relational event model to analyze group interaction processes. Group Dynamics: Theory, Research, and Practice, (), –. http://doi.org/./gdn Radford, J., Pilny, A., Reichelmann, A., Keegan, B., Welles, B. F., Hoye, J., et al. (). Volunteer science. Social Psychology Quarterly, (), –. http://doi.org/./ 



 

Reeves, B., Malone, T. W., & O’Driscoll, T. (). Leadership’s online labs. Harvard Business Review, (), –. Ren, Y., Cedeno-Mieles, V., Hu, Z., Deng, X., Adiga, A., Barrett, C. L., et al. (). Generative modeling of human behavior and social interactions using abductive analysis (pp. –). Paper presented at the International Conference on Advances in Social Network Analysis and Mining (ASONAM). Rogers, A. (, July ). Darpa wants to build a BS detector for science. Retrieved from https://www.wired.com/story/darpa-bs-detector-science/ Salganik, M. J. (). Bit by bit: Social research in the digital age. Princeton, NJ: Princeton University Press. Schecter, A., Pilny, A., Leung, A., Poole, M. S., & Contractor, N. (). Step by step: Capturing the dynamics of work team process through relational event sequences. Journal of Organizational Behavior, (), –. http://doi.org/./job. Schmälzle, R., Brook O’Donnell, M., Garcia, J. O., Cascio, C. N., Bayer, J., Bassett, D. S., et al. (). Brain connectivity dynamics during social interaction reflect social network structure. Proceedings of the National Academy of Sciences of the United States of America, (), –. http://doi.org/./pnas. Schwarz, G., & Stensaker, I. (). Time to take off the theoretical straightjacket and (re-)introduce phenomenon-driven research. The Journal of Applied Behavioral Science, (), –. http://doi.org/./ Semuels, A. (, September ). The Internet is enabling a new kind of poorly paid hell. Retrieved from https://www.theatlantic.com/business/archive///amazon-mechanical-turk// Simonsohn, U., Simmons, J. P., & Nelson, L. D. (). Better P-curves: Making P-curve analysis more robust to errors, fraud, and ambitious P-hacking, a reply to Ulrich and Miller. Journal of Experimental Psychology. General, (), –. http://doi.org/ ./xge Snijders, T. A. B. (). Models for longitudinal network data. Models and Methods in Social Network Analysis, , –. Stonedahl, F., & Wilensky, U. (). Finding forms of flocking: Evolutionary search in ABM parameter-spaces (pp. –). Paper presented at the International Workshop on MultiAgent Systems and Agent-Based Simulation, Springer. Sullivan, S. D., Lungeanu, A., DeChurch, L. A., & Contractor, N. S. (). Space, time, and the development of shared leadership networks in multiteam systems. Network Science, (), –. http://doi.org/./nws.. Taramasco, C., Cointet, J.-P., & Roth, C. (). Academic team formation as evolving hypergraphs. Scientometrics, (), –. http://doi.org/./s--- Thiele, J. C., Kurth, W., & Grimm, V. (). Facilitating parameter estimation and sensitivity analysis of agent-based models—a cookbook using NetLogo and “R”. Journal of Artificial Societies and Social Simulation, (),  http://doi.org/./jasss. Turney, P. D., & Pantel, P. (). From Frequency to Meaning: Vector Space Models of Semantics. Journal of Artificial Intelligence Research, , –. Valentine, M. A., Retelny, D., To, A., Rahmati, N., Doshi, T., & Bernstein, M. S. (). Flash organizations (pp. –). Paper presented at the  CHI Conference, New York, New York. http://doi.org/./. Varda, D. M., Forgette, R., Banks, D., & Contractor, N. (). Social network methodology in the study of disasters: Issues and insights prompted by post-Katrina research. Population Research and Policy Review, (), –. http://doi.org/./s---

     



Vellido, A., Martin-Guerrero, J. D., & Lisboa, P. J. (). Making machine learning models interpretable. (Vol. , pp. –). Paper presented at the ESANN, Citeseer. Wang, T., Rudin, C., Doshi-Velez, F., Liu, Y., Klampfl, E., & MacNeille, P. (). A bayesian framework for learning rule sets for interpretable classification. The Journal of Machine Learning Research, (), –. Watts, D. J. (). Should social science be more solution-oriented? Nature Human Behaviour, (), –. http://doi.org/./s-- Wax, A., DeChurch, L. A., & Contractor, N. S. (). Self-organizing into winning teams: Understanding the mechanisms that drive successful collaborations. Small Group Research, (), –. http://doi.org/./ Weber, M. S. (). The new dynamics of organizational change. In B. Foucault Welles & S. Gonzales-Bailon (Eds.), Handbook of communication in the networked age. Oxford: Oxford University Press. Wilensky, U. (). NetLogo preferential attachment model. Center for Connected Learning and Computer-Based Modeling, Northwestern University, Evanston, Illinois. Retrieved from http://Ccl.Northwestern.Edu/Netlogo/Models/PreferentialAttachment. Williams, D., Contractor, N., Poole, M. S., Srivastava, J., & Cai, D. (). The virtual worlds exploratorium: Using large-scale data and computational techniques for communication research. Communication Methods and Measures, (), –. http://doi.org/./ .. Zhou, S., Valentine, M., & Bernstein, M. S. (). In search of the dream team (pp. –). Paper presented at the  CHI Conference, New York, New York. http://doi.org/./ . Zimmer, B. (, October ). Twitterology: A new science? New York Times, p. SR. Retrieved from https://www.nytimes.com////opinion/sunday/twitterology-anew-science.html

  ......................................................................................................................

      ......................................................................................................................

 . 

O change is an important and evolving arena of academic research. Given the breadth of research on organizational change, it is perhaps difficult to conceive that there are understudied aspects, or that previously understudied dynamics of change may exist. And yet scholars continue to point to new dynamics by which individuals organize, and in turn, by which organizations emerge and evolve. Consider the Arab Spring revolutions, which began in Tunisia in December . The movement highlighted the potential of social media as a new tool for enabling distributed forms of organizing social movements. As noted in a recent study of the uses of Twitter during the Arab Spring revolt (Lotan et al., : p. ): “Participants began labeling messages discussing the uprisings with #sidibouzid, effectively indexing the Tunisian Revolution through a hashtag.” In the example of the Arab Spring, the hashtag became an organizing principle for those involved in various aspects of the revolutions, providing a basis for organizing and coordinating. Traditional dynamics of organizing were still at play, but the organization of information via a social platform created new challenges for coordination and control of information flow. For example, although central actors initiated the use of particular hashtags, the lack of a centralized control structure allowed hashtags to be appropriate for alternative uses. Coordination occurred through a variety of channels, including mobile telephony. In order to remain resilient and functional, the organizational system continually evolved. Dark networks are another instance of new organizational dynamics. For example, advances in social network analysis paired with improved access to digital databases have allowed researchers to map and examine criminal networks such as drug trafficking syndicates (Bright et al., ). Moreover, research on dark networks has provided new insights into the mechanisms of resiliency and control that help to create robust organizational structures (Brinton, Milward, & Raab, ), for instance, by creating structures of redundancy that exist outside of traditional hierarchies.

    



Organizing on social media and in dark networks could be viewed as extremes of new organizational patterns, but shifts in traditional notions of organizing exist within traditional industries as well. For instance, new forms of information communication technology (ICT) have had a strong impact on the news media industry in the United States, and the pace with which news organizations have evolved is not easily explained by existing theories. From a global perspective, strategic decisions have been shown to have a stronger effect than previously theorized (Weber & Monge, ). Elsewhere, while scholars have long theorized about the potential for networked organizational forms (see Sproull & Kiesler, ), the rise of new ICT, such as enterprise social media platforms, has helped some traditional hierarchical organizations become more adaptive and fluid by providing new ways for organizational members to share information and coordinate work with one another (Gibbs et al., ). New ICT may allow individuals within such organizations to communicate outside the boundaries of existing hierarchies or existing communication patterns, and in turn, such communication can decrease the reliance on existing hierarchical patterns. New forms of organizing are often linked to advances in ICT (Mohrman et al., ), but also extend from a need for increased adaptability in organizational structures on the part of traditional organizations (Hendriks, ). In the same vein, recent work on organizational change leverages recent advances in the type of data available to scholars, as well as the types of research methods implemented. Through an interdisciplinary approach to organizational change, new scholarship points to understudied mechanisms and an agenda for future research. In order to develop new organizational theories in this domain, scholars have focused on the intersection of task-, social-, and technology-based views of organizations and advocated for work across disciplines. To this end, the growth of computational social science (Lazer et al., ), focusing on theory through a lens of data analytics, and web science (Berners-Lee et al., ), focusing on theory through the study of digital trace data, further embodies the interdisciplinary nature of this growing arena of research. This chapter explores the new dynamics of organizational change based on recent research leveraging new methods and new types of data. In order to frame a conversation on new processes of organizational change, this chapter utilizes an example of organizational change in the news media industry. Throughout this chapter, a case study discussing existing processes of organizational change in news media is utilized to frame the conversation.

. O D  D

.................................................................................................................................. The study of organizational change is inherently multidisciplinary and interdisciplinary. Change processes can be considered from a variety of perspectives, with each vantage point leading to a different set of focal questions. Research on change



 . 

implemented through organizational design (e.g., Amburgey et al., ) frames change processes as a management issue. Alternatively, organizational change can be viewed as an issue of communication, focusing on the way in which change is communicated throughout an organization (e.g., Lewis, ) or on the way in which an organization represents its changing identity through public communication (e.g., Rooney et al., ). Sociologists, on the other hand, look at broader issues of changes in organizational demographics and processes of organizational legitimacy as dynamic elements of organizational change. Further afield, related work can be found in computer science (McGowan et al., ), psychology (Weick & Quinn, ), education (Fusarelli, ), and information science (Avgerou, ). Management research has produced many of the foundational texts on the topic. For instance, Schumpeter () was one of the first to theorize about the role of disruptions in driving organizational change. Others have theorized more recently about the manner by which organizations replicate and emerge as institutions (Meyer & Rowan, ) or evolve as a result of pressure from competitors and other macro-level forces (Hannan & Freeman, ). Elsewhere, computer scientists have produced a body of work that speaks to many of the core challenges associated with organizational change. Scholars have sought to meld together analyses of organizational change processes with artificial intelligence systems (Nobre et al., ). Considering the role of technological disruptions, scholars have also examined the impact of systems design on organizations (Khajeh-Hosseini et al., ), examined the impact of various architectures on successful adoption (Bieberstein et al., ), and considered the design of better metrics for measuring system impact (Fan et al., ). The communication perspective is relatively nascent, with roots in the mid-twentieth century (Rogers, ), yet scholars from this tradition have focused on the challenges of organizational change since the field was in its infancy (Redding, ). Interdisciplinary by nature, communication offers a starting point from which researchers can begin to address the new dynamics of organizational change. Modern communication research of organizational change has focused on a variety of phenomena, including the impact of new technology as it diffuses through industries (Zorn et al., ), the impact of leaders on driving organizational change (Leonard & Grobler, ), and individual reactions to new disruptions and processes (Leonardi, ).

. C S: N M  N D

.................................................................................................................................. To better explicate new dynamics of organizational change, I draw upon my work examining large-scale traces of digital news media. I have worked for nearly a decade to examine the dynamics by which existing organizations respond and adapt to disruptions from new types of organizations. Driven by a desire to understand organizational

    



change from an ecological perspective, my work here is based on a long-term study of digital transformation of the US newspaper industry. An ecological approach is utilized in order to focus on organizational change as an iterative process, in which change is driven by interaction between changes in the industry and changes in the organization. Organizational ecology focuses on the interaction between individual organizations within the context of a given industry and the competitive environment. Through my ongoing research on organizational evolution, I found that much of the extant research on organizational change failed to explain how an entire industry (e.g., traditional newspaper publishers) had failed to adequately respond to the growth of digital news. This fundamental question drove my theoretical investigation. I was drawn to news media because my prior work experience had been in that domain, but more important, it was clear that there was a gap with regard to the ability of existing theoretical frameworks to explain current organizational phenomena.

.. Data on Digital Media It was clear from the start that in order to examine the ecosystem of interaction between traditional newspapers and digital news sources, I would need a broad data set capturing a wide range of organizational interactions. My broad focus led me to start by examining organization-to-organization processes; happenstance led me to stumble upon a relevant body of digital trace data (arguably, I could have addressed similar questions through other data sources, as is often the case). Thus, I started to extract data from the Internet Archive (archive.org), the largest digital repository of archived web pages in the world; in recent years scholars have been working to leverage it as a research tool. In my case, I have used data from the Internet Archive to recreate the adaptation of the Web by media companies, looking, for instance, at the amount of online content and hyperlinking activity on the part of the New York Times in order to understand the Times’s adoption of Internet technology as a web publishing platform. My primary data set contains records for  million specific uniform resource locators (URLs) with a total of . billion captures of those web pages. The aggregate data cover a time period from  to ; secondary data sets cover the period from  to  and from  to , although these data sets are not as complete. Often, however, I work with specific subsets. When I first started this research in , much of the data I desired were not available in an easy-to-access form. Although I knew that the Internet Archive had the appropriate data, I had to overcome significant technological barriers to first extract the data, then transform them from a standard archive format into a database structure that was workable. Today I am continuing to work with approximately forty terabytes of Internet Archive data, but more often than not most analyses deal with narrow subsets of the larger set.



 . 

I relied on a large number of secondary sources to (a) provide variables for analysis and (b) help provide context for understanding the decisions made by organizations. Because of the public nature of news media organizations in the United States, there is a large body of data tracking characteristics of organizations in this ecosystem. For instance, the Editor & Publisher Yearbook includes records of the number of employees, print newspaper circulation, management including editors and publishers, and estimates of revenue. Data are included for both public and private companies, providing researchers with a rich data source for examining online and offline organizational behavior and for pairing trace data against outcomes such as circulation and revenue. In turn, these data allowed me to create a number of variables, such as industry-level variables accounting for the overall number of employees in the sector and organization-level variables such as revenue and changes in management. Moreover, I paired trace data with secondary data sources and interviews in order to provide context for understanding changing patterns within the data. I have interviewed more than sixty journalists, editors, and entrepreneurs and spent time in a number of media companies, in order to better understand the way in which these organizations are evolving. As the preceding discussion makes clear, often the necessary data are difficult to collect. The breadth of data previously described is not necessary in every case. I have published a number of papers from the data set, and I have not used all elements in all of the manuscripts. See Shumate and Weber () for a more complete discussion of some of the methodological challenges associated with this type of data collection.

.. A Spark of New Theory I used theory to guide my data collection and aimed to examine the interplay between organizational emergence and subsequent change among existing competitors. Much of my work focuses on social network analysis, but I use other methods as well, including regression and event history analysis. Although I wished to explore specific hypotheses, I also utilized visualizations to help me better understand my data. The data set covers connections among digital news sites from  to the present day. Looking at the initial network visualizations that I was able to create, I noticed that many in later years exhibited a higher degree of cohesiveness, but in the period from  through the early s, there were clear layers and clusters that emerged in the network. This pattern seemed to resonate with interviews, in which news editors would discuss the barriers that were established to prevent sharing of information between traditional and new media. Moreover, this view was reinforced by quantitative measures of the number and size of clusters within the network data. As I delved deeper, it was clear that two different types of networks existed; one represented relationships among traditional news media, and the other represented ties among new media organizations, such as blogs. The barriers between the two clusters did not appear to erode until after . From a theoretical perspective, the barrier between the

    



two organization types was a phenomenon that I could not easily rectify against existing theory. In an attempt to better understand this shift that I had noticed in the data, I started with existing organizational theories. The notion that two different types of organizations competing for common resources would remain entirely distinct sent me down a new theoretical path. I have since worked to explain a process of speciation, whereby disruptions occur only after new organization types have developed strong and robust networks (Weber, , ). Subsequently, my methodological choices were based on a desire to analyze the nature of the aforementioned change. On the one hand, I have used traditional approaches such as event history analysis for many of the reasons previously discussed—namely validity and goodness of fit measures. Alternatively, social network analysis has been useful for analyzing connectivity. The work, however, has been driven by the intersection of traditional theories and new computational approaches.

. T “O” D  O C

.................................................................................................................................. As the illustration of change in the news media industry shows, a singular focus on new dynamics obfuscates the current trajectory of organizational research. Indeed, many of the foundational processes by which organizations adapt and evolve have persisted for decades and provide building blocks upon which new research stands. For instance, my initial examinations of media organizations were grounded in organization ecology. This is a theoretical approach that has its roots in sociological research from the s (Hannan & Freeman, ). In recent years, the growing prevalence of large-scale data paired with related methodological advances has created opportunities to examine new aspects of existing organizational dynamics and to focus on new forms of organizing that have emerged. Earlier work, however, provides a clear lens through which to frame new questions regarding this process of change. The larger focus on organizational change stems from a recognition that organizations face constant pressures to adapt to the ebb and flow of market conditions, shifts in social trends (especially in the case of nontraditional organizations), and changes in ICT.

.. The Broad Shape of Organizational Change Although studies of organizational change vary widely across domains, there is commonality across this body of work. Examining the adoption of new ICT by news media organizations, it was clear that there were stages in the adaptation process, ranging from early exploration with websites to full integration of social media platforms into newsroom publishing (Weber, ). This perspective is more broadly echoed in



 . 

organizational research, which has shown that key stages exist in the life cycle of an organization and in the cycle of organizational change. These stages are generally uniform across organizations, although the pace of change varies and the exact nature of each stage of change can differ. Although many studies focus on a specific aspect or level of organizational change, such studies are inherently situated within a larger ecosystem of change. Monge and Bryant () developed a staged model of industry evolution, illustrating how organizational processes begin in a stage of emergence, followed by maintenance, self-sufficiency, and transformation. Their analysis was based on an examination of communication activity that occurs between organizations in order to understand the progression of change, but it echoes research others have conducted on the life cycle of organizations. For example, Carroll and Delacroix () present an examination of one hundred years of development in newspaper publishing and the number of newspaper companies in existence, showing that a similar life cycle exists across companies. Similarly, the case study of the US news media industry noted a clear juncture point that existed, whereupon barriers between new and old organizations quickly eroded. Such key transition points reinforce the notion of change occurring as a cyclical process. Focusing on a single organization, such as a single newspaper company or news media website, the life cycle begins with the emergence of an organization. Critical questions include addressing what structure an organization chooses to adopt and why, as well as how the emergent organization then seeks to gain legitimacy as it grows over time. Once established, an organization seeks to survive and struggles to be resilient in the face of changing competitive pressures. Failure or obsolescence, although less studied, often represents the end of an organization; otherwise, an organization adapts to a new form. These stages, and relevant studies, are outlined in Table ., providing a high-level view of the life cycle of organizational change; the following sections provide further explication. Illustrations of this process are often seen in everyday organizations. For instance, the recent rise and fall of the online news outlet Gawker.com reveals the emergence and faltering of a new organization and a new form of organizing. Gawker was an online news outlet that drew readers to the site with sensational headlines and attention-grabbing stories; the organization recently ceased operations amid a major legal challenge. In the case of Gawker, although the single organization failed to thrive, many of the tenets of its business structure—sensationalism, an emphasis on speed in reporting, integration with social media—had a strong impact on other news outlets. Thus, as the process of change advances, the cycle repeats itself across levels. For instance, in considering the diffusion of a single technological artifact (or even a new business practice) into and throughout an organization, one first begins with the introduction of the technology into the organization (emergence). Subsequently, organizational members will decide whether to accept or reject the technology. At the individual level, Rogers’ () theory of the diffusion of innovations proposes a staged model for understanding the diffusion of innovations and predicts an S-shaped curve of adoption over time. Innovators and early adopters will lead the charge, while the majority of

    



Table 8.1 New Dynamics and the Cycle of Organizational Change Process of Organizational Change

Traditional Themes

New Themes

Emergence

Available resources help shape a new organization (Audia et al., 2006). Organizations seek to mimic successful competitors (DiMaggio and Powell, 1983).

Individuals leverage social media to act collectively and to establish new organizational structures (Bimber et al., 2005).

Legitimacy and Stability

Legitimacy is garnered as others accept an organization as taken for granted (Ruef and Scott, 1998). Organizations emerge as stable when routines are entrenched, and a stable industry exists (Ruef, 1997).

Social movements may leverage social media to develop entrenched organizational structures without a physical presence (Agarwal, 2014). Not all technologies are accepted by users; there may be unintended consequences (Majchrzak et al., 2013) and differences in use (English, 2014).

Resiliency

Organizations establish the capacity to withstand change by gaining access to excess resources (Cheng and Kesner, 1997). Individuals repeat routines, creating structure (Pentland et al., 2011); older organizations tend to have more entrenched routines (Carroll and Hannan, 1989).

Research focuses on resilience and networked patterns of organizations: • Response to disasters (Chewning et al., 2012) and • The structure and robustness of terror networks (Keegan et al., 2010).

Failure or Transformation

Organizations with significant inertia may fail to respond quickly to change or may be unable to adapt (Kelly and Amburgey, 1991). Organizations transform as a result of significant disruptions or shifts in ICT (Amburgey et al., 1993). Adaptation occurs as individuals and teams innovate (Crossan and Apaydin, 2010).

Many large organizations are transforming into virtual organizations in order to retain competitiveness (Gibson and Gibbs, 2006). Early experimentation with new technology helps organizations transform (Weber, 2012). Changes in individuals’ routine may be leveraged to facilitate transformation (Feldman, 2000).

users will follow suit if a technology is shown to be successful or useful. This is seen in the choice of individual reporters to adapt to social media, using tools such as Twitter to engage with audiences and share information. Within the organization, it is also possible that a technology will be adopted, but users will find unintended applications for a given technology (DeSanctis & Poole, ). This echoes the notion of legitimacy; following a successful process of adoption, a technology will diffuse and either be accepted or rejected across an organization. The cycle is iterative and repeats across



 . 

levels as organizations face challenges updating or replacing existing technology (e.g., failure/decline) (Starbuck, ).

.. Theoretical Perspectives on the Cycle of Change Choosing a level from which to frame an instance of organizational change further refines the aims of a given study and provides a starting point for examining new processes. The process of change can be viewed from an industry, organization, team, or individual perspective. It is also possible to examine interactions that occur between levels, considering, for instance, the impact of individual adoption of technology on knowledge sharing with a given team. Institutional and evolutionary theories, among others, provide a basis for understanding macroscopic and multilevel processes of change. The institutional perspective of organizational change focuses on processes of isomorphism and legitimacy, examining the means by which groups of organizations bond together to form coherent forms and structures (Scott, ; DiMaggio & Powell, ; DiMaggio, ). The history of an organization may determine its path over time, but key junctures in the life cycle of an organization will drive change. For instance, the Washington Post has experimented with digital technology for decades, but when Jeff Bezos, CEO of Amazon.com, bought the newspaper in , it represented a clear juncture in the organization’s history and a shift in its overall digital strategy. Institutionalism provides a balanced perspective for examining both the history of an organization and the impact of such key juncture points. Similarly, organizational ecology focuses on the ecosystem of organizations. Countering institutionalism, organizational ecology focuses on change as a response to shifts in resources and the introduction of disruptions at the community level (Hannan & Freeman, ). Organizations are established based on the availability of resources; new entrants will establish themselves as generalists or specialists depending on resource availability and will evolve through a process of variation and selection in a struggle to survive over time. From this point of view, change is viewed as a top-down process. At the team level, theories focus specifically on team-based processes related to change. For instance, the theory of transactive memory seeks to explain the processes by which team members store information in repositories or by exchanging information with one another, and by doing so, create institutional memories for the preservation of knowledge over time (Hollingshead et al., ). More recently, this concept has been used to explain the utilization of social media systems in organizations to preserve information (Fulk & Yuan, ). Team-level theories bring the focus into the organization, helping the researcher understand behaviors such as the development of best practices regarding the use of Twitter as a tool for researching a news story. At the individual level, researchers consider factors such as the interactions that occur between individual users and news artifacts of technology, seeking to understand how particular users of a technology are designed. Recent studies have looked at how

    



managers engage with employees in order to lead change processes and to effectively manage the implementation of new systems (Scott & Bruce, ), or at how employees interact with the design and constraints of a technological artifact to integrate a technology into work routines (Ellison et al., ). These individual-level theories help frame an individual’s decision to accept or reject a technology and further help to develop a better understanding of how individual uses of a technology develop over time.

.. The “New” Dynamics of Organizational Change A host of additional theories are relevant to this discussion; some fall beyond the scope of this chapter, but the following discussion highlights a number of useful approaches building on the preceding discussion. Large-scale data provide an opportunity to examine organizational change across levels, on a scale that previously has not been considered, and to examine different types of organizations as they evolve. Modern organizations are generating trillions of records on any given day (Abawajy, ), producing detailed data sets for tracing organizational behavior. Research on new dynamics leverages existing work but also seeks to advance theory in new directions.

.. Emergence and New Disruptions ... Existing Traditions Focusing on emergence, ecologists and institutionalists emphasize the determination of organizational structure and founding conditions. Ecologists consider the conditions at founding, the impact of available resources, and other environmental conditions as key in determining the resulting structure of an emergent organization (Audia et al., ; Swaminathan, ); for instance, the presence of strong competitors in a particular domain may lead a new entrant to focus on providing niche services in order to secure resources. On the other hand, institutionalists look within an industry, focusing on the extent to which organizations mimic one another and replicate organizational structures based on founding conditions (DiMaggio & Powell, ). For example, the Huffington Post emerged as an online blog and news platform, mimicking the structure of traditional newspapers by creating a hierarchy of editors and reporters. These decisions helped the Huffington Post gain acceptance and eventually contribute to the ongoing disruption of the news media industry. Within a given organization, emergence of new disruptions focuses on processes of adoption, looking at individual decisions to adopt or reject technology (Strang & Soule, ). The introduction of new resources such as enterprise social media affords users the ability to perform new types of actions; however, the impact of such technology over time depends on users’ willingness to adopt a given ICT.



 . 

... New Dynamics Recent work has focused on exploring the dynamics that bring individuals together, creating the conditions necessary for organizations to emerge and for organizing to occur. For instance, Bimber, Flanagin, and Stohl () reconceptualized collective action in order to provide a new framing for understanding how new ICT enables dispersed groups to collectively act toward a common cause. They point to collective action as key to understanding the mechanisms by which social movements emerge and evolve during their foundational stages. In a similar vein, Bennett and Segerberg () coined the term “connective action,” to explain the common bonds of action that are established among activists engaged in action via social media. From their point of view, social technology such as Twitter has emerged as the place of organizing—no physical structure is needed; rather, the organization exists solely as communicative patterns denoted by markers such as hashtags (hashtags are phrases used within Tweets that are used to identify common ideas or topics, e.g., #arabspring). In this theoretical framing, the hashtag thus represents a new dynamic by which organizing may occur. In the practice of journalism, Twitter has become institutionalized as an accepted means of reporting (Lasorsa et al., ), but more important, Twitter has created a new process for reporters and journalists to engage with one another (Lewis et al., ). Moreover, in some communities, such as among Aboriginal Australians, Twitter is used as a primary communication channel for reporting about critical events (Hess & Waller, ). In this way, new processes of organizing are seen as occurring around common practices on Twitter. In another vein, I have utilized extensive historical web data, paired with secondary data sources, to examine the evolutionary mechanisms that drive change in the US news media industry. Early strategic decisions to accept or reject particular technological artifacts were found to have a statistically significant impact on the growth trajectory of an organization over time (Weber, ; Weber & Monge, ); this work points to new evolutionary processes at the intersection of industry and organization. At the user level, much attention has been focused on understanding how users and technology interact. For example, affordances research marries an understanding of the features of a given technology with the intended action of the user, to understand how the technology and use align with one another (Treem & Leonardi, ). Although such work has generally not utilized digital trace data, the affordances perspective provides is useful for thinking about large-scale enterprise social media systems and the impact they have on work at the individual level of action (Ellison et al., ).

.. Legitimacy and Stability ... Existing Traditions Legitimacy is the process by which new organizations and new industries gain acceptance from others. Organizations and industries seek legitimacy in order to

    



transition into a period of stability, whereby an organization has access to a relatively stable stream of resources and is sheltered from disruptive pressures (Ruef & Scott, ). Legitimacy from outside an organization’s walls is often realized through recognition from others, including through media coverage or the formation of professional associations (Aldrich & Fiol, ). As common types of organizations become accepted as legitimate, institutions are established that reinforce a particular common type. Within teams, and even at the individual level, similar processes occur. Teams engage with a new technology and develop routines for using it in day-to-day work; as those routines are repeated by others, they become entrenched in the organizational processes of the company (Miner, ). Legitimacy and the development of routines are key drivers of stability; more broadly, organizations as a whole seek stability in terms of resources (e.g., capital, talent, raw materials).

... New Dynamics Recent studies of organizational change emphasize the acceptance of new forms of organizations as emblematic of change within; in turn, scholars have sought to address how these new structures and organizational processes come to be accepted. For example, Agarwal et al.’s () work examining the Occupy Wall Street movement leveraged a data set consisting of more than sixty million tweets to demonstrate the emergence of stable and robust organizations’ structure based on social media interactions. This work built on approaches developed in other fields, such as computer science (Starbird et al., ), that provided the scholars with filtering techniques needed to ascertain which Twitter accounts were most relevant to the organizing of the social movement. Agarwal’s work shows that the scale of action on Twitter and echoes of that structure in other media such as newspapers represent a clear pattern of organizing. Moreover, the recognition in media coverage suggests a degree of legitimacy for a form that was previously not accepted as such. Similarly, research examining social movements has leveraged large-scale digital trace data to recreate global networks of interactions (Marchetti & Pianta, ). This allows scholars to move from examining single instances of a movement to better understanding how movements and patterns of organization occur as global actions. At an individual level, new ICT is often adapted in phases; thus, organizations must also contend with tensions that exist between early adopters and laggards. For example, such tensions reinforce the need for organizations to strategize about the appropriate way to deploy new technology (English, ). The implementation of social media within organizations has often yielded unintended consequences; for instance, employees manipulate enterprise social media to increase perceived contributions to the organization. As a result, new work focuses on the ripple effect that organizations face as they strive for more effective knowledge sharing and subsequently must address unforeseen outcomes (Majchrzak et al., ).



 . 

.. Resiliency and Stability ... Existing Traditions As organizations seek stability, they strive to develop capacity to adapt to uncertain conditions and to respond to unexpected disruptions. Beyond excess capacity, organizations strive to develop resiliency, such that they can handle unforeseen changes. In part, traditional work in this domain emphasizes processes by which organizations secure excess resources in order to withstand change (Cheng & Kesner, ). Individuals will also develop robust routines that further contribute to organizational stability; the repetition of routines over time creates an entrenched and stable structure that is often difficult to modify (Pentland et al., ). This notion of routinization is reinforced by the concept of organizational structure; as organizations age, they become more entrenched in existing processes (Carroll & Hannan, ).

... New Dynamics In studying resiliency, organizational scholars have recently started to focus on new types, as well as resiliency in new organizational forms. For instance, scholars have looked at nonprofit networks in order to better understand the way that communities utilize networks to develop resiliency in the wake of disasters (Chewning et al., ). Looking at new organizational forms, others have considered the emergence of networked organizational structures. Networked forms of organization are loosely structured organizational entities that are posited to have the capacity to adapt rapidly to changing market conditions (Podolny & Page, ). Scholars have shown that emergent network organizations (e.g., disaster response teams or community driven action) can be leveraged to respond rapidly in the wake of large disasters and to mobilize resources for recovery (Doerfel et al., ). Utilizing a similar networks perspective, scholars have shown how new organizational structures afford organizations with a stronger degree of resiliency than is seen in traditional organizational structure. For instance, terrorist networks have proven particularly resilient by compartmentalizing command and control structures and distributing control (Lindelauf et al., ; Stohl & Stohl, ). Research on organizational resiliency and stability has also emerged in unexpected domains. Using a computational approach to examine data from a massive multiplayer online game, recent research has demonstrated that illicit team networks utilize specific network strategies to mask their underlying activities; the findings are compared to data on terror networks, illustrating the similarity in dynamics (Keegan et al., ). This work provides a key example of the application of computational techniques, paired with theoretical mechanisms, to better understand dynamics of organizing over time.

    



.. Failure and Decline Failure brings the cycle full circle. When organizations reach this point, they must either remake themselves or accept continued decline and demise. In either case, there is a shift within the industry that frees existing resources and creates new opportunities for emergence. At that point, the cycle may restart, and new organizations may emerge.

... Existing Traditions Failure is not an instantaneous process, but rather tends to occur when organizations fail to adapt and transform. It is generally accepted that organizations with significant inertia may fail to respond quickly to change or may be unable to adapt (Kelly & Amburgey, ). On the other hand, organizations often transform in response to significant disruptions or shifts in ICT (Amburgey et al., ); attempts to change or adapt, however, may also leave an organization susceptible to failure. Within organizations, adaptation occurs as individuals and teams innovate (Crossan & Apaydin, ); innovation is a key driver of an organizations’ ability to adapt to new conditions and to transform in the face of decline.

... New Dynamics Failure and decline are understudied; emergent research has focused largely on routines, but there is a clear need for additional research in this domain. In part, clear data on organizational decline or failure are difficult to gather. Much recent work has focused on the processes by which organizations transform in an attempt to stave off decline. For instance, many large organizations are transforming into virtual organizations in order to retain competitiveness (Gibson & Gibbs, ); in turn, a large body of work seeks to examine the impact of shifting resources to a virtual environment. Alternatively, leveraging large-scale digital trace data, recent work has shown that early adopters who experiment with new technology are likely to transform and survive disruptions, even if the particular instance of technology adoption fails (Weber, ). To this end, changes in individual routines may be leveraged to facilitate transformation (Feldman, ).

. A C  S N D  O C

.................................................................................................................................. Key themes in present research include studies examining ecosystems of organizational change, new forms of organizing and impacts for organizational change, and the iterative process of user-technology interaction. Revisiting table ., the chart outlines key aspects of the organizational life cycle and highlights focal research streams. The preceding discussion has highlighted many of the key challenges for researchers of



 . 

organizational change; in turn, the following underscores key questions to address in the practical art of developing research within this domain. Many of the core studies of organizational change focus on traditional organization types such as corporations, small businesses, and large multinational organizations. More recently, however, digital trace data have enabled research examining a breadth of organization types, including terrorist organizations, drug trafficking rings, social movements, and volunteer and nonprofit entities. The comprehensive nature of largescale data facilitates examination of the full life cycle of an organization or the examination of a wider range of interactions than have previously been considered. The wealth of available data can be overwhelming; in turn, it is important to address key questions pertaining to (a) the scope of the study, (b) the collection of data, and (c) the subsequent analyses.

. C  B N G

.................................................................................................................................. Much has been made of the promise of big data, but recent commentary has simultaneously recognized the pitfalls that come with large-scale data analysis (boyd & Crawford, ). In part, as other chapters in this volume illustrate, many ethical considerations must be taken to heart when working with large-scale data. This is particularly salient in organizations, where digital trace data have the potential to identify participants who would otherwise choose to remain anonymous. In addition, however, digital trace data are often problematic for research; these data are often small subsets of larger data pools, meaning that the exact sampling is not always known. Alternatively, researchers are often unaware of how data were collected, meaning that the sampling procedure may never be known (Weber & Nguyen, ). In some cases, this is not problematic. For instance, researchers have used Twitter data to understand the public presentation of an organization by its members, selecting narrow frames of data and acknowledging a clear bias in the sample. For instance, such work has been used to illustrate how a corporation uses Twitter to put forth a public relations narrative (Xifra & Grau, ) or as part of a repertoire of communication channels to shape a public image (Zerfass & Schramm, ). In other instances, however, issues abound. For instance, numerous issues have been identified regarding recent research on Google Flu Trends; in particular, issues with data availability, as well as changes in the algorithms associated with the data, impacted the validity of study results (Lazer et al., ).

. F R

.................................................................................................................................. Recently after I gave a presentation on change in media organizations traced through digital data, one audience member raised her hand and observed, “This is interesting,

    



but can these data help to build new theory?” In a nutshell, the audience member was noting that much recent research has pointed to interesting trends in organizational research, but few studies have advanced the agenda with regard to new thinking pertaining to organizational dynamics. In part, theory building is a complicated process. In part, a significant body of research addresses the theory of organizational change. And yet computational social scientists are unlocking new insights into organizational change, and as this chapter has demonstrated, new data sets and methods for handling these data are providing insight into new dynamics.

R Abawajy J. (). Comprehensive analysis of big data variety landscape. International Journal of Parallel, Emergent and Distributed Systems (): –. Agarwal, S. D., Bennett, W. L., Johnson, C. N., & Walker, S. (). A model of crowd enabled organization: Theory and methods for understanding the role of twitter in the occupy protests. International Journal of Communication (): –. Aldrich, H., and Fiol, C. M. (). Fools rush in? The institutional context of industry creation. In A. Cuervo, D. Ribeiro, and S. Roig (Eds.), Entrepreneurship (pp. –). New York: Springer Berlin Heidelberg. Amburgey, T. L., Kelly, D., and Barnett, W. P. (). Resetting the clock: The dynamics of organizational change and failure. Academy of Management Proceedings (): –. Amburgey, T. L., Kelly, D., and Barnett, W. (). Resetting the clock: The dynamics of organizational change and failure. Administrative Science Quarterly (): –. Audia, P. G., Freeman, J. H., and Reynolds, P. D. (). Organizational foundings in community context: Instruments manufacturers and their interrelationship with other organizations. Administrative Science Quarterly (): –. Avgerou C. (). The significance of context in information systems and organizational change. Information Systems Journal (): –. Bennett, W. L., and Segerberg, A. (). The logic of connective action: Digital media and the personalization of contentious politics. Information, Communication & Society (): –. Berners-Lee T., Hall, W., Hendler, J., et al. (). Creating a science of the Web. Science : –. Bieberstein, N., Bose, S., Walker, L., et al. (). Impact of service-oriented architecture on enterprise systems, organizational structures, and individuals. IBM Systems Journal (): –. Bimber, B., Flanagin, A. J., and Stohl, C. (). Reconceptualizing collective action in the contemporary media environment. Communication Theory : –. boyd, d., and Crawford, K. (). Critical questions for big data. Information, Communication & Society (): –. Bright, D. A., Hughes, C. E., and Chalmers, J. (). Illuminating dark networks: A social network analysis of an Australian drug trafficking syndicate. Crime, Law and Social Change (): –. Brinton, Milward H., and Raab, J. (). Dark networks as organizational problems: Elements of a theory. International Public Management Journal (): –.



 . 

Bryant, J. A., and Monge, P. (). The evolution of the children’s television community: –. International Journal of Communication : –. Carroll, G., and Delacroix, J. (). Organizational foundings: An ecological study of the newspaper industries of Argentina and Ireland. Administrative Science Quarterly (): –. Carroll, G., and Hannan, M. (). Density dependence in the evolution of populations of newspaper organizations. American Sociological Review (): . Cheng, J. L. C., and Kesner, I. F. (). Organizational slack and response to environmental shifts: The impact of resource allocation patterns. Journal of Management (): –. Chewning, L. V., Lai, C.-H., and Doerfel, M. L. (). Organizational resilience and using information and communication technologies to rebuild communication structures. Management Communication Quarterly (): –. Crossan, M. M., and Apaydin, M. (). A multi-dimensional framework of organizational innovation: A systematic review of the literature. Journal of management studies (): –. DeSanctis, G., and Poole, M. S. (). Capturing the complexity in advanced technology use: Adaptive structuration theory. Organization Science (): . DiMaggio, P. T. (). The challenge of community evolution. In J. A. C. Baum and J. V. Singh (Eds)., Evolutionary dynamics of organizations (pp. –). New York: Oxford University Press. DiMaggio, P. T., and Powell, W. W. (). The Iron Cage revisited: Institutional isomorphism and collective rationality in organizational fields. American Sociological Review (): . Doerfel, M. L., Lai, C. H., and Chewning, L. V. (). The evolutionary role of interorganizational communication: Modeling social capital in disaster contexts. Human Communication Research (): –. Ellison, N. B., Gibbs, J. L., and Weber, M. S. (). The use of enterprise social network sites for knowledge sharing in distributed organizations: The role of organizational affordances. American Behavioral Scientist (): –. English P. (). Twitter’s diffusion in sports journalism: Role models, laggards and followers of the social media innovation. New Media & Society (): –. Fan, M., Stallaert, J., and Whinston, A. B. (). The adoption and design methodologies of component-based enterprise systems. European Journal of Information Systems (): –. Feldman, M. (). Organizational routines as a source of continuous change. Organization Science (): –. Fulk, J., and Yuan, Y. C. (). Location, motivation, and social capitalization via enterprise social networking. Journal of Computer-Mediated Communication (): –. Fusarelli, L. D. (). Tightly coupled policy in loosely coupled systems: Institutional capacity and organizational change. Journal of Educational Administration (): –. Gibbs, J. L., Eisenberg, J., Rozaidi, N. A., et al. (). The “megapozitiv” role of enterprise social media in enabling cross-boundary communication in a distributed Russian organization. American Behavioral Scientist (): –. Gibson, C. B., and Gibbs, J. L. (). Unpacking the concept of virtuality: The effects of geographic dispersion, electronic dependence, dynamic structure, and national diversity on team innovation. Administrative Science Quarterly : –. Hannan, M., and Freeman, J. H. (). The population ecology of organizations. American Journal of Sociology : –.

    



Hannan, M., and Freeman, J. H. (). Organizational ecology. Cambridge, MA: Harvard University Press. Hendriks, P. (). Why share knowledge? The influence of ICT on the motivation for knowledge sharing. Knowledge and Process Management (): . Hess, K., and Waller, L. (). Community journalism in Australia: A media power perspective. Community Journalism (): –. Hollingshead, A., Fulk, J., and Monge, P. (). Fostering intranet knowledge sharing: An integration of transactive memory and public goods approaches. In P. Hinds and S. Kiesler (Eds)., Distributed work: New research on working across distance using technology (pp. –). Cambridge, MA: MIT Press. Keegan, B., Ahmed, M. A., Williams, D., et al. (). Dark gold: Statistical properties of clandestine networks in massively multiplayer online games. In  IEEE Second International Conference on Social computing (SocialCom) (pp. –). Minneapolis, MN: IEEE. Kelly, D., and Amburgey, T. L. (). Organizational inertia and momentum: A dynamic model of strategic change. Academy of Management Journal (): –. Khajeh-Hosseini, A., Sommerville, I., and Sriram, I. (). Research challenges for enterprise cloud computing. arXiv preprint arXiv:.. Lasorsa, D. L., Lewis, S. C., and Holton, A. E. (). Normalizing Twitter. Journalism Studies (): –. Lazer, D., Kennedy, R., King, G., et al. (). The parable of Google Flu: Traps in big data analysis. Science (): –. Lazer, D., Pentland, A., Adamic, L. A., et al. (). Computational social science. Science : –. Leonard, A., and Grobler, A. F. (). Exploring challenges to transformational leadership communication about employment equity: Managing organizational change in South Africa. Journal of Communication Management (): –. Leonardi, P. M. (). Why do people reject new technologies and stymie organizational changes of which they are in favor? Exploring misalignments between social interactions and materiality. Human Communication Research (): –. Lewis, L. K. (). An organizational stakeholder model of change implementation communication. Communication Theory (): –. Lewis, S. C., Holton, A. E., and Coddington, M. (). Reciprocal journalism. Journalism Practice (): –. Lindelauf, R., Borm, P., and Hamers, H. (). Understanding terrorist network topologies and their resilience against disruption. In U. Kock Wiil (Ed.)., Counterterrorism and open source intelligence (pp. –). Vienna: Springer. Lotan, G., Graeff, E., Ananny, M., et al. (). The Arab Spring the revolutions were tweeted: Information flows during the  Tunisian and Egyptian revolutions. International Journal of Communication : . Majchrzak, A., Faraj, S., Kane, G. C., et al. (). The contradictory influence of social media affordances on online communal knowledge sharing. Journal of Computer Mediated Communication (): –. Marchetti, R., and Pianta, M. (). Global social movement networks and the politics of change. In P. Utting, M. Pianta and A. Ellersiek (Eds.)., Global justice activism and policy reform in Europe: Understanding when change happens (pp. –). New York: Routledge.



 . 

McGowan, A.-M. R., Daly, S., Baker, W., et al. (). A socio-technical perspective on interdisciplinary interactions during the development of complex engineered systems. Procedia Computer Science : –. Meyer, J. W., and Rowan, B. (). Institutionalized organizations: Formal structures as myth and ceremony. American Journal of Sociology : –. Miner, A. S. (). Seeking adaptive advantage: Evolutionary theory and managerial action. In J. A. C. Baum and J. V. Singh (Eds.), Evolutionary dynamics of organizations (pp. –). New York: Oxford University Press. Mohrman, S. A., Cohen, S. G., and Morhman, A. M., Jr. (). Designing team-based organizations: New forms for knowledge work. Sa Francisco, CA: Jossey-Bass. Nobre, F., Tobias, A., and Walker, D. (). The impact of cognitive machines on complex decisions and organizational change. AI & SOCIETY (): –. Pentland, B. T., Hærem, T., and Hillison, D. (). The (n)ever-changing world: Stability and change in organizational routines. Organization Science (): –. Podolny, J. M., and Page, K. L. (). Network forms of organization. Annual Review of Sociology : . Redding, W. C. (). Communication within the organization: An interpretive review of theory and research. New York: Industrial Communication Council. Rogers, E. M. (). History of communication study. New York: Free Press. Rooney, D., Paulsen, N., Callan, V. J., et al. (). A new role for place identity in managing organizational change. Management Communication Quarterly (): –. Ruef, M. (). Assessing organizational fitness on a dynamic landscape: An empirical test of the relative inertia thesis. Strategic Management Journal (): –. Ruef, M., and Scott, W. R. (). A multidimensional model of organizational legitimacy: Hospital survival in changing institutional environments. Administrative Science Quarterly (): –. Schumpeter, J. (). The theory of economic development. Boston: Harvard Publishing. Scott, S. G., and Bruce, R. A. (). Determinants of innovative behavior: A path model of individual innovation in the workplace. Academy of Management Journal (): –. Scott, W. R. (). Institutions and organizations. London: Sage Publications. Shumate, M., and Weber, M. S. (). The art of Web crawling for social science research. In E. Hargittai and C. Sandvig (Eds.)., Digital research confidential: The secrets of studying behavior online (pp. –). Boston: The MIT Press. Sproull, L., and Kiesler, S. (). Connections: New ways of working in the networked organization. Cambridge, MA: MIT Press. Starbird, K., Muzny, G., and Palen, L. (). Learning from the crowd: Collaborative filtering techniques for identifying on-the-ground Twitterers during mass disruptions. Proceedings of th Annual ISCRAM Conference, –. Vancouver, Canada: Simon Fraser University. Starbuck, W. H. (). Unlearning ineffective or obsolete technologies. International Journal of Technology Management (–): –. Stohl, C., and Stohl, M. (). Networks of terror: Theoretical assumptions and pragmatic consequences. Communication Theory (): –. Strang, D., and Soule, S. A. (). Diffusion in organizations and social movements: From hybrid corn to poison pills. Annual Review of Sociology (): –. Swaminathan, A. (). Environmental conditions at founding and organizational mortality: A trial-by-fire model. Academy of Management Journal (): –.

    



Treem, J. W., and Leonardi, P. (). Social media use in organizations: Exploring the affordances of visibility, editability, persistence, and association. Communication Yearbook : –. Weber, M. S. (). Newspapers and the long-term implications of hyperlinking. Journal of Computer-Mediated Communication (): –. Weber, M. S. (). Observing the Web by understanding the past: Archival Internet research. In WWW’ Companion Proceedings, –. Seoul, Korea: ACM. Weber, M. S., and Monge, P. (). Industries in turmoil: Driving transformation during periods of disruption. Communication Research: –. Weber, M. S., and Nguyen, H. (). Big data? Big issues: Degradation in longitudinal data and implications for social sciences. In WebSci  Proceedings. Oxford: ACM. Weick, K. E., and Quinn, R. E. (). Organizational change and development. Annual Review of Psychology (): –. Xifra, J., and Grau, F. (). Nanoblogging PR: The discourse on public relations in Twitter. Public Relations Review (): –. Zerfass, A., and Schramm, D. M. (). Social media newsrooms in public relations: A conceptual framework and corporate practices in three countries. Public Relations Review (): –. Zorn, T. E., Flanagin, A. J., and Shoham, M. D. (). Institutional and noninstitutional influences on information and communication technology adoption and use among nonprofit organizations. Human Communication Research (): –.

  ......................................................................................................................

        ......................................................................................................................

 . 

. I

.................................................................................................................................. S media and other terse communication technologies have become established and critical components of all phases of crisis management, including preparedness, response, and recovery activities. Looking at global events over the past five years clearly demonstrates how social media platforms are appropriated during extreme occurrences such as natural hazards, civil unrest, and terrorist attacks (Bruns et al., ; Hughes et al., ; Sutton et al., ; Arif et al., ). Crisis events often lead to environments, both physical and social, characterized by high levels of uncertainty, leading people to flock to information and communication technologies, such as social media, to engage with event-related content. Individuals use online platforms to search for, disseminate, curate, exchange, and challenge crisis information in an attempt to try to make sense of what is going on around them (Danzig et al., ; Barton, ; Scanlon, ; Spiro et al., ; Andrews et al., ). The online public engages in diverse forms of communication, including but not limited to soliciting opinions, sharing emotions and reactions, expressing condolences, and providing social support (Bruns et al., ). While the potential utility of social media platforms during a crisis is salient, these tools are also used on a day-to-day basis to share information and guidelines for emergency preparedness (Reeder et al., ). However, despite continued and widespread use of social media for the purpose of crisis management, research on information and communication behaviors in this context is still in its early stages, and many open questions remain.

    



Today, social media platforms are often faster to report on crisis events than more traditional forms of media (Vis, ). In fact, it is often individuals who experience such events directly who produce eyewitness reports; one can be on the ground in the midst of a severe disaster yet have the ability to share updates and media with a global audience in a matter of seconds. Indeed, even though information may be limited and sparse (and of questionable credibility), social platforms are often the first place people look for news, particularly news related to breaking events (Anderson & Caumont, ). This phenomena mirrors traditional theories of informal communication during crises, suggesting that individuals turn to familiar, everyday channels in nonroutine circumstances; they utilize established communication pathways and social networks to obtain information (Drabek, ). Social media are increasingly serving this function, providing important access to core social contacts, including family and friends, as well as a large, diverse information ecosystem curated by a global audience of interested users. Social media platforms afford valuable “soft” infrastructure in the crisis context. New technologies, coupled with increasing mobile adoption, allow individuals to reach a larger number of social contacts across much greater distances than was previously possible. The capacity to share, access, and exchange crisis information in near real time has expanded the role of citizens in response activities. Novel forms of citizen reporting, for example, present numerous opportunities for increasing situational awareness, as information can be shared by those experiencing the direct impact of the crisis as it unfolds—very important in rapidly changing situations (Cameron et al., ). Social media platforms have also been used to great effect by distributed volunteers aiming to help coordinate response and aid activities (Starbird & Palen, ). While traditional media sources still play a significant role in information dissemination after events occur, new media are increasingly becoming a “go-to” source for breaking news (Anderson & Caumont, ). Social media have clearly transformed informal communication channels, yet effectively utilizing social media for crisis communication and management is not without challenges. Rumors and misinformation, which often arise in uncertain and dynamic environments, are also prolific in online environments (Allport & Postman, ; Starbird et al., ; Arif et al., ). In some cases the prevalence of misinformation has been specifically cited as one of the primary reasons emergency responders question the value of information on social media (Hiltz & Kushma, ; Liu et al., ). Other barriers include managing, searching, and making sense of vast quantities of information, along with the complementary problem of identifying relevant information amid the larger stream of messages—a task that can often feel like searching for a needle in a haystack. While these challenges persist, both the public and emergency responders recognize the potential of social media platforms and actively use these technologies to share information and connect with others during extreme events. Importantly, as social behavior during crises plays out on social media platforms, researchers have rich opportunities to study such phenomena. This chapter reviews prior and ongoing work that contributes to our understanding of current patterns of



 . 

online communication during times of crisis. It focuses specifically on work aiming to understand the information and communication practices of emergency responders and official crisis management organizations, a largely understudied aspect of this social phenomenon. The research discussed here is empirical studies that make use of large-scale data sets of behavioral traces captured from social media platforms. Together this body of work demonstrates how computational techniques combined with rich, curated data sets can be used to explore online information and communication behaviors in online networks.

. S M U  C E

.................................................................................................................................. Information and communication technologies have had a notable impact on disaster response, as well as on disaster studies. Research on the use of social media during crisis events has addressed a variety of questions about informal communication, collective behavior, and the social impact of technology within disrupted settings. Building on the established tradition of disaster studies within the social sciences, this work seeks to understand how social media serve as a communication and interaction platform for those directly affect by crisis events, as well as those indirectly impacted, curious onlookers, and those seeking to aid communities in need. Existing work extends traditional theories and methods, expanding scientific inquiry into the growing, interdisciplinary field of crisis informatics. As research in this area continues, it both affirms long-documented phenomena and identifies new roles and functions as individuals and groups (including those within the immediate proximity of the event as well as interested parties around the world) go online to engage with each other and the dynamic information space that emerges. Briefly reviewing what is known about the public’s use of social media during crises provides a context in which to situate the communication and information behaviors of emergency responders, and as such it is important to discuss before proceeding to the focus of this chapter: the role of emergency responders. Social scientists have long been interested in how social systems respond to collective stress situations, exemplified by disaster and crisis events (Caplow, ; Barton, ; Blaikie et al., ). While there are many facets to disaster preparedness and response, the focus here is on communication and information-related behaviors. When crises occur, members of the public attempt to make sense of the disrupted and uncertain physical and social environment around them. This involves a variety of communication and interaction activities as individuals and groups take stock and begin to search for, and make sense of, relevant information. This social process, known as collective sense-making, is often accompanied by high levels of general anxiety and uncertainty (Anthony, ). In some cases, obtaining information necessary to make important

    



decisions about protective action or response behaviors may mean the difference between life and death. Individuals turn to family, friends, neighbors, and strangers in their search for important crisis-related information. One natural outcome of collective sense-making is the proliferation of rumors. Rumors, as the concept is used in the social sciences, are stories of unknown validity at the time of communication. This definition lacks explicit reference to the veracity of these stories, illustrating that the concept of rumoring is more general than the colloquial usage, which often equates rumor with misinformation. Indeed, studies of rumoring behavior during crisis events have a long tradition in classical studies of disaster (Shibutani, ); more recently these theories have been applied and extended to online contexts as well (Bordia & DiFonzo, ; Spiro et al., ; Andrews et al., ). Studies of informal communication and rumoring reveal a complex, dynamic process of information exchange and transmission. Indeed, mapping propagation of rumors in populations has shown that levels of expressed uncertainty within this content can be an early indication of misinformation (Starbird et al., , ). Other work has demonstrated that communication processes and online attention can vary based on features of the crisis itself, along with sociodemographic characteristics of the affected population (Spiro, ). Complementary studies have provided insight into the qualitative features of rumor content, exploring differences in the classes of rumor stories that spread (Starbird et al., ). Together, these studies indicate that rumoring behavior is important to collective sense-making and that these social processes exhibit both important similarities and differences across events. More work is needed to further tease out the complex dynamics of these communication processes in social media. Another focus of studies on public use of social media during crises consider their value for citizen reporting and digital volunteerism (Gao et al., ). This type of social convergence behavior, long known to occur in offline settings—in which volunteers flock to the physical location of a crisis—is now mirrored online. Social convergence online reflects increased communication and attention on specific parts of the information space. Emergent organizations of Twitter users, for example, have demonstrated considerable effort and success at “leveraging social connections to move information—and in some cases supplies—between affected people on the ground, response agencies . . . and other volunteer crisis workers all over the world” (Starbird & Palen, ). Scholars have also considered the role of information and communication technology in aiding recovery in disaster-affected communities, again leveraging the networked nature of communication on social media platforms to “address problems that arise from information dearth and geographical dispersion” (Shklovski et al., ). In the past decade many perspectives on the use of social media during crises have emerged; however, the large majority of prior work approaches informal communication processes from the public’s perspective. Very few studies, though new research has started to fill this gap, look specifically at the behavior of official emergency responders as they negotiate opportunities and challenges in the incorporation of social media for emergency management (Lindsay, ). This chapter aims to highlight opportunities and open



 . 

problems is this space, arguing that network-based perspectives of the online information and communication behavior of officials during crises can enhance crisis management.

. C C   N A

.................................................................................................................................. Crisis communication in the age of social media represents a fundamentally different social and technical process than that involved in prior contexts; social media have altered pathways available for crisis communication, allowing the public and emergency responders to make explicit use of the network of social relationships present in society. Although social networks have always been valuable conduits for crisis-related information (Drabek & Haas, ), the online networks articulated in many social media platforms have transformed the geographic and temporal scale on which these phenomena operate. Crisis communication via traditional media (e.g., radio or television) embraces a broadcast perspective, aiming to reach affected populations with hazard-relevant information through broad dissemination channels, which typically have large audiences. Further exposure may result from information exchange in face-to-face settings. In contrast, new media, such as social media, often rely on network-based diffusion of information, making explicit use of the underlying social network to “move” information through a population. Efficient retransmission of content is often a designed component of social media platforms (e.g., reshares on Facebook and retweets on Twitter). These processes are depicted in Figure .. Traditional Media Imminent Hazard

New Media Imminent Hazard

waming!

waming!

 . Idealized crisis communication paradigms demonstrating the importance of informal social networks for communication via new media. Warnings about imminent hazards must be distributed to target populations, depicted via nodes highlighted in yellow. Exposure (indicated in red nodes) to crisis communication could be through broadcast (common in traditional media) or via information diffusion along social networks (common in new media).

    



Although the communication process shown in figure . is an idealized case (the actual process being some combination of broadcast and network-based exposure), it helps to illustrate the importance of social ties for crisis communication in the digital age. Understanding how social networks structure information diffusion, collective sense-making, rumoring, and interaction online is vital for effective utilization of these platforms. As shown in the subsequent sections, research in this area is growing, but there is much to be done. Social media may provide unprecedented access, allowing emergency responders to disseminate event-related information, interact with members of the public, and monitor public opinion like never before. Indeed, while many emergency responders recognize the “need” to be on social media and maintain an active presence on these platforms, many open questions remain about the effectiveness of social media platforms in reaching and engaging with members of the public during times of crisis. Also important to highlight in this discussion is the theoretical shift from one-way (solely dissemination) to two-way communication pathways during crises. The potential to use social media to solicit and build situational awareness, for example, is a promising direction in the field (Vieweg et al., ). Prior work on organizational use of social media for emergency response suggests that the utility of these new tools typically falls into two broad categories: first, social media can be used to disseminate information, and second, these platforms could be used as a management tool itself, for example to receive victim requests for assistance or increase situational awareness via citizen reports (Lindsay, ). Research also suggests that current usage patterns follow a strategy that is a mix of these choices, but dissemination-based usage is clearly the default mode of usage. For example, in work exploring response behavior during Hurricane Sandy, researchers found dynamic protocols for use of Twitter representing improvised action. Emergency personnel responded to requests for aid on Twitter, while at the same time trying to reinforce the use of official channels for aid requests (Hughes et al., ). Other work has found that responding agencies use social media solely for dissemination purposes and rarely interact with other platform users (Larsson & Agerfalk, ; Carter et al., ). This variability is not surprising. Government command-and-control protocols rarely integrate seamlessly with social media, which can lead to a mismatch in expectations from the public and use by emergency responders (Kavanaugh et al., ). Moreover, legal barriers, insufficient resources, and lack of training can all pose significant impediments to effectively engaging with constituents via social media (Hughes et al., ). Responding to many of these challenges, the US Department of Homeland Security convened a Virtual Social Media Working Group to offer best practices on the “safe and sustainable use of social media technologies before, during, and after emergencies” (Virtual Social Media Working Group, ). After describing advances in understanding of social media practices for public safety, the report outlines gaps in the research literature, many of which point to the need for standards, training, and guidance. It is clear that use of social media for crisis communication merits further investigation, particularly at the intersection of effective utilization of social media by



 . 

government emergency management–related entities. We know little about how these official entities currently utilize social media and even less about how to advise them to use social media effectively during both routine and crisis events to alleviate both immediate and long-term effects of disasters. The subsequent sections of this chapter highlight some work that attempts to fill these gaps, focusing narrowly on participation by official organizations on social media platforms in order to describe current practices and examine how these practices may have changed over time. This work is empirically driven, utilizing large-scale data sets of behavioral traces captured from social media platforms, namely terse-messaging platforms such as the microblogging service Twitter. It also often takes a network perspective, considering how social ties between organizations may serve as conduits for information, knowledge, and organizational learning.

. E R   S

.................................................................................................................................. We begin our discussion of the role of emergency responders in crisis communication and collective sense-making evident on social media platforms by examining the dynamics of public attention during crisis events. This work serves to explicitly demonstrate common experiences of emergency responders online and to highlight how rapidly changing circumstances can present both opportunities and challenges for emergency management officials. Moreover, this phenomenon motivates future work that focuses on the behavior of emergency responders, as attention is often directed at official responders and local emergency management–related organizations, as demonstrated in the examples discussed in this section. Crisis events are often coupled with drastic changes in social behavior (Palen et al., ; Fritz & Marks, ). One very salient example of this type of behavior is mass convergence on the crisis location; local citizens, emergency responders, and aid organizations flock to the physical location of the event (Dynes & Quarantelli, ; Mileti, ). Today, in the age of social media, we also see global onlookers turn to communication and information exchange platforms to seek and disseminate eventrelated content (Murthy, ; Hui et al., ). Social convergence behavior, long known to occur in offline settings in the wake of crisis events, is now mirrored— perhaps enhanced—in online settings. Mass convergence can occur in many different types of social behavior. Research has started to examine convergence behavior in online settings, specifically during crisis events. Work by Starbird & Palen () examines the convergence of digital volunteers on social media platforms and online social networking sites and their ability to remotely coordinate response efforts (see also Starbird & Palen, ; Monroy-Hernandez et al., ). Prior work also explores mass convergence of attention during crisis events (Sutton, Spiro, Butts, et al., ), demonstrating that mass convergence of attention online is especially prevalent for actors who are geographically close to the area of impact and who

     Everyday



During Event

 . Idealized differences in social embeddedness after social convergence of attention phenomena during immediate aftermath of crisis events.

experience tremendous change in the size of their audience (other users who are paying attention to them) on these platforms, as illustrated in Figure .. Despite the growing recognition that online social convergence behavior is an important aspect of crisis response, systematic investigation of this phenomenon online is limited. Mass convergence of attention, in particular, is distinct from other forms of mass convergence because it results in extreme—both in magnitude and frequency—changes in the features of one’s personal (i.e., egocentric) social network. Viewed through the framework of social network theory, mass convergence of attention on individual actors can be conceptualized in terms of degree dynamics; degree is a measure of the number of social ties an individual maintains. While the study of tie dynamics is a core topic in social network analysis, prior work in this area has tended to focus on strong interpersonal relationships (e.g., friendship) and the tie formation process, rather than on the decay of these relationships over time (Burt, , ; Leskovec et al., ). Social ties on social media, on the other hand, can be both created and destroyed with the click of a button, existing for a matter of minutes or hours rather than years. Critical to the discussion here is the fact that convergence of attention has important practical implications for emergency responders in the crisis context. A local police department, for example, can go from having an audience of ten individuals for the content it produces to one of ten thousand individuals in the hours after an incident occurs. This produces a unique social environment for the actors involved, similar to obtaining a celebrity-like status. One immediate implication of this increase in audience size for emergency responders is an increased ability to reach members of the public with time-sensitive, perhaps life-changing, event-related information during crisis events (Sutton, Spiro, Butts, et al., ; Sutton, Spiro, Johnson, Fitzhugh, et al., ). Not only would emergency responders have direct access to more people, but they could also expand their reach exponentially through the social connections of that audience, as seen in figure .. Figure . shows a time series of audience size for the official New York City Fire Department Twitter account, @FDNY. While the account seems to be steadily gaining followers over time, there are multiple points at which the account gains large numbers



 . 

Number of Followers

155,000

150,000

145,000

2016–02–29

2016–02–22

2016–02–15

2016–02–08

2016–02–01

2016–01–25

2016–01–18

2016–01–11

2016–01–04

2015–12–28

2015–12–21

2015–12–14

2015–12–07

2015–11–30

2015–11–23

2015–11–16

2015–11–09

2015–11–02

2015–10–26

2015–10–19

2015–10–12

2015–10–05

2015–09–28

2015–09–21

2015–09–14

2015–09–07

2015–08–31

2015–08–24

140,000

Date time

 . Change in audience size for the official New York City Fire Department Twitter account, FDNY.

of followers in a very short time. Interestingly, one of these corresponds with the anniversary of the World Trade Center terrorist attack on September , . Likewise, Sutton, Spiro, Johnson, et al. () found that the Boston Police and Boston Police public information officer Cheryl Fiandaca saw increases in followers of more than % and ,%, respectively, during the  Boston Marathon bombing crisis. These are just illustrative examples, but preliminary studies indicate that audiences size increases are “sticky”; that is, once gained the followers stick around—decay of attention is very slow, if it occurs at all (Spiro & Butts, ). Mass convergence of attention could be a great opportunity for emergency management–related organizations, but it can rapidly devolve into a formidable obstacle if organizations are unprepared to handle the spotlight. In the context of emergency management, implications are both positive and negative; while increased audience size translates into increased potential for information dissemination, it is important to recognize the potential for increased consequences if missteps are taken. Moreover, tie decay is of particular importance in crisis settings because emergency management organizations have a potentially limited window of opportunity in which they have direct access to a large population of individuals following instances of mass convergence of attention. That is, once in the spotlight, what should these organizational actors do to take advantage of their increased access to constituents? As such, it becomes increasingly important to understand this phenomenon, quantifying the incidence and magnitude of mass convergence of attention on emergency responders in crisis contexts.

    



Mass convergence of attention is just one behavior observed on social media during crisis events, but it is a behavior that results from the combination of technological affordances, communication dynamics, and exogenous shocks to the social system. As such, it illustrates the importance of understanding more about how emergency responders utilize social media during crisis events. Characterizing social convergence behavior is vital for understanding the dynamics of information sharing and transmissions in human societies, as well as the opportunities emergency management organizations have for reaching members of the public with important, time-sensitive, crisis-related information. In particular, recognizing that everyday contexts may not be representative of crisis situations is important for organizational learning and policy; it implies that different types of organizations need different types of information strategies.

. M  S O C  E R

.................................................................................................................................. Understanding the dynamics of informal communication and collective sense-making evident in behavioral trace data captured from social media platforms is an important step in facilitating effective crisis preparedness, response, and recovery. Researchers have utilized social media data to address questions within the fields of crisis informatics, sociology of disasters, and crisis communication. The large majority of existing work in this area, as discussed previously, has focused on the communication behavior of the public during post-event response periods, utilizing keyword-based samples of social media posts produced by the public after events occur. These studies are retrospective and case study based, focused on documenting and understanding how the general public behaves during nonroutine situations. The general public perspective, however, represents only half the story. Surprisingly, very little work has examined the communication practices of official, governmental emergency responders, revealing a significant gap in our current understanding of online communication during crisis events. To help address this gap, this chapter presents a user-centric approach to studies of social dynamics during times of crisis (Spiro, ). User-centric approaches—which focus on a sample of users and their activity—offer many practical and theoretical advantages, allowing researchers to make use of rich data sets of online behavioral trace data in new ways, while at the same time avoiding common criticisms of big data studies in the social sciences. In contrast to keyword-based approaches to sampling data, within a user-centric approach, the researcher begins with a well-defined population of interest: a set of users that will be the focus of study (Spiro, ). Importantly, in many cases this study population can be defined outside of the social media platform or online interaction



 . 

environment itself. For example, one might be interested in how police precincts manage public opinion after police-involved shooting incidents. For a particular research context, such as Los Angeles, it is relatively straightforward to enumerate all police precincts within city boundaries (Lee et al., ). Once the population is defined in this way, one can match each organization with its online account (e.g., Twitter handle, Facebook page). Although many cases are covered by this simple matching process, more complex scenarios are also possible. Many emergency responders, for example, have multiple social media accounts, each designated for specific content, so researchers must be clear about where study boundaries are drawn and the unit of analysis. Despite these intricacies, having a well-defined study population is, for many research problems, preferable to sampling content without the possibility of defining the sampling frame and inclusion probability (as is the case for many keyword-based approaches). User-centric approaches also lend themselves to alternative entry points for data collection. Twitter, for example, offers multiple application programming interface (API) endpoints—both its Streaming API and REST API—each of which has different constraints and available data. In our experience, user-centric approaches to gathering Twitter data allow for prospective and exhaustive data collection for specified observation periods. Research can therefore take a longitudinal approach and be confident in the completeness of data within that period. To illustrate the details of data collection strategies designed under a user-centric paradigm, we discuss one such approach that utilizes the Twitter API to collect data on emergency responders involved in a local hazard response event. Collecting behavioral trace data can be a challenging task. Although many online systems offer access to vast records, sampling, retrieving, and archiving data from these systems requires sophisticated tools and technical know-how. Today, researchers can utilize free (and paid) tools that facilitate access; however, in many cases these systems obfuscate details of how data are collected, potentially leading to data with poor statistical properties or that violate assumptions of subsequent analysis methods. Instead, many researchers choose to design custom data collection tools to ensure systematic data collection and data sets that are appropriate to their research questions. We begin with the premise that the researcher is able to define a study population of interest and is faced with the task of collecting data about the online activity and interactions of that population. We take Twitter as the research setting, as it is one of the most heavily utilized platforms during crises. Twitter users are often identified by their account name, called a username or handle. Usernames, however, are not static.1 As such, it is advisable to use a unique, static identifier as the entry point for continued data collection. On Twitter this means querying the API to contain the user ID associated with each account. These user IDs are static and represent a means of uniquely identifying each user in the study population in the event that some users may choose to change their usernames. User IDs provide an entry point for querying other relevant information. The Twitter API allows developers to access account data

offline characteristics

    

emergency responder



actor covariates e.g. response role, scale of operations

user_name

data about online activity

user_name

user_ID account data tweets social ties

 . Example user-centric data collection schema for retrieving digital traces of online communication and social networks of emergency responders on Twitter.

(e.g., account creation date, number of followers/followees), message posted and/or retweeted (i.e., tweets), and social network data (user IDs for the users followers/ followees). Available data and the linkages between them are depicted in the lower part of Figure .. User-centric approaches to research on social dynamics during crises also allow for behavioral trace data to be augmented with organizational records and other open data, providing numerous possibilities for other actor-level covariates to be incorporated. Enhancing social media data with features such as emergency responder function, location, or budget alleviates issues that result from the tendency of many large-scale social media data sets to be case rich but variable poor and therefore not sufficient to address many questions in the social sciences. Combing data from multiple sources can be difficult, particularly when data are collected at different levels and for different purposes. By framing analysis in terms of studying a population of users, linking multiple distinct data sets becomes manageable in many common cases. Government emergency response plans, for example, offer researchers many interesting organizational-level features to combine with social media data posts by those same organizational entities. In the remainder of this chapter we highlight work that falls within a user-centric approach to study crisis communication and emergency responders on social media. These studies utilize large-scale behavioral trace data and commonly incorporate a network paradigm to motivate research questions and methods. Finally, areas for future work are noted, providing avenues to apply the strategies presented here.



 . 

. E R O: R  R

.................................................................................................................................. Social media are an important tool for risk communication. However, despite the fact that many emergency responders recognize the potential of social media platforms and actively use these technologies to share information and connect with constituents during crisis events, relatively little is known about the online communication practices of these organizations and officials. New research directly addresses this gap, aiming to identify current usage practices in terms of communication behavior by emergency responders. The authors utilize the user-centric data collection approach to examine communication behavior longitudinally, in both routine and crisis contexts. Following the data collection approach outlined previously, Butts et al. () identified a set of government entities that are key actors in the response and recovery process for all types of hazards and threats. In a number of subsequent studies utilizing behavioral trace data from Twitter collected about this study population (Reeder et al., ; Sutton et al., ), these researchers characterized the communication practices of organizational actors within the emergency preparedness and response ecosystem in the United States. The resulting longitudinal data set captures online communication from  official emergency management–related Twitter accounts over the period from June  through September . Also obtained were data about the timing, location, and severity of notable disasters using the Federal Emergency Management Agency’s (FEMA) database of disaster declarations. These public records serve as a temporal index of events that were severe enough to require a federal institutional response. There were  disaster declarations during our observation period. Researchers began by quantifying the information space spanned by messages produced by official entities on Twitter. Next, they examined the temporal dynamics of these topics in relation to exogenous shocks disaster incidents. In this study, Reeder et al. () use an unsupervised learning model applied to a large-scale text corpus of tweets to estimate prevalence of different topics on days that contain major disaster events, compared with nonevent periods. Results reveal two important findings: () everyday, routine information behavior differs from behavior during crisis periods, and () when extreme events occur, entities markedly adjust their messaging strategies, orienting toward a common event–focused communication strategy that explicitly addresses response and recovery activities. Identifying several topical shifts between nonevent and event days, as illustrated in Figure ., researchers found a clear increase in response- and recovery-related topics; preparedness topics, on the other hand, are shown to decrease in occurrence. The differences between nonevent messaging and event messaging magnify when looking only at nonsafety accounts. These patterns reveal evidence of convergence toward eventoriented content: when extreme events occur, emergency responders of all types—those both directly and indirectly affected—are more likely to tweet similar information.

    



Percent Difference in Mean Estimated Topic Probability Event Days-Nonevent Days

Percent Difference

50

25

0

–25 Advisories

Severe Weather

Event Hurricane Disaster Damage Response & Wildfires Topic

Storm Travel

Safety Tips

Public Health

State Press

 . Percent difference in mean estimated topic probability for event days minus nonevent days. Darker bars represent statistically significant (α = 0.05) differences using pairwise t-testing.

This behavioral pattern becomes clear when one visualizes distances between accounts by the context of information production (event days versus nonevent days). Figure . illustrates this topical difference and convergence toward common “event-oriented” content produced by organizations in an emergency. To further quantify these communication dynamics, researchers compared the average topical distance between accounts by state for nonevent days versus event days and found that on average, accounts from the same state are % closer on event days than on nonevent days, a statistically significant difference. Reeder et al. () demonstrate that exogenous shocks are associated with specific changes in information-sharing practices, clearly demonstrating that organizations are adapting to changing information environments and needs. Though this is not entirely surprising, one might have expected no change in communication strategies if, for example, extreme circumstances, coupled with limited resources, forced emergency responders to divert content production to more traditional communication channels. Emergency management–related organizations are actively managing social media channels. The work addresses the fundamental question of what emergency responders are doing on social media platforms like Twitter. The research quantifies communication behavior during periods of time that do and do not contain major crisis events. From this work, we gain insight into the ways in which emergency responders currently



 . 

Second Multidimensional-Scaled Axis

(a)

Topical Content Similarity on Nonevent Days of Accounts by Functional Role, Scaled to 2D

0.2

Functional Role Coast guard Emergency Government Governor Information tech National guard Police Public health

0.0

–0.2

–0.4

Second Multidimensional-Scaled Axis

(b)

–0.2 0.0 First Multidimensional-Scaled Axis

0.2

Topical Content Similarity on Event Days of Accounts by Functional Role, Scaled to 2D

0.2

Functional Role Coast guard Emergency Government Governor Information tech National guard Police Public health

0.0

–0.2

–0.4

–0.2 0.0 First Multidimensional-Scaled Axis

0.2

 . Account similarity by topics tweeted on event and nonevent days.

    



utilize social media for crisis communication. Missing, however, is additional exploration of how emergency responders make use of the network-based communication available on these platforms. While investigations of communication treating each organizational entity in isolation are a first step, we also want to explore how emergency responders take advantage of the underlying social network these platforms expose.

. I C  T

.................................................................................................................................. The previous section offers an initial look at general communication and information strategies employed by emergency management organizations on Twitter. Evidence suggests that emergency responders demonstrate a significant degree of convergence toward a common information space during crisis events, likely reflecting the use of social media platforms as an additional crisis communication channel. On the other hand, simply looking at the messages posted by emergency responders on social media platforms does not consider how social media might also play the role of an information source during crisis events. As discussed previously, a number of studies have considered the potential of social media as an information source, increasing situational awareness by making use of the public’s citizen reporting of personal experiences during crisis events. This chapter aims to take a different perspective, considering the use of social media for interorganizational communication and coordination. Social media may provide an important “backchannel” for emergency responders to increase awareness of what partners and peers are doing or saying. In particular, utilizing the asynchronous but near real-time information exchange capability of social media platforms could provide new opportunities for rapid information sharing (a critical component of effective response). To be clear, interorganizational communication captures behavior among emergency responders, rather than between emergency responders and the general public. Other work has considered the reactions of the public to content posted by emergency management officials (e.g., see Sutton et al., ). As mentioned previously, dissemination of content is not the only value of social media platforms. Many social media platforms allow for the explicit articulation of social relationships, such as Twitter following relations and Facebook friendship. Indeed, much of the value of these platforms is derived from network-based communication and information exchange. Twitter’s following relationships, for example, constitute subscription ties along which information in the form of tweets is automatically delivered to an actors’ subscribers or “followers.” So do emergency responders make use of these relationships to exchange information or communication with other emergency responders? To answer this question, the researcher must collect information about the social networks of the study population. Collecting social network data from social media can sometimes be prohibitive (e.g., rate limits make it extremely difficult to obtain the



 . 

entire network—thousands or millions of ties of following relationships—on Twitter). However, certain research questions, such as the one just asked, lend themselves to the user-centric data collection approach discussed previously. In this case, we are interested in the induced network of social relationships among emergency responders on Twitter; in other words, we wish to observe the social ties between accounts or users in the study population. By scoping the question, we can alleviate many of the challenges in data collection that result from actors maintaining large numbers of social ties at relatively low cost. On Twitter, for example, emergency management accounts tend to have many fewer outgoing relationships than incoming ties; therefore, by only sampling outgoing ties from each actor in the study population, we can recreate the induced network among actors of interest without too much trouble. Consider a set of emergency responders participating in a local wildfire incident. Here we consider data from a wildfire incident outside of Colorado Springs, Colorado, that occurred in the summer of . The Waldo Canyon fire was one of the most destructive fires in Colorado state history, destroying numerous homes and buildings in its path. Butts et al. () collected data about sixteen responding organizational entities during this period, again following the user-centered approach discussed in section . Figure . is a network visualization of the follower network among these organizations at two different points in time during the response period. Node size reflects the number of followers each account has at the time. Nodes are highlighted to indicate they are posting messages on Twitter. Ties in the network are highlighted to indicate information transfer from author to subscriber. During the forty-eight hours between these two points, response activities intensified—the fire did significant damage to residential areas in its path, resulting in loss of human life. This simple case illustrates some interesting features of considering networks among emergency responders over time during a crisis. First, while emergency responders in Colorado Springs, and the state of Colorado more generally, follow each other on Twitter, notable “clusters” exist within the induced network. Federal and state-level organizations are evident in the top left. Emergency management–related organizations are tightly bound in a dense group on the right, and local entities are found in the lower left. This separation of organizations seems to be based on scale of operations and authority. Importantly, it could lead to some groups being isolated in terms of information exposure. FEMA, for example, may not see information posted by the Colorado Springs Police Department Public Affairs Section (CSPDPIO). Even more striking is the fact that the underlying social network among these actors does not change during an eventful response period of the crisis, suggesting that network ties may reach a relatively stable state; even a severe crisis does not shock the social structure. This differs drastically from the clear change in outside attention directed at these local emergency responders, as evident in the increased node size from figure a to b and the discussion in section . While members of the public may actively reconfigure and create new social ties to obtain crisis-related information, emergency responders do not create new social relationships with other organizations involved in the response and recovery effort. These results were surprising to the

     (a)



fema

PSICC_NF femaregion8

CSPDPIO PPRedCross springsgov

READYColorado

AF_Academy

COEmergency

EPCSheriff

CSP_News CDPHE mayorstevebach

CSFDPIO

CSP_CSprings Colorado

(b)

fema

PSICC_NF femaregion8

CSPDPIO PPRedCross springsgov

READYColorado

COEmergency CSP_News CDPHE mayorstevebach

AF_Academy EPCSheriff CSFDPIO

CSP_CSprings Colorado

 . Interorganizational communication networks on Twitter during Waldo Canyon wildfire, Colorado Springs, CO. Node size scaled by audience size; blue edges represent information exchange. The following network remains static over this period; organizations do not reconfigure social relationships.

researchers, given that response activities over this time likely required increased coordination and collaboration among response personnel and emergency management organizations in the region. The results, though specific to this case, are illustrative of more general findings and reinforce prior observations that emergency responders use social media as another



 . 

channel for disseminating crisis-related information to the general public rather than as a tool for managing response activities. As work continues to explore interorganizational communication dynamics during routine and crisis contexts, we will learn more about the opportunities and challenges for utilizing social media in this way. Importantly, given the lack of clear guidelines and policies to help emergency responders negotiate communication practices on social media, this type of research offers direction for organizational learning. Incorporating what was learned in section  suggests that emergency response organizations might consider both functionally similar and geographically proximate others as models for learning social and information behavioral norms in social media platforms.

. D  C

.................................................................................................................................. Social media have quickly become one of the more prominent features within the landscape of communication and information technologies utilized by the public during crisis events. However, public officials are still learning how to best use these new channels. Despite improvement and new policies, there seems to be a disconnect between current use of such platforms by pubic officials and expectations for use by the general public. Ongoing research provides the empirical support necessary to inform social media strategies among emergency management organizations, aiming to bring insight to the potential for incorporating social media platforms into all phases of disaster preparedness, response, and recovery. Much of this work utilizes large-scale data sets of digital traces and network-based perspectives on social interaction. This chapter summarizes prior and ongoing research in this area. In contrast to much work within the fields of crisis informatics, sociology of disaster, and crisis communication, the research presented here focuses specifically on the communication and information behaviors of official emergency responders, rather than the general public. Emergency responders represent a key population that actively uses social media during crisis events despite limited guidelines on the best means of taking advantage of the affordances of these tools. As work in this area continues to grow, we hope emergency responders will expand the functions of social media during crises, building on the dissemination capabilities to also consider pathways for organizational learning, increasing preparedness and situational awareness, and engaging with the public. The research highlighted here will aid in pre-event planning to take advantage of increased audience size, as well as help to set expectations for what (and what not) to expect in terms of social network dynamics during crisis events. Characterizing social convergence behavior is vital for understanding the dynamics of information sharing and transmission in human societies, as well as the opportunities emergency management organizations have for reaching members of the public with important, timesensitive, crisis-related information. In particular, recognizing that everyday contexts may not be representative of crisis situations is important for organizational learning

    



and policy; it implies that different types of organizations need different types of information strategies. Moreover, it will help to describe the current social media usage practices of emergency responders, allowing emergency personnel to reflect on pre and post-crisis communication. Understanding how these behaviors change in the event of extreme events and over time as experience with social media grows will allow emergency responders to better plan for future events. Exploring social interaction among emergency responders could promote organizational learning and collaboration. There are many possibilities and many avenues for future work. Advances in knowledge in this area have direct practical consequences, improving emergency management and potentially reducing negative consequences for the people affected by crises around the world.

. O  F W

.................................................................................................................................. The research discussed herein offers insight into the communication patterns employed by emergency management organizations on Twitter during the early postadoption period, but open questions remain. As such this work indicates a number of directions for future work. New research on social convergence behavior, particularly mass convergence of attention, will aid in pre-event planning to take advantage of increased audience size, as well as help to set expectations for what (and what not) to expect in terms of social network dynamics during crisis events. Characterizing information behavior and social convergence is vital for understanding the dynamics of information sharing and transmission in human societies, as well as the opportunities emergency management organizations have for reaching members of the public with important, time-sensitive, crisis-related information. One might consider how information and communication behavior diffuses through the social network among emergency personnel. In particular, are entities that occupy similar positions in the information space closer in the network of social ties made explicit on many social media platforms? Can one map organizational learning by looking at communication practices? Though outside the scope of this chapter, we see promise in comparing practices outlined here with more recent or current behavior to explore if and how communication norms have been established. Another important implication of these studies relates to effective dissemination of both preparedness and response-related information. In particular, prior work demonstrates that emergency management–related organizations tend to produce similar information during disaster events, but in routine circumstances content is less similar to local actors; rather, it aligns with functional roles. This could be problematic for effective preparedness of local organizations, because constituents must pay attention to many types of organizations in their area to get a complete information picture of preparedness activities.



 . 

Tools for disaster management that facilitate automated discovery of similar others or potential collaborators should be aware of the dynamic nature of similarity and the need to adapt to multiple situations. For example, many social media platforms suggest other users of interest with whom the focal actor might consider forming a social tie. “Friend” suggestions might be very different in these two cases. Moreover, identifying potential similar actors before crises occur could help facilitate coordination efforts during response and recovery periods.

A This material is based on work supported by, or in part by, the US Army Research Laboratory and the US Army Research Office under grant number WNF---. The author would also like to thank Carter Butts, Jeannette Sutton, and Kate Starbird, along with students in the NCASD Lab at UCI and the DataLab and emComp Lab at UW, for their frequent feedback on this research topic.

N . Twitter users can change their username while maintaining their account.

R Allport, G., & Postman, L. (). The Psychology of Rumor. New York, NY: Henry Holt. Anderson, M., & Caumont, A. (). How Social Media Is Reshaping News. Technical Report. Washington, DC: Pew Research Center. Andrews, C., Fichet, E., Ding, Y., Spiro, E. S., & Starbird, K. (). Keeping Up with the Tweet-dashians: The Impact of Official’ Accounts on Online Rumoring. In Proceedings of the International Conference on Computer-Supported Cooperative Work and Social Computing (pp. –). New York, NY: ACM. Anthony, S. (). Anxiety and rumor. The Journal of Social Psychology, , –. Arif, A., Shanahan, K., Chou, F.j., Dosouto, Y., Starbird, K., & Spiro, E. (). How Information Snowballs: Exploring the Role of Exposure in Online Rumor Propagation. In Proceedings of the International Conference on Computer-Supported Cooperative Work and Social Computing (pp. –). New York, NY: ACM. Barton, A. (). Communities in Disaster: A Sociological Analysis of Collective Stress Situations. Garden City, NY: Doubleday and Company. Blaikie, P., Cannon, T., Davis, I., & Wisner, B. (). At Risk: Natural Hazards, People’s Vulnerability and Disasters (nd ed.). New York: Routledge. Bordia, P., & DiFonzo, N. (). Problem solving in social interactions on the Internet: Rumor as social cognition. Social Psychology Quarterly, , –. Bruns, A., Burgess, J., Crawford, K., & Shaw, F. (). #qldfloods and @QPSMedia: Crisis Communication on Twitter in the  South East Queensland Floods. Technical Report

    



CCI ARC Centre of Excellence for Creative Industries & Innovation. https://eprints.qut. edu.au///floodsreport.pdf Burt, R. S. (). Decay Functions. Social Networks, , –. Burt, R. S. (). Bridge Decay. Social Networks, , –. Butts, C. T., Sutton, J., & Spiro, E. S. (). Hazards, Emergency Response, and Online Informal Communication Project Data. University of California, Irvine. Butts, C. T., Sutton, J., Spiro, E. S., Johnson, B., & Fitzhugh, S. M. (). Hazards, Emergency Response and Online Informal Communication Project: Waldo Canyon Wildfire Dataset. University of California, Irvine. Cameron, M. A., Power, R., Robinson, B., & Yin, J. (). Emergency Situation Awareness from Twitter for Crisis Management. In WWW Companion (p. ). New York, NY: ACM. Caplow, T. (). Rumors in War. Social Forces, , –. Carter, L., Thatcher, J. B., & Wright, R. (). Social Media and Emergency Management: Exploring State and Local Tweets. In Proceedings of the th Hawaii International Conference on System Sciences (pp. –). New York, NY: IEEE Computer Society. Danzig, E. R., Thayer, P. W., & Glanter, L. R. (). The Effects of a Threatening Rumor on a Disaster-Stricken Community. Technical Report. Washington, DC: National Academies Press. Drabek, T. E. (). Social Processes in Disaster: Family Evacuation. Social Problems, , –. Drabek, T. E., & Haas, J. E. (). Laboratory Simulation of Organizational Stress. American Sociological Review, , –. Dynes, R. R., & Quarantelli, E. L. (). Helping Behavior in Large Scale Disasters. Newark, DE: Disaster Research Center. Fritz, C. E., & Marks, E. S. (). The NORC Studies of Human Behavior in Disaster. Journal of Social Issues, , –. Gao, H., Barbier, G., & Goolsby, R. (). Harnessing the Crowdsourcing Power of Social Media for Disaster Relief. IEEE Intelligent Systems, , –. Hiltz, S. R., & Kushma, J. (). Use of Social Media by U.S. Public Sector Emergency Managers: Barriers and Wish Lists. In Proceedings of the th International Conference on Information Systems for Crisis Response and Management (ISCRAM) (pp. –). Hughes, A. L., St. Denis, L. A., Palen, L., & Anderson, K. M. (). Online Public Communications by Police & Fire Services during the  Hurricane Sandy. In Proceedings of CHI Conference on Human Factors in Computing Systems. Toronto, Ontario, Canada. New York, NY: ACM. Hui, C., Tyschuck, Y., Wallace, W. A., Magdonismail, M., & Goldberg, M. (). Information Cascades in Social Media in Response to a Crisis: a Preliminary Model and a Case Study. In WWW Companion (pp. –). New York, NY: ACM. Kavanaugh, A., Fox, E. A., Sheetz, S., Yang, S., Li, L. T., Whalen, T., Shoemaker, D., Natsev, P., & Xie, L. (). Social Media Use by Government: From the Routine to the Critical. In Proceedings of the th Annual International Conference on Digital Government Research (pp. –). College Park, MD: ACM. Larsson, A. O., & Agerfalk, P. J. (). Snowing, Freezing . . . Tweeting? Organizational Twitter Use During Crisis. First Monday, (). Lee, H., McCormick, T., Spiro, E. S., & Cesare, N. (). Exploring Relationship Dynamics Between Citizens and the Police Via Twitter. In Proceedings of the nd Annual International Conference on Computational Social Science. Evanston, IL.



 . 

Leskovec, J., Kleinberg, J., & Faloutsos, C. (). Graph evolution: Densification and Shrinking Diameters. ACM Transactions on Knowledge Discovery from Data. Vol. , Issue , Article no. . New York, NY: ACM. Lindsay, B. R. (). Social Media and Disasters: Current Uses, Future Options, and Policy Considerations. Technical Report. Congressional Research Service, Report for Congress. https://www.nisconsortium.org/portal/resources/bin/Social_Media_and_Dis_.pdf Liu, B. F., Fraustino, J. D., & Jin, Y. (). How Disaster Information Form, Source, Type, and Prior Disaster Exposure Affect Public Outcomes: Jumping on the Social Media Bandwagon? Journal of Applied Communication Research, , –. Mileti, D. (). Disasters by Design: A Reassessment of Natural Hazards in the United States. Washington, DC: National Academies Press. Monroy-Hernandez, A., Boyd, D., Kiciman, E., & Counts, S. (). Narcotweets: Social Media in Wartime. In Proceedings of the Sixth International AAAI Conference on Weblogs and Social Media (pp. –). Palo Alto, CA: AAAI. Murthy, D. (). Twitter: Microphone for the Masses? Media, Culture & Society, , –. Palen, L., Vieweg, S., & Sutton, J. (). Crisis Informatics: Studying Crisis in a Networked World. In Third International Conference on e-Social Science. Ann Arbor, Michigan. Reeder, H., McCormick, T. H., & Spiro, E. (). Online Information Behaviors During Disaster Events: Roles, Routines, and Reactions. Center for Statistics and the Social Sciences Working Paper Series. Scanlon, T. (). Post-Disaster Rumor Chains: A Case Study. Mass Emergencies, , –. Shibutani, T. (). Improvised News: A Sociological Study of Rumor. Indianapolis and New York: The Bobbs-Merrill Company, Inc. Shklovski, I., Palen, L., & Sutton, J. (). Finding Community Through Information and Communication Technology During Disaster Events. In Proceedings of the Conference on Computer-Supported Cooperative Work and Social Computing (pp. –). San Diego, CA: ACM. Spiro, E. S. (). Communication and Affective Dynamics During Crisis Events: Measuring the Public’s Online Response. PhD diss., University of California, Irvine. Spiro, E. S. (). Research Opportunities at the Intersection of Social Media and Survey Data. Current Opinion in Psychology, , –. Spiro, E. S., & Butts, C. T. (). Degree Dynamics: Spikes and Decay in Attention Relationships. In nd Sunbelt Network Conference (INSNA). Hamburg, Germany. Spiro, E. S., Sutton, J., Greczek, M., Fitzhugh, S., Pierski, N., & Butts, C. T. (). Rumoring During Extreme Events: A Case Study of Deepwater Horizon . In Proceedings of the rd Annual ACM Web Science Conference (pp. –). New York, NY: ACM. Starbird, K., Maddock, J., Orand, M., Achterman, P., & Mason, R. M. (). Rumors, False Flags, and Digital Vigilanes: Misinformation on Twitter after the  Boston Marathon Bombing. In iConference  Proceedings (pp. –). Starbird, K., & Palen, L. (). Pass It On? Retweeting in Mass Emergency. In Proceedings of the th International Conference on Information Systems for Crisis Response and Management (ISCRAM) (pp. –). Seattle, WA. Starbird, K., & Palen, L. (). “Voluntweeters”: Self-Organizing by Digital Volunteers in Times of Crisis. In Proceedings of CHI Conference on Human Factors in Computing Systems (pp. –). New York, NY: ACM. Starbird, K., & Palen, L. (). Working & Sustaining the Virtual Disaster Desk. In Proceedings of Conference on Computer-Supported Cooperative Work and Social Computing. New York, NY: ACM.

    



Starbird, K., Spiro, E., Edwards, I., Zhou, K., Maddock, J., & Narasimhan, S. (). Could This Be True? I Think So! Expressed Uncertainty in Online Rumoring. In Proceedings of the  CHI Conference on Human Factors in Computing Systems (pp. –). New York, NY: ACM. Starbird, K., Spiro, E. S., Arif, A., Chou, F.j., Narasimhan, S., Maddock, J. I. M., Shanahan, K., & Robinson, J. (). Expressed Uncertainty and Denials as Signals of Online Rumoring. In Proceedings of the Collective Intelligence Conference (pp. –). New York, NY: ACM SIGCHI. Sutton, J., Gibson, C. B., Phillips, N. E., Spiro, E. S., League, C., Johnson, B., Fitzhugh, S. M., & Butts, C. T. (). A Cross-Hazard Analysis of Terse Message Retransmission on Twitter. Proceedings of the National Academy of Sciences, , –. Sutton, J., Spiro, E. S., Butts, C., Fitzhugh, S., Johnson, B., & Greczek, M. (). Tweeting the Spill: Online Informal Communications, Social Networks, and Conversational Microstructures during the Deepwater Horizon Oilspill. International Journal of Information Systems for Crisis Response and Management, , –. Sutton, J., Spiro, E. S., Johnson, B., & Butts, C.T. (). Following the Bombing. Online Research Highlight. Available at http://heroicproject.org/. Sutton, J., Spiro, E. S., Johnson, B., Fitzhugh, S., Gibson, B., & Butts, C. T. (). Warning Tweets: Serial Transmission of Messages during the Warning Phase of a Disaster Event. Information, Communication & Society, , –. Vieweg, S., Hughes, A. L., Starbird, K., & Palen, L. (). Microblogging during Two Natural Hazard Events: What Twitter May Contribute to Situational Awareness. In Proceedings of the th CHI Conference on Human Factors in Computing Systems (pp. –). New York, NY: ACM. Virtual Social Media Working Group. (, June). Lessons Learned: Social Media and Hurricane Sandy. Technical Report, US Department of Homeland Security, Science and Technology Directorate. Vis, F. (). Twitter as a Reporting Tool for Breaking News. Digital Journalism, , –.

  ......................................................................................................................

     ......................................................................................................................

     

. I  B

.................................................................................................................................. T discussion sections of empirical research on digital communities are littered with unsatisfying phrases. Networked communication researchers have read, and in many cases written, phrases such as the following: our study is only of one web forum, and we cannot know how our work generalizes to others; because we have gathered data only from Wikipedia, we only propose that our findings might be antecedents of that community’s success; messages in our data set were collected from one Facebook group, and we cannot speak to all the conversation that may have happened elsewhere. These phrases are acknowledgments of real problems. We think we can do better. The limitations identified in the preceding phrases stem in large part from the fact that most empirical online community research looks within individual communities. Although thousands of papers have been published about individual peer production projects such as Wikipedia and Linux, there has been little research that compares peer production projects to each other. Likewise, studies of communication in social networking sites and discussion groups have nearly always focused on interactions within a single network such as Facebook or an individual discussion forum. Such research does not always generalize well. Furthermore, we foreclose many types of research questions by selecting research sites for their size, longevity, or the engagement level of their participants. In this chapter we argue that by studying groups of communities, we can enhance the quality of online communities research in multiple ways. First, we can mitigate many common threats to the validity of online community research designs. Moreover, by studying groups of online communities, we open the door to answering types of questions that are unanswerable when our data sets begin and end at communities’ borders.

    



A few key points about population-level analysis and online communities will help contextualize the rest of our argument. First, we are not advocating anything especially radical by arguing that research on networked communication in online communities ought to do more to adopt population-level approaches. Empirical communication research into online communities has existed for several decades,1 and we cite many examples of population-level approaches. Moreover, population-level analysis exemplifies well-established traditions in organization science, which has sought to turn away from models of organizations as selfcontained entities to models that treat them as actors in complex social environments; a shift that has inspired a proliferation of comparative, population-level approaches to organizational analysis (Scott and Davis ). We draw inspiration from these approaches and seek to describe how they might be applied in the context of research on online communities. We also believe that the empirical contexts of online communities present an extraordinary opportunity to extend these concepts and frameworks. A second point concerns what exactly we mean by populations of online communities. In ecology, the term “population” is used to describe all interbreeding organisms that share a geographical area. In demography, the term is used to refer to groups of humans. In organizational sociology, it is used to refer to similar types of organizations that might compete for resources such as customers or suppliers. Although others have defined the term more narrowly (e.g., Weber, Fulk, and Monge ), we use the term “population” broadly and inclusively to refer to any groups of online communities whose membership is defined through similarity, competition, or interaction. As is the case in ecology, demography, and organizational sociology, there are many ways to define populations of communities, and a single online community might belong to a large number and variety of distinct populations. Most of our discussion focuses on populations of online communities defined through their use of a common technical platform. Examples of populations of this type include discussion forums that use the same bulletin board software, wikis that use the same hosting service, and software development projects using the same tools to host their code and coordinate their work. For example, SourceForge and Github each provides a common technological platform that hosts millions of different online communities dedicated to creating software projects, LinkedIn and Facebook each acts as a platform to host online communities associated with many different offline organizations and interest groups, and Reddit hosts hundreds of thousands of “subreddit” communities around different discussion topics. Because communities within this type of population are mediated by a common technology, data are often more readily available and consistent, as a single host or platform provider may provide access via a single application programming interface (API) or database. The resulting data sets support the advantages of computational social science (Lazer et al. ) and can often facilitate direct comparative analysis. Of course there are other ways to define populations. For example, a population of online communities might include all communities hosting conversations about a set of topics or beliefs or serving a single geographic area. Examples might be as narrow as



     

the group of all Star Wars fan culture communities or as broad as all music-sharing networks. The analogy to an industry of firms or a denomination of churches maps more or less directly in cases such as these. A population might also be described in terms of communities whose membership overlaps or through which messages diffuse. An exciting body of research, including studies of diffusion, which we discuss later, has constructed large platform-spanning populations of communities. In the rest of this chapter we use examples from empirical research to highlight five benefits of studying populations of online communities. First, we argue that studies of populations can lead to increased generalizability. That said, community-spanning research designs not only improve the type of research already done within online communities, but also bring entirely different questions into the realm of answerability. In particular, we highlight the ability to study community-level variables such as group size, intercommunity behavior such as knowledge diffusion, and the ways that communities affect each other through dynamics such as competition. We also argue that data from populations of online communities make it possible to combine many of these benefits with the benefits of intracommunity analyses. Finally, we discuss a series of limitations before concluding.

. B : G

.................................................................................................................................. The first benefit of studying populations of communities is simple and straightforward: studies of a single community—no matter how exhaustive, granular, and expertly designed—may produce findings that hinge on the idiosyncrasies of the community being studied. While this is intuitive and widely acknowledged, analyses of networked behavior and group processes frequently focus on a single platform or online community (e.g., Facebook, Twitter, Wikipedia) and result in findings that do not apply more broadly. Analyzing data drawn from multiple sites, communities, or platforms can lead to greater generalizability. Early scholars of online communities gathered and compared evidence from case studies of the discussion-based system Usenet, text-based role-playing games such as “multi-user dungeons“ (MUDs), the pioneering community “the WELL,” and others (e.g., Kollock and Smith ). Over time the forces that brought about the rise of computational social science (Lazer et al. ), including widespread availability of digital trace data and reduced costs of computing and storage, have made more direct comparisons across communities possible. However, despite the existence of opportunities for intercommunity empirical work, the vast majority of studies in online community research still focus on a single site. Generalizability is hurt because communities are heterogeneous in ways that are frequently poorly understood. This point is illustrated by Hargittai (), who uses data from offline surveys to describe the user bases of large online communities such as Facebook and Twitter in traditional demographic terms. Although Hargittai’s point is that participants in these

    



communities are far from being a cross-section of society, she also suggests that the demographics of online communities’ user bases are distinct from each other. Of course decades of social science have shown that demographic characteristic such as age, race, gender, and skills are correlated with many of the attitudes and behaviors that social scientists study. Because online communities are different from the general population in demographic terms, Hargittai suggests that generalizability to society is often unjustified. An implication of her findings is that to the extent that communities also differ systematically from each other, generalization between communities will be fraught as well. Studies across populations of online communities can help increase confidence in generalizability. For example, two studies published by Michael Restivo and Arnout van de Rijt look at the effects of social awards in Wikipedia called “barnstars” (Restivo and van de Rijt , ). In the first of these projects, the authors select very active Wikipedians and award barnstars to random subsets. Award recipients go on to edit the encyclopedia more than users not given an award. They also go on to receive more awards in the future. The authors suggest that such peer-to-peer forms of public recognition may produce a “success-breeds-success” dynamic. In a follow-up paper, van de Rijt et al. () compare the evidence of this effect in Wikipedia to very similar field experiments in three different online communities and show that the dynamic they identified in Wikipedia is also present in donations to Kickstarter, positive ratings of products on Epinions.com, and signatures on the e-petition site Change.org. Many of the most studied online communities—even those that appear extremely similar—are unusual in ways that impact findings and limit generalizability. For example, in an extensive comparative analysis of the ten largest language editions of Wikipedia, Ortega () details many instances in which the English edition is very different from Wikipedias in other languages. One example is editor retention. English Wikipedia’s low editor retention has been an extremely widely studied problem (see Halfaker, Kittur, and Riedl ; Halfaker et al. ), but most analyses of it can only speculate about whether editor retention represents an issue endemic to all other large wikis. Ortega shows that the English-language community has extremely low retention among the most committed editors compared to other-language Wikipedias. His results suggest that something particular about the English-language community should explain at least part of its retention issues. Generalization unsupported by direct comparison is a risky business. This concern is akin to the challenge of scientists using “model organisms” to understand basic biological processes (Fields and Johnston ). For example, much of what scientists know about genetics stems from research conducted with Drosophila melanogaster. That said, the question of when we should, and should not, generalize from Drosophila to other species is often not clear. Moreover, model organisms—such as the most widely studied online communities—are selected precisely because of their extreme or idiosyncratic characteristics. Geneticists study Drosophila because it reproduces more quickly than other species. Online community researchers study Wikipedia precisely because it has attracted so many readers, contributions, and contributors and because it



     

has generated articles of exceptional quality. We do not always understand the relationship between these unusual features and the basic social and communicative processes we seek to study. In such situations, a combination of careful theorizing combined with healthy doses of analytic modesty and skepticism become a researcher’s most important assets.

. B : S C-L V

..................................................................................................................................

Although Wikipedia’s enormous size and high-quality articles support its starring role in online community research, understanding why Wikipedia became so big and high quality is remarkably difficult. The question “Why did Wikipedia succeed?” implies treating Wikipedia itself as the unit of analysis and measuring its success (however one defines it) at the level of the community. As introductory research methods textbooks explain (e.g., King, Keohane, and Verba , –), answering this type of question requires variation across multiple projects, some of which succeed and some of which fail. Ideally, to study why Wikipedia succeeded, we would compare it to failed Wikipedias, or at least to projects that were trying to be something very similar to what Wikipedia became (Hill ). This sort of comparison and inference represents an especially compelling opportunity created by studying populations of online communities. Organization-level variables have a long tradition in organizational research in which scholars have used data on variations in firm performance to understand which kinds of firms thrive. For example, a body of work has established and tested a theory of the “liability of newness” that suggests that newer firms are more likely to fail (Schoonhoven ). Other research has examined how the structure of communication affects the performance of teams and work groups engaged in collaboration of various kinds (e.g., Crowston and Howison ; Cummings and Cross ). Because structure exists within groups, studying multiple groups provides variation that makes it possible to understand how structure can affect group-level outcomes. As a result, a large body of research in organizational communication has focused on group-, team-, or even firm-level variables. Both the scholarly and popular literature about the power of online communities makes claims about project-level performance and outcomes. As a result, there have been many prior calls to pursue studies across projects and communities (e.g., Crowston et al. ; Kraut and Resnick ; Benkler, Shaw, and Hill ). Schweik and English’s () book Internet Success provides a compelling example of the power of this approach. Deeply influenced by Ostrom’s () work on common pool resource management, Schweik and English frame their analysis of free/libre open source software (FLOSS) projects in terms of questions about the successful provision of information commons. They create a stage model of when FLOSS projects will be

    



abandoned and when they will develop into effective communities. They seek to explain these outcomes using dozens of project-level variables such as leadership style and the copyright license of the project. They find that measures of clarity in leadership communication are among the best predictors of successful FLOSS commons and that other potential antecedents of FLOSS success cited in earlier literature, such as license choice, do little to explain project outcomes. Schweik and English’s variables are nearly all at the level of the FLOSS project or community. Although previous case studies of leadership and governance reveal quite a bit about how FLOSS communities function, there is a sense in which they are limited to describing how these communities have operated. Because the sample of FLOSS projects analyzed by Schweik and English () in Internet Success varies in terms of success (as the authors define it) and its theoretical antecedents, their analysis can begin to answer the question of why communities succeed. In this way, population-level research can engage with substantively and theoretically important questions in a manner that smaller scale analyses or comparisons struggle to do. Perhaps the most important organization-level variables in networked communication research are related to technology. However, individual online communities tend to run on a single piece of software that consistently mediates the experience of every participant, and the vast majority of research on the effects of technology on online communities uses data in which the technology itself is held constant. In the best cases, researchers are able to take advantage of a technological change or shock within a community to understand the impact of technology (e.g., Geiger and Halfaker ). By looking across numerous communities, researchers can also build samples in which technology varies as a way of testing these claims systematically (e.g., Shaw and Benkler ). Such platform-level comparisons exemplify the sorts of analyses that become possible with the inclusion of community-level variables.

. B : S D B C

.................................................................................................................................. The ability to answer questions about the way that information and practices flow between communities is another benefit of population-level research. For example, we know that a huge proportion of messages on social media systems such as Twitter include links to sources outside Twitter and to other online communities (Bruns and Stieglitz ). As a result, studies that attempt to characterize processes of information transmission without looking beyond the boundaries of an individual community or platform are necessarily incomplete. Although it can be difficult, research across communities can paint more accurate pictures of the diffusion. Studies of diffusion have played a central role in communication scholarship since Rogers’s () seminal work on the diffusion of innovation. Subsequent studies by Valente () and others have situated these diffusion processes within networks.



     

Most of this research has sought to understand how information flows between individuals. For example, Bennett and Segerberg () used data on message diffusion on Twitter to understand processes related to power, politics, and collective action. Of course Twitter makes up only a part of Twitter users’ media diets, and many tweets point to further engagement in other media (Bruns and Stieglitz ). This speaks to the importance of research designs that can encompass a larger set of communities and thereby offer a more comprehensive explanation of diffusion processes. The power of looking across communities is illustrated by a study by Graeff, Stempeck, and Zuckerman () that traces the diffusion and development of conversations about the shooting of Trayvon Martin in February . In addition to traditional media sources, including television transcripts and newspapers, the analysis uses data from Media Cloud, an enormous data set of many thousands of online media sources (Benkler et al. ); search volume data from Google; messages on Twitter; signature data on the e-petitioning site Change.org; and a unique data set from bit.ly that provides measures of how often people have viewed social media material in a variety of communities, including Twitter and Facebook. The paper traces how, after initially being reported only in local news, Martin’s killing faded entirely from the media for weeks. A publicist hired by Martin’s family and their lawyer then thrust the event back into the local media, bringing national media coverage that led to an online petition on the website Change.org. Then a staff member at Change.org took notice of the petition and successfully engaged celebrities such as Talib Kweli and Spike Lee to tweet messages about it, drawing increased engagement and attention. This increased attention led to further media coverage. Graeff and colleagues’ accounting provides a compelling answer to the question of how Trayvon Martin’s killing became one of the most important news stories of , precisely because they followed the story as it diffused through different media sources and across different online communities, networks, and platforms. By looking across communities, the authors discovered that activity in social media communities such as Twitter and Facebook was driven by activity in Tumblr and Blogger, and that the activity in these communities was driven by earlier activity in Twitter, which in turn was driven by activity around an e-petition. Table . reproduces some of Graeff, Stempeck, and Zuckerman’s () data summarizing traffic to the Change.org petition as well as traffic referred through social media sites. The multiple data sources combine to illustrate how the controversy developed across many types of media. Although the data set on the Trayvon Martin controversy used by Graeff and colleagues remains far from comprehensive, it offers the opportunity to go much further than an analysis of any single source could. The study also builds on previous work that had looked at network gatekeeping within Twitter (Meraz and Papacharissi ) and is able to extend this theory beyond a single communication network. As is always the case with new methodologies, this approach also introduces new limitations. In particular, there are important challenges in terms of source selection, new

    



Table 10.1 Total Traffic to Trayvon Martin Change.org Petition Page/Traffic Referred by Social Mediaa Date

March 12 March 13 March 14 March 15 March 16

Change.org Petition Traffic

Traffic Referred by Social Media

11,141 34,345 305,672 190,354 80,268

7,486 20,712 45,952 72,165 50,798

a

On several days when the petition was active. Source: Graeff, Stempeck, and Zuckerman (2014).

opportunities for sample bias, and significant technical difficulties caused by the need to collect and integrate heterogeneous data sets. Other studies consider other types of information transmission processes across community boundaries. A highly cited study by Leskovec, Backstrom, and Kleinberg () used an enormous data set of . million mainstream media sites and blogs to theorize a set of temporal patterns through which short, distinctive phrases diffuse from news sites to blogs. A literature on code reuse in FLOSS has shown how code written for one project can be used in another (e.g., Mockus ), and research in computer science has looked at how images are copied and moved between websites (Seneviratne et al. ). These lines of inquiry illustrate how studies of information flow across communities are increasingly possible and can lead to theoretically salient insights. By considering a larger part of the communicative and collaborative landscape, these approaches enable the analysis of diffusion processes that may not exist at all within communities. Diffusion processes can also involve the transmission of organizational practices. In organization science, sociologists have studied dynamics of organizational “isomorphism,” whereby organizations adopt routines and structures used in other organizations (e.g., Meyer and Rowan ; DiMaggio and Powell ). Although studies of this type of diffusion remain much less common in online communities, they are possible. In one study, Zhu, Kraut, and Kittur () consider the diffusion of a model for coordinating through topic-focused task forces within Wikipedia called “Collaboration of the Week.” Because they look at both a group-level process and group-level outcomes of productivity across many groups, Zhu and her colleagues can evaluate the effectiveness of the practice at a more general level. Their work shows how a population-level research design also makes it possible to understand the ways that the task force model of collaboration spreads and, together with the Graeff et al. () study, illustrates how such research can alter and enhance our understanding of multiple kinds of diffusion processes.



     

. B : S E D

.................................................................................................................................. Research on diffusion and information transmission underscores that online communities do not exist in a vacuum. By studying individual communities, we tend to study them in isolation from each other and from the external forces that shape or threaten them. This point leads us to another line of inquiry that population-level data sets of online communities can address: the way that online communities interact with their environments and the surrounding ecology of similar communities. We use the terms “environment” and “ecology” here in the same way that scholars in organizational studies have applied them to describe a collection of factors and pressures outside organizations’ boundaries.2 In studies of firms, an environment consists of competitors, suppliers, and customers. For example, one could describe the studies of diffusion previously discussed as a type of research on environments. In this section we highlight the opportunity to learn about online communities by studying how they experience ecological forces through their environments. Questions of this type remain almost entirely unexplored in online community research but present promising opportunities. Ecological studies of organizations are well established within organizational research. Seminal work in organizational science borrowed the metaphor of ecologies from biology to assert a set of theories around resource competition among firms (Hannan and Freeman ; Carroll and Hannan ). A central insight drives this body of work: firms succeed or fail because of conditions in their environment such as the number of competing firms. It turns out that much of the likelihood of a firm’s failure can be predicted based on what other firms in the same “niche” are doing. To pick just one example, firms in either very empty or very competitive niches (i.e., brand new sectors versus markets that are crowded with other firms) fail more frequently than firms in moderately competitive markets (Carroll and Hannan ). Although originally focused only on for-profit firms, this work has been extended to nonprofit and social movement organizations (McPherson ; Soule and King ). Several pieces of research on online communities adopt this type of ecological approach. One important example is Wang, Butler, and Ren (), who use  months of longitudinal data from a stratified random sample of  Usenet discussion groups to understand how competition for participants’ attention affects communities’ growth. For each discussion group, the authors create a measure of membership overlap by identifying participants and then measuring the number of other Usenet discussion groups that each user participates in during every month. Wang and colleagues show that groups whose members participate in more groups grow more slowly. They use group-level variables such as community size to unpack this finding and conclude that larger groups experience more difficulty in growth and are more vulnerable to the deleterious effects of competition.

    



Population-level data sets make it possible to imagine research like that in the Wang et al. study. Additionally, taking an ecological perspective allows these researchers to identify substantively important relationships. Their models of community growth rate improve substantially when they take membership overlap into account, explaining up to % of the variance overall and returning significantly better fit than alternative specifications without the overlap measure. Although their strongest controls are group-level factors such as total membership, Wang et al.’s paper shows that we can make important predictions about how much an online community will grow by considering activities outside communities’ boundaries. The ecological approach to studying online communities remains rare, but other examples exist. Gu et al. () consider competition between online investment communities. Two other pieces have extended the findings and approach of Wang, Butler, and Ren (): Zhu et al. (), studying a group of , IBM discussion groups, and Zhu, Kraut, and Kittur (), studying a community of , wikis. Remarkably, the latter find that membership overlap improves the survival rate for new wikis. Future work may confirm or diverge from these findings, as well as the earlier findings of organizational research that focused exclusively on firms.

. B : S M P

.................................................................................................................................. A final benefit of population-level analyses of online communities is the ability to analyze complex, multilevel social processes. This benefit derives partly from the extraordinary granularity and breadth of online trace data, with which there is often no need to sacrifice micro-level detail in the analysis of meso- and macro-level dynamics. However, while multilevel analysis may have an affinity with exhaustive digital trace data, it need not proceed through data-intensive or computational approaches. Insightful qualitative inquiry can also account for multilevel social processes. We believe that research on populations of online communities presents an exceptional opportunity in this regard. Scholars of organizational communication and group processes have implemented multilevel approaches in a variety of domains. Prior work by Faraj and Johnson () advocates the use of multilevel network analysis to examine complex social processes at the individual level. Indeed, the rise of massive-scale online data sets has coincided with the creation of novel multilevel methods, including the use of multilevel network analysis to study (among other things) multi-team systems (Contractor ; Zaccaro, Marks, and DeChurch ). Howard’s () concept of “network ethnography” provides a mixed-methods example of this approach that utilizes exhaustive knowledge of macro- and mesolevel network structures both to identify appropriate cases for in-depth analysis and to guide the process of interpreting and generalizing findings. To date, relatively little



     

research capitalizes on the opportunities to apply multilevel analysis across many online communities. Several studies from organizational research use multilevel approaches to study individual and group-level relational processes across multiple online communities (Faraj and Johnson ), but such work remains exceptional. We think there should be more. A study of social capital and relationships in World of Warcraft (WoW) guilds by Williams et al. () provides an excellent example of multilevel work that combines the advantages of large-scale trace data with in-depth qualitative evidence. WoW guilds are arbitrarily sized clans of players that vary in size, scope, formality, and strategic focus. A player might, for instance, want to join up with a “raid-oriented family friendly guild” of marauding orcs who meet on weekends. The authors used an exhaustive observational data set collected with automated programs connecting to different types of WoW game servers over several months to build a representative sample of individual players and guilds.3 The authors created a stratified sample of these guilds and a representative sample of their members. They then conducted fortyeight interviews with individuals from the sample over the game’s chat platform to understand how members participate in guilds. Williams and colleagues elaborate their account of both group-level and individuallevel patterns of activity using quotations contextualized through their knowledge of player and guild characteristics. For example, they show how players articulate grouplevel identities and find very legitimate forms of social support through participation in guilds. One interviewee explained the experience of playing in a large guild with everything but a citation to Putnam’s () influential argument about the putative decline of civic associations in America: It’s kind of like a bowling team or a softball league . . . . [I]t’s just [a] social event in here, probably more often then [sic] bowling since I talk to these people several nights a week. (Williams et al. , )

The quote illustrates the existence of large-scale solidarity and associationism in the supposedly alienated world of video games. The authors can generalize from this kind of evidence because they use guild-level data to ensure that their findings are representative of experiences in guilds of varying sizes and characteristics. The findings integrate the richness and depth of insightful interviewing with the analytic leverage of a research design that can generalize across multiple communities. An important takeaway is that individuals’ experiences online often occur within the context of subcommunities such as guilds. As a result, understanding whether and how WoW players build sustained social relationships and social capital entails accounting for a complex interplay of both factors at the game, guild, and individual levels. In this way, Williams and colleagues’ study illustrates the value of multilevel modeling strategies as a tool for incorporating measures nested across levels of analysis and social interaction in online communities that are themselves part of larger communities. Although their study is qualitative in nature, Williams has collaborated with others on quantitative studies using the same data set in similar ways (e.g., Shen, Monge, and Williams ).

    



Multilevel analyses can also account for ecological dynamics in communities’ environments and look across non-nested communities. For instance, Faraj and Johnson () test claims about the prevalence of patterns of interpersonal exchange using exhaustive micro-level data collected from a random sample of discussion groups. They draw conclusions at the individual and group levels while accounting for community-level variation as well. Studying interpersonal communication processes in multiple online communities by simultaneously analyzing multiple levels of social organization and behavior offers a particularly promising path for future work.

. L

.................................................................................................................................. Alongside these benefits, population-level research on online communities introduces a number of limitations. Some of these are inherent to research on populations, while others derive from particular characteristics of populations of online communities. First, multiple populations of communities can interact in ways that suggest an even higher level of analysis that may bring additional insights. Second, identifying distinct communities poses difficulties in many situations. Third, research in this area exists in the interstices of multiple fields, leading to challenges related to integrating disparate literatures and concepts. Finally, it is important to acknowledge that studying populations of communities can require more effort and can be more skill- and resourceintensive than studying a community in isolation. While studies on populations of communities may provide insights and generalizability that exceed smaller scale comparisons or case studies, it is important to note that a population of communities (as we previously described it) may be distinct from still larger sets of populations, known as metapopulations in ecology research (Hanski ). Along these lines, organizational sociologists have suggested that studying organizations within industries ignores important high-level effects across industries, and that it is also important to study populations of industries—literally, populations of populations (e.g., Ruef ). In the context of online communities, examples of populations and corresponding metapopulations might look like all of the communities on a particular platform or hosting service (e.g., Wikia, GitHub, or EdX) and the full set of potentially comparable communities (e.g., all wikis, FLOSS projects, or massive open online courses). There may also be other dimensions along which to define metapopulations, such as the sets of communities encompassed by particular language groups, legal regimes, or infrastructural features. The distinction between populations and metapopulations implies new types of limitations of generalizability analogous to those confronted by single case studies. While a study may include all of the software development communities hosted by GitHub or SourceForge, or even both, there may be underlying biases that make these two groups systematically different from other otherwise similar communities. There may also be ecological forces that reflect competition between the different populations. As usual, unknown bias has no empirical remedy other than an effort to



     

encompass more cases—in this case more populations. Likewise, the identification of ecological forces at a metapopulation level remains inaccessible to studies within a single population. These possibilities suggest future opportunities for research. A second issue relates to the ways in which the nature and character of online communities resist easy classification or definition. A single community, such as the English-language Wikipedia, may contain hundreds or even thousands of distinct subcommunities. In Wikipedia, these include WikiProjects—groups of individuals who work together on topics of shared interest (Morgan et al. ). Like guilds in WoW, WikiProjects are online communities nested in English Wikipedia (for example), which is nested within the broader Wikipedia project, which is nested within the even broader Wikimedia movement. Such a fractal structure is not unique to Wikipedia and may represent a broader characteristic of large-scale, open organizations on the Internet. Given these examples, language and theories that were developed to analyze more clearly bounded formal organizations such as firms, nonprofits, and even more open volunteer projects or social movements may prove inadequate. Even the term “community” is contested and is used in a wide variety of ways by researchers studying online interaction (Bruckman ). This disagreement about what constitutes a community makes the definition of a population (or metapopulation) of online communities analytically slippery as well. Research on organizations and collective action has historically elided this concern by adopting a very broad set of inclusion criteria (e.g., Marwell and Oliver ). Ultimately though, if every sort of remotely collective endeavor can also be called a “community,” the term ceases to hold meaning. The murky empirical and conceptual boundaries around online communities make defining populations difficult. We advocate a pragmatic but ultimately somewhat unsatisfying approach to this issue: researchers should define populations by a combination of objectively observable criteria along with folk and scientific wisdom. Sometimes, as in studies examining all of the communities hosted through a common software platform or company, this may be straightforward, and the results may correspond to both scientific intuition as well as the understanding of participants within these populations. Under other circumstances, it may be necessary to choose between different possible analytic boundaries. For instance, in analyses of emergent collaboration networks such as those engaging in breaking news coverage on microblogging platforms, Wikipedia, or Facebook, groups of participants may act in ways that are consistent with researchers’ understandings of what it means to be a community even though they do not perceive themselves as such. In these situations, it becomes critical to acknowledge such divergent perspectives and proceed with caution. It may not make sense to address certain theoretically interesting questions with data drawn from unsuitable cases. Just because we can model something as a community does not mean we should. Relatedly, just because a platform describes something as a group (a Facebook group, a guild within a game, a WikiProject) doesn’t mean it is the correct unit of analysis for a particular question.

    



A third challenge is that existing research conducting population-level analyses of online communities spans disciplinary boundaries. Our own work has been influenced by organizational studies, social psychology, political sociology, interpersonal communication, and human computer interaction. Such diverse intellectual orientations may create opportunities for academic brokerage, but these opportunities come at a cost. As in finance (Zuckerman ), academic endeavors that fail to fit into existing categorical schema may pay an illegitimacy premium. There are also opportunity costs that come with learning multiple research fields. The institutionalization of novel schema takes time and proceeds unevenly. We believe the best response to this problem is to draw strategically from the most compelling domains of research. For example, the papers we have highlighted in this essay have drawn on the literature on organizational behavior, work groups, computer science, and teams. We have also turned to the sociological literature focused on comparative analysis of movements. Others have emphasized links to studies of emergent networks and collective intelligence. All of these approaches have borne fruitful results. Of course this type of pluralistic approach introduces challenges as well. While reading broadly, it is important to ground one’s work in established traditions of scholarship. A broad set of influences is not a license to cherry-pick theories to fit data. An open challenge lies in the synthesis and integration of findings across disparate approaches. For the time being, many findings remain mutually contradictory or unintelligible as they speak of similar phenomena in different ways. A final limitation is purely practical but still important to note. Simply put, studying multiple communities can require more time, more effort, more skills, or more resources than a study of a single community. One reason is that population-level data sets can be large and unwieldy, and building and analyzing them may be outside the abilities of many communication researchers. The data set used by Graeff, Stempeck, and Zuckerman () was created by professional engineering staff. Many population-level analyses, such as Leskovec, Backstrom, and Kleinberg (), were carried out by computer scientists. When using these large and varied data sets, there are often trade-offs between scope and scale. In the short term, collaborations with computer scientists and engineers is a popular strategy. In the longer term, population-level studies will become more widespread as the technology necessary to complete them becomes more established, easier to use, and more reliable. For example, the deployment of standardized APIs and large data releases from large communityhosting platforms has already made some types of population-level studies much more accessible. We expect that this will only increase with time.

. D

.................................................................................................................................. In summary, we have argued that online communities research can benefit enormously from studying populations of communities. We have tried to show how population-level



     

research designs that span communities can offer increased generalizability and can open the door to new kinds of questions, including those that focus on theoretically important community-level variables, processes of diffusion across communities, and the way that communities interact with their environments through ecological competition. Finally, we’ve argued that digital trace data sets, and a variety of analyses that cross multiple levels can allow research to realize many of these benefits without tossing aside the benefits of intracommunity studies. By narrating this argument with a series of in-depth examples of exemplary work from communication and beyond, we have also pointed to concrete examples of how this can be done. Although these studies still reflect a small proportion of research on online communities, similar approaches are increasingly common. Additional benefits from population-level research about online communities may emerge as researchers experiment with and explore this type of work. One benefit we find particularly exciting concerns the degree to which analysis across many online communities may provide stronger empirical support for policy-making and design decisions. These decisions can benefit from insights into the effects of specific interventions. This sort of evidentiary basis holds particular relevance for computermediated communication systems, in which code may shape de facto institutional arrangements, norms, and behavioral patterns (Lessig ). For example, in our own work we used a population of wikis to explore the effects of a requirement to create an account on subsequent community-level activity. Estimates of average effects of a widely debated policy decision such as this, across many projects, can give designers and community leaders greater confidence in the generalizability of a finding in a way that can inform subsequent policy decisions. Exploration of heterogeneous effects can provide an understanding of how a design might succeed or fail. In ways that we discussed previously, many studies of online communities and networked communication seek to understand the mediating effects of specific technologies. Since the design of technical platforms and tools, in most cases, operates at the level of communities, a shift to empirical analysis and theorizing at the level of communities can make this work more directly usable. There are already significant overlaps between the scholarly communities studying networked communication and the designers creating social computing systems. Crafting research capable of speaking across these boundaries will enhance the impact of communication research and provide communication researchers with greater opportunities to test, extend, and refine their theories in conversation with designers and computer scientists. Of course, while we see many opportunities in population-level research, the approach is far from a silver bullet. The limitations we have sketched out are real, significant, and a subset of the challenges that confront population-level analysis. Just as we hope that continued growth of population-level studies will help establish the benefits of the approach, we also hope it will paint a better picture of the limitations of populationbased approaches and support the development of better ones. We believe that an increased attention to populations marks one step toward better online community research and toward a deeper understanding of networked communication.

    



A We thank the Community Data Science Collective for helping us develop these ideas. We thank the editors, Brooke Foucault Welles and Sandra González-Bailón, and the reviewers of this chapter for their helpful feedback. Portions of the project were completed at the Helen R. Whitely Center at the University of Washington’s Friday Harbor Laboratories. Financial support for this work came from the National Science Foundation (grants IIS- and IIS-).

N . We do not attempt to provide a comprehensive accounting of this prior work. Interested readers might seek out Kraut and Resnick (), which is an excellent book-length synthesis of much empirical online community research. . See Scott and Davis () for both an excellent general overview of organizational theory and a detailed account of organization scientists’ definitions of organizational environments. . Note that additional details of the data collection are published in a companion article by Ducheneaut et al. ().

R Benkler, Yochai, Hal Roberts, Robert Faris, Alicia Solow-Niederman, and Bruce Etling. . Social Mobilization and the Networked Public Sphere: Mapping the SOPA-PIPA Debate. SSRN SCHOLARLY PAPER ID . Rochester, NY: Social Science Research Network. Accessed September , . http://papers.ssrn.com/abstract=. Benkler, Yochai, Aaron Shaw, and Benjamin Mako Hill. . Peer Production: A Form of Collective Intelligence. In The Handbook of Collective Intelligence, edited by Michael Bernstein and Thomas Malone, –. Cambridge, MA: MIT Press. Bennett, W. Lance, and Alexandra Segerberg. . The Logic of Connective Action: Digital Media and the Personalization of Contentious Politics. New York: Cambridge University Press. Bruckman, Amy. . A New Perspective on “Community” and Its Implications for Computer-Mediated Communication Systems. In CHI ’ extended abstracts on Human factors in computing systems, –. Montréal: ACM. doi:./.. Bruns, Axel, and Stefan Stieglitz. . Quantitative Approaches to Comparing Communication Patterns on Twitter. Journal of Technology in Human Services  (/): –. Accessed February , . doi:./... http://offcampus.lib.washington.edu/ login?url=http://search.ebscohost.com/login.aspx?direct=true&db=lls&AN=&site= ehost-live. Carroll, Glenn R., and Michael T. Hannan. . The Demography of Corporations and Industries. Princeton, NJ: Princeton University Press. Contractor, Noshir. . Some Assembly Required: Leveraging Web Science to Understand and Enable Team Assembly. Philosophical Transactions of the Royal Society of London



     

A: Mathematical, Physical and Engineering Sciences  (): . Accessed January , . doi:./rsta.., pmid: . http://rsta.royalsocietypublishing.org/ content///. Crowston, Kevin, and James Howison. . Hierarchy and Centralization in Free and Open Source Software Team Communications. Knowledge, Technology & Policy  (): –. Accessed September , . doi:./s---. http://link.springer.com/ article/./s---. Crowston, Kevin, Kangning Wei, James Howison, and Andrea Wiggins. . Free/Libre Open-Source Software Development: What We Know and What We Do Not Know. ACM Computing Surveys  (): :–:. Accessed February , . doi:./.. http://doi.acm.org/./.. Cummings, Jonathon N., and Rob Cross. . Structural Properties of Work Groups and Their Consequences for Performance. Social Networks  (): –. Accessed June , . http://www.sciencedirect.com/science/article/pii/S. DiMaggio, Paul J., and Walter W. Powell. . The Iron Cage Revisited: Institutional Isomorphism and Collective Rationality in Organizational Fields. American Sociological Review  (): –. Ducheneaut, Nicolas, Nick Yee, Eric Nickell, and Robert J. Moore. . Building an MMO with Mass Appeal A Look at Gameplay in World of Warcraft. Games and Culture  (): –. Accessed January , . doi:./. http://gac.sagepub.com/ content///. Faraj, Samer, and Steven L. Johnson. . Network Exchange Patterns in Online Communities. Organization Science  (): –. Accessed June , . doi:./ orsc... http://pubsonline.informs.org/doi/abs/./orsc... Fields, Stanley, and Mark Johnston. . Whither Model Organism Research? Science  (): –. Accessed January , . doi:./science., pmid: . http://science.sciencemag.org/content///. Geiger, R. Stuart, and Aaron Halfaker. . When the Levee Breaks: Without Bots, What Happens to Wikipedia’s Quality Control Processes? In Proceedings of the th International Symposium on Open Collaboration, :–:. WikiSym ’. . New York: ACM. Accessed November , . doi:./.. http://doi.acm.org/./.. Graeff, Erhardt, Matt Stempeck, and Ethan Zuckerman. . The Battle for “Trayvon Martin”: Mapping a Media Controversy Online and Off-line. First Monday  (). Accessed January , . http://firstmonday.org/ojs/index.php/fm/article/view/. Gu, Bin, Prabhudev Konana, Balaji Rajagopalan, and Hsuan-Wei Michelle Chen. . Competition among Virtual Communities and User Valuation: The Case of InvestingRelated Communities. Information Systems Research  (): –. Halfaker, Aaron, Aniket Kittur, and John Riedl. . Don’t Bite the Newbies: How Reverts Affect the Quantity and Quality of Wikipedia Work. In Proceedings of the th International Symposium on Wikis and Open Collaboration, –. WikiSym ’. New York: ACM. Accessed March , . doi:./.. http://doi.acm.org/./.. Halfaker, Aaron, R. Stuart Geiger, Jonathan T. Morgan, and John Riedl. . The Rise and Decline of an Open Collaboration System: How Wikipedia’s Reaction to Popularity Is Causing Its Decline. American Behavioral Scientist  (): –. Accessed May , . doi:./. http://abs.sagepub.com.ezp-prod.hul.harvard.edu/content/ //.

    



Hannan, Michael T., and John Freeman. . Organizational Ecology. Cambridge, MA: Harvard University Press. Hanski, Ilkka. . Metapopulation Ecology. Oxford: Oxford University Press. Hargittai, Eszter. . Is Bigger Always Better? Potential Biases of Big Data Derived from Social Network Sites. The ANNALS of the American Academy of Political and Social Science  (): –. Accessed October , . doi:./. http://ann.sagepub.com/content///. Hill, Benjamin Mako. . Essays on Volunteer Mobilization in Peer Production. Ph.D. dissertation. Cambridge, MA: Massachusetts Institute of Technology. Howard, Philip N. . Network Ethnography and the Hypermedia Organization: New Media, New Organizations, New Methods. New Media & Society  (): –. Accessed April , . doi:./. http://nms.sagepub.com/content///. King, Gary, Robert O. Keohane, and Sidney Verba. . Designing Social Inquiry: Scientific Inference in Qualitative Research. Princeton, NJ: Princeton University Press. Kollock, Peter, and Marc Smith. . Communities in Cyberspace. London: Routledge. Kraut, Robert E., and Paul Resnick. . Building Successful Online Communities: EvidenceBased Social Design. In collaboration with Sara Kiesler, Moira Burke, Yan Chen, Niki Kittur, Joseph Konstan, Yuqing Ren, and John Riedl. Cambridge, MA: The MIT Press. Lazer, David, Alex Pentland, Lada Adamic, Sinan Aral, Albert-Laszlo Barabasi, Devon Brewer, Nicholas Christakis, et al. . Computational Social Science. Science  (): –. Accessed March , . doi:./science.. http://www.sciencemag.org. Leskovec, Jure, Lars Backstrom, and Jon Kleinberg. . Meme-tracking and the Dynamics of the News Cycle. In Proceedings of the th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, –. KDD ’. New York: ACM. Accessed November , . doi:./.. http://doi.acm.org/./.. Lessig, Lawrence. . Code and Other Laws of Cyberspace. New York: Basic Books. Marwell, Gerald, and Pamela Oliver. . The Critical Mass in Collective Action: A Micro-social Theory. Cambridge, MA: Cambridge University Press. McPherson, Miller. . An Ecology of Affiliation. American Sociological Review  (): –. doi:./, JSTOR: . Meraz, Sharon, and Zizi Papacharissi. . Networked Gatekeeping and Networked Framing on #Egypt. The International Journal of Press/Politics  (): –. Accessed January , . doi:./. http://hij.sagepub.com/content/early//// . Meyer, John W., and Brian Rowan. . Institutionalized Organizations: Formal Structure as Myth and Ceremony. The American Journal of Sociology  (): –. Mockus, A. . Large-Scale Code Reuse in Open Source Software. In First International Workshop on Emerging Trends in FLOSS Research and Development, . FLOSS ’, . doi:./FLOSS... Morgan, Jonathan T., Michael Gilbert, David W. McDonald, and Mark Zachry. . Project Talk: Coordination Work and Group Membership in WikiProjects. In Proceedings of the th International Symposium on Open Collaboration, :–:. WikiSym ’. New York: ACM. Accessed January , . doi:./.. http://doi.acm.org/./ .. Ortega, Felipe. . Wikipedia: A Quantitative Analysis. Ph.D. dissertation, Universidad Rey Juan Carlos. Accessed June , . http://libresoft.es/Members/jfelipe/phd-thesis.



     

Ostrom, Elinor. . Governing the Commons: The Evolution of Institutions for Collective Action. New York: Cambridge University Press. Putnam, Robert D. . Bowling Alone: The Collapse and Revival of American Community. New York: Simon and Schuster. Restivo, Michael, and Arnout van de Rijt. . Experimental Study of Informal Rewards in Peer Production. PLoS ONE  (): e. Accessed April , . doi:./journal. pone.. http://dx.doi.org/./journal.pone.. Restivo, Michael, and Arnout van de Rijt. . No Praise without Effort: Experimental Evidence on How Rewards Affect Wikipedia’s Contributor Community. Information, Communication & Society  (): –. Accessed March , . doi:./ X... http://www.tandfonline.com/doi/abs/./X... Rogers, Everett M. . Diffusion of Innovations. New York: The Free Press of Glencoe. Ruef, Martin. . The Emergence of Organizational Forms: A Community Ecology Approach. American Journal of Sociology  (): –. Accessed April , . doi:./. http://dx.doi.org/./. Schoonhoven, Claudia B. . Liability of Newness. In Wiley Encyclopedia of Management. New York: John Wiley & Sons, Ltd. Accessed January , . http://onlinelibrary.wiley. com/doi/./.weom/abstract. Schweik, Charles M., and Robert C. English. . Internet Success: A Study of Open-Source Software Commons. Cambridge, MA: MIT Press. Scott, W. Richard, and Gerald F. Davis. . Organizations and Organizing: Rational, Natural and Open Systems Perspectives. Upper Saddle River, NJ: Pearson Prentice Hall. Seneviratne, Oshani, L. Kagal, D. Weitzner, Hal Abelson, Tim Berners-Lee, and N. Shadbolt. . Detecting Creative Commons License Violations on Images on the World Wide Web. In Proceedings of the th International World Wide Web Conference. New York: ACM. Accessed January , . http://dig.csail.mit.edu//Papers/ WWW/paper.pdf. Shaw, Aaron, and Yochai Benkler. . A Tale of Two Blogospheres: Discursive Practices on the Left and Right. American Behavioral Scientist  (): –. doi:./. Shen, Cuihua, Peter Monge, and Dmitri Williams. . The Evolution of Social Ties Online: A Longitudinal Study in a Massively Multiplayer Online Game. Journal of the Association for Information Science and Technology  (): –. Accessed November , . doi:./asi.. http://onlinelibrary.wiley.com/doi/./asi./abstract. Soule, Sarah A., and Brayden G. King. . Competition and Resource Partitioning in Three Social Movement Industries. The American Journal of Sociology  (): –. doi:./, JSTOR: . Valente, Thomas W. . Network Models of the Diffusion of Innovations. Cresskill, NJ: Hampton Press. Van de Rijt, Arnout, Soong Moon Kang, Michael Restivo, and Akshay Patil. . Field Experiments of Success-Breeds-Success Dynamics. Proceedings of the National Academy of Sciences  (): –. Accessed January , . doi:./pnas.. http://www.pnas.org/content///. Wang, Xiaoqing, Brian S. Butler, and Yuqing Ren. . The Impact of Membership Overlap on Growth: An Ecological Competition View of Online Groups. Organization Science  (): –. Accessed January , . doi:./orsc... http://pubsonline. informs.org/doi/abs/./orsc...

    



Weber, Matthew S., Janet Fulk, and Peter Monge. . The Emergence and Evolution of Social Networking Sites as an Organizational Form. Management Communication Quarterly  (): . Accessed November , . doi:./. http://mcq.sagepub.com/content/early////. Williams, Dmitri, Nicolas Ducheneaut, Li Xiong, Yuanyuan Zhang, Nick Yee, and Eric Nickell. . From Tree House to Barracks: The Social Life of Guilds in World of Warcraft. Games and Culture  (): –. Accessed October , . doi:./ . http://gac.sagepub.com/content///. Zaccaro, Stephen J., Michelle A. Marks, and Leslie DeChurch, eds. . Multiteam Systems: An Organization Form for Dynamic and Complex Environments. Routledge. ISBN: ---. Zhu, Haiyi, Jilin Chen, Tara Matthews, Aditya Pal, Hernan Badenes, and Robert E. Kraut. . Selecting an Effective Niche: An Ecological View of the Success of Online Communities. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, –. CHI ’. New York: ACM. Accessed December , . doi:./ .. http://doi.acm.org/./.. Zhu, Haiyi, Robert E. Kraut, and Aniket Kittur. . The Impact of Membership Overlap on the Survival of Online Communities. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, –. CHI ’. New York: ACM. Accessed December , . doi:./.. http://doi.acm.org/./.. Zhu, Haiyi, Robert E. Kraut, and Aniket Kittur. . A Contingency View of Transferring and Adapting Best Practices within Online Communities. In Proceedings of the Conference on Computer Supported Cooperative Work and social Computing. San Francisco, CA: ACM Press. Accessed January , . http://haiyizhu.com/wp-content/uploads///BestPracticeTransfer.pdf. Zuckerman, Ezra W. . The Categorical Imperative: Securities Analysts and the Illegitimacy Discount. The American Journal of Sociology  (): –.

  ......................................................................................................................

      ......................................................................................................................

    

. I

.................................................................................................................................. O social interactions have been increasingly occurring through computers and other digital devices. Virtual worlds have become popular sites of online communication and interaction. Virtual or “synthetic” worlds are “crafted places inside computers that are designed to accommodate large numbers of people” (Castronova, ). They are twoor three-dimensional spaces with multiple modalities in which users can engage in mediated interactions with the world and other users. In virtual worlds, users are able to create online self-representations, or “avatars,” that may bear little resemblance to their offline identities. Similar to the “real world,” virtual world users can construct and maintain social relationships with preexisting friends and/or strangers. Users can interact with other users through different forms of communication, often using synchronous or asynchronous voice or text chat. They can form socio-emotional ties (e.g., friendship) as well as other collaborative and competitive ties, such as sharing housing or joining teams. A defining feature of virtual worlds is the freedom for users to craft and assume identities that are unburdened by offline biases based on gender, race, and other sociodemographic characteristics. Therefore, virtual worlds have the potential to serve as “social levelers” that facilitate equal exchanges across these groups (Steinkuehler and Williams, ). However, research suggests that virtual worlds are not fulfilling their potential (Collier and Bear, ). Studies on various virtual worlds have produced descriptive statistics about how men and women differ in their sociodemographic profiles and play styles (Williams et al., b; Lehdonvirta et al., ), yet systematic examinations of how men and women engage in social interactions in these worlds remain scant. In this chapter we first offer an overview of gender disparities and social relationships in virtual worlds, followed by a review of the challenges and benefits of the

     



current approaches to studying gender and virtual worlds. Then we provide an empirical research example, situated in the massively multiplayer online game (MMO) EverQuest II (EQII), focusing on gender and networks of collaboration. We conclude by identifying a few limitations to current approaches and suggesting future directions with digital trace data.

. S G  V W

..................................................................................................................................

.. Virtual Worlds Research Millions of people maintain and further develop offline relationships using virtual worlds as a supplemental avenue for interaction. At the same time, many use virtual worlds to form both casual and meaningful relationships with individuals they have never met face to- face. There is an inherent scholarly value in studies on virtual worlds, as more people create and maintain relationships online (Williams et al., ). More important, virtual worlds allow theory testing and the exploration of computermediated human behavior in meaningful ways, especially with regard to interactions and social networks, as full-scale and longitudinal observations of social dynamics can be difficult in the offline world. In addition, it is much less costly to implement and test the impact of changes within virtual worlds than in real world social systems. Social structures, rules, and norms in virtual worlds can be altered and experimentally tested for the before-and-after differences, providing important insights into human behavioral patterns and how social structures respond to and influence specific systems. Most, if not all, user activities in virtual worlds can be captured in real time. Such digital trace data provide social scientists with a unique opportunity to study a large number of user interactions uninfluenced by researchers or artificial laboratory settings. The scale, comprehensiveness, and precision of unobtrusive behavioral data are unprecedented, and these virtual worlds offer a rich and intertwined social structure with multiplex interactions among users. With these ties, social scientists can build, analyze, and explore various online social networks to ask a variety of theoretical questions. Virtual worlds can be used as a conduit to model real world social dynamics at individual, group, and societal levels. There may be similarities in the relationship between social network ties and group performance in both real organizational work groups and virtual world guilds (Benefield et al., ). The spread of an infectious disease in a virtual world has also been compared to a real world epidemic. In the online game World of Warcraft (WoW), game designers introduced a nonplayer character (NPC) with the ability to infect characters with a communicable disease through a special spell (“corrupted blood”) in a newly created dungeon. Although the spell was originally designed to be confined within the dungeon, the highly contagious and lethal virtual plague quickly spread beyond control, infecting thousands of players



    

throughout the entire virtual world. Scholars noted the many similarities between the virtual and the offline worlds and suggested using WoW as “a platform for studying the dissemination of infectious diseases, and as a testing ground for novel interventions to control emerging communicable diseases” (Balicer, ).

.. Mapping Gender in Virtual Worlds However, a common challenge with studies on virtual worlds lies in their elusive generalizability. As every virtual world platform has unique purposes, social architecture, rules, and mechanics that may shape user interactions, it is difficult to ascertain whether patterns observed in a particular virtual world apply only to that social system or are universally true for other virtual worlds and offline social contexts. For example, do gender differences observed in a specific virtual world reveal fundamental behavioral differences between men and women, or are they simply a result of specific design features of the world? As a way to deal with these challenges, Williams () developed a “mapping” principle and framework for research. The framework suggests several dimensions, along which we could evaluate whether it is appropriate to draw parallels between online and offline as well as between multiple online worlds. The first dimension concerns group size, as researchers can observe and study behaviors in different group contexts, anywhere from dyads to large communities. The second dimension assures that scholars use appropriate controls of extraneous variables, such as demographics or communication channels. The third dimension considers contextual differences in virtual worlds, including the size of, type of, and the number of users in the virtual world. The mapping framework has important implications for examining gender and social interactions in virtual worlds. Group size could affect the intentions and behaviors of men and women as we compare small, tightly connected groups to large, dispersed groups. Further, gender is often a confounder, rather than the cause, of observed differences in various outcome variables, such as performance, play style, and communication. Observed differences between men and women along these dimensions might be spurious if extraneous variables were not properly accounted for. For example, a study of EQII and Chevaliers’ Romance III found that after controlling for play time and guild membership status, women progress in the game at least as fast as men do, dissolving the gender performance gap observed in previous studies (Shen et al., ). Finally, the contextual differences between virtual worlds may also influence gender dynamics within them. Currently, two broad types of virtual worlds exist, the “sandbox” type and the “scripted” type. In sandbox virtual worlds, such as Second Life, there are typically no predefined goals, leveling, or storyline advancement. Users are free to explore the world at their own pace and create unique in-world experiences. By contrast, MMOs such as WoW and EQII are more scripted and usually have a linear storyline and clear goals for progression for the players. Extensive research has shown

     



that compared to men, women tend to prefer to explore sandbox virtual worlds and play less competitive MMOs (Wohn, ). To date, the bulk of research on virtual worlds focuses on MMOs, and this chapter discussed gender and social networks mainly in the MMO context.

.. Massively Multiplayer Online Games (MMOs) MMOs are persistent and immersive virtual environments in which users may engage in various gaming activities, interacting with other players as well as the environment (i.e., NPCs). Research shows that many players are drawn to MMOs because of the myriad opportunities to form social and collaborative ties with others. Take as an example the MMO EQII. Players can communicate with each other instantly via the chat system, trade in-game items such as weapons and potions, and collaborate with each other both short and long term. To encourage social interactions, the game is designed with twenty-four character classes, which have distinct but complementary skills and abilities. As each character could only choose to specialize in one class, players have to actively seek collaboration with characters of other classes in order to build well-rounded teams that survive difficult quests and battles. The more advanced a character is, the more crucial it is to collaborate with others (Shen, ). Some collaborative ties are temporary, such as pick-up groups (PUGs), which are ad hoc teams that come together for a specific task. In EQII, a PUG may accommodate up to twenty-four members. As PUGs are highly task-oriented and only exist temporarily (typically for a few hours), players are as likely to collaborate with strangers as with existing ties. By contrast, guilds represent long-term social structures with stable membership. Guilds serve a multitude of purposes: they provide access to a pool of resources usually too costly for individual players to obtain on their own; they serve as a virtual community for players to connect with like-minded others and socialize; and they provide the infrastructure and stable social backdrop for players to self-organize and tackle quests collaboratively, avoiding the perils of working with complete strangers.

. G G  V W

..................................................................................................................................

.. Are There Gender Gaps in Virtual Worlds? Virtual worlds allow users to create virtual identities, called avatars, to interact with other avatars and NPCs. Users may choose from a number of character features that resemble “real life” appearances and actions, while also engaging in a fantastical reality. With the ability to create and manipulate outward appearances, there is the possibility for “social leveling” to occur in online worlds, where people can interact without biases



    

based on offline attributes such as gender, age, and physical attractiveness (Steinkuehler and Williams, ). Indeed, some virtual world users gender swap as a way to avoid gendered harassment (Hussain and Griffiths, ). Because users’ offline identities are often invisible online, it seems plausible that a virtual world could be free of racial and gender inequalities. However, many of the same offline social inequalities, such as “gender gaps,” are still rife in virtual settings (Collier and Bear, ).

... Underrepresentation Although the number of female participants in virtual worlds has grown considerably (Ipsos MediaCT, ), they are still underrepresented. The  Pew report on teens and technology use shows that nearly three-quarters (%) of teens play video games online or on their phones, yet there is a marked gender difference: % of boys play such games, compared to % of girls (Lenhart, ). Among MMO participants, female players are a definite numeric minority, as demonstrated in studies on numerous MMOs. For example, female players only constitute about % of all players in EQII (Williams et al., ). An MMO with primarily Chinese players, Chevaliers’ Romance III, demonstrates a similarly uneven gender distribution (Shen and Chen, ). Women are also underrepresented as characters in virtual worlds. There tend to be a greater number of female characters in casual games, but there are fewer in MMOs (Wohn, ). While playing MMOs, players encounter % female characters and % male characters (Williams et al., b). When the female avatars appear in video games, they are depicted as more helpless and provocative than male avatars (Ogletree and Drake, ).

... Differences in Play Styles, Motivations and Performance Studies of gender and MMOs have found persistent gender differences in play styles, motivations, and performance. Compared to men, women players are found to prefer MMOs with more social interaction opportunities and less violent content (Hartmann and Klimmt, ) and are more likely to take supportive roles (Shen, ). Women are motivated to play MMOs for a variety of reasons. First, there is the appeal of joining a community and socializing with other players using multiple types of social and taskoriented interactions. Second, they have the opportunity to role-play using multiple characters and alternate identities. Finally, they also enjoy exploring virtual worlds and participating in teams (Taylor, ). In addition, while women tend to be more motivated by social (rather than competitive) game elements (Yee, ), game advancement is one of the appeals for them (Taylor, ). In contrast, men may be socializing in games for achievement-oriented purposes (Yee, ), and they tend to prefer more aggressive character classes (Shen, ). There is also a pervasive perception that women perform worse than men do in video games. For example, controlling for level and experience, women have less assurance in their ability to perform well in League of Legends than men (Ratan et al., ). Empirically, however, the findings about gender and performance in

     



video games are often contradictory and inconclusive, as many observed performance gaps may be a result of differences in methods, game genres, and extraneous variables. For example, an experiment supported the assertions that men generally perform better than women in video games (Brown et al., ), while a survey found that men tend to have more experience and spend more time playing video games (Ogletree and Drake, ). Differences in performance might be due to gender differences or spurious factors such as prior experience. In addition, scholars suggest that in some games women are more committed and play more hours per week than men (Williams et al., a), although others have suggested men tend to play more per week (Hussain and Griffiths, ). Although the findings seem contradictory, this may result from methodological or game-based differences; in the former study, scholars surveyed MMO players and were able to connect survey data with behavioral data from game server logs (Williams et al., a), while the latter study used a self-selected sample of MMO players who were recruited from gaming blogs (Hussain and Griffiths, ). A recent study of two MMOs in the United States and China found that after controlling for play time, guild membership, and other extraneous variables, the gender performance gap dissipated: women and men advanced at the same pace (Shen et al., ). In various ways, virtual gender gaps are similar to those in male-dominated traditional workplace settings, suggesting there may be some mapping occurring. Many MMOs, for example, have pervasive gender inequalities, resulting in different player social networks. Game players in many MMOs can create an online social network through their online behaviors, by teaming, chatting, trading, and mentoring with other players. Players who cultivate their social networks can have access to advantageous resources, such as information, skills, status, social support, or leadership opportunities. It should be noted that despite the many empirical studies exploring gender differences in virtual worlds, few have examined whether gender is a causal or confounding factor for such differences. To ensure that they are not testing for differences in qualities such as experience, commitment, or goals and motivations, appropriate controls are necessary in any survey or experimental study.

.. Why Do Gender Gaps Occur in Virtual Worlds? ... Gender Role Theory There are a few explanations for why gender gaps occur in virtual worlds, despite the fact that virtual identities can be completely detached from offline identities, thanks to the ability to customize and anonymize identity. First, some scholars suggest that there are some innate differences between sexes that also tend to alter behavior online. These “innate differences,” however, may be the result of gender roles, which are socially constructed differences in expected norms for each sex (Eagly and Karau, ). Gendered stereotypes are one reason these gender roles arise. Whether men or women adhere to these stereotypes affects their perceptions of how the “ideal” man and woman should



    

behave and creates social benefits to those who conform (Eagly and Karau, , p. ). Some common gender stereotypes include those suggesting that men tend to be more achievement oriented and competitive, while women tend to be more socially oriented and cooperative. Research on gender differences in other settings has found clear examples of these behavioral patterns. For example, men have a greater affinity for and are more aroused by competitive behavior, while women are more likely to engage in cooperative behavior (Eckel et al., ). In addition, compared to men, women tend to contribute more to groups and are more likely to make socially oriented choices (Eckel and Grossman, ). Research has also found support for these behavioral patterns in virtual worlds. Women are more motivated by relationship or social factors, and men are more motivated by achievement and manipulation (Williams et al., a). Others have also suggested that games with high competition and violence were less attractive to women than to men (Hartmann and Klimmt, ). In Second Life, women prefer shopping and interacting with people, while men prefer building and owning or developing property (Guadagno et al., ). Offline, there is a stereotype that “feminine” careers, such as teachers or secretaries, are more appropriate for women (Britton, ). Many of these jobs have lower pay and less prestige than masculine careers. Similarly, in virtual worlds women tend to pick avatars that support or assist their teammates, as opposed to avatars that engage in offensive play (Shen, ). As a result of game features, assistive roles tend to have devalued positions. In addition, there may be greater game rewards for more risky and competitive behavior, as opposed to cooperative behavior, resulting in gender gaps. When gendered stereotypes about female participation and performance combine with a male-dominated gaming industry, a “vicious cycle” is perpetuated, with male developers making games that appeal to other men (Fullerton et al., ).

... Depersonalization Depersonalization occurs when individuals are viewed by their group membership, such as considering a man aggressive because he is male. Depersonalization may further explain why gender gaps in virtual worlds are prevalent. Notably, in virtual worlds, all social interactions occur using a computer-mediated medium, a context that has fewer available social cues. In settings where there is limited information to make social categorizations, depersonalization may be more likely to occur, especially when other virtual members may know the gender of a user. Many stereotypical gender differences are more apparent when group members are depersonalized (Spears et al., ). Indeed, there are many gendered stereotypes about men and women in virtual worlds. As previously discussed, the ability of women to perform well in gaming is often questioned. Because of these social stereotypes about women’s performance in virtual worlds, both male and female users may depersonalize unknown female users. Women may be less likely to identify as “gamers,” partly due to perceived lack of experience and commitment (Shaw, ). Women who participate in virtual worlds

     



are also often categorized as “girl gamers,” which depersonalizes the identity of this diverse group of users. The average female player in EQII, for example, is . years old (Williams et al., b). Because of this stereotype, many asymmetrical gender differences may affect game dynamics, such as social liking and the formation of virtual teams.

.. Gender Swapping and the Gender Gap Virtual world users can customize their identities, which creates an opportunity for men and women to choose an avatar of a different gender, known as “gender swapping” or “gender bending.” Gender usually refers to both the psychological and socially constructed states of the sexes, which are biological and demographic categories (Deaux, ). In a survey on various MMOs, % of male players and % of female players had played with characters of a different gender (Hussain and Griffiths, ). But for a player’s primary characters, in EQII, around % of players gender swap and male players gender swap twice as much as female players (Huh and Williams, ). Similarly, in WoW, .% of characters chosen by male players and .% of characters chosen by women are of the opposite gender (Yee et al., ). There are a few reasons for the mixed findings regarding the actual number of gender swappers in virtual world environments. First, in many virtual worlds, users have multiple characters but will predominantly play using one “main” character. Researchers should consider whether they are interested in the totality of characters, the sum of a player’s online gendered identities, or the identity that a player spends the most time with, which may be the character the player associates most with and interacts the most with other players. Second, differences in virtual world environments will change the predominance of gender swapping. Some games, for example, may allow players to pick a gender for any avatar, but some avatars are assigned a gender. Certain gendered avatars may have unique abilities, and players may gender swap as a way to take advantage of these skills (Hussain and Griffiths, ). Questions of identity may arise; in some instances, virtual world users behave in accordance with either their offline gender or their virtual gender. Therefore, while studying virtual world gender, scholars should consider questions of gender identity in terms of both (a) the user’s actual gender and (b) the avatar’s gender. The reasons that users gender swap may explain something about gender identity and desired interactions in online social networks. For example, many users gender swap () as a way to experience differences in treatment from other characters or () to explore another part of their “real life” or online identity (Hussain and Griffiths, ). First, there is a perception that female users often receive preferential treatment, such as gifts, from male users; male users may also want to take advantage of this kind of treatment. At the same time, female participants often experience harassment in online games (Kuznekoff and Rose, ), so they may “gender bend” as a way to avoid unequal treatment. Second, users may role-play as another gender as a way to



    

experience other aspects of themselves. Homosexual game players, for example, are more likely to gender swap their avatars than heterosexual players (Huh and Williams, ). Regardless of sexuality, however, both men and women may want to role-play aspects of both their feminine and masculine identities. There may be some other factors, such as the Proteus effect, affecting gendered interactions online. The Proteus effect suggests that avatars, a form of online selfrepresentation, encourage people to behave in a certain way based on characteristics of the avatar. For example, users with taller avatars behave differently than users with shorter avatars in virtual social interactions (Yee and Bailenson, ). Others have also found support for the Proteus effect in gendered behaviors. (Yee et al., ). In WoW, female avatars tended to heal more than males and male avatars tended to fight more than females. Others have suggested that female gender swappers tend to demonstrate more masculinity online, while male gender swappers do not behave more femininely in EQII (Huh and Williams, ).

. C A  S G  V W

..................................................................................................................................

.. Traditional Methods Traditionally, studies of gender in virtual worlds have mainly used four methods: qualitative/observational, content analysis, experiments, and user surveys. In some instances, qualitative gender analyses are concerned with gendered perceptions and differences in motivation to participate in games. Scholars may ask participants, “Why did you pick an avatar of a different gender?” or “How are women treated by other players?” This type of analysis helps us understand user motivations and perspectives and gather information from the virtual world users themselves. Ethnographies also offer rich and detailed examinations of virtual worlds and interactions within them, such as the effects of gender in a male-dominated, computer-mediated virtual world (Kendall, ). Qualitative and ethnographic studies are excellent for both exploratory analysis and the examination of individual cases. This type of data collection and analysis requires extensive and thorough observational notes. The time period for exploration and data collection also tends to be lengthy, requiring months or even years to fully understand the virtual world. Although useful, qualitative and ethnographic observations are typically not generalizable. The intended purpose for these methods is not to be representative or quantifiable, but instead to fully understand the environment and the “whys” of human behavior. Another method for analyzing gender in virtual worlds is content analysis, in which scholars select and code textual materials for specific purposes. A content analysis of gender usually examines features of the virtual world that may affect the user’s

     



experience. This might include the ratio of male to female characters (Williams et al., b) or the sexualization of characters. Content analysis requires training to standardize the coding of messages and ensure inter-coder reliability. Using this method, the current state of the media is examined. This does not, however, test for virtual world users’ perceptions or for any direct effects on the users themselves. Experiments use a highly controlled setting through which to test individual factors for causality. Within virtual worlds, experiments typically test for causality and often use a social psychological approach to testing gendered effects, such as the Proteus effect (Yee and Bailenson, ). Laboratory experiments, however, can seem artificial to participants and typically last only a short period of time. They may also use convenience samples, with participants who are unfamiliar with virtual worlds in general or unfamiliar with the chosen virtual world context. Finally, user surveys typically ask participants to self-report on their actions, perceptions, and relations within and outside the virtual world, testing for gendered differences. Many survey studies of virtual worlds, however, use self-selected samples, by soliciting participants’ voluntary responses on websites (Yee, ). Survey responses do not necessarily reflect actual behavior. For example, respondents’ perceptions of the amount of time spent playing online games may deviate from the actual time they spent playing. Some respondents may also lie on surveys due to social stigma. Each of the traditional methods informs social scientists about various aspects of human behaviors online. At the same time, these methods are usually conducted on a relatively small scale due to cost. Most traditional methods are also somewhat obtrusive.

.. Digital Trace Data More recently, as the collection and storage of digital trace data has become more accessible to social scientists, an emerging form of data-intensive research, computational social science, has grown (Lazer et al., ). Digital traces are recorded remnants of people’s actions on a digital medium. This kind of data is usually unobtrusive, as participants are often unaware or uncaring that their behavioral actions are being recorded for later analysis. The availability of digital trace data creates opportunities for social scientists to use secondary analysis of massive behavioral data sets to answer complex research questions. For virtual worlds research, these data are often provided by online game companies and used to explore individual, group, organizational, or societal processes. Forming relations with game developers to gain access to proprietary data sets is also challenging. In addition, when only a limited number of researchers have access to proprietary data, research based on those data cannot be verified or replicated (Lazer et al., ). One strategy to account for this limitation has been to form multidisciplinary research groups that can negotiate access with game companies, create infrastructure to collect and store large data sets, and conduct various studies from them.



    

One such research group, the Virtual Worlds Observatory (VWO; http:// vwobservatory.org/), includes social and computer scientists from various universities who test different research questions on the same data. This creates opportunities to learn more about virtual world dynamics, as well as to encourage accountability from other members of the research group.

. R E: G  N  EQ II

.................................................................................................................................. We offer an empirical example to illustrate some of the strengths and challenges of using large behavioral log data to analyze gender and networks. Behavioral data from Sony’s EQII, a popular fantasy-based MMO similar to WoW, are used to study the propensity to connect with other players of the same or different gender across different types of social networks. The typical forms of log data, such as those in an EQII data set, include both cross-sectional and longitudinal behavioral data (Williams et al., ). In cross-sectional logs, data represent a snapshot of the attributes at a current point in time, such as one’s character level, health, or amount of currency. Longitudinal behavioral logs, by contrast, record series of events as they happen over time. For example, when one character trades an item with another, the log data might include the IDs of the characters sending and receiving the item, the time stamp, the name of the item, and the value of the item. The EQII data used in this research example are a longitudinal sample of users and character actions from September  to , , in the server Guk, one of the nineteen North American servers. This time period was selected because it is the only week during which communication and mentorship network data as well as character attributes were all available. Like most MMOs, servers in EQII are parallel versions of a persistent world, and characters are only allowed to interact with other characters on the same server. The server Guk was selected because it represented the most common server type in EQII: player versus environment.

.. Data Quality and Management Depending on the quality of the log data sets and the research questions, very large samples may often need to be aggregated or filtered down to smaller sizes. In EQII, there are a total of , characters in the full sample. We are only interested in active players with demographic information, which cuts down the sample to , active characters (defined as those who logged into the game at least once during the observation period). Another interesting characteristic of virtual world data is analysis of users versus characters. Virtual world users often create multiple characters. In the log data, it is

     



important to distinguish between avatar characteristics, such as avatar gender and guild membership, and user characteristics, such as user gender and age. Because user characteristics are self-reported, people can easily provide false information about themselves. Some virtual worlds, for example, require a minimum age to enter. Lying about their age allows youths to enter and may give scholars a skewed age distribution. The EQII data set includes multiple characters per account. Because men are more likely to create multiple characters for each account than are women, filtering down to one character per account allows for a more precise demographic and network analysis (Leavitt et al., ). After filtering out characters that were nonactive in the collection period, we chose to take each user’s most played character, by ranking all the characters for each user by how many seconds they were played since character creation. This method was selected because we assumed that virtual world users would identify most with the “main” character that they use the most. When the subset of the sample is taken in this way, it is cut down to , users (from , characters). The sample has high gender disparity: , (%) users are men while  (%) are women. We are interested in each user’s propensity to connect with people of either the same or different gender in various networks, while accounting for the disproportionate number of female and male players available for such connections.

.. Theory-Driven versus Data-Driven Analyses When faced with large log data sets, there is a risk that theoretical decisions will be made post hoc based on purely data-driven methods. Theory-driven approaches test hypotheses, while data-driven approaches test for patterns within a data set regardless of theoretical implications (Borbora et al., ). These approaches could inform each other in a cyclical relationship, as some theoretical considerations should be made based on data, and complex data-driven methods are better informed by theory. Theory-driven analyses of MMOs can also be more easily interpreted, and the results may be more meaningful (Borbora et al., ). As a start to conducting theory-driven and data-informed research on gender networks in a virtual world, it is necessary to draw literature from other social contexts, especially when there is a lack of available research in the current medium. For example, to examine communication and mentorship networks in virtual worlds, we can test hypotheses based on prior literature in other male-dominated organizations, such as many traditionally male-dominated corporate workplaces, although there may be different incentives for virtual world users to connect with other men and women. In traditional white- and male-dominated organizations, lower-status individuals are more likely to develop relations with higher-status others to improve their social standing (Ibarra, ). Men, as the numeric majority in many virtual worlds, may also be more likely to categorize others with salient demographic differences as dissimilar. Women may also reduce their interactions with other women in a process of status enhancement. For example, in a professional offline organization, women tend to



    

develop more professional ties with men than with women. This is because social groups that are less central to a system, who have less access to information and resources through their social contacts, are more likely to create relations with people of a different social group in order to broaden their information and resource access (Ibarra, ). Often, lower-status individuals aim to improve network centrality by developing relations with perceived higher-status others (Chatman and O’Reilly, ). As a lower-status numeric minority in many virtual worlds, women may be more likely to connect to higher-status majorities (men). Gender may also be less salient for women in video games because there are fewer women, further increasing the likelihood of different-sex ties.

.. Social Network Analysis When using traditional research methods to conduct social network analysis, the sample tends to be limited in scope. Egocentric network data can be collected from a small group of individual participants, but a complete social network on a large scale would be difficult to gather. One of the advantages of using digital trace data to conduct social network analysis is that complex social systems, complete with all interconnected relationships between characters and/or players, can be fully reconstructed and analyzed. Various types of ties can be examined within the EQII data. For the purposes of an empirical example, we explore communication and mentorship networks. The communication network is created through a universal chat feature, which allows players to send instant messages to other players (Huang et al., ). While chat messaging is often used for social purposes, it is also useful for soliciting members for and collaborating in PUGs. Connecting with men may also be perceived as more advantageous to players in competitive gameplay. The mentorship network consisted of players who mentored or were mentored by others (Shen et al., ). When players “mentor” others, they choose to lower their level temporarily and relinquish higher-level skills, equal to their mentees’, as a way to collaborate with them in teams. A player might mentor another player for social reasons (to play with friends who are at a lower level, because only players at similar levels can play together in the game), or for altruistic reasons, to help train another player and give advice on how to progress. In addition, some games offer in-game advantages for mentorship, such as additional experience points, by mentoring another player. While controlling for the disproportionate numeric gender differences, we can examine users’ propensity to connect with men and women in different types of in-game networks. We conducted chi-square analyses because we are comparing the observed frequencies of interactions with their expected frequencies. In Table ., for example, we can see that both men and women are less likely to send chat messages to women than expected. Men are also more likely to mentor women and less likely to mentor men. Gender homophily for women does not occur in any of the networks, as women did not send/receive or mentor/mentee other

     



Table 11.1 Propensity to Send/Receive Messages and Mentor/Mentee with Different Players (Based on Their Gender) Sendera M M F F Mentor M M F F

Receiver M F M F Mentee M F M F

O—E (sender)b 139.15 139.15 169.96 169.96 O—E (mentor) 87.24 87.24 14.19 14.19

χ2* 0.95 4.33 6.08 27.83 χ2 4.74 21.70 0.53 2.43

pc

O—E (receiver)

0.331 0.037 0.014 0.000

346.745 346.745 62.207 64.207

p

O—E (mentee)

0.030 0.000 0.466 0.119

27.32 27.32 1.11 1.11

χ2 5.74 26.27 0.91 4.15 χ2 0.48 2.21 0.00 0.01

p 0.017 0.000 0.341 0.042 p 0.487 0.137 0.958 0.910

women more than expected. Gender homophily suggests that social interactions are more likely to occur between those of the same sex, as opposed to those of a different sex (McPherson et al., ). For the communication network, both men and women have a propensity to send messages to men as opposed to women, while accounting for the gender disparity in participation. This is perhaps due to the status differences between men and women in an online setting. In contrast, the mechanisms for mentorship may be different from the chat because it could be considered a more collaborative social network. As discussed previously, online gender identity within a virtual world can be an interesting theoretical and methodological issue. In table ., we compared the propensity to connect with players based on their demographic gender. This could also be compared and contrasted to avatar genders, leading to a general research question: When interacting with other users, do users adhere to patterns consistent with their demographic gender or with their avatar gender? One way to explore this research question is to examine gender swappers, the virtual world players who play with an avatar of a different gender, and whether they are more likely to connect to women or men. Within the users in EQII, only a small proportion of the previously discussed sample actually swapped genders with their “main,” or most used, avatar;  (%) men and  (%) women gender swapped. As shown in Table ., in the chat network, male and female gender swappers are more likely to send chats to players of their own gender, as opposed to the opposite gender, when accounting for the proportion of men to women. In the mentorship networks, however, both male and female gender swappers are more likely to mentor women. In summary, the results suggest that both sexes are more likely to communicate with men than expected, but men are more likely to mentor women. Gender swappers are likely to communicate with their own biological sex, but they also tend to mentor



    

Table 11.2 Gender Swappers’ Propensity to Send Messages and Mentor Other Players (Based on Their Demographic Gender) Sendera MF MF FM FM Mentor MF MF FM FM

Receiver M F M F Mentee M F M F

O—E (sender)b 8802.26 768.67 –274.13 274.13 O—E (mentor) –225.00 225.00 –84.39 84.39

χ2*

pc

1675.68 10.59 6.78 31.06

0.000 0.001 0.009 0.000

χ2

p

10.95 50.13 6.99 31.99

0.001 0.000 0.008 0.000

women. As a first step, chi-square tests allowed us to examine participants’ propensity to connect to either men or women, controlling for the number of men and women available for such connections. However, it is important to note that many other demographic or game-related factors, such as guild membership, were left out as chisquare tests did not allow covariates. Even though gender appears to affect social connections, it may be a confounding rather than a casual factor. For example, men may be more likely to mentor women not because they seek to mentor someone of a different gender, but because women are more likely to be of lower levels than men. Other methods, such as exponential random graph models (ERGMs), allow for the inclusion of control variables and can be a more robust approach for social network analysis. Another consideration when conducting social network analysis is whether directed or undirected network ties are of interest. Undirected ties occur when interactions are, by design, reciprocal, such as “friendship” ties that require approval from both parties. Directed ties occur when the tie originates from one actor and leads to another, such as when A sends a chat message to B, but may not receive a reply from B. Both communication and mentorship networks are directed. As shown in Tables . and ., we tested for sender and receiver effects in both social networks. In our results of the gender swappers, we were only concerned with those who initiated the ties, the message senders and the mentors.

. F D

.................................................................................................................................. As we have demonstrated in the research example, studying virtual worlds, especially with digital trace data, offers great potential to test and develop theories on gender and

     



social interaction. Still, current computational social science methods are not without limitations. In this section we identify a few limitations of current approaches and suggest a few promising directions for future research on gender and virtual worlds. First, despite the rich and complex nature of social structures within virtual worlds, few studies examine gender and social networks online, especially considering the multiple types of social and task-oriented relationships that can be formed between participants. As relationships are increasingly created and maintained in virtual settings, either on their own or as a supplement to offline interactions, they present an exciting area for future research. Virtual gender identity and representation, for example, is often customizable and can be detached from participants’ offline gender identity. To what extent are online social behaviors consistent with offline gender norms? How and why do gender disparity still come into existence and evolve over time in virtual worlds? And finally, how could we effectively design virtual worlds to encourage equal participation from various demographic groups? These questions will not only generate important insights about gender roles and social structures in contemporary media environments, but also offer practical recommendations that guide the development of virtual worlds and systems toward equal participation from both genders. The second limitation in current scholarship is the lack of mixed methods research on virtual worlds. Despite their unprecedented scale, precision, and unobtrusiveness, large behavioral log data sets should not be considered a panacea. For example, digital traces may help reconstruct relationships between participants, but they cannot always account for the quality of interactions. The frequency of interactions between two virtual world users could only pinpoint the quantity, rather than quality, of encounters. If one virtual user trades an item with another, to what extent is that a meaningful relationship? This is the type of question that cannot be answered solely by using digital trace data. Combining traditional methods with big data methods and approaches would help address some of the weaknesses of each method. By using a survey of a sample of MMO players with behavioral log data, for example, players’ perceptions could be compared with their actions. Players could be asked in a survey how often they interact with other players of the same gender, and this perception could be tested against their actual social interactions in the game. Similarly, by combining experiments and behavioral logs to test the same mechanism, problems of artificiality in experiments and causality within logs might be counterbalanced. Last but not the least, an important limitation in current approaches is the lack of cross-cultural and cross-platform research. Studies typically rely on observations from a single virtual world, usually situated in a single culture, which significantly limits the generalizability of study findings. Most research has focused on Western-centric MMOs, such as WoW (Ducheneaut et al., ) and EQII (Shen, ). Studies examining virtual worlds in other cultures also typically focus on a single world/game, such as the Chinese Chevalier’s Romance III (Shen and Chen, ) and Dragon Nest (Benefield et al., ). The lack of studies that compare and contrast gendered interactions across virtual world



    

platforms and across cultures is due in part to the limited access to behavioral data from multiple worlds. As more and more digital trace data sets become available, crosscultural and cross-platform comparative studies are able to produce generalizable insights. Comparative analyses have the potential to determine whether effects are due to features specific to the virtual world or other factors. Ideally, this will allow scholars to pinpoint how the social, cultural, and architectural characteristics of specific online environments can influence gendered interaction.

R Balicer, Ran D. “Modeling Infectious Diseases Dissemination through Online Role-Playing Games.” Epidemiology , no.  (): –. Benefield, Grace A., Cuihua Shen, and Alex Leavitt. “How Group Social Capital Affects Team Success in a Massively Multiplayer Online Game.” Paper presented at the Proceedings of the th ACM Conference on Computer-Supported Cooperative Work and Social Computing [CSCW ’], San Francisco, CA, . Borbora, Z., J. Srivastava, Kuo-Wei Hsu, and Dmitri Williams. “Churn Prediction in Mmorpgs Using Player Motivation Theories and an Ensemble Approach.” Proceedings of the  IEEE International Conference on Privacy, Security, Risk and Trust and IEEE International Conference on Social Computing, Boston, MA, . Britton, Dana M. “The Epistemology of the Gendered Organization.” Gender & Society , no.  (): –. Brown, R. Michael, Lisa R. Hall, Roee Holtzer, Stephanie L. Brown, and Norma L. Brown. “Gender and Video Game Performance.” Sex Roles , nos. – (): –. Castronova, Edward. Synthetic Worlds: The Business and Culture of Online Games. Chicago: University of Chicago Press, , . Chatman, Jennifer A., and Charles A. O’Reilly. “Asymmetric Reactions to Work Group Sex Diversity among Men and Women.” Academy of Management Journal , no.  (): –. Collier, Benjamin, and Julia Bear. “Conflict, Criticism, or Confidence: An Empirical Examination of the Gender Gap in Wikipedia Contributions.” In Proceedings of the ACM  Conference on Computer Supported Cooperative Work (CSCW ’), –. New York: ACM, . Deaux, Kay. “Sex and Gender.” Annual Review of Psychology , no.  (): –, . Ducheneaut, Nicolas, Nicholas Yee, Eric Nickell, and Robert J. Moore. “The Life and Death of Online Gaming Communities: A Look at Guilds in World of Warcraft.” Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, . Eagly, Alice H., and Steven J. Karau. “Gender and the Emergence of Leaders: A MetaAnalysis.” Journal of Personality and Social Psychology , no.  (): , . Eckel, Catherine C., Angela De Oliveira, and Philip J. Grossman. “Gender and Negotiation in the Small: Are Women (Perceived to Be) More Cooperative Than Men?.” Negotiation Journal , no.  (): –. Eckel, Catherine C., and Philip J. Grossman. “Differences in the Economic Decisions of Men and Women: Experimental Evidence.” Handbook of Experimental Economics Results  (): –.

     



Fullerton, Tracy, Janine Fron, Celia Pearce, and Jacki Morie. “Getting Girls into the Game: Towards a ‘Virtuous Cycle’.” In Beyond Barbie and Mortal Kombat: New Perspectives on Gender and Gaming (Eds. Yasmin B. Kafai, Carrie, Heeter, Jill Denner, Jennifer Y. Sun) –. Cambridge, MA: MIT Press, . Guadagno, Rosanna E., Nicole L. Muscanell, Bradley M. Okdie, Nanci M. Burk, and Thomas B. Ward. “Even in Virtual Environments Women Shop and Men Build: A Social Role Perspective on Second Life.” Computers in Human Behavior , no.  (): –. Hartmann, Tilo and Christoph Klimmt. “Gender and Computer Games: Exploring Females’ Dislikes.” Journal of Computer-Mediated Communication , no.  (): –. Huang, Yun, Cuihua Shen, Dmitri Williams, and Noshir Contractor. “Virtually There: Exploring Proximity and Homophily in a Virtual World.” Paper presented at the  International Conference on Computational Science and Engineering, . Huh, Searle, and Dmitri Williams. “Dude Looks Like a Lady: Gender Swapping in an Online Game.” In Online Worlds: Convergence of the Real and the Virtual (Eds. William S. Bainbridge), –. London: Springer, . Hussain, Zaheer and Mark D. Griffiths. “Gender Swapping and Socializing in Cyberspace: An Exploratory Study.” CyberPsychology & Behavior , no.  (): –. Ibarra, Herminia. “Homophily and Differential Returns: Sex Differences in Network Structure and Access in an Advertising Firm.” Administrative Science Quarterly , no.  (): –. Ipsos MediaCT. “The  Essential Facts about the Computer and Video Game Industry.” Entertainment Software Association, . https://cdn.arstechnica.net/wp-content/ uploads///esa_ef_.pdf Kendall, L. Hanging out in the Virtual Pub: Masculinities and Relationships Online. Berkeley: University of California Press, . Kuznekoff, Jeffrey H. and Lindsey M. Rose. “Communication in Multiplayer Gaming: Examining Player Responses to Gender Cues.” New Media & Society , no.  (): –. Lazer, David, Alex Sandy Pentland, Lada Adamic, Sinan Aral, Albert Laszlo Barabasi, Devon Brewer, Nicholas Christakis, et al. “Life in the Network: The Coming Age of Computational Social Science.” Science , no.  (): . Leavitt, Alex, Joshua Clark, and Dennis Wixon. “Uses of Multiple Characters in Online Games and Their Implications for Social Network Methods.” Proceedings of the th ACM Conference on Computer-Supported Cooperative Work & Social Computing, . Lehdonvirta, Vili, Rabindra A. Ratan, Tracy L. M. Kennedy, and Dmitri Williams. “Pink and Blue Pixel $: Gender and Economic Disparity in Two Massive Online Games.” The Information Society , no.  (): –. Lenhart, A. Teens, Social Media & Technology Overview . Washington, DC: Pew Research Center, . McPherson, Miller, Lynn Smith-Lovin, and James M. Cook. “Birds of a Feather: Homophily in Social Networks.” Annual Review of Sociology  (): –. Ogletree, Shirley Matile, and Ryan Drake. “College Students’ Video Game Participation and Perceptions: Gender Differences and Implications.” Sex Roles , nos. – (): –. Ratan, Rabindra A., N. Taylor, J. Hogan, Tracy Kennedy, and Dmitri Williams. “Stand by Your Man: An Examination of Gender Disparity in League of Legends.” Games and Culture , no.  (): –. Shaw, Adrienne. “Do You Identify as a Gamer? Gender, Race, Sexuality, and Gamer Identity.” New Media & Society , no.  (): –.



    

Shen, Cuihua. “Network Patterns and Social Architecture in Massively Multiplayer Online Games: Mapping the Social World of EverQuest II.” New Media & Society , no. (): –. Shen, Cuihua, and Wenhong Chen. “Gamers’ Confidants: Massively Multiplayer Online Game Participation and Core Networks in China.” Social Networks  (): –. Shen, Cuihua, Peter Monge, and Dmitri Williams. “Virtual Brokerage and Closure: Network Structure and Social Capital in a Massively Multiplayer Online Game.” Communication Research , no.  (): –. Shen, Cuihua, Rabindra A. Ratan, Y. D. Cai, and A. Leavitt. “Debunking the Gender Performance Gap in Two Massively Multiplayer Online Games.” Journal of ComputerMediated Communication , no. (): –. Spears, Russell, Martin Lea, Tom Postmes, and Anka Wolbert. “A Side Look at ComputerMediated Interaction.” in Strategic Uses of Social Technology: An Interactive Perspective of Social Psychology, edited by Zachary Birchmeier, Beth Dietz-Uhler, and Garold Stasse, –. New York, NY: Cambridge University Press, . Steinkuehler, Constance A., and Dmitri Williams. “Where Everybody Knows Your (Screen) Name: Online Games as ‘Third Places’.” Journal of Computer-Mediated Communication , no.  (): –. Taylor, T. L. “Multiple Pleasures: Women and Online Gaming.” Convergence: The International Journal of Research into New Media Technologies , no.  (): –. Williams, Dmitri. “The Mapping Principle, and a Research Framework for Virtual Worlds.” Communication Theory , no.  (): –. Williams, Dmitri, Mia Consalvo, Scott Caplan, and Nick Yee. “Looking for Gender: Gender Roles and Behaviors among Online Gamers.” Journal of Communication , no.  (a): –. Williams, Dmitri, Noshir Contractor, Marshall Scott Poole, Jaideep Srivastava, and Dora Cai. “The Virtual Worlds Exploratorium: Using Large-Scale Data and Computational Techniques for Communication Research.” Communication Methods and Measures , no.  (), –. Williams, Dmitri, Nicole Martins, Mia Consalvo, and James D. Ivory. “The Virtual Census: Representations of Gender, Race and Age in Video Games.” New Media & Society , no.  (b): –. Williams, Dmitri, Nick Yee, and Scott E. Caplan. “Who Plays, How Much, and Why? Debunking the Stereotypical Gamer Profile.” Journal of Computer-Mediated Communication , no.  (): –. Wohn, Donghee Yvette. “Gender and Race Representation in Casual Games.” Sex Roles , nos. – (): –. (English). Yee, Nick. “The Demographics, Motivations, and Derived Experiences of Users of Massively Multi-User Online Graphical Environments.” Presence , no.  (): –. Yee, Nick, and Jeremy Bailenson. “The Proteus Effect: The Effect of Transformed Self‐ Representation on Behavior.” Human Communication Research , no.  (): –. Yee, Nick, Nicolas Ducheneaut, Mike Yao, and Les Nelson. “Do Men Heal More When in Drag?: Conflicting Identity Cues between User and Avatar.” Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, .

  .............................................................................................................

INTERACTIONS AND SOCIAL CAPITAL .............................................................................................................

  ......................................................................................................................

    Social Networks, Social Capital, and Social Interactions ......................................................................................................................

 

T last twenty years have seen dramatic changes to human social practices and the technologies that enable, constrain, and ultimately shape (and are shaped by) them. These changes include technological advancements in social media technologies, an expanded and more representative online population, and widespread social media use. For many users, it has become second nature to document their activities and to conduct a significant portion of their social interactions online, resulting in rich, detailed digital traces that can be observed, harvested, and analyzed. These interactions and connections across and between individuals create networks—patterns of social connections—that are leveraged in the design of many social media platforms as a key structure for managing attention and information flows. In an important shift from earlier broadcast technologies, the media that deliver news and information and those that allow us to connect socially are no longer distinct channels; rather, with social media, our immediate networks act as the producers, curators, and feedback mechanisms for the news and information we receive. In addition to distributing news and information, these platforms are also important conduits for social support and social behaviors, the key themes of this section. The chapters in this section engage with research questions that center on the role of the person-centered network as it is articulated, reinforced, and shaped by social media and other online communication technologies. In doing so, these chapters illustrate both the opportunities and the challenges that scholars of technology and society are engaging with as they seek to capitalize on promising new sources of data while still drawing inspiration from long-established theories of social behavior—many of which were developed in a mediascape very different from today’s.



 

The behavioral trace data captured by today’s communication platforms play a key role in these important investigations. Newer social media technologies that derive their value from mediated interactions also archive these interactions, turning them into a rich set of data that reflect components of the human experience and, importantly, are produced in a naturalistic setting for consumption by social peers as opposed to researchers. These data offer insights into social behavior and psychological processes, but also pose a new set of challenges (Ellison & boyd, . As articulated by Jamieson, Boase, and Kobayashi in their chapter, online digital traces are incomplete and biased in ways that scholars are still seeking to understand. For instance, trace data can offer us very nuanced and detailed information about users’ activities on one particular platform—but nothing about how these interactions influence communication activities that occur via other channels. In addition, although they don’t suffer from the reporting biases found in surveys or other data collection methods, social media trace data are imbued with biases introduced by self-presentational pressures associated with creating content to be consumed by a diverse audience of friends, family, weak ties, and work colleagues. And so on. Thus, more than ever, scholars must attend to insights from multiple data sources and turn to theory for guidance in analyzing and interpreting the patterns they identify. The chapters in this section represent exemplars of studies that draw from new data and existing theory in exciting ways, articulating promising future pathways for work that seeks to explore the intersections among social networks, social capital, and social interactions broadly writ. As social capital theory tells us, social networks are important sources of information, support, and other resources. Of course, research on social capital and social networks existed long before the introduction of online social technologies, but these tools influence social networks and the social capital that flows across them in important ways. Social media platforms capitalize on network patterns in ways that earlier broadcast technologies could not, employing network structures to filter information and connect users to one another in unique configurations. Earlier broadcast technologies such as television distributed a uniform set of content to every user. In contrast, social media content is generated by the network; this means that every single person on any particular platform has the potential to see a completely idiosyncratic set of content and interactions. Moreover, and posing a continuing challenge to social media researchers, there is no way to systematically predict which subset of content will be presented to users on most platforms, due to the complexity and opacity of the algorithms involved. Even search engine results, which we might assume to be fairly uniform across users, differ across individuals (Pariser, ). The network helps manage users’ attention but also complicates researchers’ attempts to study it. These network dynamics make studying interaction and consumption patterns more complex. In addition to looking to established social science theory for guidance, researchers in this space can also benefit from a growing body of work that explicates the specific affordances of online platforms and how they shape social processes in online spaces (e.g., Evans, Pearce, Vitak, & Treem, ). As one example, online social technologies make visible and persistent what was previously invisible and ephemeral,

   



thus eliminating temporal and spatial constraints that served to restrict interaction to specific network clusters. Persistent, visible content and interactions also enable researchers to study social interaction in ways that wouldn’t be possible in face-to-face settings; the chapters in this section represent a diverse set of responses to these (and other) opportunities.

. N (D) O, O (T) G

.................................................................................................................................. The study of newer communication platforms provides researchers with new opportunities for re-engaging with extant theory, which provides a structure for understanding new data, in turn prompting re-engagement with and revitalization of established theoretical frameworks. An excellent example is Kwon’s chapter, which considers tie strength, placing it within the context of scholarship on social capital and highlighting the need to complicate and realign our thinking about how we conceptualize tie strength in the era of mediated interaction. Kwon identifies lacunae in the existing work on social capital and online activities, which tends to focus on outcomes— especially positive outcomes of online social activities at the collective or individual level—but does not attend adequately to the “networking itself.” As Kwon argues, we need to complicate our understanding of the relationship between tie strength and social capital, moving away from the assumption that strong ties are necessarily (and exclusively) the source of bonding social capital. Kwon’s chapter highlights an instance in which new technological affordances prompt us to re-engage with an older and useful concept: tie strength. As she writes, “A more refined taxonomy beyond the strong–weak ties dichotomy may contribute to a greater systematic understanding of online social networking or investment patterns.” Relationships have always been complex, but mediated and persistent interaction affordances reshape these patterns in important ways. Highlighting three motivations that characterize social relationships—uncertainty, persistence, and mutuality—Kwon considers different combinations of these factors in an attempt to unpack tie strength and social capital. The eight kinds of ties suggested by considering different combinations of persistence, mutuality, and uncertainty illustrate the unique affordances of social media and their ability to support relationships that are unlikely in purely offline contexts. This approach to explicating various kinds of relationships offers a useful framework for better understanding differences between online and offline relationships, their characteristics, and their implications for social capital. As illustrated by the chapter by Jamieson et al., one method for addressing questions about the increasingly complex, and sometimes opaque, relationships among behavior, interactions, and social support dynamics is to leverage multiple data sources. While logged trace data offer researchers the ability to study interactions that were once difficult to observe, they are also biased, offering us an incomplete representation of



 

interactions and cognitive processes. Jamieson et al. use trace phone call and texting data, combined with self-report measures of relationship status and in-person communication, to capture more nuanced and detailed patterns of communication activity among ties of various strengths. Although we know from previous work on media multiplexity theory (Haythornthwaite, ) that strong ties are more likely to use a wider range of communication channels than weak ties, this work confirms the need to consider multiple channels and highlights the need to disentangle communication stemming from coordination needs and social role from interactions that are more social in nature. This chapter also explicates a key, but underacknowledged, challenge introduced by the expanding set of platforms and communication channels available to users in today’s media environment: studies that focus on a single platform (see also Lampinen, ; Stoycheff, Liu, Wibowo, & Nanni, ). In some cases this may be appropriate, but in others, important pieces of the puzzle will be missing. And—unlike the unfilled areas that mark missing pieces in a jigsaw puzzle—it often isn’t clear which pieces are missing. This work, as the authors note, can guide future research by offering insight into what other kinds of channel data researchers should be collecting or at least highlighting as a limitation. The chapters by De Choudhury and Scholz and Falk are illustrative of the new techniques, made possible by advances in computational methods and neuroscience, for understanding social and psychological dynamics. As previously described, social media and other trace data enable researchers not only to access data that otherwise would be ephemeral, such as playful banter among a group of friends, but also to access hard to reach populations, such as the pro-eating disorder communities studied by De Choudhury. In her chapter on technology-enabled opportunities to study mental health, De Choudhury describes characteristics of social media data that make them especially useful for tracking and studying mental health issues: they are produced in a naturalistic setting, with more finely grained temporal frames than, say, a one-shot survey, and written utterances often include contextual information (such as location) and indicators of emotional state. Given the gap between what we know is good for us (our ideal behaviors) and what we actually do (our actual behaviors), indications of emotional state, especially when paired with data from exercise trackers and other devices, may be key for understanding the complex psychological dynamics that accompany mental and physical health contexts. Finally, as De Choudhury points out, the scale of data is different; this form of data collection can engage a wider range of individuals and is quicker and less expensive than other forms of data collection. De Choudhury’s chapter offers some compelling examples of work that capitalizes on the data produced as a byproduct of online interactions in order to better understand, and attempt to ameliorate, mental health issues. In addition to the new kinds of online activity data scholars now have access to, advances in functional magnetic resonance imaging (fMRI) techniques allow researchers to capture brain activity, offering insights into social and psychological processes that individual users aren’t always willing or able to accurately articulate. Just as, say,

   



the introduction of the electron microscope enabled scientists to see new detail in existing artifacts, new ways to capture brain activity data can provide researchers with insights into psychological mechanisms. In their chapter on the neuroscience of information sharing, Scholz and Falk use innovative new methods and draw from existing theories to explain the why and how of sharing decisions and behavior at the interpersonal and population levels. Importantly, they focus on understanding the mechanisms behind information sharing, which they argue is important for theory development and design intervention. This approach holds a great deal of promise for unpacking other communication phenomena that may benefit from insights into mechanisms as offered by brain activity data, such as identity shift (Gonzales & Hancock, ). To what extent does behavior change prompted by statements to others about the self relate to feelings about the self, compared to processing that is related to feelings about others? Access to brain data has the potential to untangle various strands of processing that co-occur in online social spaces, where individuals are simultaneously processing statements about the self and seeking to connect with others. Scholz and Falk describe studies that find different brain activity patterns associated with sharing content with one’s entire Facebook network compared to sharing with one friend; this holds great promise for better understanding one’s “imagined audience” (Litt, ). While sharing with one’s friend network means that each of one’s friends could potentially see the status (by going to the poster’s Wall), in reality this is determined by a complicated formula that takes into account past interactions and a host of other factors. Users can expect that different kinds of posts will activate different portions of one’s network (e.g., a post about one’s high school reunion will be “liked” by members of one’s high school cluster, which in turn will make it more visible to other members of that cluster). This technique shows promise for helping us understand the complicated relationship among content, imagined audience, and actual visibility and activity patterns.

. C

.................................................................................................................................. Taken together, these chapters illustrate promising new directions for research and offer some guidelines for creating compelling scholarship. They represent thoughtful engagement with novel sources of data (such as brain data) that can be used to understand processes difficult to observe otherwise. These new instruments and sources of data represent unexplored opportunities for scientific discovery and also highlight the need to engage with theory, building upon past work as opposed to producing descriptive work that starts from zero knowledge. As these chapters illustrate, scholars should also not shirk from revising theory when it needs to be extended or reconsidered in light of new evidence. In this sense, theoretical frameworks should be considered living entities that adapt over time, not static artifacts in a museum to be dusted off as needed. Kwon’s argument about the need to reconceptualize tie strength



 

in the era of social media is an excellent example of how scholars should not avoid reengaging with established concepts when necessary. In addition to highlighting the ways in which new data and older theories inform one another, these chapters illustrate different aspects of how Internet scholars engage with networked communication concepts and network science. The role of networks is important in many of the topics of interest to scholars in this space. For instance, in the case of the eating disorder communities studied by De Choudhury, the sharing of information within a closed network and the patterns by which messages are amplified and spread across networks are key to understanding the power of these discursive spaces. This example also highlights how the same affordance (such as the ability to broadcast communication) can have both positive and negative outcomes—enabling users to access social support from close ties, for instance, but also to share information about how best to hide harmful eating practices from one’s family. Similarly, the network plays a prominent role when considering the dynamics of information sharing via social media, as considered in the Scholz and Falk chapter. De Choudhury’s work highlights two important calls to action. First, she articulates the possibilities for scholarship in this space to more explicitly attend to practical interventions that build upon insights to minimize risk and maximize benefits for individuals at risk for health concerns. For many mental health issues, early detection is key, and social media data could be an important tool in these efforts. For instance, automated detection could be helpful by alerting moderators in an online discussion forum about specific users who may be in distress and should be observed closely, or for identifying information about a patient’s condition that should be shared with her family and support network. Considering both the practical as well as the theoretical implications of our scholarship could enhance the impact of scholars working in health and other domains as well. Second, the voice of the user and the desire to support users is an important component of De Choudhury’s work, one that other scholars should attend to. Her chapter highlights the need to consider participant privacy and other concerns, both in the design of our studies and by considering the ways in which our findings can inform the design of technology interventions and communication technology platforms more broadly. As she argues, “Designers, builders, owners and researchers of these systems need to ensure that beyond interventions and the ethical considerations around them, educating users about the privacy risks of sharing sensitive information online that can potentially be linked to their health is of utmost importance.” Educating and protecting users is especially important in domains such as health, where disclosures can have serious implications, such as employment or insurance problems, for users. Although it is tempting to treat social media data as depersonalized, keeping the user in mind is important for research in any domain that has meaningful downstream implications for the kinds of sharing that social media encourages. Finally, these chapters also highlight how inspiration from multiple disciplines can productively inform work by communication scholars. Methods and theories from psychology, sociology, neuroscience, computer science, and information studies are all

   



represented here. Similar to the way in which these chapters engage with the network, and the data are inextricably networked to one another, these chapters invite scholars to step back and consider knowledge as a networked system, with connections across theories and disciplines providing a more stable base upon which to innovate. Similar to the way in which new scientific instruments enable scientists to look at existing artifacts in new detail, or space exploration or deep sea crafts enable scientists to gather new artifacts, increasing access to social media trace data and powerful new data processing tools are opening up new ways to access and analyze information about social phenomena. But in many cases, these phenomena represent fundamental human activities we already understand quite well. As these chapters illustrate, this is an exciting time for scholars who want to design and build technical interventions that will make a difference in the world, for those who welcome the insights afforded by new sources of data, and for those who are eager to re-engage with established theories in productive ways.

R Ellison, N. B., & boyd, d. (). Sociality through social network sites. In W. H. Dutton (Ed.), The Oxford handbook of internet studies (pp. –). Oxford: Oxford University Press. Evans, S. K., Pearce, K. E., Vitak, J., & Treem, J. W. (). Explicating affordances: A conceptual framework for understanding affordances in communication research. Journal of Computer Mediated Communication, (), –. Gonzales, A. L., & Hancock, J. T. (). Identity shift in computer-mediated environments. Media Psychology, (), –. Haythornthwaite, C. (). Social networks and Internet connectivity effects. Information, Communication & Society, (): –. doi:./ Lampinen, A. (). Why we need to examine multiple social network sites. Communication and the Public, (), –. Litt, E. (). Knock, knock. Who’s there? The imagined audience. Journal of Broadcasting & Electronic Media, (), –. Pariser, E. (). The filter bubble: What the Internet is hiding from you. London: Penguin UK. Stoycheff, E., Liu, J., Wibowo, K. A., & Nanni, D. P. (). What have we learned about social media by studying Facebook? A decade in review. New Media & Society. doi:./ .

  ......................................................................................................................

        A Social Investment Approach ......................................................................................................................

.  

Motivated action guides interactions. —Lin (, p. )

. I

.................................................................................................................................. S capital is networked resources that are produced by the interplay between human agency and social structure (Lin, ). Social capital research has grown multifariously and diverged generally into two categories: outcome-oriented and investment-oriented. Offline social capital research has shown balance between these two. On the one hand, there is Putnam’s () bonding and bridging capital analysis, as well as a similar line of researchers who focus on positive functions of social connectivity. On the other hand, some social network researchers have developed various methods to delve into interpersonal investment patterns based on tie strength or social roles, such as “name generator/interpreter” techniques (e.g., Marin & Hampton, ; Marsden, ). In between are scholars who explore the relationship between social networking patterns and instrumental returns from the relationships (Burt, ; Lin, ). In digital environments, however, social capital literature to date seems to have predominantly overemphasized social capital as outcome-oriented. On an individual level, studies have highlighted the benefits of online social networking for expanding job opportunities (Utz, ), mobilizing like-minded others (Papacharissi, ), finding health support (Chung, ), and meeting dating partners (Ellison, Heino,

      



& Gibbs, ; Valkenburg & Peter, ). On a collective level, collective action mobilization (Castells, ) and civic and political participation (Gil de Zúñiga, Jung, & Valenzuela, ; Skoric, Ying, & Ng, ) have served as parameters for the evaluation of online social capital. The outcome-oriented studies insightfully reveal the empowering potential of the Internet in generating social support, life betterment, or community building. However, the development of measurements for the antecedents of these positive outcomes—that is, online social networking patterns—has unfortunately not been up to par. This chapter proposes that scholars of Internet social capital should develop a framework that helps researchers delve into online networking activities. The framework should consider not only social relational characteristics but also digital platform affordances. This chapter suggests that such a framework emphasizes users’ purposive decision-making about with whom to connect via which platform. The role of human agency has been tangentially alluded to in existing Internet social capital literature, yet has been somewhat lacking in explication. This chapter argues that social media users often purposively and strategically invest in online social connections with an anticipation of accumulating resources from the invested relationships. This claim is not idiosyncratic. Rather, it aligns with the premise of economic sociology that serves as a foundation for traditional social capital literature: that social capital is a form of capital operated and reproduced by rational individuals (Bourdieu, /; Burt, ; Lin, ; Wellman & Wortley, ). In this chapter I first briefly review the existing outcome-oriented social capital research conducted in digital environments and point out a missing piece from the current literature. Second, I introduce an economic sociological view of social capital, primarily by expanding on Lin’s () discussion of “social capitalization.” Specifically, I argue that purposive action interplays with social structure and platform affordances in the process of digital social capital production. Third, I discuss three costassociated dimensions underlying online social investment—cost of uncertainty, cost of persistence, and cost of mutuality—in conjunction with widely discussed networking principles, such as homophily versus heterophily, tie strength, and prestige effect. Fourth, I use an empirical example to demonstrate the ways in which the proposed framework may enrich the understanding of social investment patterns underlying digital social capital production. Finally, I conclude the chapter with a few suggestions regarding the ways in which future research may advance the social capitalization framework for its empirical utility.

. S C   I: A O-O A

.................................................................................................................................. A majority of existing Internet social capital research defines social capital as outcomes from social interactions. An implicit consensus seems to exist: investigating social



 

capital should highlight prosocial, positive, or beneficial functions of sociability. This perspective is consistent with Adler and Kwon’s () definition of social capital: “the good will [emphasis added] that is engendered by the fabric of social relations” (p. ). Such communitarian optimism seems to be an underlying tone for the majority of current Internet social capital research. In particular, Internet social capital research often adopts Putnam’s () notion of bonding capital and bridging capital to operationalize social capital as outcomes. Notably, Williams’s () Internet social capital scales (ISCS) is a widely cited survey instrument for measuring online social capital. Drawn from Putnam’s bondingbridging dichotomy, the ISCS evaluates online bonding capital by underscoring positive social outcomes, such as getting emotional support, accessing exclusive resources, and mobilizing solidarity. The ISCS’s bridging capital reflects different types of outcomes, which while still prosocial, include outward curiosity, contact with a diverse group of people, perception of the self as a part of the extended world, and generalized norms of reciprocity. The items in the ISCS intend to elucidate the positive effects of online social activities rather than examining the networking patterns configured from these activities. Studies using the ISCS have offered insights by attesting to the positive roles of online social networking in improving quality of life, especially in regard to bridging capital (e.g., Chang & Zhu, ; Ellison, Steinfield, & Lampe, , ; Skoric et al., ). According to Williams (), “while the Internet appears to offer the boundarycrossing engagement that we might all hope for, it does not offer as much deep emotional or affective support like the offline world does” (p. ). Although Williams’s statement may be refutable with the rise of social software—the primary utility of which is the maintenance of strong ties—the core idea holds validity in that online platforms are particularly useful in uncovering, expanding, and maintaining relationships with broader scopes of social encounters. Nevertheless, equating prosocial outcomes to social capital comes at a price: The outcome-oriented approach offers little room to explore the multitude of networking activities that have emerged in various online platforms. Although much of Internet social capital research acknowledges that bonding capital is embedded in strong ties and bridging capital in weak ties (e.g., Adler & Kwon, ; Chang & Zhu, ; Ellison et al., , ; Scholtz, Berardo, & Kile, ; Williams, ), a superficial review of tie strength does not contribute to the analysis of networking patterns. Instead, the lack of attention to networking activities may result in simplistic, and even faulty, assumptions that “strong ties” is just an interchangeable term for bonding capital (and weak ties for bridging capital), and that the offline version of relational typologies is applicable to digital sociability without modification. If bonding and bridging capitals are the individual benefits, another line of research has focused on the collective outcomes—what Kadushin () refers to as “collective social capital” (p. ). This line of research tends to highlight communitarian values. Ever since Shah, Kwak, and Holbert () investigated the relationship of Internet uses and collective social capital, such as civic engagement and trust building, collective

      



social capital research has become committed to demonstrating the positive roles of Internet uses in facilitating democracy. Kobayashi, Ikeda, and Miyata (), for example, found that engagement in online communities not only contributes to generalized reciprocity in digital spaces but also has a spillover potential into offline civil participation. Mathwick, Wiertz, and Ruyter () studied virtual peer-to-peer communities, concluding that collective social capital, such as generalized reciprocity, voluntarism, and social trust, are the assets characteristically found in well-designed virtual communities. Collective social capital outcomes seem to have garnered even more attention with the rise of social media, as demonstrated by various literature (e.g., Gil de Zúñiga et al., ; Skoric et al., ; Valenzuela, Park, & Kee, ). These studies are interested in understanding the effect of social networking service uses on civil and political participation and sometimes linking interpersonal bonding and bridging capital to collective social capital (e.g., Skoric et al., ). The collective social capital research, however, often neglects the fact that social capital is the resources invested and embedded in social relations (Lin, ). The overemphasis on collective outcomes results in an unsophisticated examination of social relational dimensions. Moreover, these studies tend to treat the use of the Internet (or social networking services) as a proxy for the social investment put into online social relations. Unfortunately, equating Internet use to online networking activities dismisses the granularity of relational patterns configured in various digital contexts. Similar to the individual level studies, collective social capital literature seems to overlook networking patterns emergent in digital platforms as a part of social capital production. In summary, the existing Internet research has emphasized the good will and beneficial consequences of social capital that accrue in digital social contexts. The premise of this line of research is that social capital should manifest itself in positive consequences, such as psychological well-being, life satisfaction, and broadened worldview on an individual level, or enhanced community and better democracy on a collective level. In other words, the empirical preference has been given to examining the positive effects of social capital. While insightful, the existing body of outcomeoriented research reveals a topical bias toward the effects of online networking rather than the networking itself. As a result, the dynamics of social relational patterns in various digital platforms seem to be underrepresented in the current literature.

. A M P: S I P O

.................................................................................................................................. Effect-oriented studies have been promoted partly in parallel with Internet researchers’ reaction to once-widespread dystopian views on media technologies, including the Internet (e.g., Kraut et al., ; McPherson, Smith-Lovin, & Brashears, ; Nie & Erbring, ; Putnam, ; Turkle ). In response to early claims that the Internet



 

induced social displacement and isolation, scholars have provided evidence to the contrary: that Internet use did not displace but instead supplemented the existing social support system (Ellison et al., ; Hampton & Wellman, ; Wang & Wellman, ; Wellman et al., ). The outcome-oriented view of social capital, however, is just one side of the story. Equating social capital to positive effects possibly diminishes the need to understand another essential dimension—the complexities of social networking patterns—that may well deserve a critical assessment in the online social environment. Social networking is understood as a form of social investment because it is an act of expending one’s time, effort, or attention for interpersonal or group-level social relationships. In this sense, “social networking” and “social investment” can be used interchangeably. The majority of outcome-oriented literature has stated that bonding capital resides in strong ties and bridging capital in weak ties (Ellison et al., ; Valenzuela et al., ; Williams, ). The logic of tie strength has been customary in the discussion of online social capital; linking tie strength to the outcomes of social interactions is rooted in well-established offline social capital research tradition. Nonetheless, as widely discussed in social network literature, the definition of tie strength is abstract and not always clear-cut. For example, Granovetter’s () seminal research defined weak ties as those who are nonfamily/relatives. Obviously, this early conceptualization of tie strength suggests a rather narrowly defined limit of strong ties. As Marsden and Campbell () later point out, the role-centric categorization is a weak index of tie strength. The most reliable way of measuring tie strength has been to ask respondents to rate emotional closeness (Burke, Kraut & Marlow, ; Marsden & Campbell, ; Wellman & Wortley, ). However, various other indicators that are often used to operationalize tie strength, such as homophily, communication frequency, and relational duration, demonstrate that tie strength is indeed a multifaceted concept, different sub-elements of which may lead to a dissimilar understanding of social relations (Marin & Hampton, ). The strong-weak ties dichotomy developed in an offline context is even more ambiguous when applied to the classification of social interactions manifest via digital channels. Mirroring the tie strength construct to digital social environments is a far more obscure task because of the sheer variations in relational forms, as well as levels of commitment—some of which are unique to online interactions thanks to platform characteristics. For example, should a researcher be comfortable defining an anonymous support group member with whom a focal actor shares a high level of selfdisclosure and a feeling of closeness as a strong tie simply because the relationship invokes affective intensity? Although an anonymous yet highly self-disclosive online relationship may not necessarily be a conventional strong tie, this relationship may produce an asset of bonding capital. Put differently, the dichotomous assumption that the relational source of bonding capital should be the traditional idea of a strong tie, or conversely that a weak tie should be the source of bridging capital, is too simplistic and abstract to appreciate the complexity of online social networking patterns as antecedents of social capital. In this sense, a more refined taxonomy beyond the strong-weak

      



ties dichotomy may contribute to a greater systematic understanding of online social networking or investment patterns.

. S C F

.................................................................................................................................. The lukewarm attention to social investment patterns by Internet scholars may partly result from the equivocal applicability of the offline tie-strength analogy to classify online social relational types (Ellison et al., ). Rather than adhering to the conventional assumption that tie strength is the baseline source of social capital, this chapter reconsiders the antecedents of social capital production by adopting the purposive action proposition inherent in the economic sociological views on social capital. Specifically, Bourdieu (/), Burt (), Coleman (), Portes (), and Lin (), the early founders of social capital theory, suggest that social capital is fungible with other forms of capital (e.g., economic, human, cultural), which necessitates some forms of “investment” (Bourdieu, /). Lin () clarifies the role of social investment by explicitly defining social capital as “investment in social relations by individuals through which they gain access to embedded resources to enhance expected returns of instrumental or expressive actions” (p. ; italics original). His definition illuminates two aspects of social capital: social investment and potential returns. First, he states that the production of social capital must accompany the process of investment in social relations. Second, he explains that social capital is manifest in the anticipation that networked resources should bring greater benefits than networking costs, as opposed to the actual outcome of benefits. The distinction between the latent ability of embedded social resources and the successful actualization of the resources is also explicitly addressed by Portes (): “It is important to distinguish the resources themselves from the ability [emphasis added] to obtain them by virtue of membership in different social structures . . . . Equating social capital with the resources acquired through it can easily lead to tautological statements” (p. ). In contrast to Lin () and Portes’s () attempt to distinguish social capital from the benefits of social capital, the majority of Internet research on social capital has adopted Putnam’s () approach, which is predisposed to treat social capital as interchangeable with positive outcomes of sociability. Provided with the predominance of this trend, it may be reasonable to adopt a different terminology that highlights another dimension of social capital production—specifically, to borrow Lin’s () concept of “social capitalization” (p. ) to differentiate the investment-oriented view from the outcome-oriented view on social capital. Lin () refers to the term “social capitalization” to address the entire process of social capital production by which a focal actor invests in social relations through a series of social interactions, accumulates social resources that can potentially be transformed into other forms of capital, and gets access to the returns from the accumulated resources.



 

Several aspects distinguish the social capitalization framework from the outcomeoriented framework: () The social capitalization framework explicitly assumes that self-interest drives social interactions. Social capital “represents purposive actions on the part of actor” (Lin, , p. ). Social investment is decision-making by a rational individual who is aware of the cost as well as anticipated benefits. The most obvious imagery may be the gift economy, notably Guanxi culture in China (Chua & Wellman, ; Smart, ). However, the self-interest proposed here is not limited to economic behaviors. Instead, self-interest broadly refers to any goal-oriented motivations, desires, purposes, and intentions. The goal could be expressive, instrumental, centered to the self, or pertinent to the community to which the self belongs. () Social capitalization is a circular process that involves purposeful decisionmaking by both the investor and the investee (Milardo, Helms, Widmer, & Marks, ). Suppose Ann is an investor who initiated a relationship with Bill. If Ann gains resources from Bill as a result of the effort Ann has put into maintaining the relationship with him, it is understood that the return is for Ann’s benefit. Simultaneously, however, Bill’s “gifting act” is another gesture of networking with Ann and is thus translated into a new cycle of social investment. This time, the direction is the opposite: Bill becomes the investor in the relationship with Ann by providing help, and Ann becomes the investee, who may or may not return the favor to Billy in the future. In other words, social investment and resource acquisition are not necessarily separate steps from each other but rather a confluence. The boundary between investment and return seems even less clear in a social media context. For example, is retweeting in Twitter resource acquisition or social investment? By retweeting, the individual acquires information as well as enhances his or her own visibility in the personal network. However, retweeting signals to the original poster his or her presence as an audience. Therefore, retweeting may also function as a form of social investment in the relationship with the original tweeter. () The social capitalization framework highlights social capital as a product of the interplay between agency and social structure. An actor’s cost and benefit assessment is moderated by the social environment in which the actor is positioned. Social investment occurs neither randomly nor limitlessly. Rather, the investment process is constrained by the carrying capacity of the predefined social structure, which existing norms, hierarchy, and power relations govern (Lin, ). For example, actors with high financial capital may perceive the cost to attend a private networking event as low, whereas poorer actors might consider it a costly investment. In this sense, social capitalization is a process that is jointly influences by structure and rational choice. Highlighting network structural dimension—either on a macrosocietal level or within an organizational boundary—is not new, as exemplified by the previous

      



work of Borgatti, Jones, and Everett (), Burt (), Lin (), and more recently Fulk and Yuan (). () What is new in digital social capitalization, however, is the role of platform affordances. Actors’ cost-benefit assessment is influenced not only by the embedded social structure but also by online platforms’ technological characteristics. In digital contexts, actors make a decision regarding not only with whom to invest but also through which platform the investment should be made. In other words, the decision for online social investment is the product of the interplay among agency, social structure, and platform affordances. In the digital sphere, the intensity and type of social investment varies greatly, ranging from seconds of “micro-donation” of time to an online community (Margetts, John, Hale, & Yasseri, ), to more committed long-term interactions via social networking services (Ellison et al., ). Such variability in the form and intensity of digitized social investment is linked to the technical advantages and constraints of different platforms and is often the source of difficulty in transplanting the offline tie-strength analogy into digital situations. Accordingly, when a researcher examines social investment dimensions in an online context, a multiplicity of platform choices is as important as human relational choices. () The social capitalization framework does not always advocate positive or successful returns (Portes & Landolt, ). The framework does not underscore positive outcomes for individual or collective betterment, but rather focuses on the act of networking itself to accumulate potential resources. Sometimes the whole process of investment and accruement may result in unintended consequences due to underlying motivations (e.g., excessive resource preservation) or relational structural constraints (e.g., ascribed social status). Therefore, as Portes () pointed out nearly two decades ago, it is possible that the process of social investment and actualizing resources could result in “not-so-desirable consequences of sociability” (p. ). In summary, to understand social investment patterns as an important dimension of social capital, social capital theory needs to be extended beyond the outcomeoriented perspective. The social capitalization proposed in this chapter may serve as a supplementary theoretical lens to the outcome-oriented framework. The social capitalization framework is centered on the purposive nature of social investment and its potential to accumulate resources for future returns. This investment-oriented perspective may serve as an alternative to the research trend that equates social capital with positive outcomes. A social capitalization framework may allow Internet researchers to explore online social networking as an interplay among agency, structure, and platform affordances. For example, what technological opportunities and constraints influence the patterns of social investment online? What goals and intentions account for an actor’s selection of certain digital platforms and types of networking activities? What leads actors to make—or to refuse to make—their resources available to others networked online?



 

. T  O S I P

.................................................................................................................................. As mentioned previously, conceptualizing social networking activities based on tie strength and associating tie strength with different outcomes of social capital—that is, bonding and bridging capital—is not always straightforward in capturing social dynamics in the digital realm. As an initial stage of social capitalization framework development, this chapter suggests three dimensions of cost-benefit assessment that may influence an actor’s social investment: cost of uncertainty, cost of persistence, and cost of mutuality (reciprocity). While these dimensions resonate with some characteristics of strong versus weak ties, the proposed social investment taxonomy offers a more atomized understanding of the networking types and patterns in digital contexts.

.. Cost of Uncertainty Lin () argues that there are two types of goals that drive social capitalization: preserving existing resources and gaining additional resources. Resource preservation is achieved by expressive actions, such as others’ endorsement for one’s entitlement to a resource or confirming one’s legitimacy by sharing sentiments with others (Lin, ). Additional resources are attained through instrumental efforts that encourage others to allocate a new portion of resources to the focal actor. According to Lin, resource preservation is more common and easier to achieve because actors maintain the status quo through recognition and affirmation by others who already share a similar identity or the same community membership—that is, homophily. Homophily refers to the tendency of individuals to have more social interactions with those who share similar attributes or characteristics (McPherson, SmithLovin, & Cook, ). Homophilous social interactions occur with a relatively low level of uncertainty and thus demand low cognitive and affective costs. In contrast, adding new resources demands greater investment of effort, because new resources are gained from a connection with someone who possesses dissimilar assets—“resource heterogeneity” (Lin, , p. ). Heterogeneous resources are more likely to be attainable via interactions with those having different identities or backgrounds. The level of uncertainty in heterogeneous interactions is thus higher than in homophilous relations, requiring more effort to reduce the uncertainty. Therefore, the willingness to provide extra effort to reduce uncertainty is directly linked to the anticipated returns from either homophilous or heterophilous social interactions. Homophilous interaction is expected to incur lower uncertainty with the anticipation of resource preservation, while heterophilious

      



interaction is expected to incur high uncertainty with the anticipation of adding new resources (Lin, ). Note that the uncertainty discussed in this capacity is limited to cross-sectional heterogeneity, as opposed to social burden caused by a hierarchical discrepancy, which can be considered as a separate dimension of cost, as detailed in section .. Also, although the uncertainty dimension closely resonates with bonding and bridging capital, it does not necessarily correspond with the dichotomy of tie strength, especially in a digital social context. For example, a Facebook political partisan group is a gathering of like-minded individuals (low uncertainty), but their identity reinforcement (resource preservation; also bonding capital effect) relies on their low-density, dispersed networks (weak ties), through which the political ideology that they believe is reaffirmed and legitimized by networked crowds’ endorsements.

.. Cost of Persistence Relational persistence may be a sub-characteristic of tie strength. For example, Granovetter () defines a strong tie concept as a “combination of the amount of time [emphasis added], the emotional intensity, the intimacy, and the reciprocal services” (p. ). Due to the relative ease of terminating relational activities in digital platforms, a maintained online relationship often explicitly signals that the user is willing or motivated to continue the relationship. Conventionally, relational duration has been represented by asking about the longevity of the relationship, such as “how long have you known this person?” Instead of asking the length of acquainted time, the relational persistence dimension in this taxonomy intends to consider whether an actor anticipates the relationship made in a given digital platform to be continuous or to be transitory. For example, users might befriend their parents on Facebook under the expectation that a Facebook friendship with them should continue without limit. In contrast, if a user perceived the relationship with his or her parents on Facebook as compromising too much privacy (i.e., too costly), the user would not befriend them—or would terminate the friendship immediately—on Facebook. However, regardless of relational persistence on this particular platform (Facebook), the nature of the parent-child relationship should remain durable and strong. In this case, it is the platform characteristics, not the inherent relational quality, that largely influence a user’s assessment of the cost of persistence. As another example, participation in an online support group could be relatively transitory—that is, sustained until the actor overcomes the hardship. If an actor maintains social interactions with certain members of the support group even after the problem is solved, the level of relational persistence with these selected members must be distinctive from that with the rest of the group’s members. Subsequently, social investment in selected members may accumulate a different quality of resources. In this case, even if an actor meets people in the same online platform, the willingness to



 

expend effort for a persistent relationship is different depending on who the partner is. These two examples—Facebook and social support groups—suggest that the decision about the cost of relational persistence is determined by the interplay among the nature of the relationship, the platform characteristics, and the actor’s intention.

.. Cost of Mutuality (Reciprocity) Mutuality, or reciprocity, may be understood as a sub-dimension of homophily as well as tie strength. However, in this chapter the notion of mutuality is treated distinctively from homophily or tie strength, particularly by highlighting that nonmutuality reflects social structural inequality (e.g., authority, popularity, power, and attractiveness). The nonmutual relationship may produce a prestige effect (Lin, ) or a “positional resource” effect (Wellman & Wortley, , p. ) by which individuals with a higher status have disproportional advantages in leveraging their social connections. If the aforementioned “cost of uncertainty” refers to cross-sectional differences, the “cost of mutuality” is caused by status inequality and thus reflects a vertical gap. Interacting with those of unequal status can sometimes incur an even greater social burden than interactions with cross-sectional differences. The low level of mutuality in social investment is expected from unequal power relations. Conversely, the more equal two actors’ status is with each another, the greater the reciprocity that may be anticipated. Therefore, whether or not an actor is willing to invest in a social relationship with unequal status reflects the level of anticipated mutuality from the interaction. Social investment may—or may not—be more likely to be initiated by someone who is lower in status. Digital platforms often facilitate unequal social networking with relatively less effort than in an offline context. For example, top scholars may be followed and retweeted by less influential junior scholars on Twitter, or fans are likely to leave comments on a celebrity’s Instagram posts. These online social activities may be enacted without much expectation of mutuality. As Lin () discusses, status inequality is one of the macrosocial structures that intervenes in the process of social capitalization. An actor with high authority may experience greater readiness to actualize resources than an actor with low authority. Also, despite the same amount of social investment, the sum of returns could be different due to status inequality.

. D  S I T: O N E

.................................................................................................................................. The taxonomy of social investment patterns is based on the variations in three costdriven motivations: cost of uncertainty, cost of persistence, and cost of mutuality.

      



Table 13.1 Social Investment Patterns: Three Cost-Benefit Assessment Dimensions Social Investment Patterns

Description

Online Example

Low UncertaintyPersistent-Mutuality

This networking type is the closest to the traditional notion of strong ties, characterized as homogeneous, mutual, and long-term.

Family friending on Facebook; multi-year collaboration among online gamers

Low UncertaintyPersistent-Nonmutuality

This networking type is rooted in a long-term community of shared interests, but actors occupy different structural positions in the community.

Supervisor as a LinkedIn contact; church pastor

Low UncertaintyTransitory-Mutuality

This networking type characterizes gathering /collaboration for the temporally shared goals.

Online support group; online protest network

Low UncertaintyTransitory-Nonmutuality

This networking type temporarily connects to more prestigious actors in the community of shared interest.

Hyperlinking to power blogger’s posts

High UncertaintyPersistent-Mutuality

This networking type connects among actors with different resources, who anticipate mutual benefits from the connection.

Listserv among the Chamber of Commerce members

High UncertaintyPersistent-Nonmutuality

This networking type represents a onedirectional commitment to a dissimilar actor of unequal status.

Following influential actor in different careers in Twitter; committed fan following celebrity on Instagram

High UncertaintyTransitory-Mutuality

This networking type aims at instrumental interactions among heterogeneous actors to achieve shortterm, mutual goals.

Guest-host interactions via online sharing services such as AirBnB

High UncertaintyTransitory-Nonmutuality

This networking type produces the weakest kind of relationship: transitory interactions with dissimilar actors of unequal status.

Email greeting to influential actor in a different career; temporary fan following celebrity on Instagram

In combination, these motivations result in eight differentiable investment patterns. Half of the types (patterns  through ) occur when actors either share a relatively equal status or have mutual consent to be treated as equal. The other half (patterns  through ) characterizes social interactions among actors in unequal power relations. While some of the identified patterns are easily found in offline contexts as well, others are more common to and even unique to online sociability thanks to technological affordances. Table . summarizes the itemized social investment patterns.



 

() The “Low Uncertainty-Persistent-Mutuality” investment pattern depicts most closely the traditional notion of strong ties. Interaction with close friends in online social networking platforms could be an example. While the most common activities that fall into this typology may be offline-to-online social interactions, online encounters can also reveal this pattern in the online gaming community, for example, where gamers team up together regularly over several years. () The “Low Uncertainty-Transitory-Mutuality” investment pattern describes online social interactions that aim for temporary goals that are shared by the actors. Social support networks or online protest networks that emerge along with a specific agenda may be an example of this pattern. This type of investment is also found offline, such as in attending group meetings of patients. () The “High Uncertainty-Persistent-Mutuality” pattern characterizes reciprocal social investment among those from different backgrounds, yet with similar social positions, who mutually understand that their long-term relationship may bring some benefit to one another in the future. For example, a manager of a local theater company and an editor of a local newspaper may be connected through LinkedIn. A Chamber of Commerce can be another example found offline; as long as members do not break the connection, they share the mutual expectation that knowing each other may possibly bring some returns in the future. Given that networking opportunities among dissimilar actors do not occur as frequently as among similar actors, this type of investment can particularly leverage online sociability. () The “High Uncertainty-Transitory-Mutuality” pattern often aims to seek an instrumental gain from other consenting individuals who possess different resources. The goals may be specific and achievable in a short period of time. An example could be the instrumental interaction between service providers and beneficiaries who meet through online apps, such as the host-guest relationship through Airbnb or Uber. This type of relational pattern is also observed in an offline context—for example, short friendships at a holiday resort. In some other cases, however, this investment type could be influenced by the motivation to preserve expressive assets, possibly resulting in defensive interactions with dissimilar individuals. For example, online commenting communities often display different opinions, and various perspectives are encountered and exchanged, such as on YouTube. Although heterogeneous discursive interactions could encourage the gaining of new knowledge (Kim, Hsu, & Gil de Zúñiga, ), it could also result in negative consequences (e.g., incivility, polarization) driven by an excessive desire to preserve existing resources (Coe, Kenski, & Rains, ). () The “Low Uncertainty-Persistent-Nonmutuality” pattern characterizes an actor’s networking with someone in a higher position in the hierarchy within the same community, under an expectation that the relationship will be sustained over a long period of time. For example, friending a workplace supervisor or the pastor or elders in a church community on Facebook is expected to be a longterm connection with shared interests, yet with unequal status. A server-client relationship at a regularly visited restaurant may fall into the offline version of

      



this category. This investment type is likely to be an offline-to-online spillover of the relationship. () The “Low Uncertainty-Transitory-Nonmutuality” pattern seems to be a rather uncommon type of social investment offline. However, it is occasionally observed in an online context. For example, in the blogosphere a community is created among like-minded bloggers with a shared topical interest (e.g., law, technology), but each blogger’s individual status in the community of interest differs depending on the blog’s popularity. Hyperlinking among bloggers reveals a power-law tendency (Hindman, ), with the most popular posts getting disproportionately large numbers of inbound links compared to the marginal ones. While marginal bloggers may gain new knowledge from this type of nonmutual investment, the hyperlinked blogger also gains expressive returns by building up a reputation of credibility by virtue of knowledge provision. Hyperlinking a certain blog post is a transitory networking activity unless a blogger subscribes to another blogger over a long period of time. () The “High Uncertainty-Persistent-Nonmutuality” pattern echoes pattern 5 (“Low Uncertainty-Persistent-Nonmutuality”), except that social networking in this category is the boundary-crossing type. For example, a junior social scientist might meet a highly influential physicist at an interdisciplinary conference and maintain the connection to the physicist via Twitter. A completely different scenario is also conceivable; for example, fans who sustain one-way interactions with a celebrity by following and commenting on the celebrity’s Twitter or Instagram posts. This fan-celebrity relationship could be persistent or transitory (the last pattern below), depending on the fan’s willingness to pay long-term attention to the celebrity. This interaction pattern occurs in a highly heterogeneous and unequal relational context, and thus it may be somewhat difficult to sustain offline. Accordingly, digital platforms are particularly useful for this type of social investment. () The “High Uncertainty-Transitory-Nonmutuality” pattern refers to the investment put into the most dissimilar interaction activities. The examples described for pattern 7 could fall into this category if one decides to invest in the relationship only momentarily. For example, the exemplified junior social scientist could just send a one-time greeting email to the physicist. The one-way fan interaction with the celebrity may also fall into this category if the fandom is fleeting.

. A E A: S I   F P N

.................................................................................................................................. So far this chapter has described the typologies of social investment patterns based on the cost-benefit assessment of digital sociability. The investment-oriented view and



 

social capitalization framework may complement the outcome-oriented approach that prevails in Internet social capital research. The social capitalization framework may be particularly useful when a researcher is interested in delving into online networking patterns. The author’s study conducted with colleagues a few years ago (Stefanone, Kwon, & Lackaff, ) demonstrates possible directions for the empirical utilization of a social capitalization framework. The  study is a rudimentary example, however, because it does not fully incorporate the taxonomy of social investment patterns in the analysis. The social capitalization framework presented in this chapter was refined after the  study was published, largely based on the lessons learned from it. Specifically, the  study was based on a pseudo-experiment conducted on Facebook in . In this study, social capital was explicitly defined as networked resources invested and gained through individuals’ purposive actions (Lin, ). The study examined what relational quality motivated individuals’ social investment online. During the experiment, fifty users (or requesters) sent out a message to twelve selected Facebook friends to request low-stakes help on an image-labeling task. Among the twelve contacted friends, six were chosen to be the emotionally closest ties (strong ties) and the other six the most emotionally detached ties (weak ties). The request message was uniform and read as follows: “Hey, [First Name]—I need your help with a class project I’m working on. I need people to provide labels for a series of online images. I’d really appreciate your help! Please go to [study URL] and take the quick survey and label as many images as you can. Your participation will be a huge help. Thanks!” (Stefanone et al., , p. ). For the image-labeling task, the research team developed a website that presented Google images one at a time with an open text bar in which participants could write any words associated with the image. The image-labeling system did not specify when to stop the task. Also, any word could be added to the system, and thus no expertise was needed. Technically, anyone could continue to label an infinite number of images. In other words, it is the contacted friend’s decision when to stop the task or how many images to label. The total number of images labeled by each friend would indicate the amount of tangible returns the requester mobilized. Simultaneously, however, spending time for this task was also understood as an act of social investment that the contacted friend placed with the requester. Provided with this circular notion, the intensity of a social investment by a contacted friend was operationalized as the number of images he or she labeled. The study also tested the effect of tie strength on the amount of image labeling. The results revealed that, while tie strength contributed to a friend’s decision to initiate the task, the effect of tie strength dropped once Facebook contact frequency was controlled. Moreover, the tangible act of helping was associated with neither the ISCS-based bonding nor bridging capital (Williams, ). Considering that the ISCS has been a widely used instrument for outcome-based Internet social capital, this result suggested that the outcome-based framework might not be the most suitable to explain the social investment dynamics underlying this experiment.

      



Instead, specific networking characteristics would explain the result better. One of the findings showed that the nonmutual contacts were willing to spend more effort in image labeling. Nonmutuality (which was originally termed “social prestige”) was measured within the recruiter-friend dyad and operationalized as the discrepancy between the perceived attractiveness of each other. The more socially prestigious (more attractive) the requester was, the more returns he or she could gain. Conversely, the friend with low social prestige in the dyad spent a longer time to perform the image-labeling task for the sake of the requester. Provided that requesters and contacted friends were from the same university, they were assumed to come from relatively similar backgrounds with a low level of uncertainty. Also, considering that defriending was a relatively uncommon practice on Facebook at the time the experiment was conducted, Facebook friendship was by default considered to be a relationship with some expectation of persistence. Therefore, this study was pertinent to discuss the dimension of mutuality among the three cost dimensions. From the requesters’ points of view, their prestigious positions were advantageous in mobilizing friends’ time and effort resources. While requesters gained returns by virtue of their friends’ help, from the friends’ perspective, the act of helping was translated into a form of social investment in the relationship with the requester. The contacted friends were willing to spend their time to solidify relationships with a more socially attractive individual; they were willing to pay the cost of mutuality.

. D  F R

.................................................................................................................................. This chapter introduced an investment-oriented understanding of social capital. Social capital theory has served as an insightful theoretical backbone for Internet research that explores the functions of digital sociability. A majority of Internet social capital studies have highlighted the ways in which social media uses can result in positive outcomes for individual well-being and societal betterment. This outcome-oriented approach has been successful in confirming the community-enhancing roles of digital connectivity. However, overemphasizing positive outcomes of social capital may possibly neglect scholarly interests in the antecedents of social capital effects, specifically the mechanisms of social investment. This chapter aimed to complement the existing outcome-oriented framework by delving into a variety of social investment patterns curated in a digital environment. The discussion was centered on Lin’s () notion of social capitalization. Although Lin’s original text mentioned the terminology “social capitalization” only a few times, the concept was nevertheless adopted as primary vocabulary in this chapter in an attempt to differentiate the understanding of social capital as a social investment from the understanding of social capital as a positive outcome. At the bottom of the social capitalization framework lies the premise that, like investment in other forms of capital, social investment is driven by purposive actions.



 

The chapter highlighted three dimensions of cost-benefit assessment that could influence social investment decisions: cost of uncertainty, cost of persistence, and cost of mutuality. The willingness to accept the cost of uncertainty is closely linked to cross-sectional similarities or differences. The decision to accept the cost of persistence underscores the actors’ willingness to keep the relationship persistent or transitory. Cost of mutuality is rooted in status inequality or hierarchical differences. Whether or not an actor endures the cost of uncertainty, persistence, or mutuality incurred from a given networking practice is determined by the interplay among the actor’s intention, social structure, and platform affordances. It is important to note that the social capitalization framework is based on a parsimonious assumption about online users—that they have purposive minds when they engage in social networking activities. Accordingly, it cannot explain a situation in which purposeless behaviors result in serendipity. Although the social capitalization framework may not be representative of all possible online networking situations and serendipity, it nonetheless contributes to the development of a social investment taxonomy configured in the digital sphere, where some social activities are uniquely distinctive from those offline. This chapter is a preliminary introduction to the social investment framework, which needs further tuning for empirical reification. Also, the chapter does not address the ways in which social investment patterns could be represented by network structural analysis. While a structural analytic approach to Internet social capital is beyond the scope of this chapter, it is an important research agenda that calls for future scholarly attention. Future research is needed in the following three areas to improve the empirical utility of the investment-oriented framework: () The first area of research needs to validate whether the three cost dimensions can uniquely distinguish various online networking activities from the traditional understanding of tie strength. The validation requires a survey of a wide range of networking activities occurring on different online platforms. The investment-oriented framework is centered on a networker’s motivation. Therefore, a survey should ask about the actors’ intentions underlying each particular networking event. Online networking activities could be sampled by modifying name generator and interpreter techniques. The name generator/interpreter method is one of the most popular in social network studies (Marin, ). Conventionally, the generator questions are designed to collect a set of social contacts predisposed to certain relational attributes. In order to validate the cost dimensions as the parameters for digital relational typologies, however, the sampling of networking activities must be as random as possible and cover as many varieties of platforms and networking episodes as possible. Specifically, a researcher may be able to collect networking incidents by using the platform generator (e.g., having respondents come up with a set of online platforms frequently used) and random name generator questions (e.g., having

      



respondents list a set of social contacts within each platform in a random way). If a computational tool is available, an automated name generation process might improve the random sampling process. Once social contacts are identified, the follow-up survey would ask the “interpreter” questions that address the nature of each networking episode based on cost dimension–related questions (e.g., to what extent the interaction with X in Y platform was based on shared interest/ expectation of a long-term continuation/expectation of reciprocal interaction). For validation purposes, the researcher may additionally ask traditional tie strength–related questions, such as affective closeness and communication frequency. Multidimensional scaling or factor analysis can then be used to examine whether these three costs are indeed unique from one another and from traditional tie dimensions. () The second—probably highly ambitious—area of research will be to ponder the implementation of machine-learning techniques for the large-scale classification of investment patterns. The platform/name generator and interpreter methods help collect limited data because they require the interpretation of every identified activity based on the multiple interpreter questions. While computational classification could be helpful to apply the investment framework to a large-scale project, it can unfortunately be very challenging for some platforms. For example, the data-driven operationalization of social investment patterns is relatively straightforward in Twitter. The relationship between two users would incur a low uncertainty if the profile descriptions reveal similar backgrounds between the two; their networking could show persistence if the actors’ following or interaction history has been long-lasting; and the relationship is based on mutuality if their interactions (e.g., retweeting, favoriting, and mentioning) show bidirectionality. Conversely, platforms such as Instagram and Facebook prohibit the use of application programming interfaces (APIs) for personal data collection; thus, employing such relational rules for computational assistance is nearly impossible. Although the computational approach will rely on technical availability, this area of research is worth exploration if one considers scaling up the observation of digital data. () The third area of research will be to bridge the investment framework and the outcome framework. The successful validation of cost-based taxonomy, proposed as the first area for future research, is a prerequisite to pursue the integration of networking patterns and social capital outcomes into a single model. Supposing that the cost-based taxonomy should effectively characterize different patterns of digital networking activities, there are two ways to bridge the investment and outcome aspects of social capital. The first is to examine the association between investment motivations and outcomes on a networking activity level. Researchers may sample networking activities via the name generator and interpreter techniques with additional interpreter questions that allow for the measurement of the bonding, bridging, and collective social capital embedded in each networking activity.



 

The second is to examine the relationship between investment patterns and outcomes on an actor level. An actor-level analysis requires researchers to develop an instrument that surveys individuals’ predispositions to social investment patterns in online platforms. The investment predispositions are then associated with the existing social capital outcome parameters (e.g., ISCS, collective social capital). An actor-level analysis may compromise the detailed understanding of networking activities, but it nonetheless should be contributory because it will allow researchers to analyze social investments as antecedents for positive outcomes.

R Adler, P. S., & Kwon, S. W. (). Social capital: Prospects for a new concept. Academy of Management Review, (), –. Borgatti, S. P., Jones, C., & Everett, M. G. (). Network measures of social capital. Connections, (), –. Bourdieu, P. (/). The forms of capital. In I. Szeman & T. Kaposy (Eds.), Cultural theory: An anthology (pp. –). New York: John Wiley & Sons. Burke, M., Kraut, R., & Marlow, C. (, May). Social capital on Facebook: Differentiating uses and users. In Proceedings of the SIGCHI conference on human factors in computing systems (pp. –). ACM. Burt, R. S. (). The network structure of social capital. Research in Organizational Behavior, , –. Castells M. (). Networks of outrage and hope: Social movements in the Internet age. Malden, MA: Polity Press. Chang, Y. P., & Zhu, D. H. (). The role of perceived social capital and flow experience in building users’ continuance intention to social networking sites in China. Computers in Human Behavior, (), –. Chua, V., & Wellman, B. (). Networked individualism, East Asian style. In Communication: Oxford Research Encyclopedias. Retrieved from http://communication.oxfordre.com/ view/./acrefore/../acrefore--e- Chung, J. E. (). Social networking in online support groups for health: How online social networking benefits patients. Journal of Health Communication, (), –. Coe, K., Kenski, K., & Rains, S. A. (). Online and uncivil? Patterns and determinants of incivility in newspaper website comments. Journal of Communication, (), –. Coleman, J. S. (). Social capital in the creation of human capital. American Journal of Sociology, , S–S. Ellison, N., Heino, R., & Gibbs, J. (). Managing impressions online: Self-presentation processes in the online dating environment. Journal of Computer-Mediated Communication, (), –. Ellison, N. B., Steinfield, C., & Lampe, C. (). The benefits of Facebook “friends”: Social capital and college students’ use of online social network sites. Journal of ComputerMediated Communication, (), –. Ellison, N. B., Steinfield, C., & Lampe, C. (). Connection strategies: Social capital implications of Facebook-enabled communication practices. New Media & Society, (), –.

      



Fulk, J., & Yuan, Y. C. (). Location, motivation, and social capitalization via enterprise social networking. Journal of Computer-Mediated Communication, (), –. Gil de Zúñiga, H., Jung, N., & Valenzuela, S. (). Social media use for news and individuals’ social capital, civic engagement and political participation. Journal of Computer-Mediated Communication, (), –. Granovetter, M. S. (). The strength of weak ties. American Journal of Sociology, (), –. Hampton, K., & Wellman, B. (). Neighboring in Netville: How the Internet supports community and social capital in a wired suburb. City & Community, (), –. Hindman, M. (). The myth of digital democracy. Princeton, NJ: Princeton University Press. Kadushin, C. (). Too much investment in social capital? Social Networks, , –. Kim, Y., Hsu, S. H., & de Zúñiga, H. G. (). Influence of social media use on discussion network heterogeneity and civic engagement: The moderating role of personality traits. Journal of Communication, (), –. Kobayashi, T., Ikeda, K., & Miyata, K. (). Social capital online: Collective use of the Internet and reciprocity as lubricants of democracy. Information, Communication & Society, (), –. Kraut, R., Patterson, M., Lundmark, V., Kiesler, S., Mukophadhyay, T., & Scherlis, W. (). Internet paradox: A social technology that reduces social involvement and psychological well-being? American Psychologist, (), . Lin, N. (). Social capital: A theory of social structure and action. New York, NY: Cambridge University Press. Margetts, H., John, P., Hale, S., & Yasseri, T. (). Political turbulence: How social media shape collective action. Princeton, NJ: Princeton University Press. Marin, A. (). Are respondents more likely to list alters with certain characteristics? Implications for name generator data. Social Networks, (), –. Marin, A., & Hampton, K. N. (). Simplifying the personal network name generator: Alternatives to traditional multiple and single name generators. Field methods, (), –. Marsden, P. V. (). Interviewer effects in measuring network size using a single name generator. Social networks, (), –. Marsden, P. V., & Campbell, K. E. (). Measuring tie strength. Social Forces, (), –. Mathwick, C., Wiertz, C., & Ruyter, K. D. (). Social capital production in a virtual P community. Journal of Consumer Research, (), –. McPherson, M., Smith-Lovin, L., & Brashears, M. E. (). Social isolation in America: Changes in core discussion networks over two decades. American Sociological Review, (), –. McPherson, M., Smith-Lovin, L., & Cook, J. M. (). Birds of a feather: Homophily in social networks. Annual Review of Sociology, , –. Milardo, R. M., Helms, H. M., Widmer, E. D., & Marks, S. R. (). Social capitalization in personal relationships. In C. R. Agnew (Ed.), Social influences on romantic relationships (pp. –). Cambridge, UK: Cambridge University Press. Nie, N. H., & Erbring, L. (). Internet and mass media: A preliminary report. IT & Society, (), –. Papacharissi, Z. (). The virtual geographies of social networks: A comparative analysis of Facebook, LinkedIn and ASmallWorld. New Media & Society, (–), –.



 

Portes, A. (). Social capital: Its origins and applications in modern sociology. Annual Review of Sociology, , –. Portes, A., & Landolt, P. (). Social capital: Promise and pitfalls of its role in development. Journal of Latin American Studies, (), –. Putnam, R. D. (). Bowling alone: The collapse and revival of American community. New York: Simon & Schuster, Inc. Scholtz, J. T., Berardo, R., & Kile, B. (). Do networks solve collective action problems? Credibility, search and collaboration. The Journal of Politics, (), –. Shah, D. V., Kwak, N., & Holbert, R. L. (). “Connecting” and “disconnecting” with civic life: Patterns of Internet use and the production of social capital. Political Communication, , –. Skoric, M. M., Ying, D., & Ng, Y. (). Bowling online, not alone: Online social capital and political participation in Singapore. Journal of Computer-Mediated Communication, (), –. Smart, A. (). Gifts, bribes, and guanxi: A reconsideration of Bourdieu’s social capital. Cultural Anthropology, (), –. Stefanone, M. A., Kwon, K. H., & Lackaff, D. (). Exploring the relationship between perceptions of social capital and enacted support online. Journal of Computer-Mediated Communication, (), –. Turkle, S. (). Alone together: Why we expect more from technology and less from each other. New York: Basic Books. Utz, S. (). Is LinkedIn making you more successful? The informational benefits derived from public social media. New Media & Society, online before print. doi:./  Valenzuela, S., Park, N., & Kee, K. F. (). Is there social capital in a social network site? Facebook use and college students’ life satisfaction, trust, and participation. Journal of Computer-Mediated Communication, (), –. Valkenburg, P. M., & Peter, J. (). Online communication and adolescent well-being: Testing the stimulation versus the displacement hypothesis. Journal of ComputerMediated Communication, (), –. Wang, H., & Wellman, B. (). Social connectivity in America: Changes in adult friendship network size from  to . American Behavioral Scientist, (), –. Wellman, B., Quan-Haase, A., Boase, J., Chen, W., Hampton, K., Diaz, I. & Miyata, K. (). The social affordances of the Internet for networked individualism. Journal of Computer Mediated Communication, . Retrieved from http://jcmc.indiana.edu/vol/issue/wellman. html Wellman, B., & Wortley, S. (). Different strokes from different folks: Community ties and social support. American Journal of Sociology, (), –. Williams, D. (). On and off the net: Scales for social capital in an online era. Journal of Computer Mediated Communication, (), article . Williams, D. (). The impact of time online: Social capital and cyberbalkanization. CyberPsychology & Behavior, (), –.

  ......................................................................................................................

   Tie Strength, Social Role, and Mobile Media Multiplexity ......................................................................................................................

 ,  ,   

. I

.................................................................................................................................. C relationships are typically maintained and developed using a combination of communication channels, such as in-person communication, social media exchanges, phone calls, and other forms of mediated communication. This phenomenon is called media multiplexity (Haythornthwaite, ). Scholarship about media multiplexity has found that strong ties—those who communicate often and have close emotional bonds—communicate through a greater variety of channels than do weak ties (Haythornthwaite, ). This finding contradicts concerns that communication technology is separating us from our close relationships by showing that it tends to be used in conjunction with in-person interaction, helping us stay highly connected to strong tie relationships. However, researchers often only have access to data sources such as social media or mobile phone logs and lack context about how these channels are used together. This chapter disentangles how communication frequency, cognitive closeness, and social role (e.g., being family or co-workers) are associated with media multiplexity. By identifying associations between these relational dimensions and multiplexity, we can infer situations in which multiplexity is likely or unlikely to be occurring, which can help researchers understand if they need to address it in their studies. When multiplexity is likely to occur, researchers must consider how these channels might affect the results of their analysis and may need to seek additional information regarding communication that occurs through these channels. Conversely, when multiplexity is unlikely, researchers can be more confident in the validity of their findings. Given that foundational work on media multiplexity was conducted before the widespread adoption of text messaging and similar types of mobile communication,



. , . ,  . 

the rise of these technologies points to a need to revisit this work. Mobile devices are currently the most widely diffused means of mediated communication internationally, with even the most basic models allowing for media multiplexity by way of voice calling and text messaging. The proliferation of mobile communication technologies contributes to increasingly complex opportunities for people to weave together multiple channels. The emergence of new communication technologies also opens up the possibility of using data collection methods that utilize metadata such as the times and dates extracted from text message and voice calling logs. These data sets are valuable for accurately identifying patterns in mobile communication, but it has been argued that log data alone are insufficient for evaluating tie strength (Wiese et al., ). In this chapter we investigate multiplexity using a combination of mobile log data and survey results collected from  adults living in the United States. This unique data set allows us to combine behavioral indicators of logged texting and calling with self-report measures of cognitive closeness, social role, and in-person communication. Given that texting and calling have been widely adopted in both economically developed and developing countries, this analysis provides important insight into the most prevalent forms of media multiplexity. We examine three factors that underlie the weaving together of calling and texting activities: cognitive closeness, communication frequency, and social role (i.e., whether ties are family members or known through work). In doing so we disentangle how these three closely related but clearly distinct aspects of social relationships contribute to the embedding of media multiplexity in personal networks. The relationships examined in this study tend to be strong because the ties included in our analysis are those with whom respondents communicate fairly regularly. As a result, tie strength in this study ranges from very strong to less strong among generally close ties. We build upon previous work by examining tie strength within this group in terms of both frequency of contact and feelings of closeness, or what we refer to as cognitive closeness. These two dimensions of tie strength have been shown to be the most salient in an empirical analysis of General Social Survey data conducted by Marsden and Campbell (). Although tie strength is clearly important to understanding media multiplexity, social role may also play a critical part in this phenomenon, given that kin and work institutions require mobile media to coordinate shared activities. For example, while adults are at work they may use mobile calling and texting to coordinate daily activities with kin, such as buying milk or picking up children on the way home (Ling, ; Christensen, ). While at home they may blur the boundary between family and work life by using their mobiles to exchange calls and texts regarding work-related matters (Gluesing, ). One implication is that members of the same social institutions—such as family members or work colleagues—may be motivated to communicate using multiple communication channels to aid the coordination of activities. In some cases, text messaging may be the most useful way of communicating with kin and work ties throughout the day because it does not disrupt the

  



activities of the recipient in the way that a voice call does, demanding immediate attention. In other cases, a voice call may be the most efficient way of conveying complex information or may be appropriate when the caller knows that the person he or she is calling is available. Understanding the factors underlying media multiplexity requires a disentangling of tie strength and social role, because they tend to be closely related. Individuals are typically close—both in terms of frequency of contact and feelings of connection—to their kin and work ties. This begs the question of whether it is really relational closeness driving media multiplexity or the fact that institutionalized social roles demand media multiplexity through the constant need to coordinate daily activities. By considering how dimensions of strong tie relationships including communication frequency, cognitive closeness, and social role are associated with multiplexity, this chapter has implications for researchers who want to assess whether multiplexity is likely to be occurring in their studies and if it needs to be addressed in their analysis.

. M M  T S

.................................................................................................................................. Media multiplexity refers to the phenomenon whereby relationships are developed and maintained through multiple communication channels (Haythornthwaite, ). Developments in communication technologies have increased the number of available communication tools, and contemporary relationships are typically mediated through a complex variety of channels (Boase, ; Haythornthwaite, ; Rui et al., ). Empirical work by Haythornthwaite and colleagues (Haythornthwaite, , , , , ; Haythornthwaite & Wellman, ; Haythornthwaite, Wellman, & Mantei, ) has shown that tie strength is a critical determinant of media multiplexity. This is because strong ties are more likely to desire opportunities for communication and variety in expression, which is likely to motivate them to utilize multiple communication channels (Haythornthwaite, ). In contrast, weak ties tend to meet their communication needs through passive opportunities for interaction such as hallway encounters (Haythornthwaite, ). This finding has been supported by a number of studies comparing media use among strong ties, such as close friends, to that among weaker ties (Baym & Ledbetter, ; Miczo, Mariani, & Donahue, ; Van Cleemput, ). Notably, using multiple communication channels has been found to be related to perceived emotional support (Wohn & Peng, ). This finding is congruent with scholarship indicating that strong ties tend to provide emotional support, while weak ties provide opportunities for acquiring new information (Granovetter, ). One of the purposes of Haythornthwaite’s work was to explain how computermediated communication (CMC) can simultaneously be “disengaging and engaging,



. , . ,  . 

disruptive of relationships yet also integrative across populations” (, p. ). Previous scholarship, she noted, tended to focus on either the strengths or shortcomings of mediated communications. She urged that researchers move beyond focusing on a single communication medium and instead consider how multiple communication media are used in combination. Research has shown that new communication media are used to supplement, rather than replace, existing communication forms (Boase et al., ). By moving beyond a focus on a single medium to instead consider all available media, it is possible to address personal communication systems (Boase, ), in which individuals combine multiple media channels to connect with their personal networks.

. M T S

.................................................................................................................................. Granovetter’s () original definition of tie strength includes four dimensions: “a (probably linear) combination of the amount of time, the emotional intensity, the intimacy (mutual confiding), and the reciprocal services which characterize the tie” (, p. ). Marsden and Campbell () noted that much research using tie strength had failed to conceptualize how it could be measured, and they responded to this by describing how tie strength could be assessed along several dimensions. Notably, they distinguished between indicators—variables that are components of tie strength—and predictors, which are related to, but not components of, tie strength. Social roles such as kinship, for example, are often predictors of tie strength but do not constitute it. Marsden and Campbell’s analysis, by contrast, showed that a measure of “closeness” was the best indicator of tie strength, which reveals that the cognitive element of how people feel about a relationship is a highly salient dimension of its strength. Fortunately much of the work comparing media use to tie strength has measured tie strength in terms of closeness, identifying strong ties as friends or close friends and weak ties as acquaintances (Baym & Ledbetter, ; Haythornthwaite, ; Van Cleemput, ). In another study (Baym et al., ), respondents were asked to rate both relational closeness and quality on scales from  to . In addition, while most studies have measured media multiplexity as the number of different media used to communicate with a tie, Baym et al. () measured multiplexity as the percentage of the relationship’s communication that took place across a medium. They found that media use was not an indicator of tie strength, but Ledbetter () suggests that relative proportion of media use may not be as important an indicator as frequency. Although scholarship measuring tie strength mainly in terms of closeness has yielded valuable insights, tie strength can be measured in relation to multiple axes, and it may be valuable to study media multiplexity in relation to a more nuanced conception of tie strength. Communication frequency is another indicator of tie strength, one that can be measured accurately using log data. However, Wiese and colleagues () caution that measuring tie strength purely using frequency and

  



duration of communication can lead to a significant number of strong ties being mislabeled as weak ties. They identified three explanations for these errors: ) Individuals use many different communication channels, and phone and SMS logs are not representative of their overall communication patterns; ) Face-to-face communication is important, but is not easily observed; and ) Individuals feel a lingering sense of closeness to friends from a previous stage in their [lives], though communication has decreased. (, p. ) For these reasons, communication frequency alone is likely to be an incomplete measure of tie strength, but it may be valuable to consider communication frequency and closeness in combination. This study builds on previous scholarship by considering media multiplexity in relation to multiple dimensions of tie strength. Given that closeness itself is a nuanced concept and can be considered according to constituent variables such as trust, enjoying socializing, or discussing important matters (Marin & Hampton, ), our analysis operationalizes cognitive closeness using variables that measure each of these three dimensions. Moreover, as discussed previously, media multiplexity might also be strongly related to frequency of communication, such that more frequent contact ties tend to be more multiplex. Accordingly, our first research question considers the implications of measuring tie strength in relation to both communication frequency and cognitive closeness. RQ: To what extent are communication frequency and cognitive closeness associated with media multiplexity?

. S R  I

.................................................................................................................................. Another dimension of tie strength is social role, such as whether a tie is a member of one’s family or work institutions. Although role is not an indicator of tie strength, it is a predictor because ties of certain roles are more likely to have a high tie strength than others (Marsden & Campbell, ). Some of the most significant social institutions are school, work, and family. This study describes communication patterns of adult respondents and focuses on work and family. In addition to being a predictor for tie strength, ties in different roles may have particular goals for their communication, such as participating in shared activities. Institutions such as families or workplaces can act as foci “around which joint activities are organized” (Feld, , p. ). As a result, one of the purposes of communication among family and work ties is to facilitate those activities, and this may shape their use of different media for communication. Our second research question addresses the relationship between social role and media multiplexity:



. , . ,  . 

RQ: To what extent is social role predictive of media multiplexity, particularly pertaining to family and work ties? Finally, given that social role may be highly associated with both frequency of contact and cognitive closeness, our final research question asks: RQ: When taken together, what is most associated with media multiplexity: communication frequency, cognitive closeness, or social role? The purpose of RQ is to assess the extent to which any associations identified in RQ and RQ might better be explained by other variables—for example, whether social role becomes more or less relevant when considered alongside communication frequency and cognitive closeness. When investigating this and our other research questions, we do not point to “mechanisms” of causality among these variables. Instead, by assessing the independence of these variables, we examine which are important for investigating multiplexity.

. D  M

.................................................................................................................................. The study described in this chapter used a combination of anonymized smartphone log data and survey data, both collected using the same Android application, Communication Explorer.1 Using call and text messaging logs allowed the researchers to access accurate data about when and to whom respondents communicated using their smartphones. Logged communication data are much more accurate than self-report accounts, in which people tend to underreport or overreport communication frequency (Boase & Ling, ; Kobayashi & Boase, ). This is a significant strength, but there are also several considerations researchers must bear in mind when working with this type of log data. Significantly, when working with this sort of data, researchers must address potential privacy concerns of their respondents. To this end, the Communication Explorer application was designed to only record the data necessary to the project; it did not copy the content of text messages or email, it stored data in an encrypted format on a secure server, and it masked identifying information from phone numbers and email addresses using anonymous numeric codes. Another consideration when using log data is how to incorporate necessary context to make sense of the data. The log data alone could not provide information about cognitive closeness or social role between respondents and their ties, so they were combined with questionnaire results describing these aspects of respondents’ relationships. In this respect, the application was designed to utilize the strengths of log data in accurately counting communication events and the strengths of self-report data in providing context for the logs. This article uses data collected from  adults living in the United States in the winter and spring of . This sample was randomly selected from a larger panel maintained by a research company that specializes in Internet surveys. Participation was limited to

  



adults between the ages of twenty and sixty-nine who used Android smartphones and Gmail on a daily basis. Respondents were paid a small amount of money to complete an online survey, then install the Communication Explorer application on their smartphones, which collected nonidentifying voice call, texting, and email2 data and administered a series of on-screen questionnaires every day for approximately one month. The application collected log data for the full length of stored records on the smartphone. Although log data are more accurate than self-report accounts of communication patterns, there are some limitations resulting from how Android-based smartphones store such data. By default, Android phones store the most recent five hundred calls in a user’s log and usually limit the logged SMS messages to two hundred per contact. Because of this, there were significant differences in length for each respondent’s log history (ranging from  days to  days). Given that different log lengths would hinder comparability among respondents’ data, logs were reduced to a consistent length across all respondents. More than % of respondents had communication logs spanning at least twenty-eight days, so respondents with shorter logs were excluded from the analysis, leaving  respondents. Some respondents began and concluded the study at different times than others, so for each respondent the analysis focuses on logged events within the twenty-eight days preceding the conclusion of the application’s data collection. As well as collecting log data, the application displayed daily pop-up surveys to respondents for at least thirty days. Each day the application randomly selected a contact with whom the respondent had had at least one logged communication in the previous twenty-four hours and asked several questions about that tie. Only one tie would be selected and used in the on-screen survey for that day, and ties who were contacted frequently would have a higher chance of being selected during a thirty-day period than those that were only contacted once or twice. The tie’s name was displayed on the screen, as well as questions about topics including the social role of the tie (e.g., whether the tie was a family member), the respondent’s cognitive closeness to the tie (e.g., whether the respondent enjoyed socializing with the tie), and whether the respondent regularly talked to that tie in person. Respondents were required to respond to thirty pop-up surveys in order to receive a small monetary incentive. Data collection generally took thirty days (for thirty surveys) but took longer than thirty days in some unusual cases where respondents chose not to respond to a pop-up survey question on a particular day or failed to have any logged calling, texting, or email activity in a twenty-four-hour period. In our analysis, only ties for whom there is survey data were considered. Because surveys were only displayed for ties the respondent contacted during the study period, the selected ties tended to be relatively strong. Ties with whom the respondent might communicate every few months, but who were not contacted during the thirty-day study period, had no chance of being selected for a survey. In addition, ties with whom the respondent communicated on multiple days were more likely to be randomly selected than ties who only appeared in the communication logs for one day. As a result, this study uses data from ties with whom there was logged calling, texting, or email during the data collection period, and it is likely that these ties are stronger and more active than ties for whom survey data were not collected.



. , . ,  . 

.. Dependent Variable The dependent variable for our analysis is the amount of media multiplexity between ties. Media multiplexity is measured as the number of communication channels used to communicate with a tie. For each tie, media multiplexity was measured on a scale from  to , indicating the number of channels used to communicate with that tie. In-person communication, phone calls, and text messages were considered as follows: Phone calls: Included if there was at least one phone call logged between the pair in the twenty-eight days preceding their final in-app survey. Text messages: Included if there was at least one text message logged between the pair in the twenty-eight days preceding their final in-app survey. In-person communication: Included if the respondent answered “yes” to the question, “Do you talk to [this tie] in person during a typical day?” As explained previously, only ties for whom survey data were collected were included, and respondents only received a survey for ties with whom they had communicated at least once with their mobile phones. As a result, each tie had a media multiplexity score of at least . In addition, ties with whom there were not at least two logged communications were excluded, so each tie had a theoretical possibility of having a multiplexity score of  (two logged communication channels plus in-person communication). Setting this minimum threshold for the scale allowed us to avoid inflated correlations with the independent frequency of contact variables described in the following discussion. Measuring multiplexity using a combination of log and survey data had both advantages and limitations. Self-report data have the advantage of addressing inperson communication, which is not logged by the app. However, as discussed previously, self-report responses tend to be inaccurate with regard to communication frequency. As a result, there were important differences in the precision with which inperson communication was measured according to self-report, and calls and text messages were measured using log data. Using the self-report questionnaire data, inperson communication is measured as a binary variable based on whether or not the respondent answered that he or she talked to the tie in person “during a typical day.” As a result, it was not possible to measure in-person communication more comprehensively, such as adjusting for ties who talked in person weekly or monthly. An alternate way of measuring in-person communication is to use phone sensors to measure physical proximity between respondents, but this is only possible in a closed network where all measured ties have a monitoring software installed (see, for e.g., Hristova, Musolesi, & Mascolo, ). This would allow greater precision than the survey measure utilized in our study but would pose significant logistical barriers for studying communication across multiple institutions, as it is necessary to evaluate the significance of social role. Dichotomizing the calling and texting data allows us to develop a scale on which an amount of calling, texting, or in-person contact greater than zero per

  



tie would be factored into the scale. The resulting scale was balanced enough to capture a reasonable degree of variability in mobile media multiplexity of ties. Of the , ties in our analysis,  (%) have a media multiplexity score of ,  (%) have a score of , and  (%) have a score of .

.. Independent Variables Independent variables were created measuring three dimensions of tie strength among each respondent-tie dyad: communication frequency, social role, and cognitive closeness. Communication frequency was measured using the logged number of phone calls and text messages exchanged with ties over the twenty-eight days preceding their final in-app survey. In-person communication was not included as an independent variable because a value of  in this variable will necessitate a value of  added to the scale in the dependent variable indicating media multiplexity. By contrast, the calling and texting variables were continuous, and any variations in these variables would not necessarily necessitate a change in the dependent variable. There is a median of  calls and  texts for each tie. However, the distributions for these scales are positively skewed. As a result, there is a mean of . calls and . texts for the ties examined in this analysis. Social role and cognitive closeness variables were created based on responses to the application’s pop-up surveys. The social role of each tie was determined based on responses to questions about whether that tie was a family member or someone known from work. There are  (%) kin ties,  (%) work-based ties, and  (%) nonkin and non-work ties. Given that these variables are mutually exclusive, we use kin and work ties as dummy variables in our regression analyses and the “other” (non-kin, non-work) ties as the reference category. Three variables were used to represent different dimensions of cognitive closeness: whether the respondent trusts the tie a lot, whether the respondent talks to the tie about important matters, and whether the respondent enjoys socializing with the tie. There are , (%) trusted ties, , (%) ties with whom the respondent discusses important matters, and , (%) ties with whom the respondent enjoys socializing. It should be kept in mind that the app tended to select ties who were strong in nature, which is likely why the large majority of ties examined in this analysis showed high levels of cognitive closeness.

.. Respondent Demographics and Control Variables The demographics of the  respondents in this sample were as follows: % were female, % were college educated, % were married, % were married with children, and % had a full-time job. The mean age of respondents was thirty-six years old.



. , . ,  . 

To control for any possible influence of these traits in our analysis, we use them as control variables. In addition, a dyadic-level variable indicating whether ties lived more than one hour away from respondents was used as a control, since this would influence the potential for in-person interaction to occur. Some % (N = ) of ties lived more than one hour away from the respondent.

. A  R

.................................................................................................................................. Using Stata (a data analysis and statistical software), ordered logistic regressions were conducted to identify relationships among three dimensions of tie strength and media multiplexity. A cluster option was used to account for nonindependence within respondent-level clusters, because independent and dependent variables were at the tie level, and demographic control variables were at the respondent level. The results of these regression analyses are presented in Table .. Models – tested communication frequency, cognitive closeness, and social role individually to assess their associations with media multiplexity. Model  tested all three sets of independent variables together to assess which of the independent variables had the strongest associations with multiplexity. RQ asked, “To what extent are communication frequency and cognitive closeness associated with media multiplexity?” This question is addressed in models  and . Model  shows that ties with whom a respondent had exchanged a large number of text messages are significantly (p < .) more likely to have high media multiplexity than other ties. Text messages count had a low unstandardized coefficient (b = .) because a difference of one text message is unlikely to have a large relationship to multiplexity. However, the standardized coefficient for text messages is much higher (standardized b = .), suggesting that a larger variation in text message count is a viable predictor of media multiplexity. The number of phone calls exchanged between ties was also significant, but to a lesser degree (p < .). Model  evaluates how three cognitive closeness variables predict media multiplexity: trust, discussing important matters, and enjoying socializing. Because there is some correlation among these variables, a variance inflation test was run to ensure that the three cognitive closeness variables did not have significant collinearity, which would compromise the regression model. A variance inflation test returns a tolerance and a variance inflation factor (VIF) for each variable. A VIF value of greater than  or a tolerance lower than . would indicate collinearity (Chen et al., ). The test showed that the three cognitive closeness variables did not exhibit collinearity that would compromise the regression analysis (Trust: Tolerance = ., VIF = .; Important matters: Tolerance = ., VIF = .; Enjoy socializing: Tolerance = ., VIF = .). A second variance inflation test was conducted in which the control variables were added, and the results were almost the same.

  



Table 14.1 Ordered Logistic Regression Results Predicting to Media Multiplexity Model 1 Communication Frequency Phone calls Text messages

Model 2

0.01* (0.01) 0.00** (0.00)

Cognitive Closeness Trust

Enjoy socializing Social Role Reference = not a kin or work tie Kin ties Work ties

Female College degree Married Married with children Full-time job Age Wald chi squared N

0.20 (0.20) 0.44* (0.19) 0.69** (0.20)

0.94*** (0.14) 0.56** (0.16) 1.15*** (0.12) 0.04 (0.15) 0.18 (0.14) 0.36* (0.17) 0.15 (0.17) 0.29 (0.16) 0.01 (0.01) 172.58*** 1,375

Model 4

0.01 (0.01) 0.00** (0.00) 0.23 (0.18) 0.81*** (0.19) 0.72** (0.21)

Discuss important matters

Control Variables Live >1 hr away

Model 3

1.24*** (0.13) 0.04 (0.16) 0.07 (0.14) 0.23 (0.18) 0.14 (0.19) 0.21 (0.16) 0.02 (0.01) 167.79*** 1,361

1.19*** (0.12) 0.01 (0.01) 0.12 (0.13) 0.10 (0.17) 0.22 (0.18) 0.14 (0.16) 0.02* (0.01) 178.84*** 1,373

0.62*** (0.15) 0.56** (0.18) 1.28*** (0.13) 0.05 (0.16) 0.14 (0.15) 0.23 (0.17) 0.14 (0.17) 0.23 (0.16) 0.01 (0.01) 250.66*** 1,359

Unstandardized coefficients. Standard errors in parentheses. *p < 0.05, **p < 0.01, ***p < 0.001

The results of model  show that ties with whom respondents discussed important matters (p < .) and ties with whom respondents enjoyed socializing (p < .) were significantly likely to have a higher media multiplexity than other ties, but trust was not a significant predictor of media multiplexity (p > .).



. , . ,  . 

The results presented in models  and  illustrate that both cognitive closeness and communication frequency are positively associated with media multiplexity. However, this was not the case for all components of these variables. With regard to cognitive closeness, discussing important matters and enjoying socializing are positive predictors of media multiplexity in table ., but trust was not significant. And of the communication frequency variables, both text messages and phone calls were found to be significant predictors of media multiplexity. RQ asked, “To what extent is social role predictive of media multiplexity, particularly pertaining to family and work ties?” Model  shows that respondents are significantly more likely to have higher media multiplexity with ties who are family members (p < .) and ties whom they know from work (p < .) when compared to the reference category of other ties. One explanation for this may be the larger amount of in-person communication with kin and work ties. Respondents indicated that they talked in person to % of family ties and % of work ties on a typical day, but to only % of other ties from their phone logs. RQ asked, “When taken together, what is most associated with media multiplexity: communication frequency, cognitive closeness, or social role?” This question is addressed in model , which combines all of the independent variables tested in the previous models. We performed an ordered logistic regression analysis of all these variables to address overlaps among the three dimensions. Would the associations identified in RQ between social role and multiplexity still be significant when tested together with the tie strength variables? If so, that would suggest the association between social role and multiplexity is significant independent of communication frequency and cognitive closeness. The results of model  show that communication frequency, cognitive closeness, and social role remain significant predictors and are therefore independently associated with media multiplexity. In model , as in model , the number of text messages exchanged with a tie is significantly (p < .) predictive of media multiplexity. In addition, as in model , the low unstandardized coefficient (b = .) for text messaging should be understood in relation to the standardized coefficient (standardized b = .). Although a difference of only one text message has only a small association with media multiplexity, larger variations have a greater predictive capacity. The number of phone calls exchanged between ties is not a significant indicator in model  (p > .). The cognitive closeness variables about discussing important matters (p < .) and enjoying socializing (p < .) are significant in model . It is notable, however, that the significance and coefficient for discussing important matters are less than in model . It is possible that some of the importance of discussing important matters in model  is better explained by other variables. In model , the social role variables have the highest significance of all the independent variables. Kin ties (p < .) and work ties (p < .) are both likely to have a higher multiplexity than other ties. The only variables that had a greater effect were enjoying socializing (b = ., p < .) and the control variable for ties who live farther than one hour away (b = –., p < .).

  



. D  C

.................................................................................................................................. Haythornthwaite and colleagues (Haythornthwaite, , , , , ; Haythornthwaite & Wellman, ; Haythornthwaite, Wellman, & Mantei, ) have shown that tie strength is associated with media multiplexity. However, much of this work was conducted before the widespread adoption of text messaging. This article has built upon their work by considering multiple dimensions of tie strength, including communication frequency and cognitive closeness, as well as examining how social role relates to mobile media multiplexity. RQ asked about the extent to which communication frequency and cognitive closeness were associated with media multiplexity. The results showed that both measures of tie strength were indicators of media multiplexity. This supports Haythornthwaite () and shows that her findings about multiplexity and tie strength remain consistent across multiple periods of time with different available communication media. In addition, our findings suggest that future studies of media multiplexity may benefit from including both communication frequency and cognitive closeness as measures of tie strength. The association between communication frequency and multiplexity is not a surprising result, as previous scholarship has argued that one explanation for high levels of media multiplexity is a desire to seek out opportunities for communication (Haythornthwaite ). This desire is likely to be related to both frequency of communication and the use of multiple communication channels. Interestingly, in model , where communication frequency was considered alongside cognitive closeness and social role, only text messages (not phone calls) had a significant association with media multiplexity. This suggests that frequency of text messaging is a stronger predictor of multiplexity than frequency of calling. One explanation for the significance of text messages is that they are often used for expressive communication designed to establish virtual co-presence and emotional bonding (Ito & Okabe, ; Ling & Birgitte, ), so exchanging a large number of texts may be linked with emotional closeness. RQ investigated the significance of social roles (i.e., kin and work ties) as predictors of media multiplexity. The results showed that kin and work ties had higher media multiplexity than other ties. The significance of social roles was also evident after addressing RQ, which examined the significance of communication frequency, cognitive closeness, and social role when compared against each other. The results showed that each relational dimension, including social role, was independently predictive of media multiplexity. The fact that kin and work ties were significantly associated with media multiplexity even when the other independent variables were included in the model warrants further discussion. One explanation for the significance of social role is that kin and work ties are likely to participate in shared activities through their common institutional focus (Feld, ). We posit that coordinating these shared activities may encourage communication across a range of channels. As noted



. , . ,  . 

previously, particularly significant is that family and work ties are significantly more likely to regularly speak in person than are other ties. This supports our assumption that membership in the same family or work institutions requires daily focused activities. In addition, kin and work ties may weave calls and texting into their daily routines, achieving a sort of perpetual contact even when physically apart (Katz & Aakhus, ). This is consistent with scholarship discussing ways that mobile phones are used to communicate with family ties while at work and with work ties when at home (Christensen, ; Gluesing, ; Ling, ; Wajcman, Bittman, & Brown, ; Wajcman et al., ). Furthermore, close institutional ties, especially family, may have enough knowledge of each other’s schedules to know whether it is appropriate to call or to text at different times of day and may vary their choice of communication channel accordingly. A limitation of this study is that only three types of media use were examined: phone calls, text messages, and in-person communication. At the time of data collection, voice calls and text messages dominated mobile phone communications, but in the relatively short period of time since the data for this study were collected, social media and other messaging apps have become increasingly common among smartphone users. Many of these apps function similarly to texting, so it is likely that results of this study would be similar to an updated study that included data about these texting apps. Nonetheless, it is clear that current and future studies would be missing crucial information if they failed to address the importance of communication apps such as Facebook Messenger, WhatsApp, Line, and Snapchat. However, these apps pose significant challenges for accessing communication log data. First, these applications may store their log data in different formats, so researchers aiming to collect data from these logs must navigate the requirements of each app individually. Second, software updates may lead to changes in data structure or policies, so a method used to retrieve log data from version . may not work with version .. Third, because user data are a valuable commodity for social networking platforms, there are cases in which these data may be intentionally inaccessible so as to maintain a competitive advantage for the platform owners (Manovich, ). As a result, although logged communication data have great potential for researchers, it is virtually impossible for researchers to capture all of the communication channels used by the people they study. This is why identifying variables associated with multiplexity is so important: if multiplexity is occurring, researchers are unlikely to have access to information about all relevant communication channels. This study has contributed to scholarship about media multiplexity by showing that communication frequency, cognitive closeness, and social role are each independently associated with multiplexity. If researchers identify these variables, they can infer whether multiplexity is likely to be important for their study. There are several possible ways multiplexity can be addressed by researchers. It may be possible to seek access to additional data sources, such as logs from communication apps. Researchers may supplement log data by asking respondents to describe if and how they use additional communication channels. And following our proposal that organizing around shared activities may be the reason

  



for the strong association between social role and media multiplexity, studying these shared activities may be important for relationships involving institutional bonds such as family or work. Conversely, multiplexity is less likely to be significant in relationships that do not involve high communication frequency, cognitive closeness, or shared social roles, so data from a single communication channel may be sufficient for investigating these cases. In either case, identifying the likelihood of multiplexity is important for understanding contemporary communication practices.

N . Communication Explorer was designed by two of this chapter’s authors, Boase and Kobayashi. Researchers may also be interested in the E-Rhythms software, which can be used to collect data similar to what Communication Explorer does but also allows for more flexibility. Information about both Communication Explorer and E-Rhythms is available at http://individual.utoronto.ca/jboase/software.html. . Although the Communication Explorer application also logged email communications sent or received using the Gmail app on respondents’ smartphones, only % (n = ) of respondents reported that Gmail was their main email address. As a result, the email data were omitted from this study because they did not represent email use for a significant number of respondents.

R Baym, Nancy K., and Andrew Ledbetter. . “Tunes That Bind? Predicting Friendship Strength in a Music-Based Social Network.” Information, Communication & Society  (): –. doi:./. Baym, Nancy K., Yan Bing Zhang, Adrianne Kunkel, Andrew Ledbetter, and Mei-Chen Lin. . “Relational Quality and Media Use in Interpersonal Relationships.” New Media & Society  (): –. Boase, Jeffrey. . “Personal Networks and the Personal Communication System: Using Multiple Media to Connect.” Information, Communication & Society  (): –. doi:./. Boase, Jeffrey, John B. Horrigan, Barry Wellman, and Lee Rainie. . “The Strength of Internet Ties.” Pew Internet and American Life Project. http://www.pewinternet.org/files/ old-media/Files/Reports//PIP_Internet_ties.pdf.pdf. Boase, Jeffrey, and Rich Ling. . “Measuring Mobile Phone Use: Self-Report versus Log Data.” Journal of Computer-Mediated Communication  (): –. doi:./jcc.. Chen, Xiao, Philip B. Ender, Michael Mitchell, and Christine Wells. . “Regression Diagnostics.” In Stata Web Books: Regression with Stata. http://www.ats.ucla.edu/stat/ stata/webbooks/reg/chapter/statareg.htm. Christensen, Toke Haunstrup. . “‘Connected Presence’ in Distributed Family Life.” New Media & Society  (): –. doi:./. Feld, Scott L. . “The Focused Organization of Social Ties.” American Journal of Sociology  (): –.



. , . ,  . 

Gluesing, Julia C. . “Identity in a Virtual World: The Coevolution of Technology, Work, and Lifecycle.” In Mobile Work, Mobile Lives: Cultural Accounts of Lived Experiences, edited by Tracy L. Meerwarth, Julia C. Gluesing, and Brigitte. Jordan, –. Malden, MA: Wiley-Blackwell. Granovetter, Mark. . “The Strength of Weak Ties.” American Journal of Sociology  (): –. Haythornthwaite, Caroline. . “Online Personal Networks: Size, Composition and Media Use among Distance Learners.” New Media & Society  (): –. Haythornthwaite, Caroline. . “Exploring Multiplexity: Social Network Structures in a Computer-Supported Distance Learning Class.” The Information Society  (): –. Haythornthwaite, Caroline. . “Strong, Weak, and Latent Ties and the Impact of New Media.” The Information Society  (): –. Haythornthwaite, Caroline. . “Supporting Distributed Relationships: Social Networks of Relations and Media Use over Time.” Electronic Journal of Communication  (). Retrieved from http://www.cios.org/EJCPUBLIC///.HTML Haythornthwaite, Caroline. . “Social Networks and Internet Connectivity Effects.” Information, Communication & Society  (): –. doi:./. Haythornthwaite, Caroline, and Barry Wellman. . “Work, Friendship and Media Use for Information Exchange in a Networked Organization.” Journal of the American Society for Information Science  (): –. Haythornthwaite, Caroline, Barry Wellman, and Marilyn Mantei. . “Work Relationships and Media Use: A Social Network Analysis.” Group Decision and Negotiation  (): –. doi:./BF. Hristova, Desislava, Mirco Musolesi, and Cecilia Mascolo. . “Keep Your Friends Close and Your Facebook Friends Closer: A Multiplex Network Approach to the Analysis of Offline and Online Social Ties.” arXiv:.. http://arxiv.org/abs/.. Ito, Mizuko, and Daisuke Okabe. . “Technosocial Situations: Emergent Structuring of Mobile E-Mail Use.” In Personal, Portable, Pedestrian: Mobile Phones in Japanese Life, edited by Mizuko Ito, Daisuke Okabe, and Misa Matsuda, –. Cambridge, MA: MIT Press. Katz, James E., and Mark Aakhus. . Perpetual Contact: Mobile Communication, Private Talk, Public Performance. Cambridge, UK: Cambridge University Press. Kobayashi, Tetsuro, and Jeffrey Boase. . “No Such Effect? The Implications of Measurement Error in Self-Report Measures of Mobile Communication Use.” Communication Methods and Measures  (): –. Ledbetter, Andrew M. . “Patterns of Media Use and Multiplexity: Associations with Sex, Geographic Distance and Friendship Interdependence.” New Media & Society  (): –. Ling, Rich. . The Mobile Connection: The Cell Phone’s Impact on Society. San Francisco, CA: Morgan Kaufmann. Ling, Rich, and Yttri Birgitte. . “Hyper-Coordination via Mobile Phones in Norway.” In Perpetual Contact: Mobile Communication, Private Talk, Public Performance, edited by James E. Katz and Mark Aakhus, –. Cambridge, UK: Cambridge University Press. Manovich, Lev. . “Trending: The Promises and the Challenges of Big Social Data.” In Debates in the Digital Humanities, edited by Matthew K. Gold. University of Minnesota Press. http:// manovich.net/content/-projects/-trending-the-promises-and-the-challenges-of-bigsocial-data/-article-.pdf.

  



Marin, A., and K. N. Hampton. . “Simplifying the Personal Network Name Generator: Alternatives to Traditional Multiple and Single Name Generators.” Field Methods  (): –. doi:./X. Marsden, Peter V., and Karen E. Campbell. . “Measuring Tie Strength.” Social Forces  (): . doi:./. Miczo, Nathan, Theresa Mariani, and Crystal Donahue. . “The Strength of Strong Ties: Media Multiplexity, Communication Motives, and the Maintenance of Geographically Close Friendships.” Communication Reports  (): –. doi:./ ... Rui, J. R., J. M. Covert, M. A. Stefanone, and T. Mukherjee. . “A Communication Multiplexity Approach to Social Capital: On- and Offline Communication and SelfEsteem.” Social Science Computer Review  (): –. doi:./. Van Cleemput, K. . “‘I’ll See You on IM, Text, or Call You’: A Social Network Approach of Adolescents’ Use of Communication Media.” Bulletin of Science, Technology & Society  (): –. doi:./. Wajcman, J., M. Bittman, and J. E. Brown. . “Families without Borders: Mobile Phones, Connectedness and Work-Home Divisions.” Sociology  (): –. doi:./ . Wajcman, Judy, Michael Bittman, Paul Jones, Lynne Johnstone, and Jude Brown. . “The Impact of the Mobile Phone on Work/Life Balance.” Canberra: Australian Mobile Telecommunications Association & Australian National University. Retrieved from https://www.oii.ox.ac.uk/archive/downloads/research/files/Report_on_Mobiles_and_Work_ Life_Balance.pdf Wiese, Jason, Jun-Ki Min, Jason I. Hong, and John Zimmerman. . “‘You Never Call, You Never Write’: Call and SMS Logs Do Not Always Indicate Tie Strength.” In Proceedings of the th ACM Conference on Computer Supported Cooperative Work & Social Computing. Vancouver, Canada: –. ACM Press. https://doi.org/./. Wohn, Donghee Yvette, and Wei Peng. . “Understanding Perceived Social Support through Communication Time, Frequency, and Media Multiplexity.” In Proceedings of the rd Annual ACM Conference Extended Abstracts on Human Factors in Computing Systems. Seoul, Korea: –. ACM Press. doi:./..

  ......................................................................................................................

      ......................................................................................................................

  

. I

.................................................................................................................................. T widespread proliferation of the World Wide Web has touched the everyday lives of large and diverse populations around the globe. Social media platforms such as Twitter, Facebook, and Instagram constitute an important component of Web use. People are increasingly adopting these platforms to share their thoughts and opinions with their contacts. In a way, social media have transformed traditional methods of communication by allowing instantaneous and interactive sharing of information created and controlled by individuals, groups, and organizations. Adoption of these technologies spans various walks of life; according to a Pew Internet study (Lenhart et al., ), currently  percent of online users use at least one online social platform, with Facebook being the most popular one. One in six people in the world today is a user of Facebook. Consequently these sites have emerged as powerful platforms providing a voice to millions for relating their opinions, thoughts, ideas, and events. An important attribute of social media is that postings on these sites are made in a naturalistic setting and in the course of daily activities and happenings. As such, social media provide a means for capturing behavioral attributes that are relevant to an individual’s thinking, mood, communication, activities, and socialization. Moreover, this real-time data stream of social information is often annotated with context such as location information, cues about one’s social environment, and rich collections of multimodal information beyond text, such as images and videos. Given the increasing uptake of social sites and the availability of the rich data on them, social media platforms have shown great potential to analyze real-world events, such as politics, product sentiment, and natural disasters. Over the last few years there

     



has been a corresponding surge of interest in utilizing continuing streams of evidence from social media on posting activity to reflect people’s psyches and social milieus. In fact, the ubiquitous use of social media, as well as the abundance and growing repository of such data, has been found to provide a new type of “lens” and a revolutionary data source for inferring health-related behaviors and mechanisms (Paul and Dredze, ; Paul et al., ; De Choudhury et al., a; Park et al., )—both at the microscopic level (i.e., the individual) and on the collective or population scale. This body of research indicates how computational techniques may be applied to naturalistic data that people share on today’s online platforms in order to make sense of their health behaviors and related experiences. In the context of this surge in research interest, this chapter highlights the potential opportunities and challenges of the use of social media as a novel stream of information for augmenting traditional approaches to health assessment. It frames these opportunities to span early-stage risk assessment (or detection) of a mental health challenge, complementing conventional diagnosis, creating platforms for social and emotional support, and rethinking conventional psychotherapy, as well as examining the possibility of designing social media–based or –driven intervention mechanisms that can enable self-reflection on mental health challenges or create avenues of better treatment and management of mental illness. Challenges include risks of disclosure of sensitive information to unintended audiences, privacy concerns and the ethical considerations related to automated inference of underlying health states of individuals, and the social and behavioral risks that social media systems may pose to already vulnerable communities. I discuss these issues in some detail while reviewing existing literature for each context.

. S M  W-B: P

.................................................................................................................................. Leveraging Internet data for modeling and analyzing health behaviors has been a ripe area of research in the recent past. The well-known Google Flu Trends1 is an appropriate example in this context, providing nuanced predictions of flu infections based on online search queries. Social media platforms in particular generate immensely rich data from people’s mundane everyday actions, thoughts, and opinions, including those about small and big happenings in their lives. Social media sites thus provide a rich ecosystem in which the context and content of one’s affective, behavioral, and cognitive reactions, as well as social interactions, can be observed over extended periods of time (Golder and Macy, ). The characteristics of such context and content can be learned for thousands or even millions of people. These social factors are known to be key in the detection and assessment of a number of health conditions and outcomes and can be made to work in a complementary fashion alongside traditional approaches.



  

Relying on these unconventional data sources—that is, social media as health assessment tools—has many advantages, beyond new opportunities for understanding health behaviors of individuals at a scope and on a scale not possible before. They fall into the following categories: • Self-report methodology in public and behavioral health surveys includes responses that are prompted by the experimenter and typically comprise recollection of (sometimes subjective) health facts. Measurement of health behavior via social media sources can capture personal and social activity and language expression in a naturalistic setting. Such activity is real time and happens in the course of a person’s day-to-day life. Hence it is less vulnerable to memory bias or experimenter demand effects and can help track concerns on a fine-grained temporal scale. • Surveys and wearable sensing tools that are often used extensively for personal health monitoring can indicate one’s physiological traits and affective responses, as well as some lifestyle attributes such as geographic location (Jovanov et al., ; Burke et al., ). However, they generally cannot capture the context and content of these reactions—aspects that may be quantified via social media. • Large amounts of naturalistic population data can be collected through social media at a faster rate, with little intrusion or intervention, as well as through inexpensive means, in contrast to traditional data sources such as health and behavioral surveys. The following subsections present reviews of prior literature, including mine, about a number of different ways that this new form of data may inform mental health practices. I have organized them into early detection strategies, promoting psychosocial support and allowing increased self-disclosure.

.. Early Detection Researchers have become increasingly interested in understanding how social media activities can be used to infer and detect the well-being of people and conditions and symptoms related to diseases (Paul and Dredze, ) and disease contagion, such as flu (Lamb et al., ). Paul and Dredze () developed a disease-specific topic model based on Twitter posts in order to model behavior related to a variety of diseases of importance in public health. Topic models are a suite of algorithms, specifically probabilistic models based on hierarchical Bayesian analysis, that uncover the hidden thematic structure in document collections (Blei et al., ). These algorithms help researchers develop new ways to search, browse, and summarize large archives of texts. Through language modeling of Twitter posts, Culotta () found evidence of high correlation between social media signals and diagnostic influenza case data. Sadelik et al. () developed statistical models that predicted infectious disease (e.g., flu) spread in individuals based on geotagged postings made on Twitter (see also Brennan et al., ).

     



For mental health in particular, Moreno et al. () demonstrated that status updates on Facebook could reveal symptoms of major depressive episodes, while Park et al. () found differences in the perception of Twitter use between depressed and nondepressed users; the former found value in Twitter due to the ability to garner social awareness and engage in emotional interaction. Katikalapudi et al. () analyzed patterns of Web activity of college students that could signal emotional concerns. Dao et al. () focused on the depression community on the LiveJournal blog and explored the effects of mood, social connectivity, and age associated with manifested depression. They specifically focused on two properties of messages in the blog: topic and linguistic style. On building statistical and machine learning methods to discriminate the posts made by bloggers in low versus high valence mood, they found the linguistic style to be the most distinguishing feature across different age categories and different degrees of social connectivity. In another work, Tsugawa et al. () investigated how useful text features extracted from the historical timeline of users may be used for recognizing depression, as measured via standardized screening tests. They found that topic models may contribute significantly in detecting depression-prone users against control users. An important contribution of this work was that around two months’ observation was sufficient for learning and prediction, and in fact long observation periods decreased accuracy. Relatedly, though not in the context of social media, Resnik et al. () showed the usefulness of a topic model applied to essays written by university students in detection of depressive tendencies. Further related to use of topic models, Dinakar et al. () analyzed posts from a popular teen support community to shed light on how computational techniques could be practically deployed on the network to help its participants. They developed a stacked generalization of an ensemble of models for topic prediction to extract the most dominant themes behind teenagers’ distress-related stories. Ensemble learning is a machine learning paradigm in which multiple learners are trained to solve the same problem (Dietterich, ). In contrast to conventional machine learning approaches, which try to learn one hypothesis from training data, ensemble methods try to construct a set of hypotheses and combine them to use. The generalization ability of an ensemble is therefore usually much stronger than that of base learners. In the context of inferring psychological states and distress in individuals, ensemble learning is appealing because it is able to boost weak learners who are slightly better than random guess to strong learners who can make very accurate predictions. More recently, Coppersmith, Harman, and Dredze () proposed a novel way of obtaining and analyzing mental health–related data from Twitter, including the possibility of developing techniques that can detect various conditions. The authors cleverly utilized the self-identification technique of Beller et al. () and statements (tweets) such as “I was diagnosed with depression” to automatically uncover a large number of users with mental health conditions. They demonstrated success for four mental health conditions: depression, PTSD, bipolar, and seasonal affective disorder. In a follow-up work, Coppersmith, Dredze, and Harman () extended the self-stated diagnosis approach to detecting a few other, rarer forms of mental illness: attention deficit



  

hyperactivity disorder, generalized anxiety disorder, bipolar disorder, borderline personality disorder, depression, eating disorders (including anorexia, bulimia, and eating disorders not otherwise specified), obsessive compulsive disorder, post-traumatic stress disorder, schizophrenia, and seasonal affective disorder. These authors developed binary classifiers that could distinguish among the users who reported their diagnosis from their age- and gender-matched controls, based on signals quantified from Twitter language. The classifiers also allowed the authors to systematically compare the language used by those with the ten conditions investigated, finding some evidence of systematic comorbidities known in the clinical psychology literature. This entire line of research provides evidence that examining mental health through the lens of language on social media is amenable to further advances in revolutionizing mental health. Although this research has demonstrated considerable success in detecting depression or depression risk in individuals and populations, typically researchers have focused on distinguishing between an already affected cohort (depression-prone) and a control cohort. Such detection techniques may be useful to identify who might be at risk; however, critical value may be derived from early detection of depressive symptoms, or an individual’s or community’s predisposition to a future depressive episode. This way help and support may be directed in a timely fashion for tailored, adaptive, and personalized interventions. The following two subsections discuss my prior research, focusing on early detection of two kinds of mental illness risk, on the micro (individual) scale and the macro (population) scale.

... Quantifying Individual-Centric Risk My research with colleagues examined linguistic and emotional correlates for the postnatal course of new mothers and thereafter built a model to predict extreme behavioral changes in new mothers (De Choudhury et al., a). In our first two studies, we analyzed Twitter postings of new mothers to detect (De Choudhury et al., a) and to predict (De Choudhury et al., b) extreme behavioral changes postpartum. The studies did not gain access to ground truth data on clinical postpartum depression (PPD) outcomes, but rather relied on the sensing of extreme changes in Twitter. These social media correlates of these extreme behavior changes included lowered positive affect and raised negativity and greater use of first-person pronouns, indicating higher self-attentional focus (see Figure .). We extended this research by obtaining gold-standard labels of depression on  new mothers who use Facebook, through an online survey (De Choudhury et al., ), and used the well-validated depression screening tool Patient Health Questionnaire (PHQ-) (Löwe et al., ). The PHQ- depression scores were critical, as they allowed us to distinguish new moms actually suffering from depression versus those who had changed behavior (e.g., posting less on Facebook) for more benign reasons such as simply being too busy with their new babies. Thereafter we characterized the Facebook behaviors of participants over fifty weeks of the prenatal period and ten weeks of the postnatal, totaling , postings on Facebook. For this, we employed various measures of social capital, emotion, and linguistic style. Thereafter we

      Activation ← users →

← users →

PA



( ←before childbirth) time (after childbirth →)

( ←before childbirth) time (after childbirth →)

 . Differences in affective behavior (positive affect or PA, and activation) of postpartum depression–vulnerable new mothers. The two heat map visualizations show individual level changes for positive affect (left) and activation (right) in the postnatal period, in comparison to the prenatal phase. The color map uses an RGB scale in which red represents greater values while blue represents smaller values of each measure. Notice that there are decreases in PA and activation following childbirth for many mothers, but for some mothers (%) the decrease in PA and activation following childbirth is considerably higher than for the majority. Source: De Choudhury et al. (a).

developed several statistical models to predict whether or not a mother will have PPD. A model that uses only prenatal data was found to explain as much as % of variance in the data and improved upon a baseline model based on demographic and childbirth history data by %. Broadly, our findings indicated that behavioral concerns such as postpartum depression may be reflected in mothers’ social media (Facebook) use, including lowered positive affect and raised negativity and greater use of first-person pronouns, indicating higher self-attentional focus. In addition, we found that the behavioral changes of mothers could be predicted by leveraging their activity simply from the prenatal period. Including a short time horizon (~ one month) after childbirth, we were able to achieve better performance; our model was found to explain up to % of variance in the data. A common thread in this body of research is how computational techniques may be applied to naturalistic data that millions of people share on today’s online social platforms to infer and detect their health conditions. These methods have demonstrated efficiency in performance and accuracy when applied to a number of health domains. Note that these techniques are rooted in findings in the social science literature, in which computerized analysis of language and social network analysis has revealed markers of depression (Rosenquist et al., ), anxiety, and other psychological disorders (Fowler et al., ; Fiore et al., ). Together, they point to the potential of linguistic and activity analyses and notions of social capital manifested in online social platforms to offer a novel methodology for augmenting traditional approaches to measuring, detecting, and predicting risk of various health concerns.

... Population-Scale Measurement From a public health perspective, social media data and other forms of online data in general have enabled large-scale analyses of a population’s health status beyond what has previously been possible with traditional methods (Ayers and Kronenfeld, ).



  

Another line of our prior research focused on examining the potential of using Twitter as a tool for measuring and predicting major depression in individuals (De Choudhury et al., d). First we used crowdsourcing to collect gold standard labels on a cohort’s depression, then proposed a variety of social media measures such as language, emotion, style, ego-network, and user engagement to characterize depressive behavior. We used the Center for Epidemiologic Studies Depression Scale (CES-D) (Radloff, ) screening test to obtain the gold standard labels on depression. Our findings showed that individuals with depression exhibited lowered social activity, greater negative emotion, high self-attentional focus, increased relational and medicinal concerns, and heightened expression of religious thoughts (De Choudhury et al., d). They also appeared to belong to highly clustered, close-knit networks, and were typically highly embedded with their audiences in terms of the structure of their ego-networks. Finally, we leveraged these distinguishing attributes to build a support vector machine (SVM) classifier (Duda et al., ) that can predict, ahead of the reported onset of depression of an individual, his or her likelihood of depression. The classifier yielded promising results, with % classification accuracy. Through this work we extended the prior line of work on PPD by () expanding the scope of social media– based mental health measures, describing the relationship between nearly two hundred measures and the presence of depression; and () demonstrating that we can use those measures to predict, ahead of onset, depressive disorders in a cohort of individuals who are diagnosed with depression via a standard psychometric instrument. Complementary to this investigation, we investigated whether population-scale assessments may be obtained from the content and activities of individuals in social media platforms. First, using crowdsourcing techniques, we gathered a ground truth set of , Twitter postings shared by individuals suffering from clinical depression; depression was measured using the CES-D (Radloff, ) screening test. As in the research previously discussed, we then developed statistical models (an SVM classifier) that can predict whether or not a Twitter post in a test set could be depression-indicative. To construct and test the predictive models, we harnessed evidence from a variety of measures, spanning emotional expression, linguistic style, user engagement, and egocentric social network properties. We demonstrated that our models can predict whether a post is depression-indicative with an accuracy of more than % and precision of .. Finally, we proposed a metric we refer to as the social media depression index (SMDI), which uses the previously described prediction models to determine depressive-indicative postings on Twitter and thereby helps characterize the levels of depression in populations. We conducted a variety of analyses at population scale, examining depression levels (as given by SMDI) across geography (US cities and states), demographics (gender), and time, including diurnal and seasonal patterns (see Figure .). Our findings from these analyses were found to align with Centers for Disease Control and Prevention (CDC)2 reported statistics of depression in the US population, as well as to confirm known characteristics of depression described in clinical literature. Employing linguistic and affective analysis of text shared on social media, my colleagues and I have also examined the presence of collective affective desensitization in communities exposed to prolonged violence, such as the drug war in Mexico (De

     



(a)

no data 4.8–7% 7.1–8.3% 8.4–9.1% 9.2–10.3% 10.4–14.8%

3.0 to 3.8 1.4 to 2.9 –0.3 to 1.3 –1.9 to –0.4 –3.6 to –2.0 –4.4 to –3.7

(b)

SM depression index

6 4 2 0 –2

0

5

10

15

20

–4 –6 Hours in a day (in a 24-hour clock) women

men

Poly. (women)

Poly. (men)

 . Population-scale social media–based, nonreactive, unobtrusive, naturalistic measure of depression. a shows the state-wise prevalence of depression per BRFSS (CDC, ) on the left; on the right is the same based on the social media measure. Darker shades indicate greater depression prevalence. b indicates diurnal patterns of depression among men and women, based on the same social media measure. Source: De Choudhury et al. (c).

Choudhury, Monroy-Hernandez, and Mark, ). In more recent work we have developed an automated rating scale to infer levels of mental illness severity in pro– eating disorder communities on Instagram through unsupervised modeling of textual content (Chancellor et al., ). Going beyond correlational studies of social media behavior and well-being states, we have developed causal analytic models to discover a community’s shifts to thoughts of suicidal ideation from mental health discourse on Reddit (De Choudhury et al., ).

.. Psychosocial Support Prior research in psychology has examined the important role of social support in combating health challenges such as depression (George et al., ). It is argued that social intimacy, social integration, and the nature of social networks, as well as



  

individual perception of being supported by others, are important and indispensable in encouraging mental illness recovery (Turner et al., ). The Web is increasingly used for seeking and sharing health information online, and such activity is known to have connections to healthcare utilization and health-related behaviors (Sillence et al., ; Liu et al., ). One study suggests that % of US Web users have participated in medical or health-related groups (Chou et al., ). In this light, approaches to community building have been proposed (e.g., Grimes et al., ; Wicks et al., ). Literature on online support groups also notes that they are popular sources of information and support for many Internet users (White and Dorman, ). These forums tend to contrast sharply with similar offline groups; for instance, people are likely to discuss online problems that they do not feel comfortable discussing face to face (Johnson and Ambrose, ). Moreover, such online health communities (OHCs) are known to foster well-being, a sense of control, self-confidence, social interactions, and improved feelings. In this light, approaches to community building have also been proposed (Smith and Wicks, ). Recent research on social media has demonstrated that they provide a way for people to communicate with their contacts regarding health concerns. Newman et al. () interviewed people with significant health concerns who participated in both OHCs and Facebook. Oh et al. () examined people’s use of Facebook for health purposes and showed that emotional support was a significant predictor of health self-efficacy. Facebook use has also been shown to help those with lower self-esteem attain higher social capital (Ellison et al., ), while distressed teenagers have been known to seek help and advice anonymously on online communities regarding issues such as social and romantic relationships, sexuality, and gender identity (Suzuki and Calzo, ). Thus information seeking and sharing on social media platforms bear the potential for benefiting individuals in the areas of self-help, social support, and empathy. The benefits of social support enabled by social media platforms may also be viewed in the light of the social penetration theory (Altman and Taylor, ), which proposes that as relationships develop, interpersonal communication moves from relatively shallow, nonintimate levels to deeper, more intimate ones—in essence allowing richer discourse. Repeated use of social media may allow developing social relationships over time, which may manifest through heightened self-disclosure. Hence, as studied in Moon (), social media may act as fora of continual social support, stronger community ties, and better social capital resources for affected populations. In fact, distinct from online fora, these social systems are more holistic, in the sense that millions of people use them to post about the mundane goings on of their lives. In addition, unlike many online communities, most social media sites have a personal permanent identity associated with user profiles. Consequently, social media platforms provide a rich ecosystem to study the variety of social support that engenders health-related discourse. With these observations in mind, in another work I and my colleagues examined a highly popular social news and entertainment medium: Reddit3 (De Choudhury and De, ). Reddit is an interesting online social system that has the attributes of a forum; it allows sharing blurbs of text and media as posts that invite votes and

     



commentary. At the same time it is often used as a social feed for information broadcast from people’s contacts and audiences. We studied the nature of discourse on this prominent social medium related to the important health challenge of mental illness. As previously discussed, mental illness in particular is a kind of health concern about which the value of emotional and pragmatic support has been recognized over the years. Studies have demonstrated that social support is beneficial in improving perceived self-efficacy and helping improve quality of life (Turner et al., ). These findings motivated our study. Based on a large data set of several thousand Reddit users, posts, and comments, we found that the Reddit communities we studied allowed a high degree of information exchange about a variety of issues concerning mental health. These ranged from discussion of challenges faced in day-to-day activities, work, and personal relationships, to specific queries about mental illness diagnosis and treatment. Feedback on mental health postings also covered a wide spectrum, from emotional and instrumental commentary to informational and prescriptive advice. In fact, lowered inhibition and self-attention-focused posts received greater support. Our observations thus demonstrated that Reddit in particular and social media in general fill an interesting gap between online health forums and conventional health information access avenues when it comes to mental health. Moreover, it is established that any psychological consequence depends on the activities a technology enables, attributes of the user, and how the two interact. We therefore believe that research in this space is crucial when trying to determine the effects of the social Web on a grave and stigmatic concern such as mental illness.

.. Health-Related Self-Disclosure Beyond social and emotional support, social media platforms are known to enable increased self-disclosure (Joinson, ), allowing individuals to discuss sensitive topics with communities they identify with. Self-disclosure is the telling of the previously unknown so that it becomes shared knowledge, the “process of making the self known to others” (Jourard, ). The benefits of self-disclosure for health challenges can be tremendous; it is an important therapeutic ingredient (Joinson, ) and is linked to improved physical and psychological well-being. In fact, self-disclosure has received a great deal of attention in counseling research because of its hypothesized benefits for the client during the course of therapy, such as an increase in positive affect and a decrease in distressing symptoms (Joinson and Paine, ). Jourard () reported that the process of self-disclosure was a basic element in the attainment of improved mental health. Ellis and Cromby () reported that discourse on emotionally laden traumatic experiences can be a safe way of confronting mental illness. On similar lines, seminal work by Pennebaker and Chung () found that participants assigned to a trauma-writing condition (in which they wrote about a traumatic and upsetting experience) showed immune system benefits (see also Pennebaker et al., ).



  

Disclosure in this form has also been associated with reduced visits to medical centers and psychological benefits in the form of improved affective states (Stricker and Fisher, ). Rodriguez and Kelly () similarly found that revealing personal secrets to an accepting confidant could reduce the feeling of alienation and as a consequence could lead to health benefits. Conversely, self-concealment has been found to be a predictor of disordered eating symptoms (Masuda et al., ). Prior research in computer-mediated communication (CMC) found that medical patients tend to report more symptoms and undesirable behaviors when interviewed by computer rather than face to face (Greist et al., ). Clients at a sexually transmitted disease clinic report more sexual partners, more previous visits, and more symptoms to a computer than to a doctor (Robinson and West, ). Ferriter () found that preclinical psychiatric interviews conducted using CMC rather than face to face contact yielded more honest, candid answers. In the United Kingdom, the Samaritans report that although only % of telephone callers report suicidal feelings, this number increases to around % for email contacts (Joinson, ). Furthermore, the role of self-disclosure as a mediator of mental illness is crucial due to the inherent stigma ascribed to many mental health conditions; there is evidence that people with mental illness tend to be guarded about what they reveal about their condition (Corrigan, ; De Choudhury et al., d). Social stigmas are negative feelings toward an individual or group on socially characteristic grounds that distinguish them from others (Corrigan, ). Because users can be essentially anonymous or pseudonymous online and specifically on many social media, they are less likely to be bothered by stigma, self-presentation, or concerns related to tracking their history on the site. These services can thus facilitate fruitful connections among peers with similar stigmatic experiences and provide an open and honest platform for discourse. Berger et al. (), for instance, showed that compared with those with nonstigmatized conditions, those with stigmatized illnesses were more likely to find health information online. Liu et al. () showed that video logs (to help people share stories, experiences, and knowledge) could support the disclosure of serious illnesses such as HIV, helping those afflicted overcome aspects of social stigma. Motivated by this line of work, my colleague and I examined how platforms such as social media might be allowing honest and candid expression of thoughts, experiences, and beliefs (Balani and De Choudhury, ). Specifically, we sought to detect levels of self-disclosure manifested in posts shared on various mental health forums on Reddit. We developed a classifier for the purpose based on content features to report on automatic detection and characterization of self-disclosure in social media content. Our classifier was able to characterize a Reddit post to be of high, low, or no selfdisclosure with % accuracy. Applying this classifier to general mental health discourse on Reddit, we found that the bulk of such discourse is characterized by high selfdisclosure, and that the community responds distinctively to posts that disclose less or more. These findings revealed the potential of harnessing our proposed self-disclosure detection algorithm in psychological therapy via social media, including design considerations for improved community moderation and support in these vulnerable

     



self-disclosing communities. For instance, automatic detection of self-disclosure levels in social media posts may help community moderators direct appropriate help and advice in a timely fashion to individuals with mental health challenges. It can also help build tailored recommender tools that, based on self-disclosure levels, match potential parties in the community to a post’s author. Relatedly, an outstanding question is related to the role of identity construction in health-related self-disclosure. It has been both anecdotally and empirically observed that through anonymity, the ability to avoid being “visible, verifiable, and accountable” leads people to act differently online than they would in offline settings (Suler, ). Does the dissociative anonymity cover that certain social media such as Reddit allow lead to increased disinhibition effect and increased self-disclosure? In prior work my colleague and I investigated this question by studying the posts made from Reddit’s characteristic “throwaway” accounts (Pavalanathan and De Choudhury, ). These are temporary identities often used as an “anonymity cloak” to discuss uninhibited feelings, sensitive information, or socially unacceptable thoughts momentarily—information otherwise considered unsuitable for the mainstream. Looking through this lens of anonymity in Reddit, we found that a small but notable fraction of redditors use the feature as a cover for more intimate and open conversations about their experiences of mental illness. Specifically, we observed the presence of almost ten times more throwaway Reddit accounts in mental health forums than in other communities. Further, postings from throwaway redditors in mental health forums were found to exhibit increased negativity, greater cognitive bias and self-attentional focus, lowered self-esteem, and greater disinhibition, even to the extent of revealing vulnerability to self-injurious thoughts. Through these findings, throwaways can be said to allow individuals to be less inhibited by self-presentation and stigma-related concerns, presumably due to lack of identifiability and accountability. However, somewhat surprisingly, despite the negative or caustic nature of content shared by anonymous redditors, we found that online disinhibition of this nature garnered more emotional and instrumental feedback through commentary.

. S M  W-B: C

.................................................................................................................................. From the previously discussed studies, we find that large-scale analyses of social media data can yield valuable results and insights that address a variety of health challenges and provide new avenues for scientific discovery relating to the utility of data in personal and societal well-being. However, despite the use of anonymized data, this line of research raises questions about how to best address potential threats to privacy while reaping benefits for individuals and populations. The use of sophisticated statistical and machine learning algorithms to harvest signals and infer latent health



  

states from otherwise benign information shared online brings to the fore a number of avenues and scenarios in which reflection and further attention need to be directed to balancing the benefits of innovation and social good with the risks they pose to the individuals whose data are being analyzed.

.. Risk to Vulnerable Populations Despite the positive benefits in health assessment that can be derived from social media, they may pose a hazard to vulnerable populations through the formation and influence of “extreme communities” on social media that promote and provide support for beliefs, attitudes, and behaviors considered typically harmful or unacceptable by the social mainstream. Examples include pro-anorexic behavior, pro-suicide tendencies, deliberate amputation, and other forms of self-harm. For instance, the impact of increasing use of online platforms, such as websites and blogs, in discourse about the highly controversial subject “pro eating disorders” has been examined extensively in prior work (Bardone-Cone and Cass, ; Borzekowski et al., ; Mulveen and Hepworth, ). Once socially or physically isolated, individuals with eating disorders can now easily connect with other sufferers online. Sometimes these users connect in “pro–eating disorder” communities that share content and advice and provide social support for disordered or unusual eating choices as a reasonable lifestyle alternative. It is known that a segment of the population on these platforms takes self-destruction to unimaginable extremes (Norris et al., ), with users encouraging the sharing of content that promotes negative perceptions of body image (Juarascio et al., ). Some of this content even goes to the extent of demonstrating pro-self-mutilation and prosuicide sentiments (Bardone-Cone and Cass, ; Gavin et al., ; Fox et al., ). Social sharing of such behaviors not only is dangerous for those with eating disorder challenges but also represents a threat of contagion to those who do not currently have these conditions but may be vulnerable. In the light of these observations, researchers have argued that easy availability of media showing extremely thin models, which reflect the current trend toward very thin beauty canons, is pushing many teenagers and young adults toward unhealthy eating habits (Andrist, ). To investigate this phenomenon more systematically, Harper et al. () examined the association between pro–eating disorder website viewership and concurrent levels of body dissatisfaction and eating disturbance. More recently, social media platforms such as Tumblr and Instagram have unique affordances that make them appropriate platforms to examine pro–eating disorder behavior. For instance, the demographics of Instagram and the demographics of the common eating disorder patient are similar. Approximately % of Instagram users are female, and roughly half of all Internet-using young adults (twelve to eighteen years old) are using Instagram, compared to typical eating disorder patients, who are women fifteen to twenty-four years old (Fox et al., ). In addition, the visual nature of these platforms themselves may predispose pro–eating disorder communities to persist. A  study

     



found that % of American girls five to twelve years old said pictures influenced their concept of ideal body shape, and % reported that images made them want to lose weight (Martin, ). Further, the use of tags on these sites makes their underlying social network a likely target for deviant behavior such as pro–eating disorder behavior. These communities are often hidden in plain sight; that is, their activities are generally cut off from the mainstream activity of users but are easily accessible by searching for related tags/keywords. Essentially, social media may provide these populations with an environment and avenue to seek and provide support and acceptance that is difficult to obtain through offline means. Although these online groups may provide the benefit of support, they may present a risk to the public by encouraging vulnerable individuals to hurt themselves. In addition, because the Web eliminates geographic barriers to communication between people, the emergence of pro-self-harm social media sites and content may present a new risk to vulnerable people who might otherwise not have been exposed to these hazards. These findings may also be viewed in the light of the “cultivation theory” (Roskos-Ewoldsen et al., ), which suggests that when information is pervasive and repeated, individuals with higher exposure levels are more likely to accept the conveyed messages as normative. Hence the presence of these harmful communities online may enhance the potentially deleterious influence of pro– eating disorder behavior on vulnerable individuals in various social digital spaces. Furthermore, there is substantial qualitative and quantitative research documenting the negative effects of adolescent Internet use related to cyberbullying (or online harassment) and sexual predation. Research has shown that the results of bullying are compounded by the increasing use of the Internet and mobile phones (Görzig and Frumkin, ). Cyberbullying is defined as “an act of aggression that is intentional, repetitive, and towards an individual of lower power.” It can take various forms, such as sending unwanted, derogatory, or threatening comments; spreading rumors; and sending pictures or videos that are offensive or embarrassing by text, email, or chat, or by posting them on websites including social networking sites and social media. Girls show rates of cyberbullying higher than or equal to boys when compared to traditional bullying, and it has been argued that this is the case because bullying on online platforms is constrained to relational aggression (e.g., social exclusion and gossip), which has been observed in females more than in males (Görzig and Frumkin, ). A number of such experiences of victimization have been known to incur distress (Dinakar et al., ). In fact, the types of bullying taking place online, such as verbal and psychological bullying, may have more negative long-term effects for mental health outcomes than traditional (e.g., face-to-face and physical) forms of bullying. This is because online bully experiences range from challenges that the anonymity of the perpetrator poses to individuals to the wide reach of cyberbullying strategies such as posts on social media profiles and the twenty-four-hour presence of the potential humiliation (Willard, ). Finally, research has also proposed a new phenomenon called “Facebook depression” (O’Keeffe et al., ), defined as depression that develops due to spending



  

considerable time on social media sites such as Facebook and then beginning to exhibit classic symptoms of depression. Individuals at risk for this tendency exhibit social isolation and may turn to risky sites and blogs for emotional support, in turn promoting substance abuse, unsafe sexual practices, or aggression.

.. Privacy, Ethics, and Policy ... Privacy-Preserving, Ethical-Intervention Design The ability to illustrate and model individual behavior using social media data shows promise in the design and deployment of next-generation wellness-facilitating technologies. Privacy-preserving software applications and services can serve as early warning systems, providing personalized alerts and information to individuals. They can provide affected individuals with just-in-time, personalized, and adaptive information on their risk of encountering significant behavioral changes, such as an adverse mental health episode, as revealed by their social network/media activity. It is important to note here that these tools should not be envisioned as stand-alone diagnostic tools, but instead as part of a broader awareness, detection, and support system. Beyond monitoring behavioral trends in real time, social media–based measures, such as degrees of activity and emotional expression, can serve as a personal, diarytype narrative resource logging “behavioral fingerprints” over extended periods of time. The application might even assign health risk scores to individuals based on predictions made about forthcoming changes in their behavior. In operation, if the inferred likelihoods of forthcoming changes surpass a threshold, the individual could be warned or engaged, and information might be provided about professional assistance and/or the value of social support from friends and family. In fact, individuals may volunteer to self-monitor their mental illness estimated from their social media content. Those keen on recovery may generate and share abstracted trends of their mental illness risk levels with a trusted friend, family member, or therapist. These logs of risk over time may provide more temporally nuanced assessment than is possible through surveys, interviews, or other self-reported information. This information can complement existing forms of psychological therapy, help establish rapport between therapists and clients, and help overcome difficulties encountered in these settings due to clients’ reluctance to share sensitive information about their mental health. In short, we hope analytic approaches based on social media data can play a role in helping individuals find timely and appropriate support. Furthermore, in order to bring help and support to social media communities engaging in risky and vulnerable behavior (e.g., pro–eating disorder or self-harm behavior), previous research has shown the importance of these sensitive communities as emotional “safety valves” for negative behavior, allowing disinhibiting discourse to avoid more drastic/dangerous actions (Emmens and Phippen, ). Hence instead of suppressing or banning such vulnerable content, a policy adopted by some sites,4 social media platforms might consider alternative intervention techniques such as the following:

     



• Platforms could issue public service announcements, with pointers to support communities or to an appropriate hotline/other resources, thus increasing the likelihood of the user being exposed to healthier behaviors. Along these lines, searches on content with high mental illness severity may automatically be directed to links hosting helpful and research-supported resources, highlighting the health risks of such activities. Automated methods could also be developed to detect whether a user is attempting to post triggering content; at that point the system could interject with a private message that provides a link to an appropriate psychological disorder helpline. • Platforms could also limit the exposure of content associated with tags that disseminate harmful or dangerous content, instead introducing recovery, support, or educational content in the suggested recommendation feeds of users to help disseminate helpful information. • Social computing system designers could work with clinicians, therapists, trusted/ identified family members, and close friends to examine how to bring timely and appropriate help to at-risk groups and to work toward altering their attitudes about the impact of deviant health behaviors. As an example, recovery from pro–eating disorder behavior is a challenging experience, and many individuals undergo conflicting perceptions of identity during recovery attempts, including revelation of vulnerability (Norris et al., ). Intervention tools may specifically focus on the needs of such groups, for instance, providing psychosocial support in response to expression of vulnerable behavior in social media content. Concerns may arise about the ethics of such forms of intervention, as they ultimately leverage information that may be considered sensitive, given their focus on behavior and health. Can we design effective interventions for people whom we have inferred to be vulnerable to a certain illness in a way that does not compromise their status, yet still raises awareness of this vulnerability to themselves and trusted others (doctors, family, friends)? In extreme situations, when an individual’s inferred vulnerability to an illness with risk-taking attitudes is alarmingly high (e.g., self-harm-prone individuals), what should be our responsibility as a research community? For instance, should there be other kinds of special intervention in which appropriate counseling communities or organizations are engaged? In short, finding the right types of intervention that can actually make a positive impact on people’s behavioral state while abiding by adequate privacy and ethical norms is a research question on its own. Furthermore, there are several policy-related dimensions to research that utilizes social network activities to make inferences about people’s mental and behavioral health. Up to what point can such inferences about illness or disability be deemed to be safe for an individual’s professional and societal identity? At what point do interventions on social media become counterproductive or possibly manipulative? It is also important to balance these interventions on health impacts with boundary regulation concerns. To what extent can we notify trusted friends, family, and clinicians that someone may be suffering from a mental illness? How do we ensure that such



  

measurements do not introduce new means of discrimination or inequality in society, given that we now have a mechanism to infer such traditionally stigmatic conditions, which are otherwise considered personal or sensitive? These and other potential consequences, such as revealing nuanced aspects of behavior and mental health conditions to insurance companies or other related decision-making agencies, make resolution of these ethical questions critical to the successful use of these new data sources. Social networks and platforms do not have any moral or ethical obligation to intervene in the case of at-risk or other vulnerable populations. However, certain forms of mental illness, such as eating disorders, are unique in that body perception and self-esteem are negatively impacted by social comparison enabled by social platforms, as well as by consumption of images of idealized physical appearance. Unlike in other health conditions, there is therefore a collective opportunity for social media designers and researchers to rethink the affordances around discoverability and sharing of dangerous and harmful content, not only to control the spread of such behaviors, but also to promote recovery from and treatment of these forms of mental illness. It should be borne in mind, though, that mental illness is a controversial topic. Is manifestation of risky mental illness tendencies in a social media platform a “bad” thing? Who decides what is “good” and what is “bad”? How can interventions be implemented without infringing on the right of individuals to express their ideas? We hope this chapter triggers conversations and involvement with the ethics and clinician community to investigate opportunities and caution in this regard.

... Securing Disclosure of Personal Information In general, new practices related to sharing benign to sensitive health information on online social platforms also raise concerns about privacy. The implications of sharing information in open fora such as social media have been examined in general (Morris et al., ) and specifically in the context of health (Young and Quan-Haase, ), most recently by Horvitz and Mulligan (). Young and Quan-Haase () studied factors influencing the disclosure of health information on Facebook and steps that people took to protect their privacy. Hartzler et al. () showed that people often made errors in determining what health information was shared with whom in their social network. Hence designers, builders, owners, and researchers of these systems need to ensure that beyond interventions and the ethical considerations related to them, educating users about the privacy risks of sharing sensitive information online that can potentially be linked to their health is of utmost importance. Data security and hacking have been essentially an “arms race.” Multifaceted data sets may be combined in clever ways to de-identify a victim (in our case an individual at an elevated risk for mental illness) even after anonymization, and sensitive information related to mental health status can be inferred from benign data that are routinely and naively shared. Hence there is a need to introduce adequate privacy protection approaches that can regulate data based on sharing practices, identifiability, and potential for risk inference. Moreover, participants’ social media use suggests that they might not be aware of the implications of some of these sharing practices, indicating that they may be unaware of

     



how some advertising companies may be collecting and distributing their information. Even though many of the health inferences found in prior research were derived from implicit patterns in activity and content, the ability to derive any information about a person’s health state from a public venue such as Twitter may have serious repercussions (e.g., higher insurance rates, denial of employment). Developing interfaces that remind users of these risks through interactive ways of interpreting their own data is an important area of future exploration for the social computing research community. It should also be recognized that introducing transparency of data processing and analysis to individuals can be challenging and can be a ripe area for investigation (Horvitz and Mulligan, ). This is because many machine learning and statistical approaches adopted in the research covered in this chapter consist of nonintuitive, nontrivial, and complex workflows, dynamics, and decision criteria functions, and promoting actual understanding of such systems and their reasoning methods to a nonexpert would require developing methods that can abstract operational details but at the same time allow clarity and intuitive characterization. Importantly, to promote transparency and still be able to tackle the challenges it poses, novel intervention systems need to revisit the regulations related to access, precision, and adaptive rights of those individuals who are algorithmically inferred to be at heightened risk of mental illness. Broadly, I envision the systems described in the previous subsection to be designed as privacy-preserving, intuitive, interpretive applications that are deployed by and for individuals, thereby honoring the sensitive aspect of revealing different types of health-related information to them.

. C

.................................................................................................................................. Social media platforms are revolutionizing how we think about mental health. This chapter has highlighted a number of recent efforts in this space, as well as noting the challenges that surface as we begin to reap the benefits of this novel, unconventional source of data for mental health help and support. Revolutionizing mental health is not limited to the case studies and approaches reviewed here. Social media platforms can provide a unique avenue for mental health organizations to reach out and support large and diverse populations. Clinicians, therapists, and caregivers can tune in to social media conversations in real time to listen and collect feedback, identify information gaps, and quell misconceptions about mental health needs of individuals. In addition, due to the multi-way, interactive functionality that is inherent to these platforms, social media can allow these agencies to increase direct engagement to maintain and increase trust and credibility about the variety of mental health information that surfaces on these platforms. There is also a potential for mental health organizations and nonprofits to engage with opinion leaders and influencers on mental health–related topics in social media and in their conversations. Influencers can be both organizations and individuals and exhibit the characteristics of credibility, persistence in convincing others, and ability to drive conversations so



  

that others take notice of the topic or idea and show support. Potentially, by engaging with such influencers, mental health officials can discuss ways to promote messaging on shared communication goals to increase the reach of mental health communications. Finally, proactively using social media to increase public awareness of mental health issues and the associated challenges of social media use for the purpose described here is a logical modern public health approach that has the potential to save many lives. The role of social media and their potential in understanding health behaviors is a relatively new and evolving phenomenon, one that society is only beginning to assess and understand and whose implications are yet to be fully construed. Because social media are mostly created and controlled by end users, the opportunity for surveillance and prevention can be extended to all users. One way to do this could be the public promotion of direct and easy avenues for people to access not only help and support, but also privacy- and ethics-related education, through social media sites. Hence, despite the ethics- and privacy-related challenges outlined in this chapter, I believe that it is important to bring the potential of social media to the fore, so as to leverage the benefits of this new data source to enhance people’s quality of life. Further, this will stimulate discussion and awareness of the potential role that policies could play in supporting the identities and practices that individuals suffering from certain illnesses develop in the face of social disadvantage. I believe that candid and informed discussions surpassing disciplinary boundaries about the potential use of social media data for mental health assessment, and the algorithmic capabilities that may be used to derive valuable information about mental health status, can lead to development and deployment of adequate programs and policies. Such developments may be able to balance the goals of protecting privacy, abiding by rigorous ethical considerations, and ensuring that those who have access to such information and inferences are able to employ them for the greater good of mental health.

N . . . .

http://www.google.org/flutrends/. www.cdc.gov/. http://www.reddit.com/. Instagram’s New Guidelines against Self-Harm Images and Accounts, http://blog.instagram.com/post//instagrams-new-guidelines-against-self-harm.

R Altman, Irwin, and Dalmas A. Taylor. Social penetration: The development of interpersonal relationships. Holt, Rinehart & Winston, . Andrist, Linda C. Media images, body dissatisfaction, and disordered eating in adolescent women. MCN: The American Journal of Maternal/Child Nursing,  () –, .

     



Ayers, Stephanie L, and Jennie Jacobs Kronenfeld. Chronic illness and health-seeking information on the internet. Health,  ():–, . Balani, Sairam, and Munmun De Choudhury. Detecting and characterizing mental health related self-disclosure in social media. In Proceedings of the rd Annual ACM Conference Extended Abstracts on Human Factors in Computing Systems, pages –. ACM, . Bardone-Cone, Anna M, and Kamila M Cass. Investigating the impact of pro-anorexia websites: A pilot study. European Eating Disorders Review,  ():–, . Beller, Charley, Rebecca Knowles, Craig Harman, Shane Bergsma, Margaret Mitchell, and Benjamin Van Durme. I’ma belieber: Social roles via self-identification and conceptual attributes. In Proceedings of the nd Annual Meeting of the Association for Computational Linguistics, . Available at https://www.aclweb.org/anthology/P- Berger, Magdalena, Todd H Wagner, and Laurence C Baker. Internet use and stigmatized illness. Social Science & Medicine,  ():–, . Blei, David M, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. The Journal of Machine Learning Research, :–, . Borzekowski, Dina L G, Summer Schenk, Jenny L Wilson, and Rebecka Peebles. E-ana and e-mia: A content analysis of pro–eating disorder web sites. American Journal of Public Health,  ():, . Brennan, Sean. Adam Sadilek, and Henry Kautz. Towards understanding global spread of disease from everyday interpersonal interactions. In Proceedings of the TwentyThird International Joint Conference on Artificial Intelligence, pages –. AAAI Press, . Burke, Jeffrey A, Deborah Estrin, Mark Hansen, Andrew Parker, Nithya Ramanathan, Sasank Reddy, and Mani B Srivastava. Participatory sensing. Center for Embedded Network Sensing, . Centers for Disease Control (CDC). Behavioral risk factor surveillance system survey data. Atlanta, GA: US Department of Health and Human Services, Centers for Disease Control and Prevention, . Chancellor, Stevie, Zhiyuan (Jerry) Lin, Erica Goodman, Stephanie Zerwas, and Munmun De Choudhury. Quantifying and predicting mental illness severity in online pro-eating disorder communities. In Proceedings of the th ACM Conference on Computer Supported Cooperative Work & Social Computing, pages –. ACM, . Chou, Wen-ying Sylvia, Yvonne M Hunt, Ellen Burke Beckjord, Richard P Moser, and Bradford W Hesse. Social media use in the United States: Implications for health communication. Journal of Medical Internet Research,  ():e, . Coppersmith, Glen, Mark Dredze, and Craig Harman. Quantifying mental health signals in twitter. In ACL Workshop on Computational Linguistics and Clinical Psychology, . Available at https://clpsych.wordpress.com/ Coppersmith, Glen, Craig Harman, and Mark Dredze. Measuring post traumatic stress disorder in twitter. In Proceedings of International Conference on Weblogs and Social Media (ICWSM), . Available at https://www.aaai.org/ocs/index.php/ICWSM/ICWSM/schedConf/presentations Corrigan, Patrick. How stigma interferes with mental health care. American Psychologist,  ():, . Corrigan, Patrick W. On the stigma of mental illness: Practical strategies for research and social change. American Psychological Association, .



  

Culotta, Aron. Estimating county health statistics with Twitter. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages –. ACM, . Dao, Bo, Thin Nguyen, Dinh Phung, and Svetha Venkatesh. Effect of mood, social connectivity and age in online depression community via topic and linguistic analysis. In Web Information Systems Engineering–WISE , pages –. Springer, . De Choudhury, Munmun, and Sushovan De. Mental health discourse on reddit: Self-disclosure, social support, and anonymity. In Proceedings of International Conference on Weblogs and Social Media (ICWSM), . Available at https://www.aaai.org/ocs/index.php/ ICWSM/ICWSM/schedConf/presentations De Choudhury, Munmun, Scott Counts, and Eric Horvitz. Major life changes and behavioral markers in social media: Case of childbirth. In Computer-Supported Cooperative Work and Social Computing (CSCW), pages –. ACM, a. De Choudhury, Munmun, Scott Counts, and Eric Horvitz. Predicting postpartum changes in emotion and behavior via social media. In Proceedings of the  ACM Annual Conference on Human Factors in Computing Systems, pages –. ACM, b. De Choudhury, Munmun, Scott Counts, and Eric Horvitz. Social media as a measurement tool of depression in populations. In Proceedings of the th Annual ACM Web Science Conference, pages –. ACM, c. De Choudhury, Munmun, Michael Gamon, Scott Counts, and Eric Horvitz. Predicting depression via social media. In Proceedings of International Conference on Weblogs and Social Media, d. Available at https://www.aaai.org/Press/Proceedings/icwsm.php De Choudhury, Munmun, Scott Counts, Eric Horvitz, and Aaron Hoff. Characterizing and predicting postpartum depression from shared Facebook data. In Proceedings of the th ACM Conference on Computer Supported Cooperative Work and Social Computing. ACM, . De Choudhury, Munmun, Emre Kiciman, Mark Dredze, Glen Coppersmith, and Mrinal Kumar. Discovering shifts to suicidal ideation from mental health content in social media. In Proceedings of the  CHI Conference on Human Factors in Computing Systems, pages –. ACM, . De Choudhury, Munmun, Andres Monroy-Hernandez, and Gloria Mark. Narco emotions: Affect and desensitization in social media during the Mexican drug war. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages –. ACM, . Dietterich, Thomas G. Ensemble learning. The Handbook of Brain Theory and Neural Networks, :–, . Dinakar, Karthik, Birago Jones, Henry Lieberman, Rosalind Picard, Carolyn Rose, and Matthew Thoman Roi Reichart. You too?! Mixed initiative LDA story-matching to help teens in distress. In Proceedings of International Conference on Weblogs and Social Media, . Available at https://www.aaai.org/Library/ICWSM/icwsmcontents.php Duda, Richard O, Peter E Hart, and David G Stork. Pattern classification. John Wiley & Sons, . Ellis, Darren, and John Cromby. Emotional inhibition: A discourse analysis of disclosure. Psychology & Health,  ():–, . Ellison, Nicole B, Charles Steinfield, and Cliff Lampe. The benefits of Facebook “friends” social capital and college students’ use of online social network sites. Journal of ComputerMediated Communication,  ():–, . Emmens, T, and A Phippen. Evaluating online safety programs. Harvard Berkman Center for Internet and Society, . [ July ]. Available at https://cyber.harvard.edu/sites/

     



cyber.law.harvard.edu/files/Emmens_Phippen_Evaluating-Online-Safety-Programs_. pdf Ferriter, Michael. Computer aided interviewing and the psychiatric social history. Social Work and Social Sciences Review,  ():–, . Fiore, Joan, Joseph Becker, and David B Coppel. Social network interactions: A buffer or a stress. American Journal of Community Psychology,  ():–, . Fowler, James H, Nicholas A Christakis, et al. Dynamic spread of happiness in a large social network: Longitudinal analysis over  years in the Framingham Heart Study. BMJ, : a, . Fox, Nick, Katie Ward, and Alan O’Rourke. Pro-anorexia, weight-loss drugs and the Internet: An “anti-recovery” explanatory model of anorexia. Sociology of Health & Illness,  ():–, . Gavin, Jeff, Karen Rodham, and Helen Poyer. The presentation of “pro-anorexia” in online group interactions. Qualitative Health Research,  ():–, . George, Linda K, Dan G Blazer, Dana C Hughes, and Nancy Fowler. Social support and the outcome of major depression. The British Journal of Psychiatry,  ():–, . Golder, Scott A, and Michael W Macy. Diurnal and seasonal mood vary with work, sleep, and daylength across diverse cultures. Science,  ():–, . Görzig, Anke, and Lara Frumkin. Cyberbullying experiences on-the-go: When social media can become distressing. Cyberpsychology: Journal of Psychosocial Research on Cyberspace,  (), . Greist, John H, Marjorie H Klein, and Lawrence J Van Cura. A computer interview for psychiatric patient target symptoms. Archives of General Psychiatry,  ():, . Grimes, Andrea, Brian M Landry, and Rebecca E Grinter. Characteristics of shared health reflections in a local community. In Proceedings of the  ACM Conference on Computer Supported Cooperative Work, pages –. ACM, . Harper, Kelley, Steffanie Sperry, and J Kevin Thompson. Viewership of pro-eating disorder websites: Association with body image and eating disturbances. International Journal of Eating Disorders,  ():–, . Hartzler, Andrea, Meredith M Skeels, Marlee Mukai, Christopher Powell, Predrag Klasnja, and Wanda Pratt. Sharing is caring, but not error free: Transparency of granular controls for sharing personal health information in social networks. In AMIA Annual Symposium Proceedings, volume , page . American Medical Informatics Association, . Horvitz, Eric, and Deirdre Mulligan. Data, privacy, and the greater good. Science,  ():–, . Johnson, Grace J, and Paul J Ambrose. Neo-tribes: The power and potential of online communities in health care. Communications of the ACM,  ():–, . Joinson, Adam. Social desirability, anonymity, and internet-based questionnaires. Behavior Research Methods, Instruments, & Computers,  ():–, . Joinson, Adam N. Self-disclosure in computer-mediated communication: The role of selfawareness and visual anonymity. European Journal of Social Psychology,  ():–, . Joinson, Adam N, and Carina B Paine. Self-disclosure, privacy and the internet. In The Oxford Handbook of Internet Psychology. Oxford University Press, . Available at https://www. oxfordhandbooks.com/view/./oxfordhb/../oxfordhb-e- Jourard, Sidney M. Healthy personality and self-disclosure. Mental Hygiene. .



  

Jovanov, Emil, Amanda O’Donnell Lords, Dejan Raskovic, Paul G Cox, Reza Adhami, and Frank Andrasik. Stress monitoring using a distributed wireless intelligent sensor system. Engineering in Medicine and Biology Magazine, IEEE,  ():–, . Juarascio, Adrienne S, Amber Shoaib, and C Alix Timko. Pro-eating disorder communities on social networking sites: A content analysis. Eating Disorders,  ():–, . Katikalapudi, Raghavendra, Sriram Chellappan, Frances Montgomery, Donald Wunsch, and Karl Lutzen. Associating internet usage with depressive behavior among college students. Technology and Society Magazine, IEEE,  ():–, . Lamb, Alex, Michael J Paul, and Mark Dredze. Separating fact from fear: Tracking flu infections on Twitter. In Proceedings of the  Conference of the North American Chapter of the Association for Computational Linguistics, pages –. Human Language Technologies, . Lenhart, Amanda, Kristen Purcell, Aaron Smith, and Kathryn Zickuhr. Social media and young adults. Pew Internet & American Life Project, , . Available at https://www. pewinternet.org////social-media-and-young-adults/ Liu, Leslie S, Jina Huh, Tina Neogi, Kori Inkpen, and Wanda Pratt. Health vlogger-viewer interaction in chronic illness management. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages –. ACM, . Löwe, Bernd, Kurt Kroenke, Wolfgang Herzog, and Kerstin Gräfe. Measuring depression outcome with a brief self-report instrument: Sensitivity to change of the patient health questionnaire (phq-). Journal of Affective Disorders,  ():–, . Martin, Jeanne B. The development of ideal body image perceptions in the united states. Nutrition Today,  ():–, . Masuda, Akihiko, Matthew S Boone, and C Alix Timko. The role of psychological flexibility in the relationship between self-concealment and disordered eating symptoms. Eating Behaviors,  ():–, . Moon, Youngme. Intimate exchanges: Using computers to elicit self-disclosure from consumers. Journal of Consumer Research,  ():–, . Moreno, Megan A, Dimitri A Christakis, Katie G Egan, Libby N Brockman, and Tara Becker. Associations between displayed alcohol references on Facebook and problem drinking among college students. Archives of Pediatrics & Adolescent Medicine,  ():–, . Morris, Meredith Ringel, Jaime Teevan, and Katrina Panovich. What do people ask their social networks, and why? A survey study of status message Q&A behavior. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages –. ACM, . Mulveen, Ruaidhri, and Julie Hepworth. An interpretative phenomenological analysis of participation in a pro-anorexia internet site and its relationship with disordered eating. Journal of Health Psychology,  ():–, . Newman, Mark W, Debra Lauterbach, Sean A Munson, Paul Resnick, and Margaret E Morris. It’s not that I don’t have problems, I’m just not putting them on Facebook: Challenges and opportunities in using online social networks for health. In Proceedings of the ACM  Conference on Computer Supported Cooperative Work, pages –. ACM, . Norris, Mark L, Katherine M Boydell, Leora Pinhas, and Debra K Katzman. Ana and the Internet: A review of pro-anorexia websites. International Journal of Eating Disorders,  ():–, .

     



Oh, Hyun Jung, Carolyn Lauckner, Jan Boehmer, Ryan Fewins-Bliss, and Kang Li. Facebooking for health: An examination into the solicitation and effects of health-related social support on social networking sites. Computers in Human Behavior,  ():–, . O’Keeffe, Gwenn Schurgin, Kathleen Clarke-Pearson, et al. The impact of social media on children, adolescents, and families. Pediatrics,  ():–, . Park, Minsu, David W McDonald, and Meeyoung Cha. Perception differences between the depressed and non-depressed users in Twitter. In Proceedings of International Conference on Weblogs and Social Media (ICWSM), . Available at https://aaai.org/Press/Proceedings/ icwsm.php Paul, Michael J, and Mark Dredze. You are what you tweet: Analyzing Twitter for public health. In Proceedings of International Conference on Weblogs and Social Media (ICWSM) ICWSM, . Available at https://aaai.org/Press/Proceedings/icwsm.php Paul, Michael J, Mark Dredze, and David Broniatowski. Twitter improves influenza forecasting. PLOS Currents Outbreaks, . doi:./currents.outbreaks.bedfbaeccaaad Pavalanathan, Umashanthi, and Munmun De Choudhury. Identity management and mental health discourse in social media. In Proceedings of the th International Conference on World Wide Web Companion, pages –. International World Wide Web Conferences Steering Committee, . Pennebaker, James W, and Cindy K Chung. Expressive writing, emotional upheavals, and health. In H. S. Friedman & R. C. Silver (Eds.), Foundations of Health Psychology, pages –. Oxford University Press, . Pennebaker, James W, Tracy J Mayne, and Martha E Francis. Linguistic predictors of adaptive bereavement. Journal of Personality and Social Psychology,  ():, . Radloff, LS. Center for epidemiological: A self-report depression scale for research in the general population. Applied Psychological Measurement,  ():–, . doi:./  Resnik, Philip, Anderson Garron, and Rebecca Resnik. Using topic modeling to improve prediction of neuroticism and depression. In Proceedings of the  Conference on Empirical Methods in Natural, pages –. Association for Computational Linguistics, . Robinson, Rachael, and Robert West. A comparison of computer and questionnaire methods of history-taking in a genito-urinary clinic. Psychology and Health,  (–):–, . Rodriguez, Robert R, and Anita E Kelly. Health effects of disclosing secrets to imagined accepting versus nonaccepting confidants. Journal of Social and Clinical Psychology,  ():–, . Rosenquist, J Niels, James H Fowler, and Nicholas A Christakis. Social network determinants of depression. Molecular Psychiatry,  ():–, . Roskos-Ewoldsen, Beverly, John Davies, and David R Roskos-Ewoldsen. Implications of the mental models approach for cultivation theory. Communications-Sankt Augustin Then Berlin, :–, . Sadilek, Adam, Henry A Kautz, and Vincent Silenzio. Modeling spread of disease from social interactions. In Proceedings of International Conference on Weblogs and Social Media (ICWSM), . Available at https://aaai.org/Press/Proceedings/icwsm.php



  

Sillence, Elizabeth, Pam Briggs, Peter Richard Harris, and Lesley Fishwick. How do patients evaluate and make use of online health information? Social Science & Medicine,  ():–, . Smith, Catherine Arnott, and Paul J Wicks. Patientslikeme: Consumer health vocabulary as a folksonomy. In AMIA Annual Symposium Proceedings, volume , page . American Medical Informatics Association, . Stricker, George, and Martin Fisher. Self-disclosure in the therapeutic relationship. Springer, . Suler, John. The online disinhibition effect. Cyberpsychology & Behavior,  ():–, . Suzuki, Lalita K, and Jerel P Calzo. The search for peer advice in cyberspace: An examination of online teen bulletin boards about health and sexuality. Journal of Applied Developmental Psychology,  ():–, . Tsugawa, Sho, Yusuke Kikuchi, Fumio Kishino, Kosuke Nakajima, Yuichi Itoh, and Hiroyuki Ohsaki. Recognizing depression from twitter activity. In Proceedings of the rd Annual ACM Conference on Human Factors in Computing Systems, pages –. ACM, . Turner, R. Jay, B. Gail Frankel, and Deborah M Levin. Social support: Conceptualization, measurement, and implications for mental health. Research in Community & Mental Health, :–, . White, Marsha, and Steve M. Dorman. Receiving social support online: Implications for health education. Health Education Research,  ():–, . Wicks, Paul, Michael Massagli, Jeana Frost, Catherine Brownstein, Sally Okun, Timothy Vaughan, Richard Bradley, and James Heywood. Sharing health data for better outcomes on patients like me. Journal of Medical Internet Research,  ():e, . Willard, Nancy E. Cyberbullying and cyberthreats: Responding to the challenge of online social aggression, threats, and distress. Research Press, . Young, Alyson L., and Anabel Quan-Haase. Information revelation and Internet privacy concerns on social network sites: A case study of Facebook. In Proceedings of the Fourth International Conference on Communities and Technologies, pages –. ACM, .

  ......................................................................................................................

     ......................................................................................................................

    . 

I sharing is a core human activity (Csibra & Gergely, ) that catalyzes innovation and development. The frequency with which we share information is evident every day, with over  billion Facebook messages (Rao, ), over  million tweets (Krikorian, ), and  billion emails sent to colleagues, acquaintances, friends, family members, and sometimes complete strangers (Radicati Group, ) within a single twenty-four-hour cycle. Furthermore, the effects of information sharing are powerful and manifold in domains such as advertising (Bughin, Doogan, & Vetvik, ), stock prices and returns (see, e.g., Luo, , ; Berger, ), and mass media campaigns (Cappella, Kim, & Albarracín, ; Jeong & Bae, ; Southwell & Yzer, ). Consequently, extensive research in marketing, health, communication, psychology, political science, sociology, and network science documents what information is shared and when. Although immense progress has been realized across these fields, current approaches (e.g., methods from computational social science) have not been as well positioned to uncover the underlying mechanisms that could explain the why and how of sharing decisions and behavior. A better mechanistic understanding is necessary to increase the stability of predictive models across time and contexts, to develop parsimonious theoretical frameworks of interpersonal sharing, and to strategically design interventions based on those theories. Thus, moving beyond the documentation of the importance of interpersonal information sharing and its large-scale patterns and effects, mechanistic approaches to the study of sharing are important in the further development of this field. In this chapter we argue that neuroscientific methods offer one approach to generating novel insights about mechanisms underlying sharing between individuals, as well as across larger populations. To this end, we review what is known about the neural mechanisms that support the progression of information through propagation chains such as the one depicted in Figure .. Specifically, we present recent neuroscientific findings that contribute to our understanding of why and how individuals share information with others (interpersonal information sharing), as well as potential mechanisms driving population-level mass sharing events (virality).



    . 

Sharing context

Propagation chain

Source (e.g. media) Information

Primary receiver

Secondary receivers Decision to share

Downstream effects (information reach/impact)

 . Information propagation chains in social networks. Information is spread from an initial source (e.g., mass media or a seed individual) to primary receivers, who are directly exposed to information stemming from the source. Primary receivers can further share the information with others (secondary receivers) who are not exposed to the source of information directly, but only through social contact with a primary receiver.

We focus primarily on functional magnetic resonance imaging (fMRI) studies, which have been used most extensively to study questions related to information sharing and virality. FMRI assesses a blood-oxygen-level-dependent (BOLD) signal in the brain as a proxy measure for neural activity with relatively high temporal and spatial resolution. The neuroscience of information sharing uses knowledge from existing neuroscience work to infer psychological states involved in sharing and to predict sharing-related outcomes, based on observed neural activation patterns. One strength of neuroimaging methods in comparison to many other approaches is a more proximal and less disruptive measurement of psychological processes, across the whole brain (i.e., capturing multiple processes), in real time. This adds crucial information to

   



self-reported, retrospective accounts of thought processes produced after exposure, which are more subject to social desirability, memory errors, or simply the inability or unwillingness of respondents to verbalize specific thoughts or experiences (Krumpal, ; Nisbett & Wilson, ; Wilson & Nisbett, ; Wilson & Schooler, ). When sharing information with others, multiple social, emotional, and cognitive factors are integrated in the brain to navigate each social interaction, sometimes outside of conscious awareness. Consequently, adding measures of neural activity to a battery of behavioral measures and computational approaches can help triangulate the underlying mechanisms that drive why and how people share and increase the predictive capacity of our models of what gets shared and when. We define interpersonal information sharing broadly in terms of facts, ideas, preferences, and knowledge that are communicated from a sharer to a receiver in a single interaction. In addition, although multiple external factors influence sharing, this chapter is particularly concerned with the basic psychological and neurocognitive mechanisms that motivate individual sharing decisions. We argue that there is a set of basic neurocognitive mechanisms, which are likely to be important across diverse sharing contexts even if the specific inputs to these processes vary across situations. Likewise, in our discussion of virality—a characteristic of information that is massively shared—we do not make a strong distinction between the notions of popularity (i.e., a large number of independent sharing events) and structural virality (i.e., retransmission from person to person through long propagation chains; see Goel, Anderson, Hofman, & Watts, ), but rather focus on neurocognitive mechanisms that are likely common across individual decisions comprising each set of effects. In sum, this chapter offers a review of . how sharing decisions are computed in the brain; . the role of neural processing in the creation of downstream outcomes of sharing, including information reach or the numbers of exposures to information in a population or group, and information impact or the effects of shared information on interactions, behaviors, or attitudes of those who are exposed to it; . the effects of contextual factors such as social network structure and individual differences on these processes; and . opportunities and limits for productive interaction between neuroscience and other methodological traditions.

. N B  S D: V-B V

.................................................................................................................................. What happens in a person’s brain during initial exposure to information, and what is it about this neural activity that generates the decision to share with others? We recently integrated existing evidence from social, affective, and cognitive neuroscience to propose



    .  Dorsal MPFC

Right TPJ Right STS

VS

Social cognition

Information

PCC MPFC

Ventral MPFC

Value-related processing

Self-related processing

 . Value-based virality framework. All neural regions of interest depicted here are derived from meta-analyses (Bartra, McGuire, & Kable, ; Murray, Schaer, & Debbané, ) or large-scale studies (Dufour et al., ) of the respective subject area. TPJ = Temporo-parietal junction, MPFC = Medial prefrontal cortex, STS = superior temporal lobe, PCC = posterior cingulate cortex, VS = Ventral striatum

a model of the processes that lead to the decision to share, called the value-based virality framework (Scholz, Baek, O’Donnell, Kim, et al., ). Value-based virality is centered on the sharer’s perceived value of sharing information with others, which is represented in the brain’s valuation and reward system. The higher the perceived value of sharing a piece of information, the more likely it is that it will in fact be shared with others. In addition, to the extent that this value computation is similar across people, information with higher perceived sharing value in the brain is more likely to gain virality in a larger population. Value-based virality further predicts that sharing value is determined based on two key inputs, namely expectations about self-related and social outcomes of sharing. Neural systems supporting self-related processing, social cognition, and valuation have been identified in extensive prior work (see Figure .). This model unifies and extends existing knowledge by suggesting a parsimonious theoretical framework that connects neural systems and associated psychological processes highlighted in prior empirical and theoretical work on virality (Berger, ; Cappella et al., ; Falk, Morelli, Welborn, Dambacher, & Lieberman, ; Meshi, Tamir, & Heekeren, ; Tamir, Zaki, & Mitchell, ) and further posits a clear structure detailing how these mechanisms work together to create sharing decisions.

. V

.................................................................................................................................. The brain’s valuation and reward system is the centerpiece of the value-based virality framework, which proposes a direct link between information-sharing value and individual sharing decisions/virality. A general psychological principle describes the

   



tendency to seek pleasure or reward and avoid pain or punishment (Elliot, ; Lewin, ). When deciding whether or not to share content with others, an individual is likely to consider the potential value and negative outcomes of sharing from various perspectives. This notion of the central role of positive valuation or reward in sharing first received support in a neuroimaging study in which a group of participants (referred to as the “interns” because they were asked to pretend to be interns at a TV studio) were exposed to a set of new TV show ideas and asked which ones they would recommend to a producer. A second set of participants (referred to as the “producers”) then saw videos in which the “interns” described the shows. The producers were subsequently asked whether they would further recommend each show (Falk et al., ). The shows that were shared most successfully by the “interns” (i.e., those most popular with “producers”) were related to the strongest activations in the value system of the interns’ brains when they first learned about the shows. Another recent study also suggested that merely sharing information with others produces neural activity in the brain’s reward system, and study participants were further willing to forgo monetary rewards for the opportunity to share information with others (Tamir et al., ). How do individuals decide whether information has high sharing value? Value-based virality suggests that people consider combinations of advantages and disadvantages of sharing, given its expected self-related and social implications. For instance, sharers might wonder whether sharing a piece of information will make them look smart, well informed, or “cool,” or whether the shared content will lead to positive or negative interactions or relationships with others. To make a final sharing decision, these different types of considerations need to be consolidated into an overall judgment of whether sharing will have net positive/rewarding or negative/punishing consequences. Neuroimaging studies suggest that human brains are well suited for such a computation. There is strong evidence that different kinds of value (e.g., primary, secondary, self-related, and social) are integrated within a general valuation system, which includes the ventral striatum (VS) and ventro-medial prefrontal cortex (VMPFC) (for a meta-analysis see Bartra, McGuire, & Kable, ). This system is thought to translate the value of different types of inputs onto a common value scale, generating a domain-general value signal that allows for direct comparisons between diverse stimuli (Levy & Glimcher, ). Valuebased virality suggests that this mechanism also allows those exposed to information to weigh the pros and cons of sharing on different dimensions, such as self-related and social value, and integrate them into a domain-general information-sharing value signal that is directly linked to individual sharing decisions and virality.

. S-R P

.................................................................................................................................. To achieve a high sharing value, information first needs to resonate with its primary receiver. Indeed, in the study just described, “interns” (i.e., primary receivers) were more likely to self-report a high likelihood to share when their brains were engaged in



    . 

self-related processing (medial prefrontal cortex/MPFC and posterior cingulate cortex/ PCC) during initial information exposure (Falk et al., ). In functional neuroimaging, neural correlates of self-related thought have been identified by asking participants to think about whether certain stimuli such as personality traits represent them or not (e.g., Murray, Schaer, & Debbané, ; Northoff et al., ). These studies routinely find that activations within MPFC and PCC increase during self-relevance judgments, relative to judgments that do not require self-related processing. When making sharing decisions, a range of self-related processes might unfold in a sharer. Information might be perceived as self-relevant, that is, important for the sharer’s life, interests, goals, or ideals. Another possibility is that self-related processing is involved in sharing decisions because sharers consider self-enhancement motives. The aim to maintain a positive image in front of others is a key motive of human interaction (Mezulis, Abramson, Hyde, & Hankin, ) and thought to be a central driver of interpersonal sharing (Berger, ; Cappella et al., ). Information that if shared would reflect positively on the sharer—for example by demonstrating that the individual is concerned about others, well-informed, or high-performing in some domain—should thus increase its sharing value. Indeed, next to its association with sharing behavior, sharing self-relevant information has been shown to activate the brain’s reward and valuation system (Tamir & Mitchell, ). In consequence, value-based virality suggests that self-related processing is an important input to the calculation of information-sharing value, so that expectations of more positive outcomes of sharing for one’s self-image will increase valuation.

. S C

.................................................................................................................................. Sharing is by definition a social process. Value-based virality thus argues that next to considering self-related outcomes of sharing, sharers also engage in social cognition when determining information-sharing value. This argument receives support from research on audience tuning, which describes that sharers adjust both the content and wording of their messages to communicate information depending on characteristics of their audience such as knowledge or opinions (Barasch & Berger, ; Clark & Schaefer, ; Marwick & Boyd, ). In other words, sharers utilize audience characteristics, possibly to predict the audience’s reactions and thoughts if they were to share information with them. This type of social processing is a form of mentalizing (i.e. thinking about the thoughts and mental states of others). The brain’s mentalizing system includes the bilateral temporo-parietal junction (TPJ), right superior temporal lobe (STS), dorsal MPFC (along with other subregions of the MPFC), and PCC, and tends to be activated when people consider what others might know, believe, or desire (Dufour et al., ). Results from the study of “interns” and “producers” previously described show that successful ideas engaged not only the brain’s valuation system but also typical mentalizing regions as “interns” were first exposed to each TV show idea

   



(Falk et al., ). In addition, prior work supports a direct link between expectations of social rewards (e.g., in the form of approval) and activity in the brain’s valuation system (Fehr & Camerer, ; Rademacher et al., ). Consequently, value-based virality proposes that determining the impact of information on social connections can be described as an instance of mentalizing, in which the sharer considers whether sharing might lead to favorable or valued social outcomes based on knowledge, needs, desires, and potential reactions of the audience. If desirable social outcomes are expected, information-sharing value will be higher.

. E S  V-B V

.................................................................................................................................. We recently tested the value-based virality model empirically in a study on the realworld, population-level retransmission of New York Times articles. In this study, participants were shown abstracts and headlines of New York Times articles in three experimental conditions. Specifically, respondents thought about whether to share the article with others (either on their Facebook wall or privately with one Facebook friend), whether to read the full text themselves, or they were asked to identify the main topic of the article (see Figure .). We found support for the involvement of self-related, social, and value-related neural systems in sharing decisions (relative to other types of decisions) in our study participants (Baek, Scholz, O’Donnell, & Falk, ; see Figure .A). Activity in the valuation system, the self-related processing system, and regions commonly associated

Read: Yourself

Read: Yourself

Most Food Illnesses Come From Greens Leafy vegetables like lettuce not dreaded spoiled shellfish - cause the most food-borne illness, with contaminated poultry being responsible for the most deaths.

+

Read: Yourself? Very Unlikely 1

2

Very Likely 3

4

5

 . Experimental design of the New York Times Study (Baek, Scholz, O’Donnell, & Falk, ; Scholz et al., ). In each trial of the fMRI task, participants were first told which condition they were in (read yourself, share with others via the Facebook wall, which is pictured here, or to one Facebook friend, or determine article content), then they read the headline and abstract of a New York Times article, before they answered a question in accordance with the respective condition.



    . 

(a)

(b) 0.3

0.2

*** *** ***

0.1

0 Valuation

Self-related processing

Social cognition

Articles ranked by population-level retransmission counts

60 50 40 30 20 0

20

40

60

80

Articles ranked by activity in the brain’s value sysytem

 . Results of the New York Times Study. A) More neural activity in value-related, self-related, and social cognition regions was observed when participants thought about sharing an article with others than when they were asked to determine the article’s main topic in the control condition (Baek et al., ). B) Neural activity in study participants (N = ) extracted from the brain’s valuation system during exposure to article headlines and abstracts (N =  article) predicted population-level retransmission counts of New York Times articles (N > , shares; Scholz et al., ). ***p < 0.001B = 3.81 (SE = 1.12), p = 0.001

with mentalizing while participants were exposed to the article headlines and abstracts was also significantly positively related to participants’ self-reported intention to share each article with others. Further, whole brain analyses showed that the effects were most robust in hypothesized brain systems, reiterating the central role of these three processes in sharing. Next, when looking at the reading condition, which is closest to a natural situation in which a reader browses the homepage of the New York Times, we found support for the mediation model outlined in Figure . when predicting population-level virality (Scholz, Baek, O’Donnell, Kim, et al., ). Specifically, neural data from the small group of imaged participants extracted while each article headline and abstract was presented were linked to indicators of population-level virality (number of shares through Facebook, Twitter, and email) derived using the New York Times API (automated programming interface) and totaling over , shares. Results from path analyses support the predictions of value-based virality. That is, activity in both the self-related and social cognition systems during initial article exposure was significantly associated with value-related processing. Activity in the valuation system in the imaged participants, in turn, was related to an article’s number of shares in the larger population of New York Times readers (Figure .B) and acted as a mediator for the effects of social cognition and self-related processing on virality. Encouragingly, these results were replicated in a second set of participants who performed a similar task using the same articles, strengthening the evidence for value-based virality. In sum, empirical evidence for value-based virality supports a parsimonious model of decisions to share information with others, in which a domain-general information-sharing

   



value signal integrates inputs from both self-related and socially relevant cognitions about the act of sharing the information. This domain-general value signal then directly relates to virality, as has been shown for the population of readers of the online New York Times. Further, the fact that neural activity in a small group of people can predict population-level outcomes suggests that large groups of individuals can arrive at similar sharing values for the same information, possibly due to similar social motives and values within a culture.

. O  I S: R  I

.................................................................................................................................. Value-based virality is a neurocognitive model of sharing decisions, which in turn impact how widely information is shared, termed virality or reach. Measures of reach include the total number of shares or the depth of penetration into a network (i.e., the length of a propagation chain). A full discussion of the factors that differentially influence each of these dimensions of reach is beyond the scope of this chapter. Here we assume that similar basic neurocognitive processes drive individual decisions in both broad and deep chains, across communication channels.1 That is, while the specific type and scope of considerations that go into a sharing decision might differ at different locations in a propagation chain, we assume that the basic neurocognitive processes of self-related, social, and value-related considerations are central drivers across these contexts. Once information is shared, downstream outcomes encompass information impact. Measures of impact include behavior, attitude, or intention change in response to information exposure. As with reach, a full discussion of the multiple factors that influence impact is beyond the scope of this chapter. Instead, we focus on specific relations between impact and the neurocognitive antecedents of sharing. Both reach and impact are determined in part by the sharers themselves, their audiences, and the communication between the two.

.. Sharers Sharers can play at least two distinct roles in a propagation chain. First, they can influence audience members. Second, they might engage more intensively with information as a result of sharing it, which can increase the information’s impact on the sharers themselves. Existing neuroimaging work has mainly focused on the former, by examining what is shared (as described previously; Baek et al., ; Scholz, Baek, O’Donnell, Kim, et al., ) and who is persuasive. Specifically, mentalizing activity in sharers is associated with greater persuasiveness, or the ability of sharers to convince their audience of their own opinions about information. For instance, two studies showed increased activation



    . 

in the mentalizing system in salespeople with superior skills in sales (Dietvorst et al., ) and in participants (“interns”) who were more successful in convincing other participants (“producers”) of their opinions about TV show ideas (Falk et al., ). In conjunction with the work supporting the role of mentalizing in value-based virality, these findings may suggest overlap in the neural antecedents that support sharing decisions and persuasiveness once sharing has occured. If sharers tend to share information that they expect will lead to positive outcomes (i.e., information with high sharing value), this may also make what they share more persuasive. Comparatively less is known about the impact of interpersonal sharing on the brains of the sharers themselves. Consistent with self-perception theory (Bem, ), discussing information can affect its impact on those involved in the conversation (David, Cappella, & Fishbein, ; Southwell & Yzer, ), including those who shared the information initially (Jeong, ). For instance, according to this view, recommending certain behaviors to others might increase a sharer’s likelihood to engage in the same behaviors later. Consequently, additional research seeking to differentiate when and why sharers are more or less personally influenced by discussion of information can improve predictions of its overall impact on a population.

.. Audiences Audiences can play at least two distinct roles in propagation chains. First, they may be conceptualized as passive receivers who are influenced by sharers. Second, they can be studied as active discussion participants who might influence the initial information sharer. A growing body of literature has described how information takes hold in the brains of receivers (for reviews see Cascio, Scholz, & Falk, ; Izuma, ), highlighting two key processes that increase susceptibility. First, elevated activity in the dorsal anterior cingulate cortex (ACC) and anterior insula (AI) is implicated in conflict detection and serves to signal when individuals are misaligned with others. This neural activity might underlie our sensitivity to social costs of rejection and can lead to conformity and realignment with the group (Berns, Capra, Moore, & Noussair, ; Tomlin, Nedic, Prentice, Holmes, & Cohen, ). Second, elevated activity in the brain’s positive value and reward system, including VS and VMPFC, highlights and rewards expected positive outcomes of conforming (Campbell-Meiklejohn, Bach, Roepstorff, Dolan, & Frith, ; Zaki, Schirmer, & Mitchell, ). Note that a similar valuation circuit has also been implicated in the computation of sharing decisions, as previously described. Translating these findings to the domain of sharing decisions, researchers who have studied susceptibility to social influence on interpersonal sharing decisions have found associations with both neural activity implicated in general susceptibility to influence and activity associated with successful/persuasive sharing. For example, a series of studies examined brain activity as participants learned about and recommended mobile game applications to others in the presence of peer feedback (as might be available through a recommender system on a mobile gaming website). Increased

   



activity in the brain’s valuation system (VS and VMPFC) when receiving group feedback (i.e., social influence) about the group’s initial recommendations was associated with increased conformity to peer recommendations (Cascio, O’Donnell, Bayer, Tinney, & Falk, ). That is, expected positive social outcomes might have motivated the observed peer-conform recommendation behavior. In addition, participants who conformed more frequently, on average, showed increased activity in the mentalizing system. This activity might have originated in participants’ considerations of why others have provided recommendations that differed from their own. Note that activity in the mentalizing system also distinguished successful and unsuccessful sharers, as previously reviewed (Dietvorst et al., ; Falk et al., ). The extent to which the same underlying psychological processes are driving the partial overlap in neural activations observed in successful sharers and those susceptible to influence remains an open question.2 Nevertheless, the boundaries between what motivates receivers to share and what motivates susceptibility to peer influence on sharing may not be clear-cut.

.. Sharer-Audience Interactions One potential explanation for overlap in neural activity is the shared experience created when sharers and audiences engage in interpersonal communication. Another plausible reason is a causal dynamic in which, to be persuasive, sharers need to impact neural processing in receivers’ brains; extant research has not yet distinguished between these accounts.3 What has been shown is that beyond isolated activation in the brains of either party, successful communication is associated with increased correlation in the time series of neural activity in key brain regions observed in a sharer and the audience. This includes both sensory and higher order processing systems in the brain (e.g., implicated in speech production and comprehension, Silbert, Honey, Simony, Poeppel, & Hasson, ; and mentalizing and self-related processing, Stephens, Silbert, & Hasson, ). Further, greater anticipatory coupling, that is, the extent to which neural activity in an audience is correlated with future neural activity of a speaker (potentially due to predictions made about what will be said next), is associated with more successful communication (Stephens et al., ).

. S P  I   P

.................................................................................................................................. We have considered psychological mechanisms that underlie information sharing by looking at both individual-level outcomes such as correspondence between sharers and their audiences (Falk et al., ) and population-level outcomes such as the number of shares an article received from New York Times readers (Scholz, Baek, O’Donnell,



    . 

Kim, et al., ) or the number of Tweets about a popular TV show episode (Dmochowski et al., ). These two levels of analysis roughly correspond to the propagation chain consisting of few individuals on the one hand and the underlying population or sharing context on the other hand (see Figure .). Multiple studies now show significant relationships between these two dimensions. For instance, the extent to which neural activity during exposure to a TV show episode was correlated between individual study participants predicted scene-by-scene tweet volume about this episode by the population of Twitter users (Dmochowski et al., ). Likewise, even though sharing outcomes in individuals and populations are assessed using different tools, there is some evidence that the psychological process underlying interpersonal sharing at the individual level and population-level virality overlap. Specifically, as described previously, similar neural responses to New York Times article headlines and abstracts are associated with individual sharing decisions (Baek et al., ; Figure .A) and population-level sharing rates of the same articles in two separate samples (Scholz, Baek, O’Donnell, Kim, et al., ; Figure .B). As such, although the specific inputs to the computation of self-relevance, social relevance, and value, which in turn inform sharing decisions, almost certainly differ depending on context factors (e.g., personal characteristics of elite and lay sharers, Katz & Lazarsfeld, ; time, Rogers, ; and broader structural features such as social norms and cultural contexts), these divergent inputs stemming from sources at multiple levels of analysis likely feed into very similar basic processes that drive individual decisions in the brain. Thus, although neuroimaging studies typically rely on relatively small (though increasing) sample sizes, components of population-level virality and its underlying psychological processes can be studied by examining individual-level propagation chains. In doing so, differences in personal traits and social environments can be studied as moderators of self, social, and valuation processes most relevant to sharing.

. S C  M  S P

.................................................................................................................................. Sharing contexts (see Figure .) are shaped by characteristics of audiences, of the sharer, of the original content, of the communication channel or medium used for sharing, and of the larger cultural context in which sharing takes place. Each of these contextual factors may modulate the relationship between brain activity and sharing decisions or outcomes, for instance by affecting the weight placed on expected social outcomes or self-related consequences, and hence the overall value of sharing.

.. Audience Characteristics Audience characteristics as basic as size (i.e., number of audience members) can affect neural mechanisms of interpersonal sharing and virality. For example, one study

   



examined the neural correlates associated with sharing with a large audience (one’s entire Facebook wall, labeled broadcasting) or a small audience (one specific Facebook friend, labeled narrowcasting) (Scholz, Baek, O’Donnell, & Falk, ). Although narrow- and broadcasting were both associated with activity in the self-related and social brain regions depicted in Figure ., the narrowcasting condition showed significantly stronger involvement in both systems than did broadcasting. More intensive processing while narrowcasting might be caused by a more vivid and concrete representation of the audience in these situations. If so, potential downstream effects might include more effective tailoring of shared information to specific, small audiences and more favorable sharing outcomes during narrowcasting. On the receiving end, several neuroimaging studies now indicate that individuals systematically differ in their susceptibility to social influence. This may affect information sharing by altering neural processes during the reception of information. In turn, these differences in receivers of shared messages can impact downstream processes in the propagation chain when the receiver decides to further retransmit the shared information (Cascio, O’Donnell, et al., ). Likewise, other audience characteristics may affect information-sharing value by altering the expected social (e.g., likelihood of approval given group opinions) and self-related (e.g., aspect of identity that a sharer wants to present to a given group) outcomes of sharing.

.. Sharer Characteristics Characteristics such as personality traits and a sharer’s position in that individual’s social network can influence both the reach and impact of information. As mentioned previously, two studies suggest that sharers differ in their ability to convince others of their own opinions about information and that this ability positively correlates with the extent of social processing during sharing (Dietvorst et al., ; Falk et al., ). Interestingly, two recent studies have identified relationships between neural indicators of persuasiveness and a sharer’s position in that person’s ego-network. First, a study of male teens suggests that those with higher ego-betweenness positions in their ego-networks—that is, those who connect many of their friends who would otherwise not be directly connected—engaged in more social processing (right TPJ, PCC, and dorsal MPFC) while making recommendations about mobile game applications to peers. This activity might signify a higher tendency to consider mental states of others during sharing (O’Donnell, Bayer, Cascio, & Falk, ). Further, a second study found that individuals who were more popular in their social networks showed higher sensitivity to status differences of others, as indicated by stronger effects of other’s popularity on activity in their valuation systems (VS, ventral VMPFC, amygdala). In addition, these individuals made more accurate predictions about how others in their networks perceived them (Zerubavel, Bearman, Weber, & Ochsner, ). In sum, personality and social network position may affect key sharing processes, though more research is required to fully understand these relationships and determine causal directions.



    . 

.. Content Characteristics Many of the individual effects that make up the current corpus of neuroscientific knowledge about information sharing have been studied within rather narrow topics such as health-related New York Times articles (Scholz, Baek, O’Donnell, Kim, et al., ) and TV show descriptions (Falk et al., ). Replication studies using stimuli from different content areas are needed to properly describe content sensitivity (if any) of the effects described in this chapter. One of the mechanisms by which content characteristics might affect sharing is through altering the information-sharing value profile. For instance, positively valenced information may be more likely to be shared in order to avoid communicating a negative image of oneself to others (Berger, ). That is, the same piece of information framed in terms of its potential positive outcomes might be more likely to engage increased activity in the self-relevance system of the brain and subsequently increase information-sharing value signals that affect sharing likelihood. Another interesting domain is dynamic changes in content and content characteristics that are due to editing and social annotations in the form of comments, recommendations, or ridicule, which might be applied to information as it moves step by step through a propagation chain (see Figure .). Recent work shows that this kind of content mutation occurs frequently in online sharing (Adamic, Lento, Adar, & Ng, ), suggesting that the same piece of information might show variation in its sharing value throughout its progression through a social network or population.

.. Communication Channel Characteristics Most of the studies presented here were restricted to a specific mode of communication between sharers and their audiences, such as Twitter (Dmochowski et al., ), Facebook (Scholz, Baek, O’Donnell, Kim, et al., ), and video messages (Falk et al., ). The specific communication channel chosen by sharers affects possibilities for sharing, reactions to shared information, and dialogue are restricted and affected by (Meshi et al., ). For instance, complex topics might have higher sharing value in face-to-face rather than text-messaging contexts due to the greater potential for followup discussion and explanation. Studying the variability of the neural processes of sharing across different channels is thus likely to uncover interesting dependencies and possibly new, unexpected mechanisms that will help us to triangulate more comprehensive theories of sharing. More broadly, as briefly mentioned before, an important information characteristic is whether it originates from mass media or interpersonal sources (corresponding to different steps in the propagation chain shown in Figure .). Communication scientists have demonstrated that information sources can differ in trustworthiness and persuasiveness (Hesse et al., ; Katz & Lazarsfeld, ), among other characteristics,

   



and work on the diffusion of innovations suggests that the relative importance of mass media and interpersonal sources may vary over time (Rogers, ). Indeed, there is a complicated interplay between mass media broadcasts and interpersonal communication, involving both mediating and moderating relationships (Southwell & Yzer, ; van den Putte, Yzer, Southwell, de Bruijn, & Willemsen, ). How these dynamics affect neural processes during sharing remains an open question. Nevertheless, as mentioned previously, here we make the assumption that the basic psychological building blocks (self-related, social, and value-related considerations; see Figure .) are useful in evaluating information from any source. The specific input to each of these computations and their relative importance, on the other hand, might differ substantially.

.. Culture Finally, cultural characteristics are known to affect social interactions as well as the flow of information in numerous ways (e.g., Rogers, ; Triandis, ), yet the neural mechanisms of sharing have almost exclusively been studied in American college students. To provide an example of a possible hypothesis, in cultures with more independent self-construals that emphasize the individual over the group (Hofstede, Hofstede, & Minkov, ), sharers might rely less on perceived social outcomes when estimating information-sharing value than do sharers in collectivistic cultures, which emphasize groups over individuals.

. S  L  N   S  V I

.................................................................................................................................. As illustrated in this chapter, neuroimaging affords key strengths that complement the existing toolbox of sharing and virality researchers, as has been argued effectively elsewhere for the fields of marketing, economics, communication, and decisionmaking (Falk, Cascio, & Coronel, ; Kable, ; Plassmann, Venkatraman, Huettel, & Yoon, ). With regard to the study of virality, two critical advantages to incorporating neuroimaging methods in conventional study designs are improvements to measurement and prediction and enhanced theory development.

.. Measurement and Prediction Neuroimaging affords the ability to capture multiple psychological processes as they occur. As such, the addition of neuroimaging to the methods repertoire of sharing and



    . 

virality researchers can help to increase the predictive power of explanatory models (Berkman & Falk, ). For example, variation in neural responses to stimuli such as advertisements (e.g., anti-smoking messages) predicts individual-level behavior (e.g., quitting smoking) as well as population-level behavior (e.g., calls to a tobacco quitline) over and above conventionally used self-report measures (Falk, Berkman, & Lieberman, ). Similar results have been documented in diverse contexts such as sunscreen use, smoking cessation, physical activity, and music purchases (Berns & Moore, ; Cascio, Dal Cin, & Falk, ; Falk, O’Donnell, et al., ; Falk et al., ; Falk, Berkman, Mann, Harrison, & Lieberman, ; Falk, Berkman, Whalen, & Lieberman, ). In this chapter we have reviewed preliminary evidence that similar techniques can be applied to the sharing of news articles (Baek et al., ; Scholz, Baek, O’Donnell, Kim, et al., ); however, this only begins to scratch the surface of what is possible.

.. Theory Development Neuroimaging techniques can also generate novel theoretical insights that are difficult to access otherwise. For example, although it can be hard for both laypersons and researchers to identify overlap between two phenomenologically different experiences, seemingly distinct processes are sometimes supported by the same neural structures and networks (Lieberman, ). In the realm of sharing and virality, one analysis conducted on the New York Times study mentioned previously (see Figure .) uncovered, somewhat unexpectedly, substantial overlap between the neural processes that support sharing and the selection of content for private consumption (Baek et al., ). Specifically, similar to decisions to share an article (see Figure .A), decisions to read the article oneself were also associated (though to a lesser extent) with neural activity in brain systems that support assessing the self-related and social outcomes and overall value of sharing. Similarly, neuroimaging can be used to dissociate core processes from one another by demonstrating activation of distinct regions or neural networks in reaction to two types of stimuli or between two groups. Researchers found that mentalizing, which involves consideration of the thoughts and beliefs of others, distinguished skilled sharers from those who are less successful in convincing others of their own opinions about shared information (Dietvorst et al., ; Falk et al., ). Dietvorst and colleagues showed that those professional salespeople in their sample who scored higher on a skill called adaptive selling, in which the salesperson adapts the interaction strategy to situational constraints such as the customer’s needs and preferences, also showed more activity in the mentalizing system during an fMRI task. In the study by Falk and others discussed previously, “interns” who were more successful in convincing “producers” of their opinions about TV shows mentalized more overall during their first exposure to the show ideas. Neuroimaging can further be useful for hypothesis generation, given that it captures activity in the whole brain over time, corresponding to multiple different processes.

   



That is, next to observing neural activity in a priori identified regions of interest to test existing theory, activations in unexpected areas can spur further exploration, hypothesis generation, and subsequent theory testing. In sum, the addition of neuroimaging techniques to the behavioral and computational measures often used in virality research can have important impacts on our understanding of why and how people share. In parallel, adding computational social science and network perspectives to the neuroscience toolbox advances our understanding of brain function by providing clues about how specific regions or networks of regions create certain experiences or compute decisions (O’Donnell & Falk, ).

.. Limitations A comprehensive discussion of the limitations of fMRI is available elsewhere (Poldrack, ). Here we highlight the issues of the correlational nature of most fMRI studies and reverse inference, because of their special relevance to the theoretical inferences that can be drawn from the work synthesized in this chapter. First, because fMRI is an observational technique that does not allow the controlled manipulation of brain activity, any relationships between neural activation discovered using fMRI and subsequent outcomes such as information-sharing behavior are correlational, not causal. Tools such as transcranial direct current stimulation (TDCS) and transcranial magnetic stimulation (TMS), however, do allow the systematic alteration of neural activity in specific regions and can be used to establish causality with more confidence (Kable, ). Thus, promising candidate regions identified through fMRI that show strong relationships with an outcome of interest and that are theoretically meaningful can be examined using TDCS or TMS to establish causal order. In addition, researchers who use fMRI are in a better position to make causal claims regarding the origins of neural activation if it is observed in response to carefully controlled stimuli that are varied across experimental conditions. For instance, one study mentioned previously compared sharing with small (narrowcasting) and large (broadcasting) audiences and observed activation differences in MPFC, VS, and PCC, among others, that are most likely due to the experimental manipulation (Scholz, Baek, O’Donnell, & Falk, ). Second, reverse inference is a threat to the correct identification of psychological processes based on observed neural activations (Poldrack, ). The same brain region can be involved in a variety of psychological processes at any given time, and fMRI does not necessarily allow researchers to determine which one is activated by their experiment or which one is related to their outcome of interest. Confidence in such reverse inferences can be systematically increased by carefully defining a priori hypotheses and identifying regions of interest that have previously been robustly or even selectively associated with a given cognitive process. Further, new resources allow neuroimagers to estimate the level of confidence in a given reverse inference. Based on data from large imaging databases such as www.neurosynth.org, researchers can estimate the proportion of studies in which the manipulation of a given psychological process activated the



    . 

region of interest (i.e., studies using forward inferences). For example, research on interpersonal sharing and virality can rely on extensive research on self-related, social, and value-related processing, which has been studied extensively in social, affective, and cognitive neuroscience (see Figure .).

. C

.................................................................................................................................. The neuroscience of information sharing and virality has made exciting initial strides. One line of inquiry suggests a parsimonious theoretical framework of the psychological mechanisms that lead to the decision to share (Baek et al., ; Scholz, Baek, O’Donnell, Kim, et al., ). Others have begun to elucidate the mechanisms of social influence in sharing situations (Cascio, O’Donnell, et al., ) and sharer-audience coupling and its relationship to successful communication (Stephens et al., ). Much more remains to be understood regarding the mechanisms that drive certain types of sharing behavior, especially regarding the interplay among several of the processes that have been identified so far. For instance: What is the relationship between the processes that drive initial decisions to share information and downstream effects such as the quality of conversations between sharers and their audiences? Is it possible to systematically increase the sharing value and virality potential of information by designing it in such a way that is likely to engage neural activity in brain areas involved in sharing decisions? Recent trends in functional neuroimaging toward the integration of various methods such as computational social science and behavioral measures (O’Donnell & Falk, ) open the way for more complex and realistic studies that allow us to assess multiple processes simultaneously within a single experiment, as well as from multiple perspectives at the same time. In this chapter we have reviewed existing experimental paradigms and approaches to the neuroscientific study of sharing, though this young and dynamically developing field has substantial room for new, innovative paradigms that go well beyond what we have described here. Together, this research will advance knowledge of why and how people share information with others and of the likely downstream impact of these processes on individuals, groups, and society at large.

N . We briefly discuss communication channels as moderators of these effects in the moderators section of this chapter. . See also the discussion on reverse inference in the limitations section of this chapter. . Research on social networks suggests that effective influencers and those susceptible to influence are rather distinct entities (Aral & Walker, ). Unfortunately, no extant studies consider both neural processes of sharing information and taking the role of an audience member. If it is broadly true that those who are susceptible to influence are not usually good influencers themselves, potential differences in neural processing of sharing

   



situations could give more specific insight into why massively shared content usually achieves popularity (i.e., many separate sharing instances of broadcast content) rather than structural virality (i.e., long propagation chains) (Goel, Anderson, Hofman, & Watts, ), though a full exploration of this idea is beyond the scope of this chapter.

R Adamic, L. A., Lento, T. M., Adar, E., & Ng, P. C. (). Information evolution in social networks. ACM Press. http://doi.org/./. Aral, S., & Walker, D. (). Identifying influential and susceptible members of social networks. Science, (), –. http://doi.org/./science. Baek, E. C., Scholz, C., O’Donnell, M. B., & Falk, E. B. (). The value of sharing information: A neural account of information transmission. Psychological Science, (), –. Barasch, A., & Berger, J. (). Broadcasting and narrowcasting: How audience size affects what people share. Journal of Marketing Research, (), –. http://doi.org/./ jmr.. Bartra, O., McGuire, J. T., & Kable, J. W. (). The valuation system: A coordinate-based meta-analysis of BOLD fMRI experiments examining neural correlates of subjective value. NeuroImage, , –. http://doi.org/./j.neuroimage... Bem, D. J. (). Self-perception theory. In L. Berkowitz (Ed.), Advances in experimental social psychology (Vol. ). New York: Academic Press. Retrieved from http://www.dbem. ws/SP%Theory.pdf Berger, J. (). Word of mouth and interpersonal communication: A review and directions for future research. Journal of Consumer Psychology, (), –. http://doi.org/./ j.jcps... Berkman, E. T., & Falk, E. B. (). Beyond brain mapping using neural measures to predict real-world outcomes. Current Directions in Psychological Science, (), –. http://doi. org/./ Berns, G. S., Capra, C. M., Moore, S., & Noussair, C. (). Neural mechanisms of the influence of popularity on adolescent ratings of music. NeuroImage, (), –. http://doi.org/./j.neuroimage... Berns, G., & Moore, S. E. (). A neural predictor of cultural popularity. Available at SSRN . Retrieved from http://papers.ssrn.com/sol/papers.cfm?abstract_id= Bughin, J., Doogan, J., & Vetvik, O. J. (, April). A new way to measure word-of-mouth marketing. McKinsey Quarterly. Retrieved from http://vandymkting.typepad.com/files/ --mckinsey-a-new-way-to-measure-word-of-mouth.pdf Campbell-Meiklejohn, D. K., Bach, D. R., Roepstorff, A., Dolan, R. J., & Frith, C. D. (). How the opinion of others affects our valuation of objects. Current Biology, (), –. http://doi.org/./j.cub... Cappella, J. N., Kim, H. S., & Albarracín, D. (). Selection and transmission processes for information in the emerging media environment: Psychological motives and message characteristics. Media Psychology, , –. http://doi.org/./.. Cascio, C. N., Dal Cin, S., & Falk, E. B. (). Health communications: Predicting behavior change from the brain. In P. A. Hall (Ed.), Social neuroscience and public health (pp. –). New York, NY: Springer New York. Retrieved from http://link.springer.com/./---_



    . 

Cascio, C. N., O’Donnell, M. B., Bayer, J., Tinney, F. J., & Falk, E. B. (). Neural correlates of susceptibility to group opinions in online word-of-mouth recommendations. Journal of Marketing Research, (), –. http://doi.org/./jmr.. Cascio, C. N., Scholz, C., & Falk, E. B. (). Social influence and the brain: Persuasion, susceptibility to influence and retransmission. Current Opinion in Behavioral Sciences, , –. http://doi.org/./j.cobeha... Clark, H. H., & Schaefer, E. F. (). Contributing to discourse. Cognitive Science, (), –. http://doi.org/./scog_ Csibra, G., & Gergely, G. (). Natural pedagogy as evolutionary adaptation. Philosophical Transactions of the Royal Society of London B: Biological Sciences, (), –. http://doi.org/./rstb.. David, C., Cappella, J. N., & Fishbein, M. (). The social diffusion of influence among adolescents: Group interaction in a chat room environment about antidrug advertisements. Communication Theory, (), –. Dietvorst, R. C., Verbeke, W. J. M., Bagozzi, R. P., Yoon, C., Smits, M., & van der Lugt, A. (). A sales force–specific theory-of-mind scale: Tests of its validity by classical methods and functional magnetic resonance imaging. Journal of Marketing Research, (), –. http://doi.org/./jmkr... Dmochowski, J. P., Bezdek, M. A., Abelson, B. P., Johnson, J. S., Schumacher, E. H., & Parra, L. C. (). Audience preferences are predicted by temporal reliability of neural processing. Nature Communications, , . http://doi.org/./ncomms Dufour, N., Redcay, E., Young, L., Mavros, P. L., Moran, J. M., Triantafyllou, C., . . . Saxe, R. (). Similar brain activation during false belief tasks in a large sample of adults with and without autism. PLoS ONE, (), e. http://doi.org/./journal.pone. Elliot, A. J. (). Approach and avoidance motivation. In A. J. Elliot (Ed.), Handbook of approach and avoidance motivation (pp. –). New York, NY: Taylor & Francis. Falk, E. B., Berkman, E. T., & Lieberman, M. D. (). From neural responses to population behavior: Neural focus group predicts population-level media effects. Psychological Science, (), –. http://doi.org/./ Falk, E. B., Berkman, E. T., Mann, T., Harrison, B., & Lieberman, M. D. (). Predicting persuasion-induced behavior change from the brain. Journal of Neuroscience, (), –. http://doi.org/./JNEUROSCI.-. Falk, E. B., Berkman, E. T., Whalen, D., & Lieberman, M. D. (). Neural activity during health messaging predicts reductions in smoking above and beyond self-report. Health Psychology, (), –. http://doi.org/./a Falk, E. B., Cascio, C. N., & Coronel, J. C. (). Neural prediction of communicationrelevant outcomes. Communication Methods and Measures, (–), –. http://doi.org/ ./.. Falk, E. B., Morelli, S. A., Welborn, B. L., Dambacher, K., & Lieberman, M. D. (). Creating buzz: The neural correlates of effective message propagation. Psychological Science, (), –. http://doi.org/./ Falk, E. B., O’Donnell, M. B., Cascio, C. N., Tinney, F., Kang, Y., Lieberman, M. D., . . . Strecher, V. J. (). Self-affirmation alters the brain’s response to health messages and subsequent behavior change. Proceedings of the National Academy of Sciences, (), –. http://doi.org/./pnas. Fehr, E., & Camerer, C. F. (). Social neuroeconomics: The neural circuitry of social preferences. Trends in Cognitive Sciences, (), –. http://doi.org/./j.tics...

   



Goel, S., Anderson, A., Hofman, J., & Watts, D. J. (). The structural virality of online diffusion. Management Science, (), –. http://doi.org/./mnsc.. Hesse, B. W., Nelson, D. E., Kreps, G. L., Croyle, R. T., Arora, N. K., Rimer, B. K., & Viswanath, K. (). Trust and sources of health information: The impact of the internet and its implications for health care providers: Findings from the first health information national trends survey. Archives of Internal Medicine, (), –. http://doi.org/ ./archinte... Hofstede, G., Hofstede, G. J., & Minkov, M. (). Cultures and organizations: Software of the mind (Vol. ). Citeseer. Retrieved from http://citeseerx.ist.psu.edu/viewdoc/download? doi=....&rep=rep&type=pdf Izuma, K. (). The neural basis of social influence and attitude change. Current Opinion in Neurobiology, (), –. http://doi.org/./j.conb... Jeong, M. (). Sharing in the context of tobacco and e-cigarette communication: Determinants, consequences, and contingent effects (Doctorial dissertation, University of Pennsylvania). Jeong, M., & Bae, R. E. (). The effect of campaign-generated interpersonal communication on campaign-targeted health outcomes: A meta-analysis. Health Communication, (), –. https://www.tandfonline.com/doi/full/./.. Kable, J. W. (). The cognitive neuroscience toolkit for the neuroeconomist: A functional overview. Journal of Neuroscience, Psychology, and Economics, (), –. http://doi.org/ ./a Katz, E., & Lazarsfeld, P. F. (). Personal influence: The part played by people in the flow of mass communications. Glencoe, IL: Free Press Krikorian, R. (, August ). New Tweets per second record, and how! Twitter Official Blog. Retrieved from https://blog.twitter.com//new-tweets-per-second-record-andhow Krumpal, I. (). Determinants of social desirability bias in sensitive surveys: A literature review. Quality & Quantity, (), –. http://doi.org/./s--- Levy, D. J., & Glimcher, P. W. (). The root of all value: A neural common currency for choice. Current Opinion in Neurobiology, (), –. http://doi.org/./j. conb... Lewin, K. (). A dynamic theory of personality. Retrieved from http://psycnet.apa.org/ psycinfo/-- Lieberman, M. D. (). Social cognitive neuroscience. In S. T. Fiske, D. T. Gilbert, & G. Lindzey (Eds.), Handbook of social psychology (th ed., pp. –). New York, NY: McGraw-Hill. Luo, X. (). Consumer negative voice and firm-idiosyncratic stock returns. Journal of Marketing, (), –. http://doi.org/./jmkg... Luo, X. (). Quantifying the long-term impact of negative word of mouth on cash flows and stock prices. Marketing Science, (), –. http://doi.org/./ mksc.. Marwick, A. E., & Boyd, D. (). I tweet honestly, I tweet passionately: Twitter users, context collapse, and the imagined audience. New Media & Society, (), –. http://doi.org/ ./ Meshi, D., Tamir, D. I., & Heekeren, H. R. (). The emerging neuroscience of social media. Trends in Cognitive Sciences, (), –. http://doi.org/./j.tics... Mezulis, A. H., Abramson, L. Y., Hyde, J. S., & Hankin, B. L. (). Is there a universal positivity bias in attributions? A meta-analytic review of individual, developmental,



    . 

and cultural differences in the self-serving attributional bias. Psychological Bulletin, (), –. http://doi.org/./-... Murray, R. J., Schaer, M., & Debbané, M. (). Degrees of separation: A quantitative neuroimaging meta-analysis investigating self-specificity and shared neural activation between self- and other-reflection. Neuroscience & Biobehavioral Reviews, (), –. http://doi.org/./j.neubiorev... Nisbett, R. E., & Wilson, T. D. (). Telling more than we can know: Verbal reports on mental processes. Psychological Review, (), –. http://doi.org/./X... Northoff, G., Heinzel, A., de Greck, M., Bermpohl, F., Dobrowolny, H., & Panksepp, J. (). Self-referential processing in our brain—a meta-analysis of imaging studies on the self. Neuroimage, (), –. O’Donnell, M. B., Bayer, J. B., Cascio, C. N., & Falk, E. B. (). Neural bases of recommendations differ according to social network structure. Social, Cognitive, and Affective Neuroscience, (), –. O’Donnell, M. B., & Falk, E. B. (). Big data under the microscope and brains in social context: Integrating methods from computational social science and neuroscience. Annals of the American Academy of Political and Social Science, (), –. Plassmann, H., Venkatraman, V., Huettel, S., & Yoon, C. (). Consumer neuroscience: Applications, challenges, and possible solutions. Journal of Marketing Research, (), –. http://doi.org/./jmr.. Poldrack, R. A. (). The role of fMRI in cognitive neuroscience: Where do we stand? Current Opinion in Neurobiology, (), –. http://doi.org/./j.conb... Poldrack, R. A. (). Inferring mental states from neuroimaging data: From reverse inference to large-scale decoding. Neuron, (), –. Rademacher, L., Krach, S., Kohls, G., Irmak, A., Gründer, G., & Spreckelmeyer, K. N. (). Dissociation of neural networks for anticipation and consumption of monetary and social rewards. NeuroImage, (), –. http://doi.org/./j.neuroimage... Radicati Group. (, February). Email statistics report –, executive summary. Retrieved from http://www.radicati.com/wp/wp-content/uploads///Email-StatisticsReport---Executive-Summary.pdf Rao, L. (). Facebook: M people using messaging; more than B messages sent daily. Retrieved from http://techcrunch.com////facebook-m-people-using-messaging-more-than-b-messages-sent-daily/ Rogers, E. M. (). Diffusion of innovations. Simon and Schuster. Retrieved from https:// books.google.com/books?hl=en&lr=&id=viiQsBjIC&oi=fnd&pg=PR&dq=rogers +diffusion+of+innovations&ots=DKYxrIWpR&sig=YmppUVppFa-cPv-fB-RjGuRA Scholz, C., Baek, E. C., O’Donnell, M. B., & Falk, E. B. (). Decision-making about broadand narrowcasting: A neuroscientific perspective. Media Psychology, https://doi.org/ ./.. Scholz, C., Baek, E. C., O’Donnell, M. B., Kim, H. S., Cappella, J. N., & Falk, E. B. (). A neural model of valuation and information virality. Proceedings of the National Academy of Science, (), -. Silbert, L. J., Honey, C. J., Simony, E., Poeppel, D., Hasson, U. (). Coupled neural systems underlie the production and comprehension of naturalistic narrative speech. Proceedings of the National Academy of Sciences, (), –.

   



Southwell, B. G., & Yzer, M. C. (). The roles of interpersonal communication in mass media campaigns. Communication Yearbook, , . Stephens, G. J., Silbert, L. J., & Hasson, U. (). Speaker–listener neural coupling underlies successful communication. Proceedings of the National Academy of Sciences, (), –. Tamir, D. I., & Mitchell, J. P. (). Disclosing information about the self is intrinsically rewarding. Proceedings of the National Academy of Sciences, (), –. http://doi. org/./pnas. Tamir, D. I., Zaki, J., & Mitchell, J. P. (). Informing others is associated with behavioral and neural signatures of value. Journal of Experimental Psychology: General, (), –. http://doi.org/./xge Tomlin, D., Nedic, A., Prentice, D. A., Holmes, P., & Cohen, J. D. (). The neural substrates of social influence on decision making. PLoS ONE, (), e. http://doi.org/./ journal.pone. Triandis, H. C. (). Individualism-collectivism and personality. Journal of Personality, (), –. van den Putte, B., Yzer, M., Southwell, B. G., de Bruijn, G.-J., & Willemsen, M. C. (). Interpersonal communication as an indirect pathway for the effect of antismoking media content on smoking cessation. Journal of Health Communication, (), –. http:// doi.org/./.. Wilson, T. D., & Nisbett, R. E. (). The accuracy of verbal reports about the effects of stimuli on evaluations and behavior. Social Psychology, (), –. Wilson, T. D., & Schooler, J. W. (). Thinking too much: Introspection can reduce the quality of preferences and decisions. Journal of Personality and Social Psychology, (), –. http://doi.org/./-... Zaki, J., Schirmer, J., & Mitchell, J. P. (). Social influence modulates the neural computation of value. Psychological Science, (), –. Zerubavel, N., Bearman, P. S., Weber, J., & Ochsner, K. N. (). Neural mechanisms tracking popularity in real-world social networks. Proceedings of the National Academy of Sciences, (), –. http://doi.org/./pnas.

  .............................................................................................................

POLITICAL COMMUNICATION AND BEHAVIOR .............................................................................................................

  ......................................................................................................................

       ......................................................................................................................

 .  

T study of political behavior and its relationship to communication covers a wide range of theory and research, depending on whose behavior is of interest (e.g., political elites or the mass public), what kind of communication is of interest (e.g., interpersonal or mass mediated), and even what one means by “political” (e.g., campaigns and elections or social movements) and/or “behavior” (e.g., voting, protesting, or opinion formation and expression). Further complicating any effort to make broad generalizations about the state of the field are differences in researchers’ disciplinary moorings (among others, political science, communication, psychology, and sociology) and methodological preferences (ranging from quantitative approaches such as surveys, experiments, and more recently computational, data, and neurocognitive science, to qualitative social science approaches such as ethnography, to humanities-based approaches such as rhetoric and discourse analysis). Given the focus of the chapters in this section (and the handbook more generally), I limit my discussion to quantitative research focusing on one crucial and much-studied component of political communication and behavior: the engagement of citizens (broadly defined) in politics, with a particular emphasis on democratic engagement.1 My goal is to link long-standing normative, conceptual, and empirical issues regarding the theory and practice of citizen engagement to recent developments illustrated in the following chapters. To do so I first unpack the definition of democratic engagement and its presumed requisites and attributes. Next I explore the role of communication in the formation of these requisites and attributes, as well as their expression. I then turn to the question “What do we know?,” arguing that for a variety of reasons, the answer is a complicated and ultimately disappointing one. Finally, I discuss how developments in information and communication technologies are changing both the process of political engagement and the ways we study it, as exemplified by the chapters in this section.



 .  

. R  A  D E

.................................................................................................................................. What constitutes a democratically engaged citizen? As I have written elsewhere (Delli Carpini, ), most normative theory and empirical research would include some combination of () adherence to democratic norms and values; () having a set of empirically grounded attitudes and beliefs about the nature of the political and social world; () holding stable (but changeable), consistent, and informed opinions on major public issues of the day; and () engaging in behaviors designed to directly or indirectly influence the quality of public life for oneself and others. Underlying all of these elements is the assumption that () citizens also have the skills, resources, and opportunities necessary to develop informed values, attitudes, and opinions; connect them together; and translate them into effective action. Research falling under the rubric “democratic norms and values” includes theories and research on such things as internal and external efficacy (e.g., Morrell, ), political (e.g., Mishler & Rose, ) and social (e.g., Uslaner, ) trust, political interest (e.g., Prior, ), civic duty (e.g., Poindexter & McCombs, ), and political tolerance (e.g., Gibson, ). These orientations are seen as providing the emotional and cognitive underpinnings necessary for engagement in public life that balance conflict with consensus, self-interest with collective interests, and a healthy skepticism with faith in the institutions and processes of governance. “Attitudes and beliefs” refer to one’s overarching views about the social and political world in which we live and are distinguished from “opinions” in that they are more likely to form early in one’s life, are less issue specific (i.e., can be applied to a variety of issues), and are arguably less amenable to short-term change (Delli Carpini, ). Research on politically relevant attitudes and beliefs includes, for example, studies on the formation of one’s ideological orientation (e.g., Feldman & Johnston, ), partisanship (Bakker, Hopmann, & Persson, ), views on the relative importance of equality versus freedom (e.g., Canache, ), a sense of whether the world is a safe place (e.g., Morgan, Shanahan, & Signorielli, ), relative commitment to individual versus collective rights (Dalton, ), general notions about race and diversity (Kinder & Sanders, ), and so forth. Unlike democratic norms and values, there is no presumption that specific attitudes or beliefs are more or less beneficial; that is, being a conservative is not more or less preferable to being a liberal. This does not mean that they are equally reasoned or reasonable, however. Rather, the hope is that attitudes and beliefs—while containing an affective or emotional component—are also based on an accurate assessment of the empirical world. For example, if a person has a deep-seated commitment to a particular political party, one would expect that this commitment is based on some understanding of what this party stands for and how it relates to his or her own values, beliefs, and opinions.

      



If values, norms, attitudes, and beliefs form the foundation upon which engagement is based, “opinions” serve as the more proximate and concrete formulation of these orientations as they apply to specific issues, policies, candidates, officeholders, and the like. For example, if a person’s deep-seated attitudes lead him or her to identify as a conservative, one would expect, all things being equal, that this would be reflected in his or her opinions regarding specific issues, such as a potential tax increase, public financing of campaigns, affirmative action, and the like (e.g., Althaus, ). The holding of opinions—especially opinions that are stable, consistent, and informed—is a crucial element of the democratic process and of democratic citizenship. Equally or more important, however, is the “behavioral expression” of these opinions. Opinions can be expressed directly or indirectly. Direct expression includes talking informally with others (e.g., Jacobs, Cook, & Delli Carpini, ), participating in more formal deliberations and meetings (e.g., Fishkin, ), signing a petition (e.g., Puschmann, Bastos, & Schmidt, ), writing a letter to the editor (e.g., Reader, ), and contacting public officials (e.g., Cook, Page, & Moskowitz, ). Indirect expression includes other forms of political or civic activity, from voting, to membership in an organization, to volunteering in the community (e.g., Verba, Schlozman, & Brady, ). Developing foundational values and attitudes, connecting these to specific opinions, and expressing these opinions through appropriate forms of political and civic behavior requires a range of skills and resources. Included here are basic skills such as reasoning, argumentation, and oral and written communication, as well as resources such as knowledge or information about the substance, processes and people of politics, and public life (e.g., Delli Carpini & Keeter, ). Such skills and resources increase the likelihood not only that citizens will be engaged, but also that they will do so in effective ways that are connected to their self-interest and their sense of the public interest. In sum, a democratically engaged citizen is one who participates in civic and political life and who has the values, attitudes, opinions, skills, and resources to do so effectively.

. T R  C  D E

.................................................................................................................................. Determining what counts as politically relevant “media” or “communications” is no less complex than what constitutes an engaged citizen. At a minimum, one must distinguish between face-to-face and mediated communication; one-to-one, one-tomany, many-to-one, and many-to-many communications; types of media (telephones, mail, magazines, newspapers, radio, television, movies, the Internet); and “genres” (news, talk shows, opinion pieces or editorials, documentaries, drama or humor). Each of these types of communication has the potential to affect different aspects of democratic engagement (from foundational values and attitudes to specific civic and political behaviors) and different parts of the population (based on age, income,



 .  

gender, race, and ethnicity), and to do so in different ways. Adding to this complexity is that individual citizens do not limit their media use to single types or genres, but rather live within larger media, communications, or information environments. These environments are shaped in part by available technology, but also by factors such as one’s social, cultural, and economic circumstances, as well as more personal preferences and choices (e.g., Williams & Delli Carpini, ). Finally, the media can serve simultaneously as the channels through which information is transmitted and received, as the source of particular kinds of information, and increasingly, as the public space in which democratic engagement actually occurs. A minimal assumption regarding the ability of citizens to meaningfully engage in politics is the presence of an information environment in which citizens are able to learn about pressing issues of the day, follow the actions of elected and government officials, and communicate their views to these officials. As the demands on citizens increase, so too do the assumptions of a communications environment that can provide citizens with the motivation, ability, and opportunity to meet these demands. In turn, limitations in the communications environment (e.g., those associated with corporately owned, unregulated, and profit-driven media systems; see, e.g., McChesney, ) are often pointed to as a major reason that democratic practice falls short of normative expectations, while enhancements to this environment (e.g., publicly supported and publicly responsive journalism; see, e.g., Couldry, ; Curran, , ch. ) are held out as a way to improve this state of affairs. The implications of the political, economic, cultural, and especially technological changes occurring over the past several decades (discussed in more detail later in this chapter) have been less clear and more hotly debated. For example, social media have been praised for their ability to give voice and provide information to frequently marginalized issues and peoples (e.g., Benkler, ) and lamented for their its role in the circulation of misinformation and even “fake news” (e.g., Silverman, ).

. W D W K?

.................................................................................................................................. It is only a slight exaggeration to say that all of the extensive quantitative social science research on the political behavior of citizens over the past seventy-five years has implicitly or explicitly been about answering four questions: What abilities, motivations, and opportunities are necessary and/or sufficient for citizens to engage meaningfully in politics? Do citizens (or some subset of them) have these attributes and opportunities? Why or why not? And ultimately, what difference does the presence or absence of such attributes and opportunities make to the practice and the outcomes of politics? It is also only a slight exaggeration to say that there are no generally agreed upon answers to these questions. This is not to deny the numerous theoretically grounded, methodologically sophisticated, and substantively informative studies that clearly exist. Collectively, however, John Zaller’s () observation regarding the state

      



of public opinion research could apply equally today to research on political behavior and communication more broadly: Efforts at integration of research findings are uncommon in the public opinion field. With only a handful of exceptions, the trend is in the other direction—toward the multiplication of domain-specific concepts and distinctions . . . . The result of all this specialization is that the public opinion field has devolved into a collection of insular subliteratures that rarely communicate with one another. Hence, we know much more about the details of particular dependent variables than we do about theoretical mechanisms that span multiple research domains. (p. )

For example, numerous studies have found that media use is positively correlated with many core elements of democratic engagement, such as political interest, attention, knowledge, and participation (e.g., De Vreese & Boomgaarden, ; Kenski & Stroud, ; Norris, ; Tworzecki & Semetko, ). At the same time, there is at least as much evidence that media use can also foster cynicism, apathy, ignorance, misinformation, and disengagement (e.g., Cappella & Jamieson, ; Feldman et al., ; Gervais, ; Meirick, ; Putnam, ; Schuck, Boomgaarden, & de Vreese, ; Torcal & Maldonado, ). In addition, most studies on either side of this ledger find effects that are for the most part remarkably small and both context and content dependent. Why is this the case? To my mind, the best answer (or guidepost to an answer) emerges from another of Zaller’s writings, “The Myth of Massive Media Impact Revived: New Support for a Discredited Idea” (). In this piece Zaller argues that making the case for substantively significant media effects requires a clear theoretical model and the ability to specify the conditions under which the model applies, to convincingly demonstrate effects when these conditions are met, and to explain why we should be justified in generalizing from this evidence to similar though less observable effects under other conditions. He goes on to specify the conditions under which we should be able to observe large effects as being those that include significant variation in the content of communication and the ability to accurately measure the reception of this content (p. ). To this last point I add that “conditions” should also include more contextual factors, from the nature of the political and information environments to individual and group differences in abilities, motivations, and opportunities. Regarding theories, our field suffers not from a lack of them, but rather a plethora, such as agenda setting; priming; framing; elaboration likelihood, affective reasoning, and biased information processing; selective exposure, attention, and retention; the reception-accept-sample (RAS) model; and a host of variations on deliberative, persuasion, and cognitive processing theories. We also have little empirically supported agreement on the individual or structural conditions (or contexts) that are likely to lead to demonstrable and generalizable relationships between communication and engagement. And our ability to conceptualize and measure both relevant communication content and exposure to that content in reliable and valid ways remains a matter of significant debate. In short, despite the philosophy of science underpinning quantitative research in this area,



 .  

our use of theory, data, and methods has often served as little more than the framework and evidence to make what are essentially highly stylized “common sense” arguments (Watts, ). As discussed in the next section, however, the availability of digital technologies (for citizens and scholars) and the data they produce may be simultaneously changing the relationship between communication and politics and our ability to conceptualize and measure this relationship and its component parts in reliable and valid ways.

. T C N   I  C E

.................................................................................................................................. Further complicating this picture is that one of the key factors in both what we study and how we study it—the information and communication environment—has changed dramatically over the past few decades. Focusing on the United States, Bruce Williams and I (Williams & Delli Carpini, ; see also Bennett & Segerberg, ; Chadwick, ) argue that the emergence and use of new technologies (largely the Internet, mobile devices, and social media) have challenged the way we think about both politics and politically relevant media. This new environment (or “media regime” as we called it) has blurred traditional distinctions between fact and opinion, news and entertainment, information producers and consumers, and mass-mediated and interpersonal communication, creating a political landscape that is both “multiaxial” (i.e., in which control of the public agenda can emerge from multiple, shifting, and previously invisible or less powerful actors) and “hyperreal” (i.e., in which the mediated representation of reality becomes more important than the facts underlying it). The implications of this metamorphosis for the study of political communication effects has been debated, most directly in an exchange between Bennett and Iyengar () and Holbert, Garrett, and Gleason () in the pages of the Journal of Communication, with the former arguing that many of our theories may no longer be applicable and the latter for caution in making such claims. Missed in this and similar exchanges is that the generalizable applicability of existing theories—even the consistency among them—was not apparent even prior to recent changes in the media ecosystem. So where does this leave us? To my mind the current disruption in the information environment does two things. First, it provides the opportunity to address Zaller’s key requisites for demonstrating media effects. The networked information environment (and crucially, the digital traces it leaves), coupled with the methodologies of “computational” or “data” science, has the potential to greatly increase the variation in communication content we can observe, our ability to accurately measure this content and its reception in often unobtrusive and more contextualized ways, and our ability to demonstrate effects and the conditions under which these effects are met. They also make observable forms of communication that heretofore remained hidden to us, opening up new possibilities for demonstrating effects.

      



As to theory, the networked information environment provides a natural laboratory for testing existing theories as well as building new ones. But I for one am not troubled by the descriptive nature of some of the still emerging research in this area. I say this because the second important result of the disruption of our information environment is more than simply a new source of data; it has changed the very nature of our object of study, by which I mean the online social world is simultaneously a reflection of the offline world of politics, a facilitator of offline politics, and a space in which politics occurs. So before we can develop and test theories, we need to be able to describe the thing we are theorizing about. It is only through the iterative process of observation, induction, and deduction that reliable and valid measures and, ultimately, generalizable theory, are likely to emerge. This brings us to the chapters in this section. Each addresses a different and wellstudied aspect of communication and political behavior: deliberation (Beauchamp), the role of emotions in political discourse (Settle), the dynamics of attention and opinion formation during social protests (Ferrara), the content and form of public opinion formation and expression (Margolin), and the ways in which digital media differentially impact fragile and stable polities (Borge-Holthoefer, Hussain, and Weber). Each draws on existing theories as a starting point, while grappling with how these theories need to be rethought in a digitally networked information environment (e.g., Settle on the role of emotions in interpersonal political communications and opinion expression). Each supports his or her argument with empirical data and examples (e.g., Ferrara’s analysis of the Twitter conversation during the  Gezi Park protests in Turkey). Each is sensitive to the multiple ways in which the online world makes visible, changes, and/or complements offline politics (e.g., Borge-Holthoefer, Hussain, and Weber on networked communities as both “observatory” and “social disrupter”). Each is thoughtful on the relationship between existing concepts and our ability to measure them effectively using data generated from online sources (e.g., Beauchamp on “argument quality”). And each provides the building blocks for the development of new theories appropriate to the study of political behavior in a networked world (e.g., Margolin’s introduction of the “satisficing semantic search” model for detecting the strength of social and political movements). Finally, collectively they take a significant step in helping political communication scholars think through the complicated and evolving relationship between communication and politics in a digitally networked world, and in the bargain, breathe new life into efforts to build generalizable but context-sensitive theory in our field more broadly.

N . While my emphasis is on individual requisites, attributes, and impacts, it is important to note that effective and sustainable democratic engagement also requires supportive (or exploitable) institutions and processes. In a democracy, such institutions and processes are present by design. But democratic (or proto-democratic) engagement can occur in polities that are not democratic in any obvious sense and can be absent or infirm in nominally democratic ones.



 .  

R Althaus, S. L. (). Collective preferences in democratic politic s: Opinion surveys and the will of the people. New York: Cambridge University Press. Bakker, B. N., Hopmann, D. N., & Persson, M. (). Personality traits and party identification over time. European Journal of Political Research, (), –. Benkler, Y. (). The wealth of networks: How social production transforms markets and freedom. New Haven, CT: Yale University Press. Bennett, W. L., & Iyengar, S. (). A new era of minimal effects? The changing foundations of political communication. Journal of Communication, (), –. Bennett, W. L., & Segerberg, A. (). The logic of connective action: Digital media and the personalization of contentious politics. Information, Communication & Society, (), –. Canache, D. (). Citizens’ conceptualizations of democracy structural complexity, substantive content, and political significance. Comparative Political Studies, (), –. Cappella, J. N., & Jamieson, K. H. (). Spiral of cynicism: The press and the public good. New York: Oxford University Press. Chadwick, A. (). The hybrid media system: Politics and power. New York: Oxford University Press. Cook, F. L., Page, B. I., & Moskowitz, R. L. (). Political engagement by wealthy Americans. Political Science Quarterly, (), –. Couldry, N. (). Media and democracy. In S. Jansen, J. Pooley and L. Taub-Pervizpour (eds.), Media and social justice (pp. –). New York: Palgrave Macmillan US. Curran, J. (). Media and democracy. Oxford: Taylor & Francis. Dalton, R. J. (). Citizen politics: Public opinion and political parties in advanced industrial democracies. Washington, D.C.: CQ Press. De Vreese, C. H., & Boomgaarden, H. (). News, political knowledge and participation: The differential effects of news media exposure on political knowledge and participation. Acta Politica, (), –. Delli Carpini, M. X. (). Mediating democratic engagement: The impact of communications on citizens’ involvement in political and civic life. In L.L. Kaid (ed.), Handbook of political communication research (pp. –). New York: Routledge. Delli Carpini, M. X. (). The psychology of civic learning. In E. Borgida and J. Sullivan (eds.), The political psychology of democratic citizenship (pp. –). New York: Oxford University Press. Delli Carpini, M. X., & Keeter, S. (). What Americans know about politics and why it matters. New Haven, CT: Yale University Press. Feldman, L., Hart, P. S., Leiserowitz, A., Maibach, E., & Roser-Renouf, C. (). Do hostile media perceptions lead to action? The role of hostile media perceptions, political efficacy, and ideology in predicting climate change activism. Communication Research, (), –. Feldman, S., & Johnston, C. (). Understanding the determinants of political ideology: Implications of structural complexity. Political Psychology, (), –. Fishkin, J. S. (). When the people speak: Deliberative democracy and public consultation. New York: Oxford University Press. Gervais, B. T. (). Following the news? Reception of uncivil partisan media and the use of incivility in political expression. Political Communication, (), –.

      



Gibson, J. L. (). Measuring political tolerance and general support for pro–civil liberties policies notes, evidence, and cautions. Public Opinion Quarterly, (S), –. Holbert, R. L., Garrett, R. K., & Gleason, L. S. (). A new era of minimal effects? A response to Bennett and Iyengar. Journal of Communication, (), –. Jacobs, L. R., Cook, F. L., & Delli Carpini, M. X. (). Talking together: Public deliberation and political participation in America. Chicago: University of Chicago Press. Kenski, Kate, and Natalie Jomini Stroud. (). Connections between Internet use and political efficacy, knowledge, and participation. Journal of Broadcasting & Electronic Media, (), –. Kinder, D. R., & Sanders, L. M. (). Divided by color: Racial politics and democratic ideals. Chicago: University of Chicago Press. McChesney, R. D. (). The problem of the media: US communication politics in the twentyfirst century. New York: New York University Press. Meirick, P. C. (). Motivated misperception? Party, education, partisan news, and belief in “death panels.” Journalism & Mass Communication Quarterly, (), –. Mishler, W., & Rose, R. (). What are the origins of political trust? Testing institutional and cultural theories in post-communist societies. Comparative Political Studies, (), –. Morgan, M., Shanahan, J., & Signorielli, N. (). Yesterday’s new cultivation, tomorrow. Mass Communication and Society, (), –. Morrell, M. E. (). Survey and experimental evidence for a reliable and valid measure of internal political efficacy. The Public Opinion Quarterly, (), –. Norris, P. (). A virtuous circle: Political communications in postindustrial societies. New York: Cambridge University Press. Poindexter, P. M., & McCombs, M. E. (). Revisiting the civic duty to keep informed in the new media environment. Journalism & Mass Communication Quarterly, (), –. Prior, M. (). You’ve either got it or you don’t? The stability of political interest over the life cycle. The Journal of Politics, (), –. Puschmann, C., Bastos, M. T., & Schmidt, J. H. (). Birds of a feather petition together? Characterizing e-petitioning through the lens of platform data. Information, Communication & Society, (), –. Putnam, R. D. (). Bowling alone: America’s declining social capital. Journal of Democracy, (), –. Reader, B. (). Free press vs. free speech? The rhetoric of “civility” in regard to anonymous online comments. Journalism & Mass Communication Quarterly, (), –. Schuck, A. R., Boomgaarden, H. G., & de Vreese, C. H. (). Cynics all around? The impact of election news on political cynicism in comparative perspective. Journal of Communication, (), –. Silverman, C. (,November ). This analysis shows how fake election news stories outperformed real news on Facebook. Buzzfeed. https://www.buzzfeednews.com/article/craigsilverman/viral-fake-election-news-outperformed-real-news-on-facebook Torcal, M., & Maldonado, G. (). Revisiting the dark side of political deliberation: The effects of media and political discussion on political interest. Public Opinion Quarterly, (), –. Tworzecki, H., & Semetko, H. A. (). Media use and political engagement in three new democracies: Malaise versus mobilization in the Czech Republic, Hungary, and Poland. The International Journal of Press/Politics, (), –.



 .  

Uslaner, E. M. (). Trust and civic engagement in East and West. In G. Badescu and E.M. Uslaner (eds.), Social capital and the transition to democracy (pp. –). London: Routledge. Verba, S., Schlozman, K. L., Brady, H. E., & Brady, H. E. (). Voice and equality: Civic voluntarism in American politics (Vol. ). Cambridge, MA: Harvard University Press. Watts, D. J. (). Common sense and sociological explanations. American Journal of Sociology, (), –. Williams, B. A., & Delli Carpini, M. X. (). After broadcast news: Media regimes, democracy, and the new information environment. New York: Cambridge University Press. Zaller, J. (). The nature and origins of mass opinion. New York: Cambridge University Press. Zaller, J. (). The myth of massive media impact revived: New support for a discredited idea. In D. Mutz, P. Sniderman, and R. Brady (eds.), Political persuasion and attitude change (pp. –). Ann Arbor, MI: University of Michigan Press.

  ......................................................................................................................

     ......................................................................................................................

 

. I

.................................................................................................................................. A the online world continues its exponential growth, the role of interpersonal communication has become increasingly central in understanding how opinions form, change, and affect behavior. While face-to-face communication has always played a fundamental role in shaping our views, verbal interactions have remained difficult to study, for obvious reasons. The rise of social media reveals and documents the sheer scale of these interpersonal interactions, but social media have also bred pessimism about the effects of these interactions, often being depicted as a den of bullying, trolling, flaming, groupthink, and worse. To understand and foster social media’s more positive potential, however, requires a robust model of productive conversation. This essay examines that challenge through the lens of deliberative theory, exploring how one might model and measure deliberation in the online world, potentially as a first step toward improving it. Deliberative theory began as an outgrowth of democratic theory, recognizing that pure procedural voting was often insufficient to achieve collective decisions that best reflect the fundamental interests of participants. Instead, deliberative scholars proposed that through careful, informed discussion, a group can arrive at collective decisions that most participants would agree are more informed and better reflective of their true preferences and beliefs than would have been achieved with a quick, majority-rule vote. As the role of interpersonal communication on social media has become especially vexed recently, it has become essential to discover when, where, and how online conversations might lead to better decisions, more informed participants, and increased mutual understanding. This essay ultimately argues that there are some



 

grounds for optimism, at least about the potential for online deliberation. To reach that conclusion, though, requires a careful examination of the origins and many competing definitions of deliberation; the challenges of its measurement, especially online; and deeper models of the fundamental deliberative qualities that productive conversation might achieve. Although each of these steps is challenging, for scholars of deliberation and argument, the world of social media would seem to be the perfect data set and test bed. In many online domains conversations are threaded, so one can tell exactly who is addressing whom; identities are clear and persistent (albeit often pseudonymous); and since communities form and interact consistently for weeks, months, or years, changes in speech, opinion, and online behavior can be measured and tracked over long periods of time. Furthermore, the same conditions that make these domains appealing to the social scientist may often make them exceedingly useful and productive for their participants, who can benefit from consistent, long-term, and transparent interactions with other members to build deliberative conversations and communities that can expand and enrich their understanding of the world and provide essential long-term emotional support. Tempering this optimism is a prevailing sense that social media are often an echo chamber of fact-free, self-reinforcing bubbles of like-minded users (Sunstein, ), or worse, a breeding ground for bullying, misogyny, racism, and other forms of harassment that reinforce existing prejudices and power structures and drive out all those who aren’t aggressive, straight, white cisgender men (Kayany, ; Hobman et al., ; Jones et al., ). Evidence of such unproductive (or destructive) speech online abounds, of course, but like many aspects of social media, there are also large degrees of selectivity bias at work here. Even if such things are prevalent or dominant on some platforms, it may also be the case that many pockets exist in which speech is more deliberative, conversational, and useful for millions of users across billions of posts. One of the goals of this essay is to explore what the conditions and criteria might be for more productive, deliberative conversation online, even if such a thing appears relatively rare today. Once these criteria are better delineated and their real-world conditions better understood, we can work toward enhancing the conditions conducive to productive speech. Although social media may be relatively new, these debates about deliberation and its pitfalls, as well as the general skepticism toward the whole conversational endeavor, are not at all new. This essay begins with a discussion of deliberative theory in the traditional “offline,” “real-world” setting, in order to better understand the normative goals and practical challenges of deliberative conversation. At their most extensive, theories of deliberation can entail dozens of potential criteria, at various levels ranging from the institutional, to the individual, to the content of each sentence spoken; one of the challenges is abstracting from this menagerie a core set of criteria, if that is possible. Another challenge is turning theory into empirics: How are each of these criteria measurable, particularly on the massive scale allowed by social media? A particular challenge is measuring substantive criteria rather than just a series of more easily observable epiphenomenal markers that may be overly specific to various platforms or

    



transcription methods. This leads directly to the various computational methods that may be necessary to measure these deliberative processes in the free-form text one finds in transcripts from deliberative exercises and in social media. Once the concepts and challenges posed by deliberation more generally have been delineated, we can turn to online deliberation more specifically. This encompasses online environments that have been carefully constructed in order to further deliberative discussion (and often emerge out of offline deliberative work), as well as more “natural” online environments such as social media, that may be more or less deliberative as a function of their membership and online structures. While constructed online deliberative platforms often are accompanied by surveys and other outcome measures, they are often as rare, expensive, and un-self-sustaining as offline exercises. By contrast, social media abound in data, but it is often difficult or impossible to survey participants to see what they have learned, how they have changed, and how they feel about their conversational activities. Because of this, with social media—and often even with purpose-built online platforms—we need to measure deliberative quality and outcomes directly from the textual content and patterns of interactions. There are many approaches to this, particularly as we move beyond subjective human judgment into more automated computational methods introducing tools from natural language processing (NLP), argument mining, and network theory. But while many of these methods seem well suited to the empirical analysis of deliberation in the wild, they are hampered both theoretically and empirically. Theoretically, there appears to be an increasing proliferation of criteria and subcriteria, numbering now in the dozens across the literature. Empirically, there is a tendency to focus on superficial, easily measured markers of argument and deliberation, rather than on deeper structures such as argument strength or conceptual interconnections—understandably enough, given that the latter are much more difficult to measure. This essay finishes, then, with a look at a couple of recent efforts to measure deeper deliberative structures. In the ideal world, deliberation is valuable because it allows people to better learn facts, ideas, and the underlying conceptual structures connecting them. Participating in or observing a discussion or argument should be beneficial because, ideally, not all arguments are equal, and not all communication is dominated by superficial rhetorical features; in a productive debate, for instance, the better side should win not because the winner is a better rhetorician, but because they are in possession of the better arguments. But to measure these things—if indeed they exist— requires us to have a model of better and worse arguments, in a way that is objective and unbiased by the prejudices of the measurer. And to measure genuine deliberative thought requires not just superficial notions of facts or lists of words associated with reasonable talk, but models of how ideas are interrelated and how deliberative conversation, like deliberative thought, sifts through and reorganizes these ideas to arrive at better global decisions. The purpose here is not to solve entirely these deeper issues, but more to raise and highlight their importance in moving beyond the myriad ad hoc approaches currently employed, toward an objective, scalable, and theoretically coherent model for measuring deliberation online.



 

. D

.................................................................................................................................. Whether explicitly or implicitly, much recent research into online communication is deeply normative: Are people learning from each other, are they in self-reinforcing bubbles, or are they engaged in combat that is at best polarizing and at worst deeply destructive? Even many descriptive or positive theories of communicative behavior online tend to have normative questions underlying at least an aspect of their hypotheses, particularly as the researchers encounter one or another type of dysfunctional collective online behavior (such as “flaming” or “trolling”). But if we wish to understand more systematically how deliberation online can go wrong, it is useful to develop better theories about how we might want it to go right. Deliberative theory has been developed over the last thirty years in part as a response to a similar set of problems in the offline world, and it suggests that the problem of developing, measuring, and understanding deliberative communication—and its absence—is considerably more complex than merely the absence of bullying or acrimony (for instance). This section examines existing, traditional “offline” deliberative theory, and in particular the fundamental challenges of measurement that carry over to a more extensive model of online deliberation. Although the very concept and definition of deliberation is deeply vexed, as we discuss later in this chapter, a rough and minimal definition could be an extended conversation among two or more people to come to a better understanding of some issue. There are many other aspects that various scholars consider essential—including institutional, formal, and outcome oriented—but most share this core idea of people engaged in conversation with the purpose of increasing their understanding. But because there remain many disagreements about these criteria and their normative importance, it is worth spend a little time understanding the development of deliberative theory and the current status of the complex and far-ranging discipline it has become. Theories of deliberative democracy were developed by Habermas and others (Habermas et al., ; Cohen, ; Dryzek, ) in part as a response to limitations in traditional democratic theory that are analogous to some of the limitations in the online world already discussed here. In particular, even when we are satisfied with a particular electoral system as a fair aggregate of existing public opinion, it is clear that these democratic procedures do not always succeed in producing the best outcomes according to various theories of justice (Bohman, ; Ackerman and Fishkin, ; Bächtiger et al., ; Thompson, ; Gutmann and Thompson, ; Mansbridge et al., ). Voters may be deeply uninformed, for instance, producing not just suboptimal outcomes according to some objective measure, but even according to what those voters themselves would wish were they to learn all the facts (Fishkin, ). Similarly, voters may not have worked through their own ideas sufficiently (by either objective measures or their own lights); may be ignorant not just of facts but of

    



important ideas, or connections between those ideas; or may misunderstand in important ways the ideas and beliefs of other voters in ways that mislead (Habermas, ; Manin, ). Mere voting may reflect the current attitudes of the voters, but perhaps not their “better” selves: what they would opt for were they better informed about the world, themselves, and others. Thus the need for deliberation before voting, to allow participants to better understand themselves; each other; and the issues, facts, and ideas at work. Early deliberative theory was deeply rooted in the rational enlightenment tradition, in which the main goal of deliberation was to learn information and share reasons, and one of the primary goals was ideally to reach some form of consensus (Habermas et al., ; Cohen, ; Habermas, ; Manin, ; Thompson, ). Out of this framework, the original criteria for practical deliberation were developed, with the goal of producing rational, factual discussion that ideally leads to consensual decisionmaking. To better understand these deliberative criteria, it is worth distinguishing here three stages of the deliberative process, which will persist even after the definition of deliberation has been subsequently expanded beyond its narrow rationalist origins. Roughly speaking, these stages are () the input, such as the environment or institutional framework under which deliberation takes place; () the deliberative process itself, including the action of the participants and the content of their communication; and () the output, such as opinion changes, decisions, or votes. The environment or institutional input may extend from large-scale social structures down to the rules governing the structure of a single deliberative group; the process may include individual behavior as well as the content of individual speech acts; and the output may range from individual knowledge gain to more consistent or factually correct group decisions (Landwehr and Holzinger, ). Deliberative criteria in the rationalist tradition are often framed most explicitly in terms of both inputs and processes: the environment must be fair, equal, and unbiased; individuals must emphasize reason giving, respect, and honest expression rather than emotion and strategic manipulation; and the content of what they say should actually contain arguments, ideas, and facts (Habermas et al., ; Cohen, ; Steenbergen et al., ; Mansbridge et al., ). But none of these criteria can really be considered successful in themselves unless the individuals have gained in their rational understanding of the issues and moved toward a consensual truth (output). For instance, a (purportedly) reasonable, respectful setting that produced acrimony and groupthink would not ultimately be judged a deliberative environment. But conversely, just as we would judge a democracy somehow deficient if a benign dictator merely appointed a representative body or merely dictated policies reflective of the democratic will without actually holding polls or elections, so too in deliberation, if we imagine an outcome in which everyone simply ends up better informed (e.g., via reading some material on their own) without the deliberative process, it may be beneficial, but it isn’t truly deliberative. Thus neither input nor output is itself sufficient, and this sense in which the process itself is the most fundamental part of deliberation becomes even more pronounced in the second wave of deliberative theory.



 

That a rational consensus exists and participants should strive to achieve it was central to the first wave of deliberative theory (Cohen, ; Habermas, ; Bächtiger et al., ; Niemann, ; Gastil and Black, ; Thompson, ), but this has since been loosened and expanded to encompass a wider variety of normative goals. In part, this is an empirical concession; as we can see from the online world, not only is consensus difficult to practically achieve even in the best of worlds, but it may not be even theoretically possible when interests are fundamentally opposed or when the participants occupy a variety of identities or other positions that, while not inherently in opposition, are not something that we would actually want to merge (Holzinger, ; Dryzek, ; Gutmann and Thompson, ; Mansbridge et al., ). Moreover, even where practically achievable, there are many flawed forms of consensus, in which either participants naturally converge via groupthink on a suboptimal position or merely bully a subset of participants into agreement (Sunstein, ; Garrett, ). In reaction to these various flaws in this rationalist, consensual version of deliberation, a second wave of deliberative theory arose that allows for a broader conception of the goals and procedures of deliberation. Here, deliberation can potentially progress even with self-interested actors (Mansbridge et al., ), implacable disagreements about the truth (Gutmann and Thompson, ), a diversity of backgrounds, and some degree of continuing ignorance or incomprehension about the experiences or backgrounds of the other participants (Mutz, ). However, without the core idea of rational progression toward a consensual truth, the second wave of deliberative criteria tend to be more fundamentally procedural rather than outcome oriented. If the process is fair and unbiased; encompasses the full diversity of viewpoints and is respectful to all differences; and in some loose sense encourages justification, exploration, and explanation over rhetoric, attack, and strategy, then it fulfills the environmental criteria, with much less emphasis on the outcome (Mutz, ; Dryzek, ; Gutmann and Thompson, ; Mansbridge et al., ; Fishkin, ; Mansbridge, ). Clearly, an outcome full of acrimony and misunderstanding would be a failure, but beyond that relatively loose criterion, procedural rather than outcome-based criteria tend to dominate. Measuring process, however—especially as the definitions broaden beyond the purely rational—leads to an array of tricky questions: If a process claims to be fair and open, does it actually achieve equal speech from all participants (Steenbergen et al., ; Steiner, ; Gutmann and Thompson, )? If not, is the content of what is said at least diverse enough that it encompasses all the major positions at the table, and in rough proportion to the individual representation of those positions (Manin, ; Mutz, ; Thompson, )? Do participants seem to present justifications, explanations, and exploratory questioning, rather than aggressive or strategic speech designed to “win” (Manin, ; Mansbridge et al., )? Even more fundamentally, do they seem to be exchanging arguments and ideas—“ideas” conceived broadly as things that can be personal, anecdotal, or emotional and not just abstract reasons—rather than rhetorical jousting? And even more fundamental to the deliberative processes itself, are their exchanges of arguments and ideas responsive to each other and reflective of underlying concepts and structures, rather than the sorts of purely

    



performative or rhetorical speech one finds so often online (Holzinger, ; Gutmann and Thompson, ; Mansbridge et al., )? These questions are arguably at the core of deliberation, as discussed later in this chapter. But each is quite difficult to answer empirically, and each leads to multiple nests of thorny theoretical problems. Again, one might be tempted to try to shortcut the discourse analysis by turning back to outcomes—for example, to simply survey participants at the end about their knowledge, understanding of the issues, satisfaction, and so forth. But again, those outcomes can be achieved without a deliberative process at all. If we are interested (as with democratic theory more generally) in achieving the correct outcome via the correct process, then the most fundamental and direct measure is not at the institutional (input) or outcome (output) levels, but at the discourse level— with all of its attendant empirical and theoretical challenges. Perhaps as a result of the challenges of measuring process, much of the early and concrete work on deliberative measurement and institutional design was done at the input and output levels, but that has gradually expanded to capture the more elusive but arguably more fundamental procedural criteria (Steenbergen et al., ; Holzinger, ; Steiner, ; Gutmann and Thompson, ). In addition to more straightforward measures of institutional/environmental suitability (open, fair proceedings that at least attempt to encourage constructive, reasonable conversation) and outcomes (surveys of knowledge gain, satisfaction, and absence of groupthink or polarization), the tricky middle—the discourse itself—has seen important strides in measurement. Perhaps the most substantial is the Discourse Quality Index (DQI), which attempts to measure a number of types of quality, including breadth of participation, depth and content of justifications, respect toward other groups and their arguments, and constructive behavior (Steenbergen et al., ). All of these are fairly abstract and subjective criteria, however, and while the DQI and other measures of course subdivide these qualities into more detailed and potentially objective subcriteria and measurements, the evaluation of them has generally required intensive work from fairly expert human coders, each with their various potential biases regarding what counts as justification, constructive behavior, rational, manipulation, and so forth. Partly as a response to these types of bias, and partly as a response to the sheer cost and effort of such hand-coded measurement, various more automated methods have recently been developed. Such methods are absolutely necessary to expand these measurements to raw online discourse at scale. As an illustration of one of the more comprehensive efforts at multifaceted computational measurement of deliberation, Gold et al. () seek to measure four different core aspects of deliberative discourse: equal participation, mutual respect, justification, and persuasive effects. Like most fundamental deliberative qualities, each of these is a theoretically complex and multifaceted concept, without any obvious computational measure than can be easily automated (hence the DQI’s reliance on human coders). Equal participation is perhaps the most directly operationalized, since it operates on the individual level, and one can use speaking time as a proxy for this. But for the content-based criteria, rather than construct deep substantive measures that somehow reflect the complex concepts at



 

work in the minds of human coders—a challenging task—the authors find superficial markers that are hopefully associated with the deeper measures. This approach is not specific to them, but rather illustrative of how many automated attempts have grappled with these difficult measurement issues. Respect, or its absence, is measured via interruptions as notated in a transcript; justification is measured via a couple of grammatical constructs (in German) that are sometimes used when justifying remarks; and persuasiveness is tracked by verbs associated with changes in opinion (“accept,” “believe,” etc.). So while it may be that these markers are systematically associated with a number of fundamental measures, such an approach requires extensive validation against human coding, does not export readily to other languages or contexts, and does little to capture the core concepts of deliberation. However, with some work, it can presumably be converted into a variety of languages and function as a quick and largescale, though approximate, measure of deliberation along a number of different procedural dimensions. As discussed in the next section, as the deliberative field turns to online environments to test and develop its theories, it increasingly employs similar measurement strategies, which are scalable and automatable, but also necessarily somewhat superficial.

. O D

.................................................................................................................................. Within the body of research into specifically online deliberation, there have been two somewhat distinct communities that have only recently begun to merge. On the one hand are those who emerge out of the deliberation tradition, interested in cataloging the various criteria previously enumerated, distinguishing the pros and cons of online versus offline discussion, and constructing deliberative environments online. On the other hand is work that emerges more from computer science and communications, which pays greater attention to behavior in existing rather than purpose-built online communities and therefore grapples with more varied network topologies connecting interlocutors rather than simple chatrooms modeled on literal deliberative rooms, and which tends to employ more automated content analysis rather than using survey measures of outcomes (Himelboim, , ; Himelboim et al., ; GonzalezBailon et al., ; Choi, ). But although originally more descriptive than normative, much of the latter work has been drawn to normative deliberative questions very similar to those developed in the deliberation community. Within the outgrowth of traditional deliberative theory into online communication, one encounters many of the same sets of checklists previously mentioned. For instance, in Schneider we have four criteria: equality, diversity, reciprocity, and quality (Schneider, ). In Dahlberg there are six: reasoning, reflexivity, ideal role taking, sincerity, equality, and autonomy (Dahlberg, ). In Janssen we have six: form, dialogue, openness, tone, argumentation, and reciprocity (Janssen and Kies, ). And in the International Association of Public Participation (IAP, ) there are five: inform,

    



consult, involve, collaborate, and empower (Nabatchi, ). Once again, we are presented with a congeries of theories and criteria, although of course plenty of overlap can be found among these. Each list, though, does tend to focus more on a single level: Schneider more on the environmental or input criteria; Dahlberg and Janssen more on individuals, process, and content; and the International Association of Public Participation more on outcomes. Perhaps the most wide-ranging recent effort to systematize many of these qualities is Friess and Eilders (), who distinguish three phases of deliberation similar to those previously outlined: “Institutional/Input/Design,” “Communicative/Throughput/Process,” and “Productive/Outcome/Results.” Within each of these three basic stages, they in turn enumerate the usual menagerie of criteria, although helpfully targeted for specifically online deliberation. For institutions they propose asynchrony, self-identification, moderation, empowerment, division of labor, and information; for processes they propose rationality, interactivity, equality and inclusion, civility, common good, and constructiveness; and for outputs they propose knowledge gain, reason learning, opinion change, social trust, and political engagement (on the individual level), as well as consensus, error avoidance, epistemic quality, and legitimacy (on the collective level). They present a tidy table summarizing these aspects, but such tidiness somewhat masks the inherent complexity and even messiness of even this “systematic” set of criteria, with its twenty-one variables. The historical trajectory seems to be an ever-increasing list of criteria, with measurement and assessment falling further and further behind. Furthermore, as challenging as it may be to enumerate and categorize all these criteria, much more challenging is implementing their measurement, particularly at the most central stage, the deliberative process. This challenge is illustrated by Nelimarkka et al. (), who closely examine three online systems from the perspective of Dahlberg and the DQI: the “Living Voters Guide,” including its earlier iterations Consider.it and Reflect (Kriplean et al., ); the “Open Town Hall” (Vogel et al., ); and the authors’ own “California Report Card (CRC)”. The CRC is notable for the sensitivity of its attention to equality and autonomy, randomizing individual encounters and even their arguments in ways to prevent dominance by certain participants or their ideas. Much more challenging than these input criteria, though, are the procedural measures at the heart of deliberation. The authors devote close attention to the difficulties of operationalizing reason, for instance, which like many others they take to be one of the core criteria for the procedural stage. They divide reason into reciprocity and justification, and while justification can often be measured via fact giving and other superficial syntactic measures (as Gold et al.  do), a deep measure of reciprocity is quite tricky, since simply counting responses in discussions fails to capture the degree to which individuals really are taking in and thoughtfully responding to the ideas of their interlocutors (which Trénel  distinguishes as formal interactivity vs. substantial interactivity). This idea of reciprocity, which in many ways is close to the core ideal of deliberation, will be addressed shortly, but for now it can be taken simply as another instance of how difficult it is to measure process, particularly at the core level of interactive content. Yet as difficult as this may be in



 

carefully crafted online deliberative environments, it is much more difficult to capture in the wild. Many of the approaches already discussed also turn their sights to natural online settings, but even recent work that carefully applies one of these sets of four, seven, or twenty-two criteria can feel immediately obsolete or irrelevant when applied to ever-changing social media.

. D  S M

.................................................................................................................................. On the one hand, this definitional proliferation and its associated measurement and design problems demonstrate the breadth and ambition of what modern deliberative theory has become. On the other hand, even relatively minimal real-world deliberative polls are sufficiently expensive that it is difficult to do the sort of extensive iterative experimentation necessary to design effective institutions applicable to a wide range of domains. And when moving from experiment to implementation, even a well-designed system can be prohibitively costly to deploy at scale. Even if we take the idea of deliberative “polling” to heart and hope that, like a well-chosen focus group, a wellselected deliberative poll might somehow be representative of a larger polity, the situation is even worse than a focus group, inasmuch as the interactive aspect between people’s opinions leads to a combinatorial explosion of possible outcomes that might be very sensitive to the exact backgrounds and behaviors of the participants. The appeal of online communication is therefore both at the implementation and experimentation stages. It is easier and cheaper to build and deploy deliberative systems online at the scale necessary for deliberative democracy, and more fundamentally, insofar as we still do not have ideal models or measurements of deliberation, it is far easier not just to design online deliberative experiments, but to use the copious quantities of online observational data to hone and test our theories. One of the most helpful developments in this regard has emerged out of computer science and other fields outside of the explicitly deliberative. This research has traditionally been more interested in a positivistic understanding of online communication, but has been drawn into the same normative issues underlying deliberative work. Most usefully, given the complex and varying nature of environments and behaviors found online, a wide array of tools emerging out of network theory, NLP, and other domains has been developed that are useful for operationalizing some of these theoretically elusive deliberative criteria. Somewhat less attention, though, has been given to the “outcome” side of things, a point Friess and Eilders () also make about online deliberative research more generally—and a point that is returned to later.

.. Early Work: Flames and Bubbles Some of the earliest work in this area begins with a striking characteristic of early online communication, an observation that is both descriptive and normative: the emergence

    



and rapid prevalence of “flaming” (Reinig et al., ; Kayany, ). In addition to trying to explain this phenomenon, some of the earlier work explicitly examines the trade-offs between this tendency online and counterbalancing advantages that may emerge from the same set of underlying online features. On the one hand, in the s Usenet discussion boards and blogs had quickly become famous for “flame wars,” in which conversations dissolve into vitriolic attacks that basically epitomize the opposite of deliberative discussion. On the other hand, while the absence of social cues was often blamed for the flaming tendencies (Kiesler et al., ; Kayany, ), the ano—or pseudonymity—of the online fora could produce more equality between participants (Dubrovsky et al., ; Bordia, ; Albrecht, ), potentially leading to more engagement (Price and Cappella, ), diversity (Hargittai et al., ; Garrett, ; Wojcieszak and Mutz, ; Brundidge, ), and participation (Boulianne, ), desiderata also identified in the roughly contemporaneous deliberation literature. But while the prevalence of “flaming” was taken as a given, less well examined was its cause: in particular, was it due more to an absence of social cues and conventions, or more to the very diversity of opinion that had been lauded, for example, bringing people together from further points along the ideological spectrum than would normally encounter each other in everyday life (Hobman et al., ; Papacharissi, )? A second line of investigation that subsequently emerged, while not directly responding to this research question, does seem to have flowed directly out of it. In the early s the dominant form of social media was the weblog, and in the political domain it was soon noted that the most distinctive characteristic of the linkages between political blogs was dense intraparty connections, with weak or absent interparty connections (Adamic and Glance, ). This also marked one of the first notable applications of network theory to examine not just homogenous environments, but also the complex, self-selected interpersonal connections online that deeply affect the deliberative outcome (Himelboim, , ; Gonzalez-Bailon et al., ; Eveland and Kleinman, ). These “bubbles” of self-selected interlocutors were taken at the time, and have often been taken since, as self-evidently deleterious, blocking information flow and leading to less-informed participants. While perhaps the earlier forms of online communication led to somewhat greater diversity of participants, as the medium matured and people had greater ability to self-sort, bubbles arguably become more dominant, eroding many of the benefits of diversity. Alongside this, more substantial theoretical work examining the bubble and groupthink phenomena was developing in the deliberative and social science literatures, with perhaps the most well-known early crossover being Sunstein (). This can also be seen as a counterbalance to the lauded “wisdom of the crowd” (Surowiecki, ; Page, ), in which the most stylized result is that a diversity of opinion can produce group judgments more accurate than those of any of the individual participants, but this increased accuracy is destroyed if the participants are allowed to discuss their individual judgments first; discussion is precisely what collapses the wisdom of the crowd back to groupthink. There are of course many more important and theoretically interesting details about exactly how much diversity of opinion is optimal and how much discussion is sufficient to ruin the group wisdom, but this tension between diversity



 

and self-destruction mirrors the earlier research into flaming and bubbles. How are we to know when we are having a productively diverse discussion versus an unproductive flame war? Except in the most artificial of circumstances, real-world deliberative outcomes are rarely as clear-cut as guessing the number of jelly beans in a jar. Instead of relying on outcome measures, mirroring the progression in deliberative work, online studies have turned more attention to the discourse itself, seeking more automated measures of discourse quality. Perhaps the most direct and prevalent content measure is emotion and sentiment, particularly as automated methods have been developed to measure such things (Papacharissi, ; Berger and Milkman, ; González-Bailón et al., ). More sophisticated content measures have also emerged out of the NLP community, but less as a response to these problems than out of an entire different subdiscipline, addressed shortly.

.. More Complex Measures of Environment: Network Analysis Perhaps a more substantial and novel contribution emerging from the study of social media behavior has been on the environmental or input side, as network measures were developed to model and explain deliberative behaviors. Because modern social media since Usenet is arranged not in closed-room-like silos but in open-ended networks of interconnected users, the entire first level of deliberative analysis—the environment— becomes problematic. Can Facebook or Twitter even be subdivided into smaller communities whose deliberative qualities can be independently assessed? And if not, we need to model both the environment and individual behaviors in a more openended network structure to discover which topologies are associated with deliberative quality, either as input or potentially as output. But like the more sophisticated and granular analysis of wisdom-of-the-crowd effects in Page (), this is ultimately a good thing: it forces us to understand how the environment and the individual interact, how the former shapes the behavior of the latter, and how the latter is constituted and shaped in turn by the self-selecting and linking behavior of the individuals. After all, barring costly deliberative democratic institutions, the most prevalent form of deliberation in actual life is interpersonal and thus “networky,” whether in person or online. A review of network-based approaches to communication, even from a normative deliberation point of view, is beyond the scope of what can be covered here (Himelboim, , ; Brundidge, ; González-Bailón et al., ; Eveland and Kleinman, ; Choi, ). Even something as far afield as retweet behavior is apropos, since those behaviors map onto our previous questions of bubbles, knowledge (and error) transmission, and even cascades of vitriol and “flaming”—all of which presumably have analogs in the more artificial and controlled environment of a deliberative meeting. How can the topology be tweaked to boost deliberative outcomes, and how do less deliberative behaviors shape their own self-reinforcing environments? These sorts of

    



questions remain largely unanswered, but recent work has begun to bring together the network and other computational tools with the more established deliberative concerns. One recent illustrative multimodal example of this is Choi (), who examines four now-familiar criteria—discussion flow, diversity of opinion, rationality of discussion, and persuasion—but from a perspective that mixes network methods and automated content analysis. Since the domain is Twitter, the approach is more descriptive than normative or design oriented, but the fundamental research questions are once again driven by the core deliberative concerns. Perhaps most interesting is the analysis of discussion flow or dynamics, which focuses on retweets but is analogous to many forms of information transmission. Using an exponential random graph model, they examine how different local network topologies linking Twitter users affect tendencies to retweet. This is a common approach in network analysis, but the normative deliberative application is more relevant, in which the question here is whether topologies tend to reinforce existing dominant speakers or diffuse communication out toward peripheral players, and on a second level whether or not cliques of speakers tend to form. In both cases, from a deliberative perspective we would prefer the latter options, with diffusion and equality rather than concentration and cliquishness. The predominant result in network analysis is that concentration rather than diffusion dominates, although in this case Choi finds that this is less the case in this specific Twitter data set than expected; whether that is due to increased deliberation or merely an underpowered sample is less clear. Choi’s examination of the other three qualities—diversity, rationality, and persuasion— likewise illustrates the sorts of automated content analysis that are now prevalent, although this entire field remains limited by the same sorts of superficial content measures we saw in Gold et al. (). Diversity is operationalized via the domains in quoted URLs, used to measure the relative prevalence of inter- versus intra-ideological discussions based on the known ideology of different online news sources. While a fine measure, this illustrates how so many of these measures can be very specific to the form of the social media in which they transpire, since obviously URL quotation is a useless measure in live communication. It also illustrates some of the weaknesses of the purely descriptive approach, since while they of course find less inter- than intra-ideological discussion, it is unclear (a) what we would most desire normatively and (b) what variations in conditions might affect these relative quantities. Content—rationality and its alternatives—is operationalized using now-standard NLP measures, in particular the relatively basic tools included in toolkits such as Linguistic Inquiry and Word Count (Pennebaker et al., ), which purport to be able to measure content qualities as diverse as sentiment, anxiety, anger, or sadness, as well as high-level cognitive features such as causal reasoning, reflection, speculation, and assertion. However, although reasonably well validated against human judgments, these automated measures are mainly assessed via simple wordlists (which may be compiled either by human experts or by more computational methods), so it is unlikely that they are able to capture deep conceptual structures, just as the syntactic indicators of



 

“justification” in Gold et al. () tended to be more superficial than deep. Consistent with other work (Berger and Milkman, ; González-Bailón et al., ), Choi finds that negative emotion tends to increase retransmission of content, and of the cognitive measures, strong assertion rather than causal or speculative thinking seems to dominate. But the generality of these results, particularly given the superficial measures, remains a problem shared by almost all researchers of online deliberation.

.. More Complex Measures of Content: Argument Mining While the introduction of network analysis has been useful for pushing deliberative theory and practice away from simplistic environments to the sort of complex social structures common in actual human dynamics, the content side has often lagged, relying mainly on either human coding of broad categories or automated methods that amount to little more than word counting. However, another branch of computer science and computational linguistics has recently been making great strides in deeper measures of deliberative content, under the broad umbrella of “argument mining.” Just as network analysis pushes deliberative theory to examine local interpersonal structures and not just homogenous institutions, argument mining in NLP pushes previously superficial, content-based deliberative criteria into greater levels of argumentative detail. Like deliberative theory, argument analysis began a few decades ago as a more theoretical endeavor (Toulmin, ; Douglas, ), and it has really only blossomed as an empirical and computational program in the last twenty years. Some of the earliest work was with legal texts (Moens et al., ; Wyner et al., ; Mochales and Moens, ), attempting to identify arguments and then to identify and classify the substructures of different arguments, such as premises and claims. And as in deliberative theory, these types of arguments quickly proliferated, with for instance ninety-six different “schemes” (e.g., from precedent, from effects, from authority, from fear, ad hominem, slippery slope) in Walton et al. (). These techniques were initially most often applied to well-structured texts such as legal documents, where it is clear that (a) writers are indeed engaged in formal argument, and (b) the forms of those arguments are often sufficiently stereotyped to make for easier automatic retrieval. They also have been heavily applied in essay scoring (Shermis and Burstein, ; Stab and Gurevych, a, b; Beigman and Deane, )—in which one wants not just to detect the presence and kinds of arguments, but also to answer more normative claims about quality—and have expanded to many other domains. And of course much contemporary work has now shifted to the domain of online communication. Early empirical work began with human classification, then moved on to simple word-count approaches like those previously discussed, such as looking for reasoning terms such as “because” or “therefore.” But computational methods have subsequently advanced considerably, using machine learning methods to classify argument types (usually trained with human-coded examples) rather than (or in addition to) human-

    



derived terms. Some of the most interesting recent work has been in argument sub- and superstructures: the relationships between premises, evidence, and claims, for example, on the substructure scale, and the relationships between larger arguments that support or contradict each other, for example, on the larger scale (Douglas, ; Walton et al., ; Palau and Moens, ; Feng and Hirst, ; Peldszus, ; Peldszus and Stede, ; Yanase et al., ). What is particularly interesting here from the deliberative perspective is how many of these structures go well beyond what deliberative theory has normatively classified. “Justification,” for instance, is a very broad category (Biran and Rambow, ; Park and Cardie, ; Oraby et al., ; Park et al., ; EckleKohler et al., ; Rinott et al., ), and dozens of Walton’s schemes could plausibly fit under that heading. This approach has also begun to incorporate less logical structures, such as emotional or personal appeals (Wang and Cardie, ; Wachsmuth et al., ; Oraby et al., ), which remain a vexed issue in deliberative theory as well. As with network structure versus institutions, argument classification in the mining domain has expanded far beyond even the twenty-one criteria of Friess and Eilders (), but if anything the overall organizational structure unifying these components is even less clear.

.. Proliferating Criteria and the Core We have thus seen an extreme proliferation in all of the major branches of deliberative analysis of online discussion. Deliberative theory itself has grown into dozens of potential criteria, and these have only increased as the specific peculiarities of online communication have entered the mix. Even if we aggregate some of these criteria into input, process, and output stages, developments in network theory have greatly complicated the environmental and individual levels, and developments in argument mining and NLP have greatly complicated the sorts of content we can evaluate from a deliberative perspective, even using raw text and large scales. At the same time, as Friess and Eilders () also discuss, the outcome side of the process remains relatively underdeveloped, languishing in older measures such as knowledge and satisfaction surveys, or in crudely unrealistic measures of collective accuracy that have little bearing on the complex, subjective decisions characteristic of deliberative democracy (Page, ). All of these issues are furthermore fundamentally connected to the shift to online content. Deliberative experiments and polls were problematized by the introduction of online deliberative polls, but these issues become much more pressing once we acknowledge that despite their reputation, social media and other emergent modes of online communication are deeply deliberative, albeit erratically, with topologically complex structures and difficult to measure content (e.g., Twitter jargon). So what is the long-term solution here? The hierarchical classification approach (twenty-one criteria, ninety-six schemes, etc.) seems to only grow and become more baroque over time, whereas the search for one or a few core deliberative criteria seems to have been long abandoned, particularly with the turn away from the ideal of rational consensus. One advantage of traditional democratic and social choice theory is that it



 

provides (theoretically) distinct notions of individual preference and collective outcome, so while opinion varies about what collective outcomes are ideal given individual inputs, or even whether those outcomes are theoretically achievable (Arrow, ), at least both ends of the process are relatively stable. With deliberation, particularly online, the extremely complex communicative process becomes the end itself, and diversity of process, as well as opinion input, becomes almost an end in itself. That said, perhaps we can make some headway in trying to narrow down the deliberative process to something closer to a core, hugging as close as possible our earlier, rough-and-ready formulation: an extended conversation among two or more people to come to a better understanding of some issue. On the environmental input side, the core criterion most often appears to be something like equality: while in a deliberative poll one may need to seek out diversity of opinion, if we take the participants as given (e.g., in a self-selected online community), most of the subcriteria commonly enumerated are means toward achieving the end of equal participation. From a design point of view, there are questions about how to achieve this—equal speaking time, equalized and moderated content, limitations on self-sorting, and/or enforced cross-ideological communication—but it is unlikely that any of these questions will have single definitive answers without the sorts of collective outcome criteria (such as factual accuracy) that vary from situation to situation. From an online point of view, though—momentarily ignoring the often egregious communicative content—the institutional setting would seem relatively well suited to deliberative communication, with relatively open and equal forums yielding an apparent equality of participation. The fact that the results are often so poor—from flaming to trolling—would suggest that this equality threshold may not be sufficient, and that indeed there may be no solution without looking more explicitly at communicative content and behavior. On the other end of the process, outcomes seem to vary most widely depending on the interests of the theorist or institutional designer, and in theory seem to potentially encompass every normative good under the sun, from information to engagement to trust on the individual level to consensus, increased agreement, or at least meta-agreement (about the justice of the decision procedure itself) on the collective level. The sine qua non, it would seem, would be some form of opinion change or behavioral change, but apart from this narrow core, there may be very little necessary overlap among these criteria: the deliberative qualities suited to one outcome may be entirely different from those suited to another. And from an online perspective, the outcome is perhaps understandably less well examined—especially in the wild—given how little explicit data we often have about opinion and behavior apart from the communicative activity itself. Even measuring the bare minimum—opinion change of any form—can be quite challenging. Despite having its own array of measurement challenges, though, the communicative process itself might be the most theoretically cohesive and substantive locus for capturing the core deliberative process. Assuming that participation is relatively equal and outcomes involve some sort of opinion shift other than polarization or groupthink, a deliberative process as distinguished from generic communication seems to involve something like collective deliberation in the same sense that an individual deliberates.

    



For the collective, as for the individual deliberator, the idea is less about acquiring new information and ideas, and more that the facts, ideas, beliefs, and memories that one already has must be considered, brought into bearing with each other, and formed into a more internally coherent structure that yields an overall opinion or behavior that is more internally consistent or accurate than what had come before. By these lights rationality, reason given, or justification are simply means to this broader sense of deliberation, as are the more upstream qualities often included in the deliberative process, such as civility, non-negative emotions, and constructive intentions. In the most ideal (if abstract) form of deliberation, one would expect the communicative process to somehow sort through the existing array of ideas and, based on the individual or collective evaluation of them, the best arguments would “win” and allow the individual or group to select a better course of action than before. That is, we would hope that the better arguments gain credence based on their merits, rather than on the identity of the speaker or style of the presentation. In the online world, there are presumably moments, individuals, and domains that are more or less deliberative in this way, and the goal would be to distinguish more or less deliberative processes to better establish the types of environments, individuals, and communicative content that are more or less conducive to deliberation. To do so, however, requires measurement of deliberative communication at this relatively abstract and general level: not words associated with justification or non-negative emotion, for example, but a way to distinguish in a politically and content-agnostic way better arguments or ideas from worse, as well as to measure the conceptual connections between these ideas so that we can detect when (if ever) the collective or individuals have better sorted through their thoughts after deliberation.

. M A Q

.................................................................................................................................. Rather than attempt yet another effort to sort through and tabulate the numerous, multilevel, multistage criteria for deliberation that have proliferated over the last few decades, the remainder of this essay focuses on better understanding and measuring what is arguably the core deliberative process: the consideration and exchange of ideas in order to discern the better from the worse and to assemble the whole into a more cohesive and consistent, interconnected system. A full model of this process remains beyond the scope of this essay, but the next two sections suggest a couple of the component parts: first, a model that measures the latent persuasive effects of content as separate from style (and thus may discern the side with the more persuasive content); and second, a model that infers the latent connections between ideas, in order to distinguish more interactive, responsive deliberation from less. The hope is that exploring these approaches here may help us move toward more sophisticated models of online deliberation that actually capture something closer to the core deliberative process, rather than crude word-level correlates or rough individual-level behavioral characteristics.



 

One of the fundamental assumptions of deliberation is that in the process of talking through ideas the group may eventually distinguish the better from the worse and thereby arrive at a better conclusion. We have seen that there are many ways to distinguish more or less persuasive styles, but how are we to measure the inherent value of arguments owing to the merit of their content rather than mere style? That is, how might we distinguish better from worse arguments? We could certainly imagine assigning an army of human coders to sift through vast quantities of text and score arguments, but of course that would lack both generality and objectivity. It’s difficult to even conceptualize what an argument’s inherent strength or value might be independent of the context and background of the evaluator. Indeed, this may be part of the weakness of a theory of deliberation that imagines that arguments win on their own merits. But without something like that assumption, we are left with little more than style and expression, and the entire edifice of rationality, truth, and reason seems lost. So for our purposes here, let’s consider a slightly more modest goal: inferring not the objective quality of arguments, but the inherent, if latent, persuasive effect of topics or ideas that come by merit of their content rather than the style of their presentation. This at least moves us a little closer to a full deliberative model of content quality. To infer this empirically, consider a specific form of argument strength: the inherent persuasiveness of an idea, subtopic, or subissue within the context of a larger debate on a broader issue (Wang et al., ). These data are currently from live rather than online speech—the “Intelligence Squared” Oxford-style public debates—so the discussion here is brief, but the model may be applied to online discussion equally well if we have an overall measure of opinion and persuasion. Our motivating example is a segment of debate about the death penalty, quoted in Table .. Here both sides are discussing the death penalty from the perspective of the mistaken execution of the

Table 18.1 Excerpt from Debate on Death Penalty about Execution of Innocents* Motion: Abolish the Death Penalty Argument 1 (PRO): What is the error rate of convicting people that are innocent? . . . When you look at capital convictions, you can demonstrate on innocence grounds a 4.1% error rate, 4.1% error rate. I mean, would you accept that in flying airplanes? I mean, really . . . . Argument 2 (CON): The risk of an innocent person dying in prison and never getting out is greater if he’s sentenced to life in prison than it is if he’s sentenced to death. So the death penalty is an important part of our system. Argument 3 (PRO): I think if there were no death penalty, there would be many more resources and much more opportunity to look for and address the question of innocence of people who are serving other sentences. Model inferences Rhetorical features: Questions (1), Numerical evidence (1), Logical reasoning (2, 3) Strength: (1) & (3) inferred Strong; (2) inferred Weak * Our model scores each argument according to its rhetorical features as well as its latent persuasive strength.

    



innocent; both use various rhetorical techniques to strengthen the persuasive effects of their arguments, but both are also constrained by the overall topic. The hypothesis is that some topics of argument (within the overall topic of the death penalty) are inherently more suited to one side or the other, and while those on the disadvantaged side may do their best to use rhetoric or other smaller pieces of information to bear, they are fundamentally disadvantaged as long as the discussion is on this issue, and therefore would do better to switch the topic to something better suited to their side. In this case, the discussion of innocent executions is better suited to the anti-deathpenalty side, as perhaps would be a discussion of racial disparities in execution rates, whereas a discussion of the specific heinous acts committed in various murders would perhaps inherently support the pro-death-penalty side. While this framework, with its emphasis on strategy and rhetoric, may seem somewhat counter to the spirit of deliberation with its emphases on justification and sincerity, it acknowledges that no matter how sophisticated the environment or moderation, or how sincere the participants, the reality is that people do their best to persuade others using many different substantive and rhetorical tools. The fundamental assumption for deliberation to be able to take place, however, is that there be at least some substance beneath the rhetoric, and that we have some way (as listeners or researchers) of distinguishing inherently better and worse arguments and may come closer to the truth (even a multivalent, contingent truth) upon deliberating over these arguments as they are put forth. Each of the public debates in this data set is on a set topic (e.g., the death penalty), with experts arguing either side. Crucially for these purposes, the audience for these public debates is surveyed about the issue both before and after the debate, with the “winner” being the side that has gained more supporters, and we take this outcome as our measure of persuasive effect. Thus we are able to train the computational model using the observed debate outcomes, to determine which topics and other features are predictive of winning a debate (i.e., which are persuasive). A hidden topic Markov model (Gruber et al., ) is used to automatically segment a debate into “arguments”: chunks of text a few sentences long that are all on the same topic. We also measure per “argument” as many stylistic features as possible that might have persuasive effects— everything from pronouns and basic sentiment to logical and justification terms, hedges, emotions, concrete language, readability, personality, and raw word counts. The idea is that both substance (the inherent persuasive effects of various topics) and style affect outcome, and if we wish to avoid merely finding the superficial stylistic correlates of substance (e.g., counting “reasoning“ words such as “because” or “therefore”), we must incorporate those stylistic markers in our model and hope that in controlling for these superficial effects, the residual will be the inherent topic-specific persuasive effects. We then build a latent variable structural support vector machine (Yu and Joachims, ), which uses both the observed stylistic features and the unobserved, latent argument strengths to predict debate outcomes, in which the observed debate winners serve to train the model to infer which features are persuasive and which topics are strong for one side or the other, and then is tested out-of-sample by using those inferred values to predict the outcomes of some debates and compare those



 

predictions with the actual outcomes. There are of course many more technical details to this, which are detailed in Wang et al. (). Ultimately, we are able to predict % of the debate outcomes correctly (out of sample), a significant improvement over using just the observed stylistic features (%) or guessing at random (%). How is it possible to predict the inherent persuasive effects of topics in a debate that the model has never encountered before, on a subject entirely new to the data set? Because the model estimates the effects of interactions between our myriad stylistic measures and topics, it learns the many subtle stylistic features that tend to be associated with intrinsically stronger or weaker topics, then uses those observed stylistic features to predict the inherent persuasiveness of new topics. This is different, however, from merely using style to predict persuasion; only when combined with the model of latent strength does the predictive accuracy jump, suggesting that while we may (with lots of computation) be able to identify persuasive content by the style in which the arguers present it, it is nevertheless the merits of the content that cause those persuasive effects, rather than some elaborate interaction of styles alone. If one finds this plausible, what it shows is that argument strength—the idea that some topics are inherently more persuasive for one side than the other—does seem to play a significant role in debate. The winning side is usually (but not always) the one with more strong arguments, and most arguments are strong for one side or the other but not both. It may seem over-idealistic to imagine that the inherent value of content plays any role in persuasion in public debates, or one may be skeptical of the model’s method for inferring these latent values. But the larger point is that if one believes that one of the core purposes of deliberation is separating the conceptual wheat from the chaff, one has to believe that individual deliberators have a way of doing so that is not merely reflective of superficial stylistic presentation. And to empirically measure this, one needs a human or computational approach that can discern such latent values, using either the methods described here or some other latent model that goes far beyond the usual superficial word counting common in other measures of deliberative quality. Of course much of the value of ideas lies not in some atomic, intrinsic value, but via their interrelationships and mutual consistency, interdependencies, and contextual meanings. Inferring those things empirically leads us to the second deep-modeling approach.

. M C C

.................................................................................................................................. To measure the connections between ideas requires an additional layer: not just the topics of discussion, but a network of links between them. Others have approached this idea of a conceptual network in similar ways (Shaffer et al., ; Lodge and Taber, ), but the goal here is to infer these networks using purely computational methods. These are similar to the ones previously used: first one automatically infers the topics

    



under discussion, then one infers the connections between those topics, operationalized as the correlations between their usage, where the assumption is that topics that co-occur tend to be more tightly linked in people’s minds, and those that are anticorrelated are considered to be inconsistent with each other in some practical sense. The hypothesis is that people have these networks of linked ideas in their minds, and when they are truly deliberating, they don’t merely hammer on their own preferences, nor do they mindlessly echo whatever framing content their interlocutor has put forth. Rather, in a truly responsive conversation, a person will consider the topics their interlocutor has raised, then raise ones that are logically or conceptually related— new ideas that support or contradict the position just heard. The empirical test of the model, therefore, is whether we can predict the topics of person B responding to person A better using the network-based approach than by merely guessing that B will either mindlessly echo A or merely beat on the same drum B always beats on. If this can be established, then we might validate both this specific model of network-based conversation and the idea that deliberative responsiveness as a latent quantity can be objectively measured. It may also allow us to discern when and where interlocutors are more or less deliberative and how conceptual networks may differ and influence deliberation. To build this model, we need much more interpersonal communication than the formal debates previously employed provide. Instead, the two largest online political forums were scraped, each with millions of users over the last decade, one on the Left (Dailykos) and one on the Right (Redstate), because there are virtually no biparitsan political forums online (Beauchamp, ). While the dream of a robust, full-spectrum deliberative environment remains unfulfilled, these uni-partisan forums are still robustly deliberative, probably much more so than domains in which the two sides go to war with each other (e.g., the comment forums for many newspapers). Indeed, because it is less clear what “ideology” means within parties in the United States at the moment, this is precisely what makes these discussions more deliberative and potentially fruitful, since the participants themselves are not aware what labels and identities should apply to themselves (except, arguably, during primary season). Both forums have had extensive discussions with many millions of posts over many years, and discussions tend to be threaded such that one can discern who is speaking to whom. As before, we begin with a topic model, but then construct a network of correlations linking topics with positive or negative edges depending on their empirical correlations within the speech acts of all the users. Deliberative responsiveness is modeled as the raising of topics or ideas related to, but different from, what one’s interlocutor has just said, which is modeled as a Markov process where person A makes a point with some distribution over the topics (e.g., % about the execution of innocents, % about recidivism), and person B responds with topics that are correlated with but in some respect different from those topics (e.g., bringing up life imprisonment in addition to the existing topics under discussion), which is operationalized by multiplying the topic vector through the correlation matrix: b = Ca. Figure . shows how well various models of responsiveness describe the interactions on these two forums, where each



  Dailykos: all users Random reply Previous comment Author mean Network prediction –10

–1

0

1

MAE relative to guessing the forum mean (0)

Redstate: all users Random reply Previous comment Author mean Network prediction –10

–1

0

1

MAE relative to guessing the forum mean (0)

 . Note: The figure shows how well these models of user interaction explain the content of post B written in response to post A, across two large political forums: Redstate (conservative) and Dailykos (liberal). Models with higher values indicate better accuracy in predicting responses.

model is tested by seeing how well it predicts the topics of a comment by person B responding to a comment by person A. As we can see, there are interesting asymmetries between the Left and Right; while on the Right the best predictor of what B will say in response to A is whatever B usually says (B’s mean response), for the Left the best predictor is the network model of responsiveness. That is, at least for these two forums, the Left—whether due to the environment, individuals, or content under discussion— shows more core deliberative responsiveness. This study also establishes that these deliberative arguments do seem to have longterm effects on users’ ideology, where ideology is measured by examining which posts users tend to like in a fully automated way (much as the vote record can be used to infer ideology without any human supervision), and also shows that the density and structure of these conceptual networks seems to vary across the Left and Right in interesting ways (see Beauchamp  for further details). But most fundamentally for our purposes here, it suggests that not only do individual topics and ideas have intrinsic

    



strengths that appear to be correlated with real-world persuasive outcomes (at least in formal debates), but also these topics and ideas are connected in conceptual networks that allow us to discern more from less deliberative conversation, where the notion of deliberation here is not some superficial, stylistic measure, but instead a deep model of responsiveness that attempts to show objectively the way in which truly responsive discussants engage in creative and complex ways with what others are saying. Not only are some topics more inherently persuasive than others, but their effects are intertwined, and to understand deliberation—the careful consideration of ideas to work through their connections and implications—requires something like these high-level latent models in order to characterize and predict behavior. Neither approach offers a complete model of core deliberation, nor do these ideas comprise the only possible core. Nor is the first model readily applicable to domains without well-measured persuasive outcomes, or the second model readily applicable to the sorts of brief, more diffuse conversations more common on Facebook, Twitter, and so forth, let alone face-to-face conversation. But both suggest an approach that might allow us to focus on some core notion of deliberation that is fairly general yet conceptually rich, and that is neither a collection of superficial stylistic correlates nor a vast collection of heuristics for capturing every aspect of deliberation hypothesized by three decades of theorists. But whatever the specific approach, a model that focuses on responsive conversation in the service of conceptual improvement, or some other such core, allows us to both discern fundamentally deliberative environments or moments in the wild and take some action to tweaking those environments in beneficial ways without requiring that we construct complete online deliberative utopias in order to achieve the civil and democratic purposes for which deliberative theory was originally developed.

. C

.................................................................................................................................. Deliberation theory began to grapple with online conversation almost as soon as Internet communication became prevalent. While most comparisons have been unfavorable whether or not the online environment was found or constructed (Dubrovsky et al., ; Bordia, ; Hobman et al., ; Min, ), we have also seen that online modes offer many potential advantages, including anonymity, diversity, engagement, cost, and more flexible, open-ended network structures. While deliberative polls and even purpose-built online deliberative forums are prohibitively costly to deploy at the scales common for social media, social media themselves have many deliberative characteristics and subcommunities. The challenge is discerning these more deliberative spaces and moments without getting swamped either by the immense proliferation of deliberative criteria or the immense scale of measuring terabytes of textual content and other online behaviors. While NLP and argument-mining methods present new approaches to measure deliberative behavior in the wild at scale, these domains also



 

tend to run wild with numerous criteria and measures for gauging argument and discussion quality, many of which measures are also disappointingly superficial when examined in detail. While it is unlikely that this large, multidisciplinary community will converge upon a single core notion of deliberation any time soon, it has been argued here that it is still worth seeking out a core concept of deliberation. This theory should ideally be distinct from existing political theories, including many aspects of environment or input (such as equality or representativeness), much of the output (such as information gain), and even many procedural criteria (such as civility or respect), which are all potentially well-established norms in existing theories of democracy or civil society. Arguably, this deliberative core is, as the term “deliberation” suggests, less about adding information or behaving well and more about how existing arguments, facts, and ideas are evaluated by the thinking individual or conversing group—how the better arguments and ideas are sifted from the worse, and how the connections of support and contradiction among them are worked through to construct more consistent conceptual networks and courses of action. To measure these abstract qualities, however, requires not just superficial measures of textual features, individual opinions, or environmental structures, but rather more substantial procedural models of the quality and interrelations among arguments. Separate models of these two things were presented here, but ultimately we will need a deeper model combining the strength of ideas, the network of support and contradiction linking them, and their behavioral consequences for the individual and a group trying to come to a collective decision. While no model will truly be able to measure the content and quality of ideas or the immensely complex logical, conceptual, and cultural connections between them, more substantive models of deliberation will allow us to better discern not just brief moments of deliberative quality online, but also what sorts of network topologies, content moderation, individual characteristics, and content topics are best suited to productive conversation, and to guide whatever interventions we can manage into the fast-evolving jungle of social media.

A The author would like to thank Sarah Shugars, Lu Wang, and Kechen Qin for their help in writing this essay.

R Ackerman, B., and J. S. Fishkin. (). Deliberation day. Journal of Political Philosophy (), –. Adamic, L. A., and N. Glance (). The political blogosphere and the  us election: divided they blog. In Proceedings of the rd International Workshop on Link Discovery, pp. –. Association for Computing Machinery (ACM), Chicago, IL.

    



Albrecht, S. (). Whose voice is heard in online deliberation? A study of participation and representation in political debates on the Internet. Information, Community and Society (), –. Arrow, K. J. (). Arrow’s theorem. In The new Palgrave: A dictionary of economics : –. Basingstoke: Palgrave MacMillan. Bächtiger, A., M. Spörndli, M. R. Steenbergen, and J. Steiner. (). The deliberative dimensions of legislatures. Acta Politica (), –. Beauchamp, N. (). Someone is wrong on the Internet: Modeling argument and persuasion via an exchange of ideas. Working Paper. Beigman, Y. S. M. H. B., and K. P. Deane (). Applying argumentation schemes for essay scoring. In: Proceedings of the nd Annual Meeting of the Association for Computational Linguistics (ACL) . Berger, J., and K. L. Milkman. (). What makes online content viral? Journal of Marketing Research (), –. Biran, O., and O. Rambow. (). Identifying justifications in written dialogs by classifying text as argumentative. International Journal of Semantic Computing (), –. Bohman, J. (). Public deliberation: Pluralism, complexity, and democracy. Cambridge, MA: MIT Press. Bordia, P. (). Face-to-face versus computer-mediated communication: A synthesis of the experimental literature. Journal of Business Communication (), –. Boulianne, S. (). Does Internet use affect engagement? A meta-analysis of research. Political Communication (), –. Brundidge, J. (). Encountering “difference” in the contemporary public sphere: The contribution of the Internet to the heterogeneity of political discussion networks. Journal of Communication (), –. Choi, S. (). Flow, diversity, form, and influence of political talk in social-media-based public forums. Human Communication Research (), –. Cohen, J. (). Deliberation and democratic legitimacy. In Deliberative Democracy: Essays on Reason and Politics. Cambridge, MA: MIT Press. Dahlberg, L. (). The Internet and democratic discourse: Exploring the prospects of online deliberative forums extending the public sphere. Information, Communication & Society (), –. Douglas, W. (). Argumentation schemes for presumptive reasoning. London, UK: Routledge. Dryzek, J. S. (). Discursive democracy: Politics, policy, and political science. New York, NY: Cambridge University Press Dryzek, J. S. (). Democratization as deliberative capacity building. Comparative Political Studies (), –. Dubrovsky, V. J., S. Kiesler, and B. N. Sethna. (). The equalization phenomenon: Status effects in computer-mediated and face-to-face decision-making groups. Human-Computer Interaction (), –. Eckle-Kohler, J., R. Kluge, and I. Gurevych. (). On the role of discourse markers for discriminating claims and premises in argumentative discourse. In Proceedings of the  Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics. https://www.aclweb.org/ Eveland, W. P. Jr., and S. B. Kleinman. (). Comparing general and political discussion networks within voluntary organizations using social network analysis. Political Behavior (), –.



 

Feng, V. W., and G. Hirst. (). Classifying arguments by scheme. In Proceedings of the th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Vol. , pp. –. Association for Computational Linguistics. https://www. aclweb.org/ Fishkin, J. (). Deliberation by the people themselves: Entry points for the public voice. Election Law Journal (), –. Friess, D., and C. Eilders. (). A systematic review of online deliberation research. Policy & Internet (), –. Garrett, R. K. (). Politically motivated reinforcement seeking: Reframing the selective exposure debate. Journal of Communication (), –. Gastil, J., and L. Black. (). Public deliberation as the organizing principle of political communication research. Journal of Public Deliberation () . Gold, V., M. El-Assady, T. Bögel, C. Rohrdantz, M. Butt, K. Holzinger, and D. Keim. (). Visual linguistic analysis of political discussions: Measuring deliberative quality. Digital Scholarship in the Humanities, (), –. González-Bailón, S., R. E. Banchs, and A. Kaltenbrunner. (). Emotions, public opinion, and us presidential approval rates: A -year analysis of online political discussions. Human Communication Research (), –. Gonzalez- Bailón, S., A. Kaltenbrunner, and R. E. Banchs. (). The structure of political discussion networks: A model for the analysis of online deliberation. Journal of Information Technology (), –. Gruber, A., Y. Weiss, and M. Rosen-Zvi. (). Hidden topic Markov models. In Artificial Intelligence and Statistics, Vol. , pp. –. Gutmann, A., and D. Thompson. (). Democracy and disagreement. Cambridge, MA: Harvard University Press. Habermas, J. (). Moral consciousness and communicative action. Cambridge, MA: MIT Press. Habermas, J., J. Habermas, and T. McCarthy. (). The theory of communicative action, Vol. . Boston: Beacon Press. Hargittai, E., J. Gallo, and M. Kane. (). Cross-ideological discussions among conservative and liberal bloggers. Public Choice (–), –. Himelboim, I. (). Reply distribution in online discussions: A comparative network analysis of political and health newsgroups. Journal of Computer-Mediated Communication (), –. Himelboim, I. (). Civil society and online political discourse the network structure of unrestricted discussions. Communication Research (), –. Himelboim, I., E. Gleave, and M. Smith. (). Discussion catalysts in online political discussions: Content importers and conversation starters. Journal of Computer-Mediated Communication (), –. Hobman, E. V., P. Bordia, B. Irmer, and A. Chang. (). The expression of conflict in computer-mediated and face-to-face groups. Small Group Research (), –. Holzinger, K. (). Bargaining through arguing: An empirical analysis based on speech act theory. Political Communication (), –. Janssen, D., and R. Kies. (). Online forums and deliberative democracy. Acta Política (), –. Jones, L. M., K. J. Mitchell, and D. Finkelhor. (). Online harassment in context: Trends from three youth Internet safety surveys (, , ). Psychology of Violence (), .

    



Kayany, J. M. (). Contexts of uninhibited online behavior: Flaming in social newsgroups on Usenet. Journal of the American Society for Information Science (), –. Kiesler, S., J. Siegel, and T. W. McGuire. (). Social psychological aspects of computermediated communication. American Psychologist (), . Kriplean, T., J. Morgan, D. Freelon, A. Borning, and L. Bennett (). Supporting reflective public thought with Considerit. In Proceedings of the ACM  Conference on Computer Supported Cooperative Work, pp. –. Association for Computing Machinery: https://dl.acm.org/citation.cfm?id=. Landwehr, C., and K. Holzinger. (). Institutional determinants of deliberative interaction. European Political Science Review (), –. Lodge, M., and C. S. Taber. (). The rationalizing voter. New York, NY: Cambridge University Press. Manin, B. (). Democratic deliberation: Why we should promote debate rather than discussion. In Papers delivered at the program in ethics and public affairs seminar, Princeton University, Vol. . Mansbridge, J. (). A minimalist definition of deliberation. In Deliberation and development: Rethinking the role of voice and collective action in unequal societies, pp. –. Mansbridge, J., J. Bohman, S. Chambers, D. Estlund, A. Fllesdal, A. Fung, C. Lafont, B. Manin, et al. (). The place of self-interest and the role of power in deliberative democracy. Journal of Political Philosophy (), –. Min, S.-J. (). Online vs. face-to-face deliberation: Effects on civic engagement. Journal of Computer-Mediated Communication (), –. Mochales, R., and M.-F. Moens. (). Argumentation mining. Artificial Intelligence and Law (), –. Moens, M.-F., E. Boiy, R. M. Palau, and C. Reed. (). Automatic detection of arguments in legal texts. In Proceedings of the th International Conference on Artificial Intelligence and Law, pp. –. Association for Computing Machinery. https://dl.acm.org/citation.cfm? id=. Mutz, D. C. (). Hearing the other side: Deliberative versus participatory democracy. New York, NY: Cambridge University Press. Nabatchi, T. (). Putting the “public” back in public values research: Designing participation to identify and respond to values. Public Administration Review (), –. Nelimarkka, M., B. Nonnecke, S. Krishnan, T. Aitamurto, D. Catterson, C. Crittenden, C. Garland, C. Gregory, C.-C. A. Huang, G. Newsom, et al. (). Comparing three online civic engagement platforms using the “spectrum of public participation” framework. In Proceedings of the Oxford Internet, Policy, and Politics Conference (IPP), pp. –. http://ipp.oii.ox.ac.uk/sites/ipp/les/documents/IPP_Nelimarkka.pdf Niemann, A. (). Beyond problem-solving and bargaining: Genuine debate in EU external trade negotiations. International Negotiation (), –. Oraby, S., L. Reed, R. Compton, E. Riloff, M. Walker, and S. Whittaker. (). And that’s a fact: Distinguishing factual and emotional argumentation in online dialogue. Proceedings of the nd Workshop on Argumentation Mining, Association for Computational Linguistics.,–: http://www.aclweb.org/anthology/W- Page, S. E. (). The difference: How the power of diversity creates better groups, firms, schools, and societies. Princeton, NJ: Princeton University Press. Palau, R. M., and M.-F. Moens. (). Argumentation mining: The detection, classification and structure of arguments in text. In Proceedings of the th International Conference on



 

Artificial Intelligence and Law, pp. –. Association for Computing Machinery: https://dl.acm.org/citation.cfm?id=. Papacharissi, Z. (). Democracy online: Civility, politeness, and the democratic potential of online political discussion groups. New Media & Society (), –. Papacharissi, Z. (). The virtual sphere .: The Internet, the public sphere, and beyond. In Routledge handbook of Internet politics, pp. –. London, UK: Routledge Park, J., and C. Cardie. (). Identifying appropriate support for propositions in online user comments. In Proceedings of the First Workshop on Argumentation Mining, pp. –: http://www.aclweb.org/anthology/W-. Park, J., A. Katiyar, and B. Yang. (). Conditional random fields for identifying appropriate types of support for propositions in online user comments. Proceedings of the nd Workshop on Argumentation Mining. pp. -: http://www.aclweb.org/anthology/W- Peldszus, A. (). Towards segment-based recognition of argumentation structure in short texts. Proceedings of the First Workshop on Argumentation Mining, pp. -: http://www. aclweb.org/anthology/W- Peldszus, A., and M. Stede. (). Joint prediction in MST-style discourse parsing for argumentation mining. In Proceedings of the  Conference on Empirical Methods in Natural Language Processing, –. http://www.aclweb.org/anthology/D- Pennebaker, J. W., M. E. Francis, and R. J. Booth. (). Linguistic inquiry and word count: Liwc . Mahwah, NJ: Lawrence Erlbaum Associates. Price, V., and J. N. Cappella. (). Online deliberation and its influence: The electronic dialogue project in campaign . IT & Society (), –. Reinig, B. A., R. O. Briggs, and J. F. Nunamaker Jr. (). Flaming in the electronic classroom. Journal of Management Information Systems (), –. Rinott, R., L. Dankin, C. Alzate, M. M. Khapra, E. Aharoni, and N. Slonim. (). Show me your evidence—an automatic method for context dependent evidence detection. In Proceedings of the  Conference on Empirical Methods in Natural Language Processing, pp. –: http://www.aclweb.org/anthology/D- Schneider, S. M. (). Expanding the public sphere through computer-mediated communication: Political discussion about abortion. PhD thesis, Massachusetts Institute of Technology. Shaffer, D. W., D. Hatfield, G. N. Svarovsky, P. Nash, A. Nulty, E. Bagley, K. Frank, A. A. Rupp, and R. Mislevy. (). Epistemic network analysis: A prototype for st-century assessment of learning. International Journal of Learning and Media (IJLM) () pp. –. Shermis, M. D., and J. Burstein. (). Handbook of automated essay evaluation: Current applications and new directions. London, UK: Routledge. Stab, C., and I. Gurevych, (a). Annotating argument components and relations in persuasive essays. In Proceedings of COLING , the th International Conference on Computational Linguistics: Technical Papers, pp. –. Stab, C., and I. Gurevych, (b). Identifying argumentative discourse structures in persuasive essays. In Proceedings of the  Conference on Empirical Methods in Natural Language Processing (EMNLP), –. http://www.aclweb.org/anthology/D- Steenbergen, M. R., A. Bächtiger, M. Spörndli, and J. Steiner. (). Measuring political deliberation: A discourse quality index. Comparative European Politics (), –. Steiner, J. (). Deliberative politics in action: Analyzing parliamentary discourse. New York, NY: Cambridge University Press. Sunstein, C. R. (). Republic. com .. Princeton, NJ: Princeton University Press. Surowiecki, J. (). The wisdom of crowds. New York, NY: Anchor.

    



Thompson, D. F. (). Deliberative democratic theory and empirical political science. Annual Review of Political Science , –. Toulmin, S. ([] ). The uses of argument. Cambridge, UK: Cambridge University Press. Trénel, M. (). Measuring the quality of online deliberation. Coding scheme .. Unpublished paper . Vogel, R., E. Moulder, and M. Huggins. (). The extent of public participation. Public Management, (), pp.–. Wachsmuth, H., M. Trenkmann, B. Stein, and G. Engels. (). Modeling review argumentation for robust sentiment analysis. In Proceedings of COLING , the th International Conference on Computational Linguistics: Technical Papers, pp. –. Walton, D., C. Reed, and F. Macagno. (). Argumentation schemes. New Yor, NY: Cambridge University Press. Wang, L., N. Beauchamp, S. Shugars, and K. Qin. (). Transactions of the Association for Computational Linguistics. (): http://aclweb.org/anthology/Q- Wang, L. and C. Cardie (). A piece of my mind: A sentiment analysis approach for online dispute detection. In Proceedings of the nd Annual Meeting of the Association for Computational Linguistics (Volume : Short Papers), vol. , pp. –. Wojcieszak, M. E., and D. C. Mutz (). Online groups and political discourse: Do online discussion spaces facilitate exposure to political disagreement? Journal of Communication (), –. Wyner, A., R. Mochales-Palau, M.-F. Moens, and D. Milward. (). Approaches to text mining arguments from legal cases. In Semantic processing of legal texts. pp. –. Berlin, Heidelberg: Springer. Yanase, T., T. Miyoshi, K. Yanai, M. Sato, M. Iwayama, Y. Niwa, P. Reisert, and K. Inui. (). In Proceedings of the nd Workshop on Argumentation Mining, pp. –: http://www. aclweb.org/anthology/W-. Yu, C.-N. J., and T. Joachims. (). Learning structural SVMs with latent variables. In Proceedings of the th Annual International Conference on Machine Learning, pp. –.

  ......................................................................................................................

    Social Media and Emotions in Political Communication ......................................................................................................................

 . 

. I

.................................................................................................................................. S media have revolutionized political communication. The political environment is evolving rapidly, and the traditional gatekeepers of information—the media and elites—have ceded much control over the dissemination of political content (Gainous & Wagner, ; Jacobs & Shapiro, ; Shogan, ; Williams & Delli Carpini, ). Equally important is the transformation in how people access political information and engage in interpersonal political communication. All of these changes call for a more nuanced understanding of the ways that people communicate on social media and the effects of exposure to online communications on subsequent political behavior (Jacobs & Shapiro, ). A byproduct of these transformations is the generation of novel forms of textual data, opening avenues of inquiry related to mass political communication that were previously inconceivable. However, to date these data have been underutilized. Researchers studying mass communication largely focus on sentiment analysis, in which the emotional language used in political communication is aggregated as a measure of public mood or opinion. Studies utilizing textual data from social media frequently do not make clear the premises or assumptions made about the data generation process, ignoring what we know from non-text-based research about why elites and the mass public communicate about politics on social media. Consequently, scholars from different disciplines do not always agree about the value added by findings from textual-based social media studies, studies often rooted more in data exploration than in hypothesis testing.1

   



It is time to apply more rigorously our theories of political communication to the textual data generated from social media sites. We should move beyond the straightforward categorization of sentiment as a measure of public opinion and instead prioritize a richer consideration of when, how, and why people communicate, and communicate emotionally, about politics on social media sites. In addition to bolstering our understanding of the communication of elites and opinion leaders, this new research will vastly improve our understanding of interpersonal political communication among the mass public, addressing questions that have remained unanswered for decades because of a lack of suitable data to test theoretical claims. For example, how do the type, level, and dynamic of emotion in a political conversation amplify or alter the patterns of influence we expect to see between discussants? When and how do opinion leaders communicate about politics, and in the “two-step flow of communication,” what is the effect of adding their own commentary before passing elite messages along to their networks? In this chapter I assess the variety of methodological approaches used to characterize online political communication and measure emotion within it. I then outline what I see as some of the important missing theoretical questions that scholars must grapple with to better understand both the data generation and data interpretation processes for analyzing politically oriented textual data created on social media sites. I briefly review pertinent theories about the role of emotion in political communication, focusing on how the application of these theories to online interactions raises new definitional, theoretical, and methodological questions meriting future study.

. P C  S M: M  F

.................................................................................................................................. Because of the relatively low cost of accessing public data from Twitter and the widespread interest in the effects of social media on communication generally, there are thousands of papers published in conference proceedings and in academic journals—across the humanities, social sciences, and information sciences—analyzing textual social media data. Here I outline the general methodologies used to identify, categorize, and analyze online political communication. I then discuss the non-text-based methodologies studying the same topic. Synthesizing the findings of these two veins of research suggests that we should not naively interpret the results of text-based studies without considering the particularities of the data generation process underpinning the creation of that text.

.. Textual-Analysis-Based Approaches To fully capitalize on the research possibilities created by this new form of communication, many scholars choose to directly analyze the textual data generated from online



 . 

communication on social media sites such as Facebook and Twitter. Grimmer and Stewart () provide an excellent overview of the research design decisions involved in a project using text as data; although they do not focus on the use of social media in particular, many of the issues they raise remain applicable. In the next section I outline some of the decisions that must be made in the social media data collection process, as well as some of the most commonly utilized techniques for coding and quantifying the data.

... Recognizing Political Communication The first challenge of utilizing social media data to study political communication is identifying what counts as “political” text. Scholars studying communication generated by the media or government officials face fewer ambiguities in determining whether text is politically relevant. The most common way to label political speech is to make actor-based assumptions identifying “political” actors and then designate their online text to be “political.” This is the approach taken in all of the studies examining elite usage of Twitter and Facebook (e.g., Gainous & Wagner, ; Shogan, ; Glassman, Straus, & Shogan, ; Golbeck, Grimes & Rogers, ; Lassen & Brown, ; Groshek & Al-Rawi, ). These studies typically define the universe of available cases for study (e.g., all congressional candidates in a particular election, or all members of Congress who have a certain level of activity on Twitter), collect all text coming from these accounts, and then code for various additional characteristics of this predesignated “political” communication. Researchers studying the ways that the mass public interacts about politics face a different challenge: Who counts as a “political” actor? While the actor-based approach works well for elite communication, it is much harder to identify analogous universes of cases among the mass public. The typical approach starts by identifying elite actors and examining the users who interact with those elites: users who post comments on elites’ Facebook pages, share or retweet elites’ messages, or participate in an explicitly political group or forum (Mullen & Malouf, ; Parmalee & Bichard, ; O’Connor et al., ; DiGrazia et al., ). Yet political scientists have long known that most people don’t care about, don’t pay much attention to, and don’t know much about politics and consequently have relatively unstructured political opinions and ideologies (Converse, ; Campbell et al., ; Zaller, ; Zaller & Feldman, ; Delli Carpini & Keeter, ; Lau & Redlawsk, ; Mondak, ; Kinder, ). Early studies of interpersonal online political communication often focused on the most politically interested and active Internet users, studying the communication that happened at explicitly political chat rooms, designated political webpages, or blog sites (Papacharissi, ; Hill & Hughes, ; Adamic & Glance, ; Meraz, ; Yano, Cohen & Smith, ). Today, however, most people integrate their news seeking and political interaction into the broader realm of their online social interaction (Pew Research Center ), and plenty of political conversation occurs outside of designated political forums. Thus, everyone is potentially exposed to political information and has the ability to make a political

   



comment online; figuring out which tweets on Twitter or status updates on Facebook are political in nature presents a thornier problem. Thus, the second approach for identifying political communication on social media is to identify a political topic of interest. This dictionary-based approach begins with a predefined set of keywords, phrases, or hashtags associated with a given topic. Researchers collect all relevant text using those keywords and then code the content of those posts. The difference in this approach is that instead of assuming that the characteristics of the actors bestow on the text its political nature, here the characteristics of the topic make the text “political.” These two approaches have revealed much about the descriptive patterns of online political communication of certain categories of people or certain topics but are limited in their scope. Furthermore, the rapid development and ever-changing norms on sites like Facebook and Twitter mean that patterns discovered in one political context do not necessarily hold in the next, and it is unclear how patterns of discussion about one topic can be generalized to a different topic salient in another time and place. Therefore, some scholars have sought broader approaches to identifying political communication on social media by using machine-learning algorithms or topic modeling. These tools were originally developed by computer scientists for the purpose of classifying large numbers of individual documents (Hopkins & King, ), and these procedures have rapidly advanced in recent years as the scope and scale of data available for analysis have exceeded the limits for which it is feasible to employ human coders (Lyman & Varian, ; King & Lowe, ; Cardie & Wilkerson, ). Studies using a machine-learning approach typically identify the kind of language that is used in texts explicitly labeled as political, then calculate the predicted probability that an unlabeled text is political in nature; Figure ., derived from the data used in Settle et al. (), depicts the results of one example of this kind of approach. Topic modeling instead seeks to extract words and phrases that tend to co-occur, information that the researcher can use to apply appropriate labels to the topics recovered from the text. The benefit of automated approaches is a reduced dependency on the assumptions about who or what is “political.” This is more analogous to the way we study interpersonal communication in the offline world. Of course the estimates derived from these approaches are dependent on the dictionary used to code political language and the threshold used to assess probability of “politicalness” in machine-learning approaches, and on the choice of labels applied to grouped words and phrases in the topic-modeling approaches. Identifying and collecting political discourse on social media is only the first step. Once political speech is identified, the next challenge for all three of these approaches— the actor-based approach, the dictionary-based approach, and the automated approach—is to characterize it in a meaningful way.

... Sentiment Analysis of Political Text One of the most common approaches to derive meaning from political social media text is to use sentiment analysis, or opinion mining, defined as the “computational



 .  Political Status Messages by Month

Percent of All Status Messages (%)

5 4 3 2 1

Jan

Dec

Oct

Nov

Sep

Jul

Aug

Jun

Apr

May

Mar

Jan

Feb

0

Month

 . Proportion of all status updates posted on Facebook that are political in nature. Proportions are generated from all status updates posted by a random sample of two million Americans in ten states, during the period January –January . See Settle et al. () and Settle et al. () for more details.

study of opinions, sentiments and emotions expressed in text” (Liu, ). In many studies the implicit assumption underpinning the use of sentiment analysis is that a measure of “public opinion” can be captured based on the aggregation of the sentiment expressed by a large enough set of people. While the concept of sentiment analysis predates the rise of the Internet and social media, the prevalence of data available for textual analysis has caused a rapid increase in the number of algorithms available for detecting sentiment (Pang & Lee, ). There are machine-learning approaches appropriate for evaluating sentiment in text (Pang, Lee & Vaithyanathan, ; Pang & Lee, ; Wilson, Wiebe & Hoffmann, ), but the vast majority of sentiment analysis of online political content utilizes a dictionary-based approach, relying on a predefined set of positive, negative, or specific affect words. There are significant challenges posed by evaluating sentiment in online text (Thelwall et al., ). One of the most important factors for accurate and effective sentiment analysis is the application of an appropriate and valid word dictionary (Grimmer & Stewart, ), a challenging task in the digital environment, where users employ abbreviations, emoticons, truncations, and nonstandard English grammatical patterns and spellings (see Thelwall et al.,  for an elaboration of the process for and problems in successfully extracting sentiment from unstructured text). Computer scientists constantly develop more refined dictionaries while seeking to identify and correct systematic errors in sentiment analysis procedures, and this chapter reviews only the application of well-validated dictionaries to political text on social media. One of the most commonly used dictionaries is the Linguistic Inquiry Word Count (LIWC), a well-known and respected text analysis program created by a team of

   



psychologists (Pennebaker, Francis & Booth ) and validated in dozens of academic studies (for examples, see Alpers et al. ; Bohanek, Fivush & Walker, ; Burke & Dollinger, ; Graves et al., ; Owen et al., ; and Pennebaker & Lay, ). It is designed to analyze larger bodies of text, not just short groups of sentences, but it has been updated and optimized to include phrases frequently used in online communication. Thus, it is an appropriate choice in certain instances for analyzing large volumes of social media data. The LIWC dictionary is a tool that evaluates the overall tone and characteristics of the text, counting the frequency of words in over seventy language categories, including emotional language, psychological processes, and specific content areas. The program is sensitive to many of the challenges faced by most sentiment analysis dictionaries, namely difficulty in identifying negations and sarcasm (Tumasjan et al., ). These kinds of errors are typically considered random noise, but in short bits of text, this noise can be amplified and may distort results. Another widely adopted dictionary and approach is Sentistrength, based on an algorithm its developers argue is better for identifying behaviors instead of simply identifying opinions about products, one of the most popular commercial purposes for which sentiment analysis algorithms are optimized. Another of the key advantages of Sentistrength is that its algorithm was specifically designed for the informal English text most frequently used online, including both nonstandard spellings and other common textual methods of expressing sentiment. It has been found to be more accurate than baseline coding and other machine-learning approaches on a sample of MySpace comments (Thelwall et al., ), but while the developers of the program report that it has human-level accuracy in coding for most bits of short social web texts, it is considerably weaker for political texts, especially for identifying positive sentiment in news-related discussions (Thelwall et al., ). Several other dictionaries and algorithms exist for analyzing sentiment in social media text, including the Affective Norms for English Words2 (Bradley & Lang, ), the General Inquirer,3 OpinionFinder4 (Somasundaran & Wiebe, ), SentiWordNet (Baccianella, Esuli, & Sebastiani, n.d.),5 WordNet-Affect6 (Strapparava & Valitutti, ), and SO-CAL (Taboada et al., ).

... Characterizing Political Text of the Masses The vast majority of the research using political text generated by social media users is somewhat atheoretical and descriptive, not analytical, often attempting to make predictions about real world political outcomes (e.g., O’Connor et al., ; Wang et al., ; Tumasjan et al., ; Tjong, Sang & Bos, ; Curini, Ceron, & Iacus, ; Ceron et al. ), although the predictive validity of this approach remains highly suspect. Computer scientists appear to be increasingly aware of its pitfalls (GayoAvello, Metaxas & Mustafaraj, ; Metaxas, Mustafaraj & Gayo-Avello, ; GayoAvello, , ) and there have been some excellent efforts to provide guidance for strengthening the quality of data and interpretation in these studies (for an excellent overview of some of the major methodological challenges, see Ruths & Pfeffer, ; and Tufekci, ).



 . 

As of , there had only been one study (of which the author is aware) that incorporates textual analysis to test hypotheses rooted in the mass political behavior literature. Settle et al. () test a foundational question in the study of political behavior: How does experiencing a competitive election affect an individual’s engagement with politics? While this question had been previously explored using a variety of research designs and data sources, there was little evidence supporting the way in which exposure to competition affected people’s day-to-day political engagement. The authors examine a collection of  million Facebook status updates, comparing political discussion during the  election generated by users living in states with competitive presidential elections (“battleground” states) to that generated by those living in uncompetitive states (“blackout” states). Users in the “battleground” states were more likely to discuss politics in the campaign season than were users in “blackout” states, and engaging in political discussion via Facebook mediated approximately % of the relationship between exposure to political competition and selfreported voter turnout. A handful of other studies utilizing social media data bear mentioning, although they do not analyze the text of the social media data per se. Bond and Messing () use support for political pages on Facebook to estimate the ideology of political figures and citizens. Similarly, Barbera () measures the ideological positions of Twitter users based on whom they choose to follow. Bakshy et al. () use a unique data set of . million active Facebook users and the  million website URLs they posted on the site during the second half of , assessing the degree to which users encounter ideological heterogeneity in their networks. While these studies do not inform us directly about the content of political interaction in social media, they provide pertinent information about the context in which those interactions take place.

.. Nontextual Approaches to the Study of Online Political Communication To date, text-based studies have not adequately considered the forces influencing the creation of the data on which they rely. Valid interpretation and application of the findings from text-based studies are contingent on a better understanding of who creates the text, as well as the psychological motivations driving online communication and its behavioral consequences for offline political actions. These questions have begun to be addressed in studies that examine online communication, without directly incorporating social media text. I next discuss some of the most pertinent findings from these other methodological approaches.

... Survey-Based Approaches The results of text-based studies must be contextualized within a broader understanding of who engages in online political communication. The Pew Research Center has

   



taken the lead in this effort, conducting representative surveys of the American public about how people integrate social media into their lives, particularly in the domain of politics. Within its agenda of studying the evolving nature of journalism and the media, Pew has extensively surveyed Americans on their media habits and the sources of their news. For over a decade, the Pew Internet and American Life Project has pioneered our understanding of how the American public uses the Internet; the organization conducted the earliest nationally representative surveys of social media usage, asking as early as  about the extent to which people interacted about politics on the sites. Unfortunately, however, the emphasis in these surveys has been on describing usage patterns instead of on causally explaining their antecedents or consequences. A handful of other scholars have used survey methodologies to better understand these questions. Gil de Zúñiga et al. () use survey data—a nonrepresentative sample, but matched to census data to provide greater accuracy and generalizability—to study the relationship between informational use of social media and online and offline political participation, arguing as others have that using these sites increases social capital and both online and offline civic and political participation (Ellison, Steinfield & Lampe, ; Pasek, More & Romer, ; Valenzuela, Park & Kee, ; Wellman et al., ; Vaccari et al., ). Another body of literature explores how personality traits predict social media usage (Correa, Hinsley & Gil de Zúñiga, ; Bachrach et al., ). Collectively, these studies demonstrate that the people generating the most political text online are different from the general population in meaningful ways. Yet the data from these surveys cannot address important questions related to the social psychological motivations underpinning the behavior.

... Experimental Approaches Experiments are well suited to exploring these motivations. There are two broad categories of experimental approaches to studying political interaction on social media: laboratory or survey experiments that simulate the experience, and field experiments operating within the social media platform itself. At the time of preparation of this article, there were few published experimental studies that use stimuli derived from simulated social media interactions, although this situation will almost certainly change in the coming years. Most of the existing studies test theories of selective exposure, finding that social endorsements increase the selection of content, reducing the effects of partisan cues on political news selection (Messing & Westwood, ). Other work has focused on the agenda-setting effects of exposure to political information on social media (Feezell, ). Although rare, there have been a number of important experiments conducted within the Facebook platform. While these experiments do not directly manipulate interpersonal communication on social media and do not utilize stimuli designed to alter political communication, they do shed light on related processes. A large-scale experiment embedded within Facebook on election day in  suggests that social messages were more effective than informational messages in mobilizing people to vote, for both those users who received the encouragement directly, as well as their



 . 

friends (Bond et al., ). In a smaller scale study, Ryan () finds that evoking anger in a political advertisement shown on Facebook increases users’ likelihood of clicking the ad to visit the associated political website. In addition, there have been several studies utilizing textual data to study emotional diffusion or contagion, outside the context of politics. Coviello et al. () take a creative approach, employing the weather as an instrument, demonstrating that rainfall (and by extension, many other exogenous factors) both directly influences emotion in status updates but also diffuses to affect the emotion in updates of users not living in the area receiving rain. Building on this, Kramer, Guillory, and Hancock () find that altering the aggregate level of positive or negative emotion in users’ feeds had the effect of altering the emotional composition of the users’ own posts, supporting the idea that emotional contagion can occur on social media sites absent face-to-face contact and verbal cues.7

... Network-Based Approaches A complementary approach to studying the interpersonal communication of politics on social media focuses on the diffusion of information or behavior in aggregate network structure. Here, the unit of analysis shifts from studying the individual to focusing on an aggregate or larger phenomenon. In an exploratory data analysis based on visualizations of the Twitter network structure of hundreds of conversations on various topics, Smith and colleagues report that different conversational topics appear to exhibit different network properties on Twitter. Political discussions tend to be characterized by two separate, polarized clusters of discussants who rarely interact with each other, link to different websites, and use different hashtags and language (Smith et al., ). Many factors appear to drive the diffusion of certain behaviors through Facebook (e.g., changing profile pictures (State & Adamic, ) or rumor cascades (Friggeri et al., ), as well as on other platforms like instant messaging networks (Aral, Muchnik & Sundararajan, ). Other work seeks to identify the most diffusioninfluential and susceptible members of social networks (Aral & Walker ).

. M F: T-D I

.................................................................................................................................. The results from the survey, experimental, and network-based approaches about political engagement on social media should make us pause to reconsider the interpretation of text-based analyses of social media data. Sentiment analysis implicitly assumes that people accurately depict their underlying emotional state in their written expressions about politics; when people deviate from this, it should only create noise in the data, making it harder to find patterns or relationships. But the considerations outlined in the preceding section suggest that the data generation process is

   



considerably more complex, an idea I explore more in the following section. If people are posting status updates and tweets as a way to communicate, potentially strategically, with others in their social network—and not just as a way to express their opinion to researchers who may be listening—then we need to develop better conceptual theories for how these online political discussions are different than face-to-face political discussions and how those differences should alter our expectations for the effects of discussion on political behavior and attitudes. In addition to the need to develop better theory to explain the data generation process, I argue that a fruitful path forward for future studies using social media text as data is an expansion of the scope of inquiry for which textual data can be used. Failing to move beyond sentiment analysis to understand the dynamics of emotion in political communication would be a disappointing underutilization of an incredible trove of data. Social media data can be used both to better understand phenomena about which we already know considerable amounts in the offline world—such as opinion leadership and emotional response to elite political communication—and to open up new doors to explore phenomena left previously unexplored, namely emotional interpersonal communication. I further explore these two suggestions—theoretically driven consideration of the data generation process paired with a broader consideration of the range of inquiry—in the next sections.

.. Theoretical Foundations of the Data Generation Process The greatest tension resulting from the review of the textual data analysis literature is the lack of theoretical consensus about how to characterize what it is people are doing when they communicate with one another about politics on social media, and in the process of doing so, produce data about their behavior. Competing, and frequently underdeveloped, ideas about the data generation processes of social media communication have resulted in a patchwork of results and a lack of consensus about the interpretation of those findings. Researchers employing advanced sentiment analysis techniques are not typically interested in understanding the data generation process itself and often remain agnostic about how that process should influence the interpretation of their data (for exceptions, see Gayo-Avello, Metaxas & Mustafaraj, ; Metaxas, Mustafaraj & Gayo-Avello, ; Gayo-Avello , ; Ruths & Pfeffer, ; and Tufekci, ). Yet a naive analysis of “text as data” potentially biases our interpretation of aggregated text as a measure of public sentiment. People are responsive to various forces in the political environment that influence what they choose to communicate online; information from the media, campaign communication, and interactions within the social network all affect what and when people choose to post about politics. Posting tweets and status updates is a much more complex social behavior than simply expressing an opinion. Conversely, social scientists studying this behavior tend to characterize



 . 

political interaction on social media based on the assumptions and priorities they carry with them from their traditional fields of study. Thus, political behavior scholars have focused on the impact of online communication on distal political outcomes like voting and offline activism, without clearly classifying or theorizing the novel behavior of online political communication itself. It would be nearly impossible for scholars from such diverse theoretical backgrounds to arrive at a unified characterization of political communication on social media; even if they could, the diversity of social media platforms, the plethora of formats in which people communicate about politics through social media, and the speed with which those behaviors evolve suggests that we should not seek a single unified framework. However, while scholars do not need to start from identical premises about how data are generated or how that process should influence interpretation, it is imperative that they more methodically address the way these issues matter within their own theoretical frameworks. Before researchers can realize the full potential of social media data to address previously unexplored territory about political communication, they must move beyond simple sentiment analysis to address key questions about the behavior of social media political communication itself. This substantial novel theorizing will strengthen their research designs and the substantive interpretations of their findings. I outline some of these key questions in the sections that follow.

... The Fusion of Political Behaviors on Social Media In the offline world, scholars largely silo political expression, information seeking, and political discussion into separate, albeit often correlated, behaviors. Social media break down those barriers. In other work (Settle, ), I argue that we should consider interpersonal interactions about politics on Facebook to be a fusion behavior among political expression, news exposure, and discussion (the END Framework). There are many overlaps between this behavior and its offline analogues, but there are important differences related to the content, context, and nature of the communication. Our theorized expectations for how and when we expect emotion to play a role in communication should incorporate and capitalize on these differences and similarities. For example, while our theories of face-to-face political discussion do not preclude discussions that involve more than two people, in practice, our measurement strategies focus on measuring dyadic discussion between people, forcing a discussion network to assume a “starburst” shape (for exceptions to this generalization, see Lazer et al., ; and Song, ). We know that this is not an accurate depiction—a respondent’s discussants may talk with each other, or a respondent may have a conversation about politics with more than one person at a time—but best practices in survey assessment have dictated that we only capture the kind of political discussion that we can most precisely and uniformly measure. Social media data allow us to explore these other configurations of interaction. Similarly, while scholars recognize that political discussion can occur between people who might not identify each other as “discussion partners,” the focus of research has been on people who report talking to each other regularly. The majority of face-to-face

   



conversation does likely happen between strong ties, and to the extent that our online social network interactions mirror our “real world” connections (Jones et al., ), it is likely that we are also interacting frequently with strong ties on social media. However, the context of social media opens up the possibility of engagement between weak ties in a way not feasible offline. On social media, opinions are expressed and information is exchanged in full awareness—and frequently with full intention—that others will observe the exchange and may join the discussion. A theory built on the fusion of these particular political behaviors may not be the right theoretical framework for all research agendas studying interpersonal political communication on social media. However, scholars should make clear which theories they do utilize or create. They should also take seriously how the norms of a given social media platform influence how people communicate. An assessment of the wide variety of data sources used in this volume sheds some light on the diversity and quantity of data available for analysis. Online social media sites continue to evolve and offer new modes of communication; as they do, it will be important to consider the norms of the site and the motivating reasons that people communicate in particular ways. Finally, researchers should justify the assumptions they make in their research design decisions. For example, scholars who use tweets as a measure of public opinion need to be mindful of the unique demographic profile of Twitter users who access news on the site: they are younger, more mobile, and more educated than Facebook news users, let alone the American adult population overall (Mitchell & Guskin, ). This may or may not matter in a particular study, but it should be addressed.

... The Strategic and Instrumental Uses of Emotion Social media facilitate politicians in communicating directly with both opinion leaders and their constituents, and the online context, paired with a more polarized political environment, incentivizes different forms of elite political communication. Better understanding of these incentives and their consequences is critical for furthering mass communication research, because in the digital era it is not only elites who are communicating strategically; the mass public can be much more involved in disseminating messages and working to persuade people what to think about. A large portion of the information that circulates on social media is generated by politicians and media elites (Gainous & Wagner, ), but the transmission of this information is normally accompanied by users’ commentary on the topic. Thus, while elites still retain significant agenda-setting influence, people engaged in online political discussion take on increasing abilities to frame the conversation. In addition to strategic behavioral decisions related to persuasion and information diffusion, identity management is a major component of how people decide what to communicate, and to whom. In the online context, identity management not only matters in the moment of communication; it is also made instantaneously public and potentially permanent as part of a person’s online profile. Users may desire to project emotion about politics on social media in particular ways, to draw attention to themselves or to seek social support. The durability of a person’s



 . 

comments online—and the way those short bits of text become immortalized as part of their online profile and identity—mean that people are more deliberate in what they choose to post. Related is the question of whether or not sentiment in online political social media text represents a user’s experience with, or his or her desired expression of, emotion. The sentiment that a person records in an online context may not be an accurate reflection of the actual physiological emotional response that person experiences. People may post about politics on Facebook or Twitter for many reasons other than documenting their underlying opinions; the more instrumental communication becomes, the less valid it is as a measure of public sentiment. Finally, the norms and rules guiding online communication affect the way people communicate about politics. As a simple comparison, posts on Twitter are limited to  characters; status updates on Facebook are not. Although the functionalities of Twitter have changed considerably since the publication of the article by boyd, Golder, and Lotan (), they suggest there are heterogenous ways in which “authorship, attribution and communicative fidelity” are negotiated within retweeting conversations. Within Facebook, people can’t screen themselves behind usernames; paired with the visible social network in which political discussions are embedded, people may be more hesitant to express their opinions freely (Hampton et al., ). Conversely, in some forums for political discussion, the norms appear to demand direct response to a post with a contradictory point of view (Mullen & Malouf, ). A meaningful interpretation of political communication from these sites must take into consideration these norms and practices.

... Emotional Interdependencies: The Spread of Emotion through Social Networks To the degree that emotion is thought to cluster in social networks and has recently been shown to spread through social networks (Coviello et al., ; Kramer, Guillory & Hancock, ; Fowler & Christakis, ), the expression of emotion in political discussion is not independent of the emotional expression of our social ties. When Facebook users encounter the emotional political discussion of their friends on the site, it may further prompt their own emotionality toward politics. This interdependency and influence are important processes to consider from both a theoretical and methodological viewpoint when interpreting textual data.

.. Integrating Theory and Textual Analysis to Expand Mass Political Communication Research The second necessary guidepost for future studies using social media text as data is an expansion of the scope of inquiry for which textual data can be used. A continued focus on quantifying the content of political communication without considering why and

   



how it was created bypasses the opportunity to study important new research questions that fuse textual analysis with more theoretically driven hypotheses. Doing the latter is easier said than done. While scholars have well-elaborated theories of elite communication and behavior applicable to elite-generated social media text, there are fewer applicable theories about interpersonal communication in the offline context. The challenge for the field is thus to develop richer theories about why and how people communicate about politics, while keeping in mind how the structural differences between face-to-face and online communication should affect the patterns and relationships we expect. In the following sections I explore some potential areas of expanded or novel inquiry using text as data.

... Opinion Leaders: Using Emotion Strategically The rise of social media has introduced a relatively novel concept: the idea that the mass public participates in the framing and dissemination of political information. Consistent with the notion of opinion leaders, who have always played an important role in the transmission of information from elites to the uninformed mass public (Campbell et al., ), studies find evidence that a small proportion of online users drive much of the volume of political information (Rainie & Smith, ). For example, Figure . shows that in , most Facebook users posted very few political status updates, but a small proportion posted a large number (see Settle et al.,  and Settle Distribution of Political Messages by Type 6

10

Total Posts Emotional Nonpolitical Posts Political (0.90) Posts Political (0.75) Posts Emotional Political Posts

105

Frequency

104 103 102 101 100 100

101

102

103

104

Number of Status Messages

 . Log-log plot showing the distribution of the number of status messages posted by each user in a random sample of two million Americans in ten states, during the period January –January . Most people post very few political status updates in the study period, and a small proportion of users generate much of the political discourse. See Settle et al. () and Settle et al. () for more details.



 . 

et al.,  for more information about the data used.) In the pre-Internet world, researchers measured the behavior of opinion leaders mainly by identifying those people who self-reported a high level of interest in politics and frequent political discussions with others. Today we can use data derived from social media to understand more about when and how opinion leaders actually communicate about politics. These opinion leaders have different goals and motivations than do political elites. When they communicate online, how much of what they write is designed to persuade other people, compared to the proportion that is simply an expression of their own viewpoints? How do these “power users” cultivate their online identities using political expression? To what extent are they simply propagating the messages and frames cultivated from elite communication? What is the effect of adding their own commentary before passing elite messages along to their networks? These questions would have been nearly impossible to answer in the predigital world, because they hinge on detailed and accurate depictions of specific instances of people’s behavior, something difficult to measure in surveys and only artificially measured in a laboratory context. Research suggests that the people most politically engaged on social media are also those most politically engaged in the offline world (Rainie & Smith, ; Smith, ); while there may be differences in strategy or technique used in online versus face-to-face communication, there are likely to be many similarities as well.

... The Psychology of Emotional Response The two most dominant theories of the way that the mass public incorporates emotional evaluations into their political opinions have important implications for emotional communication but are more focused on the way that people use emotion to form opinions rather than express opinions. The “hot cognition” hypothesis (Lodge & Taber, ) is a theory of motivated reasoning in which all sociopolitical concepts are affectively charged and thus influence the way we encode, retrieve, and express political judgment. The affective intelligence theory dictates a dual processing model in which two different emotional systems regulate the response to cues in the political environment: the disposition system affects levels of enthusiasm and aversion when processing routine information, and the surveillance system affects levels of anxiety in response to novelty or threat. These subsystems in turn affect political behavior; people who report feeling anxious about candidates or political issues report higher levels of information acquisition, more interest in the campaign, and participation beyond voting (Marcus, Neuman & MacKuen, ). We have learned much from empirical tests of these two theories about the role of emotional response in political attitude formation and behavior. The use of the same question formats and target stimuli in a variety of studies, including the American National Election Study, provides confidence in the generalizability of the empirical patterns found and the ability to track these patterns over time. Asking about specific emotions not only permits an assessment of the contrast between positive and negative

   



emotions, an important distinction, but also allows for theorizing about the differential effects of anger and anxiety (Huddy, Feldman & Cassese, ). The experimental work has reinforced the causal connection between response to various target stimuli and both implicitly measured and self-reported emotional responses. Taken collectively, we can be confident that we are accurately and reliably measuring something about the reality of emotional engagement with politics. In principle, these same measurement techniques could be modified and updated to test how people emotionally respond to social media stimuli, such as campaign communications on social media, or information about threatening policies communicated on a person’s Facebook News Feed. However, the measurement problems inherent in these established survey and experimental techniques are only amplified in the context of studying social media political communication. Instead, testing our established theories of emotional response with textual analysis techniques on social media data offers many complementarities to the methods traditionally used to study how people emotionally respond to political stimuli. I note three important advances: the ability to study a wider range of triggers of emotional activation, the ability to bypass self-reported emotional response, and the ability to more richly characterize emotional response by interpreting people’s own descriptions of their experiences. First, applying our theories of emotional response to research designs incorporating textual data from social media will allow us to study a wider range of triggers of emotional activation. We currently know much more about the consequences of emotional activation than we do about its sources (Wolak & Marcus, ). Most of the political stimuli studied have been perceptions of a candidate (Marcus, Neuman & MacKuen, )—operationalized in different ways in survey versus lab studies, such as personal characteristics (Civettini & Redlawsk, ), campaign materials (Brader, , ), and policy positions—with a secondary focus on threatening political issues (Huddy et al., ; Huddy, Feldman & Weber ; Brader et al., ; Miller & Krosnick, ). Experimental work has refined our understanding of the strategies, rhetoric, and issue positions that are most emotionally provocative, but the generalizability of these findings is limited to the experimental stimuli that have been tested. Thus, we are limited in our understanding of what it is that candidates or parties do in the real world political environment that activates particular emotions. Social media allow us to see how these constructs operate in the real world, to better understand the things to which people are emotionally responding. Instead of crafting experimental stimuli or asking broadly about “the kind of person [a candidate] is, or because of something [a candidate] has done,”8 we can actually see when people are responding in the aggregate with increased anger, anxiety, or enthusiasm to particular real world political events (see Figures . and . for examples). Second, an important problem in both the survey and experimental work is relying on self-reported emotion. Many of the new approaches in the fields of neurophysiology and social psychology suggest that we should move beyond self-report measurements of a limited number of emotions, incorporating both different measurement strategies and a wider conceptualization of affectivity. Previously we had to rely on retrospective



 .  Percent of Users Posting an Anxious Message

Percent of All Posters %

15

Mumbai Attacks

Nonpolitical Users Political Users

10

DJIA ,  data points per user) that give “real-time credit scoring” (InVenture, ). Tigo/Millicom (a Colombian telecom provider in thirteen African and Latin American countries) is also using airtime to determine insurance premiums; that is, the more you talk, the lower your insurance premium. These all link one’s communication practices to other perhaps “weightier” metrics that have real impacts on socioeconomic well-being. As part of the larger development drive that focused on ICTs in the early s as mobile phones began to be taken up across the global South, socioeconomic well-being continues to be of interest as poverty stubbornly remains prevalent. There are opportunities to use these alternate measures to assess if any gains can be made in linking communication, as an act and a process, to things like economic transfers. The mobile phone itself need not be the focus, as it facilitates largely unanticipated uses and appropriations.

.. Mobility and Location De Bruijn () described mobile phones as telephones with legs, and “the very mobility that the mobile phone offers became a rich research field for those with various interests in developing countries” (Molony, , p. ). Reliable data on movement and migration in the global South are hard to come by, and this is largely



. , . ,  . 

true for within-country migration, as most governments’ census data do not capture specific kinds of migration like circular or temporary migration (Blumenstock, ). However, this need not be the case. Even the most basic phones can provide location simply because they need to ping the nearest cell towers to connect calls and the sequence of towers can show movements of the cell phone (user). In urban areas where cell phone towers are more dense, this sort of inference can be very precise (Blumenstock, ) and can be used for traffic and transportation estimation, origins and destinations (e.g., Calabrese et al., ), and large-scale movements and migrations. Using the concept of “inferred mobility” and a longitudinal data set of phone records from Rwanda, Blumenstock () computed detailed trajectories of user movement that were then compared with existing qualitative data on internal migration in the country. Other work by Eagle et al. (b) and Frias-Martinez et al. () also used individual movement logs to analyze large-scale mobility, with the latter focused on populations of different socioeconomic status and the former focusing on rural versus urban populations in a developing country. These works present quantitative ways of understanding patterns of movement that are hard to find with standard survey techniques and can be also used to “measure patterns of information diffusion, or analyze the impact of mobile-based services” (Molony, , p. ). Global Positioning System (GPS)-enabled devices provide even more precise locations and are increasingly found in the basic and java-enabled feature phones that are ubiquitous in the global South, where there appear to be more crises given the inadequate infrastructure to cope with both natural and man-made disasters. Some work on disaster and crisis communication have used GPS data to assess the value of mobile phones in aiding recovery work after a crisis (Bengtsson et al., ).

.. Digital Innovation and Technology Entrepreneurship One outcome of mobile usage in the global South that is only now receiving attention in the literature is the increase in technology entrepreneurship and digital innovation. The ubiquity of the mobile platform and the relative ease and low cost of creating programs and content for mobiles is driving entrepreneurship, particularly among the youth across the global South. Other communication technologies have always required huge investments to monetize (TV, radio, and land lines vis-à-vis call centers). From Chile to Zimbabwe, new mobile applications that cater to not just the small percentage of smartphone users but also the millions who have feature and basic phones largely manufactured in China are being designed for a myriad of interests and challenges. Mobile phones have been a more accessible platform, and the ways that they are being appropriated for entrepreneurial ventures need to be better understood. Thus far, most of the interest in this area has come from the business and development worlds (e.g., Dahlberg Africa, McKinsey Consulting, the World Bank). There has not been much development since Levy and Banerjee () issued a call to turn to new theories about the network society and bring them to bear on mobile phones within the

        



setting of urban entrepreneurial activity. Still, recently there has been interest within communication and information studies in discursive practices around the creation of mobile apps and other information technologies (Avle, ), how entrepreneurs respond to technological change in resource-constrained environments (Zachary, ), how ICT innovation challenges dominant ideas of innovation and design practice (Marchant, ; Avle & Lindtner, ), conceptual connectivity in specialized sectors (Graham & Mann, ), and tech innovation in ICT hubs (Jimenez Cisneros, ). In general, there is little known about how technological artifacts emerge out of the global South, considered in innovation literature to be the periphery. Postcolonial, feminist, and critical cultural studies theories have served as theoretical starting points for some of this work, located in design and human and computer interaction (HCI) (e.g., Chan, ; Bardzell & Bardzell, ; Lindtner et al., , Avle & Lindtner, ). Silicon Valley’s strong influence on extant theories of digital innovation and information technologies belies the reality that is the global nature of technology production. Who is designing and building the myriad software and hardware products that are used across the global South? Silicon Valley largely does not design for the world’s poorest, so tech entrepreneurs in Accra, Nairobi, Santiago, Bangalore, and Shenzhen are innovating, building on bare bones infrastructures to serve the billions who cannot afford iPhones. Design and innovation are already happening. As researchers interested in mobile phones and the data their users produce, we have a blind spot if we take for granted where the devices and their software come from, how they are made, and who works to get them into the hands of the populations we study.

. C  U M D   G S  A R

.................................................................................................................................. With increased access comes increased data, and one crucial question we want to address is: Who has access to users’ data across the global South? For the most part, it is understood that telecom providers hold their customers’ data and, where applicable, third-party content and app providers on phones, in exchange for service provision. In many cases, additional user data are collected before they are needed for any immediate use, in anticipation of mining for patterns that might end up useful. For instance, Android apps tell users what the applications they install require. More often than not, the items requested are not core to the functioning of the app. Opting out of “request for access” does not mean opting out of releasing certain data, but rather opting out of use. In other words, refusing such a request means the user is no longer able to download the app.



. , . ,  . 

Third parties that do not directly interface with users but have access to their data are becoming more common. For instance, a company like Jana Mobile, through its mCent app and partnerships with over  telecom providers in  countries, has access to nearly . billion phones in the developing world (Bergen, ; Olson, ). In addition, government agencies are increasingly requesting data from tech companies and telecom providers, often under blurry legal and regulatory guises. In the global South, the nongovernmental organization industrial complex (Gereffi et al., ) holds a lot of data on some of the world’s most vulnerable populations due to the interventions that have been ongoing for the last thirty to forty years in development and ICTD. There is often some overlap between these players and academics who produce research from these populations. The data asymmetry of corporations and other powerful stakeholders knowing more about individuals’ lives thus puts the onus on regulators to ensure as fair a platform as possible for the most vulnerable populations. Other issues, such as data security and protection, are of paramount importance as more and more of the world’s vulnerable populations get access to mobile data. This is where researchers ought to be more mindful as they work to gain access to user data. The ethics of doing mobile and other ICT research is covered in this volume, but we wish to underscore that some data may not be necessary to answer some of the questions we have posed. Identifying information is typically not needed, and other issues covered by the Belmont Principles and institutional review boards are good reminders when seeking data for our research. With precise location data comes an increased risk of the targeting of populations, particularly of the most vulnerable, such as refugees fleeing conflict. Biases can and have been built into systems that end up favoring the already well off and powerful in different societies (Blumenstock, ), whether about policing, lending, or voting, in both the global North and South. In a world wracked by war and refugee crises, the same tools that help migrants can in the wrong hands be used against them. The potential for political, economic, and other forms of targeting for harm arises where adequate protections of consumer/citizen data are not in place, either by law or by volition. Access to observable data poses a challenge, particularly in places that might not be researchers’ primary area of expertise. This is one area where collaborative work with global South researchers is valuable. The greater value to working with researchers in the global South lies in the perspectives and theories they bring, as well as their familiarity with the sites that provide empirical evidence for what we already know from the North and new ideas coming from the South.

. C

.................................................................................................................................. Mobile phones have moved from novelty, to “nice to have,” to mutual expectations of being readily available at all times, almost to the point of being mundane, particularly

        



in the global North (Katz & Aakhus, ; Ling, ). In the global South, an increasing number of people are integrating mobile phones into their lives in ways similar to how others do elsewhere. In this chapter we have focused on personal data, although we acknowledge that mobiles are increasingly being integrated into business workflows in the global South. We have primarily discussed ways that observational data, or data generated by everyday uses (not interventions), can be useful for research. We have reviewed mobile phone usage and industry trends, not to “render the familiar strange” (Riles, ) or to look to the South to reify difference. Rather, our goal has been to reorient us as researchers to emerging trends in contexts that are not part of the everyday for most of the researchers conducting work on communications in the networked age. By highlighting some of these practices, we have shown ways that we can better understand how the everyday is being changed with mobile phone use on a much more global scale and specific areas in which data can inform theory or open new pathways for research.

N . USSD or unstructured supplementary service data is a GSM Association protocol that links a mobile device to the service provider’s servers. It functions somewhat like SMS texts but can accommodate more characters, is more interactive, and is arguably more user friendly than SMS without the development and data cost of a native mobile app. See figures . and . for examples of what the USSD interface looks like. . Zain is a Kuwaiti telecom provider, operating in eight African and Middle Eastern countries. Celtel, a Sudanese-founded telecom company, was acquired by Zain in  and then sold to Bharti Airtel in . . Bharti Airtel is Indian owned and operates in eighteen African countries and three Southeast Asian countries. Zenith is one of Nigeria’s biggest banks, with operations in three other African countries as well as the United Arab Emirates and the United Kingdom.

R App Annie. N.d. “Google Play top app charts.” https://www.appannie.com/apps/google-play/ top/ghana/overall (accessed February , ). Arora, Payal and Nimmi Rangaswamy. . “Digital leisure for development: reframing new media practices in the global South”. Media, Culture & Society, : –. Atieno, Milicent. . “Opera Mini users in Africa saved $ million in mobile data usage”. http://innovtiv.com/opera-mini-users-in-africa-saved-m-in-mobile-datausage/ (accessed August , ). Avgerou, Chrisanthi. . “Discourses on ICT and development.” Information Technologies & International Development , no. : . Avle, Seyram. . “Articulating and enacting development: Skilled returnees in Ghana’s ICT industry.” Information Technologies & International Development, , no. : –.



. , . ,  . 

Avle, Seyram, and Silvia Lindtner. . “Design(ing) ‘here’ and ‘there’: Tech entrepreneurs, global markets, and reflexivity in design processes.” In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’), –. New York: ACM. Avle, Seyram. . “‘Radio locked on @citi’: FM radio audiences on Twitter.” In W. Willems and W. Mano, (Eds.), From Audiences to Users: Everyday Media Culture in Africa, –. London and New York: Routledge. Bardzell, Jeffrey, and Shaowen Bardzell. . “What is critical about critical design?” In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. –. New York: ACM. Bengtsson, Linus, Xin Lu, Anna Thorson, Richard Garfield, and Johan Von Schreeb. . “Improved response to disasters and outbreaks by tracking population movements with mobile phone network data: A post-earthquake geospatial study in Haiti.” PLoS Med , no. : e. Bergen, Mark. . “Mobile startup Jana launches new tool to reach next billion consumers, on their phones: Unilever among several top brands to test out the novel platform.” http://adage.com/article/digital/unilever-taps-emerging-market-mobile-platform-jana/ / (accessed February , ). Blumenstock, Joshua E. . “Inferring patterns of internal migration from mobile phone call records: evidence from Rwanda.” Information Technology for Development , no. : –. Burrell, Jenna, and Kentaro Toyama. . “What constitutes good ICTD research?” Information Technologies & International Development , no. : . Calabrese, Francesco, Giusy Di Lorenzo, Liang Liu, and Carlo Ratti. . “Estimating origindestination flows using mobile phone location data.” IEEE Pervasive Computing , no. : –. Chan, Anita Say. . Networking peripheries: Technological futures and the myth of digital universalism. Cambridge, MA: MIT Press. Chirumamilla, Padma and Pal, Joyojeet. “Play and Power: A ludic design proposal for ICTD.” ICTD () , –. De Bruijn, Mirjam. . “The telephone has grown legs: Mobile communication and social change in the margins of African society.” African Studies Center, University of Leiden. Donner, Jonathan. . “The rules of beeping: Exchanging messages via intentional ‘missed calls’ on mobile phones.” Journal of Computer-Mediated Communication , no. : –. Donner, Jonathan. . “Research approaches to mobile use in the developing world: A review of the literature.” The Information Society , no. : –. Donovan, Kevin, and Aaron K, Martin. . “The rise of African SIM registration: The emerging dynamics of regulatory change.” First Monday , no. –. doi:https://doi.org/ ./fm.vi.. Eagle, Nathan, Yves-Alexandre de Montjoye, and Luís MA Bettencourt. . “Community computing: Comparisons between rural and urban societies using mobile phone data.” InCSE ’: International Conference on Computational Science and Engineering, , vol. , pp. –. IEEE. Eagle, Nathan, Alex Sandy Pentland, and David Lazer. . “Inferring friendship network structure by using mobile phone data.” Proceedings of the National Academy of Sciences , no. : –. Frias-Martinez, Vanessa, Jesus Virseda, Alberto Rubio, and Enrique Frias-Martinez. . “Towards large scale technology impact analyses: Automatic residential localization from

        



mobile phone-call data.” In Proceedings of the th ACM/IEEE International Conference on Information and Communication Technologies and Development, p. . New York: ACM Gereffi, Gary, Ronie Garcia-Johnson, and Erika Sasser. . “The NGO-industrial complex.” Foreign Policy : –. Gillwald, Alison (Ed.). . “Towards an African e-Index: Household and individual ICT access and usage across  African countries.” LINK Centre, Wits University, School of Public and Development Management. Gillwald, Alison, Anne Milek, and Christoph Stork. . “Gender assessment of ICT access and usage in Africa.” Towards Evidence-based ICT Policy and Regulation , no. : –. Gitau, Shikoh, Gary Marsden, and Jonathan Donner. . “After access—Challenges facing mobile-only Internet users in the developing world.” In G. Fitzpatrick and S. Hudson (Eds), Proceedings of the th International Conference on Human Factors in Computing Systems (CHI ), pp. –. New York: ACM. Gomez, Ricardo, Luis F. Baron, and Brittany Fiore-Silfvast. . “The changing field of ICTD: Content analysis of research published in selected journals and conferences, –.” In Proceedings of the Fifth International Conference on Information and Communication Technologies and Development, pp. –. NewYork: ACM. Graham, Mark, and Laura Mann. . “Imagining a silicon savannah? Technological and conceptual connectivity in Kenya’s BPO and software development sectors.” The Electronic Journal of Information Systems in Developing Countries , no. : –. Grameen Foundation. . “Women, mobile phones, and saving: A Grameen Foundation case study.” Grameen Foundation USA. https://grameenfoundation.org/resource/womenmobile-phones-and-savings-case-study (accessed August , ). GSM Association (GSMA). a. “Definitive data and analysis for the mobile industry.” https://www.gsmaintelligence.com (accessed August , ). GSM Association (GSMA). b. “The mobile economy Sub-Saharan Africa”. https:// www.gsmaintelligence.com/research/?file=cefbdfdceb&download (accessed August , ). GSM Association (GSMA). . “Sub-Saharan Africa Mobile Observatory .” London: GSM Association & Deloitte. Henrich, Joseph, Steven Hein, and Ara Norenzayan. . “The weirdest people in the world?”. Behavioral and Brain Sciences : –. Hofstede, Geert H., and Geert Hofstede. . Culture’s consequences: Comparing values, behaviors, institutions and organizations across nations. London: Sage. Howard, Philip N., Aiden Duffy, Deen Freelon, Muzammil M. Hussain, Will Mari, and Marwa Maziad. . “Opening closed regimes: what was the role of social media during the Arab Spring?” SSRN . Howard, Philip N., and Muzammil M. Hussain. . Democracy’s fourth wave? Digital media and the Arab Spring. Oxford: Oxford University Press on Demand. Igarashi, Tasuku, Jiro Takai, and Toshikazu Yoshida. . “Gender differences in social network development via mobile phone text messages: A longitudinal study.” Journal of Social and Personal Relationships , no. : –. International Telecommunications Union (ITU). . “Facts and figures.” http://www.itu. int/en/ITU-D/Statistics/Pages/facts/default.aspx (accessed February , ). InVenture, . “Modern credit for a mobile world.” https://inventure.com/#learnmore (accessed July , ).



. , . ,  . 

Jack, William, and Tavneet Suri. . “Mobile money: The economics of M-PESA.” No. w. National Bureau of Economic Research. https://www.nber.org/papers/w (accessed August , ). James, Jeffrey, and Mila Versteeg. . “Mobile phones in Africa: How much do we really know?” Social Indicators Research , no. : –. Jimenez Cisneros, Andrea. . “Technological innovations within ICThubs—The case of Bongohive, Zambia.” Master’s thesis, Royal Holloway. Kang, Juhee, and Moutusy Maity. . “Texting among the bottom of the pyramid: Facilitators and barriers to SMSs use among the low-income mobile users in Asia.” SSRN . Katz, James E., and Mark Aakhus. . Perpetual contact: Mobile communication, private talk, public performance. Cambridge, UK: Cambridge University Press. Kelven, Udoh. . “African Opera mini users saved US$million in mobile Internet sata, says Opera software.” http://techloy.com////african-opera-mini-users-savedusmillion-in-mobile-internet-data-says-opera-software/ (accessed February , ). Koranteng, Kweku. . “An inclusive growth approach to understanding network neutrality in Ghana.” Research ICT Africa. Levy, Mark R., and Indrajit Banerjee. . “Urban entrepreneurs, ICTs, and emerging theories: a new direction for development communication .” Asian Journal of Communication , no. : –. Lindtner, Silvia, Garnet D. Hertz, and Paul Dourish. . “Emerging sites of HCI innovation: hackerspaces, hardware startups & incubators.” In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. –. New York: ACM. Lindtner, Silvia, and David Li. . “Created in China: the makings of China’s hackerspace community.” Interactions , no. : –. Ling, Rich. . Taken for grantedness: The embedding of mobile communication into society. Cambridge, MA: MIT Press. Ling, Rich, and Horst, Heather. . “Mobile communication in the global south.” New Media Society : –. doi:./. Marchant, Eleanor. . “Who is ICT innovation for: Challenges to existing theories of innovation, a Kenyan case study.” CGCS Occasional Paper Series on ICTs, Statebuilding, and Peacebuilding in Africa. Center for Global Communication Studies, University of Pennsylvania. http://www.global.asc.upenn.edu/app/uploads///Marchant_Who-isICT-Innovation-for.pdf (accessed October , ). Mbiti, Isaac, and David N. Weil. . “Mobile banking: The impact of M-Pesa in Kenya.” No. w. National Bureau of Economic Research. Meeker, Mary. . “Internet trends .” Kleiner Perkins Caufield & Byers. Mirani, Leo. . “Millions of Facebook users have no idea they’re using the Internet.” http://qz.com//milliions-of-facebook-users-have-no-idea-theyre-using-the-internet/ (accessed February , ). Molony, Thomas. . “ICT and human mobility: Cases from developing countries and beyond.” Information Technology for Development , no. : –. Morawczynski, Olga. . “Saving through the mobile: A study of M-PESA in Kenya.” Advanced Technologies for Microfinance: Solutions and Challenges: . Moyo, Last. . “Introduction: Critical reflections on technological convergence on radio and the emerging digital cultures and practices.” Telematics and Informatics : –.

        



Nielson Company. . “Pay-as-you-phone: How global consumers pay for mobile.” http://www.nielsen.com/us/en/insights/news//how-global-consumers-pay-for-mobile. html (accessed December , ). Ohri, Chandni. . “Role of digital financial services to empower people living below the poverty line.” Grameen Foundation Insights. http://www.grameenfoundation.org/blog/ role-digital-financial-services-empower-people-living-below-poverty-line (accessed August , ). Olson, Parmy. . “This app is chasing in on giving the world free data.” http://www.forbes. com/sites/parmyolson////jana-mobile-data-facebook-internet-org/#cdfca (accessed January , ). Pearce, Katy. . Phoning it in: Theory in mobile media and communication in developing countries. Mobile Media & Communication , no. : –. Pew Research Center. . “Cell phones in Africa: Communication lifeline.” http://www. pewglobal.org////cell-phones-in-africa-communication-lifeline/ (accessed February , ). Punathambekar, Aswin. . “Reality TV and the making of mobile publics: The case of Indian Idol.” In Marwan Kraidy and Katherine Sender (Eds.), Real Worlds: Global Perspectives on the Politics of Reality Television, –. New York: Routledge. Riles, Annelise. . The network inside out. Ann Arbor: University of Michigan Press. Sey, Araba. . “Exploring mobile phone-sharing practices in Ghana.” Info , no. : –. Sey, Araba. . “Managing the cost of mobile communications in Ghana.” In M. FernándezArdèvol and A. Ros (Eds.), Communication Technologies in Latin America and Africa: A Multidisciplinary Approach, –. Barcelona: IN. Sey, Araba. . “‘We use it different, different’: Making sense of trends in mobile phone use in Ghana.” New Media & Society , no. : –. Steinfield, Charles, Susan Wyche, Tian Cai, and Hastings Chiwasa. . “The mobile divide revisited: mobile phone use by smallholder farmers in Malawi.” In Proceedings of the Seventh International Conference on Information and Communication Technologies and Development, p. . New York: ACM. Truong, Alice. . “The fastest-growing mobile phone markets barely use apps.” http://qz. com//the-fastest-growing-mobile-phone-markets-barely-use-apps/ (accessed February , ). University of Ghana Business School (UGBS). . Ghana business development report. University of Ghana, Legon—Ghana. Willems, Wendy. . “Participation—in what? Radio, convergence and the corporate logic of audience input through new media in Zambia.” Telematics and Informatics : –. Zachary, Pascal. . “Black star: Ghana, information technology and development in Africa.” First Monday , no. . http://firstmonday.org/ojs/index.php/fm/article/view/ (accessed October , ).

  .............................................................................................................

ETHICS OF DIGITAL RESEARCH .............................................................................................................

  ......................................................................................................................

     ......................................................................................................................

 . 

C digital research involves important ethical challenges. The emergence of computational systems, such as social network sites, and the role of mobile media in nearly all aspects of human life lead to the potential to ask fascinating new research questions, but also raise new ethics questions, which the social sciences continue to grapple with. While the ethical principles established for research conducted with humans in the Belmont Report—respect for persons, beneficence, and justice—remain core ethical principles, how to implement these principles in digital research is an ongoing conversation. Challenges include securing informed consent when doing research at scale, the difficulties that arise with collaborations between academic and industry researchers, evolving norms around privacy, and establishing the nature of risks and benefits in computational social science. My personal involvement in this ongoing conversation dramatically intensified in the summer of  when a paper I co-wrote on emotional contagion on Facebook was published, and controversy erupted over the ethical aspects of the study Kramer, Guillory, & Hancock (). The worldwide attention to the study highlighted many of the ethical issues associated with conducting digital research. Since that time I have engaged in hundreds of conversations about the study with a wide range of stakeholders, from Facebook users to other researchers, IRB administrators, journalists, industry practitioners, and government regulators. While many others have addressed the specific issues of the Facebook Emotional Contagion study, including excellent papers on the ethical and legal aspects (Grimmelman, ; Meyer, ), in this chapter I focus on sharing my takeaways from the conversations I’ve had since the study was published. I start with what I learned from the many emails I received when the controversy was prominent in the media. I then lay out my takeaways from all the other conversations I’ve had, then introduce the five chapters in this section and describe how they contribute to the larger conversation about ethics in digital research.



 . 

. W I F O  F U’ E

.................................................................................................................................. The Facebook Emotional Contagion study triggered widespread critical attention from users, the media, and the academic community. One of the effects of the widespread media attention was that I received hundreds of emails from colleagues and Facebook users in response to the study. What did these users have to say about it? In answering this question, I have excluded the emails received from my colleagues in academia, who most often were professors but also administrators with oversight over ethical aspects of research at their institutions. Academic professionals were almost exclusively interested in aspects of institutional review board (IRB) related processes, especially around informed consent. For example, many inquiries regarded whether the study had been approved by an IRB and how the study could have been conducted without informed consent. At the end of this chapter I highlight the issues that these colleagues raised, especially about informed consent. Other emails from professionals included notices of investigation requests to the Office for Human Research Protections or other legal and administrative requests. Here I focus on the emails from Facebook users to help understand what their concerns were. When I read through all the emails, I noted several important themes. The first was anger or surprise that the newsfeed had been manipulated. An example of the comments made is: “How dare you manipulate my Newsfeed!” Emails with this theme indicated that users were unaware that manipulations were even possible on the newsfeed, suggesting that many of the correspondents conceptualized the newsfeed as an objective window into their social world. The second theme in the emails reflected how concerned individuals were that any manipulation of their newsfeed may have affected their awareness of their social world. For example, one user noted that if she had been in the experiment she may not have learned that her friend’s father had passed away and therefore would have missed the funeral. These emails made salient the fact that users value the information they receive in their newsfeed, which helps them stay aware of important events in their social lives. This theme contrasts with the frequent portrayal of the newsfeed as a form of pointless trivia shared by narcissists. Users were upset not only at having their newsfeed manipulated (without their consent, or in some cases, awareness), but that a manipulation of their feed may have affected their ability to monitor or manage their social lives. The third theme that emerged was related to the special status of emotions. Many emails indicated anger and worry that users’ emotions may have been manipulated. Several emails referred to emotions as highly personal and indicated that control of emotions was not reasonable in an experiment. These emails made clear that the manipulation of emotions represented a violation of users’ autonomy. The last theme was queries about whether a user was in the experiment. For example, one user asked, “I want to know if I was in this experiment.” Given the relatively large

   



size of the study, correspondents simply wanted to know if they were in the study or not. This question is unique to big data social science studies, in which there is some possibility, although small, that the person asking was involved in the study. Contrast this study, and others like it, with a typical lab study, in which participants are usually students at a university, recruited specifically for the study. This theme suggests that big data social science research feels potentially more personal than standard laboratory research. When looking across these themes, one thing that became clear to me was how people’s expectations were violated. At the time, users of Facebook were not expecting that their newsfeed was curated by an algorithm. Indeed, in a study conducted soon after the controversy, Eslami and colleagues () observed that many of their participants were unaware that the newsfeed was curated or was not simply chronologically ordered posts from their friends and family. This observation has led my colleagues and I to study how people reason about these kinds of complex computational social systems, such as Facebook’s newsfeed or Twitter. In work with Megan French (French & Hancock, ), we have begun examining the kinds of folk theories people have about such systems by asking them to identify metaphors that best describe how the newsfeed or Twitter works. We ask about their metaphors because folk theories are unintuitive and implicit but also causal and explanatory of people’s behavior (Gelman & Legare, ). What those metaphors reveal is that people hold very different folk theories about these systems, producing some metaphors with a positive connotation, such as personal shopper, and others with strongly negative implications, such as spy and paparazzi. By improving our understanding of how people intuitively conceptualize how systems operate, our goal is to better understand why and how people engage with these kind of cyber social systems and to help avoid violating expectations.

. C T S  C  D R

.................................................................................................................................. Since the publication of the Facebook Emotional Contagion study, I have had the opportunity to speak with hundreds of people about the ethical challenges of digital research. In these conversations several important themes and questions have repeatedly arisen. In this section I lay out what I think are the key takeaways and enduring challenges from these conversations. The chapters in this section of this handbook on ethics on digital research address many of these questions and challenges and raise some novel challenges as well. I point out where these chapters touch on the issues I’ve heard in my own conversations, though all the chapters address issues of risk, consent, and some of the novel complexities of digital research.



 . 

.. Evolving Research Ecosystem The digital research ecosystem has a host of complexities that need not be considered in more traditional research involving lab studies and surveys. Digital research also has a larger set of stakeholders, which can include not only users and researchers but also industry partners, regulators, and technology developers. Each of these stakeholders has different concerns about ethics, and regulations differ by stakeholder. For example, academic researchers are bound by common rules from the Department of Health and Human Services, while industry partners are bound by the Federal Trade Commission. The digital research ecosystem is also continuously evolving, with new technical developments making new forms of research feasible. IRBs struggle to keep up with these new research techniques and their associated ethical concerns. The chapter by Crowcroft, Haddadi, and Henderson in this handbook does an excellent job of reviewing the challenges and dilemmas of the evolving digital research ecosystem. The authors advocate developing a new ethics framework and a privacy protection ecosystem that consists of a supporting legal system, technology sector, analytics firms, consumer rights groups, researchers, and Internet users.

.. Heterogeneous Populations and Cultures A related challenge of digital research is the vastly increased reach of social science research, which can include potentially vulnerable populations and cultures that have deeply different expectations and political sensitivities. The chapter by Mai and Repnikova in this handbook examines doing research in China. They provide a useful analysis and draw on the notion of flexibility and cultural and political sensitivity articulated by Buchanan and Ess (). China’s pervasive censorship, weak legal frameworks for privacy protection, and political repression make its Internet users especially vulnerable. The chapter reveals that little ethical consideration has been paid to online experiments, surveys, content analysis, and digital ethnography studies in China. The authors call for researchers to thoroughly consider ethical implications and elaborate the logic and process of research subjects’ projections. The chapter by Pearce in this handbook similarly tackles some of the challenges of doing research in social and cultural environments that may put researchers and participants at risk in ways that we rarely need to consider in countries with strong legal and democratic institutions. The chapter highlights how the very tools that make digital research possible can also be compromised by hostile actors, exposing researchers and their participants to novel forms of substantial risk.

.. Informed Consent Best Practices One of the biggest challenges in doing large-scale research is informed consent. This is one of the most important protections in human subjects research. One of the key

   



challenges in digital research is seeking informed consent at scale, when thousands or hundreds of users may be involved in a study. How and when should informed consent be required for participation in a study, and how should it be done so that the informed consent process does not interfere with the users’ own goals, such as browsing social media? Crowcroft and colleagues in this handbook lay out some thoughtful ideas regarding different forms of consent, including secured versus sustained consent and novel ideas such as data donation. A related challenge is the issue of informed consent when the data have already been collected. Much digital research is based on behavioral trace data, and often these data have already been collected without the users’ consent. The chapter by MenchenTrevino in this handbook offers some useful probing questions that researchers can ask to help determine whether research on secondary data for which consent wasn’t granted is reasonable, such as whether someone who has provided the data would be comfortable with the intended analysis.

.. Understanding Risks Research ethics involve balancing the risks and benefits of a study. One of the central challenges for digital research is assessing risks to participants. These risks, which have been considered carefully by social scientists over the last several decades, have been transformed by advances in digital research methods. In some cases the risks have been exacerbated, such as those to privacy when working with large sets of social data. In other cases, risks can be minimal in digital research, including online surveys or small manipulations in online environments (such as modifying a user interface). In the Facebook Emotion Contagion study the balance of risk and benefit was novel to much of the public, who were unaware that their social feeds were already algorithmically curated. In my own analysis of the risk-benefit balance I considered the risk to individuals of modifying the ranking of posts in their newsfeed to be minimal based on our laboratory research on emotional contagion. Nonetheless, my perception of the risks did not match the public’s assessment. The reaction to the Facebook study made clear that users were upset with the manipulation regardless of the risk involved. Further, while risks to individuals may be minimal, as digital research is conducted on a massive scale, there may be aggregate risks to society that are not minimal, and these kinds of societal risk have not been considered in more traditional social science research. In the Crowcroft chapter in this handbook the author lays out a useful framework for thinking about the possible risks for participants by considering possible harms, including psychological, physical, and economic. As the reactions from users to the Facebook study make clear, the violation of expectations is also an important risk. As our work on folk theories reveals, if users think of a system such as Facebook using the metaphor of a platform or a window, when that system violates expectations associated with that metaphor (e.g., a platform conducting experiments), this can upset and anger users. The Menchen-Trevino



 . 

chapter in this handbook provides a technique that may help researchers avoid violating participant expectations by asking non-researchers about their reactions to a proposed study to understand how users might perceive the risks of a study.

.. Privacy Practices Continue to Evolve Digital researchers, especially those trained in data science, have made substantial progress in developing protections for user privacy. Nonetheless, privacy concerns remain an important issue as privacy practices continue to evolve, from changes in our social norms and preferences to novel ways in which identities can be shared or detected. The Crowcroft chapter lays out a framework that includes a privacy protection ecosystem built on the legal system, the technology sector, analytics firms, consumer rights groups, researchers, and Internet users.

.. Technical and Legal Issues The many technical and legal issues that come up in any given project create a vexing challenge for digital researchers. For example, there is substantial debate about whether researchers need to follow systems terms of service. And of course the technologies available for digital research are continuously developing and evolving novel capabilities, which each come with potentially new ethical issues. In the chapter by Mislove and Wilson in this handbook, they lay out a useful, practical guide for researchers in the digital space. This chapter provides a detailed review of the technological, legal, and ethical issues concerning Internet data collection. The authors describe various methods to collect data from different web services and discuss the potential legal and ethical implications of these data collections, usage, and sharing. I recommend this chapter as a handy resource for researchers.

.. Ongoing Challenges While these are some of the key challenges that have emerged from conversations on digital research and ethics, and these chapters help us move that conversation substantially forward, there remain some issues that raise entirely new kinds of ethics questions. The use of crowdworkers as part of more aspects of digital research raises some of these new questions. For instance, if crowdworkers are asked to help with content analysis, do they need to provide informed consent in case they are exposed to content that might carry some risk (e.g., content analysis of health-related messages, such as depression)? If crowdworkers become part of a research team, becoming crowd researchers, what ethical training should be required for them, and how should this be regulated?

   



Another example is autonomous experiments, in which artificial intelligence (AI) is used to automatically develop and implement new experiments. These kinds of autonomous experiments have the potential to dramatically enhance research in some domains, such as online auctions, but they raise key ethical questions. How might ethical constraints be programmed into such a system? While these questions and more will evolve as our conversation on ethics in digital research progresses, the chapters in this section, along with all the other chapters in this handbook that carefully consider research practices, make important and useful contributions to our community’s ongoing conversation.

R Buchanan, E., & Ess, C. (). Internet research ethics: The field and its critical issues. In Kenneth Einar Himma and Herman T. Tavani, The handbook of information and computer ethics. Hoboken, NJ: John Wiley & Sons. Eslami, M., Rickman, A., Vaccaro, K., Aleyasen, A., Vuong, A., Karahalios, K., Hamilton, K., & Sandvig, C. (, April). I always assumed that I wasn’t really that close to [her]: Reasoning about invisible algorithms in news feeds. In Proceedings of the rd annual ACM Conference on Human Factors in Computing Systems (pp. –). New York, NY: ACM. French, M. & Hancock, J. T. (, May). What’s the folk theory? Reasoning about cyber-social systems. Paper presented at the annual meeting of the International Communication Association (ICA), San Diego, CA. Gelman, S. A., & Legare, C. H. (). Concepts and folk theories. Annual Review of Anthropology, , –. Grimmelman, J. (). The law and ethics of experiments on social media users. Colorado Technology Law Journal, , –. Kramer, A., Guillory, J., & Hancock, J. T. (). Experimental evidence of massive-scale emotional contagion through social networks. Proceedings of the National Academy of Science, , –. Meyer, M. N. (). Two cheers for corporate experimentation: The A/B illusion and the virtues of data driven innovation. Colorado Technology Law Journal, , –.

  ......................................................................................................................

      A Proactive Research Ethics ......................................................................................................................

 -

. I

.................................................................................................................................. D and mobile technologies keep users from getting lost, being bored, or feeling disconnected; they also collect unprecedented amounts of data about user behavior. These digital traces are being analyzed to enable companies to sell more, banks to manage risk, and governments to identify terrorist or criminal activities. Concern has been raised about how this kind of analysis gives institutions more control over individuals, from Facebook recognizing a picture of one’s face to corporations estimating the probability that a person will commit a felony (Richards & King, ). On the other hand, academic use of trace data is beginning to further the understanding of human behavior in a wide range of domains. Researchers are using mobile phone trace data to understand (and prevent) depression (Proudfoot, ; Whittaker et al., ), as well as to see the desires of a society as expressed through search terms (Waller, ) and to help understand public opinion in an age when polls are becoming less reliable (DeSilver & Keeter, ). The industrialized world is rapidly becoming a collection of data-driven societies as people increasingly rely on personalized technologies in their day-to-day lives. Digital technologies impact any study of contemporary life in these societies. Researchers, such as those who have contributed chapters to this volume, are beginning to develop methods to incorporate trace data into their studies. This chapter provides a discussion of the ethical issues particular to the endeavor of designing such methods. Digital trace data are observations of digital behavior (see Menchen-Trevino, ). This includes records of geographic location, web browser history, and the use of particular service features, as well as intentionally contributed traces such as social

     



media posts, email, and search terms. Some digital traces are publicly available, while others remain confidential among the service provider, the user, and the designated recipients.1 Digital traces are an important part of big data, which is usually defined in technological terms, but for the purposes of social science, as I have written previously, “big data are big because their analytical potential is qualitatively different than those researchers have been able to collect and assess in the past” (Menchen-Trevino, , p. ). That is, a film is “just” a sequence of static photographs viewed in succession, but the expressive possibilities of film are quite different than those of photography (Mayer-Schönberger & Cukier, ). Researchers interested in quantitative generalizations and prediction have been quick to embrace these new trace data sets, but they are also relevant to understanding smaller communities and contexts, particularly when combined with other forms of research. Trace data sets can be combined with experiments, surveys, and even indepth interviews or participant observation. Researchers might work with traces such as posts from a social media site, emails from a corporate email server, or a mobile phone provider’s logs of geographic information. Trace data could also be collected from individuals through a commercial research panel of people who have consented to provide such information in exchange for compensation, or participants could be recruited directly by the researcher. Generally, traces document real-world observations of behavior, but experiments can be performed with the cooperation of a service provider (e.g., Bond et al., ; Kramer, Guillory, & Hancock, ). As researchers experiment with new methods, it is particularly important to consider the ethical underpinnings of the norms being created. When researchers created modern survey methods in the mid-twentieth century, they scarcely imagined that this would reshape what public opinion is and how it impacts society (Abbott, ). This is not to say that in this current methodological firmament researchers will be able to predict all of the consequences of their work. However, an awareness of how powerful and value-laden methodological norms have become should inspire current innovators to make active choices beyond the research ethics protocols required by institutions or the legal frameworks that necessarily lag behind the rapid development of new technologies and methods (for a detailed guide see Zevenbergen et al., ). Furthermore, academics publish their research, and with this comes public scrutiny not faced by corporate or government researchers, who often do not disclose their findings. Common industry practices may outrage the public, particularly when the most widely used digital services are involved. An example is the public outrage that followed the publication of a study that reported the results of an experiment in which Facebook varied its news feed algorithm to display the posts of friends based in part on the emotion words they contained, to see whether this impacted the emotion words used by the affected users in their status updates (Kramer et al., ). Facebook and other websites do experiments frequently in order to improve the newsfeed and other features. However, the study was quickly dubbed “the emotion manipulation study,” and the controversy was even addressed by a US senator, who asked the Federal Trade Commission to look into the issues raised by the public reaction to the study



 -

(Kerr, ). In response to this controversy Christian Rudder, cofounder of the OkCupid dating website, wrote on the company blog: “We noticed recently that people didn’t like it when Facebook ‘experimented’ with their news feed. Even the FTC is getting involved [see https://www.cnet.com/news/senator-asks-ftc-to-investigate-facebooks-mood-study/]. But guess what, everybody: if you use the Internet, you’re the subject of hundreds of experiments at any given time, on every site. That’s how websites work” (). Rudder went on to give examples of experiments performed by OkCupid that could be far more consequential for the unknowing users than the Facebook study, such as telling users they “matched” with another user according to their algorithm when they actually “matched” by random chance. The public disclosure of results is one of the principal societal benefits of academic research, particularly if it can contribute to broader public awareness of how these data are used by the companies that provide digital platforms. Preparing to serve in this new role of educating the general public about their digital world is essential for researchers who publish studies based on trace data. This chapter is organized around ethical considerations relevant to the phases of research, from design to data collection, analysis, and reporting. The chapter ends by outlining the questions that researchers in the different phases of a project that uses digital trace data should consider in order to practice a proactive research ethics of digital trace data collection and analysis. The goal is to provide a tool for thinking through ethical considerations as research methods are developed, not a prescription or recipe to follow. Every researcher is trained and has experience in a particular field that accepts only certain methodological choices and research goals as valid. Boundaries are constantly being pushed by new methods, but only steps that are carefully justified in terms familiar to the relevant community of peer reviewers will be published in high-impact venues. These disciplinary limitations can be useful, as beginning each new project from scratch would be extremely laborious. However, for those exploring the potential of new data and methods, a greater awareness of methodological options beyond one’s own subdiscipline is particularly useful. The research design considerations discussed below begin with research goals in order to provide a framework for discussing some fundamental assumptions that are too often unstated.

. R D

.................................................................................................................................. Researchers conceive of a goal for their investigation before data are collected or acquired. The specific goal for a project tends to fall within an overarching category of goals deemed appropriate to a particular paradigm or tradition of research. I use the terms qualitative and quantitative when necessary, as they are pervasive in the literature on social science methodology. However, the distinction between the analyses of numerical versus nonnumerical data is not the defining difference between the

     



paradigms. More important are the explanatory programs, approaches to data, and sources of data. Each methodological choice has particular ethical implications. The discussion focuses on issues related to incorporating digital trace data into a research project. What counts as a valid explanation varies quite a bit among disciplines and between subfields of social research. These differences are in part based on epistemological distinctions beyond the scope of this chapter (see Bryman, ; Creswell, , pp. –), but also have to do with (often unstated) high-level goals, which are related to research methods. Below I summarize two perspectives on what these goals are (Abbott, ; Goertz & Mahoney, ) and discuss the ethical issues associated with incorporating trace data analysis into research with each goal. Abbott () defines three types of explanation used in the social sciences. The pragmatic view of explanation is satisfied when control over an outcome is achieved and has typically used methods such as experiments or surveys. This goal is fairly explicit in applied fields where research is intended to give advice to policy makers, regulators, or businesses with some degree of power over the relevant variables. However, predictive analytics, and causal analysis in general, tend to fall into this category even if applied ends are not explicitly stated. The results of pragmatic research give those with the power to influence the predictor variables an improved ability to control the response variable. For example, trace data for an online course might include each user’s login date/time, duration of use, and final grade records. An analysis of these data might find a strong positive correlation between the number of distinct days a student logged into the course and the final grade, holding the duration of use constant. If this finding is not spurious it increases the control of grades for those with the power to influence login frequency across different days: the students themselves (if they are given this information) or possibly the teachers or administrators who set course requirements. In some cases, predictions are not intended to give control over the response variable, but to enable those with access to the data to differentiate their behavior toward the individuals from whom the data are collected. For instance, the retailer Target wanted to identify pregnant customers earlier than its competitors, during the second trimester of pregnancy, in order to provide custom advertising during this time of critical change in purchasing habits (Duhigg, ). This kind of trace data analysis is descriptive; predictor variables (items purchased) were empirically derived by comparing the purchases of those who registered for a baby registry in the months prior to their registration with the average customer. While this analysis does not give Target any power over the outcome variable (likelihood of pregnancy), it does increase its pragmatic power to customize its advertising, or any other action it wants to take, based on this prediction. The potential for data holders to personalize and customize their treatment of users, customers, or citizens is the double-edged sword of trace data. This is discrimination in the sense of disaggregating a population based on certain characteristics, but it may or may not be used to differentiate by gender, age, race, or other categories a society deems



 -

appropriate to protect from discrimination. There are laws in many countries, including European Union countries and the United States, regarding the use of personal data to discriminate against protected classes of people. However, the legal code lags behind the technological possibilities, not just today but by definition. A data analyst at Target said: “We are very conservative about compliance with all privacy laws. But even if you’re following the law, you can do things where people get queasy” (quoted in Duhigg, ). Similarly, Eric Schmidt, CEO of Google, said that the laws are written by lobbyists hired by vested interests, and currently: With your permission you give us more information about you, about your friends, and we can improve the quality of our searches . . . . We don’t need you to type at all. We know where you are. We know where you’ve been. We can more or less know what you’re thinking about. (video embedded in Thompson, )

In the design phase of pragmatic research one of the key considerations is to think carefully about who could gain control over what (or whom). This can be difficult to assess as technology changes and revelations are made such as those provided by Edward Snowden about covert US government surveillance programs (Hosenball, ). However, a first step is to ask the “queasy” question: Would someone who may have provided these data be uncomfortable with the intended analysis? This is an empirical question, but useful progress can begin with the researcher simply imagining herself or himself in the shoes of the research population. In some cases, problems with the research design emerge from this simple mental exercise. However, the answer to the queasy question is often not clear or obvious, particularly if the population is not familiar to the researcher, or if it is a very large and diverse group. Answering the question may require preliminary research to contact members of the population in question. This is particularly important in contexts where individuals may face threats that are not easily identified by those outside of the local context (see Pearce, this volume). However, familiarity with the possibilities of various analytical techniques can cause researchers to lose touch with public perceptions even in their own society, so a preliminary assessment of nonresearchers’ reactions to the proposed analysis is recommended in all circumstances. Ideally this would be a well-designed qualitative2 study of attitudes toward a clear description of the intended analysis, conducted by a third party who fully understands but does not directly benefit from the proposed project. However, making any attempt to discern how those providing the data would react to the project is preferable to not doing so at all, particularly reaching out to members of the least powerful groups in the population. If it is not clear who the least powerful are, think about which groups are marginalized in the societies in question as online activities happen in this context. The results of an investigation into the queasy question, once complete, may not be easy to interpret. Is one person’s discomfort enough to require a redesign of the research, even if all legal and institutional ethics protocols have been followed? The answer depends on the source of discomfort. Is the discomfort about consent,

     



autonomy, or other core principals of research ethics? If so, this may warrant a reconsideration of the design. If the discomfort comes from a misunderstanding of the research, clearer communication should be used to describe the project. Each case requires individual consideration, and ideally this determination would be made with the input of a third party who does not stand to benefit from the research (e.g., a colleague with a different research focus). If the answer to the queasy question is “yes” (or a strong maybe), this should give the researcher an important reason to reconsider the analysis, regardless of whether it could still pass institutional and legal review processes. Of course the queasy question may be less relevant to studies with the intent to speak truth to power. For example, a study of corporate board membership networks or political donors may not need to be redesigned based on the discomfort of the analyzed groups. In these cases the public’s right to know may outweigh the discomfort felt by the wealthy and powerful. However, this consideration is highly relevant when the population (or part of it) may be vulnerable or marginalized. Pragmatic researchers should also consider how their research could maximize its utility to less powerful groups. More than just avoiding outrage, pragmatic research has the potential to inform and aid the marginalized and the institutions that represent or advocate for them. An example of a pragmatic research project designed to benefit margined actors is Turkopticon (Irani & Silberman, ). Amazon runs a crowdsourcing platform called Mechanical Turk on which individuals or businesses post tasks online, for example, naming the objects in photographs, and offer payment. Workers complete the tasks and receive the payment. The workers cannot easily connect with each other, and the pay is often quite low per hour, with no benefits (Chen, ). Turkopticon provides a platform for these crowd workers to connect with each other and enables them to review employers and warn other workers about any problems they have encountered. The second type of explanation Abbott () describes is called semantic explanation, “where an explanation is an account that enables us to stop looking for further accounts” (p. ). In this view the goal is to translate the phenomenon in question into familiar terms that need no further explanation for the intended audience. For some this final realm of explanation is individual self-interest, while for others it is culture, homophily, or socioeconomic status. This type of translation can extend, refine, or generate theories about social life. These theories may be academic or the applied theories of policy makers and practitioners. Ethnography, qualitative comparative analysis (QCA), and exploratory social network analysis tend to use semantic explanation. Issues of power and vulnerability are applicable to semantic explanations as they were to pragmatic explanations, but the power in this case is about control of the research narrative and which voices are heard in the analysis. The key question is: Which voices are unheard? Qualitative research methodology associated with this explanation type is often quite self-conscious of power issues and has developed techniques to address them, such as member checking and audit trails, discussed in section .



 -

Finally, Abbott () describes syntactic explanation, which satisfies the reader based on the elegance of the explanation or the logic of the argument itself. A simple yet powerful explanation is more syntactically satisfying. One can appreciate the quality of a syntactic argument without agreeing that its analysis is correct. For example, agentbased models (ABMs) can supply very syntactically powerful explanations. ABMs assign individual agents (which may represent people, corporations, countries, etc.) properties and interaction patterns. ABM simulations provide the results of larger scale interactions among agents over time and/or space. The effects of various agent-level rules on systems can be compared, which can lead to surprising and compelling results. For example, Schelling () found that even when individual preferences favored racial integration of their community, a pattern of segregation would result at the community level under certain circumstances. Whether or not, in practice, the individual-level process is different than the ABM describes, the surprising explanatory power of the model is quite clear in the article’s narrative, such that it retains syntactic value regardless of its empirical value. Similarly, the force of much historical narration is in the quality of the logic and the elegance of its presentation. The truth of history matters, but even flawed historical arguments may have syntactic value. Abbott () thus places formal modeling and historical narration in the syntactic realm. Trace data, particularly participatory traces such as email and social media, will be analyzed by historians of the digital age. Some of these records are directly analogous to written records and provide transformative potential as well (see Gardiner & Musto, ; Moretti, ). The ethical considerations in the syntactic approach are identical to those in the semantic approach: issues of narrative control and the representativeness of the available records versus the full population. In summary, the goal of the pragmatic approach is control, the goal of the semantic approach is translation from the realm of events or experiences into an appropriate conceptual schema, and the goal of the syntactic approach is the explanatory power and elegance of the argument itself. Abbott’s association of methods with particular goals is illustrative but not comprehensive, and multi-method projects are difficult to categorize. Goertz and Mahoney () divide the goals of social science along different lines, covering less of the social sciences than Abbott, as they limit themselves to research with a goal of valid causal inference, which leaves out interpretive approaches with semantic goals (p. ). Their analysis fits within just one branch of Abbott’s scheme, the pragmatic. Goertz and Mahoney detail how qualitative methods and quantitative methods can contribute to the pragmatic goal and provide an illustration of how different traditions and methods can work toward the same goals. Each research goal (control, translation, and elegance) can be sought by using behavioral or self-reported data and either an inductive or a deductive approach to the data (or some combination of the four, as they are not mutually exclusive). Digital trace data are behavioral.3 This is clearly an advantage when the goal is to study behaviors, but it is a challenge when assessing attitudes, knowledge, and even demographics. The challenge is not just empirical but also ethical. Richards and King () discuss identity as “the ability of individuals to define who they are” (p. ). According

     



to this view, when analysts define a category using only behavior, they can take away agency in defining one’s identity. Practices in line with identity as an ethical principal can be at odds with those promoting privacy. Removing identifying contact information from trace data sets is a common practice, making it impossible to collect behavioral and self-reported information from the same participants. It may be that the harm to privacy may outweigh the benefits of identity in this case, but researchers working with trace data should ask themselves if they can promote identity without compromising privacy. This could involve collecting self-reports from the same population, if not the same individuals, so that behavioral and self-reported identities can be compared. Trace data can be approached deductively, with theories based on prior research developed before the data collection and analysis applied to the new data, or can be inductively explored. In practice all studies use a mix of inductive and deductive approaches, but the main findings of a study tend to be explained based on one approach or the other. Deductive research needs to carefully consider the issues of power and voice at the research design phase. Can the less powerful groups impacted by the analysis be consulted not just about their comfort with the analysis and data gathering, but also about their views on the topic in question? Inductive projects need to address this question during the analysis phase.

. D C

.................................................................................................................................. Digital trace data are collected in many different ways, some of which are quite different from traditional social science methods. These differences impact informed consent and researcher autonomy. Technology users are deluged with privacy policies and terms of service agreements for every app, software, and service they use. These agreements often outline digital trace collection and sharing practices, but they are so onerous and lengthy that they are rarely read. A  study (McDonald & Cranor, ) estimated the average Internet user would need to spend  hours per year to read the privacy policies of all websites visited. Solove () has called this consentbased system “privacy self-management,” in which “[c]onsent legitimizes nearly any form of collection, use, or disclosure of personal data” (p. ). This puts an onerous burden on the individual, yet relieving this burden and putting these decisions in the hands of an institution means taking power as well as burden from individuals. There is no simple solution to this problem, and for now researchers have to navigate a broken system in which what is legally permitted is often quite out of line with users’ comfort and may be out of step with the type of society one would like to build. Traces collected from digital service providers are often acquired by researchers after users have produced the traces. Internet researchers studying email lists and chat rooms ran into these issues many years ago (Ess, ), and similar issues apply to using secondary data sets, or archival material. If the process by which users originally



 -

registered and began contributing information is clear, then the issue, following Ess (), is to consider carefully the expectations of the users. Additional consent for the study may be warranted or not. Furthermore, some methods of gathering traces are subject to arbitrary changes and new limitations at any time. Social media application programming interfaces (APIs) can be changed by the company providing the API, and they can limit particular uses of the data collected from them. A recent change to the Facebook API cut off access to “friends of friends” data, shutting down analysis of one’s own personal networks, as well as the networks of application users (Menchen-Trevino, ). Similarly, Twitter revoked access to its API from a website that exposed the deleted tweets of elected US politicians after it had been operating for over three years (Trotter, ). Furthermore, the sampling procedures of APIs are often held as trade secrets, introducing unknown biases that are just beginning to be researched in the few cases where it is possible to make comparisons (González-Bailón, Wang, Rivero, Borge-Holthoefer, & Moreno, ). This highlights a more general issue. Depending on corporations for data puts researchers in a weak position. Even if companies do grant access in some cases, this relationship of dependency limits the kind of research one can do both technically and conceptually. As Paul Lazarsfeld, a sociologist and pioneering media researcher who had many successful commercial research partnerships, said at a conference for media practitioners: [W]e academic people always have a certain sense of tightrope walking: at what point will the commercial partners find some necessary conclusion too hard to take and at what point will they shut us off from the indispensable sources of funds and data? (, p. )

Yet research partnerships with corporations are indispensable. In , % of online adults in the United States used Facebook (Pew Research Center, ), and US Facebook users spent an average of forty minutes per day on the site (Constine, ). The societal importance of such sites is potentially transformative, and researchers cannot ignore any avenue for doing research with these data. The other route to gathering digital trace data is to ask users to consent to using software that collects their digital traces. One option is to work with participant recruitment companies or media ratings firms that have large panels of respondents who have agreed to trace data collection (e.g., web browsing records) in return for compensation. However, what this compensation is and how the data are collected (the code of the software, how the data are cleaned and processed) are often kept as trade secrets. Whether the informed consent process was done properly is a matter of trust in the company, but what’s equally important is that if these details are unavailable to other researchers, a full replication of the study is impossible. Another possibility is for researchers to obtain consent from participants to use data collection software themselves. This is a fairly uncommon approach, as it has generally involved writing software for this purpose, but it is one I have used and continue to

     



develop (Menchen-Trevino, ; Menchen-Trevino & Karr, ). While this option offers independence from corporate ties and puts the researcher in control of the recruitment process, gaining genuine informed consent is problematic in today’s digital environment. Technology users are besieged with requests for data, and trust in those seeking data is understandably limited. Building an ethical consent process is no small task. The data collection requests that users are accustomed to nudge them to share as much as possible without reading the fine print. Genuinely informing users of what will be collected and how it will be used could alarm them, even if they routinely “agree” to sharing even more with less-forthcoming corporations. These challenges can be overcome with significant effort at establishing trust, but this approach is quite labor intensive compared to other options. Working with companies that collect trace data from their users, whether through an individually negotiated agreement or a public API, puts control over the consent process in the hands of the company. This eases the data collection process considerably but puts the researcher in a dependent position. A recruitment company or a company with an existing online panel can collect traces from individuals, but details about recruitment and the software used to collect data may be unavailable. It is possible to approach individuals to ask for their consent to use research data collection software, but that can be quite resource intensive. The researcher must carefully evaluate (or create) the consent process not only for legal or procedural reasons but also to address the “queasy question.” Furthermore, beyond the individual-level queasy question, just because people may be comfortable with providing data for the intended analysis does not necessarily make it ethical. Perhaps the harm would not fall on the participants, but on others; for example, those who volunteer to use driving or activity trackers in order to receive discounts on car insurance or health insurance may be careful drivers or regular exercisers, but if tracking expands to become opt-out rather than opt-in, this creates a surveillance regime in which some are rewarded and others are punished. Furthermore, the immediate incentives for sharing could cause people to ignore difficult-to-imagine future harms. For example, vehement Internet shaming campaigns do not happen to many people, but they can cause the targeted individuals to loose jobs, relationships, and even careers based on information they voluntarily chose to share online4 (Ronson, ). Both the individual and the social consequences should be considered. Issues surrounding data storage and destruction are often a focus of university ethics panels. The principle is that if identifying information has any potential to harm individuals, it should be destroyed as soon as it is not required. With de-identifiable data this is not particularly problematic (except in cases of suspected fraud; see Broockman & Kalla, ); however, in some cases the data are infused with identifying information. In  AOL released twenty million search queries publicly, without any directly identifying information for research purposes, but search terms are infused with potentially identifying information, and subsequently individual users were identified and even named and contacted by the press (Barbaro & Zeller, ). The same issue applies to records of geographic location and many other digital traces. This



 -

concern is not limited to digital traces, however. It is true of much qualitative data due to their rich detail. For example, a recent controversial ethnography by Alice Goffman () of the urban neighborhood in Philadelphia with the pseudonym th Street was investigated by a journalist, who said, “I’d been wandering around the neighborhood I was pretty sure was ‘th Street,’ handing out photos of Goffman, asking anyone willing to talk to me if they remembered this small white girl who used to hang out with Chuck” (Singal, ). Soon the journalist had identified some of Goffman’s research participants, although Goffman, like AOL, had not published any identifying information. Very detailed individual-level information, whether collected by humans or machines, carries the inherent risk of identification. At the same time, transparency is required to use such data to advance research. A secure data repository for such information would allow auditors and researchers attempting replication to access such data confidentially. Such a repository could be expensive to maintain, however, as it would likely become a target for individuals who wanted to breech the confidentiality of participants.

. A  R

.................................................................................................................................. Ethical issues in trace data analysis and reporting have to do with transparency, validity, and continued consideration about informed consent. Digital traces are often created for purposes other than social science research. The way traces are recorded and stored is designed to allow the company to use them to customize a user’s experience or provide data to advertisers, management, or investors. Data are the building blocks of information and knowledge, but building blocks are manufactured. Bowker () points out that all data are by definition not raw, and thus “raw data” is an oxymoron. The design of these building blocks may or may not be relevant to a particular analysis, but failing to consider the intent and possible blind spots of the data could be problematic. When data analysts refer to “raw data,” what they usually mean is that the data are not suitable for their intended purpose. The preparatory work is often called data cleaning. Data cleaning is not a problem of finding the right categories to classify data; it is the process of defining the categories themselves. The main ethical challenge in data cleaning is preserving transparency. The term cleaning is unfortunate because it suggests that just “garbage” is removed, when it can be far from obvious what is or is not “clean.” In my own work I have analyzed web visit logs. If I have an analytical category of “website,” it is not clear how to operationalize this concept using URLs that were designed to deliver web content. For example, the URLs https://www.google.com/ calendar/render?tab=mc#main_ and https://news.google.com/ are both from Google, but if I were interested in studying online news, a rule that lumped together the URLs news.google.com and google.com/calendar would not be “clean” for my purposes. Some data cleaning removes corrupt or unusable data, but in many other cases data cleaning

     



defines a variable for the purposes of a particular study. The way these variables are constructed could significantly impact the results of an analysis, and the way variables were “cleaned” for one study may not be appropriate for other studies. Not providing the rules (programming code) that define the variables in an analysis is akin to not providing the questionnaire in a report of survey research. Yet unlike providing questionnaires, it is not standard practice to provide a detailed audit trail for the data cleaning process in many publications. One possibility is for publishers to require that such processes be included as an online appendix. There are active open data movements (e.g., the Global Open Data Initiative), so this concern is beginning to be addressed. However, for researchers to provide detailed documentation of the data cleaning process they must have this documentation themselves. A precise audit trail is extremely difficult and time-consuming to provide for point-and-click analysis tools (e.g., Excel). This effectively requires researchers to use code-based tools (e.g., R). While code-based tools are becoming more popular, this is a barrier to entry for researchers new to this form of analysis. Even intentionally contributed publicly available traces can contain data that many users are unaware of creating. For example, a tweet from Twitter collected using Twitter’s API includes not just the text of the tweet, the date and time it was posted, the user name, and associated user information, but also the language of the tweet (per Twitter’s automated analysis) and the device or program used to post the tweet (e.g., iPhone app, TweetDeck, Twitter’s web interface) (Ford, ).When new methods of analysis are developed after the data have been collected, the original consent provided by the user may no longer be valid, even if it could have been fully informed at the time of collection. Data analyst Adrian Short () described some data analysis techniques Twitter users are generally unaware of: Even if you’re a professional data analyst, you’ve got no way to know how any one of these techniques could be used, either in good faith, recklessly or maliciously, to invade the privacy and damage the lives of people who have done nothing more than post to Twitter. I hope it’s clear that your tweets can reveal your legal identity, relationships, group memberships, interests, location, attitudes and health even where you haven’t explicitly or obviously volunteered that information. This can, and of course is, being used to change people’s lives, very often for the worst [sic]. It can affect people’s job prospects, relationships, health, finances, it could cost people their liberty or even their lives. There is no meaningful way to consent to this, no way that any one person could comprehend the genuine risk from their social media exposure, either in the light of current known techniques or of data analysis methods yet to be devised [emphasis added]. Increasingly, opting out isn’t an option either. At best you lose the benefits of being part of social networks online. At worst, your absence flags you as an outsider or someone with something to hide.

In this view the ethical burden is on the analyst, since consent is not possible. The importance of ethical decision-making in the analysis of publicly available online



 -

information is, in some places in the world, a matter of life and death. Secular bloggers have been killed in Bangladesh (Alam, ). Today we know that an analysis identifying the most influential secular bloggers in Bangladesh would endanger the lives of those individuals, but this would not have been obvious to those outside the local context before the news of these incidents spread, and a large number of data analysts may have missed this particular bit of foreign news. Such an analysis would not violate any law and would not likely raise alarms for an institutional ethics review panel. Researchers should invest time in learning the context of the personal data they are analyzing, particularly if the actors are identifiable individuals. Even if researchers make an effort to mask participant identities, a determined investigator can identify individuals by putting together small pieces of information, and if the study involves connected individuals, just one identification can reveal the entire group. Most analysis is not a matter of life and death, however. Never revealing identifying information in the analysis of publicly available data may not always be the correct choice. Such information could allow the individuals to participate in defining themselves, and it can provide more transparency in the analytical process. It may be possible in some cases for individuals to opt in to being identified, promoting the autonomy of individual identity, although the risk of identifying the group via these individuals should be assessed. The needs for transparency and appropriate caution in disclosing identifying information may seem to conflict, but there are ways to reconcile this tension. Holding private data confidential is critical to preserving and understanding privacy in the era of big data (Richards & King, , p. ). The code for data cleaning and analysis can be posted without the underlying data set. There may be ways to post a de-identified data set, but even if this is not possible the researcher can provide the original data confidentially and securely to an auditor. The original data set could be identified with a fingerprint (Altman, ) at the time of submission for publication so that the auditor could verify that the submitted data set was the same. Transparency allows for replication for research aiming at generalizations, and it provides an audit trail for research more concerned with a specific case. It also can make research fraud easier to detect (e.g., Broockman & Kalla, ). Qualitative research has long collected sensitive private data and analyzed it largely inductively. Some procedures have been developed within the qualitative and mixed methods traditions for addressing the ethical issues related to this approach, some of which may apply to quantitative research as well. Member checking is appropriate for research that intends to reflect participant perspectives in the analysis. This may be true for a network analysis as well as ethnography. In this approach preliminary results are sent to participants, and they are asked whether they feel the interpretation is accurate (Creswell, , p. ). Some researchers send participants their data (e.g., interview transcript, friendship network), but misunderstandings can result from this approach (see Carlson, ). Examining evidence from different sources involves triangulation, a key concept in mixed-method research (Jick, ; Morse, ). Triangulation is particularly relevant to the analysis of trace data, as important contexts can be invisible,

     



particularly when there are offline components to the phenomenon, but even with entirely online cases there may be backchannels, such as chat and messaging applications or private email, that may be missing from the available data.

. D

.................................................................................................................................. Digital traces are often portrayed in the media as powerful but mysterious. This causes fear, and understandably so. Research with digital traces is likely to face increased scrutiny from university ethics panels, research participants, and the press. This type of scrutiny can be negative if it simply restricts the publication of results and does nothing to change the collection or use of trace data behind closed doors. However, it can be positive if the attention is used to inform the public about what data they are providing and the benefits of research. It is paramount for academics to consider the ethical implications of trace data analysis and hold themselves to transparent standards for its collection and analysis. It is also essential that academic work using digital traces continues, expands, and flourishes apace with the society it intends to understand and illuminate. Some researchers take a reactive approach to research ethics, whereby they adhere to the ethical and legal protocols imposed upon them without considering how broader ethical principles may apply. This approach may work well in times of relative stability in research methods, but today new methods are being developed rapidly, and researchers must proactively consider the ethical norms they are creating. Privacy laws and the research ethics guidelines of university institutions can be a good starting point for thinking through the ethical issues of trace data analysis, but they should not limit the considerations of researchers working with new digital data sets. This chapter has considered the ethical issues researchers who are incorporating digital trace data into a research design face during various phases of the process. Table . lists the key questions discussed. Many of the questions in table . relate directly to the basic ethical principles of the Belmont Report (The National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research, ): respect for persons, beneficence, and justice. Understanding how these principles apply to publicly or semi-publicly available data has long been a concern of Internet researchers (Ess, ; Markham & Buchanan, ). However, working with trace data often adds a new set of powerful actors to the research process: the corporations that create the data and platforms participants interact within. It is important to consider carefully how the presence of these new actors with their own distinct goals and responsibilities affect research ethics and researcher autonomy. There is much need for new law and policy concerning the ethical issues trace data sets present (see Solove, ; Richards & King, ), as well as ethics-focused analyses of the ways corporations and (when possible) governments are collecting and using



 -

Table 29.1 Key Ethical Questions Discussed in Each Research Phase Research Design

Data Collection & Storage

Analysis & Reporting

• Who could gain control over what (or whom)? • Would someone who may have provided these data be uncomfortable with the intended analysis? • Is the discomfort about consent, autonomy, or other core principles of research ethics? • How can the research benefit marginalized groups? • Which voices are going unheard, and how can I listen for them? • Can the research support participant identity? • Are (or were) participants able to give meaningful informed consent? • Is the study replicable and/or auditable? • How might the data collection process limit the researcher’s autonomy? • Could the participants’ data cause harm to others or to the larger society? • Is it possible to store identifying data confidentially? • Is it appropriate to allow individuals to identify themselves to promote autonomy? • If posting the data publicly is not possible, what can be done to increase transparency (e.g., data set fingerprinting, secure storage to enable auditing)? • Does the way the data were initially created and stored create any possible blind spots? • Is any “data cleaning” clearly documented? • Could the data analysis endanger the users? (This requires learning about the local context.) • Is member checking appropriate? • Is triangulation with other forms of data possible?

these data that go beyond alarming headlines to discuss systemic issues. It is also important to alert the public to what data are being collected and how they can be used. For example, the Panopticlick project from the Electronic Frontier Foundation helps web users understand how their web browser can be uniquely identified and tracked by websites. In addition, the informed consent process for using digital traces for medical research is being improved to become more visual and participant-centered, rather than document-centered (Sage Bionetworks, ). Also, I have developed an informed consent process for web browsing history data collection that involves interactive data visualizations (Menchen-Trevino, ). Academic social scientists, and even academic computer scientists, are a very small part of the digital trace data analysis sector, compared to corporate and government involvement. Academics also teach students who will join industry and government (see Berendt, Büchler, & Rockwell, ). Incorporating ethical perspectives into the teaching of trace data analysis is important, particularly as the ethical codes of relevant industry associations are just beginning to be developed (Digital Analytics Association, ; Gowans, ). Academic researchers are unique in having a primary objective to

     



publish findings and thus sometimes serve as intermediaries to a justifiably skeptical public. This is a role that must be embraced, despite the difficulties it entails. If trace data analysis is seen as threatening by the public, the small window of transparency that academic research brings may be closed by the companies that control the data, while less publicly accountable research continues apace.

N . Unless they are hacked by a government (see Hosenball, ) or other groups (McHugh, ). . The source of possible discomfort is unknown to the researcher, so the purpose is exploratory, and therefore it is more appropriate for qualitative or inductive-centered methods. . Some traces such as social media status updates, blog posts, or online dating profiles can be viewed as self-reports or creation behaviors, but the overwhelming volume of digital traces is logs of mundane system usage or location reports. . Of course individuals can also be harmed by information shared online without their voluntary participation, but the Ronson () article offers a vivid example of a case in which the sharing was voluntary and extreme consequences befell the individual.

R Abbott, A. (). Of time and space: The contemporary relevance of the Chicago School. Social Forces, (), –. http://doi.org/./sf/.. Abbott, A. (). Methods of discovery: Heuristics for the social sciences. New York: W. W. Norton. Alam, J. (, August ). th blogger killed in Bangladesh by suspected militants. The Big Story. Retrieved from http://bigstory.ap.org/article/dfedddeeffafafecaaad/ secular-blogger-killed-bangladesh-fourth-year Altman, M. (). A Fingerprint Method for Scientific Data Verification. In Advances in Computer and Information Sciences and Engineering (pp. –). Springer, Dordrecht. https://doi.org/./----_ Barbaro, M., & Zeller, T., Jr. (, August ). A face is exposed for AOL searcher no. . New York Times. Retrieved from http://query.nytimes.com/gst/abstract.html?res=ECEDDFFFAABCACB Berendt, B., Büchler, M., & Rockwell, G. (). Is it research or is it spying? Thinking-through ethics in big data AI and other knowledge sciences. KI—Künstliche Intelligenz, (), –. http://doi.org/./s--- Bond, R. M., Fariss, C. J., Jones, J. J., Kramer, A. D. I., Marlow, C., Settle, J. E., & Fowler, J. H. (). A -million-person experiment in social influence and political mobilization. Nature, (), –. http://doi.org/./nature Bowker, G. C. (). Memory practices in the sciences. Cambridge, MA: The MIT Press.



 -

Broockman, D., & Kalla, J. (, July ). We discovered one of social science’s biggest frauds: Here’s what we learned. Vox. Retrieved from http://www.vox.com///// lacour-gay-homophobia-study Bryman, A. (). The debate about quantitative and qualitative research: A question of method or epistemology? The British Journal of Sociology, (), –. http://doi.org/./ Carlson, J. (). Avoiding traps in member checking. The Qualitative Report, (), –. Chen, M. (, January ). Is crowdsourcing bad for workers? The Nation. Retrieved from http://www.thenation.com/article/crowdsourcing-bad-workers/ Constine, J. (, June). American users spend an average of  minutes per day on Facebook. Retrieved from http://social.techcrunch.com////facebook-usage-time/ Creswell, J. W. (). Research design: Qualitative, quantitative, and mixed methods approaches (th ed.). Thousand Oaks, CA: Sage Publications. DeSilver, D., & Keeter, S. (, July ). The challenges of polling when fewer people are available to be polled. Retrieved from http://www.pewresearch.org/fact-tank//// the-challenges-of-polling-when-fewer-people-are-available-to-be-polled/ Digital Analytics Association,. (). The web analyst’s code of ethics. Retrieved from http://www.digitalanalyticsassociation.org/codeofethics Duhigg, C. (, February ). How companies learn your Secrets. New York Times. Retrieved from http://www.nytimes.com////magazine/shopping-habits.html Ess, C. (). Ethical decision-making and Internet research: Recommendations from the AoIR ethics working committee. Association of Internet Researchers. Retrieved from www.aoir.org/reports/ethics.pdf Ford, P. (, November ). The hidden technology that makes Twitter huge. Retrieved from http://www.bloomberg.com/bw/articles/--/the-hidden-technology-that-makes-twitter-huge Gardiner, E., & Musto, R. G. (). The digital humanities: A primer for students and scholars. New York: Cambridge University Press. Goertz, G., & Mahoney, J. (). A tale of two cultures?: Qualitative and quantitative research in the social sciences. Princeton, NJ: Princeton University Press. Retrieved from http:// proxyau.wrlc.org/login?url=http://search.ebscohost.com/login.aspx?direct=true&db=nlebk&AN=&site=ehost-live&scope=site Goffman, A. (). On the run: Fugitive life in an American city. Chicago?and London: University of Chicago Press. González-Bailón, S., Wang, N., Rivero, A., Borge-Holthoefer, J., & Moreno, Y. (). Assessing the bias in samples of large online networks. Social Networks, , –. http://doi.org/ ./j.socnet... Gowans, J. (, November ). DRAFT: Code of ethics & standards for social data. Retrieved from http://blog.bbi.org////draft-code-of-ethics-standards-for-social-data/ Hosenball, M. (, November ). NSA chief says Snowden leaked up to , secret documents. Reuters. Retrieved from http://www.reuters.com/article////us-usasecurity-nsa-idUSBREADB Irani, L. C., & Silberman, M. S. (). Turkopticon: Interrupting worker invisibility in Amazon mechanical turk (p. ). In CHI ‘: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. –). New York: ACM. http://doi.org/ ./.

     



Jick, T. D. (). Mixing qualitative and quantitative methods: Triangulation in action. Administrative Science Quarterly, (), –. http://doi.org/./ Kerr, D. (, July ). Senator asks FTC to investigate Facebook’s mood study. Retrieved from http://www.cnet.com/news/senator-asks-ftc-to-investigate-facebooks-mood-study/ Kramer, A. D. I., Guillory, J. E., & Hancock, J. T. (). Experimental evidence of massivescale emotional contagion through social networks. PNAS, (), –. http://doi. org/www.pnas.org/cgi/doi/./pnas. Lazarsfeld, P. F. (). Some notes on the relationships between radio and the press. Journalism Quarterly, , –. Markham, A. N., & Buchanan, E. A. (). Ethical decision-making and Internet research: Recommendations from the AoIR Ethics Working Committee (version .). Association of Internet Researchers. Retrieved from http://aoir.org/reports/ethics.pdf Mayer-Schönberger, V., & Cukier, K. (). Big data: A revolution that will transform how we live, work, and think. Boston: Houghton Mifflin Harcourt. McDonald, A. M., & Cranor, L. F. (). The cost of reading privacy policies. ISJLP, , . McHugh, M. (, September ). The dangers of looking at Ashley Madison hack infographics. Retrieved from http://www.wired.com///dangers-looking-ashley-madison-hack-infographics/ Menchen-Trevino, E. (, December). Partisans and dropouts? News filtering in the contemporary media environment. Evanston, IL: Northwestern University Press. Menchen-Trevino, E. (). Collecting vertical trace data: Big possibilities and big challenges for multi-method research. Policy & Internet, (), –. http://doi.org/./.POI Menchen-Trevino, E. (, April ). Facebook privacy changes come with a price for research. Retrieved from http://www.ericka.cc/facebook-privacy-changes-come-with-aprice-for-research/ Menchen-Trevino, E. (). Web historian: Enabling multi-method and independent research with real-world web browsing history data. Paper presented at the iConference, Philadelphia, IDEALS. Retrieved from http://hdl.handle.net// Menchen-Trevino, E., & Karr, C. (). Researching real-world Web use with Roxy: Collecting observational Web data with informed consent. Journal of Information Technology & Politics, (), –. http://doi.org/./.. Moretti, F. (). Distant reading (st ed.). London?and New York: Verso. Morse, J. M. (). Approaches to qualitative-quantitative methodological triangulation. Nursing Research, (), –. National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research. (). The Belmont report: Ethical principles and guidelines for the protection of human subjects of research. Department of Health, Education, and Welfare. Retrieved from https://www.hhs.gov/ohrp/regulations-and-policy/belmont-report/index.html Pew Research Center. (). Social networking fact sheet. Retrieved from http://www.pewinternet.org/fact-sheets/social-networking-fact-sheet/ Proudfoot, J. (). The future is in our hands: The role of mobile phones in the prevention and management of mental disorders. Australian & New Zealand Journal of Psychiatry, (), –. http://doi.org/./



 -

Richards, N. M., & King, J. H. (). Big data ethics (SSRN Scholarly Paper No. ID ). Rochester, NY: Social Science Research Network. Retrieved from http://papers.ssrn.com/ abstract= Ronson, J. (, February ). “Overnight, everything I loved was gone”: the internet shaming of Lindsey Stone. The Guardian. Retrieved from http://www.theguardian.com/technology/ /feb//internet-shaming-lindsey-stone-jon-ronson Rudder, C. (, July ). We experiment on human beings! [Web log post]. Retrieved from http://blog.okcupid.com/index.php/we-experiment-on-human-beings/ Sage Bionetworks. (). Participant centered consent toolkit. Retrieved from http://sagebase.org/platforms/governance/participant-centered-consent-toolkit/ Schelling, T. C. (). Dynamic models of segregation. The Journal of Mathematical Sociology, (), –. http://doi.org/./X.. Short, A. (, October ). Unethical uses for public Twitter data [Web log post]. Retrieved from https://adrianshort.org/unethical-twitter/ Singal, J. (, June ). The Internet accused Alice Goffman of faking details In her study of a black neighborhood: I went to Philadelphia to check. The Cut. Retrieved from http:// nymag.com/scienceofus///i-fact-checked-alice-goffman-with-her-subjects.html Solove, D. J. (). Introduction: Privacy self-management and the consent dilemma. Harvard Law Review, , . Thompson, D. (, October ). Google’s CEO: “The laws are written by lobbyists.” The Atlantic. Retrieved from http://www.theatlantic.com/technology/archive///googlesceo-the-laws-are-written-by-lobbyists//#video Trotter, J. (, June ). Twitter just killed Politwoops. Retrieved from http://tktk.gawker. com/twitter-just-killed-politwoops- Waller, V. (). Not just information: Who searches for what on the search engine Google? Journal of the American Society for Information Science and Technology, (), –. http://doi.org/./asi. Whittaker, R., Merry, S., Stasiak, K., McDowell, H., Doherty, I., Shepherd, M., . . . Rodgers, A. (). MEMO—a mobile phone depression prevention intervention for adolescents: Development process and postprogram findings on acceptability from a randomized controlled trial. Journal of Medical Internet Research, (), e. Zevenbergen, B. et al. (). Networked systems ethics. Retrieved from http://networkedsystemsethics.net

  ......................................................................................................................

 ’       ......................................................................................................................

    

T Web today offers a tremendous number of sites and services that can provide useful data for researchers. Websites ranging from online social networks (e.g., Facebook, Twitter), to online marketplaces (e.g., eBay, Craigslist), to recommendation services (e.g., Yelp, TripAdvisor) can be used to examine human behavior at scale. However, collecting data from these services brings up a host of technical, legal, and ethical issues for researchers, and many research communities are still grappling with the challenges that obtaining data from these services present. In this chapter we provide an overview of many such issues, aiming to provide practitioners with specific examples of how different services can be accessed and how different communities have dealt with the ethical and legal challenges that they present.

. T A  D C

.................................................................................................................................. We begin by examining the technical aspects and options associated with obtaining web-based data across a variety of websites and services. Collecting such data requires careful handling, so we start with an overview of the universal best practices for properly anonymizing and storing such data. We then examine various mechanisms, discussing the specific trade-offs of each in separate subsections.



    

.. Universal Best Practices Regardless of the method by which data are obtained, the collected data should be handled with care. In this section we outline a few challenges that users will likely face for all data sets.

... Anonymizing Data Data that are collected may contain private information or personal information about human subjects. We discuss human subjects/institutional review board (IRB) issues separately, but in general, researchers should anonymize any data fields that are potentially vectors for personal or private information that is not necessary to conduct the research itself. For example, if one is collecting data from Facebook, often the exact names of the users and their associated Facebook user identifiers are not strictly necessary to conduct the research (i.e., one is interested in how information is exchanged among these users, but their names are not germane). In general, proper anonymization is extremely difficult to accomplish, and even wellreasoned approaches to anonymization have been shown to be reversible in certain circumstances (Hansell, ; Narayanan & Shmatikov, ). Often these approaches rely on using external data to identify a few “unique” users, then iteratively de-identify more users. As a cautionary example, in  Netflix announced the Netflix Prize for better movie-rating-prediction algorithms. As part of the prize, the company released a data set of movie ratings by its users, in which the users’ names were replaced with anonymized identifiers. Researchers were still able to de-identify users by correlating the anonymized data with data from the Internet Movie Database (IMDb), looking for users who rated (combinations of) rare movies. Despite the challenge of achieving strong anonymization, there are a few best practices. First, data should be anonymized as they are collected, and the only data that are written to disk should already be anonymized (i.e., data should be anonymized in memory, as they are received). Second, hash functions1 (e.g., SHA) serve as useful tools for anonymizing arbitrary data like usernames, but researchers should ensure that they salt the hash (e.g., append a single long, random string to all data to be hashed); this approach prevents attackers from simply running the hash function themselves to test for the presence of data (an approach called rainbow tables). For example, when the New York City Taxi Commission released taxi data, it hashed but did not salt the data; researchers were almost immediately able to de-anonymize the resulting data to identify individual taxis (Pandurangan, ).

... Secure Storage Some sensitive data must be collected and not anonymized, as they are central to the research. In this case, researchers should endeavor to secure such data to the greatest extent possible. There are a few different approaches to securing data, and we provide a brief overview here.

’      



First, a few universities are beginning to offer secure, centralized data storage for sensitive research data. In this case, the servers storing the data are professionally managed, and access control is enforced by information technology (IT) professionals. If possible, researchers should take advantage of such services. Second, if these services are not available, researchers can take steps to secure data on their own machines. Sensitive data should always be encrypted, and there are a number of mechanisms for accomplishing this. For machines with only a single user, most modern operating systems offer full-disk encryption (e.g., BitLocker in Windows, FileVault on Mac OS X); if this is used, and the machine or hard drive is stolen, the thief will not be able to access any data on the disk. For shared machines, many operating systems allow users to create encrypted disk images for storing data. Even if users who should not have access to the data manage to copy these images off the server, they will need the disk image’s password in order to decrypt them. Third, there are a number of cloud providers (e.g., Amazon Web Services) that offer cloud-based storage services. Similar to the university-provided services, these are professionally administered, but they require that the users practice proper controls to ensure the access keys and tokens do not fall into the hands of others who should not have access. Researchers should examine the options available to them—and the associated cost—to determine what the best choice is for their particular data set.

... Backups Researchers should take steps to ensure that collected data are not lost in the event of a hardware failure, software fault, malware attack, or other external event. The easiest way to do this is to ensure that users have an off-site backup of their data that is updated regularly. A number of professional services offer this backup (e.g., Amazon Glacier), and many universities offer backup storage services to researchers. Failing to properly back up data can be catastrophic, as much web-based information is difficult or impossible to recollect a second time.

.. Obtaining Data Directly from Companies When obtaining web-based data, typically the best mechanism is getting it directly from the company or organization that controls it. For example, researchers have previously been able to obtain access to mobile phone traces (Eagle et al., ) and social networking data sets (Jiang et al., ; Yang et al., ) through company contacts. Obtaining data in this way has a number of advantages, such as completeness and access to information that may not be available publicly (e.g., in the social networking context, companies have logs of who views content that are often not visible on the site itself). However, obtaining access to data directly from a company is often difficult; many companies are hesitant to share proprietary data with researchers. Moreover, even if



    

one does obtain data, companies often require researchers to sign nondisclosure agreements (NDAs) that may limit their ability to publish results (e.g., if the company does not like the research results, it may be able to prevent publication). In cases where a company is reluctant to give data directly to researchers, there are two potential solutions that we have found effective. The first compromise solution is for the company to give researchers remote access to company-owned machines where the data are stored. Thus, all data continue to reside at the company and cannot be copied to researchers’ machines. A second avenue of compromise is to send a research staff member (typically a student) to the company on an internship. These can both come with restrictions on dissemination and publication as well, so researchers would be wise to negotiate the terms of such agreements in detail up front.

.. Using Application Programming Interfaces Many companies make their services accessible to programmers via application programming interfaces (APIs). Essentially, these are a set of functions that enable the programmer to download or upload data to the service or make changes to the service’s settings for a given user. Researchers have successfully used APIs in the past as a vehicle for data collection (Liu et al., ); we outline some of the trade-offs of APIs in the following sections.

... Access Model A service may offer one or more APIs that have different functionality and cater to different use-cases. For example, Twitter and Facebook both offer APIs for embedding content into other websites, as well as separate APIs tailor-made for data collection by third parties. Each of these APIs has a distinct interface and access rules that govern how it may be used. Thus, researchers should carefully read API documentation to determine what methods are appropriate for their tasks. Today, most APIs are authenticated using the OAuth security protocol. Essentially, OAuth allows a user to delegate access to an account to a third party but reserves the right to later revoke such access. Thus, even if a researcher is collecting data from a service that is publicly available (i.e., researchers do not need a users to opt-in to data collection), that researcher will often have to use OAuth with his or her own account to gain access. In addition, APIs define the data formats that they support and the set of method options that they will recognize. For example, most APIs today use the Javascript Object Notation (JSON) data representation, which is a concise but human-readable way of conveying data.

... Rate Limits When using APIs, service providers commonly place rate limits on their use. These can come in a variety of different forms, but in general, the limits are based on the number

’      



of calls that any particular client or application can make to a given API method. For example, Twitter limits the rate at which an application can call the statuses/user_timeline API method (the one that returns the tweets of a given user) to  calls per fifteenminute period, and each call returns up to two hundred tweets. Thus, applications are limited by Twitter to obtaining no more than thirty-six thousand tweets per fifteenminute time period via this API method. Services are typically very explicit about their rate limits and post them prominently in the API documentation. Many services also offer dedicated API calls that tell programmers about the current rate limit and their usage remaining in the current period. It is often good practice to write data collection applications that intelligently query for the current rate limit and adjust their speed accordingly; this relieves the programmer of having to calculate the rate limit.

... Terms of Service When using APIs, researchers typically have to agree to the service provider’s terms of service (ToS) document, which outlines the rules about accessing the API and what the researcher can and cannot do with the obtained data. Researchers should read such documents carefully, as they can often prohibit resharing of the data (even for scientific purposes). We discuss how different scientific communities treat violations of ToS later in this chapter.

... Implementations For most popular services (e.g., Twitter, Facebook), there are libraries for accessing APIs that programmers can use. These libraries make it significantly easier for researchers to quickly access data, removing the burden of having to implement an API client from scratch. For example, for Twitter, there are libraries2 for all common languages, including perl, Python, and C++. Similarly, Facebook offers a variety of application development kits3 that allow developers to easily interact with the Facebook API in a variety of languages. Researchers should not attempt to reinvent the wheel by implementing their own JSON parser, OAuth client, and so forth.

.. Scraping Web-Based Data There are occasionally times when the service that the researcher wishes to study does not have an API, or the API does not allow the researcher to access the data needed to conduct the research. For example, the Pinterest image-sharing service did not have an API for many years, making obtaining data quite difficult (Ottoni et al., ). If the service is web-facing, researchers can often implement web scraping to collect data programmatically by accessing the website itself. In essence, a researcher can use web scraping to programmatically download a large number of web pages in order to collect research data.



    

... Implementation A number of tools exist for programmatically accessing websites and recording the results. The simplest tools are programs that simply make web requests without rendering the content or running the Javascript; popular examples include the wget and curl programs, as well as the requests library in Python. These are all are very lightweight, are easily scriptable, and do not download images and other items that are often not needed by the researcher. However, as websites become more complex, these simple programs are sometimes insufficient to download the content of interest. For example, many websites use JavaScript to dynamically load content in the user’s browser; since wget and curl cannot execute JavaScript, they are incapable of capturing this dynamic content. As an alternative, a number of tools perform all of the actions of a real web browser, allowing even complex sites to be crawled. In general, these tools can be divided into two classes. First, there are headless tools that implement full browser functionality but do not use a graphical user interface (GUI); they operate entirely at the command line. They do download all images (if desired) and execute Javascript on the pages. Popular examples include phantomjs and casperjs. These tools are more lightweight than running a full browser (e.g., they typically use less memory and CPU), but they can be difficult to script and debug. Second, there are tools that allow an actual web browser to be controlled programmatically (e.g., when collecting data, the tools run a normal instance of Google Chrome, etc.). The most popular example of such a tool is Selenium. Compared to headless tools, this approach is much more heavyweight, but it is often easier to control and debug (e.g., a researcher can often tell the program what to do by clicking on the appropriate location on the screen).

... Accounts Many popular services today (e.g., Facebook) require that users have an account in order to log into the service and view content. In this case, researchers would face a similar restriction; each scraper they run would have to have a valid account to be able to log in. Due to service abuse by malicious parties, it is often a violation of the services’ ToS to create multiple accounts, and the services often actively search for users who are creating too many accounts. Researchers should be aware of these limitations when attempting to implement crawlers. Researchers should be aware that there are websites that sell user accounts in bulk for popular websites. However, these marketplaces operate in a legal grey area, and the user accounts they provide are often of dubious quality (e.g., the accounts may quickly be banned by the website). Although there are examples of researchers successfully buying accounts, these studies are typically focused on understanding some aspect of the online black market (Thomas et al., ). As such we do not recommend that most researchers buy user accounts.

’      



... Rate Limits Web-based services are often resistant to having data collected by scraping and implement techniques to ensure that scrapers cannot collect data at too fast a rate. Typically, these rate limits are implemented at the IP address level, meaning the service monitors the number of requests that come from different IP addresses and block IP addresses that send too many requests. Sometimes these IP bans can be permanent. Thus, researchers should be careful when implementing web-scraping tools and should be cognizant of the load that they are placing on the web service.

... Parsing Finally, when utilizing web scraping, the results that a researcher obtains are usually encoded in web-native (HTML) format. Parsing HTML and extracting data from it can be cumbersome, though tools exist to simplify this process. For example, the JQuery set of JavaScript libraries offers methods for quickly loading an HTML page and accessing parts of the page at particular locations. The lxml and beautifulsoup Python modules offer similar functionality. Researchers are encouraged to use these tools when possible. Researchers should also be aware that many sites vary their HTML structure over time (and sometimes will use multiple layouts at once when conducting A/B tests4). For example, a site like Facebook might try out multiple potential page layouts, and a web scraper would naturally encounter HTML pages that have different structures. As a result, researchers should verify that their HTML parsers produce correct results on a diverse sample of pages.

.. Crowdsourcing Data Collection and Analysis Often researchers find that the data they are trying to access are not provided by a single service, or they are not accessible in a single location. Instead, researchers often have to recruit individual users to help them locate the content that is of interest. For example, if researchers are looking for web pages that discuss a particular topic, or images of a particular location, existing tools such as Google Search often require that a human examine and filter the results. Recruiting humans for these tasks can now be automated, to a large extent, through the use of crowdsourcing websites (e.g., Amazon Mechanical Turk, CrowdFlower). Essentially, these sites allow researchers to pay users to perform small tasks that are difficult for a computer to do. For example, one common task that crowdsourcing is used for is image labeling, when a researcher wishes to determine the contents of a large number of images. While crowdsourcing provides many potential benefits to researchers, it does have a number of limitations. First, crowdsourcing can be somewhat expensive: Studies have shown that most crowdsourcing tasks end up paying users $–$/hour (Paolacci et al., ), and sites like Amazon Mechanical Turk (AMT) charge up to a % commission



    

on top of this rate. Second, crowdsourced workers are of varying skill levels, and there are workers who will not complete the task but still request payment. Third, there are a number of best practices that researchers should employ when attempting to interpret crowdsourced data; we refer the reader to Wang’s paper (Wang et al., ) for a much more detailed discussion of the issues that affect various sites and services. Finally, there are a number of limits on what crowdsourcing services allow requestors to ask users to do; for example, some sites prohibit requestors from asking users to install any software. Thus, researchers should verify that their crowdsourced task is something that is compatible with the site’s rules.

.. Data Donations from Users One final approach to data collection is to ask for data donations from end users themselves. In other words, the researchers ask the users to provide their own data for scientific study. When using this approach, researchers need to recruit users in a similar manner to other research studies and implement software that allows users who agree to donate their data to have them collected. There are a number of different approaches to collecting donated data, and the most appropriate choice depends on the particular service researchers are studying. The most common mechanism for doing this is to implement a third-party client for a service (e.g., Twitter) that users can install. Once the user approves the application, it then has access to the user’s data and can transmit the data to the researcher. Often such an application would use the service’s API. Another option for implementing a data donation system is to use browser extensions or plugins. These pieces of software are installed in the user’s browser and have privileged access to monitor the user’s web browsing as the user interacts normally with the sites. Researchers should be careful to only collect data relevant to the research study, as a browser plugin can access passwords and other sensitive information that users enter into their browsers.

. L I

.................................................................................................................................. Collecting data from the Web inevitably means interfacing with systems that are owned or provided by third parties. The use of these systems, and the data they contain, are governed by various agreements, laws, and voluntary standards. All researchers should be aware of legal encumbrances on the collection and use of web data. In this section we discuss some of the major legal issues that impact the collection and use of web data. This discussion focuses mostly on US laws, although we highlight relevant European statutes as well.

’      



.. Terms of Service Perhaps the most well-known form of legal restriction on the use of online services and data are ToS agreements. Some software and websites force users to read these agreements and click a button signifying they will abide by the stipulations (i.e., click-through or click-wrap agreements), although many organizations simply post a ToS on their website and expect all users to abide by the rules (i.e., browse-wrap agreements). ToS are often quite verbose, covering issues such as liability, privacy policies (e.g., how user data are collected and shared), and acceptable use policies. Acceptable use policies are typically the most relevant portion of ToS to researchers, since they enumerate restrictions on how the site may be accessed (e.g., crawlers and scrapers may be forbidden); whether data can be collected, retained, or shared; and restrictions on the use and creation of user accounts. Some ToS do not allow users to grant third parties access to their accounts; this means that study participants may risk repercussions if they allow researchers to access their accounts for the purposes of data collection.5 Violating ToS may open up researchers to legal liability. However, the legal precedents surrounding ToS are complicated, since the enforceability of ToS may depend on how they are presented to users. Courts have been more willing to enforce clickthrough agreements because users are () prominently notified of the agreements’ existence and () required to give affirmative consent by clicking a button or checking a box (Feldman v. Google, ). In contrast, courts have been more reluctant to enforce browse-wrap agreements because there is no affirmative consent, and the only notice is often a tiny link at the bottom of the website (Specht v. Netscape, ; Bailey, ). Additionally, a number of researchers (including the authors of this chapter) are actively involved in litigation around ToS violations and research (Sandvig v. Sessions, ); as this litigation is ongoing, the legal treatment of ToS violations in the U.S. is still not entirely clear. Regardless of the legal standing of ToS, researchers would be remiss to ignore them. Even if a site does not have a click-through agreement, a judge may reasonably expect researchers to conduct due diligence before crawling a website, thus opening the door for researchers to be held liable for ToS violations. Furthermore, universities and companies typically have policies that compel researchers to obey relevant laws, which can be construed as forcing researchers to abide by ToS. Finally, some academic communities enforce strict norms that prevent the publication of research that breaches ToS.

... APIs versus Websites Organizations may have different ToS for their websites and their APIs. Researchers should not assume that just because data were obtained via an API, they are unencumbered by restrictions. For example, the ToS for Twitter’s API allows the user to store and analyze information downloaded from Twitter, but he or she may only share tweet IDs and user IDs with third parties, not the content or metadata of those tweets.



    

... Computer Fraud and Abuse Act Of particular concern when discussing ToS is the Computer Fraud and Abuse Act of  (CFAA). The CFAA was originally designed to criminalize the hacking of financial and telecommunications companies, but its language is so vague that it has been used to prosecute violations of acceptable use policies (Sandvig et al., ). Perhaps the most (in)famous use of the CFAA was federal charges being brought against Internet activist Aaron Swartz for crawling the JSTOR academic paper archive, in violation of JSTOR’s ToS (USA v. Swartz, ). In this case, Swartz was charged with CFAA violations for automatically downloading large numbers of scientific articles from JSTOR’s site. While JSTOR and MIT declined to pursue civil litigation under the CFAA, the US attorney for Massachusetts did pursue criminal charges. Although the legal precedents related to the use of the CFAA to criminalize ToS violations are unclear (Lee v. PMSI, ; USA v. Nosal, ), researchers should be aware that ToS violations could potentially be construed as federal crimes. As demonstrated in the Swartz case, prosecutors have demonstrated a willingness to pursue CFAA claims against those who violate ToS, even if the aggrieved party (JSTOR in this case) does not wish to press civil charges (JSTOR, ).

.. Human Subjects and Institutional Review Data gathered from the Web may be subject to laws that govern human subject experiments. In the United States, human subject experiments are regulated under the principles laid out in the US Department of Health and Human Services (HSS) Code of Federal Regulations, title , part . These regulations are colloquially referred to as “The Common Rule,” and they are applicable whenever a researcher gathers identifiable private information about an individual or data through intervention/ interaction with an individual (US DHHS, ). The most critical directives of The Common Rule are that () federally funded institutions must have IRBs that review and approve research designs, and () researchers must obtain informed consent from study participants. In the context of online data collection, this means that any experiments that will gather personally identifiable information or interact with users must be approved by an IRB before data are collected. Researches who violate IRB rules risk disciplinary action from their institutions, and all data collected without approval must be erased. Note that some research designs may be exempt from IRB regulations. Specifically, “research involving the collection or study of existing data . . . if these sources are publicly available or if the information is recorded by the investigator in such a manner that subjects cannot be identified, directly or through identifiers linked to the subjects” (US DHHS, ) is exempt. However, even if this caveat applies to a given experiment, the researcher must still seek out a formal exemption from the IRB; researchers may not exempt their own experimental designs.

’      



In general, researchers who are collecting data from social media, online forums, and other user-generated content sites should obtain IRB approval before collecting data. In some cases these data gathering tasks will be exempt because the data are public, and IRB approval should be prompt. Experiments that involve crowdsourced workers should also be approved by an IRB, although some academic communities are more lax than others about enforcing this norm (e.g., the CHI and WWW communities often accept studies that leverage crowdsourcing without IRB approval, so long as no personal information is collected from the workers). We recommend that researchers obtain IRB approval when any of their data concern human subjects or when their data are collected using crowdsourced workers. Finally, we recommend that researchers state in their IRB applications that they will follow the universal best practices outlined previously (e.g., data anonymization and encryption), since these are issues that IRBs care deeply about.

... Human Subject Rules Outside the United States Many nations have laws governing the treatment of human subject experiments. In Canada, the relevant laws are specified in the Tri-Council Policy Statement: Ethical Conduct for Research Involving Humans (Tri-Council Policy Statement, ). In Europe, each member state has local laws concerning human subject experiments, while EU-level rules are developed by the Secretariat of the European Group on Ethics in Science and New Technologies (EGE).6 Researchers in other countries should contact the office of research at their home institutions to determine what rules and regulations may be applicable.

... Toward Uniform Ethical Rules Some academic communities have grown concerned that federally funded researchers are placed at a disadvantage because of IRB rules. Specifically, researchers in corporate labs or in non-US countries may be able to use data sets or conduct experiments that would be disallowed under IRB rules. To level the playing field, some academic communities are moving toward a model in which Common Rule–like requirements must be upheld by all paper writers, regardless of their funding status or country of origin (examples include the SIGCOMM and IMC computer networking conferences7).

.. Nondisclosure Agreements In some cases, the simplest (or only) way to get data from an online service may be to directly request it. Some companies are receptive to the idea of working with researchers, but they may require that researchers sign an NDA before code or data are furnished. NDAs are legal contracts, and as such researchers who violate the terms of an NDA risk being sued for breach of contract. At most institutions, researchers cannot sign NDAs; instead, designated legal authorities of the institution must sign these contracts.



    

NDAs may contain a wide variety of stipulations, but they do not, in general, preclude researchers from publishing results gleaned from the covered data. In the past, we have been party to NDAs that prevented us from sharing the covered data with third parties or releasing the data publicly; however, the NDAs allowed us to publish aggregated results from the data, subject to approval by the counterparty. Data owners may be amenable to negotiating specific clauses in NDAs should researchers have specific needs, although owners are under no obligation to honor researchers’ requests. Before signing an NDA, researchers should carefully consider the impact the NDA will have on their ability to publish and share data. At a minimum, most NDAs state that the covered data cannot be shared or publicly released. Some academic communities enforce norms that require data transparency in order to facilitate reproducibility (two prominent examples of this are the BMJ and PLOS medical journals [Loder, ]); it may be impossible to publish results based on NDA-restricted data at these venues.

.. The Robots Exclusion Standard The Robots Exclusion Standard, commonly called robots.txt, is a voluntary standard that is designed to control the behavior of web crawlers. The standard’s name comes from its technical implementation: a webmaster may place a text file in the root directory of the website (e.g., https://facebook.com/robots.txt or https://google.com/ robots.txt) that contains rules that crawlers are supposed to obey when they traverse the given site. These rules are quite simple: they allow the webmaster to specify the User-Agent of a particular crawler (prominent examples include Googlebot and Bingbot) as well as specific URLs that the given crawler is allowed and disallowed to visit. Administrators may also use the wildcard (User-Agent: *) to specify that the rules apply to all crawlers. robots.txt files are human readable; interested readers can easily find examples by manually typing in the appropriate URL for major websites. Because the Robots Exclusion Standard is a voluntary protocol, it does not carry the force of law. However, websites sometimes refer to their robots.txt file in their ToS, within the context of acceptable uses of the site. Thus, crawlers that do not obey a website’s posted robots.txt file may be breaching the site’s ToS. Many robots.txt files contain helpful comments from webmasters that point interested parties to the website’s ToS and its policy on scraping. Before researchers begin scraping a website, they should conduct due diligence by reading the site’s ToS and checking for the existence of a robots.txt file.

.. Click and Impression Fraud Many websites are entirely reliant on advertising revenue to support their existence. These websites are known as publishers in the advertising ecosystem, since they publish content like news articles. Publishers make money when users visit their site and see an

’      



ad (known as an impression) or click on an ad. Advertisements are typically served by ad networks, like Google Adsense; these networks take care of tracking each ad’s engagement (impressions and clicks), as well as facilitating payments from advertisers to publishers. The ability for publishers to make money by displaying advertisements has given rise to two types of fraud, known as impression fraud and click fraud. In the former case, a fraudulent publisher makes money by having crawlers or crowd workers repeatedly view their site, thus generating many phony impressions. In the latter case, the crawlers/crowd workers also click on the ads. Although there is no law that criminalizes these forms of fraud, ad networks like Google have successfully sued individuals in civil court for perpetrating click and impression fraud (Google Inc. v. Auction Expert International L.L.C., et al., CV). Researchers who use scrapers to collect data should be aware of the potential for their software to cause click and impression fraud. Each time a scraper visits a site with embedded advertisements, this generates an impression, which may cost an advertiser money. If a scraper is very aggressive (i.e., it scrapes a large number of pages quickly), it can potentially exhaust an advertiser’s budget. This is especially true if the researcher’s program clicks on advertisements, since cost-per-click (CPC) advertisements are more expensive than cost-per-impression (CPM) advertisements. These false impressions and clicks could result in liability for researchers, should the ad network choose to bring lawsuits against them. There are steps that researchers can take to minimize their impact on advertisers. First, scrapers should never click on, or otherwise follow, the URLs present in advertisements. We are not aware of any academic community (even very liberal ones) that believes it is acceptable for researchers to generate clicks on ads. Second, researchers can minimize impression fraud by building filters into their scrapers that block requests to known advertisers. This is essentially the same technique used by browser extensions like Adblock Plus to filter out advertisements. Researchers are rarely interested in the advertisements embedded in web pages (the primary exception to this is researchers who are studying the advertisements themselves), so filtering out requests to ad networks has no impact on data collection.

. E A

.................................................................................................................................. In addition to the legal challenges faced by researchers using web data, there are also ethical considerations that researchers must take into account. Following the rules set by IRBs is an important first step toward conducting ethical research, but it is not a complete solution. Even in cases where research is IRB exempt, or at institutions that do not have IRBs, there are still ethical issues that researchers must address. In this section we discuss some of the most pressing ethical issues that arise when dealing with websites and web-based data and present concrete ideas for how to navigate them.



    

.. Harm to Users and Consent The Common Rule is designed to address the dual issues of harm and consent. Historically, this has referred to study participants; that is, in what ways can the study protocol harm people, does the scientific benefit of the research outweigh the harm, and how will participants be informed about the potential harms? However, in the context of research on the Web, harm can also be done to the service itself (e.g., by creating fake accounts to scrape data). We discuss issues related to users and services separately.

... Harm to Users Broadly construed, participants may be harmed by any aspect of a study that directly impacts their environment or risks disclosing personal information. A classic example of harm is deception, for example, tricking a participant into clicking a malicious link to observe whether that person is susceptible to phishing. However, harm can also be much less overt; for example, a research protocol that involves posting content to social media may harm users by exposing them to uncomfortable content or simply by annoying them with unwanted messages. Furthermore, even observational studies may potentially cause harm. For example, if a researcher scrapes data from social media without properly anonymizing them, and then the data get leaked, this could cause tangible harm to the unknowing study participants. The job of IRBs is to ensure that researchers carefully enumerate all the possible harms of their protocol, mitigate those harms whenever possible, and inform participants of any remaining harms. However, regardless of whether institutional or community standards require studies to be IRB approved, all researchers should strive to obey the ethical standards set out by The Common Rule.

... Consent on the Web In traditional, in-person user studies, obtaining informed consent is straightforward. However, obtaining informed consent on the Web presents unique challenges. In some cases it may be impractical to obtain consent; for example, a large-scale study of public tweets might necessitate obtaining informed consent from tens or hundreds of thousands of Twitter users. In these cases, it is critical that collected data be anonymized and secured with encryption. However, even these technical countermeasures are not enough to assuage all ethical concerns. For studies that require the collection of personal information or other nonpublic data about users (e.g., browsing history), informed consent must be obtained. The simplest and most direct approach is to build the consent mechanism into the data collection apparatus and ask users up front to consent. Unfortunately this direct approach often has the effect of scaring off participants, since the first thing they see is a disclosure that primes them to think about harm and consequences. Fortunately, other models of obtaining online consent are possible. For example, the Volunteer

’      



Science platform uses a post hoc consent approach (Keegan et al., ): users are free to engage with the research studies hosted on the platform, and all the collected data are kept private. At a later point in time, once users are comfortable with the platform, they are asked to give consent. If a user agrees, then the private data are released to the researchers.

... “Public” Data and Context Researchers often assume that because web-based data are publicly accessible, this obviates the need for obtaining informed consent or protecting user privacy. However, scholars are increasingly raising ethical concerns about these assumptions. danah boyd argues that there is a difference between being in public versus being public, with the former case implying nothing about a user’s willingness to consent to data collection (boyd & Crawford, ). Similarly, Helen Nissenbaum argues that contextual integrity is a key tenet of privacy, with the implication being that collecting data for research (even if it is public) strips the data of context and therefore violates users’ expectations of privacy, unless they actively consent to the data collection (Nissenbaum, ). Concerns about the collection and analysis of public data increase as the object of studies becomes more specific. For example, a broad study that presents high-level, aggregate data from all tweets on Twitter is less likely to raise concern than a study that focuses on tweets from specific subpopulations, especially vulnerable ones like minorities or children. Researchers planning to utilize public data sets should carefully consider the ethics of their experimental methodology. Even if there are no legal restrictions that prevent resharing public data, releasing data that are stripped from context may still harm users (Zimmer, ). Furthermore, researchers should consider adopting novel approaches that attempt to balance the burden of collecting informed consent with the researcher’s ability to collect large data samples from online sources (Hutton & Henderson, ).

.. Harm to Services Although The Common Rule only applies to study participants, researchers must also consider the harm their research may do to the websites they study. For example, scraping a website uses up computational and bandwidth resources, as well as generating false page impressions. When possible, researchers should avoid these ethical issues entirely by using sanctioned data collection mechanisms like official APIs. If this is not possible, ethical research designs should take measures to minimize adverse impacts on service providers, as well as to demonstrate that the scientific benefits of the research outweigh costs to the service provider. Besides the issue of harm to services, researchers must also consider the ethics of violating ToS agreements. ToS violations are a contentious issue in the academic community, and as previously mentioned, some communities and institutions simply forbid researchers to violate ToS. However, we argue that unyielding adherence to ToS



    

restrictions may prevent research that causes no practical harm but has the potential to bring significant scientific benefits. Furthermore, companies can potentially use ToS restrictions to chill legitimate research (Sandvig et al., ), which is an ethically dubious practice itself. In summary, there are no easy answers to the ethical questions surrounding the interplay of researchers and websites. Researchers should carefully plan their approach ahead of time to minimize harm and include statements in their papers that describe the ethics of experimental methodologies and justify why potential benefits outweigh potential harms. Authors should be aware that many academic communities review papers on ethical as well as technical grounds, and thus failure to adhere to ethical best practices may result in paper rejection.

... Responsible Disclosure In many cases, researchers collecting data from a website are only interested in measuring and studying the behavior of humans. In these cases, the website is largely incidental; it is simply the platform on which the phenomenon of interest is conducted. However, other studies may focus specifically on the characteristics of the website; in other words, the platform itself is the object of study. These types of studies raise an important ethical question: If researchers uncover evidence of harm or wrongdoing by a website, should they disclose this information, and if so, how? For example, studies have uncovered racial bias on AirBnB (Edelman & Luca, ) and in advertisements on Google Search (Sweeney, ), as well as price discrimination by e-commerce sites (Hannak et al., ). In all of these cases, it is clearly a social good for the information to be publicly disclosed as soon as possible. However, we also argue that the websites have a right to know that they are being scrutinized (potentially post-hoc), so that they can respond to the findings and (hopefully) correct the problems. In our own work, we follow the practice of responsible disclosure, which is a protocol pioneered in the computer security community. After we study a website, we inform the site of our findings and give its owners a chance to respond before we publish any results or speak to the press. This gives the website a chance to correct any inaccuracies in our work, as well as to address issues on that end before they are made public. Although many websites simply ignore requests from researchers, our disclosure policy has also led to fruitful conversations with companies that were mutually beneficial.

.. Reproducibility One final ethical consideration is reproducibility; that is, is it ethical to conduct research knowing that the code and/or data cannot be shared? This is a difficult question that defies clear answers. On the one hand, the benefits of transparent research are clear: the entire scientific enterprise is strengthened when studies can be verified and built upon. If at all possible, researchers should strive to be completely open and transparent in their work and thus avoid this ethical debate entirely.

’      



On the other hand, an increasing amount of basic research is conducted inside corporations or by researchers using data sets that are encumbered by ToS or NDA restrictions. Dogmatic adherence to rules that require data sharing effectively precludes publishing any findings based on these data sets. We argue that allowing work that leverages restricted data sets to be published, even if the data are nontransparent, is still a net benefit to the scientific community, since otherwise these measurements, findings, and ideas would not come to light at all. For example, the work published by Facebook’s Data Science team has opened a window into the world’s largest online social network that would otherwise remain closed (Bond et al., ; Kramer et al., ; Bakshy et al., ).

. S D

.................................................................................................................................. As discussed in the previous section, scientific research is rooted in reproducibility. In the context of studies that rely on data collected from websites, reproducing the experiments conducting by a given study often requires sharing the analysis code and collected data. However, web-based data are sometimes encumbered by a unique set of challenges in this regard, briefly discussed here.

.. Community Norms and Sharing Infrastructure The expectation that research data will be shared, and the infrastructure available to implement such sharing, varies widely across scientific communities. Certain communities require that sufficient data to reproduce results from published work be made available to other researchers (perhaps in anonymized form), while others encourage such sharing but will willingly accept papers for which the writers admit they are unable to share the data. Often these expectations will be stated in the conference’s call for papers or at the journal’s submission site. Similarly, certain communities have made efforts to reduce the burden on authors when attempting to share data by setting up sites and services that facilitate data sharing between researchers. Notable examples include ICWSM’s Data Sharing Service, which will host data sets from published papers and will both vet applications from requesting researchers and handle the actual distribution. Other examples are the CRAWDAD repository of wireless networking data, the Stanford Large Network Dataset Collection, and the Harvard Dataverse.

.. Limits to Sharing Often the goals of data sharing are at odds with the ability of the authors to share data from their studies. In particular, data obtained directly from companies is often under



    

an NDA, commonly with provisions that allow the company to review articles before publication. Rarely are raw data obtained in this way made available to other researchers. Even when data are collected directly by the researchers themselves, there are often limits to their ability to share them. For example, IRBs will often require that certain types of data not be shared, or that all shared data meet strict anonymity rules. We encourage researchers to have discussions with their IRBs on these issues during the original protocol review process. Finally, data collected from websites and services may come with ToS restrictions that prohibit collected data from being reshared. One prominent example of this form of data is data from Twitter, whose ToS only allow for user IDs and tweet IDs to be shared. Thus, if a researcher wanted to enable someone to reproduce a study that used only public tweets, the researcher would only be able to send the list of tweet IDs that were originally included. The recipient would then have to go to Twitter to redownload the tweets themselves; given the rate limits, this can often taken a significant amount of time. Thus, researchers should familiarize themselves with the provisions of the ToS to ensure they do not inadvertently violate them.

. S

.................................................................................................................................. Researchers today have an unprecedented amount of data available via a variety of websites and services. However, collecting data from these services involves a host of technical, legal, and ethical challenges, and many academic disciplines are still developing best practices. In this chapter we have provided an overview of the various issues that researchers should be aware of, along with pointers to other documents, projects, and services that may be useful for researchers.

N . Hash functions are cryptographic primitives that convert arbitrarily large input (e.g., arbitrary text) into a fixed-size output (e.g.,  bytes of data). Hash functions are designed to be pre-image resistant, meaning it is easy to take input and calculate its hash value, but it is extremely hard to take an arbitrary hash value and find input text that will hash to it. . See https://dev.twitter.com/overview/api/twitter-libraries. . See https://developers.facebook.com/docs/apis-and-sdks. . A/B testing is a methodology for web services to test two (or more) variants of a particular service. For example, Facebook may test two different algorithms for ordering a user’s news feed, looking for which one results in the highest user click-through rate. . For example, see sections   of Facebook’s terms of service (https://www.facebook.com/ legal/terms). . See http://ec.europa.eu/epsc/ege_en.htm. . See the SIGCOMM  call for papers at http://conferences.sigcomm.org/sigcomm// cfp.php; and the IMC  call for papers at http://conferences.sigcomm.org/imc/ /cfp.html.

’      



R Bailey, Ed. The Clicks That Bind: Ways Users “Agree” to Online Terms of Service. The Electronic Frontier Foundation, . https://www.eff.org/wp/clicks-bind-ways-usersagree-online-terms-service Bakshy, Eytan, Solomon Messing, and Lada A. Adamic. Exposure to ideologically diverse news and opinion on Facebook. Science, :–, . Bond, Robert M., Christopher J. Fariss, Jason J. Jones, Adam D. I. Kramer, Cameron Marlow, Jaime E. Settle, and James H. Fowler. -million-person experiment in social influence and political mobilization. Nature, :–, . boyd, danah, and Kate Crawford. Critical questions for big data: Provocations for a cultural, technological, and scholarly phenomenon. Information, Communication & Society, ():–, . Christopher Specht, John Gibson, Michael Fagan, Sean Kelly, Mark Gruber, and Sherry Weindorf v. Netscape Communications Corporation and America Online, Inc. United States Court of Appeals for the Second Circuit, . Eagle, Nathan, Alex Pentland, and David Lazer. Inferring friendship structure using mobile phone data. Proceedings of the National Academy of Sciences of the United States, ():–, . Edelman, Benjamin G., and Michael Luca. Digital Discrimination: The Case of Airbnb.com. Social Science Research Network Working Paper Series, . Hannak, Aniko, Gary Soeller, David Lazer, Alan Mislove, and Christo Wilson. Measuring Price Discrimination and Steering on E-commerce Web Sites. In Proceedings of the ACM/ USENIX Internet Measurement Conference, Vancouver, Canada, . New York, NY: ACM. Hansell, Saul. AOL removes search data on vast group of web users. New York Times, August , . https://www.nytimes.com////business/media/aol.html Hutton, Luke, and Tristan Henderson. “I Didn’t Sign Up for This!” Informed Consent in Social Network Research. In Proceedings of the Ninth International AAAI Conference on Weblogs and Social Media (ICWSM), Oxford, UK,. Palo Alto, CA: AAAI. Jiang, Jing, Christo Wilson, Xiao Wang, Wenpeng Sha, Peng Huang, Yafei Dai, and Ben Y. Zhao. Understanding latent interactions in online social networks. ACM Transactions on the Web, ()::–:, . JSTOR Statement: Misuse Incident and Criminal Case. JSTOR, . http://docs.jstor.org/ jstor-statement-misuse-incident-and-criminal-case.html Keegan, Brian, Katherine Ognyanova, Brooke Foucault Welles, Christoph Riedl, Ceyhun Karbeyaz, Waleed Meleis, David Lazer, Jason Radford, and Jefferson Hoye. Conducting Massively Open Online Social Experiments with Volunteer Science. In Proceedings of the Ninth International AAAI Conference on Weblogs and Social Media (ICWSM), Oxford, UK, . Palo Alto, CA: AAAI. Kramer, Adam D. I., Jamie E. Guillory, and Jeffrey T. Hancock. Experimental evidence of massive-scale emotional contagion through social networks. Proceedings of the National Academy of Sciences of the United States, ():–, . Lawrence Feldman v. Google, Inc. United States District Court for the Eastern District of Pennsylvania, . Liu, Yabing, Chloe Kliman-Silver, and Alan Mislove. The Tweets They Are a-Changin’: Evolution of Twitter Users and Behavior. In Proceedings of the Eighth International AAAI Conference on Weblogs and Social Media (ICWSM), Ann Arbor, MI, . Palo Alto, CA: AAAI.



    

Loder, Elizabeth. Why medical journals must make researchers share data from clinical trials. The Conversation, . http://theconversation.com/why-medical-journals-must-make-researchers-share-data-from-clinical-trials- Narayanan, Arvind, and Vitaly Shmatikov. Robust De-anonymization of Large Sparse Datasets. In Proceedings of the IEEE Security and Privacy Conference, Oakland, CA, . New York, NY: IEEE. Nissenbaum, Helen. Privacy as contextual integrity. Washington Law Review, :–, . Ottoni, Raphael, Diego de Las Casas, João Paulo Pesce, Wagner Meira Jr., Christo Wilson, Alan Mislove, and Virgílio Augusto Fernandes de Almeida. Of Pins and Tweets: Investigating How Users Behave across Image- and Text-Based Social Networks. In Proceedings of the Eighth International AAAI Conference on Weblogs and Social Media (ICWSM), Ann Arbor, MI, . Palo Alto, CA: AAAI. Pandurangan, Vijay. On Taxis and Rainbows: Lessons from NYC’s improperly anonymized taxi logs. Medium, . https://tech.vijayp.ca/of-taxis-and-rainbows-fbca Paolacci, Gabriele, Jesse Chandler, and Panagiotis G. Ipeirotis. Running experiments on Amazon Mechanical Turk. Judgment and Decision Making, (): –, . Sandvig, Christian, Kevin Hamilton, Karrie Karahalios, and Cedric Langbort. Auditing Algorithms: Research Methods for Detecting Discrimination on Internet Platforms. In Proceedings of “Data and Discrimination: Converting Critical Concerns into Productive Inquiry,” a Preconference at the th Annual Meeting of the International Communication Association, Seattle, WA, . Washington, DC: ICA. Sweeney, Latanya. Discrimination in Online Ad Delivery. Social Science Research Network Working Paper Series, . Thomas, Kurt, Damon McCoy, Chris Grier, Alek Kolcz, and Vern Paxson. Trafficking Fraudulent Accounts: The Role of the Underground Market in Twitter Spam and Abuse. In Proceedings of USENIX Security, Washington D.C., . Berkeley, CA: USENIX. Tri-Council Policy Statement. Ethical Conduct for Research Involving Humans. Government of Canada, . United States of America v. Aaron Swartz. United States District Court, District of Massachusetts, . United States of America v. David Nosal, United States Court of Appeals for the Ninth Circuit, . US Department of Health and Human Services (US DHHS). Public Welfare.  C.F.R §  (). Wang, Jing, Panagiotis G. Ipeirotis, and Foster Provost. A Framework for Quality Assurance in Crowdsourcing. Social Science Research Network Working Paper Series, . https:// papers.ssrn.com/sol/papers.cfm?abstract_id= Wendi J. Lee v. PMSI, Inc. United States District Court for the Middle District of Florida, . Yang, Zhi, Christo Wilson, Xiao Wang, Tingting Gao, Ben Y. Zhao, Yafei Dai. Uncovering social network Sybils in the wild. ACM Transactions on Knowledge Discovery from Data, ()::–:, . Zimmer, Michael. “But the data is already public”: On the ethics of research in Facebook. Ethics of Information Technology, :–, .

  ......................................................................................................................

     Dilemmas and Solutions ......................................................................................................................

 ,  ,   

. W A  D!

.................................................................................................................................. W have been gathering “big data” throughout history. Land surveys such as the Domesday Book or the Napoleonic Cadastre have long been conducted, while a population census of the Roman Empire is mentioned in the Bible. Of late, however, advances in technology have led to a sharp fall in the relative costs of gathering data about individuals and their traits, leading to increases in the “three Vs” (volume, velocity, and variety) of big data (Laney, ). The use of computers in the systems around us means that data are often generated as a side effect of some other task. Vehicle detectors in intelligent transport systems can be used for monitoring congestion or adjusting speed limits, but also for identifying cars or drivers. Smart electricity meters can be used for charging, but also for correlating usage with social deprivation demographic data. This extends to individuals themselves, as we increasingly use online social networks (OSNs), and they become integrated with smartphone applications (apps) and other wearable and Internet of things technologies to create new social computing applications. These enable people to collect and generate their own data through the self-reporting of hobbies, interests, activities, and emotions, but at the same time enable accurate behavioral profiling by gathering browsing and shopping data from across the Web through the use of social widgets (e.g., the Facebook like button) and the rich interconnection between analytics and tracking services (Falahrastegar, Haddadi, Uhlig, & Mortier, ). This increase in the amount and types of data, and the ease with which they can be collected, has led to a corresponding increase in research based on such data collection,



. , . ,  . 

with thousands of papers published using Facebook alone (Caers et al., ). Such research is essential for improving our understanding of this wide techno-social landscape. Indeed, understanding whether human behavior has changed psychologically or socially because of Internet or technology use could be difficult without collecting such data. These new technologies have also enabled improvements in new research data collection methods, such as “citizen science” (Silvertown, ), while the resultant large quantities of data have enabled new methods of analysis, such as “discovery-driven” or “hypothesis-free” research (Aebersold, Hood, & Watts, ). The data collected from people can also be used for societal and individual benefits. For example, location-based services can be used for predicting journeys, optimizing traffic routes, or public transport services, but the data can also be used for delivering targeted advertising or providing detailed (or potentially intrusive) location-based services. Increasingly, the data generated by us, or inferred about us, create a rich information ecosystem that has both benefits and harms. In this chapter we focus on the challenges for conducting responsible and ethical research when using data collected from OSNs for research purposes, from active volunteers and passive participants. We discuss the potential benefits and also the potential pitfalls of such digital research, what it means to conduct responsible research, and the current technical limitations that might impede our ability to do so. We then describe some best practices, both in terms of what researchers can do technically and also legally and socially. Ethics itself is of course a rather overloaded term (in the computer science sense1), with many definitions. For this reason some prefer to use the notion of “responsible research” (Owen, Macnaghten, & Stilgoe, ). But Shamoo and Resnik () offer four definitions of ethics in relation to research, and for our purposes we focus on the first: “ethics as standards of conduct that distinguish between right and wrong, good and bad, and so on.” In his well-known book Singer (, p. ) describes such ethical standards as being somehow decided by considering universal preferences and making decisions that are ethical according to “preference utilitarianism.” But ethics and morality are closely linked (Farrimond, ), and when considering preferences, perhaps we should limit ourselves to moral actors.2 Thus, when discussing the benefits and pitfalls of digital research, we need to consider the preferences of all actors involved in these systems.

. T G B

.................................................................................................................................. Many of the previously mentioned examples in this chapter and the book as a whole relate to the use of big data for understanding aggregate behaviors (e.g., overall road traffic, overall political patterns, or statistical distribution of mental or physical health conditions in individuals or society). Such data are useful to governments, industry, and researchers aiming to optimize their services or to improve scientific insight in

    



relevant areas. They can be used to identify the spread of rumors and media in a society (Cha, Pérez, & Haddadi, ), or to understand bias in individuals’ views or those of the media (An, Cha, Gummadi, Crowcroft, & Quercia, ). Similarly, location and picture data from OSNs such as Instagram and Twitter can be used to understand the health habits of individuals and communities with respect to issues such as physical activity and obesity (Widener & Li, ; Mejova, Haddadi, Noulas, & Weber, ), to find investors for crowdfunded projects (An, Quercia, & Crowcroft, ), or to aid relevant organizations in providing fast and efficient disaster relief in emergencies (Díaz, Aedo, & Herranz, ; Imran, Castillo, Lucas, Meier, & Vieweg, ). Data from OSNs can also be used to recommend friends, new content, books, and entertainment to individuals using the social plug-ins provided by these services (Konstas, Stathopoulos, & Jose, ; Seth & Zhang, ). The data collected from these sources can be used to infer public health (Mejova et al., ; Paul & Dredze, ) and physical activity traits (Cavallo et al., ), as well as to provide useful feedback and behavioral interventions to individuals. The vast amount of data available from OSNs, complemented by the wealth of personal data obtainable from wearable devices and smartphones shared online, provide the research community with opportunities bounded only by consumer willingness, scientific creativity, data access regulations, and ethics (Mortier, Haddadi, Henderson, McAuley, & Crowcroft, ).These social computing applications often rely on collection of personally identifiable information (PII) from individuals. Though some services and organizations mandate ethical approval for dealing with such sensitive data, and some research fields are developing subject-specific guidelines (Bailey, Dittrich, Kenneally, & Maughan, ), there is as yet no universally agreed upon framework for the collection, archiving, and use of personal information. The examples above provide a glimpse of opportunities and huge potential in the use of social media. We encourage the reader to consider the value of these services while reading the rest of this chapter, in which we present the challenges and limitations.

. W C P G W?

.................................................................................................................................. Individuals on OSNs are often attracted by the network effect, with one of the main attractions being the ability to maintain their relationships with offline friends and family, but beyond these much useful information can be found through weak ties (Granovetter, ) and from users self-organizing to create well-connected communities (De Meo, Ferrara, Fiumara, & Provetti, ). In doing so, users may inherently divulge a large set of personal information about themselves (e.g., their location and activity) and their connections (social circles) over time. Researchers enjoy the wealth and depth of data, the interactions between individuals and organizations, and contentsharing patterns on social media. However, collecting these data has potential risks for the individual using these services, as the information can be used to identify small



. , . ,  . 

groups or individuals and thus potentially expose sensitive details. Over the past few years several research studies have attracted attention in the media, perhaps unexpectedly, due to their use of sensitive data, jeopardizing individuals’ privacy or affecting the participants’ emotional well-being. In  a set of university researchers employed students to help them crawl the Facebook “walls” of an entire undergraduate class (Selwyn, ). Similar research has been conducted by creating fake user profiles in order to collect individuals’ profiles in certain geographic areas (Viswanath, Mislove, Cha, & Gummadi, ; Haddadi & Hui, ). This highlights the vulnerabilities of these OSNs to impersonation and honeypot attempts, even though the intention may have been to conduct scholarly studies. One of the most well-known examples of this problem is the  “emotional contagion” Facebook/Cornell study (Kramer, Guillory, & Hancock, ), which faced much criticism over its legal and ethical aspects (Schroeder, ). This experiment involved tuning the individuals’ newsfeed based on the sentiment of the posts from their network and observing the effects of these posts on their consecutive posts. Participants did not consent to this manipulation, and the Facebook user agreement was only changed after the experiment to allow this use. The wide public reaction highlighted the broader impact and importance of scientific experiments and the way their results are communicated to the public. In a similar manner, an earlier Facebook experiment in  had received some criticism when the supposedly anonymous study was quickly de-anonymized, revealing sensitive personal and private information (Zimmer, ). Individuals on OSNs, or on the Internet in general (e.g., using search engines), are often unaware that they may be part of an ongoing experiment. The passive data collection from OSNs poses a threat to individuals’ anonymity and privacy. Similarly, research on Internet openness or social network data mining can lead to ethically unacceptable practices, may be illegal in many jurisdictions, and may be dangerous for individuals who reside in those locations (Wright, Souza, & Brown, ). Sometimes willingness to share OSN data is temporally sensitive(Bauer et al., ; Ayalon & Toch, ), though this does not reduce the severity of the risks involved. Research on mental health issues such as depression (De Choudhury, Counts, & Horvitz, ), sexual orientation, political views, personality, happiness, and personality traits (Kosinski, Stillwell, & Graepel, ) can have consequences for individuals’ employment, relations, and personal lives. Future inferences and correlations (drawn from data collected today from an individual) may also be harmful due to effects on their future social, economic, or political status. Anonymization and de-identification by removing names, or mapping names to IDs, have long been utilized to release and use OSN and similar data for research purposes. However, the de-anonymization of the Netflix Prize data set and understanding of the privacy risks of models such as K-anonymity (Sweeney, ) have led to widespread exposure of risks in traditional anonymization techniques (Narayanan & Shmatikov, ; Machanavajjhala, Kifer, Gehrke, & Venkitasubramaniam, ). Despite further efforts in de-identification, this still remains a challenge for large-scale OSN data usage

    



(Narayanan & Felten, ). Location privacy studies have also identified the ease of predictability of mobility patterns and social ties using mobile phone data (De Domenico, Lima, & Musolesi, ), and linking OSN data sets with other mobility traces can make de-identification difficult (Ji, Li, Srivatsa, He, & Beyah, ). In section  we discuss some potential solutions in the literature. Taking a step back and taking a broader view, we can categorize the risk levels and their probabilities of occurrence by looking at a number of threats: • Psychological harm ranges from embarrassment at falling for a deceptive experiment, to being exposed to offensive content, to being trolled or stalked. • Economic harm includes loss of money or property, identity theft, and loss of job or reputation as a consequence of content being shared on social networks. • Physical harm includes threats to individuals’ lives due to their ideology, political affiliations, or religious views, based on knowledge gained from social media for personal purposes. These examples are just a few of the high-risk threat categories for individuals using OSNs. What are the challenges and obstacles that need to be overcome to protect individuals using these services? Are the outcomes and resulting inferences worth the potential risks? Can we protect against all potential risks, and should we, if protecting against a risk makes the resulting data less beneficial?

. T L: C W O T?

.................................................................................................................................. The preceding sections discussed the pros and cons of using OSNs. One might ask: How can we use science to protect individuals from privacy and ethics issues with large-scale data analysis, yet to enable us to perform research beneficial to society? In many studies, testing a certain hypothesis requires a thorough understanding of the relationships in a network or analyzing the context and sentiment of tweets. Hence we need to consider a large number of varying privacy laws and ethics considerations to distinguish between cyberespionage and performing scientific research. One of the most fundamental analysis techniques applied by researchers with OSN users is the graph analysis technique. Graph techniques can enable understanding of communication patterns and trends or the role and influence of individuals in OSNs (Cha, Haddadi, Benevenuto, & Gummadi, ; Cha, Benevenuto, Haddadi, & Gummadi, ). Yet we are unable to study the topological characteristics of graphs which have been anonymized in such a way that some nodes cannot be identified and linked back to their neighbors (Narayanan & Felten, ; Narayanan & Shmatikov, ). Advances in applications of differential privacy have enabled progress in this space by achieving a trade-off between maintaining structural similarity and privacy



. , . ,  . 

protection (Sala, Zhao, Wilson, Zheng, & Zhao, ). Other techniques, such as homomorphic encryption, and subsequently secure multiparty computation (Lindell & Pinkas, ), have enabled some computations to be done on sensitive data (Kocabaş & Soyata, ; Torres, Bhattacharjee, & Srinivasan, ). However, we are still unable to perform scalable contextual or sentiment analysis on fully encrypted content or large multidimensional data sets while fully preserving privacy. Full reliance on encryption remains an ideal goal in this space. It is important to understand that technology does not, and indeed cannot, provide a complete panacea for many of the challenges and threats that we have outlined. For data to be useful, they need to be processed. If someone uploads a photo or a status update to an OSN, they expect that photo to be viewed or that status to be read. If someone purchases a product based on a recommendation from an OSN friend, they expect that product to be delivered. It is the processing of these data that enable some fundamental attacks, because however legitimate data access may be, the persons to whom access has been granted may renege, and may do any of the following: . Re-broadcast data. This problem is well-known to the media industries, as no matter how they attempt to secure the storage and delivery of music or films, consumers have to be able to play or view these media at some point. Hence no DRM (digital rights management) scheme, regardless of its complexity or sophistication, has been able to block this “analog hole” (Sicker, Ohm, & Gunaji, ), that is, has been able to prevent people from recording their music coming out of a headphone jack or speakers. So copyright holders have pursued legal remedies rather than technological ones. . Re-identify data. It has been demonstrated that the “power of 4,” for example, four uses of a credit card online or four check-ins on an OSN such as Foursquare or Facebook, is all that is needed to identify, with a high probability, a unique individual (de Montjoye, Hidalgo, Verleysen, & Blondel, ; de Montjoye, Radaelli, Singh, & Pentland, ). . Reveal data. Huge data breaches are a regular feature in today’s news.3 This is perhaps an unsurprising effect of the tendency to store lots of sensitive data in one system. Even though there may be good reasons for doing so (e.g., efficiencies and better healthcare through storing health data in a central system), the risk of becoming an attractive target for attackers or the total loss of privacy that results from such a data breach must be considered.

. B P   T P

.................................................................................................................................. What are the technical solutions for privacy-aware and ethical research on OSNs? How can we guarantee individuals’ privacy will be respected, their personal data will be

    



protected, and data cannot be used in future for malicious purposes? There have been a number of efforts in privacy-preserving, or private, information retrieval and analysis. In this section we discuss a number of recent approaches to this challenge. One of the most basic and fundamental responsibilities of researchers dealing with personal data is to ensure appropriate, role-based, access control measures to databases of personal data in place when collecting, analyzing, and releasing personal data. This step would ensure basic protection against data being accidentally revealed to individuals who were not involved in the research, data collection, or analysis process. The natural follow-on procedure would be to perform strong encryption, with careful management of decryption keys. These first steps would substantially raise the effort level for an unintended individual trying to access the data. But they may not be effective in protecting the individuals’ identity against the researchers or a determined attacker. More advanced techniques include the following: • Data fuzzing and noise addition. The main aim of these techniques is aggregating data and building distributions in a way that an individual or the relationship between two subjects cannot be identified. Methods such as differential privacy (Dwork, ) also provide guarantees about the accuracy of results of an analysis independent of an individual’s inclusion or exclusion in a database. They hence enable sharing data of OSN graphs (Sala et al., ) or addition of noise to survey results (Haddadi, Mortier, & Hand, ), without a major impact on the usefulness of final statistics. • Data retention/deletion policy. Legislation such as the EU Data Protection Directive (Directive //EC of the European Parliament and of the Council of  October  on the protection of individuals with regard to the processing of personal data and on the free movement of such data, ) provides rules on collection, use, retention, and disposal of personal data intended for different purposes. In essence, introduction of more strict requirements such as the “Right To Be Forgotten” in the EU has led to development of mechanisms for individuals to request the removal of some of their digital footprints. Research in online reputation management and privacy has highlighted the importance of mechanisms for data deletion (Mayer-Schonberger, ). • Malleable encryption schemes. These schemes provide the ability to perform certain computations on the ciphertext with results that match those which would have been performed on the raw data. The most important of these schemes are partial and fully homomorphic encryption systems (Gentry, ). These schemes would allow the data from participants in research to be encrypted at the collection end by the user, hence limiting the exposure of sensitive data such as financial records or health data. • Privacy by design. Many research or surveying works can be done by limiting the data collection process in the first place, moving away from the traditional approach of collecting as much data as possible in case one might find it useful



. , . ,  .  at a later stage. Often, data collection and analysis can be performed at the user end, even for research involving resource-constrained devices such as smartphones or sensors (Haddadi, Hui, Henderson, & Brown, ; Lane & Georgiev, ). Researchers and data practitioners should ideally assess the cost-benefit trade-offs of gathering data in the first place in order to prevent excessive data collection and processing of sensitive and unnecessary data.

These approaches present some of the ways in which researchers can reduce the unintended negative consequences of data collection from OSNs or other sources of personal information. But are technical solutions ideal? Will they stand the test of time? Or should we expect responsible research to be subject to wider scrutiny and transparency?

. B P   L  S P

.................................................................................................................................. Technical systems can help us ensure that data are stored or processed appropriately. But to understand what is appropriate, we need to understand how people, such as research participants, want their data to be used. Traditionally, researchers have been relying on ethics committee or institutional review board (IRB) approvals and informed consent. The notion of informed consent has long been the standard for determining whether participants’ wishes have been met. “Informed” consent became enshrined in law in the  Salgo case in the United States. Martin Salgo became permanently paralyzed in a translumbar aortography and sued his physicians for negligence and failing to inform him of the risk of paralysis (Faden & Beauchamp, ). The court in this malpractice case suggested that doctors in such cases have a duty to disclose the risks and alternative treatments, while at the same time acknowledging that doctors should also use their discretion where appropriate. Similar ideas about informed consent in research ethics were first codified in the Nuremberg Code () and the Declaration of Helsinki (), and later in documents such as the CIOMS guidelines (Council for International Organizations of Medical Sciences, ) and the EU Data Protection Directive (European Parliament and the Council of the European Union, ). Probabilities of risks involved in de-anonymization and reidentification are not always easily comprehensible to researchers, let alone explainable in plain everyday language to OSN users taking part in an experiment. On the other hand, it is not always clear what the role of researchers should be if they find legally questionable information on individuals while performing these measurements. Informed consent has a strong basis in moral theory, specifically the respect for autonomy. Although it has long been in use in medical ethics, Manson and O’Neill () argue that it is insufficient. They find that current informed consent requirements are impractical, and that individual autonomy is only one part of the moral issue

    



in much research. One of their proposals is that universal standards for informed consent (e.g., standard forms that apply to all research situations) are unrealistic. Similarly, there is currently some debate about whether informed consent is required for the collection of data from digital systems such as OSNs. Zimmer () discusses the Facebook T study, concluding that just because people share data on an OSN does not mean researchers have the right to access those data for research. Solberg () argues to the contrary that Facebook research generally does not require ethics approval, as it is of low risk. Buchanan, Aycock, Dexter, Dittrich, and Hvizdak () look at cybersecurity research and propose that we need a community process to determine standards: “We support a shared responsibility model, where researchers, REBs, and information technology experts work together to disclose, understand, and accommodate the unique ethical issues within CS research.” This is akin to Manson and O’Neill (, p. )’s concerns about standards and the consideration of all actors within Responsible Research and Innovation (RRI). Even if one does determine that informed consent is required or desired for a particular piece of research, it is not always clear whether obtaining consent is meaningful if the way in which people are informed is overly complex (Luger, Moran, & Rodden, ), or if, as in the Facebook example described in section , terms of service may not be clear or are even inaccurate (Check Hayden, ). Ioannidis () analyzes the claim that consent is meaningless for big data or discovery-driven research and argues that we still need consent, in part to maintain trust between researchers and participants. Luger and Rodden () further distinguish between secured consent, such as the traditional ticking of a checkbox at the start of an experiment or the acceptance of an EULA (end user license agreement) when installing a piece of software, and sustained consent, in which participants might be probed in a sustained fashion to determine whether they truly consent to sharing data. (A specific OSN example of this is provided by Sleeper et al. [], who studied selfcensorship on Facebook by allowing participants to choose when and what to share with researchers over time.) The need for the latter is born out by Munteanu et al. (), who present a number of case studies, arguing that the consent process must be examined over the life cycle of a study, while Neuhaus and Webmoor () propose “agile ethics,” akin to the agile software engineering process, in which ethical considerations should be evaluated and revisited over time. Sustained consent can be a burden for participants, who may be discouraged by continuous requests for consent in addition to requests for data. It may thus affect participation rates or even fatigue participants, such that their consent decisions are incorrect. This has led to recent research into methods for reducing the burden of sustained consent while still providing more granular control than secured consent. Morrison, McMillan, and Chalmers () show that allowing participants to periodically view representations of their data over the course of an experiment can improve engagement. Hutton and Henderson () employ Nissenbaum’s () framework of contextual integrity to predict when interactions are likely to violate privacy expectations and thus require new requests for consent, showing that this is a viable



. , . ,  . 

alternative to sustained consent for OSN studies. Gomer, Schraefel, and Gerding () propose the use of artificial intelligence agents to achieve what they term semiautonomous consent, wherein preferences can be elicited from participants and agents determine when consent would and would not be granted. Related approaches to this include the use of machine learning and recommender systems to set privacy preferences themselves, freeing users from the complex interfaces used for designing privacy policies in many OSNs (Toch, ; Zhao, Ye, & Henderson, , ). One particular concern is how data are used beyond the life cycle of an experiment. This will become increasingly common as data sharing becomes mandated by funding bodies in the UK, EU, and the US. Some participants may have concerns about their data being used for other purposes, while others might welcome their use for some studies but not others. Kaye et al. () recognize this tension through their notion of dynamic consent, whereby participants can be engaged in research over time, viewing how their data are used and allowing them to consider different contexts for their data. Although proposed for biobank research, it could also be applicable to OSN data. Beyond the notion of consent to participating in an experiment, we can consider how people would like to have more control of their data. Current systems enable very little control, with hardly any transparency about how or when data are collected (e.g., the various data protection controversies over Google’s Street View data collection using Wi-Fi-enabled cars), or transparency about how data are processed and transformed (e.g., controversies about how Facebook manipulates users’ news feeds to determine what information to present [Kramer et al., ]). We have proposed that understanding and enabling this control warrants a new research area of its own, human-data interaction (HDI) (Mortier et al., ). As the Internet of things begins to be deployed, the amounts of data around us will be ever increasing, including information about health, well-being, food, home entertainment usage, and many other areas. We need to understand what people will be willing to share with companies in return for services, or with researchers voluntarily, without assuming that they have to give up all of their privacy to allow such technology into their homes and lives. Focusing on legal aspects, another important challenge has been the recent introduction of the “Right To Be Forgotten” in Europe, which has enabled individuals to request removal of their data, or inaccurate articles referring to them, from search engines. Various social media and OSNs have also been increasingly enabling privacy tools and data removal requests. Some have gone to the extent of holding researchers responsible for removal of data items from their collections even long after the data collection has taken place, should an individual request the removal of a social media posting. Control also extends to legal requirements. For instance, people may wish to keep their cloud data in particular places, because they do not trust other governments or because there are legal requirements that mandate the storing of data in particular jurisdictions (Hon & Millard, ).

    



Table 31.1 Example Tools and Techniques for Best Practices in Dealing with Sensitive Data in Research Process Data collection Data storage Data analysis Data release

Techniques Proportionate data collection, meaningful consent, preprocessing at source User-controlled data aggregation, source-based data anonymization Differential privacy, homomorphic encryption, privacy by design Distribution building, noise addition, differentially private anonymization

The main barrier to treating responsible research and innovation as a purely legal issue is the often delayed response of the legal system to new advances in technology and the risks they can present. Hence researchers and scientists need to take into consideration not only legal aspects, but also ethical considerations, societal impact, and mutual trust when dealing with individuals’ data. Table . presents some potential tools and techniques for best practices in personal and social data collection, management, and analysis from OSNs. These are just examples of ways in which researchers can limit the usage of individuals’ private data beyond the intended scientific purposes.

. W C  O R?

.................................................................................................................................. Today, OSNs occupy the largest part of screen time for hundreds of millions of Internet users across the world.4 With high levels of engagement modes, variety of content, and convenience of data collection mechanisms, these platforms provide the perfect environment for studies of human behavior, benchmarking sociology and psychology theories, and opportunities for performing large-scale A/B testing. However, there are numerous ethical challenges to be considered when performing research using OSNs. As shown in this chapter, technical or legal solutions alone are not enough to prevent privacy disasters or trust misplacement. In essence, one organization or stakeholder is not enough to protect the rights of individuals, and the ecosystem needs co-creation of rules among the legal system, technology sector, analytics firms, consumer rights groups, researchers, and individuals. As researchers and data practitioners, we also need to develop new ethics frameworks to comply with new and evolving legal requirements and technical approaches (e.g., Microsoft Research, described in Bowser & Tsai, ).



. , . ,  . 

Some other challenges exist in the space of ethics-aware and responsible data collection for research on OSNs, including (this list is not exhaustive) • making secure systems easy to use (for all stakeholders); • communicating risks to ordinary users in simple terminology; • obtaining meaningful consent in a fashion that is accurate and yet not burdensome; • achieving near-perfect, yet feasible and scalable, anonymization; • coping with legacy systems and technologies; • establishing the legal duties and liabilities of the researcher; • achieving near perfect reliability for big and small data systems; and • sustaining data and data protection over hundreds of years. Enabling more stringent requirements and tighter controls on data and processes also implies that people should own their own data. They should be allowed to decide what data are collected about them and with whom they should be shared, and they should be allowed to monetize these data, in similar ways to how supermarket loyalty cards work, with clearly identified parties, for transparent reasons, or through aggregators who turn large amounts of reidentifiable data into statistical information, using the statistical techniques described in the previous section. For instance, someone might be happy to share their financial data with service providers for helping with financial or budgeting assistance, but not to share their health data with these same services. Conversely, they might be willing to share food consumption data with other interested parties, such as, supermarkets, in return for payment. The availability of a personal cloud, combined with techniques such as differential privacy, and appropriate use of access control and cryptographic techniques can all serve to make this work. Recently there have been advances in systems and frameworks, such as OpenPDS (de Montjoye, Shmueli, Wang, & Pentland, ) and Databox (Chaudhry et al., ), which aim to enable a user-centric approach to personal data use. In conclusion, there is an urgent need for the research community to consider personal, societal, and ethical consequences of large-scale use of social media data in order to perform valuable scientific research without losing the respect and trust of the individuals involved in the ecosystem.

N . Overloading is a form of polymorphism in which different functions or methods with the same name can be invoked depending on context. Informally the term can also be used to reference the use of the same term to invoke different meanings. . “It is precisely the moral persons who are entitled to equal justice” (Rawls, , p. ).

    



. For instance, as we write this chapter, the US Office of Personnel Management suffered a data breach that affected more than twenty-one million people (Davis, ). . In a recent earnings call, Facebook reported that on average people spend fifty minutes a day on their sites (D’Onfro, ).

R Aebersold, R., Hood, L. E., & Watts, J. D. (, April). Equipping scientists for the new biology. Nature Biotechnology, (), . doi:./ An, J., Cha, M., Gummadi, K. P., Crowcroft, J., & Quercia, D. (). Visualizing media bias through twitter. In Proceedings of the th International AAAI Conference on Web and Social Media (ICWSM). An, J., Quercia, D., & Crowcroft, J. (). Recommending investors for crowdfunding projects. In Proceedings of the rd International Conference on World Wide Web (pp. –). doi:./. Ayalon, O., & Toch, E. (). Retrospective privacy: Managing longitudinal privacy in online social networks. In Proceedings of the Ninth Symposium on Usable Privacy and Security. New York: ACM. doi:./. Bailey, M., Dittrich, D., Kenneally, E., & Maughan, D. (, March). The Menlo report. IEEE Security & Privacy Magazine, (), –. doi:./msp.. Bauer, L., Cranor, L. F., Komanduri, S., Mazurek, M. L., Reiter, M. K., Sleeper, M., & Ur, B. (). The post anachronism: The temporal dimension of Facebook privacy. In Proceedings of the th ACM Workshop on Workshop on Privacy in the Electronic Society (pp. –). New York: ACM. doi:./. Bowser, A., & Tsai, J. Y. (). Supporting ethical web research: A new research ethics review. In Proceedings of the th International Conference on the World Wide Web (pp. –). Republic and Canton of Geneva, Switzerland: International World Wide Web Conferences Steering Committee. doi:./. Buchanan, E., Aycock, J., Dexter, S., Dittrich, D., & Hvizdak, E. (, June). Computer science security research and human subjects: Emerging considerations for research ethics boards. Journal of Empirical Research on Human Research Ethics, (), –. doi:./ jer.... Caers, R., De Feyter, T., De Couck, M., Stough, T., Vigna, C., & Du Bois, C. (, September). Facebook: A literature review. New Media & Society, (), –. doi:./  Cavallo, D. N., Tate, D. F., Ries, A. V., Brown, J. D., DeVellis, R. F., & Ammerman, A. S. (). A social media–based physical activity intervention: a randomized controlled trial. American Journal of Preventive Medicine, (), –. Cha, M., Benevenuto, F., Haddadi, H., & Gummadi, K. (). The world of connections and information flow in Twitter. Systems, Man and Cybernetics, Part A: Systems and Humans, IEEE Transactions, (), –. doi:./tsmca.. Cha, M., Haddadi, H., Benevenuto, F., & Gummadi, P. K. (). Measuring user influence in Twitter: The million follower fallacy. In Proceedings of the th International AAAI Conference on Web and Social Media (ICWSM). Retrieved from http://aaai.org/ocs/index.php/ ICWSM/ICWSM/paper/viewPaper/



. , . ,  . 

Cha, M., Pérez, J., & Haddadi, H. (). Flash floods and ripples: The spread of media content through the blogosphere. In Proceedings of the  ICWSM Data Challenge Workshop. Chaudhry, A., Crowcroft, J., Haddadi, H., Howard, H., Madhavapeddy, A., McAuley, D., & Mortier, R. (). Personal data: Thinking inside the box. th decennial Aarhus conferences. doi:./aahcc.vi. Check Hayden, E. (, June ). Informed consent: A broken contract. Nature,  (), –. doi:./a Council for International Organizations of Medical Sciences. (). International ethical guidelines for biomedical research involving human subjects. CIOMS. Retrieved from http://www.cioms.ch/index.php/printable-publications?task=view&#;id=&#; catid= Davis, J. H. (, July ). Hacking exposed  million in U.S., Government says. New York Times. Retrieved from http://www.nytimes.com////us/office-of-personnel-managementhackers-got-data-of-millions.html De Choudhury, M., Counts, S., & Horvitz, E. (). Social media as a measurement tool of depression in populations. In Proceedings of the th Annual ACM Web Science Conference (pp. –). New York: ACM. doi:./. De Domenico, M., Lima, A., & Musolesi, M. (, December). Interdependence and predictability of human mobility and social interactions. Pervasive and Mobile Computing, (), –. doi:./j.pmcj... De Meo, P., Ferrara, E., Fiumara, G., & Provetti, A. (, October). On Facebook, most ties are weak. Communications of the ACM, (), –. doi:./ de Montjoye, Y.-A., Hidalgo, C. A., Verleysen, M., & Blondel, V. D. (, March ). Unique in the crowd: The privacy bounds of human mobility. Scientific Reports, (). doi:./srep de Montjoye, Y.-A., Radaelli, L., Singh, V. K., & Pentland, A. (, January ). Unique in the shopping mall: On the reidentifiability of credit card metadata. Science, (), –. doi:./science. de Montjoye, Y.-A., Shmueli, E., Wang, S. S., & Pentland, A. S. (, July ). openPDS: Protecting the privacy of metadata through safeanswers. PLoS ONE, (), e+. doi:./journal.pone. Díaz, P., Aedo, I., & Herranz, S. (). Citizen participation and social technologies: Exploring the perspective of emergency organizations. In C. Hanachi, F. Bénaben, & F. Charoy (Eds.), Information systems for crisis response and management in Mediterranean countries (Vol. , pp. –). Springer International Publishing. doi:./---_ Directive //EC of the European Parliament and of the Council of  October  on the protection of individuals with regard to the processing of personal data and on the free movement of such data. (, Nov). OJ L  pp. –. D’Onfro, J. (, April ). Here’s how much time people spend on Facebook, Instagram, and Messenger every day. Business Insider. Retrieved from http://businessinsider.com/howmuch-time-do-people-spend-on-facebook-per-day-- Dwork, C. (). Differential privacy. In M. Bugliesi, B. Preneel, V. Sassone, & I. Wegener (Eds.), Automata, languages and programming (Vol. , pp. –). Berlin, Heidelberg: Springer Berlin Heidelberg. doi:./_

    



European Parliament and the Council of the European Union. (). Directive //EC of the European Parliament and of the Council of  October  on the protection of individuals with regard to the processing of personal data and on the free movement of such data. Official Journal of the European Union, L , –. Faden, R. R., & Beauchamp, T. L. (). A history and theory of informed consent. Oxford: Oxford University Press. Falahrastegar, M., Haddadi, H., Uhlig, S., & Mortier, R. (). Anatomy of the third-party web tracking ecosystem. CoRR, abs/.. Retrieved from http://arxiv.org/abs/ . Farrimond, H. (). Doing ethical research. Basingstoke, UK: Palgrave Macmillan. Gentry, C. (, March). Computing arbitrary functions of encrypted data. Communications of the ACM,  (), –. doi:./. Gomer, R., Schraefel, M. C., & Gerding, E. (). Consenting agents: Semi-autonomous interactions for ubiquitous consent. In Proceedings of the  ACM International Joint Conference on Pervasive and Ubiquitous Computing: Adjunct publication (pp. –). New York: ACM. doi:./. Granovetter, M. S. (, May). The strength of weak ties. American Journal of Sociology,  (), –. doi:./ Haddadi, H., & Hui, P. (). To add or not to add: Privacy and social honeypots. In Communications workshops (ICC),  IEEE International Conference on (pp. –). Haddadi, H., Hui, P., Henderson, T., & Brown, I. (, September). Targeted advertising on the handset: Privacy and security challenges. In J. Müller, F. Alt, & D. Michelis (Eds.), Pervasive advertising (pp. –). London: Springer London. doi:./---_ Haddadi, H., Mortier, R., & Hand, S. (, April). Privacy analytics. ACM SIGCOMM Computer Communication Review, (), –. doi:./. Hon, W. K., & Millard, C. (). How do restrictions on international transfers of personal data work in clouds? In C. Millard (Ed.), Cloud computing law. Oxford: Oxford University Press. doi:./acprof:oso/.. Hutton, L., & Henderson, T. (, May). “I didn’t sign up for this!”: Informed consent in social network research. In Proceedings of the th International AAAI Conference on Web and Social Media (ICWSM) (pp. –). AAAI, Oxford, UK. Imran, M., Castillo, C., Lucas, J., Meier, P., & Vieweg, S. (). AIDR: Artificial intelligence for disaster response. In Proceedings of the Companion Publication of the rd International Conference on World Wide Web (pp. –). doi:./. Ioannidis, J. P. A. (, March ). Informed consent, big data, and the oxymoron of research that is not research. The American Journal of Bioethics, (), –. doi:./ .. Ji, S., Li, W., Srivatsa, M., He, J. S., & Beyah, R. (, April). General graph data deanonymization: From mobility traces to social networks. ACM Transactions on Information and System Security, (). doi:./ Kaye, J., Whitley, E. A., Lund, D., Morrison, M., Teare, H., & Melham, K. (, May ). Dynamic consent: A patient interface for twenty-first century research networks. European Journal of Human Genetics, (), –. doi:./ejhg.. Kocabaş, Ö., & Soyata, T. (). Medical data analytics in the cloud using homomorphic encryption. In Handbook of research on cloud infrastructures for big data analytics (p. ).



. , . ,  . 

Konstas, I., Stathopoulos, V., & Jose, J. M. (). On social networks and collaborative recommendation. In Proceedings of the nd International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. –). New York: ACM. doi:./. Kosinski, M., Stillwell, D., & Graepel, T. (, April ). Private traits and attributes are predictable from digital records of human behavior. Proceedings of the National Academy of Sciences, (), –. doi: ./pnas. Kramer, A. D. I., Guillory, J. E., & Hancock, J. T. (, June ). Experimental evidence of massive-scale emotional contagion through social networks. Proceedings of the National Academy of Sciences, (), –. doi:./pnas. Lane, N. D., & Georgiev, P. (). Can deep learning revolutionize mobile sensing? In Proceedings of the th International Workshop on Mobile Computing Systems and Applications (pp. –). New York: ACM. doi:./. Laney, D. (, February ). -D data management: Controlling data volume, velocity and variety (Tech. Rep. No. ). META Group, Inc. Retrieved from http://blogs.gartner.com/ doug-laney/deja-vvvue-others-claiming-gartners-volume-velocity-variety-construct-forbig-data/ Lindell, Y., & Pinkas, B. (). Secure multiparty computation for privacy-preserving data mining. Journal of Privacy and Confidentiality, (), . Luger, E., Moran, S., & Rodden, T. (). Consent for all: revealing the hidden complexity of terms and conditions. In Proceedings of the SIGCHI conference on human factors in computing systems (pp. –). New York: ACM. doi:./. Luger, E., & Rodden, T. (). An informed view on consent for UbiComp. In Proceedings of the  ACM International Joint Conference on Pervasive and Ubiquitous Computing (pp. –). New York: ACM. doi:./. Machanavajjhala, A., Kifer, D., Gehrke, J., & Venkitasubramaniam, M. (, March). L-diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data, (). doi:./. Manson, N. C., & O’Neill, O. (). Rethinking informed consent in bioethics. Cambridge, UK: Cambridge University Press. doi:./cbo Mayer-Schonberger, V. (). Delete: The virtue of forgetting in the digital age. Princeton, NJ: Princeton University Press. Mejova, Y., Haddadi, H., Noulas, A., & Weber, I. (, May). # foodporn: Obesity patterns in culinary interactions. ACM conference on digital health. doi: ./. Morrison, A., McMillan, D., & Chalmers, M. (, October). Improving consent in large scale mobile HCI through personalised representations of data. In Proceedings of the th Nordic Conference on Human-Computer Interaction: Fun, fast, foundational (pp. –). New York: ACM. doi:./. Mortier, R., Haddadi, H., Henderson, T., McAuley, D., & Crowcroft, J. (, October ). Human-data interaction: The human face of the data-driven society. SSRN. doi:./ ssrn. Munteanu, C., Molyneaux, H., Moncur, W., Romero, M., O’Donnell, S., & Vines, J. (, April). Situational ethics: Re-thinking approaches to formal ethics requirements for human-computer interaction. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. –). doi:./. Narayanan, A., & Felten, E. W. (). No silver bullet: De-identification still doesn’t work. White paper.

    



Narayanan, A., & Shmatikov, V. (). Robust de-anonymization of large sparse datasets. In Proceedings of the IEEE Symposium on Security and Privacy (pp. –). doi:./ SP.. Neuhaus, F., & Webmoor, T. (, February). Agile ethics for massified research and visualization. Information, Communication & Society, (), –. doi:./x.. Nissenbaum, H. F. (, February). Privacy as contextual integrity. Washington Law Review, (), –. Owen, R., Macnaghten, P., & Stilgoe, J. (, December). Responsible research and innovation: From science in society to science for society, with society. Science and Public Policy, (), –. doi:./scipol/scs Paul, M. J., & Dredze, M. (). You are what you tweet: Analyzing Twitter for public health. In Proceedings of the th International AAAI Conference on Web and Social Media (ICWSM). Retrieved from https://www.aaai.org/ocs/index.php/ICWSM/ICWSM/paper/ view/ Rawls, J. (). A theory of justice (Rev. ed.). Cambridge, MA: Harvard University Press. Sala, A., Zhao, X., Wilson, C., Zheng, H., & Zhao, B. Y. (, November). Sharing graphs using differentially private graph models. In Proceedings of ACM SIGCOMM Internet Measurement Conference (IMC) (pp. –). doi:./. Schroeder, R. (). Big data and the brave new world of social media research. Big Data & Society, (). doi:./ Selwyn, N. (). Faceworking: Exploring students’ education-related use of facebook. Learning, Media and Technology, (), –. doi:./ Seth, A., & Zhang, J. (). A social network based approach to personalized recommendation of participatory media content. In Proceedings of the nd International AAAI Conference on Web and Social Media (ICWSM) (pp. –). Seattle. Shamoo, A. E., & Resnik, D. B. (). Responsible conduct of research. Oxford: Oxford University Press. doi:./acprof:oso/.. Sicker, D. C., Ohm, P., & Gunaji, S. (, Spring). The analog hole and the price of music: An empirical study. Journal on Telecommunications and High Technology Law, (), –. Silvertown, J. (, September). A new dawn for citizen science. Trends in Ecology & Evolution, (), –. doi:./j.tree... Singer, P. (). Practical ethics (rd ed.). Cambridge, UK: Cambridge University Press. Sleeper, M., Balebako, R., Das, S., McConahy, A. L., Wiese, J., & Cranor, L. F. (). The post that wasn’t: Exploring self-censorship on Facebook. In Proceedings of the  Conference on Computer Supported Cooperative Work (pp. –). New York: ACM. doi:./ . Solberg, L. (). Data mining on Facebook: A free space for researchers or an IRB nightmare? University of Illinois Journal of Law, Technology & Policy, (). Sweeney, L. (). k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, (), –. doi:./ S Toch, E. (). Crowdsourcing privacy preferences in context-aware applications. Personal and Ubiquitous Computing, (), –. doi:./s--- Torres, W. A. A., Bhattacharjee, N., & Srinivasan, B. (). Effectiveness of fully homomorphic encryption to preserve the privacy of biometric data. In Proceedings of the th international Conference on Information Integration and Web-Based Applications & Services (pp. –). New York: ACM. doi:./.



. , . ,  . 

Viswanath, B., Mislove, A., Cha, M., & Gummadi, K. P. (). On the evolution of user interaction in Facebook. In Proceedings of the nd ACM Workshop on Online Social Networks. doi:./. Widener, M. J., & Li, W. (). Using geolocated Twitter data to monitor the prevalence of healthy and unhealthy food references across the US. Applied Geography, , –. doi:./j.apgeog... Wright, J., Souza, T. D., & Brown, I. (). Fine-grained censorship mapping: Information sources, legality and ethics. In Proceedings of the USENIX Workshop on Free and Open Communications on the Internet. Zhao, Y., Ye, J., & Henderson, T. (, December). Privacy-aware location privacy preference recommendations. In Proceedings of the th International Conference on Mobile and Ubiquitous Systems: Computing, networking and services (pp. –). Brussels, Belgium: ICST. doi:./icst.mobiquitous.. Zhao, Y., Ye, J., & Henderson, T. (, March). The effect of privacy concerns on privacy recommenders. In Proceedings of the st International Conference on Intelligent User Interfaces (pp. –). New York: ACM. doi:./. Zimmer, M. (, December). “But the data is already public”: On the ethics of research in Facebook. Ethics and Information Technology, (), –. doi:./s---

  ......................................................................................................................

          ......................................................................................................................

 . 

. I

.................................................................................................................................. A researchers, we are obligated to minimize risks associated with participating in our research projects. This is especially the case when participants are considered vulnerable. Two recent trends involving vulnerable research participants are quite worrying. The first disturbing trend is researchers who use digital methods to parachute into sexy yet dangerous research environments without understanding the context, unaware that they may be putting consenting and nonconsenting, vulnerable participants at risk. The second worrying trend is researchers with experience working with vulnerable participants in difficult research environments who are grasping at digital methods as a way to deal with research challenges but do not fully understand the additional risk associated with digital methods. As a researcher who does both traditional fieldwork and research using digital methods in difficult environments, I have learned, sometimes painfully, how the combination of vulnerability and digital methods can put participants and researchers in greater danger. The impetus for writing this chapter was two events. The first occurred at an academic conference, where some “big data” hotshots were bragging about their multi-million-tweet data set collected from a contested and bloody attempted coup in a country they knew little about, in a language that they were interpreting through Google Translate. As I sat looking at their visualization, I thought



 . 

about how useful these patterns of associations would be for the regime’s security services. This made me consider my own work on opposition movements’ use of social media in authoritarian regimes, and I contemplated whether my computational analyses might have identified individuals with social media power invisible to the naked eye. And as the months passed and quiet digital attacks on my research and character grew louder, I became more concerned, not only over what the security services knew about my participants from my research, but also what they knew about me. The second event was a discussion with a colleague, a noted social scientist and expert on a particular country, about his recent fieldwork trip to a conflict zone where kidnappings of foreigners are not uncommon. His digital security sloppiness shocked me. When I inquired about security steps he had taken, he replied, “Having a pin code on my smartphone is so annoying.” Digital security is important even in an environment where there is rule of law; walking into a conflict zone like this is idiotic. I was enraged that he had not taken basic steps to reduce the likelihood of his, and his participants’, information falling into the wrong hands. Researchers excited about data drawn from authoritarian elections, or a contentious protest in a nondemocratic environment, or a conflict zone, need to be more aware of the potential severe outcomes for nodes like the fictitious @takedownthedictator. She is a real person with loved ones. And while she may have tweeted an incredibly interesting statement worthy of being included on a conference presentation slide or has such high eigenvector centrality that she is worthy of an entire paragraph in the results section, highlighting her may lead to severe consequences unforeseen by the researchers, or in fact by the “participant” herself. Is another CV line worth @takedownthedictator’s life? Certainly not, but this is reality when studying vulnerable people in difficult environments. In this chapter I hope to provide context for this situation and suggestions to possibly reduce the likelihood that something bad will happen. On the other side, imagine an experienced researcher who has studied rebel groups for decades. She takes great care to protect her participants and minimize risk. Yet on her most recent field trip she brought her iPad with her to keep up on email and finally catch up on Game of Thrones. And it seems so much easier to record interviews with her iPhone than to bring an audio recording device and deal with getting the files off it. With the Gmail and Dropbox apps loaded on her devices, she lands in her field site and goes through the typical interrogation and search by “customs” agents at the airport. Despite having experienced this dozens of times, and with excellent local language skills and well-honed local culture politeness gestures, her heart always pounds at these moments. As usual, the agents take her bag. And after twenty-five minutes with her iPad and iPhone, security services have downloaded her email correspondents’, research assistants’, and participants’ information and have put malware on her devices that turns them into mobile surveillance devices, capturing audio, video, and every word typed, including passwords. This chapter also aims to help researchers like this or those who view digital methods as a way to deal with some of the challenges associated with such research. They too need to know more about the additional risk associated with digital tools and methods.

     



The chapter first describes the risk accompanying research in difficult environments, then risk associated with digital methods and particular and perhaps unexpected complications when utilizing digital tools and methods in difficult environments. Finally, a set of recommendations is presented.

. R

.................................................................................................................................. Risk is understood as the probability and magnitude of harm to research participants, oppose to benefit, which is the probability and magnitude of a positive outcome for research participants (Weijer, ). And in the post–Belmont Report era, minimizing and disclosing risk to potential research participants has become obligatory. Risks for participants are categorized in the Belmont Report as () physical risk (bodily harm), () psychological risk (affecting the participant’s perception of self, emotional suffering), () social risk (participation in research or the findings themselves may expose participants to some form of social stigmatization), and () economic risk (participants bear financial costs for participating) (Weijer, ). Yet the definition of risk set forth in the Belmont Report leaves much up to the interpretation of the researcher. Specifically, the risk-benefit calculation stipulates that the risk of harm should not be greater than what the individual would “ordinarily encounter in daily life or during the performance of routine physical or psychological examinations or tests” (Weijer, , p. ). But it is entirely up to the discretion of the researcher to determine the content, context, and quantity of risk that individuals ordinarily encounter, despite there being great variability in all of these factors (Haggerty, ; Hemming, ; Kopelman, , ; Labott & Johnson, ; Opsal et al., ; Resnik, ; Weijer, ). We know that risk is typically evaluated in a framework of the researcher’s own experiences and understanding of harm, risk, and daily life, which may be quite different from those of the participants, and possibly paternalistic (Hemming, ; Opsal et al., ). In addition, researchers are in the “tricky position” of predicting outcomes and managing protocols for a host of highly speculative risks for a wide range of individuals participating in research (Haggerty, , p. ).

. D R E

.................................................................................................................................. These challenges of evaluating ordinary risk as well as predicting outcomes are problematic for all researchers, but are particularly arduous when the research context is one in which both present and future ordinary life is permeated with risk and uncertainty. The “difficult” research environments are defined here as places where research is done, in which daily life is unpredictable, uncertain, and permeated by fear



 . 

(what Green [], Koch [a], Kovats-Bernat [], and others call a “culture of fear”), and normal existence is rampant with risk, regardless of choices individuals make to avoid or mitigate it. These difficult research sites can include conflict zones, politically repressive environments, authoritarian states, areas where “terrorist” or ideological groups are active, and more. But all have in common the fact that “social relationships and cultural realities are critically modified by the pervasion of fear, threat of force, or (ir)regular application of violence” (Kovats-Bernat, , pp. –). Recommendations made in this chapter could also likely be extended to studies of vulnerable or marginalized populations, activists, social movements, and criminal or stigmatized groups. Unsurprisingly, all of these contexts remain understudied, yet they are critical as objects of study in order to understand the human condition and extend theory under these circumstances, which are undoubtedly of importance to the researcher, their discipline, and people and societies living in such situations. Participants living in such environments are also considered vulnerable because of the status difference between participant and researcher (Macklin, ; Pittaway, Bartolomei, & Hugman, ); the lack of liberties, which may leave them open to exploitation (Loff, Zion, & Gillam, ); and the fact that as participants they are predisposed to additional harm (Kottow, ). They also may require ongoing protection (Levine et al., ). While the exact definition of vulnerability remains debated (Levine et al., ), certainly participants in difficult environments easily meet most definitions of “vulnerable.” In such environments, it is nearly impossible to conduct an a priori risk assessment. It is impossible to predict all of the outcomes, and while researchers are well-advised to consider the costs for participants (Lee-Treweek & Linkogle, ), they “can no longer predict the full array of likely consequences for those who participate in the fieldwork encounter” (Pottier, Hammond, & Cramer, , p. ). Moreover “we must remind ourselves daily that some of the things we jot down can mean harassment, imprisonment, exile, torture, or death for our informants or for ourselves and take our notes accordingly” (Kovats-Bernat, , p. ). Sluka () cautions that even the most innocent research can be used by various forces against participants. While this may be obvious to those who work in the field regularly, digital methods scholars new to such environments should take heed. Participants in difficult sites have good reason to fear taking part in research, as it comes with physical, psychological, social, and economic risks (Cohen & Arieli, ; Drake, ; Huggins & Glebbeek, ; Smeltzer, ). Or as Sriram plainly puts it, “individuals may well have put themselves at significant risk to provide the raw material for our books, articles, and indeed our careers” (Sriram, , p. ). It is true that it is the researchers who benefit most from their research (Browne & Moffett, ; Peritore, ; Wamai, ). While Lee (), Peterson (), and Sluka () claim that risk in difficult research environments can be mitigated by careful planning and thoughtful assessment, we know that researchers in such environments engage in creative and unorthodox methods and often assess risk, perhaps not carefully or thoughtfully, in situ in

     



order to continue their work. Researchers also engage in disutilization, minimizing the potential usefulness of the research for negative purposes (Lee, ) through selfcensorship, or publishing in ways or venues that reduce nonscholarly interest in the work (Scott, Miller, & Lloyd, ; Smeltzer, ; Sriram, ; Turner, ). Participants are not the only ones facing risk in difficult environments (Lee, ; Lee-Treweek & Linkogle, ). Researchers experience physical threats and violence, including of a sexual nature (Browne & Moffett, ; Diphoorn, ; Glasius et al., ; Huggins & Glebbeek, ; Kovats-Bernat, ; Lee, ; Lee-Treweek & Linkogle, ; Mertus, ; Paluck, ; Ross, ; Sehgal, ; Skidmore, ; Warden, ; Wong, ); are monitored by security services and other authorities (Cramer et al., ; Dresch, ; Gentile, ; Glasius et al., ; Peritore, ; Peshkova, ; Sowerwine, ; Thogersen & Heimer, ; Thomson, ); are interrogated by security services (Glasius et al., ; Peshkova, ); have their offices and apartments searched (Roberts, ); are arrested (Glasius et al., ); and experience psychological or emotional risk, either directly or indirectly (Glasius et al., ; Lee-Treweek & Linkogle, ), especially from witnessing or studying difficult or traumatic events (Abdullah, ; Browne & Moffett, ; Diphoorn, ; Glasius et al., ; Lee, ; Lee-Treweek & Linkogle, ; Mertus, ; Ross, ; Sehgal, ; Warden, ; Wood, ) . The risk to researchers, Sluka () and Williams et al. () argue, is a real issue that demands greater attention. There are as many reasons why researchers continue to do research in difficult environments as there are researchers (see Romano ),but many argue that difficult research environments are fetishized or attract thrill seekers (Browne & Moffett, ; Helbardt, Hellmann-Rajanayagam, & Korff, ; Lee ; Shaw, ; Swedenburg, ; Nelson, ). Yet interestingly, researchers tend to not talk about the challenges and increased risk encountered in difficult sites ( Avruch, ; Browne & Moffett, ; Cramer et al., ; Goode, ), out of fear of losing access to the research site or participants, as well as fear of losing credibility with other researchers or participants (Romano, ). There is also a norm of masking emotions in research that may lead some to not wish to disclose their fears (Glasius et al., ; Meyer, ). Others may fear that disclosing the unconventional or unacceptable methods used will impact their careers (Thogersen & Heimer, ). Regardless of the motivation to conduct research in difficult environments, actually conducting the research requires creativity and unorthodox methods. Those interested in studying these sites must be flexible; be innovative; and constantly assess, mitigate, and plan for risk and danger (KovatsBernat, ; Lee, ; Peterson, ; Sluka, ). As Kovats-Bernat () explains, “the customary approaches, methods, and ethics of [anthropological] fieldwork are at times insufficient, irrelevant, inapplicable, imprudent, or simple naïve” (pp. –), or as Goode () describes, research in such environments “may prove challenging to conduct, impossible to verify and unlikely to convince a skeptical audience” (pp. –). In particular, researchers working in challenging environments should engage in what Cordner et al. () call reflexive research ethics—“ethical guidelines



 . 

and decision-making principles that depend on continual reflexivity concerning the relationships between researchers and participants” (p. )—moving beyond the formal ethics guidelines and acknowledging complex questions of reciprocity and power.

. D M

.................................................................................................................................. Some, out of frustration, have suggested that digital methods can be employed to reduce challenges and risks for both participants and researchers in difficult research environments (Duffield, ; Fischer, ; King, ; Mawdsley, ; Unwin, ). Digital methods may seem especially attractive because they can be viewed as a way to avoid one of the largest obstacles in difficult environment research: access to the field. Researchers often have difficulty obtaining visas to travel even when they obscure or conceal their real reasons for travel (Bonnin, ; Browne & Moffett, ; Glasius et al., ; Norman, ; Romano, ; Sowerwine, ; Turner, , ) or pay bribes (Browne & Moffett, ). Logistical challenges such as this contribute to the desirability of digital methods to hypothetically eliminate the need for a researcher to physically access a field site (Bengtsson, ; Ignacio, ). Yet these proposals are often made without fully understanding digital methods and the additional risks that they may present. Similarly, there are certainly numerous examples of researchers parachuting (physically or digitally) into difficult research environments without a sense of the context or risk involved. Both of these entryways into combining digital methods and difficult research environments are problematic, yet discussion of the increased risk is scant, although Stockmann () and Glasius et al. ()are notable exceptions. Digital methods—either conducting traditional research with digital tools or using digital tools for new research methods like computational analysis—do present new and exciting opportunities for research. Some affordances of digital methods often mentioned are that they allow for faster and sometimes less expensive data collection, with sometimes better samples being acquired than would otherwise be available (Barratt, ; Buchanan, ; Fileborn, ; Hesse-Biber & Griffin, ; Hewson, ; Hewson et al., ; Hope, ; Kivits, ; Morgan & Lobe, ; Rundall, ), and the ability to reach participants who may otherwise not be accessible (Cook, ; Hesse-Biber & Griffin, ). Yet digital methods present new challenges to traditional understanding of ethics (Buchanan, ; Burgess & Bruns, ; Eynon, Fry, & Schroeder, ; Whiteman, ). Particularly, traditional ethical guidelines may be insufficient for digital methods (Morison et al., ; Whiteman, ) or at the least problematic (Ackland, ; Buchanan, ; Buchanan & Hvizdak, ; Eynon, Fry, & Schroeder, ; HesseBiber, ; Hewson et al., ; Hine, ; Livingstone & Locatelli, ; Whiteman, ). This is complicated by the fact that institutional review boards sometimes do not understand the digital methods enough to evaluate them (Hesse-Biber, ).

     



And while Whiteman () suggests that resolving the ethical challenges of digital methods will require localized ethical perspectives that incorporate not only the ethics of the academy, but also those of the institution, the researcher, and the researched, concerted efforts to do so are few and far between, and debates about digital methods ethics abound.

. D M  R

.................................................................................................................................. Ethical questions regarding risk are particularly challenging. Digital methods “require researchers to consider risks other than those inherent in the data they intend to collect. They also need to identify and address possible technical and administrative problems with potential ethical or legal impacts” (Charlesworth, , p. ); in fact, digital methods may increase risk for participants (Whiteman, ). And while the Association of Internet Researchers (AoIR) ethical guidelines (Markham & Buchanan, ) ask a researcher to consider if the connection between a participant’s online data and his or her physical person may enable psychological, economic, or physical harm, the same question of defining risk that exists in nondigital research remains, although the AoIR guidelines do acknowledge that harm is defined contextually. There are two main concerns about risk when combining difficult research environments and digital methods: association and identification.

.. Risk by Association and Rapport Participants in difficult environments are often reluctant to speak to researchers, in part because of norms of distrusting outsiders but also because researchers have a reputation for being spies (Clark, ; Dresch, ; Fry, ; Helbardt, HellmannRajanayagam, & Korff, ; Jenkins, ; Lee, ; Norman, ; Peritore, ; Roberts, ; Sluka, ). Therefore trust building can be a challenging and onerous task (Amour, ; Browne & Moffett, ; Clark, ; Cohen & Arieli, ; Norman, ). Nonetheless, building trust in difficult environments is even more important than in other research environments (Glasius et al., ; Mertus, ; Paluck, ; Smeltzer, ), although full trust may not be possible, and thus a compromise—a “partial trust” relationship—may be the best that a researcher can hope for (Chakravarty, ). Once a researcher accesses participants, the culture of fear also impacts the research process, as individuals may be reluctant to disclose information and will self-censor (Bell, ; Belousov et al., ; Chakravarty, ; Koch, b). This may not be intentional, but rather an artifact of a norm of public silence among individuals who live in such environments (Koch b). Paluck () used nonthreatening coded language to avoid forcing participants to talk about risky topics, but this may not be



 . 

enough. Moreover, questions remain about the validity of results from such environments (Cohen & Arieli, ; Goode, ; Roberts & Allen, ). Given these challenges, digital methods seem like an attractive way to reduce risks of association and build rapport. In fact, rapport between researcher and participant can be built via mediated communication (Barratt, ). For example, the mediated nature of online interviewing can make the participant feel more at ease when disclosing information (Bowker & Tuffin, ). And although some would argue that it is difficult to demonstrate authenticity in a mediated environment (Willis, ), the ability of participants to see that the researcher has a backstory and an affiliation is an important affordance of the Internet (Klein et al., ) Kendzior () presents an interesting case. Her research with citizens of an authoritarian state took place mostly online, which meant that she was occasionally believed to be a creature of the local security services or the US government, but communicating with participants via Facebook allowed her to gain trust: [I]n , I found a new way to prove to [authoritarian state’s] dissidents that I was not a regime agent but a nice girl from [state] with good intentions. I joined Facebook . . . . On my Facebook page, [authoritarian state’s] dissidents whom I have never met are able to see my photos and read posts from me, my friends and my family alluding to the biographical details I have given them. I really am a graduate student, I really live in [city], I really have a son and a daughter. I really am an American who can speak [their language] and who is interested in [their region’s] political affairs. If my identity is a lie, it is an elaborate construction years in the making. Facebook is my character witness. (p. )

However, befriending participants digitally can sometimes be problematic; in particular, it can add complications to the researcher-participant relationship (Robards, ).

.. Data Security with Introducing Digital Methods: Get Informed When working in difficult research environments, researchers should have strict data security practices because, as Peritore () suggests, when it comes to data security in difficult environments, they should assume a worst-case scenario. It is not uncommon for researchers to be monitored by security services and other authorities (Cramer et al., ; Dresch, ; Gentile, ; Glasius et al., ; Peritore, ; Sowerwine, ; Thomson, ; Thogersen & Heimer, ). While Peritore suggested assuming the worst case in the late s with traditional paper data in mind, his recommendation stands in the twenty-first century. Given the physical and electronic surveillance norms in many difficult environments, maintaining confidentiality and practicing good data security are challenging (Browne & Moffett, ; Eynon, Fry, & Schroeder, ; Glasius et al., ; Kunnath, ; Roberts, ; Wood, ). Written field

     



notes are problematic (Obbo, ), but audio and video recordings may be worse (Kovats-Bernat, ), because written field notes provide obscurity and deniability in a way that audio or video does not. Many researchers who work in difficult research environments are likely already familiar with traditional data security measures but may not be as familiar with digital data security. Digital methods complicate data security because digitally transferring or storing data, which can be more easily compromised, is riskier than other means of transferring and storing data (Ackland, ; Buchanan, ; Buchanan & Hvizdak, ; Hewson et al., ; Lomborg & Bechmann, ). So while meeting via a digital means may seem less risky than physically meeting when conducting research in a difficult environment (Rundall, ), digital meeting does not eliminate risk and may in fact increase it because of how easily it can be compromised. It is quite simple for an interested party to use digital surveillance tools to discover associations between individuals and monitor email, short message service (SMS) communication, and even Skype calls. For example, it is well established that authoritarian regimes monitor their citizens electronically (Pearce, ; Youmans & York, ), and it would not be out of the question for digital communication between a participant and researcher to be discovered this way. (Glasius et al., () document one such case at length). And while the AoIR ethics guidelines suggest that data be stored securely and unanticipated breaches need to be considered, it is impossible to anticipate the future of electronic monitoring. Given the ease and efficiency of present-day monitoring, researchers should assume the worst.

.. What Is to Be Done: In the Field When researchers want to go into the field, regardless of the methods they will use, they must investigate the most up-to-date digital and data security techniques, devoting special attention to the region of the world in which they are working. As Driscoll () described it, “I decided that if I was going to continue operating in an unfriendly authoritarian environment, I needed to adapt. I quickly educated myself about how internet servers work. I stopped assuming my email communications were private. For important topics I began to reply upon pen and paper” (p. ). This is not optional. In the twenty-first century, any researcher is going to have a phone and a laptop, which need to be secured. But recommendations and IT department assistance about security in one’s home country environment are likely insufficient. Guides for activists and journalists may be the most useful way to familiarize oneself with basic techniques, but note that security concerns change frequently, and practices that worked a few months earlier may be insufficient. The Electronic Frontier Foundation has a good basic guide (https://ssd.eff. org/), from , but it is impossible to know if it will continue to be updated as technologies change. Talking to experts regularly will allow a researcher to learn about the latest threats and techniques. Stanford’s Liberation Technology listserv (https://mailman.stanford.edu/mailman/listinfo/liberationtech) is a good starting point to find



 . 

regional digital security experts. However, by and large, journalists are the best allies. They have similar security concerns and have networks of advice from other journalists working in difficult environments. Some general guidelines are provided here. Regarding devices themselves, while in the field one should carry one’s laptop and other devices (phone, tablet) on one’s person at all times. While this is onerous, security services are known to search or monitor researchers’ offices and homes, and even internationally known chain hotel room safes are not secure (Glasius et al., () discuss one such case at length). If one’s device is stolen or lost, having a “remote wipe” feature enabled allows one to remotely delete the hard drive contents. Both Android and Apple mobile devices have these features preinstalled, but one should be familiar with the process before entering the field. Computers are another matter; “remote wipe” is not natively installed, and one would need to purchase a program to do this. However, even with a “remote wipe” feature, researchers should consider encrypting computer and mobile device hard drives so that the data are inaccessible. (Although as Glasius et al. ()note, this is subject to local law. In Kazakhstan, the use of encryption of subject to legal restrictions and would potentially draw attention to the researcher.) Android and Apple’s mobile operating systems come with encryption, and there are many options for encrypting a Windows or Apple laptop. Similarly, one may want to consider using a more secure operating system than Windows or Apple’s iOS. While changing to an entirely different operating system can be challenging, increased security may be worth it. As of this writing (March ), Tails (https://tails.boum.org/) is a popular option for an alternative, more secure operating system, but is likely to require some effort to learn how to use. Even with these security measures, it is good practice, before crossing a border and after arriving back in one’s home country, to reformat all devices and computers to factory settings. The surveillance programs commonly used by authoritarian regimes do not appear in virus scans, so an infected device will not appear to be infected. The University of Toronto’s cybersecurity research group, The Citizen Lab, has documented many cases of such infections. (https://citizenlab.org/) Reformatting is the only way to remove the virus. Glasius et al. ()suggest having two laptops: one for the Internet and another for any research related materials and that this would be particularly useful for keeping contact information of participants separate from interview notes. In the field or at home, having strong passwords is important. Password management systems like Pass can help to create more secure passwords. More important, though, is that access to one’s devices can be made more difficult through two-step authentication, in which one must verify one’s identity via an SMS text message or other means. This should be enabled for all Internet accounts: Google, Facebook, Twitter, Dropbox, and so forth. This will reduce the likelihood of one’s accounts being compromised. When connecting to the Internet in the field, one should use a virtual private network (VPN) on a computer, tablet, or mobile device. A VPN securely accesses the Internet, essentially telling the device that it is not in country X, but is instead located in a location of the researcher’s (or the company’s) choosing, such as Dallas, Texas

     



(A guide to how VPNs work: http://gizmodo.com//vpns-what-they-do-howthey-work-and-why-youre-dumb-for-not-using-one). This will slow down the Internet connection but is an essential safety step. While there are free VPN services, a paid VPN is likely to be faster and more reliable, with technical support and customer service available. Prices start at $US per month, as of March . Configure and test the VPN before going into the field. Communication with participants, such as arranging a meeting, should take place over the most secure channel available (although this may be burdensome for participants), via diverse channels, and potentially via a third party. Once a meeting has been arranged, video recordings are discouraged because of the ease of identification. Audio recordings are better, but a best practice may be to conduct an audio recording, immediately upload the audio file to the cloud, unlink the recording device (mobile device or otherwise) from the cloud, and delete the audio file from the device. Some researchers do not use audio recording at all in order to best protect participants (Glasius et al., ).

.. Data Security When Entering New Spaces: Identification Data security matters because sloppiness can increase the likelihood of participants’ being identified, thus putting them at greater risk. This is true for traditional research, as described above, but even more so when using digital methods. For example, qualitative studies that analyze content posted online, especially if the content is directly quoted, can easily lead to a participant’s identity being known (Dawson, ; Roberts, ; Trevisan & Reilly, ; Whiteman, ). While identification because of such sloppiness is unfortunate in any environment, in difficult research environments identification can lead to severe consequences for consenting participants or nonconsenting “participants” who merely were a byte of data in a massive download. Thus, researchers excited to download all the tweets from a protest in an authoritarian regime or analyze a Tumblr page of transgender individuals in Iran in order to test hypotheses and write an amazing conference paper need to be better informed about the far-reaching outcomes of their research. Such research may seem low risk, but as with any unobtrusive research, it brings with it questions of informed consent (Ackland, ; Hewson et al., ; Buchanan, ; Buchanan & Hvizdak, ; Charlesworth, ; Eynon, Fry, & Schroeder, ; Hine, ; Livingstone & Locatelli, ; Lomborg & Bechmann, ; Stevens, O’Donnell, & Williams, ; Whiteman, ). Individuals are not actively consenting to be studied, are not informed of risks and benefits, and cannot withdraw themselves (Ackland, ; Buchanan, ; Oboler, Welsh, & Cruz, ). Moreover, the question of publicness versus privateness of online spaces contributes to the consent issue. Although social media are “public,” individuals often do not treat these spaces as such (Ackland, ; Buchanan, ; Markham & Buchanan, ;Roberts, ; Stevens, O’Donnell, & Williams, ; Trevisan & Reilly, ; Whiteman, ). And as the intentionality of the content creator is sometimes used as a determinant of the publicness of the data



 . 

(Zimmer, ), one must ask if the creator considered that his or her tweet would appear on a PowerPoint slide at an academic conference or in a journal article. Increased identifiability is especially problematic with the use of digital trace data, evidence of human activity that is logged and stored digitally (Howison, Wiggins, & Crowston, ). Use of digital trace data with computational methods, social network analysis, and linguistic or sentiment analysis similarly allows for greater identifiability of participants (Whiteman, ). This is so in part because analyses of digital trace data rarely anonymize, and anonymizing may in fact reduce the utility of the analysis (Ackland, ; Kadushin, ; Saunders, Kitzinger, & Kitzinger, b). And while computational analyses of digital trace data allow for patterns of association beyond the visible spectrum of traditional analyses (Cioffi-Revilla, ; Lazer et al., ; Kennedy & Moss ; Neuhaus & Webmoor, ), illuminating such associations adds additional risk because one is understanding individuals within networks, and research about an individual has implications for all of that person’s connections (Kadushin, ). There is an additional set of challenges associated with computational methods for more vulnerable people (Madden, Gilman, Levy, & Marwick, ). There are also questions of validity of data and findings from digital methods (Hesse-Biber & Griffin, ), particularly when the application programming interfaces (APIs) used to access data are imperfect and constrained (Bruns & Burgess, ; Burgess & Bruns, ; Puschmann & Burgess, ), as is the case in difficult environments. For example, Zeng, Burgess, and Bruns () describe the challenges associated with working with data that are likely censored on the Chinese social networking site Weibo. Moreover, they discuss how much of the public timeline data is actually released via the API.

.. What Is to Be Done: With Digital Methods For research that does include informed consent, researchers need to consider that digital methods threaten confidentiality. As a way to better inform participants, Saunders, Kitzinger, and Kitzinger (a, b) have changed their consent form to let participants know that it is possible that links between their research interviews and online presences could be made. During data collection, communication with participants should be done on the most secure channel possible. A researcher must determine what that is for the particular country and what tools are the most current; working with journalists and the Liberation Technology listserv would be useful for this. Greater efforts at anonymization need to be made; be extra cautious to anonymize field notes and during writing. Markham () suggests fabricating composite participants to protect privacy. For unobtrusive research, if one wishes to analyze digital trace data, one should consider anonymizing the data and not submitting the raw data files to any repository.

     



. C

.................................................................................................................................. Given the tremendous risks that participants and researchers in difficult environments already face as a result of participating, consensually or not, in research, any method that may increase the risk should be approached with caution. As such, a set of best practices has been presented to help researchers attempting to use digital methods in difficult environments. This is not the first consideration of the amplification of ethical dilemmas involving digital research. Livingstone and Locatelli () find similar magnified ethical challenges when studying youth, a vulnerable group and challenging to obtain consent from, with digital methods. Trevisan and Reilly’s () study of digital disability activities similarly presents amplified ethical challenges on a (perceived) vulnerable population using digital methods. Ignacio () presents a similar paradox with using digital methods to study diasporic communities. It seems logical to use digital methods to study those spread throughout the world, but this may not be the “breakthrough” that it appears because of the impact on sampling and generalizability, the performative nature of social media, and the loss of context. Yet this discussion of difficult research environments adds to our knowledge, as risk in this case is not a question merely of vulnerability of participants, but of both researched and researcher. Certainly conducting research in difficult environments is a (perhaps) necessary but problematic endeavor. And while there is no replacement for on-the-ground research, digital methods could help mitigate some of the challenges faced by researchers. But the opportunities presented by digital methods also increase risk for both the researcher and participants. Taking precautions like those detailed in this chapter is time consuming and therefore slows down the research process. But compared to the severe potential consequences for both researchers and participants, doing anything possible to not increase risk is essential. One must truly consider whether another CV line is worth blood on one’s hands. Digital methods give the illusion of safety but may in fact be loading bullets in a gun.

R Abdullah, N. (). On the “vulnerability” of the social researcher: Observations from the field of spirit interference.” In ISAE Symposium for Sociology (pp. –). http://www.isasociology.org/uploads/files/EBul-Dec-Abdullah.pdf Ackland, R. (). Web social science. Thousand Oaks, CA: Sage. Amour, P. O. (). Practical, theoretical, and methodological challenges of field research in the Middle East. Historical Methods: A Journal of Quantitative and Interdisciplinary History, (), –. doi:./.. Avruch, K. (). Notes toward ethnographies of conflict and violence. Journal of Contemporary Ethnography, (), –. doi:./



 . 

Barratt, M. J. (). The efficacy of interviewing young drug users through online chat. Drug and Alcohol Review, (), –. doi:./j.-...x Bell, K. (). Doing qualitative fieldwork in Cuba: Social research urgin politically sensitive locations. International Journal of Social Research Methodology, (), –. doi:./.. Belousov, K., Horlick-Jones, T., Bloor, M., Gilinskiy, Y., Golbert, V., Kostikovsky, Y., Levi, M., & Pentsov, D. (). Any port in a storm: Fieldwork difficulties in dangerous and crisisridden settings. Qualitative Research, (), –. doi:./ Bengtsson, S. (). Faraway, so close! Proximity and distance in ethnography online. Media, Culture & Society, (), –. doi:./ Bonnin, C. (). Navigating fieldwork politics, practicalities and ethics in the upland borderlands of Northern Vietnam. Asia Pacific Viewpoint, (), –. doi:./ j.-...x Bowker, N., & Tuffin, K. (). Using the online medium for discursive research about people with disabilities. Social Science Computer Review, (), –. doi:./  Browne, B., & Moffett, L. (). Finding your feet in the field: Critical reflections of early career researchers on field research in transitional societies. Journal of Human Rights Practice, (), –. doi:./jhuman/huu Bruns, A., & Burgess, J. E. (). Methodological innovation in precarious spaces: The case of twitter. In H. Snee, C. Hine, Y. Morey, S. Roberts, & H. Watson (Eds.), Digital methods for social science: An interdisciplinary guide to research innovation (pp. –). New York: Palgrave Macmillan. Buchanan, E. A. (). Internet research ethics: Past, present, and future. In C. Ess & M. Consalvo (Ed.), The Handbook of Internet Studies (pp. –). Malden, MA: WileyBlackwell. Buchanan, E. A., & Hvizdak, E. E. (). Online survey tools: Ethical and methodological concerns of human research ethics committees.” Journal of Empirical Research on Human Research Ethics, (), –. doi:./jer.... Burgess, J. E., and Bruns, A. (). Easy data, hard data: The politics and pragmatics of Twitter research after the computational turn.” In G. Langlois, J. Redden, & G. Elmer (Eds.), Compromised data: From social media to big data (pp. –). London: Bloomsbury Publishing. Chakravarty, A. (). “Partially trusting” field relationships opportunities and constraints of fieldwork in Rwanda’s postconflict setting. Field Methods, (), –. doi:./ X Charlesworth, A. (). Data protection, freedom of information and ethical review committees. Information, Communication & Society, (), –. doi:./X.. Cioffi-Revilla, C. (). Computational social science.” Wiley Interdisciplinary Reviews: Computational Statistics, (), –. doi:./wics. Clark, J. A. (). Field research methods in the Middle East. PS: Political Science & Politics, (), –. doi:./S Cohen, N., & Arieli, T. (). Field research in conflict environments: Methodological challenges and snowball sampling. Journal of Peace Research, (), –. doi:./  Cook, C. (). Email interviewing: Generating data with a vulnerable population. Journal of Advanced Nursing, (), –. doi:./j.-...x

     



Cordner, A., Ciplet, D., Brown, P., & Morello-Frosch, R. (). Reflexive research ethics for environmental health and justice: Academics and movement building. Social Movement Studies, (), –. doi:./.. Cramer, C., Johnston, D., Oya, C., & Sender, J. (). Mistakes, crises, and research independence: The perils of fieldwork as a form of evidence. African Affairs, (), adv. doi:./afraf/adv Dawson, P. (). Our anonymous online research participants are not always anonymous: Is this a problem? British Journal of Educational Technology, (), –. doi:./ bjet. Diphoorn, T. (). The emotionality of participation: Various modes of participation in ethnographic fieldwork on private policing in Durban, South Africa. Journal of Contemporary Ethnography, (), –. doi:./ Drake, G. (). The ethical and methodological challenges of social work research with participants who fear retribution: To “do no harm.” Qualitative Social Work, (), –. doi:./ Dresch, P. (). Wilderness of mirrors: Truth and vulnerability in Middle Eastern fieldwork. In P. Dresch, W. James, & D. Parkin (Eds.), Anthropologists in a wider world (pp. –). New York: Berghahn Books. Driscoll, J. (). Can anonymity promises possibly be credible in police states. Comparative Politics Newsletter. Comparative Politics of the American Political Science Association, –. Duffield, M. (). From immersion to simulation: Remote methodologies and the decline of area studies. Review of African Political Economy, (supp. ), S–S. doi:./ .. Eynon, R., Fry, J., & Schroeder, R. (). The ethics of Internet research. In N. G. Fielding, R. N. Lee, & G. Blank (Eds.), The Sage handbook of online research methods (pp. –). Thousand Oaks, CA: Sage. Fileborn, B. (). Participant recruitment in an online era: A reflection on ethics and identity. Research Ethics, (), –. http://doi.org/./ Fischer, G. R. (). Digital Middle Eastern studies: Challenges, ethics, and the digital humanities. University of Texas, Austin. Retrieved from https://repositories.lib.utexas. edu/handle// Fry, L. J. (). Spies like us? Respondent perceptions of research sponsors in  African countries. International Journal of Modern Anthropology, (), . doi:./ijma.vi. Gentile, M. (). Meeting the “organs”: The tacit dilemma of field research in authoritarian states. Area, (), –. doi:./area. Glasius, M., de Lange, M., Bartman, J., Dalmasso, E., Lv, A., Del Sordi, A., . . . Ruijgrok, K. (). Research, ethics and risk in the authoritarian field. Cham: Springer International Publishing. https://doi.org/./---- Goode, J. P. (). Redefining Russia: Hybrid regimes, fieldwork, and Russian politics. Perspectives on Politics, (), –. doi:./SX Green, L. (). Fear as a way of life. Cultural Anthropology, (), –. doi:./ can....a Haggerty, K. D. (). Ethics creep: Governing social science research in the name of ethics. Qualitative Sociology, (), –. doi:./B:QUAS...a Helbardt, S., Hellmann-Rajanayagam, D., & Korff, R. (). War’s dark glamour: Ethics of research in war and conflict zones. Cambridge Review of International Affairs, (), –. doi:./



 . 

Hemming, J. (). Exceeding scholarly responsibility: IRBs and political constraints. In C. L. Sriram, J. C. King, J. A. Mertus, O. Martin-Ortega, & J. Herman (Eds.), Surviving field research: Working in violent and difficult situations (pp. –). New York: Routledge. Hesse-Biber, S. (). Emergent technologies in social research: Pushing against the boundaries of research praxis. In S. Hesse-Biber (Ed.), The handbook of emergent technologies in social research (pp. –). Oxford: Oxford University Press. Hesse-Biber, S., & Griffin, A. J. (). Internet-mediated technologies and mixed methods research: Problems and prospects. Journal of Mixed Methods Research, (), –. doi:./ Hewson, C. (). Qualitative approaches in Internet-mediated research: opportunities, issues, possibilities. In P. Leavy (Ed.), The Oxford handbook of qualitative research (pp. –). Oxford: Oxford University Press. Hewson, C., Yule, P., Laurent, D., & Vogel, C. (). Internet research methods: A practical guide for the social and behavioural sciences. Thousand Oaks, CA: Sage. Hine, C. (). Ethnography for the Internet: Embedded, embodied and everyday. London: Bloomsbury Publishing. Hope, J. (). Mixing modes to widen research participation. In H. Snee, C. Hine, Y. Morey, S. Roberts, & H. Watson (Eds.), Digital methods for social science: An interdisciplinary guide to research innovation (pp. –). London: Palgrave Macmillan. Howison, J., Wiggins, A., & Crowston, K. (). Validity issues in the use of social network analysis with digital trace data. Journal of the Association for Information Systems, (), –. Huggins, M. K., & Glebbeek, M.-L. (). Introduction: Similarities among differences. In M. K. Huggins & M.-L. Glebbeek (Eds.), Women fielding danger (pp. –). Lanham, MD: Rowman & Littlefield. Ignacio, E. N. (). Online methods and analyzing knowledge-production: A cautionary tale. Qualitative Inquiry, (), –. doi:./ Jenkins, S. A. (). Assistants, guides, collaborators, friends: The concealed figures of conflict research. Journal of Contemporary Ethnography (December), . doi:./ Kadushin, C. (). Understanding social networks: Theories, concepts, and findings. Oxford: Oxford University Press. Kendzior, S. (). “The Uzbek opposition in exile: Diaspora and dissident politics in the digital age.” Washington University, St. Louis. Kennedy, H., & Moss, G. (). Known or knowing publics? Social media data mining and the question of public agency. Big Data & Society, (), . doi:./  King, J. C. (). Demystifying field research. In C. L. Sriram, J. C. King, J. A. Mertus, O. Martin-Ortega, & J. Herman (Eds.), Surviving field research: Working in violent and difficult situations (pp. –). New York: Routledge. Kivits, J. (). Online interviewing and the research relationship. In C. Hine (Ed.), Virtual methods: Issues in social research on the Internet (pp. –). New York: Berg. Klein, H., Lambing, T. P., Moskowitz, D. A., Washington, T. A., & Gilbert, L. A. (). Recommendations for performing Internet-based research on sensitive subject matter with “hidden” or difficult-to-reach populations. Journal of Gay & Lesbian Social Services, (), –. doi:./..

     



Koch, N. (a). Introduction—field methods in “closed contexts”: Undertaking research in authoritarian states and places. Area, (), –. doi:./area. Koch, N. (b). Technologising the opinion: Focus groups, performance and free speech. Area, (), –. doi:./area. Kopelman, L. M. (). Moral problems in assessing research risk. IRB: Ethics and Human Research,  (), . doi:./ Kopelman, L. M. (). Minimal risk as an international ethical standard in research. The Journal of Medicine and Philosophy, (), –. doi:./ Kottow, M. H. (). The vulnerable and the susceptible. Bioethics. (–), –. doi:./-. Kovats-Bernat, J. C. (). Negotiating dangerous fields: Pragmatic strategies for fieldwork amid violence and terror. American Anthropologist, (), –. doi:./ aa.... Kunnath, G. J. (). Anthropology’s ethical dilemmas: Reflections from the Maoist fields of India. Current Anthropology, (), doi:./ Labott, S. M., & Johnson, T. P. (). Psychological and social risks of behavioral research. IRB: Ethics and Human Research, (), . doi:./ Lazer, D., Pentland, A., Adamic, L. A., Aral, S., Barabasi, A.-L., Brewer, D., . . . Van Alstyne, M. (). Social science: Computational social science. Science, (), –. doi:./science. Lee, R. M. (). Dangerous fieldwork. Thousand Oaks, CA: Sage. Lee-Treweek, G., & Linkogle, S. (). Putting danger in the frame. In G. Lee-Treweek & S. Linkogle (Eds.)m Danger in the field: Risk and ethics in social research (pp. –). New York: Routledge. Levine, C., Faden, R., Grady, C., Hammerschmidt, D. Eckenwiler, L., & Sugarman, J. (). The limitations of vulnerability” as a protection for human research participants. The American Journal of Bioethics, (), –. doi:./ Livingstone, S., & Locatelli, E. (). Ethical dilemmas in qualitative research with youth on/ offline. International Journal of Learning and Media, (), –. doi:./ IJLM_a_ Loff, B., Zion, D., & Gillam, L. (). The declaration of Helsinki, CIOMS and the ethics of research on vulnerable populations. Nature Medicine, (), –. doi:./ Lomborg, S., & Bechmann, A. (). Using APIs for data collection on social media. The Information Society, (), –. doi:./.. Macklin, R. (). Bioethics, vulnerability, and protection. Bioethics, (–), –. doi:./-. Madden, M., Gilman, M. E., Levy, K. E., & Marwick, A. E. (). Privacy, poverty and big data: A matrix of vulnerabilities for poor Americans. Washington University Law Review, (), –. Markham, A. N. (). Fabrication as ethical practice. Information, Communication & Society, (), –. doi:./X.. Markham, A. N., & Buchanan, E. A. (). Ethical decision-making and Internet research: Recommendations from the Aoir ethics working committee (version .). http://www.aoir. org/reports/ethics.pdf Mawdsley, E. (). Using the World Wide Web for development research. In V. Desai & R. B. Potter (Eds.), Doing development research, (pp. –). Thousand Oaks, CA: Sage.



 . 

Mertus, J. A. (). Maintenance of personal security: Ethical and operational issues. In C. L. Sriram, J. C. King, J. A. Mertus, O. Martin-Ortega, & J. Herman (Eds.), Surviving field research: Working in violent and difficult situations (pp. –). New York: Routledge. Meyer, S. D. (). From horror story to manageable risk: Formulating safety strategies for peace researchers. University of Troms. http://www.ub.uit.no/munin/handle// Morgan, D. L., & B. Lobe. (). Online focus groups. In S. Hesse-Biber (Ed.), The handbook of emergent technologies in social research (pp. –). Oxford: Oxford University Press. Morison, T., Gibson, A. F., Wigginton, B., & Crabb, S. (). Online research methods in psychology: Methodological opportunities for critical qualitative research. Qualitative Research in Psychology, (), –. doi:./.. Nelson, I. L. (). The allure and privileging of danger over everyday practice in field research. Area, (), –. doi:./area. Neuhaus, F., & Webmoor, T. (). Agile ethics for massified research and visualization. Information, Communication & Society, (), –. doi:./X.. Norman, J. M. (). Got trust? The challenge of gaining access in conflict zones. In C. L. Sriram, J. C. King, J. A. Mertus, O. Martin-Ortega, & J. Herman (Eds.), Surviving field research: Working in violent and difficult situations (pp. –). New York: Routledge. Obbo, C. (). Adventures with fieldnotes. In R. Sanjek (Ed.), Fieldnotes: The making of anthropology (pp. –). Ithaca, NY: Cornell University Press. Oboler, A., Welsh, K., & Cruz, L. (). The danger of big data: Social media as computational social science. First Monday, (). Retrieved from http://firstmonday.org/ojs/index. php/fm/article/view// Opsal, T., Wolgemuth, J., Cross, J., Kaanta, T., Dickmann, E., Colomer, S., & Erdil-Moody, Z. (). “There are no known benefits . . . ”: Considering the risk/benefit ratio of qualitative research. Qualitative Health Research, (), –. doi:./ Paluck, E. Levy. (). Methods and ethics with research teams and NGOs: Comparing experiences across the border of Rwanda and Democratic Republic of Congo. In C. L. Sriram, J. C. King, J. A. Mertus, O. Martin-Ortega, & J. Herman (Eds.), Surviving field research: Working in violent and difficult situations (pp. –). New York: Routledge. Pearce, K. E. (). Democratizing Kompromat: The affordances of social media for statesponsored harassment. Information, Communication & Society, (), –. doi:./ X.. Retrieved from http://www.tandfonline.com/doi/abs/./ X.. Peritore, N. P. (). Reflections on dangerous fieldwork. The American Sociologist, (), –. doi:./BF Peshkova, S. (). Women, Islam, and identity. Syracuse: Syracuse University Press. Peterson, J. D. (). Sheer foolishness: Shifting definitions of danger in conducting and teaching ethnographic field research. In G. Lee-Treweek & S. Linkogle (Eds.), Danger in the field: Risk and ethics in social research (pp. –). New York: Routledge. Pittaway, E., Bartolomei, L., & Hugman, R. (). “Stop stealing our stories”: The ethics of research with vulnerable groups. Journal of Human Rights Practice, (), –. doi:./jhuman/huq Pottier, J., Hammond, L., & Cramer, C. (). Navigating the terrain of methods and ethics in conflict research. In C. Cramer, L. Hammond, & J. Pottier (Eds.), Researching violence in Africa: Ethical and methodological challenges (pp. –). Leiden and Boston: Brill.

     



Puschmann, C. & Burgess, J. E. (). The politics of Twitter data. In K. Weller, A. Bruns, J. Burgess, C. Puschmann, & M. Mahrt (Eds.), Twitter and society (pp. –). New York: Peter Lang. Resnik, D. B. (). Eliminating the daily life risks standard from the definition of minimal risk. Journal of Medical Ethics, (), –. doi:./jme.. Robards, B. (). Friending participants: Managing the researcher-participant relationship on social network sites. Young,  (), –. doi:./ Roberts, L. D. (). Ethical issues in conducting qualitative research in online communities. Qualitative Research in Psychology, (), –. doi:./.. Roberts, L. D., & Allen, P. J. (). Exploring ethical issues associated with using online surveys in educational research. Educational Research and Evaluation, (), –. doi:./.. Roberts, S. P. (). Research in challenging environments: The case of Russia’s “managed democracy”. Qualitative Research, (), –. doi:./ Romano, D. (). Conducting research in the Middle East’s conflict zones. PS: Political Science & Politics, (), –. doi:./S Ross, A. (). Impact on Research of Security-Seeking Behaviour. In C. L. Sriram, J. C. King, J. A. Mertus, O. Martin-Ortega, & J. Herman (Eds.), Surviving field research: Working in violent and difficult situations (pp. –). New York: Routledge. Rundall, E. (). Key ethical considerations which inform the use of anonymous asynchronous websurveys in “sensitive” research. In J. MacClancy & A. Fuentes (Eds.), Ethics in the field: Contemporary challenges (pp. –). New York: Berghahn Books. Saunders, B., Kitzinger, J., & Kitzinger, C. (a). Anonymising interview data: Challenges and compromise in practice. Qualitative Research, (), –. doi:./ Saunders, B., Kitzinger, J., & Kitzinger, C. (b). Participant anonymity in the Internet age: From theory to practice. Qualitative Research in Psychology, (), –. doi:./ .. Scott, S., Miller, F., & Lloyd, K. (). Doing fieldwork in development geography: Research culture and research spaces in Vietnam. Geographical Research, (), –. doi:./ j.-...x Sehgal, M. (). The veiled feminist ethnographer: Fieldwork among women of India’s Hindu right. In M. K. Huggins & M.-L. Glebbeek (Eds.), Women fielding danger (pp. –). Lanham, MD: Rowman & Littlefield. Shaw, W. S. (). Researcher journeying and the adventure/danger impulse. Area, (), –. doi:./j.-...x Skidmore, M. (). Secrecy and trust in the affective field: Conducting fieldwork in Burma. In M. K. Huggins & M.-L. Glebbeek (Eds.), Women fielding danger (pp. –). Lanham, MD: Rowman & Littlefield Publishers. Sluka, J. A. (). Participant observation in violent social contexts. Human Organization, (), –. Retrieved from http://sfaa.metapress.com/content/h/ Smeltzer, S. (). Asking tough questions: The ethics of studying activism in democratically restricted environments. Social Movement Studies, (), –. doi:./ .. Sowerwine, J. (). Socialist rules and postwar politics: Reflections on nationality and fieldwork among the Yao in Northern Vietnam. In S. Turner (Ed.), Red stamps and gold stars: Fieldwork dilemmas in upland socialist Asia (pp. –). Vancouver, BC: Nias Press.



 . 

Sriram, C. L. (). Maintenance of standards of protection during writeup and publication. In C. L. Sriram, J. C. King, J. A. Mertus, O. Martin-Ortega, & J. Herman (Eds.), Surviving field research: Working in violent and difficult situations (pp. –). New York: Routledge. Stevens, G., O’Donnell, V. L., & Williams, L. (). Public domain or private data? Developing an ethical approach to social media research in an inter-disciplinary project. Educational Research and Evaluation, (), –. doi:./... Stockmann, D. (). Towards area-smart data science: Critical questions for working with big data from China. SSRN Electronic Journal (January ). doi:./ssrn. Retrieved http://papers.ssrn.com/abstract= Swedenburg, T. (). With Genet in the Palestinian field. In C. Nordstrom & A. C. G. M. Robben (Eds.), Fieldwork under fire: Contemporary studies of violence and survival (pp. –). Berkeley: University of California Press. Thogersen, S., & Heimer, M. (). Introduction. In Doing Fieldwork in China (pp. –). Honolulu: University of Hawaii Press. Thomson, S. M. (). “That is not what we authorised you to do . . . ”: Access and government interference in highly politicised research environments. In C. L. Sriram, J. C. King, J. A. Mertus, O. Martin-Ortega, & J. Herman (Eds.), Surviving field research: Working in violent and difficult situations (pp. –). New York: Routledge. Trevisan, F., & Reilly, P. (). Ethical dilemmas in researching sensitive issues online: Lessons from the study of British disability dissent networks. Information, Communication & Society, (), –. doi:./X.. Turner, S. (). Red stamps and green tea: Fieldwork negotiations and dilemmas in the Sino-Vietnamese borderlands. Area, (), –. doi:./area. Turner, S. (). Dilemmas and detours: Fieldwork with ethnic minorities in upland southwest China, Vietnam, and Laos. In S. Turner (Ed.), Red stamps and gold stars: Fieldwork dilemmas in upland socialist Asia (pp. –). Vancouver, BC: Nias Press. Unwin, T. (). Doing development research “at home”. In V. Desai & R. B. Potter (Eds.), Doing development research (pp. –). Thousand Oaks, CA: Sage. Wamai, N. (). First contact with the field: Experiences of an early career researcher in the context of national and international politics in Kenya. Journal of Human Rights Practice, (), –. doi:./jhuman/huu Warden, T. (). Feet of clay: Confronting emotional challenges in ethnographic experience. Journal of Organizational Ethnography, (), –. doi:./JOE--- Weijer, C. (). The ethical analysis of risk. The Journal of Law, Medicine & Ethics, (), –. doi:./j.-X..tb.x Whiteman, N. (). Undoing ethics: Rethinking practice in online research. Boston: Springer. doi:./---- Williams, T., Dunlap, E., Johnson, B. D., & Hamid, A. (). Personal safety in dangerous places. Journal of Contemporary Ethnography, (), –. doi:./ Willis, P. (). Talking sexuality online—Technical, methodological and ethical considerations of online research with sexual minority youth. Qualitative Social Work, (), –. doi:./ Wong, R. W. Y. (). A note on fieldwork in “dangerous” circumstances: interviewing illegal tiger skin suppliers and traders in Lhasa. International Journal of Social Research Methodology, (), –. http://doi.org/./.. Wood, E. J. (). The ethical challenges of field research in conflict zones. Qualitative Sociology, (), –. doi:./s---

     



Youmans, W. L., & York, J. C. (). Social media and the activist toolkit: User agreements, corporate interests, and the information infrastructure of modern social movements. Journal of Communication, (), –. doi:./j.-...x Zeng, J., Burgess, J. E., & Bruns, A. (). The challenges of Weibo for data-driven digital media research. Unpublished manuscript submitted to IR: Phoenix , October –. Retrieved from http://eprints.qut.edu.au///The_Challenges_of_Weibo_for_DataDriven_Digital_Media_Research.pdf Zimmer, M. (). “But the data is already public”: On the ethics of research in Facebook. Ethics and Information Technology, (), –. doi:./s---

  ......................................................................................................................

     The Case of China ......................................................................................................................

    

. I

.................................................................................................................................. L before the advent of the Internet, the  Belmont Report outlined the guiding ethical principles for social science research: respect for persons, beneficence, and justice. These principles speak to the importance of protecting individual autonomy, minimizing harm, and considering fairness in the process of scientific research. This overarching ethical framework, which paralleled a number of practical guidelines, has sparked a gradual institutionalization of human subject protection in academic institutions in the form of institutional review boards (IRBs), which are instrumental in the ethical screening of research projects. In the past decade, however, the widespread adoption of the Internet has challenged these existing ethical frameworks. The Internet, constituting as much a new communicative channel as a new source for data collection, has become an integral tool for scholars across disciplines. Other than providing new platforms for academic deliberation, it has evolved into an important source of behavioral and textual data, even for research projects that have little to do with digital context. As the global academic community has grown increasingly reliant on the Internet for scientific research, new reflections and debates have emerged on the strengths and limitations of the ethical framework originating from the pre-Internet era (e.g., Buchanan, ; Eynon, Fry, & Schroeder, ; Thorseth, ). Early discussions tended to center on the unique ethical challenges derived from the Internet as a new means to collect observational data about human subjects. An influential report from a workshop organized by the National Institutes of Health (NIH) and the American Association for the Advancement of Science (AAAS)

    



identified a number of ethical challenges deemed to be characteristic of the Internet setting, including the difficulty in delineating the boundary between public and private domains and in appropriately protecting users’ privacy. Based on these discussions, the report provided recommendations for IRBs and researchers, calling for a careful reassessment of the existing ethical guidelines developed from traditional research settings (Frankel & Siang, ). In response to this pioneering report, other works have drawn attention to the interlinks between the context and Internet research ethics. Some scholars, for instance, highlight the ways in which one’s philosophical tradition could shape ethical decisions in practice (Buchanan & Ess, ). Others focus on the dynamic relationship between the disciplinary/methodological specificities and ethical issues within the online setting, assigning ethical dimensions to methodological choices (e.g., Buchanan, ; Markham, ; Sveningsson, ). Yet others underline the importance of social and cultural differences in shaping ethical considerations (Buchanan & Ess, ). As Eynon et al. () have noted, in the Internet setting, “research object is no longer clearly delineated by national boundaries and protected by national research governance” (p. ), yet the factors researchers must consider when making ethical decisions, including ethical governance and legal framework, are nevertheless culturally bounded. For instance, studies have shown that Internet users’ expectations for privacy vary across cultural contexts (Ess, ), which creates further ethical challenges for research projects involving data collected from diverse cultural communities. Taking into consideration the diverse contexts that researchers may deal with in making ethical decisions in their online research, the Association of Internet Researchers (AoIR) has released a document, Ethical Decision Making and Internet Research, that proposes a few fundamental ethical principles but largely leaves ethical decisionmaking to the discretion of researchers. The nonmandatory nature and the flexibility of the framework is based on the recognition that ethical decision-making in research practice is a complicated process that “interweaves one’s world view (ontology, epistemology, values, etc.), one’s academic and political environment (purposes), one’s defining disciplinary assumptions, and one’s methodological stances” (Buchanan & Markham, ). In particular, while this approach recognizes shared ethical norms across academic communities, it also acknowledges that these shared norms may generate different interpretations and judgments in different sociocultural contexts. Therefore, an Internet ethical framework should be flexible enough to allow for what Ess () called “ethical self-direction and (in case of error) correction” ( p. ). In this chapter we embrace this notion of flexibility and cultural and political sensitivity by examining the case of China, which carries its own unique sociocultural context that may complicate ethical considerations in conducting Internet research. Echoing Ess’s argument, we believe that practical wisdom and informed ethical judgment cannot be divorced from a nuanced grasp of the social specificities of a research site. In the context of Internet research, the important sociocultural variables that may affect ethical decisions include cultural expectations about privacy, user engagement with the Internet, and the existing legal and political frameworks that regulate data



    

collection and use. In the following sections, we explore these factors in the Chinese context, followed by a survey of existing methodological approaches to Chinese Internet research and their ethical implications.

. T I  “C C”

.................................................................................................................................. While the body of theoretical and practical literature on Internet research ethics has engaged with various ethical issues across disciplines and methodological approaches (e.g., Eynon et al., ; Kraut et al., ; Markham, ), few studies have problematized ethical considerations within specific sociocultural contexts. China represents an important case for engaging with Internet ethics. It is both widely used for data gathering and ethically contested. On the one hand, with the growing importance of China in global economic and political governance, the country features heavily in global Internet research. In recent years we have witnessed an increase in studies that either directly analyze China’s digital practices or use its Internet channels for data collection purposes. At the same time, China’s Internet has a number of unique characteristics that make it a distinct subject of study and data source from the Internet in other regional and global contexts. Specifically, the key contextual variables to consider in the application of the Chinese Internet for research purposes include the omnipresent surveillance regime and the weak legal protections of Internet privacy. As for the surveillance regime, while the number of Chinese netizens is now the highest in the world ( million in ), the dynamism of the Chinese Internet should not be mistaken for freedom. In fact, the most popular imagery associated with the Chinese Internet is that of the “the Great Firewall of China” (GFW), the sophisticated, state-crafted apparatus designed to filter websites and consistently and pervasively censor online information exchanges. In addition to the technological architecture dedicated to censoring content, the Chinese Communist Party apparatus has deployed a number of other tools to contain the potential political repercussions of expanding Internet access, including legal measures,1 but also the implementation of indirect censorship via Internet companies (MacKinnon ) and local officials (Repnikova ). The unprecedented censorship apparatus, moreover, works selectively, with the state primarily targeting content that can provoke popular mobilization, while mildly tolerating some criticism of official policies (King, Pan, & Roberts ). The line between criticism and political mobilization, however, can often be blurry in practice, and the Chinese regime occasionally resorts to offline retribution against its perceived critics. Under Xi Jingping’s administration, for instance, many netizens were arrested for spreading “false rumors”—an ambiguous category created by the regime. Overall, when using the Chinese Internet researchers face a highly contested political space, with its subjects of study being

    



significantly more vulnerable and thereby requiring more consideration than Internet users in other contexts. In addition to pervasive surveillance, the Chinese Internet context is marked by a relatively underdeveloped legal framework for the protection of Internet privacy. According to a recent cross-national UNESCO study, “there is limited protection for privacy in China, with no fully-fledged constitutional guarantee, no proper privacy law and no data protection law” (Mendel, Puddephatt, Wagner, Hawtin, & Torres, , p. ). The report also underlines the relatively powerless position of ordinary netizens when confronted with problematic commercial practices of the private sector, especially “abuses of private data in the form of targeted marketing approaches” (p. ). Although there has been increasing visibility of online privacy in public discussions, especially after a series of data-disclosure incidents, data privacy has not yet registered as a significant concern for major Internet companies such as Baidu and Alibaba. Parallel with the meager legal framework for online privacy, data mining has emerged as a profitable business in China. Recent research on the data-mining industry has documented how Internet users’ data are systematically collected for marketing purposes. Data are sold to commercial organizations to help them better understand their audience and maximize the effects of their advertising campaigns (e.g., Turow & Draper, ); these same technologies can also be used to monitor citizens, allowing government agencies to utilize cost-effective surveillance techniques (e.g., Grandy, ; van Dijck, ). In the Chinese context, major social media platforms are also delving into the burgeoning “big data” business and looking for new ways to generate profits from user data (Bloomberg, ; Larson, ). In the absence of a sophisticated privacy protection framework, Chinese Internet users are particularly vulnerable when both the state and private corporations are looking to capitalize on their personal data.

. M A  E C

.................................................................................................................................. Despite the growing interest in China as a subject and important sociopolitical considerations pertaining to the Chinese Internet context, the research on ethical dimensions of this scholarship is largely nonexistent. A rare exception is a recent review essay that examines the ethical challenges in carrying out experimental research in the Chinese context (Lü, ). The study highlights the importance of grasping the Chinese government’s motives for restricting data collection efforts, as well as the prominence of the online censoring apparatus. Although the essay presents several practical suggestions regarding how to get around official pressures, the discussion focuses narrowly on one type of methodological approach (experiment) and skips over the ethical considerations. This chapter, therefore, presents the first step toward grappling with ethical considerations of Chinese Internet research.



    

In our analysis of Chinese Internet studies, we adapt Buchanan’s () definition of Internet-based research as “research which utilizes the Internet to collect information through an online tool, such as an online survey; studies about how people use the Internet, e.g., through collecting data and/or examining activities in or on any online environments; and/or, uses of online datasets or databases” (p. ). Specifically, we examine research that () uses the Internet as a tool to collect data about human subjects, such as online survey service or crowdsourcing platforms like Amazon Mechanical Turk; and () examines how people use the Internet, for example, through the observation of online social interactions or other forms of digital traces. In order to highlight the ethical implications unique to the Chinese social context, we limit the scope of review to recent empirical Internet studies that involve human subjects from China or data sets collected from the Chinese cyberspace. We thus exclude studies such as those concerning how overseas Chinese use the Internet or analyses of Chinese discursive data from social media platforms not accessible from mainland China, such as Twitter and Facebook. To identify published research papers to review we performed a search for China-related Internet studies published between  and  in high-impact, peer-reviewed journals in communication, such as Journal of Communication, New Media & Society, and Journal of Computer-Mediated Communication. We also looked at certain highly cited publications from noncommunication journals such as Science and American Political Science Review, as well as top China studies journals, including The China Quarterly and The Journal of Contemporary China. In our engagement with Chinese Internet ethics, we follow the approach that Buchanan and Ess took in their  review and organize our analyses by research method. When examining the research papers on the Chinese Internet, we pay special attention to issues that may evoke ethical concerns. For instance, how does the author (s) collect the data? What are some possible ethical considerations associated with data collection or analysis procedure? Has the author(s) reflected on those issues? Has the author(s) taken special measures to protect the human subjects involved in the study, such as their privacy? In each methodological category, we critically review exemplary studies from the perspective of research ethics and reflect on how Chinese contextual factors outlined in the previous section may complicate ethical concerns. We also discuss best practices as well as potential ethical issues that deserve careful consideration from future researchers who are going to conduct Internet research in the Chinese context. Table . offers a brief summary of the issues discussed in the following sections.

. O E  S

.................................................................................................................................. The Internet provides a convenient way to reach a large number of potential subjects, which accounts for its increasing popularity as a tool in experimental and survey studies. Recent research on the Chinese Internet that takes on an experimental or a

— Consider removing personally identifiable information, as well as other appropriate anonymization procedures, in published research.

— Separate participant recruitment from the collection of survey responses and choose service providers with more sophisticated privacy protection schemes. — Recruit participants through popular online social media platforms (e.g., QQ, SinaWeibo) — Major Chinese social media platforms such as SinaWeibo require users to provide real names and other personal information during registration. — Politically sensitive research topics may impose greater risks on participants, as their anonymity may be compromised. — Consider proper encryption or other measures to protect digital communication between researchers and participants from censorship.

Recommendations

Issue 2 Stringent surveillance for Internet content

Recommendations

— Revealing details about participants in published research in the form of verbatim quotes or virtual account names may not only compromise their anonymity but also endanger their physical freedom.

— Chinese online survey service providers lack sophisticated measures to protect privacy of user data.

Content Analysis

Issue 1 Weak legal framework for Internet privacy

Online Experiments/Surveys

Table 33.1 Ethical Concerns across Different Methodological Approaches

— Consider alternatives to formal consent forms. — Consider how the presence of censorship may affect the interaction between researchers and participants.

— The presence of online surveillance may render subjects in politically sensitive research projects particularly vulnerable.

— Consider users’ expectations about the public/ private characteristics of the platform.

Digital Ethnography



    

survey approach engages with a wide range of topics, such as the role of text messages in online health intervention (Lau, Lau, Cai, & Archer, ), online discussions and political change (Hyun & Kim, ), cultural differences in information-seeking behaviors (Han, Zhang, Chu, & Shen, ; Yang, Kahlor, & Li, ), and the mechanisms of Internet censorship (King, Pan, & Roberts, ). For both online experiments and surveys, a common venue for recruiting research subjects is through private online survey service providers such as Survey Monkey (Hyun & Kim, ). These companies maintain online communities of regular survey takers and help clients identify appropriate survey targets with customizable screening criteria. A major ethical consideration when collecting data through survey service providers, as ethics scholars have already noted, is the issue of anonymity (Buchanan & Ess, ; Ess, ; Eynon et al., ). In order to receive compensation for their participation, registered users of online survey communities have to provide sensitive personal information such as bank account and cell phone numbers. As a result, the anonymity of participants may be compromised when the survey service providers link participants’ offline identities with their responses to survey questions. Even when researchers opt to collect data directly from users, this issue may linger, as researchers may have to access personally identifiable information in order to offer incentives despite the fact that the experimental design may require anonymity (Peden & Flashinski, ). While privacy concerns in survey and experimental Internet research is present across political contexts, it looms larger in China. First, as previously discussed, the legal framework for protecting online privacy in China remains relatively weak, leaving loopholes for Internet service providers to take advantage of users’ personal data. Second, though several incidents involving user information disclosure have recently received intensive media exposure,2 the issue of online privacy has not yet registered as a prominent concern among Chinese Internet companies, especially those whose business model relies on data mining. In a recent study that assesses how well some of the world’s major tech companies protect users’ privacy and freedom of expression, Tencent, the leading Chinese Internet company, ranks at the bottom, indicating a relatively low level of transparency about data collection and use (Thielman, ). The weak concern for data privacy can also be found among popular Chinese survey service providers; not many have detailed descriptions of users’ privacy rights in their user agreements, and even those that do mention privacy do not address important questions such as whether they can share user data with business partners and what type of consent they would need to obtain in order to do that. There are two possible ways to tackle the issue of privacy protection when collecting survey data through private Chinese Internet companies: separating participant recruitment from the collection of survey responses and forgoing Internet survey providers in favor of online social media platforms. As for the former, researchers may rely on Chinese service providers solely to reach potential participants and then reroute them to a non-Chinese survey platform through a link included in a digital invitation. This way researchers would be able to keep the personally identifiable information and

    



survey responses stored separately; they would also be able to choose a survey provider operating under a more sophisticated privacy protection framework to collect and handle the response data. The UNESCO report previously cited suggests that the United States and European Union countries have a relatively long history of protecting privacy through innovative legislation (Mendel et al., ). Therefore, researchers may consider online survey providers from these regions, such as Qualtrics and QuestionPro, for more responsible data management. The second recruiting strategy is to reach potential participants through online Chinese social media platforms, such as Sina Weibo, WeChat, and QQ (e.g., Lu, ; Mou, Wu, & Atkin, ). Compared with recruiting through Internet survey service providers, this approach appears to pose fewer risks, as users have a certain leeway in the type and amount of personal information they reveal to the platform and are less obliged to provide sensitive information such as ID number and bank account number. However, researchers should be aware that these users are not completely anonymous on these platforms. Under government regulation, major social media service providers in China began to enforce a real-name registration policy in late . In order to have a registered account, users need to submit information that enables the platform to associate their online personae with their offline identities. Therefore, if a topic is politically sensitive, researchers who plan to recruit through Chinese social media may need to consider additional measures to minimize the political risk resulting from compromising anonymity. For instance, when communicating with potential participants about the specifics of the study, researchers may choose technological tools that provide proper encryption and are not subject to the scrutiny of the platforms. Beyond the general concern with participants’ privacy, researchers have to be especially cautious with studies concerning online political participation and activism. As already noted, the Chinese regime pays close attention to online discussions that criticize government conduct or mobilize collective action, which makes participants in politically sensitive studies especially vulnerable to potential retribution from the state. Our analysis of studies that engage in political issues on the Internet suggests that the capacity of surveillance apparatus is often underestimated. For instance, a recent online survey looking at the factors that affect the user patterns of circumventing technologies (Mou et al., ) notes that the anonymity of their participants (recruited through Sina Weibo) is assured, but does not explicate ways in which that has been achieved. In particular, the authors do not discuss measures undertaken to protect these subjects from state surveillance (e.g., how they explain the purpose of the study in the recruiting post and whether the survey questions are accessible to platform operators). This consideration is especially crucial, as the Chinese government may have a vested interest in figuring out who regularly evades censorship because these individuals are likely to have critical political dispositions. As a result, researchers should consider the potential risks associated with identification when recruiting participants through social media, particularly when disclosure of personal identity renders participants vulnerable to political censorship.



    

. C A

.................................................................................................................................. Content analysis constitutes another popular method to collect and analyze discursive data from the Chinese cyberspace. Recent publications deploying this method have explored a diverse range of issues, including the characteristics and the influence of online health messages (Na, ; Wang & Liu, ), the role of social media in political discussions (Clothey, Koku, Erkin, & Emat, ; Hassid, ), the cultural values reflected in digital texts (Merolla, Zhang, & Sun, ; Ye, Sarrica, & Fortunati, ; Yuan, Feng, & Danowski, ), the mechanism of Internet censorship (King, Pan, & Roberts, ), and the dynamic relationship between social media and government legitimacy (Bondes & Schucher, ; Tong & Zuo, ). These studies typically draw on user-generated discursive data from a number of online sources, including microblogs, web pages, forums, streaming websites, and blog posts. While at first glance it may appear to be less compromising of netizen privacy than the survey and experiment approach, content analysis of the Chinese Web still carries ethical implications. With the search engines now empowered to access and locate identifying information from publicly accessible online data archives, online confidentiality is not guaranteed. Although this is the case across political contexts, in China, once again, the situation is aggravated by the presence of routine censorship and government surveillance combined with a weak legal framework, which means that identity disclosure can result in unexpected political retribution against Chinese citizens if they are suspected of engaging in politically destabilizing activities online. Revealing details about participants in published research, whether in the form of verbatim quotes or virtual account names, may compromise their confidentiality and anonymity and endanger their physical freedom. Our survey of prominent studies using content analysis of the Chinese Internet has identified cases in which negligence compromised subjects’ privacy. In a recent study of online political expression, for instance, researchers surveyed several Uygur-language online platforms to explore how ethnic minority groups use the Internet to express politically subversive ideas (Clothey et al., ). The published version included not only the specific names of online platforms sampled, but also the URLs of these websites. Understandably, scholars are always expected to be sufficiently transparent about their research procedures so the academic community can evaluate the quality of their work. However, in this particular case revealing the virtual identity of these online communities may bring them to the attention of censors, which not only puts their anonymity at risk but also affects the continuity of their online activities. In this case, a general reference to the types of platforms surveyed would suffice to explain and validate the analysis, and specific URLs should be omitted to ensure user anonymity. Even more apparent ethical concerns permeate the  study on microblog tweets about a controversial social media event that attracted widespread attention among Chinese netizens (Bondes & Schucher, ). The researchers collected and analyzed forty-six hundred tweets to identify the pattern of radicalization in online discussions.

    



When discussing the influential microblog users, their real account names were directly quoted without proper anonymization. As the analysis mainly draws on online postings, revealing account names does not seem to contribute significant analytical value, while potentially endangering the participants. Quoting real account names almost always compromises the confidentiality and anonymity in the Chinese context. First, major Chinese social media platforms such as SinaWeibo are mandated by the state to enforce real-name registration policy, which requires users to register with personally identifiable information (e.g., cell phone number, name on government-issued ID). In this regard, virtual account names are directly associated with offline personae, at least for the service providers who operate these platforms. It is also common for popular microbloggers (e.g., celebrities, public intellectuals, journalists) to use their real names for their online accounts. Moreover, the Chinese surveillance scheme tends to pay special attention to opinion leaders—the personal or organizational social media accounts with a large number of followers—as they are assumed to have a greater impact on opinions and behaviors of ordinary Internet users. While they attract significant attention from researchers, public opinion leaders online are also in the limelight of the surveillance apparatus and therefore should be treated most carefully. Reflecting on the possible ways to ensure the anonymity of participants in carrying out content analysis, Lawson () recommends researchers follow appropriate anonymization procedures when using participants’ virtual names and verbatim quotes in data analysis, such as creating pseudonyms for participants’ virtual names or removing them entirely in published work. This strategy also applies when the research objects are online communities. A good example of a study that has addressed this ethical concern through appropriate anonymization is Yuan et al.’s () work on the conceptualization of privacy on Chinese social media platforms. In this study, the researchers examined the semantic structure of eighteen thousand microblog postings that contain the word “privacy.” In the published paper, the discussion section includes multiple English translations from original Chinese postings, none of which refer to the virtual account names responsible for these comments. This way, researchers were able to include vivid illustrations for their arguments without compromising the virtual identity of their subjects. Another good example comes from King et al.’s () study about the pattern of content deletion across the Chinese social media landscape. While this paper contains a number of direct quotes from blog posts that express critical opinions of government or government officials, researchers concealed the names of bloggers or sites when presenting the texts, protecting research subjects from political risks associated with identity disclosure.

. D E

.................................................................................................................................. In addition to quantitative methods such as experiments, surveys, and content analysis, Chinese Internet research also features increasingly prominent use of qualitative



    

methods. Specifically, digital ethnography is either explicitly or implicitly deployed in much of the recent scholarship. Examples of studies include an analysis of Chinese digital activism (Yang ), a study of paid commentators or the -cent army (Han ), and official social media use (Repnikova & Fang ), among other topics. The use of digital ethnography even in uncensored contexts has already attracted some ethical debates in the Internet ethics scholarship. For instance, some studies have found that when asked whether they would mind being observed for research, chat room participants largely expressed objections, finding participant observation intrusive and unwelcome (Hudson & Bruckman ). Other studies have revealed that privacy of what has been at first considered anonymous social media content is actually quite fragile, with user identities being traceable with sufficient technological know-how. A  study of selected Facebook user profiles, for instance, has sparked controversy over the capacity of researchers and their responsibility to protect the identity of profiles they are studying on social media networks (Zimmer ). The debate over what is deemed ethical qualitative research of social media platforms is far from settled. One side supports the blanket application of the human subject model to the Internet, which would in turn entail the requirement of gaining consent from producers and consumers of any digital data. And the other side advocates a more relativist ethical approach, with some facets of the Internet being treated as more akin to traditional media or a public space (Bassett & O’Riordan, ) and others regarded as private/personal spaces (i.e., closed chat rooms, etc.). The China case directly speaks to these debates. On the one hand, a more repressive environment means that privacy considerations and consent measures should be even more stringent than those applied in democratic contexts. On the other hand, the censored nature of the Chinese Internet makes it challenging if not impossible to abide by some of these ethical codes, and the ephemeral nature of much of the sensitive online content can also in some cases reduce the risks of it being traced back to the original sources. As for the more stringent ethical considerations, as we have already noted elsewhere, scholars of the Chinese Internet arguably work with more vulnerable groups (or subjects) than their counterparts analyzing the uncensored Internet. Even if users are not involved in anything political, the Chinese state still closely monitors their online activities for public preferences that could, in turn, be incorporated into policy-making decisions (Chen, Pan, & Xu, ). Observing, analyzing, and publicly sharing information on Chinese Internet users, therefore, can potentially have a direct impact on Internet users’ personal safety or a more indirect influence on how the state may think about public preferences, if a scholar’s report were to be accessed and shared by Chinese intellectuals and officials. At the same time, the specificity of the Chinese digital sphere complicates ethical considerations. Asking for consent for research from netizens already straddling the gray political boundaries with the state would likely complicate many research projects, as Internet users tend to actively avoid any unnecessary exposure. Moreover, if they are made aware of being watched, netizens’ interactions may significantly alter, affecting the objectivity of research. Further, if a researcher vocalizes and exposes his or her

    



presence online, this may attract instant attention from authorities, immediately compromising the project. Becoming more publicly visible or known also means that the interactions with netizens would more likely get tracked, which in turn would endanger netizens’ safety. Similar considerations often apply to getting a formalized consent from interviewees. Having carried out interview research in the past, one of the authors found that formalizing consent could compromise safety more than enable it, as consent forms could be accessed by authorities, and interviewees expressed discomfort with signing documents. Moreover, while potentially threatening/compromising the project, the intensive censorship of the Internet in some ways might reduce the possibility of posts being traced to original users. Many postings, especially of a semi-sensitive nature, are instantly deleted, which means that by the time a researcher captures, analyzes, and publishes this information (it could be years later), the traceability of the posts used and cited is less feasible. In fact, one of the key challenges in observing Chinese Internet usage and capturing the flow of interaction is the fact that messages may disappear or be altered in minutes. The censorship of the Chinese Internet, therefore, at once aggravates ethical considerations and obliterates others, given the speed of keyword filtering and the impermanence of online content. Despite the complications that the China case presents for the observation of an ethical code in digital ethnography research, some efforts could still be made by researchers to protect their subjects of study, as well as their own digital identity, as much as possible. Current studies on the Chinese Internet that draw on digital ethnography tend to entirely abstain from any ethical discussions in their research methods. For instance, it is unclear in studies that quote social media users whether their identities have been anonymized and whether any observation of online activity may also involve researchers’ interactions with netizens. Digital ethnography is typically referenced in very general and often ambiguous terms, with no ensuing explanation of how it was carried out beyond observation of specific platforms, forums, and digital news outlets. In thinking through the ethical implications and challenges of engaging in digital ethnography of a heavily surveilled digital landscape, it may be useful to distinguish the sensitivity of the data analyzed, as well as between public and private Internet domains. As for sensitivity, only some subjects of digital research in China fall under that category, which includes the study of dissidents, online critics, public discontent, and critical/sensitive events or scandals. In engaging with this sensitive category, protecting the anonymity of data is critical, as well as maintaining caution and awareness in interacting with research subjects online. While obtaining consent may not be possible or advisable in the case of China, a researcher still carries serious ethical responsibility in studying vulnerable political and societal groups. Many subjects, however, fall into politically neutral or nonsensitive zones, such as the study of online gaming, official propaganda initiatives online, and digital entertainment and consumption habits. Such research may allow for more flexibility but at the same time should still ensure that individual privacy is attended to by shielding user identities from exposure.



    

Beyond the political sensitivity of the subject matter, it is important to distinguish between the more private or semiprivate domains of research and the public domains online. Closed chat rooms, for instance, which make for an interesting source of data collection on a variety of subjects, are semiprivate and require permission for entry. Participation in Weixin, a popular chat platform, for instance, is contingent upon an invitation to join a discussion group, signaling a degree of privacy that is built into the interactions. In studying such platforms, regardless of the subject, a researcher needs to be extra vigilant about protecting users’ identity and not compromising them through sensitive interactions aimed at a specific research agenda. In contrast to these semiprivate digital spaces, public domains, such as official Weibo, digital news outlets, and blogs, as well as the new WeChat platforms of official media and commentators, require less stringent ethical consideration. Citing official commentaries or quoting a blog source directly should not be treated as breaking ethical boundaries, considering that this information is purposefully in the public domain. Proper anonymization along the lines discussed previously, however, may still be needed if politically sensitive comments manage to survive the censorship (e.g., those critical of government conduct/officials, those collected before they were censored). On the whole, in applying a relativist approach to digital ethnography in China, we advocate for distinguishing between political contexts of data, as well as between the degree of privacy of Internet domains examined in the study. Table . presents a set of questions intended to prompt ethical reflections; it may be useful as a checklist when researchers conceive their research projects. These questions were adopted from the latest version of Ethical Decision Making and Internet Research (Buchanan & Markham, ) as well as the work of Chris Mann (). In particular, we highlight the issues relevant to the sociocultural particularities of the Chinese context as discussed in this chapter.

. D

.................................................................................................................................. In this chapter we have mapped out ethical considerations for Internet research in the Chinese context that thus far have been little addressed in the existing literature. Specifically, we weighed the challenges and strategies of researchers who engage with the Chinese Internet against the unique contextual variables pertaining to the China case. Specifically, the pervasive censorship, political repression, and weak legal framework for privacy protection make Chinese Internet users especially vulnerable research subjects. Our analysis of the most prominent research methods when applied to the Chinese Internet, including online experiments, surveys, content analysis and digital ethnography, demonstrates that ethical considerations require more attention from researchers. While surveys and experiments pose the highest concerns for user privacy, content analysis and digital ethnography can also compromise identities and even

    



Table 33.2 Ethical Questions Specific to the Context of the Chinese Internet Ethical Questions to Consider Technological background

— Are you aware of the technological options available for protecting research subjects’ privacy against online surveillance? If not, do you know of any technical expert with whom to discuss this issue? — If you are to collect data through private survey service providers, do they have explicit measures to protect the privacy of user data?

Organization of informed consent

— Are there potential risks involved if the authorities are able to access the written consent forms? — Is it possible to waive the written consent form? — What options do you have if you decide to have participants submit online signatures?

Description of research

— Do the research projects involve politically sensitive topics? If yes, will you clarify the risks of exposure to participants? — If you decide to withhold certain information, what will be withheld, and how will you do that? What are the potential risks?

Data collection and management

— If you decide to collect data through Chinese private companies, do they have explicit measures to protect users’ privacy? — Are there non-China-based service providers operating under more sophisticated legal frameworks regarding privacy protection? — Will you employ technological means to secure confidentiality? — Are participants identifiable? What type of identifiable information may be revealed? Are participants’ online identities connected with their offline identities in any way?

Norms in the research context

— What are the ethical expectations users attach to the venue in which they are interacting, particularly around issues of privacy, both for individual participants as well as the community as a whole?

Presentation of findings

— Are there potential risks involved when referencing the verbatim quotes or virtual account names of the participants in published work or prepublication occasions, such as workshops or conferences? — What options do you have to minimize such risks?

physical safety of netizens. We therefore advocate that every researcher of Chinese digital space think through the ethical implications of their projects and discuss the logic and the steps taken to safeguard their research subjects. In this chapter we introduced some strategies for ethical navigation of the Chinese Internet space, including the use of virtual names for participants, hiding the original websites of sensitive blogs, using safe technology in communicating with survey participants, and differentiating between sensitive and nonsensitive domains in documenting the analysis from digital ethnography. The first step to engaging with ethical conundrums, however, remains that of immersing oneself in the context of Chinese Internet management and awareness of multiple dimensions of sensitivity in carrying out research projects.



    

As our analysis represents the first step at straddling ethical considerations in China’s social media research, we hope it will facilitate new research on this subject in the future, because many questions still remain unanswered. One fruitful avenue of research is that of analyzing Chinese Internet users’ context-specific expectations about privacy. Recent research on Internet privacy has highlighted the critical role of context in shaping users’ privacy concerns and behaviors (e.g., Nissenbaum, ; Smith, Dinev, & Xu, ), and culture is one of the most important contextual factors (Ess, ). In this regard, research that explores the ways in which privacy expectations vary across different cultural communities may contribute to solid, practical recommendations when making ethical decisions. For instance, it is important to delve deeper into whether Chinese netizens are culturally conditioned to value online privacy less than their Western counterparts, or if their attitudes are merely an artifact of the political system. Another fruitful venue for future research is mapping out distinctions in political versus commercial surveillance and privacy inhibitions. In this chapter we have demonstrated that both the state and Internet firms actively undermine user privacy. A deeper look into distinctions between the toolkits deployed by political and commercial actors would shed light on the risks that researchers and their subjects incur in deploying and participating in certain projects. Another important research venue would be to investigate grassroots initiatives that attempt to foster better online privacy protection and awareness and to incorporate their recommendations into social media research projects. How do Chinese netizens themselves try to protect their online identity? What groups and organizations exist that fight for netizen rights, and what can we, as Western researchers, learn from them? Finally, stepping beyond China, a critical area of research would be comparative, contrasting ethical considerations of social media research across political systems with limited Internet freedom. For instance, a comparison of China, Russia, Saudi Arabia, and Iran would be of enormous value to the scholarly community. Our preliminary attempts to situate China in a larger comparative framework suggest that the Internet landscapes faced by researchers on China might be more comparable to those of researchers on Iran and Saudi Arabia, who similarly navigate a system of keyword filtering and intensive preemptive surveillance, albeit arguably a less sophisticated one than in China. Russia, in contrast, practices more post-factum Internet censorship or preliminary blocking of full sites, but does not deploy keyword filtering technologies akin to those used by these other three states. Even if China might be most innovative in its technological capacity when it comes to censorship, however, the threats that scholarly research can pose to netizens are equally grave across authoritarian contexts. Whereas a Chinese censor might be quicker at identifying an individual netizen’s identity, a Russian official might be more arbitrary in post-factum response online and offline. Moreover, the legal architecture of online protection is weak in all nondemocratic contexts, which further exacerbates ethical concerns, as we showed throughout our analysis. One possible strategy for comparative analysis would be to take the same approach we adapted in our chapter and expand it to more cases, in search of case-specific nuance but also parallels that would help generalize and theorize

    



about Internet ethics in authoritarian systems. Another approach would be to compare the degree of political surveillance and legal protections for Internet users, delineating ethical concerns accordingly. Another comparative angle is to focus mainly on practical strategies to contest nondemocratic Internet spaces and to illuminate whether the tactics we outlined in our China chapter are context-specific or generalizable to cases like Iran or Russia.

N . The recent draft of the cybersecurity law is a good example of the use of law to restrict online information flows at the national level. See Ramzy, A. (, July ). What you need to know about China’s cybersecurity law. The New York Times. Other laws applied in controlling online content include antipornography and antiterrorism legislation. . In  a hacker incident led to massive leakage of user information across several major Chinese Internet forums and social network. Even more unsettling, users’ account information was found to be publicly traded on online commercial platforms. See Yixuan Zhang, Y., and Cheng, C. (). User information leakage from renowned websites: who’s to take actions? Retrieved from news.xinhuanet.com See also Jia, K, and Du, F. (). Investigating the incident of information leakage from social network sites: Online stalking reflects the insufficiency of private information protection. Retrieved from news. xinhuanet.com

R Bassett, H. E., & O’Riordan. K. (). “Ethics of Internet Research: Contesting the Human Subject Research Model.” Ethics and Information Technology (), –. The Belmont Report: Ethical Principles and Guidelines for the Protection of Human Subjects of Research. (). The National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research. DHEW Publication No (OS) -. Bloomberg. (, June ). Chinese allow phone tracking, other data mining in exchange for credit. Chicago Tribune. Retrieved from http://www.chicagotribune.com/news/snswp-blm-china-credit-cdc-e-e-af-dffcff--story.html Bondes, M., & Schucher, G. (). Derailed emotions: The transformation of claims and targets during the Wenzhou online incident. Information, Communication & Society, (), –. http://doi.org/./X.. Buchanan, E. A. (). Virtual research ethics: Issues and controversies. In E.A. Buchanan (Ed.),Readings in virtual research ethics: Issues and controversies (pp. vi–xii). Hershey, PA: Information Science Publications. Buchanan, E. A. (). Internet research ethics: Past, present, and future. In Mia Consalvo and Charles Ess (Eds), The handbook of Internet studies (pp. –). Malden, MA: WileyBlackwell. Buchanan, E., & Ess, C. (). Internet research ethics: The field and its critical issues. In Kenneth Einar Himma and Herman T. Tavani (Eds), The handbook of information and computer ethics. Hoboken, NJ: John Wiley & Sons.



    

Chen, J., Jennifer Pan & Yiqing Xu (). “Sources of Authoritarian Responsiveness: A Field Experiment in China.” American Journal of Political Science  (), -. Clothey, R. A., Koku, E. F., Erkin, E., & Emat, H. (). A voice for the voiceless: Online social activism in Uyghur language blogs and state control of the Internet in China. Information, Communication & Society, (), –. http://doi.org/./ X.. Ess, C. (). “Lost in translation”? Intercultural dialogues on privacy and information ethics. Ethics and Information Technology, (), –. Ess, C. (). Ethical pluralism and global information ethics. Ethics and Information Technology, (), –. Eynon, R., Fry, J., & Schroeder, R. (). The ethics of Internet research. In Fielding, N., Lee, R.M. & Blanck, G. (Eds), The SAGE Handbook of Online Internet research methods (pp. –). Sage Publications Ltd. Frankel, M. S., & Siang, S. (). Ethical and legal aspects of human subjects research on the Internet. AAAS Online. Retrieved from http://nationalethicscenter.org/resources// download/ethical_legal.pdf Grandy, O. H. (). Data mining, surveillance, and discrimination in the post-/ environment. In R. V. Ericson & K. D. Haggerty (Eds.), The New Politics of Surveillance and Visibility (pp. –). Toronto: University of Toronto Press. Han, G., Zhang, J., Chu, K., & Shen, G. (). Self–other differences in HN flu risk perception in a global context: A comparative study between the United States and China. Health Communication, (), –. http://doi.org/./.. Han, R. ().“Manufacturing consent in cyberspace: China’s ‘Fifty-Cent Army’.” Journal of Current Chinese Affairs (), –. Hassid, J. (). Safety valve or pressure cooker? Blogs in Chinese political life. Journal of Communication (), –. http://doi.org/./j.-...x Hudson, J. and Amy Bruckman (). “‘Go Away’: Participant objections to being studied and the ethics of chatroom research.” The Information Society (), –. Hyun, K. D., & Kim, J. (). The role of new media in sustaining the status quo: online political expression, nationalism, and system support in China. Information, Communication & Society, (), –. http://doi.org/./X.. King, G., Pan, J., & Roberts, M. E. (). How censorship in China allows government criticism but silences collective expression. American Political Science Review, (), –. King, G., Pan, J., & Roberts, M. E. (). Reverse-engineering censorship in China: Randomized experimentation and participant observation. Science, (), . Kraut, R., Olson, J., Banaji, M., Bruckman, A., Cohen, J., & Couper, M. (). Psychological research online: Report of Board of Scientific Affairs’ Advisory Group on the Conduct of Research on the Internet. American Psychologist, (), . Larson, C. (, May ). In China, big data is becoming big business. Retrieved from http:// www.bloomberg.com/news/articles/--/in-china-big-data-is-becoming-big-business Lau, E. Y., Lau, P. W. C., Cai, B., & Archer, E. (). The effects of text message content on the use of an Internet-based physical activity intervention in Hong Kong Chinese adolescents. Journal of Health Communication, (), –. http://doi.org/./ .. Lawson, D. (). Blurring the boundaries: Ethical considerations for online research using synchronous CMC forums. In E. A. Buchanan (Ed.), Readings in virtual research ethics: Issues and controversies. Hershey, PA: IGI Global.

    



Lu, H. (). Burgers or tofu? Eating between two worlds: Risk information seeking and processing during dietary acculturation. Health Communication, (), –. http://doi. org/./.. Lü, X. (). Ethical challenges in comparative politics experiments in China. In Ethics and Experiments: Problems and Solutions for Social Scientists and Policy Professionals. London: Routledge. Retrieved from http://www.xiaobolu.com/researchpapers/China_Ethic_in_Experiment_Lu.pdf MacKinnon, R. (). “China’s Censorship .: How Companies Censor Bloggers.” First Monday  (). Mann, C. (). Generating data online: Ethical concerns and challenges for the C researcher. In M. Thorseth (Ed.), Applied ethics in Internet research (pp. –). Programme for Applied Ethics, Norwegian University of Science and Technology. Retrieved from http://www.ntnu.no/anvendtetikk/fileadmin/gamlefiler/www.anvendtetikk.ntnu.no/pdf/Nr-AppliedEthicsInInternetResearch.pdf Markham, A. N. (). The methods, politics, and ethics of representation in online ethnography. In N. K. Desnin & Y.S. Lincoln (Eds), The Sage handbook of qualitative research (pp. –). Thousand Oaks, CA: Sage. Retrieved from http://citeseerx.ist.psu. edu/viewdoc/summary?doi=.... Markham, A., & Buchanan, E. (). Ethical decision-making and Internet research: Recommendations from the AoIR Ethics Working Committee (Version .). Retrieved from http:// pure.au.dk/portal/files//aoirethics.pdf Mendel, T., Puddephatt, A., Wagner, B., Hawtin, D., & Torres, N. (). Global survey on Internet privacy and freedom of expression. Paris: UNESCO. Retrieved from https://books. google.com/books?hl=en&lr=&id=VJdr-noC&oi=fnd&pg=PA&dq=Global+survey +on+Internet+privacy+and+freedom+of+expression&ots=lugAOgcWax&sig=zitChHWyswwRsJkXHYtiJQdao Merolla, A. J., Zhang, S., & Sun, S. (). Forgiveness in the United States and China: Antecedents, consequences, and communication style comparisons. Communication Research, (), –. http://doi.org/./ Mou, Y., Wu, K., & Atkin, D. (). Understanding the use of circumvention tools to bypass online censorship. New Media & Society, (), –. Na, L. (). A revolutionary road: An analysis of persons living with hepatitis B in China. Journal of Health Communication, (), –. http://doi.org/./.. Nissenbaum, H. (). Privacy in Context: Technology, Policy, and the Integrity of Social Life. Stanford: Stanford University Press. Peden, B., & Flashinski, D. P. (). Readings in virtual research ethics: Issues and controversies. E. A. Buchanan (Ed.). Hershey, PA: IGI Global. Repnikova, M. (). Media Politics in China: Improvising Power Under Authoritarianism. Cambridge and New York: Cambridge University Press. Repnikova, M. and Fang, K. (). “Authoritarian Participatory Persuasion .: Netizens as Thought Work Collaborators in China.” Journal of Contemporary China (Forthcoming). Smith, J. H., Dinev. T. & Xu, H. (). “Information Privacy Research: An Interdisciplinary Review.” Management Information Systems Quarterly (), –. Sveningsson, M. (). Ethics in Internet ethnography. In E.A. Buchanan (Ed.), Readings in virtual research ethics: Issues and controversies (–). Hershey, PA: Idea Group. Thielman, S. (, November ). World’s biggest tech companies get failing grade on dataprivacy rights. The Guardian. Retrieved from http://www.theguardian.com/technology/ /nov//data-protection-failure-google-facebook-ranking-digital-rights



    

Thorseth, M. (Ed.). (). Applied ethics in Internet research. Programme for Applied Ethics, Norwegian University of Science and Technology. Trondheim: Norway. Retrieved from http://www.ntnu.no/anvendtetikk/fileadmin/gamlefiler/www.anvendtetikk.ntnu.no/pdf/ Nr-AppliedEthicsInInternetResearch.pdf Tong, J., & Zuo, L. (). Weibo communication and government legitimacy in China: A computer-assisted analysis of Weibo messages on two “mass incidents.”. Information, Communication & Society, (), –. http://doi.org/./X.. Turow, J., & Draper, N. (). Advertising’s new surveillance ecosystem. In D. Lyon, K. Ball, & K. D. Haggerty (Eds.), Routledge handbook of surveillance studies (p. ). Routledge. van Dijck, J. (). Datafication, dataism and dataveillance: Big data between scientific paradigm and ideology. Surveillance & Society, (), –. Wang, W., & Liu, Y. (). Discussing mental illness in Chinese social media: The impact of influential sources on stigmatization and support among their followers. Health Communication, (), –. http://doi.org/./.. Yang, G. (). The Power of the Internet in China: Citizen Activism Online. New York: Columbia University Press. Yang, Z. J., Kahlor, L., & Li, H. (). A United States-China comparison of risk information–seeking intentions. Communication Research, (), –. http://doi.org/./  Ye, W., Sarrica, M., & Fortunati, L. (). A study on Chinese bulletin board system forums: How Internet users contribute to set up the contemporary notions of family and marriage. Information, Communication & Society, (), –. http://doi.org/./ X.. Yuan, E. J., Feng, M., & Danowski, J. A. (). “Privacy” in semantic networks on Chinese social media: The case of Sina Weibo. Journal of Communication, (), –. http://doi.org/./jcom. Zimmer, M. (). “But the data is already public: On the ethics of research in Facebook.” Ethics & Information Technology (), –.

.............................................................................................................

CONCLUSION .............................................................................................................

  ......................................................................................................................

       ......................................................................................................................

 ˊ -ˊ     

. I

.................................................................................................................................. T world has changed much since the s, the decade during which communication grew institutional roots in the United States and adopted the academic contours that still define the field today. Those were the days of mass communication, when the Office of Radio Research became the Bureau of Applied Social Research, with Paul Lazarsfeld at the helm. During his tenure, Lazarsfeld put on the table the main theoretical concerns that drove communication research in the decades that followed (Katz, ). Prominent among those concerns was the question of media effects, that is, the extent to which the media are capable of shaping the minds of the people—or in the words of Walter Lippmann, whether the media have the ability to manufacture consent (Lippmann, ). What Lazarsfeld and his team uncovered was the importance of primary groups, “represented both as a network of information and a source of social pressure” (Katz, , p. ). Their research suggested that these groups, composed of peers, were more influential than newspapers. This finding was seen as “a good thing for democracy” because it signaled that people could “fend off” media influence (Katz, , p. ). More than seven decades later, the importance of peers and primary groups vis-à-vis the media is still the focus of much research—as is the question of whether peer influence is good or bad for democracy. However, the context of the debate has changed drastically: digital technologies have morphed the domains of interpersonal communication and broadcasting to the point of making them unrecognizable by the old standards.



 ˊ -ˊ     

In many other respects, however, the world has not changed much since the era of mass communication. At a time when social media are under increased scrutiny for their role in the spread of misinformation and “fake news” (Allcott & Gentzkow, ; Guess, Nyhan, & Reifler, ), we often forget that similar debates also took place during the golden age of radio. After all, radio airwaves were the channel that in the late s allowed Orson Welles to broadcast the news of an alien invasion, a dramatization of H. G. Wells’s The War of the Worlds that was not recognized as fiction by some audience members who succumbed to panic as they listened (Schwartz, ). After the incident, policy measures were taken to keep news broadcasters from misleading the public (even though, in the end, the episode was more consequential for the discussions that followed than for the actual level of collective hysteria it triggered). These policy measures were “based on the idea that the people . . . owned the airwaves, and that the public had a right to be informed” (Schwartz, , p. ). The policies remained in place until the late s, when broadcasters were given “the freedom to tailor news content toward certain demographics, in order to maximize their ratings and appeal to advertisers” (Schwartz, , p. ). This policy change meant that, from then on, segmented audiences could get customized or slanted information. So while the ability to personalize content might have reached a new level of sophistication in our current era of filter bubbles (Pariser, ) and polarized social media (Sunstein, ), the underlying phenomenon carries echoes of times past. The forces that drive communication, in other words, have remained remarkably unchanged across various waves of technological development. What connects this era with previous eras is how technologies underpin the social architecture that allows information to flow. The dynamics of information flow are consequential to the extent that they can shape shifts in opinion patterns or facilitate the emergence of new forms of organization, as many of the chapters in this Handbook have discussed. This emphasis on the architecture that facilitates communication does not undermine the value of the message or the impact messages have on opinions and behavior (and many other chapters included in this Handbook pay deserved attention to these elements of the equation; see, for example, chapter  on the mechanisms that make ideas more likely to go viral or chapter  on text production in the context of political campaigns). Instead, the emphasis on the architecture of information diffusion, or networks, highlights the role that technologies play in channeling communication and determining the reach of the content spread. It is to this hidden architecture—and to how it interacts with messages and transmission mechanisms—that we point when we talk about “networked communication.” This epithet gave us the title to this Handbook because it identifies the common thread that connects the six parts organizing the chapters; but also, and perhaps most importantly, because it helps us emphasize the theoretical value of conceptualizing communication, and its effects, as an ever-shifting network of actions and reactions.





. R  P  N C

.................................................................................................................................. In the introduction to this volume, we wrote that “networked communication represents a new direction in a research agenda that centers on the complexity, interconnectedness, and dynamism of communication practices”. Now, through the lens of the detailed discussion offered by the  preceding chapters of how that statement materializes in a range of substantive domains, we can make an additional assertion: networks are more than just a convenient metaphor. Networks offer a theoretical language and an analytical toolset that allow us to examine many of the communication dynamics that for decades remained hidden or imperfectly mapped. This theoretical language, we believe, cannot be developed in isolation from the research coming out of other disciplines – especially those that are also looking into communication dynamics as they manifest in the digital realm. The background of our contributors includes computer science, political science, sociology, human-computer interaction, physics, epidemiology, and information systems. Their work connects with communication research in a space that, as discussed in the Introduction, is already consolidating under the name of computational social science (Lazer et al., ; Watts, ). Our main goal with this Handbook was to offer a tangible space in which some of these recent developments were brought together to make their convergence explicit as it relates to the research agenda of communication as a field. This, of course, begs the question of how, exactly, communication benefits from that exchange, or why important theoretical developments will follow from it. The purpose of this concluding chapter is to directly address those questions and explain, in turn, why the research agenda represented in this Handbook falls in line with the questions that motivated the institutional development of communication as a field. One important element that sets this era of networked communication apart from previous eras (and certainly from the mass media era of the s) is that the same technologies that channel communication can also be used to analyze its dynamics and effects. For example, we can trace the origin and pathways of information cascades (see chapters  and ); uncover organizational dynamics through digital footprints (see chapters  and ); map spatial and mobility patterns as they relate to collective action and urban landscapes (see chapters  and ); or identify the role that emotions play in political communication and deliberation (see chapters  and ). This type of analysis, which relies on observational data generated in natural environments but also on the use of new measurement instruments enabled by the digital revolution, was simply precluded by older technologies. Of course, these new possibilities also create new challenges, for instance, when trying to preserve the ethics of social research (considered, from different angles, in the five chapters that form part VI); but the opportunities to develop our theories of why communication offers a backbone to so many dimensions of social life are many and exciting.



 ˊ -ˊ     

Digital technologies have created new research frontiers that we have just started to explore. They have also created a new media environment. It is not only that networks are now more pervasive, or that “interpersonal and mass communication are increasingly intertwined” (Neuman, , p. ); it is also that the connections between sources and audiences can be mapped in ways that allow us to understand how mediated communication weaves interdependence—the networks of action and reaction to which we referred above. One of the core theories in the field, the two-step flow model (Katz, ; Katz & Lazarsfeld, ), suggests that interpersonal communication is crucial to understanding the reverberating effects of mass media. Until recently, those reverberating waves could only be depicted through broad strokes; the theory was for the most part speculative until reliable ways to connect survey and observational data were developed. We can now reconstruct the temporal and aggregate dynamics of peer-to-peer communication with much more refined devices; as a consequence, we can reconsider the scope of mass communication theory through a new empirical light (see, e.g., chapters  and ). Improvements in measurement inevitably lead to advances in theory, and today we are in a position to revisit the intuitions that motivated Lazarsfeld and his colleagues with richer data and stronger analytical tools. Not only that; we can also revisit the many other theories and models of communication that followed those foundational ideas, as the chapters in this Handbook show.

. T D

.................................................................................................................................. So what is the theoretical gain of focusing on the principles of networked communication? The most important is that it forces us to unpack old analogies of how communication operates to uncover the observed mechanisms through which news and information, as they travel from person to person, exert influence on a large scale (for a more detailed discussion of this base statement see González-Bailón, ; many of the ideas discussed in this section draw from this book). When Gabriel Tarde talked in the late s about the laws of imitation and the difference between publics and crowds, he was setting down the foundations of how we think about diffusion and collective behavior today—ideas that shaped Lazarsfeld’s approach to what he called “the discipline of communications research” and, more specifically, his two-step flow theory (Clark, ; Katz, ; Katz, Ali, & Kim, ). At the time, all Tarde could do was elucidate through the use of metaphors how social influence underpinned the behavior of publics and crowds. And yet the impressionistic depictions of the social world he presented in his writings responded to objective and radical changes in how communication took place, changes that were introduced by the technological revolution of the era: the telegraph. The telegraph, and the form of mass newspapers it allowed to flourish, made it possible for a new, powerful type of audience to arise in the form of “the public”: a modern phenomenon that detached audiences from a shared time and space and





allowed them to exist as a distributed form of collective attention synchronized around common issues (van Ginneken, ). The telegraph, however, did not allow mapping shifts in those collective dynamics as, say, social media allow us to do today; there was no obvious way to monitor that activity, to store that information, or to analyze it. It was not even obvious how to make the most of the postal system to track shifts in public attention, something about which Tarde complained explicitly because he thought it would be interesting to have access to “statistics of conversation” as compiled through letter exchange (Clark, , p. ). In his mind, those letters created important pathways through which the public gained some of its strength; they allowed individuals to coordinate their attention through the many long-range ties that postal activity helped maintain. The streams of influence that gave society its lifeblood (a common metaphor at the time) were certainly felt, but they remained intangible for most research purposes. The attempts by Lazarsfeld and his team to monitor interpersonal communication through surveys were an influential step toward making more palpable what was perceived but largely unmeasurable: the part that people played in channeling and amplifying mass media effects (Katz & Lazarsfeld, ). In this conception, as in Tarde’s, the public is not composed of isolated receptors of information but of actors embedded in large structures of interdependence that are woven and rewoven constantly—that is, the interpersonal networks through which messages resonate and information flows. The ability to map those networks, however, was highly limited by the measurement instruments available at the time. Lazarsfeld and his team used a version of the name generator to elicit data on personal ties as recalled by their respondents. This approach involved asking people about whom they talked to, an obvious way to gather Tarde’s statistics of conversation but based on subjective recollection as opposed to objective trails like those left by, say, letter exchange. In addition, this information allowed assessing the relative popularity of specific individuals, but not how their networks fit with other networks to assemble the larger structures that give backbone and muscle to the public. We have since advanced far in our measurements and conceptualization of networks—and that is due, in great measure, to the digital revolution (Christakis & Fowler, ; Mejova, Weber, & Macy, ; Watts, ). Digital technologies have not only improved our ability to track and analyze networked communication; they have also changed “the very nature of our object of study,” as Delli Carpini writes in his introduction to part V. Fortunately, he states, “the networked information environment (and crucially, the digital traces it leaves) . . . has the potential to greatly increase the variation in communication content we can observe, our ability to accurately measure this content and its reception in often unobtrusive and more contextualized ways, and our ability to demonstrate effects and the conditions under which these effects are met.” In other words, we can now conduct better research because there is more observational data on the expression, dynamics, and consequences of communication. Today, most forms of mediated communication can be reconstructed as networks that help us trace the origins and effects of information exchange. The chapters in this



 ˊ -ˊ     

Handbook have considered some of the theoretical questions that arise around the study of those networks: How do network structure and evolution affect diffusion processes (part I)? How does the emergence of networks in online spaces change organizational dynamics (part II)? How do online networks complement or displace more traditional structures of support, and how do they amplify media effects (part III)? How do online interactions change the dynamics of political communication and behavior (part IV)? How does geographical space mediate the formation of ties, and how do digital technologies help overcome spatial constraints to rewire networks (part V)? And what are the ethical dilemmas that arise from the uses we can make (for research and practice) of networked data (part VI)? What all the preceding chapters have in common is the use of digital technologies to obtain novel empirical insights about why networked communication is so consequential for social life. They help realize, in other words, the research vision that Tarde put forward as he tried to make sense of an older technological revolution.

. T V  I W

.................................................................................................................................. Better measurements lead to better theories because they encourage us to sharpen the methodological tools we use to dissect and understand the world. The analysis of digital traces and the vast amounts of social data that can be now parsed and sifted have prompted researchers to look over the fence of their disciplinary boundaries. Computer and data scientists look for guidance on how to theorize about human behavior, and social scientists look for aid in how to analyze new sources of data. Another important element the preceding chapters have in common is that they do not necessarily operate within the disciplinary boundaries that have delimited the field of communication since the s. Around the same time that Lazarsfeld and his team published their first research on interpersonal communication and media effects, another influential article was published by Claude Shannon under the title “A Mathematical Theory of Communication,” later transformed into a book (Shannon & Weaver, ). Shannon’s approach followed the legacy of information theory to unpack the mathematical building blocks of communication and, subsequently, the hidden probabilistic structure of language. This work would ultimately lead to developments in cryptography, natural language processing, and computational linguistics, but at the time there were very few bridges connecting the way in which these engineers and emerging communication scholars thought about their work. These bridges are still scarce today, but luckily they are being erected in increasing numbers. This type of collaboration has already led to the adoption of methodologies like machine learning and large-scale text analysis, which in turn are empowering the type of research that communication scholars can do in realms such as health and political behavior (see, e.g., chapters , , and ).





We are just starting to create a common language and research standards that can be used across disciplinary domains (Salganik, ). One of the realizations that have surfaced from this digital revolution is that “the scholarly research paradigm is lagging far behind the relentless pace of technical change” (Neuman, , p. ). New computational methods and modeling choices are helping us catch up with the empirical demands of digital data. However, as Lazer writes in his introduction to part I, “there is yet so much to be done even in this territory”; to the extent that digital data are often the “digital refuse” of modern technologies, a careful mapping “between behavior and relevant social science constructs” is necessary, but often problematic. Digital technologies have not dissolved the truth contained in the old adage that “not everything that can be counted counts, and not everything that counts can be counted” (Lohr, , p. ). And yet much in the same way as new technologies have encouraged us to think about complex network dynamics, “we can step back and think about knowledge as a networked system, with connections across theories and disciplines providing a more stable base upon which to innovate,” as Ellison writes in her introduction to part III. Knowledge, too, can be conceived of as a structure connecting ideas, and interdisciplinary work can only improve the density and quality of those connections. Developing a common framework for theory building, with translatable research standards, depends on the existence of those ties bridging disciplinary domains. The chapters in this Handbook are an example of how to develop that type of work.

. C  F C

.................................................................................................................................. It is difficult to build theory around phenomena that cannot be observed, but the vast amounts of data digital technologies have made available present an equally relevant problem: not all observations are useful for the purposes of research, and the size of data sets does not necessarily ensure higher quality in the information contained. Theory is still the safest guide to determine if the slice of the world we can see through the digital lens is informative enough. Issues like representation and bias are relevant to understanding the limits of digital research, as the various chapters in this Handbook have considered from their respective empirical corners. Also relevant is the question of how to link different forms of communication so that the interplay of online and offline behavior can be more accurately mapped. An important aspect of this problem is how to incorporate geography and space into the analysis. As Marvin notes in her introduction to part IV, rather than render space irrelevant, digital technologies have made it even more salient; most digital research, she writes, is “locally entangled at every point with physical bodies, political cultures, and mobile communication systems.” And as with analog mapping, higher resolution is not always a virtue. Finding the right approach to the fine-grained landscape of digital traces requires a trade-off between detail and efficacy.



 ˊ -ˊ     

Related to this problem is the question of how to go from description to explanation. The changes triggered by the irruption of digital networks have been so drastic in so many domains that much research has taken a descriptive approach to those changes. Statistical methods to infer patterns and yield predictions using large-scale digital data are becoming increasingly powerful, but there is still much work to be done both in developing new methods and in making them more conventional in the research agenda of communication as a field. An epistemological issue that comes up as part of this discussion is the relationship between data-driven and theory-driven research. Rather than alternatives, these two approaches should complement each other, although it is true that data-driven research often takes priority because of the new sources of large-scale data and the analytical tools developed to sift through those data. The question of who has access to those sources of information immediately follows. The cloistering of data within private companies, non-disclosure agreements (NDAs) that prevent data sharing among researchers, as well as ongoing changes to API access that limit or, in some cases, completely eliminate researchers’ access to data challenge the entire research enterprise. Transparency and the ability to reproduce findings are essential features of cumulative research, but these are principles that are difficult to enforce if access to data becomes restricted on proprietary grounds. Moreover—and less frequently discussed— limitations risk creating a hierarchy of researchers who can access data easily, less easily, or not at all. When access to data hinge on things that advantage already priviledged researchers and insititutions—including internship relationships, limited data grants, and high fees to purchase data—we directly feed the elitism and exclusivity that has hampered diversity in science in the past. As we argue in the Introduction, research benefits from removing barriers to access that limit diversity and, as a result, the quality of our collective work. Privacy concerns are another legitimate reason access to data is often restricted. Those concerns are usually expressed in terms of violated expectations, as Hancock explains in his introduction to part VI. Improving our understanding of how people conceptualize technologies, he argues, can help us understand how they engage in the data-generating process and therefore how to avoid violating their expectations when designing research. Likewise, the expanding use of artificial intelligence and machine learning to make predictions that can inform decision-making raises some risks in the perpetuation of bias and structural discrimination, which is of particular relevance when using research to inform interventions, for example in the design of targeted campaigns. Of course the core of the challenge is not really new; bad research has always informed poor decision-making, and the use of artificial intelligence and other computational tools is no exception to this rule. However, insofar as algorithms are perceived as “objective,” researchers must take extra care to unpack and critically evaluate biases in the data driving our interventions. One of the new avenues through which networked communication can exert mass influence has taken the form of algorithmic interventions, that is, bots or software designed to seed and spread messages (Ferrara, Varol, Davis, Menczer, & Flammini, )





or encourage some sort of behavioral response (Munger, ; Shirado & Christakis, ). These artificial actors are being deployed to shape the internal logic of networked communication: to the extent that they are centrally controlled by organized interests, they can redirect information flows according to some politically motivated design. This seems to counter Lazarsfeld’s idea that peer influence helps people “fend off” media influence, or the idea that peer discussions are a good antidote to media manipulation; when some of the “peers” in the networks we are exposed to are not humans but actors executing a preprogrammed script, the logic of decentralized communication changes drastically. It is too early to determine the consequences of having these artificial actors shaping the way in which information circulates online, although some evidence has started to accumulate across political contexts (e.g., Ferrara, ; Sukal, Sanovich, Bonneau, & Tucker, ). Future research should look more systematically at how software programmed to seem human shapes the dynamics of attention allocation when inserted in online networks. Future research should also consider whether other forms of algorithmic intervention result in networked communication evolving into a hybrid model of mass influence built in under the appearance of peer effects. Scholars are still working to understand how to best deal with these challenges. The task will require an ongoing conversation and research designed to offer cumulative insights on ever-changing technologies. As the chapters in this Handbook show, the advantages that digital data offer and the new methodological possibilities of computational tools are already palpable. Understanding the effects that technological changes have on communication dynamics and the effects that these in turn have on social phenomena and individual behavior is a long-distance race, with no clear finish line. We will move faster in that endeavor if we integrate the knowledge that comes out of different disciplinary efforts, especially when they converge around the same questions. This volume has offered an overview of some of the most exciting developments in digital research arising at this intersection of disciplinary approaches. Surely this is just the starting point of many more exciting advances to come. The preceding chapters are all intended to offer an entryway to those developments.

R Allcott, H., & Gentzkow, M. (). Social media and fake news in the  election. The Journal of Economic Perspectives, (), –. Christakis, N. A., & Fowler, J. H. (). Connected: The surprising power of our social networks and how they shape our lives. New York: Little, Brown & Company. Clark, T. N. (). Gabriel Tarde: On communication and social influence. Chicago: University of Chicago Press. Ferrara, E. (). Disinformation and social bot operations in the run up to the  French presidential election. First Monday, (). Ferrara, E., Varol, O., Davis, C., Menczer, F., & Flammini, A. (). The rise of social bots. Communications of the ACM, (), –.



 ˊ -ˊ     

González-Bailón, S. (). Decoding the social world: Data science and the unintended consequences of communication. Cambridge, MA: MIT Press. Guess, A., Nyhan, B., & Reifler, J. (). Selective exposure to misinformation: Evidence from the consumption of fake news during the  U.S. presidential campaign. Working paper. https://www.dartmouth.edu/~nyhan/fake-news-.pdf Katz, E. (). The two-step flow of communication: An up-to-date report on a hypothesis. Public Opinion Quarterly, (), –. Katz, E. (). Communications research since Lazarsfeld. Public Opinion Quarterly, (), S–S. doi:./poq/._PART_.S Katz, E., Ali, C., & Kim, J. (). Echoes of Gabriel Tarde: What we know better or different  years later. Los Angeles, CA: USC Annenberg Press. Katz, E., & Lazarsfeld, P. (). Personal influence. The part played by people in the flow of mass communications. New York: Free Press. Lazer, D., Pentland, A., Adamic, L., Aral, S., Barabási, A.-L., Brewer, D., . . . Van Alstyne, M. (). Computational social science. Science, , –. Lippmann, W. (). Public opinion. New York: Harcourt, Brace. Lohr, S. (). Data-ism: The revolution transforming decision making, consumer behavior, and almost everything else. New York: HarperCollins. Mejova, Y., Weber, I., & Macy, M. W. (). Twitter: A digital socioscope. Cambridge, UK: Cambridge University Press. Munger, K. (). Tweetment effects on the tweeted: Experimentally reducing racist harassment. Political Behavior, –. doi:./s--- Neuman, W. R. (). The digital difference: Media technology and the theory of communication effects. Cambridge, MA: Harvard University Press. Pariser, E. (). The filter bubble: How the new personalized web is changing what we read and how we think. New York: Penguin Press. Salganik, M. J. (). Bit by bit: Social research in the digital age. Princeton, NJ: Princeton University Press. Schwartz, A. B. (). Broadcast hysteria: Orson Welles’s “War of the Worlds” and the art of fake news. New York: Farrar, Straus and Giroux. Shannon, C. E., & Weaver, W. (). The mathematical theory of communication. Urbana: University of Illinois Press. Shirado, H., & Christakis, N. A. (). Locally noisy autonomous agents improve global human coordination in network experiments. Nature, , . doi:./nature https://www.nature.com/articles/nature#supplementary-information Sukal, D., Sanovich, S., Bonneau, R., & Tucker, J. A. (). Detecting bots on Russian political Twitter. Big Data, (), –. Sunstein, C. R. (). #Republic: Divided democracy in the age of social media. Princeton, NJ: Princeton University Press. van Ginneken, J. (). Crowds, psychology, and politics, –. Cambridge, UK: Cambridge University Press. Watts, D. J. (). Six degrees. The science of a connected age. London: William Heinemann. Watts, D. J. (). A twenty-first century science. Nature, , .

I

.................

Note: Tables and figures are indicated by an italic ‘t ’ and ‘f ’, respectively, following the page number. AAAS. See American Association for the Advancement of Science Abbott, A. , ,  abductive inference  ABMs. See agent-based models A/B testing , n,  ACC. See anterior cingulate cortex accounts, for web scraping  Accra  Ackland, R.  activity-driven networks – epidemic spreading and –, f preferential attachment model for  random walks in –, f rumor spreading and –, f actors in agenda setting  between centrality of  in crisis events – in politics  reciprocity of  Adamic, L. A.  Adblock Plus  adjacency matrix for epidemic spreading  for unweighted networks –, f Adler, P. S.  Affective Norms for English Words  Agarwal, S. D.  agency in agenda setting  digital trace data and – social capitalization and – agenda convergence – agenda setting  actors in  agency in 

audience fragmentation and ,  audience members for ,  big data and  bottom-up – challenges of – complexity reduction in –, f concept association ties in – cumulative advantage in  degree centrality in  digital trace data for  dyadic attributes in  information sources for ,  interorganizational ties in  issue adoption ties in ,  media use ties in – network model for –, f, f obtrusiveness in  in politics ,  public opinion and  sampling in  agent-based models (ABMs)  aggregation services, retweets by  agile ethics  agility, of communication – AI. See anterior insula; artificial intelligence AirBnB  Albert, R. ,  alcohol advertising, GIS and – Amazon  Amazon Glacier  Amazon Mechanical Turk (AMT) , , – Amazon Web Services  American Association for the Advancement of Science (AAAS) – American National Election Study –, n





AMT. See Amazon Mechanical Turk Al-Ani, B.  annealed networks epidemic spreading on ,  random walks and ,  timescales of  anonymization on Chinese Internet –,  in data collection ,  on social media – anterior cingulate cortex (ACC)  anterior insula (AI)  anticipatory coupling, in information sharing  AoIR. See Association of Internet Researchers application programming interfaces (APIs) authentication of  for data collection – digital trace data from ,  for Facebook , ,  for Instagram  libraries for  for New York Times  for online communities ,  rate limits on – sampling of  for server-side logs  for social media ,  ToS for , – for Twitter –, , , , , ,  apps, for cell phones , , , ,  Arab Spring  hashtags for  on social media – triangulation in  Twitter and –, ,  Aral, S. – ArcGIS  for alcohol advertising  for assault risk – for teen mobility  argument mining, deliberation and – argument quality, in deliberation –, t Armstrong, B. K. 

artificial intelligence (AI) for digital research  organizational dynamics and  aspiration, in satisficing semantic search , – assault risk place for – points of –, f retrospective data collection for –, f, f space for  Association of Internet Researchers (AoIR)  Ethical Decision Making and Internet Research of ,  attention concentration, on Internet – in crisis events –, f, f, t, f, t dynamics of – information sharing and  scarcity, with information proliferation  on social media – strength of weak ties hypothesis and – audience for deliberation  of drag culture  fragmentation of, agenda setting and ,  of information sharing –, – members, for agenda setting ,  audio analysis  audit trail, for data cleaning  authentication, of APIs  automated text  available alternatives, in satisficing semantic search  available dose, from cigarette advertising – avatars gender of  Proteus effect on  in virtual worlds ,  Avle, Seyram  Avnit, A.  Aycock, J. 

 backchannel, social media as  Backstrom, Lars  backups, for data collection  Baker, W. E.  Bakshy, E. S.  Banchs, R. E.  Banerjee, Indrajit – Banet-Weiser, S.  Bangladesh, blogs in  Barabási, A. L. ,  barnstars, in Wikipedia  Baym, Nancy K.  beautifulsoup  Belmont Report , ,  Benevenuto, F. ,  Bennett, W. L. , ,  Bernulli, D.  betweenness centrality  Bezos, Jeff  big data ,  agenda setting and  broad data and  in China  digital trace data in  from Facebook ,  hubris  information flow and – from little teams  for mobility  risks of – from social media , – surveillance from  Bimber, B.  Bingbot  Bit by Bit: Social Research in the Digital Age (Salganik)  BitLocker  blogs in Bangladesh  for drag culture  flaming on  Huffington Post as  Martin, T., and  mass media and  mentions in  in Middle East  for well-being early detection 



blood-oxygen-level-dependent (BOLD)  blue team dynamics  BOLD. See blood-oxygen-level-dependent Borgatti, S. P. ,  Borge-Holthoefer, J. –, ,  Boston Marathon bombing crisis  bots  bottom-up agenda setting – boundary spanning  Bourdieu, P.  Bowker, G. C.  bridges in Middle East – on social media – broadcast media. See mass media broad data  Brooklyn, drag culture in –, – Bruns, Axel –,  Bryant, J. A.  bubbles, on social media – Buchanan, E. A. ,  bullying  Burgess, J. E. ,  burstiness random walks and  of time-varying networks  Burt, R. S. ,  Butler, Brian S. – Butts, C. T. ,  Byrnes, Hilary  C++  California Report Card (CRC)  Campbell, K. E. , ,  Carey, James W.  Carr, C. T.  Carroll, G.  cascade model for agenda setting  for word-of-mouth –, f casperjs  Cattuto, C.  CDC. See Centers for Disease Control and Prevention celebrities on Chinese Internet  influence of , 





celebrities (Continued) Martin, T., and  propagation phenomena and  as trendsetters  cell phones apps for , , , ,  from China ,  cost reduction for – data bundling on , f, f encryption for  entrepreneurship with – Facebook on  face-to-face and  gender and – in Ghana – in global South –, f, f, f, f GPS on  ICT and  innovation with – Internet on – mapping with – mobile data on , f, f, – mobile payments with – mobility and – prepaid –, f, f scratch cards for –, f, f SIM cards for , ,  SNA for – social media on  subsidized connectivity for – tie strength and – time-varying network burstiness for  for well-being  women with – Center for Epidemiologic Students Depression Scale (CES-D)  Center for Open Science  Centers for Disease Control and Prevention (CDC) ,  CES-D. See Center for Epidemiologic Students Depression Scale CFAA. See Computer Fraud and Abuse Act of  Cha, M. ,  Chaffee, S. H.  Chalmers, M. 

Change.org  Chevaliers’ Romance III  female participants in  China big data in  cell phones from ,  cybersecurity law in n data mining in  rumor spreading in  Chinese Internet – anonymization on –,  content analysis for – crowdsourcing on  digital research on –, t ethics with – ethnography on – information leakage from n opinion leaders on  PII on – privacy on , –, – social media on ,  surveys on  vulnerable populations on  Choi, S.  Choudhary, A.  Chung, Cindy J.  cigarette advertising available dose from – GIS and – CIOMS. See Council for International Organizations of Medical Sciences Citizen Lab  citizen reporting, in crisis events  citizen science  CivilServant.io  Clauset-Newman-Moore algorithm  click fraud – cloud-based storage  clusters homophily of  indegree influence in t Jaccard coefficient for –, t Louvain algorithm and , t,  in mass media  persistent , f in politics –, t proximity of 

 SNA for  subgraphs of  CMC. See computer-mediated communication cognitive closeness , ,  Cognos  Coleman, J. S.  “Collaboration of the Week,” on Wikipedia  collective action in Arab Spring  with ICT  social capital and  collective sense-making, in crisis events –,  collective social capital – Colorado Springs Police Department Public Affairs Section (CSPDPIO)  Columbia studies  command-and-control protocols, in crisis events  common pool source management  The Common Rule ,  communication. See also specific topics agility of – changing nature of – channels, information sharing on – data collection for – in deliberation  in democratic engagement – methods for – in Middle East – mobility of – organizational dynamics and –,  in politics –, – SNA and – tie strength and – Twitter for  Communication Explorer –, nn– communication infrastructure theory  community structure  of drag culture n complex networks methods  computational linguistics  Computational Methods Interest Group 



computational social science ,  for communication and organizational dynamics – data collection for –, – DDR and – for individuals  methods of – new theories from – for online communities  PDR and  risks and benefits of  scale of , – social contagion and  TDR and – Computer Fraud and Abuse Act of  (CFAA)  computer-mediated communication (CMC) – in drag culture – for well-being – concept association ties, in agenda setting – conceptual connections, in deliberation –, f conditional probability, of random walks  connective action  connectivity SNA for  of time-varying networks  connectivity driven frameworks  connectors  Conover, M.  consensus-building  contact networks  contagion. See also rumor spreading; social contagion emotional, on Facebook –, –, –,  epidemic spreading –, f,  frameworks, with agenda setting  content cohesion  in deliberation , – of information sharing  personalization  content analysis  for Chinese Internet – of deliberation –





content analysis (Continued) in digital research  for gender – for virtual worlds – content polarization , f convergent evolution  Coopersmith, Glen – Cordner, A. – core, of deliberation – cost of uncertainty, in social investment – cost-per-click (CPC)  cost-per-impression (CPM)  Couldry, Nick  Council for International Organizations of Medical Sciences (CIOMS)  Coviello, L.  CPC. See cost-per-click CPM. See cost-per-impression CRAWDAD  crawlers  on Facebook  CRC. See California Report Card crisis events actors in – attention in –, f, f, t, f, t citizen reporting in  collective sense-making in –,  command-and-control protocols in  data collection on – degree dynamics in  digital trace data on  digital volunteerism in ,  Facebook in  face-to-face in  future research for – ICT in  mass convergence in –, f mass media in  nodes in , f retweets in  rumor spreading in ,  social convergence in , –, f,  social media in –, f, f, f, f, f, f, f

Twitter in , , –, –, f victim requests in  vocabulary in  crisis informatics  Cromby, John  crowdsourcing. See also Amazon Mechanical Turk on Chinese Internet  for data collection –,  IRB for  on mental health  in Qatar  crowdworkers  Cruz, Ted  CSPDPIO. See Colorado Springs Police Department Public Affairs Section Culotta, Aron  cultivation theory  culture digital research and  ethics and  ethnography for  information sharing and  culture of fear , – cumulative advantage, in agenda setting  curl  customer-made data  cyberbalkanization  cyberbullying  cybersecurity law, in China n Dahlberg, L. ,  Dao, Bo  dark networks  DARPA. See Defense Advanced Research Project Agency data bundling, on cell phones , f, f data cleaning agenda setting and – audit trail for  of digital trace data – Robots Exclusion Standard for  data collection. See also digital research; ethics of data collection; mobile data anonymization in ,  APIs for – backups for 

 click fraud with – from companies – for computational social science –, – on crisis events – crowdsourcing for –,  data donations for  of digital trace data – on emergency responders – harm to services from – harm to users from  on human subjects – impression fraud with – informed consent for – IRB for – legal issues with – on mass media – for media multiplexity – on mental health  of metadata  on mobility  opportunistic – on organizational dynamics – for politics – privacy in  reproducibility in – responsible disclosure in  retrospective, for assault risk –, f, f secure storage for – for sentiment analysis – from social media – technical aspects of – for teen mobility – for tie strength – web scraping for – data donations  data-driven research (DDR) – on virtual worlds – datafication  data fuzzing  data mining in China  of social media  of Twitter Lists  Data Protection Directive, of EU , 



data retention/deletion policy, on social media  data security as arms race  with digital methods – for mobile data  data sharing. See information sharing Data Sharing Service, of ICWSM  data storage – of digital trace data –,  Dawkins, R.  DDR. See data-driven research death penalty –, t De Bruijn, Mirjam  De Choudhury, Munmun ,  Declaration of Helsinki  decline, in organizational dynamics  deductive approach abductive inference and  to digital trace data  for politics  Defense Advanced Research Project Agency (DARPA)  degree centrality, in agenda setting  degree dynamics, in crisis events  de-identification, on social media – Delacroix, J.  deliberation argument mining and – argument quality in –, t communication in  conceptual connections in –, f consensual version of  content analysis of – content in , – core of – criteria for – on death penalty –, t defined  on Facebook ,  focus groups for  hierarchical classification of – models and measures for – outcomes of – phases of  on politics  proliferation of –





deliberation (Continued) reciprocity in  SNA on – on social media – stages of  talk through ideas in  on Twitter , ,  web links and  deliberative democracy – Delli Carpini, Michael  democratic engagement – in Arab Spring – demographics of media multiplexity of tie strength – of online communities – Department of Health and Human Services (HHS) ,  Department of Homeland Security  dependent variable, for media multiplexity – depersonalization, in virtual worlds  depression – from Facebook  population-scale measurement for f PPD –, f Desouza, Kevin C.  Dexter, S.  dictionary-based approach, for politics – diffusion (models, frameworks) with agenda setting ,  of Internet  in Middle East  organizational dynamics and  in population-level analysis –, t random walks and , , f digital divide  digital ethnography. See ethnography “Digital Mapping of Urban Mobility Patterns” (Wiebe and Morrison, M.)  digital methods data security with – for digital research – ethics of  IRB for  risk of –

digital research AI for  challenges to – on Chinese Internet –, t content analysis in  culture and  difficult environments for – digital methods for – ecosystem  ethics of – on Facebook – informed consent in – IRB for ,  on mass media – privacy and  risk of –, , – technical and legal issues of  on terrorist networks  unintended consequences of – on vulnerable populations ,  Digital Research Confidential (Hargittai and Sandvig)  digital rights management (DRM)  digital trace data agency and – for agenda setting  analysis and reporting of – from APIs ,  in big data  on crisis events  data cleaning of – data collection of – data storage of –,  for discrimination – on emergency responders  ethics of –, t for ethnography  on Facebook  on gender ,  generalizability from ,  identity and – informed consent for – as naturalistic data  partition-specific network analysis of – personalization from – privacy of 

 research design for – self-report of  server-side logs and  on social media ,  transparency with ,  from triangulation – from Twitter  on virtual worlds ,  digital volunteerism, in crisis events ,  Dinakar, Karthik  disasters. See crisis events Discourse Quality Index (DQI) ,  discovery-driven research  discrimination, digital trace data for – discursive ego network  disruptions from CMC  in organizational dynamics , – Dittrich, D.  dot-com bubble  DQI. See Discourse Quality Index drag culture audience of  blogs for  in Brooklyn –, – CMC in – community structure of n defined  focus groups for –, ,  geotagging of  in Manhattan – media imagery of – performances in – politics of – queer terroir and – reading in – safe spaces in – shade in – social media in  third-set oriented in  urban exceptionalism and – Drag Race (RuPaul) ,  Dredze, Mark , – DRM. See digital rights management Durkheim, Emile 



dust bowl empiricism – dyadic ,  in agenda setting  media multiplexity and  dynamical processes on activity-driven networks – with field dependence  in time-varying networks – dynamic consent  Eagle, Nathan , ,  early adopters of ICT  of organizational dynamics – of social conventions –, f trendsetters as  eating disorders  Ebola crisis –, f, f, t, f, t ecological dynamics, population-level analysis of – ecological networks  Editor & Publisher Yearbook ,  effort, in satisficing semantic search – Egypt polarization in , f retweets in –, f Twitter in  eigenvalues for epidemic spreading  of Lapacian matrix  Eilders, C. , ,  Electronic Frontier Foundation ,  Ellis, Darren  Ellison, N. B.  email data security with  to Facebook – hyperedges for  time-varying network burstiness for  emergence, of organizational dynamics – emergency responders data collection on – digital trace data on  keyword-based samples for 





emergency responders (Continued) mass convergence and – routines and reactions of –, f, f social media for –, f, f, f, f, f, f, f Twitter and –, –, f user-centric approach for –, f emotion. See sentiment analysis emotional contagion, on Facebook –, –, –,  emulative computational models  encryption  for cell phones  homomorphic ,  END Framework  end user license agreement (EULA)  English, Robert C. – English, Wikipedia in ,  enterprise social media platforms  entrepreneurship, in global South – epidemic spreading –, f geotagging for  instantaneous networks for  epidemic threshold ,  epidemiology, mobility and –, f EQII. See EverQuest II Erdoğan, Recep Tayyip  Erdős-Rényi network  Eslami, M.  Ess, C.  Ethical Decision Making and Internet Research (AoIR) ,  ethical self-direction and correction  ethics. See also responsible research with Chinese Internet – culture and  of digital methods  of digital research – of digital trace data –, t of Facebook – of information sharing – on Internet research –, t, t mental health and – reflexive research – of vulnerable populations 

ethics of data collection n agenda setting  on Internet – organizational dynamics and  ethnography on Chinese Internet – for culture  digital trace data for  on gender in virtual worlds  network – passing  in protests – semantic explanation for  virtual – EU. See European Union EULA. See end user license agreement European Union (EU) Data Protection Directive of ,  privacy in  evangelists – Everett, M. G.  EverQuest II (EQII) , , ,  data quality and management for – female participants in ,  gender in – gender swapping on  PUGS in  SNA on –, t, t evolutionary biology ,  explanation pragmatic – semantic  syntactic  extramedial level, for agenda setting  Eynon, R.  Facebook  accounts for  APIs for , , ,  average time spent on n big data from ,  bridges on  on cell phones  crawlers on  in crisis events  Data Science of  deliberation on , 

 depression from  digital research on – digital trace data on  drag culture on  email to – emotional contagion on –, –, –,  ethics of – friendship connection on , , ,  functions of  information diet on  information sharing on  ISCS for  Martin, T., and  newsfeed  newsfeed algorithm for  online community of  parsing for  politics on , f, , –,  psychosocial support from  re-identify data on  self-censorship on  social investment in – strong ties in  text analytics for  for well-being early detection  face-to-face cell phones and  in crisis events  for democratic engagement  effort in  hyperedges for  information sharing  for politics  social media and – tie strength of – failure, in organizational dynamics  Faraj, Samer ,  Faulkner, R. R.  Faust, Katherine  FDNY. See New York City Fire Department Federal Emergency Management Agency (FEMA) ,  Federal Trade Commission , – Feldman v. Google  FEMA. See Federal Emergency Management Agency



females. See women Ferriter, Michael  Fiandaca, Cheryl  field dependence, dynamical processes with  FileVault  first moments, for epidemic spreading ,  flaming – Flanagin, A. J.  flash organizations  FLOSS. See free/libre open source software fMRI. See functional magnetic resonance imaging focus groups for deliberation  for drag culture –, ,  followers million follower fallacy and – nodes with  on Twitter –, –, ,  formation, of public opinion –, , f–f Foursquare  re-identify data on  FOX News  fragmentation of audience, agenda setting and  communication and  of mass media – in Qatar –, f free/libre open source software (FLOSS) –,  Freelon, D.  French, Megan  frequency counts  Friess, D. , ,  Fulk, J.  functional magnetic resonance imaging (fMRI) , , ,  Garrett, R. K.  gatekeeping ,  by mass media  Gawker – Gee, Laura K.  gender. See also drag culture cell phones and –





gender. (Continued) content analysis for – defined  digital trace data on , – in EQII – future research on – homophily for – on online communities – stereotypes for – in virtual worlds – genderfucking  genderqueering  gender role theory, for virtual worlds – gender swapping  messages in , t in virtual worlds , –, , t General Inquirer  generalizability from digital trace data ,  from population-level analysis – of virtual worlds  Gentzkow, M.  geographic information systems (GIS). See also mapping alcohol advertising and – cigarette advertising and – in health – lines in  in mobility – place in – points in  polygons in ,  retrospective study design for –, f, f space in – vector layers in , f geotagging of drag culture  for epidemic spreading  geowebs  Gerbaudo, Paolo , – Gerber, George  Gerding, E.  Gezi Park protests , –, , f–f GFW. See Great Firewall of China

Ghana  cell phones in – mobile credit transfers in  mobile payments in – SIM cards in  Gil de Zúñiga, Homero  GIS. See geographic information systems Github , ,  Glance, N.  Glasius, M. ,  Gleason, L. S.  Global Open Data Initiative  global position system (GPS) on cell phones  in global South  for teen mobility – global South cell phones in –, f, f, f, f entrepreneurship in – GPS in  innovation in , – mapping in – mobile data in , f, – poverty reduction in  prepaid cell phones in –, f, f well-being in  Goffman, Alice  Gold, V. , ,  Golub, Adam  Gomer, R.  González-Bailón, S. –, ,  Goode, J. P.  Goodreau, Steven  Google big data from  image-labeling system in  random walks and  VSMs at – Googlebot  Google Flu Trends , ,  Google Inc. v. Auction Expert International L.L.C., et al, CV  Google Street View  Google Translate  GPS. See global position system

 Graeff, Erhardt , ,  Grameen Foundation  Granovetter, Mark S. , , , ,  graph analysis techniques, for social media – graphical user interface (GUI)  grassroots influencers , – Gray, M. L. ,  Great Firewall of China (GFW)  Green, L.  Greenspan, Alan  Grimmer, Justin  group dynamics, of nodes  groupthink ,  GSM Association (GSMA)  Gu, Bin  GUI. See graphical user interface Guillory, Jaime E. ,  Gummadi, K. P. ,  Guo, L.  Habermas, J.  Haddadi, H. ,  Halberstam, J.  Hall, Stuart  Hancock, Jeffrey T. , ,  Hanhardt, C. B.  Hargittai, E. , – HARKing  Harman, Craig – harm from data collection to services – to users  Harper, Kelley  Hartzler, Andrea  Harvard Dataverse  hash functions , n hashtags ,  for Arab Spring  semantic sourcing and  Haythornthwaite, Caroline ,  HCI. See human and computer interaction HDI. See human-data interaction headless tools  health. See also mental health GIS in –



mapping for –, f, f social media and  Health Communities for Teens – heat maps , f Henderson, T.  Hendler, Jim  HHS. See Department of Health and Human Services hierarchical tree model –, f high-resolution data  Hmielowski, J. D.  Holbert, R. L. , –,  homomorphic encryption, for social media ,  homophily – of clusters  for gender – hubs and – SNA and – in social investment – of subgraphs  of tie strength  honeypots  Horst, Heather  Horvitz, Eric  hot cognition hypothesis  Howard, Philip N.  HTML  Huang, Wenyi  Huawei  hubs  in activity-driven networks  homophily and – Jaccard coefficient and  Huffington Post  human and computer interaction (HCI)  human-data interaction (HDI)  Hurricane Sandy  Hutton, L.  Hvizdak, E.  hyperedges – for relational event models  hypergraphs – for peer production  for team interlocks  hyperlinks. See web links hypothesis-free research 





IAP. See International Association of Public Participation ICT. See information communication technology ICT development (ICTD) ,  ICWSM. See International Conference on Weblogs and Social Media identity. See also gender; individuals; privacy digital trace data and – ignorant nodes, in rumor spreading  Ikeda, K.  image analysis  image-labeling system, in Google  IMDb. See Internet Movie Database impersonation  impression fraud – inadvertent exposure, on social media – indegree influence , f adjacency matrix for – in clusters t million follower fallacy and – by trendsetters –, f independent variable, for tie strength  Indignados, in Spain  individuals. See also privacy computational social science for  information sharing and – network theory and  organizational dynamics and  PII on , – satisficing semantic search for  inductive approach abductive inference and  to digital trace data  for politics  inferred mobility  influence/influencers. See also indegree influence; opinion leaders; outdegree influence; trendsetters; user influence of celebrities ,  grassroots , – mention , , f of public figures  retweet , , f semantic sourcing and  social contagion and  on social networks n

influentials  information communication technology (ICT) ,  cell phones and  collective action with  in crisis events  early adopters of  ITU and  in Middle East –,  information diet on social media –, f on Twitter –, f,  information flow big data and – networks and – information leakage, from Chinese Internet n information sharing attention and  audience of –, – on communication channels – content of  culture and  defined  ethics of – impact of – individuals and – limits to – on mass media  neuroscience of –, f populations and – reach of – self-related processing of – sharer-audience interactions in  sharers of –,  sharing contexts in – social cognition of – valuation of – value-based virality for –, f, – information sources, for agenda setting ,  information subsidies, satisficing semantic search and  informed consent  agenda setting  best practices for – for data collection –

 in digital research – for digital trace data – for social media – initiators, in hierarchical tree model –, f Innis, Harold  innovation/innovators with cell phones – in global South , – of organizational dynamics – INSNA. See International Network for Social Network Analysis Instagram APIs for  drag culture on  for eating disorders  instantaneous networks for epidemic spreading  random walks and  institutional review board (IRB) ,  for crowdsourcing  for data collection , – for digital methods  for digital research ,  information sharing and  for social media  insurance premiums  intellective computational models  intellectual property, on Internet  intermedia level, for agenda setting  International Association of Public Participation (IAP) – International Communication Association  International Conference on Weblogs and Social Media (ICWSM)  International Network for Social Network Analysis (INSNA)  International Telecommunications Union (ITU)  Internet. See also Chinese Internet; specific topics attention concentration on – on cell phones – data collection on – diffusion of  drag culture on –, – ethics of data collection on –



intellectual property on  in Middle East  privacy policies on  publishers on – research on, ethics on –, t, t Internet Archives , ,  Internet Movie Database (IMDb)  Internet social capital scales (ISCS)  for Facebook  Internet Success (Schweik and English) – interorganizational ties, in agenda setting  InVenture  IP address, rate limits and  IRB. See institutional review board ISCS. See Internet social capital scales Islamic State of Iraq and Syria (ISIS)  isomorphism  issue adoption ties, in agenda setting ,  ITU. See International Telecommunications Union Iyengar, S.  Jaccard coefficient  for clusters –, t hubs and  Jackson, Michael  Al-Jaidah, Mahmoud  Jana Mobile  Janssen, D. ,  Javascript Object Notation (JSON)  Jive  Johnson, B.  Johnson, Steven L. ,  Jones, C.  JQuery  JSON. See Javascript Object Notation JSTOR  Kadushin, C.  Kaggle  Kaltenbrunner, A.  K-anonymity  Karsai, M.  Katikalapudi, Raghavendra  Katz, E. ,  Kaye, J. 





“Keep America American”  Kelly, Anita E.  Kendzior, S.  Kenya  kernel density methods  keyword-based samples, for emergency responders  Kickstarter  King, G. ,  King, J. H. – Kittur, Aniket ,  Kitzinger, C.  Kitzinger, J.  KKK. See Ku Klux Klan Kleinberg, Jon  Kobayashi, T.  Koch, N.  Kony   Kovats-Bernat, J. C. ,  Kozlowski, S. W. J.  Krackhardt  Kramer, Adam D. I. ,  Kraut, Robert E. ,  Ku Klux Klan (KKK) , , –,  Kutcher, Ashton  Kwak, N. – Kweli, Talib  Kwon, S. W.  Lapacian matrix  Lasswell, Harold  latency, of time-varying networks  latent Dirichlet allocation  latent semantic analysis  Lawson, D.  Lazarsfeld, Paul F. , , , , ,  Lazer, D. , , ,  Ledbetter, Andrew  Lee, R. M. – Lee, Spike  Lee v. PMSI  legal issues with data collection – with digital research  with social media – with ToS –

legitimacy, organizational dynamics and – Leskovec, Jure  Levi-Strauss, Henri  Levy, Mark – liability of newness  libraries, for APIs  Lin, N. , , , ,  lines, in GIS  Ling, Rich  Lingel, Jessa –, – Linguistic Inquiry and Word Count (LIWC) , , – for semantic sourcing ,  LinkedIn  Linux  Lippmann, Walter  Liu, Leslie S.  LIWC. See Linguistic Inquiry and Word Count Louvain algorithm  clusters and , t,  nodes and  lxml  Lysenko, Volodymyr V.  machine learning ,  politics and  for sentiment analysis  Malinowski  malleable encryption schemes, on social media  Manhattan, drag culture in – manipulation, of platforms  Mann, Chris  Manson, N. C. – mapping with cell phones – in global South – for health –, f, f of mobility –, f, f of place  of space  of virtual worlds – for well-being  March, J. G. – Markham, A. N. , 

 Markov process , ,  Markowitz, D. M.  Marsden, P. V. , ,  Martin, Trayvon –, t Marvin, Carolyn  Marx, Karl , ,  mass audience  mass convergence, in crisis events –, f, – massively multiplayer online game (MMOG) , , , , , – female participants in  play styles, motivations, and performance in – resiliency in  mass media clusters in  in crisis events  data collection on – defense of – for democratic engagement  digital research on – end of – evolution of  global concentration of  ICT and ,  information sharing on  media fragmentation of – network science and –, f organizational dynamics and – proliferation in  public opinion and  Twitter and , – masspersonal communication  “A Mathematical Theory of Communication” (Shannon)  Mathwick, C.  McCombs, Maxwell , ,  mCent  McMillan, D.  Media Cloud  medial prefrontal cortex (MPFC) ,  media multiplexity analysis of –, t data collection for – defined 



dependent variable for – social roles and –,  tie strength and – media use ties, in agenda setting – memory, of time-varying networks – mental health. See also well-being data collection on  ethics and – privacy and , – self-report on  social media for – surveys on  mention influence , , f mentions in blogs  on Ebola crisis  on Gezi Park protests f in Twitter ,  mentorship networks – Menzel  Meraz, S.  message design, satisficing semantic search in – metadata data collection of  on Middle East  metapopulations – Metzger, M. J.  Microsoft Teams  Middle East. See also Qatar blogs in  bridges in – communication in – diffusion in  ICT in –,  Internet in  metadata on  NLP in  polarization in – social media in – Milgram, Stanley  million follower fallacy –, f, f Miyata, K.  MMOG. See massively multiplayer online game Mobile Content Ltd.  mobile credit transfers 





mobile data on cell phones , f, f, – challenges of – data security for  in global South , f, f, – security of  mobile payments – mobile phones. See cell phones mobility big data for  cell phones and – of communication – data collection on  epidemiology and –, f future research on – GIS in – inferred  mapping of –, f, f One Boy’s Day: A Specimen Record of Behavior on – social media and  of teens – in virtual environments  modularity maximization  Molloy-Reed model, for random walks  Monge, P.  Moody, James  Moon, Youngme  Moore, Michael  Moreno, Megan  Moreno, Y. – Morris, Martina  Morrison, A.  Morrison, Christopher  M-Pesa –,  MPFC. See medial prefrontal cortex MSNBC  Mubarak, Hosni  MUDs. See multi-user dungeons Mulligan, Deirdre  Multichoice  multidimensional network – Multinet  multiplayer online game (MMO)  multi-user dungeons (MUDs)  Munteanu, C.  mutuality. See reciprocity

MySpace  “The Myth of Massive Media Impact Revived: New Support for a Discredited Idea” (Zaller)  name generator/interpreter  NAS. See network agenda setting National Institutes of health (NIH) – National Science Foundation (NSF)  naturalistic data digital trace data as  on mental health ,  natural language processing (NLP) , , , ,  in Middle East  Nature Human Behavior  NDAs. See nondisclosure agreements Nelimarkka, M.  Netflix  network agenda setting (NAS)  networked communication –. See also specific topics challenges of – defined  interdisciplinary work on – principles of – theoretical developments in – networks. See also specific topics brokers – of conflict – effect  ethnography – information flow and – model of, for agenda setting –, f, f nodes in  science, mass media and –, f society  theory, individuals and  NetworkX  Neuhaus, F.  neural networks  neuroscience of information sharing –, f reverse inference with – of viral information – Newman, M.  Newman-Girvan algorithm 

 news cycle acceleration  news media. See mass media newspapers. See also New York Times digital transformation of  New York City Fire Department (FDNY), on Twitter –, f New York Times APIs for  on Gezi Park protests  value-based virality of –, f, f web links on  Nicolaides, C. – NIH. See National Institutes of health Nissenbaum, Helen ,  NLP. See natural language processing nodes of activity-driven networks  agenda convergence of  in crisis events , f for epidemic spreading – with followers  group dynamics of  hyperedges for – Louvain algorithm and  million follower fallacy and – in networks  random walks and – of rumor spreading – SNA and  of static networks  subgraphs and –, – on Twitter , f NodeXL  Noelle-Neumann, E.  noise addition, on social media  nondisclosure agreements (NDAs) , , – information sharing and  reproducibility and  nonplayer characters (NPCs) in virtual worlds  in WoW  nonprofit networks ecological dynamics of  resiliency of  NPCs. See nonplayer characters



NSF. See National Science Foundation Nuremberg Code  OAuth  Obama, Barack , – obtrusiveness, in agenda setting  Occupy Wall Street , ,  off-site backup  Oh, Hyun Jung  OHCs. See online health communities OkCupid  Pass  One Boy’s Day: A Specimen Record of Behavior – “One Foot on the Streets, One Foot on the Web: Combining Ethnography and Data Analysis in the Study of Protest Movements” (Gerbaudo) , – O’Neill, O. – online communities APIs for ,  classification of  community-level variables in – computational social science for  demographics of – diffusion in –, t ecological dynamics of – gender identity on – generalizability of – multilevel processes of – on social media – of Wikipedia – online health communities (OHCs)  online social networks (OSNs). See social media Onnela, J.-P.  OpenPDS  Open Science Framework  Opera  OpinionFinder  opinion leaders ,  on Chinese Internet  in politics – semantic sourcing and  opinion mining. See sentiment analysis opportunistic data collection –





“Opportunities and Challenges for Research on Mobile Phone Data in the Global South” (Avle and Quartey)  organizational dynamics AI and  broad shape of – caution with  communication and –,  cycle of t, – data collection for –, – decline in  diffusion and  across disciplines – disruptions in , – early adopters of – failure in  future research on – with ICT  individuals and  innovators of – legitimacy and – mass media and – methods for – new aspects of –, t new cookbook for study of – old aspects of – resiliency and  stability and – Twitter and  organizational dynamics (change), emergence of – organizational ecology ,  O’Riordan, K.  Ortega, Felipe  Ostrom, Elinor  O’Sullivan, P. B.  “Our Stage, Our Streets: Brooklyn Drag and the Queer Imaginary” (Lingel) –, – outdegree influence adjacency matrix for – million follower fallacy and – Overbey, L. A.  overloading , , n Page, S. E.  PageRank –

Paluck, E. – Panopticlick project  Papacharissi, Z.  Park, Minsu  Parks, Malcolm  parsing, web scraping and  partition-specific network analysis of digital trace data – subgraphs in – passing ethnography  password management systems  Patient Health Questionnaire (PHQ-)  Paul, Michael J.  Paxson, H.  PCC. See posterior cingulate cortex PDR. See phenomenon-driven research peer production  hypergraphs for  peer-to-peer networks ,  public recognition in  Pennebaker, J. W. ,  Pentland, Sandy  Peritore, N. P.  Persily, N.  persistence, in social investment – persistent clusters , f persistent subgraphs , f personal communication systems  personalization content  from digital trace data – personally identifiable information (PII)  on Chinese Internet – Peterson, J. D. – Pew Internet and American Life Project  Pew Project for Excellence in Journalism – Pew Research Center –, – on mobile data  phantomjs  phenomenon-driven research (PDR)  PHQ-. See Patient Health Questionnaire pick-up groups (PUGS)  in EQII  PII. See personally identifiable information place for assault risk –

 in GIS – mapping of  space with – points of assault risk –, f in GIS  polarization communication and  in deliberation  in Egypt , f in Middle East – in politics – politics. See also protests actors in  agenda setting in ,  clusters in –, t communication in –, – data collection for – deliberation on  democratic engagement in – dictionary-based approach for – of drag culture – experiments on – on Facebook , f, , –,  face-to-face for  information diet for  machine learning and  network-based approaches for  nontextual approaches to – opinion leaders in – polarization in – proximity matrix for –, t public opinion in  in Qatar – satisficing semantic search in ,  selective exposure and  semantic sourcing in  sentiment analysis and – social media and , – surveys for –, – TDR on – on Twitter , , ,  two-step flow of communication for  polygons, in GIS ,  population-level analysis community-level variables in – diffusion in –, t



of ecological dynamics – generalizability from – limitations of – of multilevel processes – of social media – populations. See also vulnerable populations defined – information sharing and – metapopulations and – vulnerable – population-scale measurement for depression f for health  of well-being –, f Portes, A. ,  posterior cingulate cortex (PCC) ,  postpartum depression (PPD) –, f poverty reduction, in global South  power-law distribution for time-varying network burstiness  for word-of-mouth  PPD. See postpartum depression preference utilitarianism  preferential attachment model for activity-driven networks  for scale-free social networks  SNT and  for time-varying networks  prepaid cell phones, in global South –, f, f prestige effects  principle of least effort – privacy agenda setting  on Chinese Internet , –, – in data collection  by design, on social media – digital research and  of digital trace data  in EU  of locative media tools  mental health and , – norms for  policies, on Internet  practice evolution for  self-management  on social media –, –





proliferation of data  of deliberation – of information, attention scarcity with  in mass media  of media multiplexity  Prolific.ac  propagation phenomena celebrities and  in social media – prospective cohort study design – for teen mobility – protests. See also Arab Spring ethnography in – Gezi Park , –, , f–f Occupy Wall Street , ,  public opinion in –, , f–f SNT and  on social media – Twitter and –, t, ,  Proteus effect  proximity matrix , t psychosocial support, for well-being – public data  public figures. See also celebrities influence of ,  on Twitter ,  public opinion. See also opinion leaders agenda setting and  defined  formation of –, , f–f mass media and  in politics  in protests –, , f–f satisficing semantic search on – on social media – terrorist networks and  publishers – Puff Daddy  PUGS. See pick-up groups purposive action  in social capitalization  Putnam, R. D. ,  p-values  Python libraries for  lxml and beautifulsoup of 

NetworkX for  TSM for , –, – QAP. See quadratic assignment procedure Qatar crowdsourcing in  fragmentation in –, f social media in –, f Twitter in –, f QCA. See qualitative comparative analysis QQ  quadratic assignment procedure (QAP)  qualitative comparative analysis (QCA)  qualitative text analysis  Qualtrics  Quan-Haase, Anabel  quantitative text analysis  Quartey, Emmanuel  queer terroir, drag culture and – QuestionPro  rainbow tables  Rak, J.  random walks, in activity-driven networks –, f ranking algorithms, on social media – RAS. See reception-accept-sample model raster layers, for assault risk , f rate limits on APIs – for web scraping  raw data – reading, in drag culture – Reardon, K. K.  re-broadcast data, on social media  receivers, in hierarchical tree model –, f reception-accept-sample model (RAS)  reciprocity (mutuality) in deliberation  in social investment  reciprocity, in social investment  Reddit ,  psychosocial support on – self-disclosure on – red team dynamics  Reeder, H. , 

 reflexive research ethics – re-identify data, on social media  relational event models  hyperedges for  Relph, Edward  Ren, Yuqing – replication/replicability  agenda setting and – in satisficing semantic search  in semantic sourcing  on Twitter  reproducibility, in data collection – reproductive number, epidemic spreading and  research. See also digital research; theory driven research DDR –, – discovery-driven  hypothesis-free  PDR  responsible, on social media – research design for digital trace data – prospective cohort, for teen mobility – retrospective – resiliency agility and  organizational dynamics and  tie strength and  Resnik, D. B.  Resnik, Philip  resource heterogeneity  responsible disclosure, in data collection  responsible research, on social media – Responsible Research and Innovation (RRI)  REST API  Restivo, Michael  retrospective data collection, for assault risk –, f, f retrospective study designs – retweets by aggregation services  in crisis events  on Ebola crisis  in Egypt –, f



on Gezi Park protests f influence , , f signals  of social conventions  of web links  reveal data, on social media  reverse inference, with neuroscience – RFID  Richards, N. M. – Right To Be Forgotten ,  risk. See also assault risk of big data – of computational social science  of digital methods – of digital research –, , – Robots Exclusion Standard (robots.txt), for data cleaning  Rodriguez, Robert R.  Rogers, Everett M. , , – role-playing, in virtual worlds  Romney, Mitt , , –,  RRI. See Responsible Research and Innovation Rubio, Marco – Rudder, Christian  rumor spreading activity-driven networks and –, f in China  in crisis events ,  on “Keep America American”  satisficing semantic search and  on social media –, f on Twitter , –,  RuPaul ,  Ruyter, K. D.  Ryan, Timothy J.  Sadelik, Adam  safe spaces, in drag culture – Salganik, M. J. , ,  sampling in agenda setting  of APIs  key-word based, for emergency responders  sandbox type, of virtual world  Sandvig, C.  Sandvig v. Sessions 





satisficing semantic search aspiration in , – available alternatives in  defined – effort in – for individuals  information subsidies and  in message design – in politics ,  principle of least effort and – on public opinion – replication in  rumor spreading and  search costs in – semantic sourcing in – for text analytics – theory and measures of – Saunders, B.  scale for argument mining  of computational social science , – scale-free social networks, preferential attachment model for  Schelling, T. C.  Schmidt, Eric  Schneider, S. M.  Schraefel, M. C.  Schramm, Wilbur  Schumpeter, J.  Schweik, Charles M. – science, technology, engineering, and math (STEM)  Scott, James C.  scratch cards, for cell phones –, f, f scripted type, of virtual worlds  search costs, in satisficing semantic search – search engines attention concentration from  random walks and  Second Life ,  female participants in  second moments, for epidemic spreading ,  secured consent 

secure storage, for data collection – Segerberg, A. ,  selective exposure  communication and  for politics  politics and  Selenium  self-censorship, on Facebook  self-disclosure  on well-being – self-interest, in social capitalization  self-perception theory  self-related processing, of information sharing – self-report for cell phone use  of digital trace data  of information sharing  on mental health  in politics  for sentiment analysis – of social roles  on surveys  semantic aspiration, in semantic sourcing – semantic ego networks  semantic explanation  semantics, VSMs of – semantic sourcing measures of – in satisficing semantic search – semantic aspiration in – statements in –, ,  semiautonomous consent  Sennett, Richard  sentiment analysis data collection for – defined – machine learning for  politics and – self-report for – on social media – TDR on –, f, f, f Sentistrength  SentiWordNet  server-side logs  digital trace data and 

 Settle, Jaime E. , , – sexually transmitted diseases  Sey, Araba  shade, in drag culture – Shah, D. V. – Shamoo, A. E.  Shannon, Claude  Shapiro, J. M.  sharers –,  sharing contexts – Shaw, Donald  Shin, J.  Short, Adrian  SIM cards for cell phones , ,  in Ghana  Simmel, Georg  Simon, H. A. ,  Sina Weibo ,  Singer, P.  SIS. See susceptible-infected-susceptible Skype  big data from  data security with  Slack  Sleeper, M.  Sluka, J. A. – smartphones. See cell phones SMDI. See social media depression index Smith, Marc A.  Smythe, Dallas  SNA. See social network analysis SNAP  SNT. See social network theory social capital analysis of – collective action and  defined  future research on – outcome-oriented approach to – social capitalization and  social investment patterns for – social media and –,  tie strength and – social capitalization agency and – framework for –



platform affordances for  purposive action in  self-interest in  social capital and  social choice theory – social cognition, of information sharing – social contagion . See also rumor spreading computational social science and  influencers and  social conventions on social media –, f, f on Twitter –, f, f social convergence, in crisis events , –, f,  Social Data Initiative, of Social Science Council  social economic status, structural holes and  social investment cost of uncertainty in – defined  diversity in –, t in Facebook – homophily in – persistence in – reciprocity in  for social capital – taxonomy of – tie strength in  uncertainty in –, t, – social levelers, virtual worlds as ,  social media (networks) . See also specific platforms anonymization on – APIs for ,  Arab Spring on – attention on – as backchannel  best practices for –, t big data from , – bridges on – bubbles on – on cell phones  challenges of – on Chinese Internet ,  connective action of 





social media (networks) (Continued) in crisis events –, f, f, f, f, f, f, f data collection from – data fuzzing on  data mining of  data retention/deletion policy on  de-identification on – deliberation on – digital trace data on ,  in drag culture  for emergency responders –, f, f, f, f, f, f, f enterprise platforms for  face-to-face and – flaming on – graph analysis techniques for – health and  homomorphic encryption for ,  inadvertent exposure on – influencers on n information diet on –, f informed consent for – IRB for  legal issues for – malleable encryption schemes on  for mental health – in Middle East – million follower fallacy on –, f, f mobility and  noise addition on  online communities on – politics and – population-level analysis of – privacy by design on – privacy on –, – propagation applications on –, t, f propagation patterns on –, f, f, f, f propagation phenomena in – protests on – public opinion on – in Qatar –, f ranking algorithms on – re-broadcast data on  re-identify data on 

responsible research on – reveal data on  rumor spreading on –, f satisficing semantic search on – sentiment analysis on – social capital and –,  social conventions on –, f, f social interaction on – as social observatory – social recommendations on – structural analysis of  subgraphs for  topical experts on –, t tracking by ,  trendsetters on –, f, f two-step flow of communication of  user-centric approach for  user influence on –, f, f, f, f user types on –, f vulnerable populations on – well-being and – word of mouth on –, f, f social media depression index (SMDI)  social network analysis (SNA) for cell phones – communication and – for connectivity  on deliberation – homophily and – research questions for – semantic explanation for  SNT and – on virtual worlds –, t, t social network theory (SNT) – preferential attachment model and  protests and  SNA and – strength of weak ties hypothesis and  structural holes theory and  social recommendations on social media – on Twitter – social roles – media multiplexity and –,  self-report of  on surveys  tie strength and , –

 Social Science Council, Social Data Initiative of  Social Science One (SS)  social stigmas  social ties. See tie strength sociometric badge  SocioPatterns  socio-technical assemblages  Solove, D. J.  SOPA/PIPA campaign  SourceForge ,  South Africa, Multichoice in  space for assault risk  in GIS – mapping of  place with – Space Time Adolescent Risk Study (STARS) , – Spain  Specht v. Netscape  Spiro, E. S.  spreaders in hierarchical tree model –, f in rumor spreading – SS. See Social Science One stability, organizational dynamics and – Stanford Large Network Dataset Collection  STARS. See Space Time Adolescent Risk Study statements in semantic sourcing –, ,  vagueness of  static networks nodes of  random walks and ,  timescales of  STEM. See science, technology, engineering, and math Stempeck, Matt ,  stereotypes, for gender – Stewart, Brandon  stiflers, in rumor spreading – Stockmann, D.  Stohl, C.  Streaming API, of Twitter 



strength of weak ties hypothesis , ,  attention and – SNT and  Strogatz, S. H. , , ,  structural holes opportunities of  SNT and  social economic status and  structural polarization –, f subgraphs changes to over time , –, t, f of clusters  homophily of  nodes and –, – in partition-specific network analysis – persistent , f relationship to one another –, t for social media  tracking of  on Twitter  subsidized connectivity, for cell phones – Sunstein, C. ,  support vector machine (SVM)  Survey Monkey  surveys on Chinese Internet  on gender in virtual worlds  introduction of  on media multiplexity of tie strength – on mental health  for politics –, – self-report on  social roles on  susceptible-infected-susceptible (SIS) ,  sustained consent – Sutton, J.  SVM. See support vector machine Swartz, Aaron  syntactic explanation  Tamburrini, N.  TDCS. See transcranial direct current stimulation TDR. See theory driven research





team interlocks, hypergraphs for  Tecno  teen mobility, prospective cohort study design for – temporo-parietal junction (TPJ)  Tencent  terms of service (ToS) for APIs , – harm to services and – information sharing and  legal issues with – reproducibility and  for Robots Exclusion Standard  of Twitter  terrorist networks digital research on  public opinion and  resiliency of  text analytics  qualitative and quantitative  satisficing semantic search for – on social media – theory driven research (TDR) – on politics – on sentiment analysis –, f, f, f on virtual worlds – third-level agenda setting  third-set oriented, in drag culture  Thomlinson, Roger  threshold model, for agenda setting  tie strength . See also strength of weak ties hypothesis analysis of –, t cell phones and – communication and – data collection for – defined ,  of face-to-face – homophily of  independent variable for  measures of – media multiplexity and – social capital and – in social investment  social roles and , – Tigo/Millicom 

time-respecting path, of time-varying networks  time-varying networks activity-driven networks and – burstiness of  connectivity of  dynamical processes in – latency of  memory of – preferential attachment model for  properties of – random walks in – representations for –, f, f time-respecting path of  TMS. See transcranial magnetic stimulation Tongson, K.  topical experts on social media –, t on Twitter –, t topic modeling  ToS. See terms of service TPJ. See temporo-parietal junction trace data. See digital trace data tracking by social media ,  of subgraphs  transcranial direct current stimulation (TDCS)  transcranial magnetic stimulation (TMS)  transparency agenda setting and – with digital trace data ,  trendsetters on social media –, f, f on Twitter –, f, f triangulation  in Arab Spring  digital trace data from – Tri-Council Policy Statement: Ethical Conduct for Research Involving Humans  Trump, Donald – TSM. See Twitter Subgraph Manipulator Tuan, Yi-Fu  Tufekci, Z.  Tumblr drag culture on 

 for eating disorders  Martin, T., and  Turkopticon  TwapperKeeper  Twitter . See also hashtags; retweets APIs of –, , , , , ,  Arab Spring and –, ,  for communication  connective action of  in crisis events , , –, –, f deliberation on , ,  digital trace data from  drag culture on  Ebola crisis on –, f, f, f, t in Egypt  emergency responders and –, –, f evangelists on – FDNY on –, f filtering techniques for  followers on –, –, ,  functions of  Gezi Park protests on –, f grassroots on  information diet on –, f,  interorganizational communication on –, f Lists feature of –, t Martin, T., and  mass media and , – mentions in ,  million follower fallacy on –, f, f NAS on  newsfeed of  nodes on , f organizational dynamics and  politics on , , ,  propagation applications on –, t, f protests and –, t, ,  public figures on ,  in Qatar –, f rate limit for  on Romney , –,  rumor spreading on , –, 



social conventions on –, f, f social recommendations on – subgraphs on  TDR and  text analytics for  time-varying network burstiness for  topical experts on –, t ToS of  trendsetters on –, f, f Trump on – user IDs for – user influence on –, f, f, f, f user types on –, f as vital media  vocabulary on  web links on –, f, f,  for well-being early detection  word of mouth on –, f, f Twitter Subgraph Manipulator (TSM) , –, – two-step flow of communication ,  with agenda setting  million follower fallacy and  for politics  of social media  TXTGhana  UCINet  uncertainty, in social investment –, t, – unweighted networks, adjacency matrix for –, f URLs. See web links USA v. Nosal  USA v. Swartz  Usenet  flaming on  user-centric approach for emergency responders –, f for social media  user IDs, for Twitter – user influence on social media –, f, f, f, f on Twitter –, f, f, f, f user types on social media –, f on Twitter –, f





Valente, Thomas W.  valuation, of information sharing – value-added services (VAS)  value-based virality for information sharing –, f, – of New York Times –, f, f van de Rijt, Arnout  variance inflation factor (VIF)  VAS. See value-added services vector layers for assault risk , f in GIS , f vector space models (VSMs) – ventral striatum (VS) , , ,  ventro-medial prefrontal cortex (VMPFC) , ,  victim requests, in crisis events  video analysis  VIF. See variance inflation factor viral information neuroscience of – from rumor spreading  viral marketing, million follower fallacy and  viral media broadcast media and  for rumor spreading  Twitter as  virtual environments, mobility in  virtual ethnography – virtual private network (VPN) – Virtual Social Media Working Group  virtual worlds. See also massively multiplayer online game avatars in ,  content analysis for – DDR on – depersonalization in – digital trace data on , – female participants in , –,  future research on – gender in – gender role theory for – gender swapping in , –, –, t generalizability of 

mapping of – NPCs in  play styles, motivations, and performance in – research on – role-playing in  sandbox type of  scripted type of  SNA on –, t, t as social levelers ,  TDR on – Virtual Worlds Observatory (VWO)  Vivienne, S.  VMPFC. See ventro-medial prefrontal cortex vocabulary in crisis events  in semantic sourcing  Volunteer Science , – VPN. See virtual private network VS. See ventral striatum VSMs. See vector space models vulnerable populations on Chinese Internet  digital research on ,  ethics of  on social media – VWO. See Virtual Worlds Observatory Wakita-Tsurumi algorithm  Waldo Canyon fire , f Walker, Scott  Walton, D.  Wang, Jing  Wang, L.  Wang, Xiaoquing – The War of the Worlds (Wells)  Washington Post  Wassmerman, Stanley  Watts, D. J. , , ,  weak ties. See strength of weak ties hypothesis Web .  Weber, Max  Weber, M. S. – web links (hyperlinks, URLs) deliberation and  on New York Times 

 on Twitter –, –, f, f,  by word-of-mouth –, f, f Webmoor, T.  web scraping accounts for  for data collection – parsing and  rate limits for  Web Social Science (Ackland)  WeChat  Weeks, B. E.  weighted networks , f WEIRD. See Western, educated, industrialized, rich, and demographic populations the WELL  well-being cell phones for  early detection for – in global South  mapping for  population-scale measurement of –, f psychosocial support for – self-disclosure on – social media and – Wells, H. G.  Western, educated, industrialized, rich, and demographic populations (WEIRD) ,  wget  WhatsApp  Whiteman, N.  WHO. See World Health Organization Wiebe, Douglas J.  Wiertz, C.  Wiese, Jason – Wikipedia , 

barnstars in  “Collaboration of the Week” on  in English ,  online communities of – WikiProjects  wikis  ecological dynamics of  Williams, Dmitri , ,  Williams, T.  wisdom of the crowd ,  women (females) with cell phones – as participants, in virtual worlds , –,  WordNet-Affect  word of mouth on social media –, f, f on Twitter –, f, f World Health Organization (WHO)  World of Warcraft (WoW) ,  female participants in  NPC in  Xi Jingping  X-Tigi  Yelp  Young, Alyson L.  Yuan, Y. C.  Zaller, John –,  Zeng, J.  Zhu, Haiyi ,  Zimmer, M.  Zipf, G. K. ,  Zuckerman, Ethan ,  Zukin, S. , 

