114 92 8MB
English Pages 303 [295] Year 2021
Human Dynamics in Smart Cities Series Editors: Shih‐Lung Shaw · Daniel Sui
Atsushi Nara Ming-Hsiang Tsou Editors
Empowering Human Dynamics Research with Social Media and Geospatial Data Analytics
Human Dynamics in Smart Cities Series Editors Shih-Lung Shaw, Department of Geography, University of Tennessee, Knoxville, TN, USA Daniel Sui, Research & Innovation, 340 Burruss Hall, Virginia Polytechnic Institute & State University, Blacksburg, VA, USA
This series covers advances in information and communication technology (ICT), mobile technology, and location-aware technology and ways in which they have fundamentally changed how social, political, economic and transportation systems work in today’s globally connected world. These changes have raised many exciting research questions related to human dynamics at both disaggregate and aggregate levels that have attracted attentions of researchers from a wide range of disciplines. This book series captures this emerging dynamic interdisciplinary field of research as a one-stop depository of our cumulative knowledge on this topic that has profound implications for future human life in general and urban life in particular. Covering topics from theoretical perspectives, spacetime analytics, modeling human dynamics, urban analytics, social media and big data, travel dynamics, to privacy issues, development of smart cities, and problems and prospects of human dynamics research, the series includes contributions from various disciplines with research interests related to human dynamics. The series invites contributions of theoretical, technical, or application aspects of human dynamics research for a global and interdisciplinary audience.
More information about this series at http://www.springer.com/series/15897
Atsushi Nara · Ming-Hsiang Tsou Editors
Empowering Human Dynamics Research with Social Media and Geospatial Data Analytics
Editors Atsushi Nara Department of Geography San Diego State University San Diego, CA, USA
Ming-Hsiang Tsou Department of Geography San Diego State University San Diego, CA, USA
ISSN 2523-7780 ISSN 2523-7799 (electronic) Human Dynamics in Smart Cities ISBN 978-3-030-83009-0 ISBN 978-3-030-83010-6 (eBook) https://doi.org/10.1007/978-3-030-83010-6 © Springer Nature Switzerland AG 2021 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Acknowledgements
This book project is based upon work supported by the National Science Foundation under Grant No. 1634641, the IMEE project titled “Integrated Stage-Based Evacuation with Social Perception Analysis and Dynamic Population Estimation”, Grant No. 1837577, the CS4All project titled “Encoding Geography: Building Capacity for Inclusive Geo-Computational Thinking with Geospatial Technologies”, and Grant No. 2031407, the CS4All project titled “Collaborative Research: Encoding Geography-Scaling up an RPP to achieve inclusive geocomputational education”. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the National Science Foundation.
v
Contents
1
2
Introduction: Human Dynamics Research with Social Media and Geospatial Data Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Atsushi Nara
1
Theorizing Social Media: A Formalization of the Multilevel Model of Meme Diffusion 2.0 (M3 D2.0 ) . . . . . . . . . . . . . . . . . . . . . . . . . . Brian H. Spitzberg
13
3
Research on Misinformation and Social Networking Sites . . . . . . . . . Lourdes S. Martinez
4
Research Trends in Social Media/Big Data with the Emphasis on Data Collection and Data Management: A Bibliometric Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Qiong Peng and Xinyue Ye
5
6
7
31
47
Similarity Measurement on Human Mobility Data with Spatially Weighted Structural Similarity Index (SpSSIM) . . . . Chanwoo Jin, Atsushi Nara, Jiue-An Yang, and Ming-Hsiang Tsou
65
An Integrated Evacuation Decision Support System Framework with Social Perception Analysis and Dynamic Population Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Atsushi Nara, Xianfeng Yang, Sahar Ghanipoor Machiani, and Ming-Hsiang Tsou
89
Learning Dependence Relationships of Evacuation Decision Making Factors from Tweets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Atsushi Nara, Sahar Ghanipoor Machiani, Nana Luo, Alidad Ahmadi, Karen Robinett, Ken Tominaga, Jaehee Park, Chanwoo Jin, Xianfeng Yang, and Ming-Hsiang Tsou
vii
viii
Contents
8
Examining Spatiotemporal and Sentiment Patterns of Evacuation Behavior During 2017 Hurricane Harvey . . . . . . . . . . 139 Chenxiao (Atlas) Guo and Qunying Huang
9
Sentiment Analysis of Social Media Response and Spatial Distribution Patterns on the COVID-19 Outbreak: The Case Study of Italy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 Gabriela Fernandez, Carol Maione, Karenina Zaballa, Norbert Bonnici, Brian H. Spitzberg, Jarai Carter, Harrison Yang, Jack McKew, Filippo Bonora, Shraddha S. Ghodke, Chanwoo Jin, Rachelle De Ocampo, Wayne Kepner, and Ming-Hsiang Tsou
10 Conceptualizing an Ecological Model of Google Search and Twitter Data in Public Health . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 Bo Liang and Ye Wang 11 A Case Study in Belief Surveillance, Sentiment Analysis, and Identification of Informational Targets for E-Cigarettes Interventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 Lourdes S. Martinez, Ming-Hsiang Tsou, and Brian H. Spitzberg 12 Placing Community: Exploring Racial/Ethnic Community Connection Within and Between Racial/Ethnic Neighborhoods . . . . 217 Joseph Gibbons 13 Exploring Gentrification Through Social Media Data and Text Clustering Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 Cheng-Chia Huang, Atsushi Nara, Joseph Gibbons, and Ming-Hsiang Tsou 14 Spatial Distribution Patterns of Geo-tagged Twitter Data Created by Social Media Bots and Recommended Data Wrangling Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 Ming-Hsiang Tsou, Hao Zhang, Jaehee Park, Atsushi Nara, and Chin-Te Jung 15 The Future of Human Dynamics Study: Research Challenges and Opportunities During and After the COVID-19 Pandemic . . . . 275 Ming-Hsiang Tsou
Editors and Contributors
About the Editors Atsushi Nara is Associate Professor of Geography at San Diego State University. He also serves as Associate Director for the Center for Human Dynamics in the Mobile Age (HDMA) (http://humandynamics.sdsu.edu). He received his B.S. degree in environmental engineering from Shimane University, Japan; M.S. degree in geography from University of Utah; and Ph.D. degree in geography from Arizona State University. His research interests are in spatiotemporal data analytics, modeling behavioral geography and complex urban and social systems, and geospatial computation. His recent research projects study human mobility, disaster responses, evacuation decision support systems, air pollution, food environment, and geocomputational education. He served on the editorial boards of Computational Urban Science. Ming-Hsiang Tsou is Professor of Geography and Founding Director of the Center for Human Dynamics in the Mobile Age at San Diego State University (SDSU). He also serves as Program Director for the Big Data Analytics Program at SDSU. He received his B.S. degree in geography from National Taiwan University, Taiwan; M.A. degree in geography from State University of New York at Buffalo; and Ph.D. in geography from University of Colorado at Boulder. His research interests are in big data, human dynamics, social media, visualization, and Internet Geographic Information Systems (GIS). He is co-author of Internet GIS, a scholarly book published in 2003 by Wiley and served on the editorial boards of the Annals of GIS (2008–), Cartography and GIScience (2013–), the Professional Geographers (2011–2019), and International Journal of Geographic Information Science (2019–). His recent NSF research projects focus on health disparities, disaster responses, and spatial social networks.
ix
x
Editors and Contributors
Contributors Alidad Ahmadi Department of Civil, Construction, and Environmental Engineering, San Diego State University, San Diego, CA, USA Norbert Bonnici Malta Council for Science and Technology, Kalkara, Malta Filippo Bonora University of Bologna, Bologna, Italy Jarai Carter Columbia University, New York, NY, USA; John Deere, New York, NY, USA Rachelle De Ocampo Department of Geography, Center for Human Dynamics in the Mobile Age, Metabolism of Cities Living Lab, San Diego State University, San Diego, CA, USA Gabriela Fernandez Department of Geography, Center for Human Dynamics in the Mobile Age, Metabolism of Cities Living Lab, San Diego State University, San Diego, CA, USA Sahar Ghanipoor Machiani Department of Civil, Construction, and Environmental Engineering, San Diego State University, San Diego, CA, USA Shraddha S. Ghodke University of College London, London, UK Joseph Gibbons Department of Sociology, Center for Human Dynamics in the Mobile Age, San Diego State University, San Diego, CA, USA Chenxiao (Atlas) Guo Department of Geography, University of WisconsinMadison, Madison, WI, USA Cheng-Chia Huang Esri, Redlands, CA, USA Qunying Huang Department of Geography, University of Wisconsin-Madison, Madison, WI, USA Chanwoo Jin Department of Geography, Center for Human Dynamics in the Mobile Age, Metabolism of Cities Living Lab, San Diego State University, San Diego, CA, USA Chin-Te Jung Esri, Redland, CA, USA Wayne Kepner Department of Geography, Center for Human Dynamics in the Mobile Age, Metabolism of Cities Living Lab, San Diego State University, San Diego, CA, USA Bo Liang Department of Business, Nevada State College, Henderson, NV, USA Nana Luo Center for Human Dynamics in the Mobile Age and Department of Geography, San Diego State University, San Diego, CA, USA
Editors and Contributors
xi
Carol Maione Department of Geography, Center for Human Dynamics in the Mobile Age, Metabolism of Cities Living Lab, San Diego State University, San Diego, CA, USA; Department of Management, Economics, and Industrial Engineering, Politecnico Di Milano, Milan, Italy Lourdes S. Martinez School of Communication, San Diego State University, San Diego, CA, USA Jack McKew AECOM, Newcastle, Australia Atsushi Nara Department of Geography, Center for Human Dynamics in the Mobile Age, San Diego State University, San Diego, CA, USA Jaehee Park Center for Human Dynamics in the Mobile Age and Department of Geography, San Diego State University, San Diego, CA, USA Qiong Peng School of Architecture, Planning and Preservation, University of Maryland, College Park, MD, USA Karen Robinett Center for Human Dynamics in the Mobile Age and Department of Geography, San Diego State University, San Diego, CA, USA Brian H. Spitzberg School of Communication, San Diego State University, San Diego, CA, USA; Department of Communication, San Diego State University, San Diego, CA, USA Ken Tominaga Center for Human Dynamics in the Mobile Age and Department of Geography, San Diego State University, San Diego, CA, USA Ming-Hsiang Tsou Department of Geography, Center for Human Dynamics in the Mobile Age, Metabolism of Cities Living Lab, San Diego State University, San Diego, CA, USA Ye Wang College of Arts and Sciences, University of Missouri – Kansas City, Kansas City, MO, USA Harrison Yang Department of Geography, Center for Human Dynamics in the Mobile Age, Metabolism of Cities Living Lab, San Diego State University, San Diego, CA, USA Jiue-An Yang Calit2/Qualcomm Institute, University of California San Diego, La Jolla, CA, USA Xianfeng Yang Department of Civil & Environmental Engineering, University of Utah, Salt Lake City, UT, USA Xinyue Ye Department of Landscape Architecture and Urban Planning, Texas A&M University, College Station, TX, USA
xii
Editors and Contributors
Karenina Zaballa Department of Geography, Center for Human Dynamics in the Mobile Age, Metabolism of Cities Living Lab, San Diego State University, San Diego, CA, USA Hao Zhang Department of Geography, San Diego State University, San Diego, CA, USA
List of Figures
Fig. 2.1 Fig. 2.2 Fig. 3.1 Fig. 3.2 Fig. 4.1 Fig. 4.2
Fig. 4.3 Fig. 4.4 Fig. 4.5 Fig. 5.1 Fig. 5.2 Fig. 5.3 Fig. 5.4 Fig. 5.5 Fig. 5.6 Fig. 5.7 Fig. 6.1
The multidimensional model of meme diffusion 1.0 (M3 D; see Spitzberg 2014) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The multidimensional model of meme diffusion 2.0 (M3 D2.0 ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Proportion of articles published by year (N = 44) . . . . . . . . . . . Proportion of articles by field (N = 44) . . . . . . . . . . . . . . . . . . . . Growth of publication outputs (Horizontal axis: year; Vertical axis: number of publications) . . . . . . . . . . . . . . . . . . . . . Most productive countries during 2010–2019 (TP, total publications; IP, the number of independent publications by country; CP, the number of internationally collaborative publications) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Co-authorship cooperation between productive countries . . . . . Institution collaboration network of most 66 central institutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Co-occurrence network of top 86 high-frequency keywords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison of two pictures using SSIM (Adapted from Pollard et al. (2013)) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison of two OD matrices using SpSSIM . . . . . . . . . . . . Spatial distribution of flows: a Total, b LODES, c Twitter, d Instagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Heat maps of OD pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Localized SpSSIM (Inflows of L-T): a 10 km, b 20 km, c 30 km, d 40 km . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Localized SpSSIM (in-flows of T-I): a 10 km, b 20 km, c 30 km, d 40 km . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Standardized difference of in-flows: a Southeastern San Diego (LODES-Twitter). b Coastal (Twitter-Instagram) . . . . . . The design framework of IWEDSS . . . . . . . . . . . . . . . . . . . . . . .
14 16 35 36 50
55 55 56 59 71 71 76 77 81 82 83 96
xiii
xiv
Fig. 6.2 Fig. 6.3
Fig. 6.4 Fig. 6.5
Fig. 6.6 Fig. 6.7 Fig. 7.1 Fig. 7.2 Fig. 7.3 Fig. 7.4 Fig. 7.5 Fig. 8.1 Fig. 8.2 Fig. 8.3 Fig. 8.4 Fig. 8.5 Fig. 8.6 Fig. 8.7 Fig. 8.8
Fig. 8.9 Fig. 8.10 Fig. 8.11
Fig. 8.12
List of Figures
Average count of unique Twitter user in San Diego, 2015 (n = 1,571 TAZs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Weekday hourly average Twitter user density distribution in San Diego in four time periods; a 0:00–1:00, b 6:00– 7:00, c 12:00–13:00, and d 18:00–19:00 . . . . . . . . . . . . . . . . . . . Census based night time (a) and daytime (b) population density distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A map of San Diego County wildfire evacuation plan at 3:30 a.m., October 25, 2007 (SanGIS & Office of Emergency Services, 2007) . . . . . . . . . . . . . . . . . . . . . . . . . . . The SMART dashboard for “Wildfires in California” topic . . . . ReadySD-Social, a mobile app for broadcasting emergency information in San Diego . . . . . . . . . . . . . . . . . . . . . Evacuation behavior model (Sorensen et al., 1987 modified from) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Evacuation decision-making conceptual model . . . . . . . . . . . . . Hourly frequency distribution of tweets coded by user’s attitude toward evacuation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hourly frequency distribution of tweets coded by user’s evacuation status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Evacuation decision making structure . . . . . . . . . . . . . . . . . . . . . Number of verified tweets by day in 2017 . . . . . . . . . . . . . . . . . Regions of interest used in this research . . . . . . . . . . . . . . . . . . . The workflow of generating trajectories and examine spatiotemporal-sentimental patterns . . . . . . . . . . . . . . . . . . . . . . Meteorological characteristics of Hurricane Harvey at different stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Aggregated trajectories among contiguous U.S. cities in 2017 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Aggregated trajectories with Houston during Hurricane Harvey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Aggregated trajectories within the affected area during Hurricane Harvey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison of aggregated trajectories with Houston at different phases of disasters, showing the response phase with the highest connection in general . . . . . . . . . . . . . . . Destinations from Houston at the county level during Hurricane Harvey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Number of trajectories inbound, outbound, and inside Houston in 2017 (7-day smooth) . . . . . . . . . . . . . . . . Number of trajectories inbound, outbound and inside Houston at different phases of disasters (preparedness, response, recovery) . . . . . . . . . . . . . . . . . . . . . . . Percentage of trajectories outbound mandatory and voluntary evacuation areas at different stages . . . . . . . . . . .
98
99 101
103 105 105 115 120 126 126 127 143 144 145 147 149 150 151
152 153 154
154 155
List of Figures
Fig. 8.13 Fig. 8.14 Fig. 8.15 Fig. 8.16 Fig. 8.17 Fig. 8.18
Fig. 8.19
Fig. 8.20
Fig. 8.21
Fig. 8.22
Fig. 9.1
Fig. 9.2
Fig. 9.3 Fig. 9.4 Fig. 9.5 Fig. 9.6 Fig. 10.1 Fig. 11.1 Fig. 11.2 Fig. 11.3 Fig. 11.4
xv
Numbers of “evacuation” trajectories leaving the affected areas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Averaged sentiment score by day in 2017, with Y-axis showing the sentiment score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Averaged scores of negative emotions by day during Hurricane Harvey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Major destinations from Houston with averaged sentiment level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Average sentiment scores inside and outside Houston at different stages of Hurricane Harvey . . . . . . . . . . . . . . . . . . . . Change of sentiment scores among trajectories with different spatial relationships to Houston at different stages of Hurricane Harvey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Average sentiment scores among outside, voluntary evacuation area, and mandatory evacuation area at different stages of Hurricane Harvey . . . . . . . . . . . . . . . . . . . . Change of sentiment scores among trajectories with different spatial relationships to different evacuation areas at different stages of Hurricane Harvey . . . . . . . . . . . . . . . Average sentiment scores among outside, moderately affected area, and severely affected area at different stages of Hurricane Harvey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Change of sentiment scores among trajectories with selected spatial relationships to different affected areas (“evacuation”) at different stages of Hurricane Harvey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Geography of COVID-19 spread effects in Italy: social (left), economic (middle), and environmental (right) pandemic impacts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Social, economy, and environment COVID-19 related tweets by geography, and top 5 keyword retweets (yellow = social retweets; orange = economic re-tweets; red = environmental retweets; grey = total tweets) . . . . . . . . . . . . . . . Sentiments in northern Italy during the time of COVID-19 . . . . Sentiments in central Italy during the time of COVID-19 . . . . . Sentiments in southern Italy during the time of COVID-19 . . . . Sentiments in Italy’s islands during the time of COVID-19 . . . . An ecological model of Big Data from Google search and Twitter: a case of flu research . . . . . . . . . . . . . . . . . . . . . . . . Data filter and cleaning procedures in SMART Dashboard (adapted Tsou et al., 2015) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The screen shot of SMART dashboard for E-cigarettes . . . . . . . Key features for interactive query and visualizations in SMART dashboard (Tsou et al., 2015) . . . . . . . . . . . . . . . . . . Sentiment analysis (N = 973) (Martinez et al., 2018) . . . . . . . .
156 157 157 158 159
159
160
160
161
162
173
176 177 178 178 179 191 206 207 207 208
xvi
Fig. 11.5
Fig. 11.6 Fig. 11.7 Fig. 11.8 Fig. 12.1 Fig. 12.2 Fig. 13.1 Fig. 13.2 Fig. 13.3
Fig. 13.4 Fig. 13.5 Fig. 13.6 Fig. 13.7 Fig. 13.8 Fig. 13.9 Fig. 13.10 Fig. 14.1
Fig. 14.2
Fig. 14.3
Fig. 14.4
Fig. 14.5
List of Figures
Sentiment analysis: confirmation and rejection of common beliefs about e-cigarettes (N = 973) (Martinez et al., 2018) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Activity rate for advocates Twitter accounts (ranked by the numbers of followers) . . . . . . . . . . . . . . . . . . . . . . . . . . . . Activity rate for a normal advocates Twitter account (#1) . . . . . Activity rate for a cyborg advocates Twitter account (#4) . . . . . Racial/ethnic breakdown of the philadelphia metropolitan area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Select GWR coefficients in the Philadelphia metropolitan area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Study area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The distribution of Instagram posts between 2013 and 2015 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The distribution of the number of users by the number of Instagram posts (Top: Before filtering; Bottom: After filtering) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A map of the text clustering result in 2013 . . . . . . . . . . . . . . . . . A map of the text clustering result in 2014 . . . . . . . . . . . . . . . . . A map of the text clustering result in 2015 . . . . . . . . . . . . . . . . . Land use profile of each group . . . . . . . . . . . . . . . . . . . . . . . . . . . The average normalized TF-IDF of each word in clustering groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The word clouds based on the average normalized TF-IDF in each group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ranks of gentrifying keywords in each cluster group . . . . . . . . . The spatial distribution of geo-tagged tweets using the Streaming API with the bounding box of San Diego County (October and November 2015) at county, state, and world scales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The spatial distribution of geo-tagged tweets using the streaming API with the bounding box of Columbus City (October and November 2015) at county, state, and world scales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The numbers of Tweets produced by different platforms inside the San Diego bounding box during the month of November, 2015 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The numbers of Tweets produced by different platforms inside the City of Columbus bounding box during the month of November, 2015 . . . . . . . . . . . . . . . . . . . . . The spatial distribution of the TweetMyJOBS tweets (blue color dots) in San Diego (left) and Columbus (right) during the month of November, 2015. Red color areas are the hotspots of clustered tweets in these locations . . . . . . . . . . .
209 210 211 212 232 233 242 243
244 245 246 246 249 249 250 251
261
262
265
266
268
List of Figures
Fig. 14.6
Fig. 14.7
xvii
The spatial distribution of the TNN CMH traffic tweets in Columbus during the month of November, 2015. Red dots are the clustered tweets created by the traffic bots . . . . . . . The number of users along with their geo-tagged tweeting rates within November 2015 in San Diego County (left) and in Columbus City (right) . . . . . . . . . . . . . . . . . . . . . . . . . . . .
269
270
List of Tables
Table 4.1 Table 4.2 Table 4.3 Table 4.4 Table 4.5 Table 4.6 Table 5.1 Table 5.2 Table 5.3 Table 5.4 Table 5.5 Table 5.6 Table 6.1
Table 7.1 Table 7.2 Table 7.3 Table 7.4 Table 7.5 Table 8.1 Table 8.2 Table 8.3
Scientific outputs descriptors during 2010–2019 . . . . . . . . . . . . Distribution of the top 20 subject categories . . . . . . . . . . . . . . . . The 20 most active journals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The 10 most cited documents . . . . . . . . . . . . . . . . . . . . . . . . . . . Top 17 institutions based on the total publications . . . . . . . . . . . Temporal evolution of the 30 most frequently used keywords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Demographic characteristics of Instagram and Twitter in the U.S. (Greenwood et al., 2016) . . . . . . . . . . . . . . . . . . . . . . Data summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Descriptive statistics of flows . . . . . . . . . . . . . . . . . . . . . . . . . . . Top 5 flows between SRAs by data sources . . . . . . . . . . . . . . . . The values of SSIM by window sizes . . . . . . . . . . . . . . . . . . . . . The values of SpSSIM by spatial weight distances . . . . . . . . . . Spearman’s rank correlation coefficients between the hourly unique Twitter user densities and the nighttime and daytime population densities estimated based on census surveys . . . . . . . . . . . . . . . . . . . . . . . Key factors affecting individuals’ evacuation decisions . . . . . . Coding examples: Attitude Toward Evacuation and Evacuation Status with its sample tweets . . . . . . . . . . . . . . Summary of coding results (N = 512) . . . . . . . . . . . . . . . . . . . . Conditional probability table . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example tweets mentioning a high-risk perception and information received from three sources . . . . . . . . . . . . . . . The disaster phases and stages of 2017 Hurricane Harvey . . . . Number of trajectories among contiguous U.S. cities in 2017 and during Hurricane Harvey . . . . . . . . . . . . . . . . . . . . . Statistics of trajectories with Houston at different phases of disasters with aggregated cities (CBSA) . . . . . . . . . . . . . . . .
51 52 53 54 57 58 66 74 75 77 78 79
102 116 123 125 128 131 146 149 152
xix
xx
Table 9.1 Table 9.2 Table 9.3 Table 10.1 Table 12.1 Table 12.2 Table 12.3 Table 12.4 Table 12.5
Table 14.1
Table 14.2
Table 14.3
Table 14.4
Table 14.5
List of Tables
Socio-economic characteristics for selected Italian case studies based on geography (ISTAT, 2019) . . . . . . . . . . . . . . . . . ANOVA one-way: exploring Italy’s social, economy, and environment factors during the COVID-19 outbreak . . . . . Italian policy timeline by phase based on social, economy, and environment factors . . . . . . . . . . . . . . . . . . . . . . . Application of our proposed indexes of Google and Twitter data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Descriptives of Independent Variables (N = 8,305) . . . . . . . . . . Community connection score, by race (N = 8,305) . . . . . . . . . . Global regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Five-number summary of full geographically weighted regression model (Bandwidth = 3,000) . . . . . . . . . . . . . . . . . . . Select significant coefficients from geographically weighted regression model by neighborhood racial/ethnic composition (Bandwidth = 3,000) . . . . . . . . . . . . . . . . . . . . . . . The percentage of geo-tagged tweets within the original bounding box or within the State boundary in the County of San Diego and the City of Columbus collected in one month (November, 2015) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Proportion of social media bots with different “source” names in one month of geo-tagged tweets (November, 2015) in San Diego County . . . . . . . . . . . . . . . . . . . The Proportion of data noises with different “source” names in one month of geo-tagged tweets (November, 2015) in the City of Columbus . . . . . . . . . . . . . . . . . . . . . . . . . . An example of white lists (partial) and black lists (partial) in 2017 case studies in San Diego using Twitter Search APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . User biases comparison between San Diego and Columbus cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
171 176 180 190 222 225 226 227
228
260
263
264
267 270
Chapter 1
Introduction: Human Dynamics Research with Social Media and Geospatial Data Analytics Atsushi Nara
1.1 Introduction Social media and geospatial big data have provided researchers new ways to study human dynamics. The increasing adaptation and usage of location aware mobile technologies, sensor devices, and ubiquitous cyberinfrastructure in everyday life generates a massive amount of data about space, place, and human activities dynamically over time (Nara et al., 2018). Social media platforms and open data initiatives further enable access to a large amount of fine-granular and individual-scale data, which were not available in the past. Researchers can now explore and discover new knowledge about human dynamics that is buried in very large, high-dimensional, and complex datasets. As human dynamics have been studied in diverse disciplines, various definitions have been developed (Shaw et al., 2018; Yuan, 2018). According to the Center for Human Dynamics in the Mobile Age (HDMA Center), human dynamics can be defined as a transdisciplinary research field that focuses on the understanding of dynamic patterns, relationships, narratives, changes, and transitions of human activities, behaviors, communications, and interactions (HDMA, 2014; Tsou, 2018). Example research topics include population change, migration, human mobility, human relationships, social and socioeconomic activities and interactions in physical and virtual space, human–computer interactions, and many other variants related to human actions, which have been studied in many disciplines such as business, communication, computer science, geography, physics, planning, psychology, public health, sociology, transportation, among others (Shaw et al., 2018; Tsou, 2018; Yuan, 2018).
A. Nara (B) Department of Geography and Center for Human Dynamics in the Mobile Age, San Diego State University, San Diego, CA, USA e-mail: [email protected] © Springer Nature Switzerland AG 2021 A. Nara and M.-H. Tsou (eds.), Empowering Human Dynamics Research with Social Media and Geospatial Data Analytics, Human Dynamics in Smart Cities, https://doi.org/10.1007/978-3-030-83010-6_1
1
2
A. Nara
Human dynamics research benefits from the availability of social media and big geospatial data leading to data-driven geospatial scientific, purely inductive, and emergent forms of analysis that let data speak for itself (Kitchin, 2014; Kwan, 2016). Spatially and temporally fine-granular timestamped location data allow researchers to reconstruct individual trajectories, to describe dynamic human actions and their contexts in detail, and to examine both individual and collective behavior (Nara et al., 2018). The individual-scale data can also avoid data scaling issues that exist in conventional data such as the ecological fallacy and Modifiable Areal Unit Problems (MAUPs) (Nara et al., 2018). Furthermore, the pervasiveness of smartphone and internet usage along with technical advancements in cyberinfrastructure, sensor devices, and cloud computing accelerate the generation of large social media and geospatial data in real/near-real time. With these advancements and data availability, researchers can now trace, monitor, map, analyze and model human and social movements, disease outbreaks, natural hazards, crime incidents, and popular events at a relatively low cost.
1.2 Research Challenges Despite the aforementioned advantages, there exist notable challenges including but not limited to issues with data biases, data and algorithm uncertainties, heterogenous data integration, and education, among others (Dony et al., 2019a, 2019b; Evans et al., 2019; Kwan, 2016; Nara et al., 2018; Tsou, 2015). Social media data are known to be biased by the unevenness of demographic, socioeconomic, geographic, and temporal distributions. For example, a study by Jiang et al. (2019) indicates that the number of geotagged tweets can be explained by various socioeconomic factors such as age, gender, ethnicity, education, and income. In general, among these factors, the population under 18 years old is negatively associated while the population in the 18–29 age group is positively related; however, their influences are varied across geography. In addition, the frequency of social media posts per user often follows a power-law distribution and therefore, hyperactive users significantly contribute to the content of data. Furthermore, the number of posts is affected by dates/times (e.g., before/after work vs during work hours, weekday vs weekend, holidays, etc.) as well as the number of active/registered users changing over time along with the evolution of social media platforms. Recognizing the data biases is essential to conduct human dynamics research; however, it is challenging to fully understand such biases because the baseline information in social media data (users’ demographic profiles, the number of active/registered users in a study region, etc.) is typically not available to researchers. Another significant challenge is related to data and algorithm uncertainty. Most social media and big geospatial data are not the output of instruments designed to produce valid and reliable data amenable for scientific analysis (Lazer et al., 2014). For instance, the quality of location information in social media data can be controlled by end users, which make it challenging for researchers to know the
1 Introduction: Human Dynamics Research with Social Media …
3
level of uncertainty. In particular, the accuracy of geographic coordinates depends on hardware (e.g., GPS device), software (e.g., mobile apps), and user settings (Craglia et al., 2012). To give another example, the location of an Instagram post is selected by a user based on a list of nearby point of interests (POIs) provided by Instagram; therefore, a user can easily manipulate his/her location. While users may decide to fake (or spoof) location to protect individual geo-privacy, few studies have discussed and incorporated location spoofing in the existing literature (Zhao & Sui, 2017). In regard to algorithm uncertainty, Kwan (2016) questioned that big data-driven research ignores the potentially significant influence of algorithms on research results, and thus geographic knowledge generated with big data might be more of an artifact of the algorithms used than the data itself. For instance, Fischer (2014) suggests that tweet locations might have been fuzzed by Twitter through snapping them to the closest latitude or longitude to prevent the disclosure of people’s exact locations. Algorithms are also increasingly implemented as computerized procedures to deal with big data (Kwan, 2016) and are frequently updated under agile software development process. Researchers often do not have access to, or detailed knowledge of, such algorithms and their development processes that are fully controlled by data providers who generate, process, update, filter, and provide their data through Application Programming Interfaces (APIs) (Nara et al., 2018); therefore, algorithms become increasingly detached from and less visible to researchers who use them (Kwan, 2016), which ultimately introduces greater uncertainty and potentially result in significant differences in research findings. The integration of heterogeneous data from multiple sources is one way of producing new insights and possibly remedying data biases. In particular, quantitative and qualitative data integration enriches data analytics on social media and big geospatial data. In the era of big data, a wide variety of data is available from public and private sector organizations as well as citizens via internet and social media. However, data analytics that focuses only on numerics and algorithm from a single data source are not enough to study 360 degrees of human activity or relationships that includes unforeseeable factors (Shacklett, 2015). New data analytics frameworks and methodologies are needed to solve data integration issues, which include data scales (e.g., big versus small data size), spatial and temporal resolutions (e.g., mixed used of fine and coarse geographic/temporal scale data), formats (e.g., relational versus object-based), types (e.g., quantitative versus qualitative), and bias handling. Last but not least, to fully utilize social media and big geospatial data for human dynamics research, geospatial data science (or geocomputation) skills are essential. With the increasing demand for geospatial computational skills in geospatial data science research and industry, the development of effective learning pathways toward geocomputationally intensive jobs is prominent (Dony et al., 2019c). This new demand requires interdisciplinary domain knowledge in spatial statistics and analysis, geocomputational skills, and substantive expertise. Two key challenges on geocomputation education are (1) the need for educational cross collaborations between geography and computer science to cultivate spatial and computational
4
A. Nara
thinking; and (2) broadening participation in the context of geocomputation through improved inclusion and diversity. These challenges are by no means comprehensive. Chapters in this book cover some emerging challenges in addition to some of the aforementioned challenges. Discussed topics include theory development (Chap. 2), the spread of misinformation/disinformation (Chap. 3), biases and data integration (Chaps. 5 and 6), bots (Chap. 14), and new research agenda after the COVID-19 pandemic (Chap. 15).
1.3 Overview of the Chapters This book is composed of 15 chapters that cover the current state of the art human dynamics research that utilizes social media and geospatial data collected via information and communication technology (ICT), mobile technology, and locationaware technology. This chapter provides an introductory context and an overview of all chapters in this book. The subsequent chapters are generally organized into three main themes: (1) theoretical backgrounds, (2) techniques and methodologies, and (3) applications. It is worth noting that some chapters discuss more than one theme. The final chapter presents future research agendas on the study of human dynamics with social media and geospatial data. The first theme is theoretical perspectives specifically focused on information diffusion and misinformation. Chapter 2 (Spitzberg, 2021) extends a formative conceptualization of social media communication as meme diffusion into a propositional model, animated largely by evolutionary and attention economy explanatory metaphors. The result is an integrative model formalized in 18 propositions, indicating that multiple system factors influence the generation and attrition of social media messages. The system levels include features of the meme itself, its medium, its source, its social network and societal context, and the interference or facilitation of geospatial, technical and significant societal events. As such, memes sometimes diffuse because of the information value of events (evememic), viral meme cycles (entymemic), or some combination of these processes (polymemic). The model integrates extensive cross-disciplinary research and manifold theoretical influences in the interest of demonstrating a process of theory construction in the context of social media and new media. Chapter 3 (Martinez, 2021) offers a critical review of research on misinformation and social networking sites (SNSs). Using keywords related to misinformation and SNSs, the review examines relevant scholarship published since 2004. Content of relevant articles is summarized in terms of examined contexts, involved disciplines (e.g., public health, communication), methodological approaches, use of theory, and solutions presently offered for addressing this important problem. Disinformation and fake news are also included in the scope of this review. Current trends in research on misinformation and SNSs are discussed. Results of the review suggest that misinformation on SNSs represents an issue facing many fields without a clear or easy solution. Four recommendations are derived from the present review: (1) performance
1 Introduction: Human Dynamics Research with Social Media …
5
of additional research on platforms other than Facebook or Twitter; (2) clarification of conceptualizations of misinformation and increased consistency in usage of terms; (3) greater integration of theory for enhancing understanding of how misinformation spreads and how to best correct misinformation once it proliferates on SNSs; and (4) promotion of interdisciplinary collaborations among researchers investigating misinformation and SNSs. Directions for future research are also provided. The second theme of this book discusses technical and methodological aspects on human dynamics research with social media and big geospatial data analytics. Chapter 4 (Peng & Ye, 2021) studies the trend of human dynamics research utilizing social media/big data particularly with its emphasis on technical aspects related to data collection and management. The authors conducted Bibliometric analysis to examine growth of output during 2010–2019, distribution of output in subject categories and journals, most cited documents, geographic and institutional distribution of publications, institution collaboration network, and keywords. The keyword analysis reveals that “big data”, “social media”, “data collection”, “Twitter”, “Facebook”, and “privacy”, were popular throughout the past 10 years. Additional keywords such as “data management”, “cloud computing”, “machine learning”, “data mining”, “Internet of Things”, “big data analytics”, “crowdsourcing”, “data analytics”, “data science”, “big data management”, “Hadoop”, “MapReduce”, “sentiment analysis”, “surveillance”, “business intelligence”, and “IoT” have attracted increasing attention, further reflecting research trends. “Understanding diverse characteristics of human mobility provides profound knowledge of urban dynamics and complexity. Human movements are recorded in a variety of data sources and each describes unique mobility characteristics. Revealing similarity and difference in mobility data sources facilitates grasping comprehensive human mobility patterns” (Jin et al., 2020). Chapter 5, a reprint work, “introduces a new method to measure similarities on two OD (Origin–Destination) matrices by spatially extending an image assessing tool, structural SIMilarity index (SSIM). The new measurement, spatially weighted SSIM (SpSSIM), utilizes weight matrices to overcome the SSIM sensitivity issue due to the ordering of OD pairs by explicitly defining spatial adjacency. To evaluate SpSSIM, the authors compared performances between SSIM and SpSSIM with resampling the orders of OD pairs and conducted bootstrap to test the statistical significance of SpSSIM. As a case study, the authors compared OD matrices generated from three data sources in San Diego County, CA; U.S. Census-based Longitudinal Employer-Household Dynamics Origin–Destination employment statistics, Twitter, and Instagram. The case study demonstrated that SpSSIM was able to capture similarities of mobility patterns between datasets varied by distance. Some regions showed local dissimilarity while the global index indicated they were similar. The results enhance the understanding of complex mobility patterns from various datasets including social media” (Jin et al., 2020). Chapter 6 is another reprint work that presents a novel methodological framework to collect and integrate social media and big geospatial data for developing an evacuation decision support tool during natural disasters. “In designing evacuation plans, it is critical for the responsive agencies to consider the dynamic change of human population within impact areas and understand social perception from local residents.
6
A. Nara
Although a large number of evacuation models has been reported in the literature, many used census survey data which represent only the nighttime population distribution. To fill this research gap, this study introduces a novel data integration framework for developing an evacuation decision support system for wildfire, Integrated Wildfire Evacuation Decision Support System (IWEDSS). IWEDSS integrates multiple data sources including social media, census survey, geographic information systems (GIS) data layers, volunteer suggestions, and remote sensing data. The integration is based on multi-disciplinary theoretical and modeling approaches including Geographic Information Science, civil and transportation engineering, computer science, social media and communication. IWEDSS includes four core modules: dynamic population estimation, stage-based robust evacuation planning, social perception analysis, and web-based geomatical analytic platform. It offers tools for evacuation planers and resource managers to make better decisions that can reduce the evacuation time and potential number of injuries and deaths. This paper also presents a case study to demonstrate the suitability of incorporating social media data to estimate the dynamic change of human population” (Nara et al., 2017). Continued from the previous chapter discussing the development of an evacuation decision support tool, Chap. 7 (Nara et al., 2021) introduces a methodological framework to study human evacuation decision making. Individuals react very differently to evacuation orders as their decisions depend on a variety of factors. Identifying key contributing factors and understanding how they affect an individual’s evacuation decisions can help emergency response organizations improve evacuation plans and communication strategies. Conventionally, researchers have studied human evacuation behaviors by conducting post-disaster surveys, which could be costly, be limited by sampling methods, and be dependent on respondent availability resulting in nontimely responses. Social media data analytics is an alternative approach to examine human behaviors during a disaster as social media have become an important communication channel and researchers can access a large amount of data instantly at a relatively low cost. In this study, the authors explored how social media data can be used to gain insight on human evacuation behavior. The authors designed a conceptual model based on relevant literature, developed a codebook to classify Twitter communications, and employed a Bayesian Network approach to build a model to inductively learn dependence relationships of evacuation decision making factors from tweets. In analyzing tweets during the Lilac Fire in San Diego, CA, the learned Bayesian Network highlighted two key factors, risk perception and received information source, that jointly influenced the individual evacuation decision making. This case study also implies that factors related to individual/family situations, evacuation situations, knowledge, and previous experience may not be primary decision-making factors. Chapters under the third theme of this book present applied interdisciplinary research examples ranging from disaster management and public health to urban geography and spatiotemporal information diffusion. Chapter 8 (Guo & Huang, 2021) presents an applied case study in disaster management (a continued topic following Chaps. 6 and 7). Recent decades have witnessed an increasing risk of natural disasters, causing a large loss of assets all over the world. It is widely
1 Introduction: Human Dynamics Research with Social Media …
7
acknowledged that a more comprehensive understanding of the evacuation behaviors will significantly mitigate the loss of natural hazards and improve the management process for corresponding agencies. Social media platform with user-generated content provides great potentials for better understanding the spatiotemporal pattern of evacuation. However, existing work mainly focuses on the general spatiotemporal pattern, failing to analyze evacuation movements at different stages of disaster (e.g. preparedness, response, and recovery), and lacks comprehensive analysis with usergenerated content (e.g. sentiment). Taking the 2017 Hurricane Harvey as a case study, the authors generated Twitter-based trajectories of users spatiotemporally affected by Hurricane Harvey, conducted analysis by considering differently affected regions at different disaster stages, and visualized the spatiotemporal and sentiment pattern of evacuation behavior. Chapter 9 (Fernandez et al., 2021), a public health case study, utilized Twitter data and sentiment analysis during the COVID-19 pandemic. The authors analyzed over 4 million COVID-19 related tweets in four of Italy’s geographic regions (north, central, south, and islands) to address whether socioeconomic factors and tweet sentiments (fear, anger, and joy) shifted over the course of the pandemic and lagged behind specific policy shifts before and after the lock-down to provide an overview of apprehension associated with pandemic outbreaks. A dataset was built using keywords related to COVID-19 between March and June. The north of Italy was found to have the highest number of social and environ-mental impacts when compared to the other regions. However, southern Italy and islands were most hit by the economic crisis, including the shortage of medical supplies, weaker infrastructure, loss of tourism, and high unemployment rate. Thus, the COVID-19 pandemic widened the existing gap between the north and the south regions of Italy, emphasizing the lack of preparedness and delays implemented by governments when addressing the emergency, lack of response to a nationwide emergency and communication plan, and decentralization of the health care system. The developed sentiment analysis based on geographies and social media during the COVID-19 outbreak, informed health communication campaigns that helped policy makers and health care professionals track pandemic-related changing over time through the use of technology. In Chap. 10, Liang and Wang (2021) formulate eight propositions by developing a group of indexes as “infodemiological indicators” from an ecological perspective. An increasing number of Internet users engage in health-related online activities such as searching for health information and sharing experience in online communities. Geo-tagged and time-stamped data generated from these online activities could inform public health practitioners of the population’s information needs, opinions, attitudes, and behaviors on specific health issues, and health status, thereby becoming an important source of public health surveillance. The heterogeneity and multi-facet nature of social media and search data requires an ecological view to guide big data analysis, interpretation, and visualization. The authors recognize the complexity and dynamics of big data, and the efforts of previous scholars in collecting, cleaning/filtering, analyzing, and modeling big data, especially Google search and Twitter data for public health surveillance. Through eight propositions, this study constructs an ecological model of Google Search and Twitter data for
8
A. Nara
public health surveillance by differentiating message sources, goals, content, and geo-stamped and time-stamped information. The proposed model monitors public attention of health issues by analyzing information flow from celebrities, media, organizations, and blogs to consumers who react, by retweeting or searching online. In addition, it detects diseases by examining consumer reports of personal experiences on Twitter and online searches. “To illuminate understanding of how social media can be leveraged to glean insights into public health issues such as e-cigarette use, [Chap. 11 uses] social media analytics and research testbed (SMART) dashboard to observe Twitter messages and follow content about e-cigarettes in different cities across the U.S. The case studies indicate that the majority of e-cigarette tweets are positive (68%), which represents a potential problem for public health. Stigma plays the most important roles in both confirmed and rejected messages for e-cigarettes. The authors also noticed that some advocates of e-cigarettes might be hybrid human-bot accounts (or multiple users using one account). The key findings demonstrate the use of the SMART dashboard as a means of public health-related belief surveillance, and identification of campaign targets and informational needs of different communities in real-time. Future uses of this tool include monitoring social messages about e-cigarettes for combating the spread of tobacco-related misinformation and disinformation, and detecting and targeting informational needs of communities for intervention” (Martinez et al., 2019). As an urban geography application, Chap. 12 (Gibbons, 2021) studies the neighborhood community connection using geospatial data and methodology. Ethnic neighborhoods have a long-recognized association with neighborhood community connection as measured by trust, belongingness, and cooperation with neighbors. How consistent is community connection for racial/ethnic minorities within ethnic neighborhoods? Does this community connection extend beyond these neighborhoods? To address these questions, this study utilizes the 2014–2015 wave of the Public Health Management Corporations (PHMC) Southeastern Pennsylvania Household Health Survey data and Geographically Weighted Regression (GWR) to explore how the relationship between race/ethnicity and community connection varies across space. The author compares the relative strength of these relations to the location of ethnic neighborhoods identified by the 2011–2015 American Community Survey. This study suggests that while Black community connection is generally weaker outside of mostly Black neighborhoods, it is not consistently strong in these neighborhoods, either. For the Hispanic population, even more variation exists both within and beyond mostly non-White neighborhoods. Chapter 13 (Huang et al., 2021) employs a text mining approach to explore gentrification dynamics through Instagram posts. Gentrification is the transformation of a working-class or abandoned area of a city under the influence of redevelopment and influx of higher-income residents, which involves economic upgrading and replacement of long-term residents who were often of lower social status. Researchers utilize various qualitative and quantitative methodologies to measure gentrification dynamics; however, incorporating human perceptions of neighborhoods into a large-scale measurement has not been widely explored. Moreover, there is a lack of
1 Introduction: Human Dynamics Research with Social Media …
9
research considering gentrification dynamics at a finer spatio-temporal scale across a large area. This chapter introduces a novel data mining framework to examine gentrification dynamics utilizing social media data. Specifically, the authors developed an indicator, gentrification ambiance, to characterize the sense of gentrification. The indicator was extracted from geotagged Instagram posts by employing text processing and text clustering techniques. The spatial distribution of gentrification ambiance over years revealed that areas observed with a sense of gentrification closely correspond to the human-perceived gentrifying areas. Furthermore, clustering results delineated two types of gentrification, residential-driven and commercialdriven. Finally, the result portrayed the yearly changes, illustrating the potential gentrification expansion. Chapter 14 (Huang et al., 2021) analyzed the spatial distribution patterns of geotagged Twitter data created by bots and illustrates their characteristics in terms of data noises, user biases, and system errors. Spam tweets (data noises) created by bots and cyborgs can be identified by examining their “source” metadata fields. In this case study, the portion of geo-tagged tweets created by bots and cyborgs is significant (29.42% in San Diego, CA and 53.47% in Columbus, OH). The spatial distribution patterns of tweets created by bots are not random. Different types of bots created unique spatial tweet patterns. The authors also found that the majority of geotagged tweets are not created by generic Twitter apps on Android or iPhone devices, but by other cross-platform apps, such as Instagram and Foursquare. A multi-step data wrangling procedure for geo-tagged social media is recommend for spatial data analysis. Chapter 15 (Tsou, 2021) concludes this book by providing future research agendas on the study of human dynamics with social media and geospatial data. This chapter specifically discusses the significant impacts of the COVID-19 pandemic on human society and addresses new research challenges and opportunities. The COVID-19 pandemic in 2020 and 2021 has impacted everyone’s life and changes our activities and behaviors significantly. It is a first example of a “data-rich” pandemic in human history and opens many research opportunities for the study of human dynamics. This chapter provides some representative examples and suggestions for future development in both short term and long-term periods. The short-term research agenda includes: (1) establishing a transdisciplinary and interoperable COVID-19 data research consortium; (2) facilitating interdisciplinary research collaboration to develop new theories and new methods; (3) identifying health disparity problems from multiple perspectives; and (4) analyzing human mobility changes from multiple data sources. The long-term research agenda emphasizes the lasting impact of the COVID-19 pandemic on our local and global communities. The long-term agenda includes: (1) tracking changes in the percentage of work-from-home employees and online education (synchronic or asynchronized) students; (2) analyzing the dynamic movements of consumer product and mixed digital online services; (3) using Internet of Things (IoT) devices to study human dynamics; (4) extending the study of human dynamics from the physical world to virtual reality; and (5) facilitating the development of responsive decision support systems for the effective management of public health resources and disaster responses in the future.
10
A. Nara
In summary, this book, entitled Empowering Human Dynamics Research with Social Media and Geospatial Data Analytics, discusses theoretical backgrounds, techniques and methodologies, and applications of current state-of-the-art human dynamics research utilizing social media and geospatial big data. It describes various forms of social media and big data with location information, theory development, data collection and management techniques, and analytical methodologies to conduct human dynamics research including geographic information systems (GIS), spatiotemporal data analytics, text mining and semantic analysis, machine learning, trajectory data analysis, and geovisualization. The book also covers applied interdisciplinary research examples ranging from disaster management and public health to urban geography and spatiotemporal information diffusion. By providing theoretical foundations, solid empirical research backgrounds, techniques, and methodologies as well as application examples from diverse interdisciplinary fields, we believe that this book will be a valuable resource to students, researchers, and practitioners who utilize or plan to employ social media and geospatial big data in their work. Acknowledgements This material is based upon work supported by the National Science Foundation under Grant No. 1634641, IMEE project titled “Integrated Stage-Based Evacuation with Social Perception Analysis and Dynamic Population Estimation”, Grant No. 1837577, CS4All project titled “Encoding Geography: Building Capacity for Inclusive Geo-Computational Thinking with Geospatial Technologies”, and Grant No. 2031407, CS4All project titled “Collaborative Research: Encoding Geography-Scaling up an RPP to achieve inclusive geocomputational education”. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the National Science Foundation.
References Craglia, M., Ostermann, F., & Spinsanti, L. (2012). Digital Earth from vision to practice: Making sense of citizen-generated content. Int J Digit Earth, 5, 398–416. Dony, C. C., Magdy, A., Rey, S., Nara, A., Herman, T., Solem, M. (2019a). RPP for geocomputation: Partnering on curriculum in geography and computer science. In 2019 Res. Equity Sustain. Particip. Eng. Comput. Technol. RESPECT, pp. 1–2. Dony, C., Magdy, A., Rey, S., Nara, A., Herman, T., Solem, M. (2019b). RPP for geocomputation: Partnering on curriculum in geography and computer science. Dony, C., Nara, A., Amatulli, G., Delmelle, E., Tateosian, L., Rey, S., & Sinton, D. (2019). Computational thinking in U.S. college geography: An initial education research agenda. Res Geogr Educ, 21, 39–54. Evans, M. R., Oliver, D., Yang, K., Zhou, X., Ali, R. Y., Shekhar, S. (2019). Enabling spatial big data via cybergis: Challenges and opportunities. CyberGIS for Geospatial Discovery and Innovation, pp. 143–170. Fernandez, G., Maione, C., Zaballa, K., et al. (2021). Sentiment analysis of social media response and spatial distribution patterns on the COVID-19 outbreak: The case study of Italy. This volume. Fischer, E. (2014). Making the most detailed tweet map ever. In Mapbox. https://www.mapbox. com/blog/twitter-map-every-tweet/. Accessed 15 Oct 2016. Gibbons, J. (2021). Placing community: Exploring ra-cial/ethnic community connection within and between racial/ethnic neighborhoods. This volume.
1 Introduction: Human Dynamics Research with Social Media …
11
Guo, C., Huang, Q. (2021). Examining spatiotemporal and sentiment patterns of evacuation behavior during 2017 hurricane Harvey. This volume. HDMA. (2014). The center for human dynamics in the mobile age (HDMA center) website. https:// humanDynamics.sdsu.edu/. Accessed 1 April 2021 Huang, C.-C., Nara, A., Gibbons, J., Tsou, M.-H. (2021). Exploring gentrification through social media data and text clustering techniques. This volume. Jiang, Y., Li, Z., & Ye, X. (2019). Understanding demographic and socioeconomic biases of geotagged Twitter users at the county level. Cartography and Geographic Information Science, 46, 228–242. Jin, C., Nara, A., Yang, J.-A., & Tsou, M.-H. (2020). Similarity measurement on human mobility data with spatially weighted structural similarity index (SpSSIM). Transactions in GIS, 24, 104–122. Kitchin, R. (2014). Big data, new epistemologies and paradigm shifts. Big Data & Society, 1, 2053951714528481. Kwan, M.-P. (2016). Algorithmic geographies: Big data, algorithmic uncertainty, and the production of geographic knowledge. American Association of Geographers Annals, 106, 274–282. Lazer, D., Kennedy, R., King, G., & Vespignani, A. (2014). The parable of google flu: Traps in big data analysis. Science, 343, 1203–1205. Liang, B., Wang, Y. (2021). Conceptualizing an ecological model of Google search and Twitter data in public health. This volume. Martinez, L. S. (2021). Research on misinformation and social net-working sites. This volume. Martinez, L. S., Tsou, M.-H., Spitzberg, B. H. (2019). A case study in belief surveillance, sentiment analysis, and identification of informational targets for e-cigarettes interventions. In Proc. 10th Int. Conf. Soc. Media Soc. Association for Computing Machinery, New York, NY, USA, pp. 15–23. Nara, A., Ghanipoor Machiani, S., Luo, N., Ahmadi, A., Robinett, K., Tominaga, K., Park, J., Jin, C., Yang, X., Tsou, M.-H. (2021). Learning dependence relationships of evacuation decision making factors from Tweets. This volume. Nara, A., Tsou, M.-H., Yang, J.-A., Huang, C.-C. (2018). The opportunities and challenges with social media and big data for research in human dynamics. In S.–L. Shaw, D. Sui (Eds.), Hum. Dyn. Res. Smart Connect. Communities. Springer, Cham, pp. 223–234. Nara, A., Yang, X., Ghanipoor Machiani, S., & Tsou, M.-H. (2017). An integrated evacuation decision support system framework with social perception analysis and dynamic population estimation. Int J Disaster Risk Reduct, 25, 190–201. Peng, Q., Ye, X. (2021). Research trends in social media/big data with the emphasis on data collection and data management: A bibliometric analysis. This volume. Shacklett, M. (2015). Thick data closes the gaps in big data analytics. In TechRepublic. https:// www.techrepublic.com/article/thick-data-closes-the-gaps-in-big-data-analytics/. Accessed 10 Oct 2019. Spitzberg, B.H. (2021). Theorizing social media: A formalization of the multilevel model of meme diffusion 2.0 (M3D2.0). This volume. Shaw, S.-L., & Sui, D. (2018). Introduction: Human dynamics in perspective. In S.-L. Shaw & D. Sui (Eds.), Hum. Dyn. Res. Smart Connect. Communities (pp. 1–11). Cham: Springer International Publishing. Tsou, M.-H. (2015). Research challenges and opportunities in mapping social media and Big Data. Cartography and Geographic Information Science, 42, 70–74. Tsou, M.-H. (2018). The future development of GISystems, GIScience, and GIServices. In Huang B (ed) Comprehensive Geographic Information Systems. Elsevier, Oxford, pp. 1–4 Tsou, M.-H. (2021). The future of human dynamics study: Research challenges and opportunities during and after the COVID-19 pandemic. This volume. Yuan, M. (2018). Human dynamics in space and time: A brief history and a view forward. Transactions in GIS, 22, 900–912. Zhao, B., & Sui, D. Z. (2017). True lies in geospatial big data: Detecting location spoofing in social media. Annals of GIS, 23, 1–14.
Chapter 2
Theorizing Social Media: A Formalization of the Multilevel Model of Meme Diffusion 2.0 (M3 D2.0 ) Brian H. Spitzberg
2.1 Introduction Despite enormous variability across people, their individual patterns of working, sleeping, eating, commuting, playing and communicating are generally routine and relatively predictable in their geospatial (e.g., propinquity effects), chronemic (e.g., circadian and diurnal effects), and functional (e.g., working, eating, commuting, etc.) patterns, most of which are accomplished through mediated or face-to-face (FtF) communication (Spitzberg 2019). Much of current research on social media “big data” tend to pose a general research question and then describe a variety of findings that emerge from a given data set. Seldom are predictions posed on the basis of an a priori theory in such a way that the theory is at risk of being falsified. Yet, predictive, falsifiable theory is arguably even more important in the context of big data than in previous empirical contexts and methodologies (Mazzocchi 2015; Wise and Shaffer 2015). This essay seeks to illustrate a project of theory-building in the context of big data and geospatiallyinformed social media.
2.2 Meme Themes The multidimensional model of meme diffusion (M3 D) began with the concept of memes (see Fig. 2.1). Memes were proposed as a neologism by the evolutionary biologist Dawkins (2016) as an analogue to genes (Schlaile and Schlaile 2021; Shifman 2013). As genes replicate and confer biological information from one generation or B. H. Spitzberg (B) School of Communication, San Diego State University, San Diego, CA, USA e-mail: [email protected] © Springer Nature Switzerland AG 2021 A. Nara and M.-H. Tsou (eds.), Empowering Human Dynamics Research with Social Media and Geospatial Data Analytics, Human Dynamics in Smart Cities, https://doi.org/10.1007/978-3-030-83010-6_2
13
14
B. H. Spitzberg
Fig. 2.1 The multidimensional model of meme diffusion 1.0 (M3 D; see Spitzberg 2014)
person to the next, Dawkins reasoned that there are likely to be similar mechanisms of cultural evolution. So analogized, memes are expected to reflect three main characteristics: variation, selection and retention. Variation permits novelty of information to occur, selection reflects that there will be competitive pressures such that not all memes replicate and nonadaptive memes replicate less than adaptive memes, and retention represents the potential for information being passed on through replication to sustain itself through future generations of replication. A meme is defined here as a replicable act or set of acts that has meaning (i.e., entropic influence). Technically, all digital messages, whether visual, auditory or symbolic, constitute potential memes, and become actual memes when significant portions of their content are replicated. Thus, any digital message transferred across media is, by definition, a meme. The M3 D proposes that memes occupy a broader information environment in which fitness is influenced by adaptation to the availability or pressures of attention as a scarce resource (Schlaile et al. 2018). Given an information-rich ecosystem in which attention is limited, memes naturally compete for attention. Precise replication (i.e., repetition) and partial replication (with adaptation, or speciation) are both signs of memetic success. Information space is an ecosystem in which some memes are more adaptive or fit in moving other existing or competing memes out of their ecological niche (i.e., meme extinction). In a competitive memetic information ecosystem some displacement is likely to always be in flux, such that some memes become replicated only at the expense of driving other memes to extinction or diminution (Leskovec et al. 2009). Such patterns are likely to reflect competition between and among
2 Theorizing Social Media: A Formalization …
15
different meme structures producing a set of macro-level (memeplexes Schlaile and Schlaile 2021) and micro-level patterns of diffusion (Weng et al. 2012). The life span of memes can be conceptualized through the lens of the M3 D theory, which outlines the meme diffusion process as a function of six levels of facilitating or impeding factors (Spitzberg 2014). The six hierarchical levels are: meme level (i.e., message factors, such as distinctiveness, redundancy, simplicity, media convergence, richness), source level (i.e., motivation, knowledge, skill, credibility, centrality/propinquity, adaptation to media), structural social network level (i.e., past memes, number of connected nodes, network interdependence, centrality, structural homophily), subjective social network level (i.e., counter-memes, frame or narrative fidelity, subjective homophily, relative informational advantage, cascade thresholds), societal level (i.e., rival social networks, counter-memes and counter–frames, diffusion stage, mitigating publicity or issue emergence, media accessibility), and geotechnical level (i.e., system limitations, geospatial scope or span, proximity, etc.). These factors facilitate or impede meme diffusion, and the influence of a given meme can be indexed according to outcome criteria such as its diffusion span (i.e., popularity), diffusion velocity, longevity (i.e., duration), fecundity (i.e., spawning of new mutated or altered versions of the original meme (Adamic 2015), and the outcome or effect(s) on the outcome(s) of interest (e.g., policy change, election outcome, product purchase, social movements, etc.). M3 D envisions that meme cycles are sometimes evememic (i.e., events in realspace generate meme activity; e.g., social media regarding a natural disaster or celebrity scandal) and are sometimes entymemic, or meme-generated (e.g., Gangham style, a Trump tweet). When these meme cycles merge or intersect, they become polymemic (e.g., social media promote convergence of people to Washington DC to disrupt Congress, and the subsequent riot generates social media content that reinforces the escalation of the riot). What follows is a set of conjectures about memes and their functioning, derived from the M3 D, with an eye toward formal theory and proposition construction (as M3 D2.0 ), emphasizing propositional model-building and cross-theoretical synthesis (Turner 1990). They are visually formalized in Fig. 2.2. An assumption of M3 D2.0 is that the memescape ecology represents an attention economy. Memes enter the dynamic social system in an over-crowded information ecosystem. Keeping with the evolutionary theory analogue, not all memes will survive to replicate. For example, the vast majority of social media messages go no more than one link or receiver whereas only a small fraction “go viral” with mass replication (Spitzberg 2019). We exist in a complex communication ecology, in which our absolute amount of time (Miritello et al. 2013) and certain quotidian constraints (Malmgren et al. 2008), cognitive capacity for processing information (Fisher et al. 2018; Marois and Ivanoff 2005) and ability to establish and maintain relationships (Gonçalves et al. 2011; LaRose et al. 2014; Liang and Fu 2015) are limited, while the potential for producing more information is virtually unlimited (Jang and Pasek 2015). Given limits on cognitive capacity for maintaining relationships, (1) social networks tend to reach practical capacity at approximately 250; (2) most people have functional networks closer to 150, including acquaintances,
16
B. H. Spitzberg
Fig. 2.2 The multidimensional model of meme diffusion 2.0 (M3 D2.0 )
friends and kin; (3) these networks are hierarchically organized in inclusive layers of increasing size but diminishing emotional intensity; (4) these layers scale at an approximate ratio of 3 (e.g., 5, 15, 50, 150); and (5) the increasing size-to-decreasing intensity inverts when available social networks are constrained to relatively small sizes (Tamarit et al. 2018). All “behavior constitutes time allocation” (Baum 2013, p. 283) of one course of action rather than another (Glorieux 1993), and thus, all behavior involves a consideration of foregone alternatives. This principle parallels evolutionary concepts of survival of the fittest in the sense that memes compete to be replicated (Adamic 2015; Weng et al. 2012). The consequence is that attention to any given meme or medium tends to displace an alternative meme or medium (Hofstadter 1985; Peng et al. 2017; Twenge 2019). As such, an attention economy exists in which every communicative activity competes at multiple levels with its alternatives (Ciampaglia et al. 2015; Falkinger 2007; Simon 1971), especially as information and connection overload nears (Feng et al. 2015; LaRose et al. 2014; Stephens et al. 2017). Given the assumption of an attention economy, the following partial propositional survey summarizes the model in Fig. 2.1. P1: Meme production (number, rate, repetition, polymediated distribution, etc.) is curvilinearly related to (a) initial meme attention and (b) meme diffusion. P1 derives from the relatively obvious evolutionary notion that natural reproduction leans toward profligacy such that larger numbers of offspring increase the odds that some survive. Greater production and redundancy across more distinct modalities or formats (Dancygier and Vandelanotte 2017) imply greater exposure, and therefore greater opportunity for attention (Gupta and Jenkins-Smith 2015; Hodas and Lerman 2014; Weng and Menczer 2015). A component of such production is redundancy of signal production, which reinforces exposure, fidelity and adoption (Unkelbach et al. 2019). The curvilinearity arises because as memes become more popular they are diffused more, making the average population of memes more similar, thereby decreasing their novelty and entropy-reduction value (see P2, and
2 Theorizing Social Media: A Formalization …
17
P10), which helps account for the S-curve of eventual diminishing adoption as the potential for information entropy reduction is progressively diminished. In addition, the potential novelty of each novel meme is “stealing each other’s chance for success” (Coscia 2017, p. 72). Thus, production and popularity reach thresholds in an attention economy. P2: Perceived memetic novelty is curvilinearly related to (a) meme attention and (b) meme diffusion. Messages designed to have valence, potency and activity are more likely to be perceived to have information value or issues salience (Schlaile et al. 2018; Vu and Gehrau 2010; Webberley et al. 2016), to a point. Information value here represents the extent to which a media user or consumer attends to, interprets, interacts with, and intends to apply message contents to some aspect of their lives. From a message design perspective, people attend to information sources that “are ‘loud,’ i.e., sent with relatively strong signal strength” (Falkinger 2007, p. 268). Meme interestingness (Weng and Menczer 2015) is likely related to various design aspects, including information variability (Fisher et al. 2018), multimodality (Pereira 2018), affective content (Berger and Milkman 2012; Kim 2015; Schlaile et al. 2018; Walker et al. 2017), and deviation from a baseline template (e.g., Berger and Packard 2018; Coscia 2017). Such features help explain why most viral memes “carry elements of surprise, humor, or irony, so frame-shifting” (Dancygier and Vandelanotte 2017, p. 593). From a perceiver attention perspective “novelty attracts human attention… When information is novel, it is not only surprising, but also more valuable” (Vosoughi et al. 2018, p. 1149). However, novelty is likely to be curvilinear to attention, as too little fails to gain attention whereas too much arousal risks information overload or sensory threat (Andersen et al. 1998; Bauckhage and Kersting 2014; Fisher et al. 2018; Kidd et al. 2012). P3: Information relevance is positively related to (a) meme attention and (b) meme diffusion. Some portion of novelty overlaps with perceived information value, but such value is also likely related to manifold actor motivations, as well as message design aspects oriented toward such motivations (Chiu et al. 2007). Contagions compete at the level of content relevance—that is, contagions suppress irrelevant content and amplify diffusion of relevant content (Myers and Leskovec 2012; Xu et al. 2013). For example, regarding self-health, uncertainty reducing messages with information about patient efficacy may be preferred to novel messages (Kim 2015). In general, perceived information value/relevance (Xiao and Khazaei 2019) or utility (Kim 2015; Yang et al. 2018) are positively related to memetic attention (Lee and Shin 2014). P4: The relationships between meme production, meme novelty, and memetic information value (P1-3) and meme (a) attention and (b) diffusion are moderated positively by meme source network centrality/popularity. “The number of followers is the most important factor” in predicting retweets (Weng and Menczer 2015, p. 12). Memes distributed by more popular sources, stars, influentials or celebrities are more likely to be seen, and more novel, informationally valuable, and subsequently diffused (Schlaile et al. 2018; Walker et al. 2017; Wang and Zheng 2014). Network centrality and the structural connections it represents
18
B. H. Spitzberg
are positively related to various characteristics of disproportionate competence and interpersonal influence in social networks (Chiu et al. 2017). P5: The relationships between meme production, meme novelty, and memetic information value (P1-P3) and meme (a) attention and (b) diffusion are moderated positively by the media competence of the meme source. More digitally competent and credible sources will be more capable of designing, composing, and diffusing memes to appropriate targets. Media competence is the actual and perceived ability to appropriately and effectively interact with others through digital media (Pereira 2018; Spitzberg et al. 2020). More media literate and competent communicators are likely to not only compose and strategically select messages by medium affordances and intended audiences, but they are likely to be held in higher esteem and ethos by those in the social network. P6: The relationships between meme production, meme novelty, and memetic information value (P1-3) and meme (a) attention and (b) diffusion are moderated positively by the degree to which node’s social network is subjectively perceived to have adopted the meme. People tend to base their behavior and beliefs on perceived thresholds of norms in their social networks or their perceived peer groups (Spitzberg et al. 2013; Yan and Jiang 2018). These are their subjective norms, or the perception of the extent to which their reference groups have adopted a given innovation, behavior or belief. Given attention competition, feedback becomes “essential for determining the allocation of cognitive resources,” and promotes likelihood of engagement and likelihood of response (Hodas and Lerman 2014). Engagements represents a summative category of all the ways in which users might interact with a text through active or kinetic response (e.g., through a click for any purpose, such as retweeting, replying, following, liking/disliking, adding a URL link, adding hashtag(s), photo(s) or image(s), or modification such as a new meme image or caption modification) (Tomblinson et al. 2019). Attentional virality is likely a multiplicative function of both direct communication with others about a diffusion (i.e., learning or informational externalities) and observation of others adopting the diffusion (i.e., network effects or payoff externalities) (Qiu et al. 2015). P7: The positively moderated relationship between subjective norms and the relationship between meme production, meme novelty, and memetic information value (P1-3) and meme (a) attention and (b) diffusion is moderated positively by the node’s attributed homophily with the network. Subjective norms are likely to be given greater attention and credibility to the extent that the reference group is perceived to be similar to oneself. Homophily, or the degree of structural and/or perceived similarity of a person or group, is a powerful predictor of meme diffusion. In one study, over half of innovation adoption was a product of structural homophily, independent of peer influence (Aral et al. 2009, p. 21547). Consistent with spiral of silence theory and various threshold hypotheses, people are more interested in discussing topics when they believe public opinion is consistent with their own opinions on a given topic (Hayes et al. 2001).
2 Theorizing Social Media: A Formalization …
19
P8: Node’s media competence is positively related to node’s (a) attention selectivity, which in turn is positively related to (b) the meme’s matching to node’s affordance needs. Part of media literacy or competence is the ability to match personal needs with the affordances provided by various media platforms to adapt topical content to the type of media audiences anticipated and to understand the various technological affordances provided by different media (Ruppel and Burke 2015). Media competence is likely to translate into perceived or attributed authority, credibility and competence (Xiao et al. 2018). P9: Information affordance needs are positively related to elaborate surveillance attention to information-relevant memes. P10: Meme affect arousal (proxy: information value/novelty) is curvilinear (inverse parabolic) to elaborate attention and curvilinear (parabolic) to peripheral attention. The more relevant memes are to a user’s informational affordance needs (i.e., uses and gratifications Phua et al. 2017), the greater that user’s attention is likely to be to those memes (Evans et al. 2017; Lu and Cheng 2013; Oh and Syn 2015). Affordance needs are likely to motivate perceptual thresholds that cue users to attend to and reflect upon memes regarding the relevance of such memes to those needs. Some memes carry emotional value or effect independent of a user’s information needs. “Emotions focus attention, i.e., they prime which type of information a person primarily perceives” (Steigenberger 2015, p. 437). Emotions, in turn, are directly tied to physiological arousal, and evidence indicates that “virality is partially driven by physiological arousal” (Berger and Milkman 2012, p. 192), with negatively valenced messages generally garnering greater attention value (Kätsyri et al. 2016). The emotional content of memes tends to activate processing and facilitates the virality of meme diffusion (Berger 2011; Nikolinakou and King 2018; Stieglitz and Dang-Xuan 2013). The curvilinearity of P10 arises from the information value that emotions have for survival. In general, emotion and arousal that are too intense or constrained blunt responsiveness to partner affect variations (Luginbuehl and Schoebi 2020). Further, arousal and valence are inversely parabolic in their relationship—excessive arousal produces negative subjective valence (Andersen et al. 1998; Simons et al. 2003). Low arousal tends not to be noticed, and extremely high arousal tends to bypass or interfere with more central cognition and directly activate more threat-based peripheral routes of processing (Andersen et al. 1998), thereby impairing attention (Simons et al. 2003) and communication processes (Crenshaw et al. 2019). In contrast, the greater personal message relevance and media competence, the more likely messages will be processed via more elaborate or conscious analytic cognitive route (Shi et al. 2018). P11: Meme attention is positively related to meme diffusion entropy cycle activity (i.e., speciation fecundity, velocity, popularity, duration, and population saturation/ resurgence). Attention, and thereby diffusion, tends to wax and wane as other issues compete for attention and previous memes’ entropy-reducing value diminishes. The more a given meme dominates a network, the lower the entropy of that network (Golo and
20
B. H. Spitzberg
Galam 2015). In an attention economy the additional unique information value of any given memetic strain of messages tends to diminish over time as it penetrates a given informational niche due to both informational redundancy as well as depletion of new adopters. Furthermore, the more time that elapses as a given memeplex diffuses, the greater the potential for novel external (i.e., evememic) events to emerge to compete with extant memetic diffusion cycles (Bauckhage and Kersting 2016). P12: The relationship between meme diffusion and attention entropy cycle activity (P13) is moderated positively by network infection-potential. Network infection potential is a function of numerous factors such as the network size, density, homophily, and its influentials and boundary-spanners or liaisons (Watts and Dodds 2007). The role of three key variables are discussed here in regard to infection potential: (a) influentials and the size of the uninfected network, (b) the mix of homophily versus heterophily, and related, (c) the role of strong and weak ties. Influentials, celebrities, and those with large followings and connections are often disproportionately capable of initiating viral cascades (Weng and Menczer 2015). The extent to which they succeed depends substantially on the extent to which their social networks have large numbers of nodes already uninfected by or exposed to a given meme (He et al. 2016; Watts and Dodds 2007; Zhang et al. 2016). Diffusion events have demonstrated that all ordinary users have “the potential to become an influential” (Sasahara et al. 2013, p. 9). However, in general, “it is better to have a large number of easily influenced individuals than to have a few highly connected hubs in a social network” (Liu-Thompkins 2012, p. 475). Thus, “it is the moderately connected majority, not the much smaller number of highly connected people, who hold the greatest potential influence” (Smith et al. 2007, p. 390). In regard to homophily and heterophily, as indicated in P7, homophily is fundamentally attractive (Luo and Zhang 2009). Similarity in demography, language, attitude, personality, and media platform selection facilitate attention to, and actual, communication, and such communication reinforces the attitude similarities within homophilous social networks (e.g., Liang and Fu 2015; Spitzberg 2019). Thus, homophilous communities efficiently diffuse information within their communities (Liang 2018; Weng et al. 2014). However, homophily reinforces info-niches or echo chambers (Choi et al. 2020; Nikolov et al. 2015) that can trap information (Weng et al. 2013). The degree to which memes diffuse beyond such communities depends on a balancing of homophily and heterophily, and strong and weak ties, in group composition. A balance of homogeneity and heterogeneity may activate subjective norms (P7) (Chiang 2007). As homophily reinforces strong ties, and heterophily supports weak ties, such balance becomes vital to widescale diffusion. Weak ties tend to connect less homophilous nodes across communities (Granovetter 1983). “Weak ties … play a crucial role in spreading information over a network, and their removal reduces the portion of the network that can be reached through information diffusion” (De Meo et al. 2014, p. 80). “It is these weak links that are often the connecting nodes between different network structures” (Smith et al. 2007, p. 391).
2 Theorizing Social Media: A Formalization …
21
Thus, widespread contagion is likely due to heterophilous connectors or liaisons who span multiple communities and network holes such that there is a high degree of homophily within communities but rich heterophilous weak ties spanning community boundaries (Bennett et al. 2018; Susarla et al. 2016; Weng et al. 2013). To the extent that community size is large, and the community has heterophilous (i.e., bridge, linking, connective, liaison) members, such communities reinforce diffusion rather than inhibit or trap memes (Liang 2018; Wang and Zheng 2014). Thus, homogeneity and heterophily are curvilinear to diffusion. That is, diffusion is generally facilitated within networks by homogeneity, homophily and the stronger ties associated with such similarity, whereas diffusion across networks is facilitated by heterogeneity and heterophily at the boundaries (liaisons, bridges, weak ties) of networks (LiuThompkins 2012). Too high a degree of either homophily or heterophily within a group will diminish a meme’s potential for global virality. P13: The greater the speciation fecundity, velocity, popularity, duration, and population saturation/resurgence of a meme, the greater its impact (issue efficacy). There are many potential avenues or indicators of memetic impact, and these can be operationalized in manifold ways. Five conceptual indicators are specified here, although there are clearly more criteria that can be identified (Schlaile and Schlaile 2021). Fecundity is the extent of speciation, or the degree to which a given seed or initiator meme spawns variants of itself, in which some aspect of design or content are retained but subtracted from, added to, reframed, or transformed via a different medium. Velocity is the speed or rate at which diffusion occurs, which can be assessed over any designated time span or scope (e.g., the velocity of diffusion within a given medium, a given group, or according to some designated campaign schedule). Popularity is the sheer quantity of nodes touched (i.e., viewed, heard, attended to, rated, re-sent, or speciated) by a given meme. Duration is the lifespan of a meme or its memeplex and its direct or speciated progeny. Saturation/resurgence refers to the proportion of a specified population exposed to or infected by a meme and re-exposed or re-infected by that meme, given that memes often become recycled in a given population. Various aspects of these indicators have been operationalized by researchers (e.g., Schlaile and Schlaile 2021; Wang and Zheng 2014). These various indicators of memetic diffusion will typically signal relatively parallel tracks through meme attention-entropy cycles, in which new memes emerge, and those with greater fitness survive, and a small fraction thrive in viral patterns of success (Gleeson et al. 2014; Gleeson et al. 2014; Weng et al. 2012). This meme cycle is similar to what has been studied as an issue attention cycle in mass media or news cycles (Waldherr 2014). In online memetic virality there is generally a bursty initial trend followed by relatively rapid dissipation (Ciampaglia et al. 2015). P14: Meme attention entropy activity cycles (i.e., speciation, fecundity, velocity, popularity, duration, and population saturation/resurgence) are (a) positively related to the concurrence of cooperative societal meme cycles and (b) negatively related to the concurrence with competitive societal meme cycles. Memes can be cooperative or competitive. Cooperative memes are topically or narratively linked to other memes in ways that reinforce their conjoint informational value, and thereby their fecundity, longevity and speciation. Such cooperation does
22
B. H. Spitzberg
not need to imply “consistency” in issue, valence, position, or topic. For example, it seems reasonable to conjecture that MSNBC News would not have the attention it receives were it not for Fox News, and vice-versa. Their news cycles of thrust and parry, argument-counterargument, feed off of and into each other’s media memescapes, as well as the legions of followers and social network inhabitants who attend to their content (Clark et al. 2018; Wang and Song 2020). Competitive memes are so independent of the information in an existing memescape that they are capable of drawing attention away from or redirecting attention to an entirely new direction. Sometimes these competitive meme cycles are due to counterframes, counternarratives, or media campaigns designed to distract attention away from existing meme cycles, or simply to generate their own attention. Given the general rapidity with which existing meme cycles tend to erode their information value, many populations and social networks are essentially primed for new, novel, or arousing meme cycles to be initiated. P15: Feedback and reciprocity to memes are positively related to node’s continued meme attention and diffusion cycle activity. The norm of reciprocity is a relatively universal disposition, and a fundamental organizing process in group formation and maintenance (Schonmann and Boyd 2016). Communicative behavior and links are generally reciprocal in structure and content (Friedman and Singh 2003; Liang and Fu 2015; Myers et al. 2014). Research also indicates a broad valence (Ferrara and Yang 2015) and affective (He et al. 2016) reciprocity in tweets, especially in regard to influential users (Bae and Lee 2012). This reciprocity probably reaches various thresholds reflected in P1. As social networks expand in size, high quantity human producers and consumers will become less reciprocal due to time displacement (Spitzberg 2019). As of 2014 in Twitter, “for users with an inbound degree of less than 100, each new follower a user receives adds an average of 4,770 new secondary followers. Similarly, each new follow a user makes connects her to 3,573 new secondary followings” (Friedman and Singh 2003, p. 497). The exception to this threshold is the role of bots, or automatic programmed reciprocating memes, which are not constrained by the informational or affordance motives of an information ecology. P16: Any given meme’s attention entropy activity cycle (i.e., speciation, fecundity, velocity, popularity, duration, and population saturation/resurgence) is negatively related to competitive evememic socio-geospatial events. P17: Meme attention entropy activity cycles (i.e., speciation fecundity, velocity, popularity, duration, and population saturation/resurgence) are positively related to daytime work-related diurnal cycles. P18: Evememic socio-geospatial events are positively related to (a) new meme production, and thereby, (b) competitive attention diversion from extant meme cycles. Memes tend to be produced proportionate to the societal and cultural significance of the events to which the memes refer. While clearly there are institutional gatekeeping and agenda-setting processes that bias or distort the significance of events, the symbolic capital of certain types of events often drive such processes. Successful memes are “deeply situated in particular cultural contexts” (Wang and Wang 2015,
2 Theorizing Social Media: A Formalization …
23
p. 270). Most everyday virality is contagion-like, in the sense that messages spread in cascades of messages that diffuse and speciate across social networks. In contrast, sometimes mega-events occur that capture collective attention, characterized by both large numbers of solo messages as well as chains of retweets. Such pattern differences can identify “strong collective attention during natural disasters and major sporting events, as well as moderate attention related to culture, science, technology, politics, and annual regular events” (Sasahara et al. 2013, p. 9). Given the competition of topics for attention in pluralistic societies, and relative inelasticity of time relative to economies of scale and functionality in information packaging and processing (Zhu 1992), several tendencies arise. First, as the number of topics introduced into an information ecology increases, the collective or average popularity of any given topic will decrease, “as the addition of any new topic onto the public agenda comes at the cost of other topics” (Xu et al. 2013, p. 3). Importantly, this attention competition is dynamic in multiple ways (Myers and Leskovec 2012). For example, some topics are more cooperative (i.e., as attention to one topic increases, it naturally reinforces attention to another topic, such as economy topics and politics topics in an election year), whereas others are more competitive (i.e., as attention to one topic increases, it naturally diminishes attention to another topic, such as when a school shooting diminishes the attributed importance and salience of more routine matters) (Peng et al. 2017). Among the most stable factors influencing meme cycles are diurnal or quotidian patterns reflecting contemporary social and occupational activities, such as light– dark, sleep–wake, work-leisure, and dining-socializing routines (Dzogang et al. 2017; Dzogang et al. 2018; 2013; Spitzberg 2019). Yet, even regular diurnal cycles can give way to significant social, political, cultural and international events, which grab the public’s attention and create media firestorms of meme activity (Myers et al. 2012; Schäfer et al. 2014). General topics are likely to demonstrate relatively regular diffusion cycles and patterns, but event-specific situations are likely to reveal much burstier patterns and their own distinct diffusion curves (Stai et al. 2017; Wang and Zheng 2014) and even distinct emotional profiles. Studies continue to demonstrate that geospatial proximity is still significantly predictive of communication ties and quantity (Spitzberg 2019). The moderating influence of media on proximity is suggested by “a new corollary to Tobler’s first law of geography; ‘In both real space and cyberspace, everything is related to everything else, but near things are more related in real space than in cyberspace’” (Han et al. 2018, p. 16). There are also institutional design and technological factors that often mediate the shape of meme diffusion. For example, the ‘invention’ of hashtags vastly facilitated the ability to produce and investigate meme cycles and diffusion. Such hashtags especially facilitate self-organizing meme diffusion during notable events such as crises (Sutton et al. 2015). Various algorithms designed into social media influence meme diffusion cycles (Simkin and Roychowdhury 2015). The flexibility of media contents for visual and textual mimesis or alteration and editing across multiple media platforms also affect diffusion (Susarla et al. 2016), as well as the broadband technological capacity of a given social structure (Chiu et al. 2007). Finally, institutional interference in meme cycles by the use of automated bots and computer programs
24
B. H. Spitzberg
significantly influence meme production, and thereby, as would be typically intended, meme diffusion cycles (Martinez et al. 2018). These are merely illustrative examples of the ways in which socio-geospatial activities can disrupt or accentuate meme diffusion cycles.
2.3 Conclusion Consistent with general systems theory and its assumptions of interdependence, holism, equifinality and multifinality, there are numerous moderating and mediating hypotheses that could be further elaborated from M3 D2.0 and its constructs. Changing any downstream variable has implications for changes in all of the upstream variables in the model and vice versa, given both the explicit and implicit feedback loops among variables. As one example of the interdependency of variables, one study found that social network site intensity, trust, tie strength, homophily, privacy concerns, and attention to social comparison moderated the relationship between social network site use and online bridging and social capital (Phua et al. 2017). There are likely to be additive and multiplicative interactions among the variables of such a model that cannot be fully anticipated or articulated in simple bivariate propositions. Furthermore, any given empirical pursuit of the model might be able to begin its focus at any given starting point of the model. “Internet memes are often bottom-up cultural creations, moving from individuals to wider crowds. However, once memes achieve a certain level of popularity and become part of meme generators, they transform into top-down repertoires” (Nissenbaum and Shifman 2018, p. 306). Thus, some research of political social media shows complex interactions between social media and traditional media, indicating that “attentional spikes of the blogs, tweets, and discussion board posts are as likely to precede the traditional media as to follow it” (Neuman et al. 2014, p. 210). Memes can drive cyberspace memes or realspace events, and realspace events can drive memes, and together they can form their own dynamic self-reinforcing symbolic systems. Meme diffusion cycles can be evememic, entymemic, or polymemic, belying the implied linearity of the model’s arrows. Such complex interdependencies reflect a growing recognition in the study of communication geography that “an emerging paradigm of media/communication geography is the idea of a metaphysics of encounter” (Adams 2017, p. 366). “Not only do communications mediate between space and place, but they do so in ways that are part of a second dialectic structured by the tension between communications as context and communications as contents. Space is where communication happens but is also one of the things created by communications” (Adams 2011, p. 39). In such a complex context, “the need for integrating theory becomes paramount to sensemaking in the realm of big data” (Spitzberg et al. 2020, p. 175). The challenge for developing theory in a manner compatible with scientific inquiry has never been greater, nor more open to the imagination.
2 Theorizing Social Media: A Formalization …
25
References Adamic, L. A. (2015). The information life of social networks. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining. Association for Computing Machinery, New York, NY, USA, pp. 273–274. Adams, P. C. (2011). A taxonomy for communication geography. Progress in Human Geography, 35, 37–57. Adams, P. C. (2017). Geographies of media and communication I: Metaphysics of encounter. Progress in Human Geography, 41, 365–374. Andersen, P. A. (1998). The cognitive valence theory of intimate communication. In M. T. Palmer & G. A. Barnett (Eds.), Mutual Influ Interpers. Commun. Theory Res. Cogn. Affect Behav (pp. 39– 72). Westport: Greenwood Publishing Group. Aral, S., Muchnik, L., & Sundararajan, A. (2009). Distinguishing influence-based contagion from homophily-driven diffusion in dynamic networks. Proceedings of the National Academy of Sciences, 106, 21544–21549. Bae, Y., & Lee, H. (2012). Sentiment analysis of Twitter audiences: Measuring the positive or negative influence of popular twitterers. Journal of the American Society for Information Science and Technology, 63, 2521–2535. Bauckhage, C., Kersting, K. (2014). Strong regularities in growth and decline of popularity of social media services. ArXiv14066529 Phys. Bauckhage, C., Kersting, K. (2016). Collective attention on the web. Foundations and Trends® Web Science, 5, 1–136 Baum, W. M. (2013). What counts as behavior? The molar multiscale view. Behavior Analyst, 36, 283–293. Bennett, W. L., Segerberg, A., & Yang, Y. (2018). The strength of peripheral networks: Negotiating attention and meaning in complex media ecologies. The Journal of Communication, 68, 659–684. Berger, J. (2011). Arousal increases social transmission of information. Psychological Science, 22, 891–893. Berger, J., & Milkman, K. L. (2012). What makes online content viral? Journal of Marketing Research, 49, 192–205. Berger, J., & Packard, G. (2018). Are atypical things more popular? Psychological Science, 29, 1178–1184. Chiang, Y.-S. (2007). Birds of moderately different feathers: Bandwagon dynamics and the threshold heterogeneity of network neighbors. Journal of Mathematical Sociology, 31, 47–69. Chiu, C.-Y. (Chad), Balkundi, P., Weinberg, F. J. (2017). When managers become leaders: The role of manager network centralities, social power, and followers’ perception of leadership. The Leadership Quarterly, 28, 334–348. Chiu, H.-C., Hsieh, Y.-C., Kao, Y.-H., & Lee, M. (2007). The determinants of email receivers’ disseminating behaviors on the internet. Journal of Advertising Research, 47, 524–534. Choi, D., Chun, S., Oh, H., Han, J., & Kwon, T. (2020). Rumor propagation is amplified by echo chambers in social media. Science and Reports, 10, 310. Ciampaglia, G. L., Flammini, A., & Menczer, F. (2015). The production of information in the attention economy. Science and Reports, 5, 9452. Clark, J. C., Spitzberg, B. H., Tsou, M. H. (2018). The digipolitical animal: Investigating the memetic diffusion of political messages on Twitter. Western States Communication Association Conference. Coscia, M. (2017). Popularity spikes hurt future chances for viral propagation of protomemes. Communications of the ACM, 61, 70–77. Crenshaw, A. O., Leo, K., & Baucom, B. R. W. (2019). The effect of stress on empathic accuracy in romantic couples. Journal of Family Psychology, 33, 327–337. Dancygier, B., & Vandelanotte, L. (2017). Internet memes as multimodal constructions. Cognitive Linguistics, 28, 565–598. Dawkins, R. (2016). The selfish gene. Oxford: Oxford University Press.
26
B. H. Spitzberg
De Meo, P., Ferrara, E., Fiumara, G., Provetti, A. (2014) On Facebook, most ties are weak. Communications of the ACM, 57, 78–84. Dzogang, F., Lightman, S., & Cristianini, N. (2017). Circadian mood variations in Twitter content. Brain and Neuroscience Advances, 1, 2398212817744501. Dzogang, F., Lightman, S., Cristianini, N. (2018). Diurnal variations of psychometric indicators in Twitter content. PLoS One, 13, e0197002. Evans, S. K., Pearce, K. E., Vitak, J., & Treem, J. W. (2017). Explicating affordances: A conceptual framework for understanding affordances in communication research. Journal of ComputerMediated Communication, 22, 35–52. Falkinger, J. (2007). Attention economies. Journal of Economic Theory, 133, 266–294. Feng, L., Hu, Y., Li, B., Stanley, H. E., Havlin, S., Braunstein, L. A. (2015). Competing for attention in social media under information overload conditions. PLoS One, 10, e0126090. Ferrara, E., Yang, Z. (2015). Measuring emotional contagion in social media. PLoS One, 10, e0142390. Fisher, J. T., Keene, J. R., Huskey, R., & Weber, R. (2018). The limited capacity model of motivated mediated message processing: Taking stock of the past. Annals of the International Communication Association, 42, 270–290. Friedman, D., & Singh, N. (2003). Negative reciprocity: The coevolution of memes and genes. SSRN. https://doi.org/10.2139/ssrn.509242 Gleeson, J. P., Ward, J. A., O’Sullivan, K. P., Lee, W. T. (2014). Competition-induced criticality in a model of meme popularity. Physical Review Letters, 112, 048701. Glorieux, I. (1993). Social interaction and the social meanings of action: A time budget approach. Social Indicators Research, 30, 149–173. Golo, N., & Galam, S. (2015). Conspiratorial beliefs observed through entropy principles. Entropy, 17, 5611–5634. Gonçalves, B., Perra, N., Vespignani, A. (2011). Modeling users’ activity on Twitter networks: Validation of Dunbar’s Number. PLoS One 6, e22656. Granovetter, M. (1983). The strength of weak ties: A network theory revisited. Sociol Theory, 1, 201–233. Gupta, K., Jenkins-Smith, H. (2015). Anthony Downs, Up and down with ecology: The ‘issueattention’ cycle. In M. Lodge, E. C. Page, S. J. Balla (Eds.), The Oxford Handbook of Classics in Public Policy and Administration. Oxford, pp. 1–12. Han, S. Y., Tsou, M.-H., & Clarke, K. C. (2018). Revisiting the death of geography in the era of big data: The friction of distance in cyberspace and real space. International Journal of Digital Earth, 11, 451–469. Hayes, A. F., Shanahan, J., & Glynn, C. J. (2001). Willingness to express one’s opinion in a realistic situation as a function of perceived support for that opinion. International Journal of Public Opinion Research, 13, 45–58. He, S., Zheng, X., & Zeng, D. (2016). A model-free scheme for meme ranking in social media. Decision Support Systems, 81, 1–11. Hodas, N. O., & Lerman, K. (2014). The simple rules of social contagion. Science and Reports, 4, 4343. Hofstadter, D. R. (1985). Metamagical Themas: Questing for the Essence of Mind and Pattern. New York: Basic Books. Jang, S. M., & Pasek, J. (2015). Assessing the carrying capacity of Twitter and online news. Mass Communication and Society, 18, 577–598. Kätsyri, J., Kinnunen, T., Kusumoto, K., Oittinen, P., Ravaja, N. (2016). Negativity bias in media multitasking: The effects of negative social media messages on attention to television news broadcasts. PLoS One, 11, e0153712. Kidd, C., Piantadosi, S. T., Aslin, R. N. (2012). The goldilocks effect: Human infants allocate attention to visual sequences that are neither too simple nor too complex. PLoS One, 7, e36399. Kim, H. S. (2015). Attracting views and going viral: How message features and news-sharing channels affect health news diffusion. The Journal of Communication, 65, 512–534.
2 Theorizing Social Media: A Formalization …
27
LaRose, R., Connolly, R., Lee, H., Li, K., & Hales, K. D. (2014). Connection overload? A cross cultural study of the consequences of social media connection. Information Systems Management, 31, 59–73. Lee, E.-J., & Shin, S. Y. (2014). When the medium is the message: How transportability moderates the effects of politicians’ Twitter communication. Communication Research, 41, 1088–1110. Leskovec, J., Backstrom, L., Kleinberg, J. (2009). Meme-tracking and the dynamics of the news cycle. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, New York, NY, USA, pp. 497–506. Liang, H. (2018). Broadcast versus viral spreading: The structure of diffusion cascades and selective sharing on social media. The Journal of Communication, 68, 525–546. Liang, H., Fu, K. (2015). Testing propositions derived from Twitter studies: Generalization and replication in computational social science. PLoS One, 10, e0134270. Liu-Thompkins, Y. (2012). Seeding viral content: The role of message and network factors. Journal of Advertising Research, 52, 465–478. Lu, J., & Cheng, L. (2013). Perceiving and interacting affordances: A new model of human– affordance interactions. Integrative Psychological & Behavioral Science, 47, 142–155. Luginbuehl, T., & Schoebi, D. (2020). Emotion dynamics and responsiveness in intimate relationships. Emotion, 20, 133–148. Luo, S., & Zhang, G. (2009). What leads to romantic attraction: Similarity, reciprocity, security, or beauty? Evidence from a speed-dating study. Journal of Personality, 77, 933–964. Malmgren, R. D., Stouffer, D. B., Motter, A. E., & Amaral, L. A. N. (2008). A Poissonian explanation for heavy tails in e-mail communication. Proceedings of the National Academy of Sciences, 105, 18153–18158. Marois, R., & Ivanoff, J. (2005). Capacity limits of information processing in the brain. Trends in Cognitive Sciences, 9, 296–305. Martinez, L. S., Hughes, S., Walsh-Buhi, E. R., & Tsou, M.-H. (2018). Okay, we get it. You vape: An analysis of geocoded content, context, and sentiment regarding e-cigarettes on Twitter. Journal of Health Communication, 23, 550–562. Mazzocchi, F. (2015). Could big data be the end of theory in science? EMBO Reports, 16, 1250– 1255. Miritello, G., Lara, R., Cebrian, M., & Moro, E. (2013). Limited communication capacity unveils strategies for human interaction. Science and Reports, 3, 1950. Myers, S. A., Leskovec, J. (2012). Clash of the contagions: Cooperation and competition in information diffusion. In 2012 IEEE 12th International Conference on Data Mining, pp. 539–548. Myers, S. A., Zhu, C., Leskovec, J. (2012). Information diffusion and external influence in networks. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, New York, NY, USA, pp. 33–41. Myers, S. A., Sharma, A., Gupta, P., Lin, J. (2014). Information network or social network? The structure of the Twitter follow graph. In Proceedings of the 23rd International Conference World Wide Web. Association for Computing Machinery, New York, NY, USA, pp. 493–498. Neuman, R. W., Guggenheim, L., Mo Jang, S., & Bae, S. Y. (2014). The dynamics of public attention: Agenda-setting theory meets big data. The Journal of Communication, 64, 193–214. Nikolinakou, A., & King, K. W. (2018). Viral video ads: Emotional triggers and social media virality. Psychology and Marketing, 35, 715–726. Nikolov, D., Oliveira, D. F. M., Flammini, A., Menczer, F. (2015). Measuring online social bubbles. PeerJ Computer Science, 1, e38. Nissenbaum, A., & Shifman, L. (2018). Meme templates as expressive repertoires in a globalizing world: A cross-linguistic study. Journal of Computer-Mediated Communication, 23, 294–310. Oh, S., & Syn, S. Y. (2015). Motivations for sharing information and social support in social media: A comparative analysis of Facebook, Twitter, Delicious, YouTube, and Flickr. Journal of the Association for Information Science and Technology, 66, 2045–2060.
28
B. H. Spitzberg
Peng, T.-Q., Sun, G., Wu, Y. (2017). Interplay between public attention and public emotion toward multiple social issues on Twitter. PLoS One, 12, e0167896. Pereira, A. (2018). Exploring the multimodal argument: The interplay of multimodality and attention economy. Pedagogies: An International Journal, 13, 201–221. Phua, J., Jin, S. V., & Kim, J (Jay). (2017). Uses and gratifications of social networking sites for bridging and bonding social capital: A comparison of Facebook, Twitter, Instagram, and Snapchat. Computers in Human Behavior, 72, 115–122. Qiu, L., Tang, Q., & Whinston, A. B. (2015). Two formulas for success in social media: Learning and network effects. Journal of Management Information Systems, 32, 78–108. Ruppel, E. K., & Burke, T. J. (2015). Complementary channel use and the role of social competence. Journal of Computer-Mediated Communication, 20, 37–51. Sasahara, K., Hirata, Y., Toyoda, M., Kitsuregawa, M., Aihara, K. (2013) Correction: Quantifying collective attention from tweet stream. PLoS One, 8. https://doi.org/10.1371/annotation/ 25b6b59d. Schäfer, M. S., Ivanova, A., & Schmidt, A. (2014). What drives media attention for climate change? Explaining issue attention in Australian, German and Indian print media from 1996 to 2010. International Communication Gazette, 76, 152–176. Schlaile, M. P. (2021). A case for economemetics? Why evolutionary economists should re-evaluate the (f)utility of memetics. In M. P. Schlaile (Ed.), Memetics and Evolutionary Economics: To Boldly Go Where no Meme has Gone Before (pp. 33–68). Cham: Springer International Publishing. Schlaile, M. P., Knausberg, T., Mueller, M., & Zeman, J. (2018). Viral ice buckets: A memetic perspective on the ALS Ice Bucket Challenge’s diffusion. Cognitive Systems Research, 52, 947– 969. Schonmann, R. H., & Boyd, R. (2016). A simple rule for the evolution of contingent cooperation in large groups. Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences, 371, 20150099. Shi, J., Hu, P., Lai, K. K., & Chen, G. (2018). Determinants of users’ information dissemination behavior on social networking sites: An elaboration likelihood model perspective. Internet Research, 28, 393–418. Shifman, L. (2013). Memes in a digital world: Reconciling with a conceptual troublemaker. Journal of Computer-Mediated Communication, 18, 362–377. Simkin, M. V., & Roychowdhury, V. P. (2015). Why does attention to web articles fall with time? Journal of the Association for Information Science and Technology, 66, 1847–1856. Simon, H. A. (1971). Designing organizations for an information-rich world. In GreenBerger (Ed.), Computers, Communications, and the Public Interest. Johns Hopkins Press, pp. 37–72. Simons, R. F., Detenber, B. H., Cuthbert, B. N., Schwartz, D. D., & Reiss, J. E. (2003). Attention to television: Alpha power and its relationship to image motion and emotional content. Media Psychology, 5, 283–301. Smith, T., Coyle, J. R., Lightfoot, E., & Scott, A. (2007). Reconsidering models of influence: The relationship between consumer social networks and word-of-mouth effectiveness. Journal of Advertising Research, 47, 387–397. Spitzberg, B. H. (2014). Toward a Model of Meme Diffusion (M3 D). Communication Theory, 24, 311–339. Spitzberg, B. H. (2019). Trace of pace, place, and space in personal relationships: The chronogeometrics of studying relationships at scale. Personal Relationships, 26, 184–208. Spitzberg, B. H., & Record, R. A. (2020). Mediated communication. In B. H. Spitzberg, D. J. Canary, H. E. Canary, & P. A. Andersen (Eds.), Commun . Capstone Commun. Inq. Theory Exp (pp. 321–335). San Diego: Cognella Academic Publishing. Spitzberg, B. H., Tsou, M.-H., Gupta, D. K., An, L., Gawron, J. M., & Lusher, D. (2013). The map is not which territory?: Speculating on the geo-spatial diffusion of ideas in the Arab Spring of 2011. Studies in Media and Communication, 1, 101–115.
2 Theorizing Social Media: A Formalization …
29
Stai, E., Karyotis, V., Bitsaki, A.-C., & Papavassiliou, S. (2017). Strategy evolution of information diffusion under time-varying user behavior in generalized networks. Computer Communications, 100, 91–103. Steigenberger, N. (2015). Emotions in sensemaking: A change management perspective. Journal of Organizational Change Management, 28, 432–451. Stephens, K. K., Mandhana, D. M., Kim, J. J., Li, X., Glowacki, E. M., & Cruz, I. (2017). Reconceptualizing communication overload and building a theoretical foundation. Communication Theory, 27, 269–289. Stieglitz, S., & Dang-Xuan, L. (2013). Emotions and information diffusion in social media—Sentiment of microblogs and sharing behavior. Journal of Management Information Systems, 29, 217–248. Susarla, A., Oh, J.-H., & Tan, Y. (2016). Influentials, imitables, or susceptibles? Virality and wordof-mouth conversations in online social networks. Journal of Management Information Systems, 33, 139–170. Sutton, J., Gibson, C. B., Phillips, N. E., Spiro, E. S., League, C., Johnson, B., Fitzhugh, S. M., & Butts, C. T. (2015). A cross-hazard analysis of terse message retransmission on Twitter. Proceedings of the National Academy of Sciences, 112, 14793–14798. Tamarit, I., Cuesta, J. A., Dunbar, R. I. M., & Sánchez, A. (2018). Cognitive resource allocation determines the organization of personal networks. Proceedings of the National Academy of Sciences, 115, 8316–8321. Tomblinson, C. M., Wadhwa, V., Latimer, E., Gauss, C. H., & McCarty, J. L. (2019). Publicly available metrics underestimate AJNR Twitter impact and follower engagement. American Journal of Neuroradiology, 40, 1994–1997. Turner, J. H. (1990). The misuse and use of metatheory. Sociological Forum, 5, 37–53. Twenge, J. M. (2019). More time on technology, less happiness? Associations between digital-media use and psychological well-being. Current Directions in Psychological Science, 28, 372–379. Unkelbach, C., Koch, A., Silva, R. R., & Garcia-Marques, T. (2019). Truth by repetition: Explanations and implications. Current Directions in Psychological Science, 28, 247–253. Vosoughi, S., Roy, D., & Aral, S. (2018). The spread of true and false news online. Science, 359, 1146–1151. Vu, H. N. N., & Gehrau, V. (2010). Agenda diffusion: An integrated model of agenda setting and interpersonal communication. Journalism & Mass Communication Quarterly, 87, 100–116. Waldherr, A. (2014). Emergence of news waves: A social simulation approach. The Journal of Communication, 64, 852–873. Walker, L., Baines, P. R., Dimitriu, R., & Macdonald, E. K. (2017). Antecedents of retweeting in a (political) marketing context. Psychology and Marketing, 34, 275–293. Wang, X., & Song, Y. (2020). Viral misinformation and echo chambers: The diffusion of rumors about genetically modified organisms on social media. Internet Research, 30, 1547–1564. Wang, J., & Wang, H. (2015). From a marketplace to a cultural space: Online meme as an operational unit of cultural transmission. Journal of Technical Writing and Communication, 45, 261–274. Wang, Y., Zheng, B. (2014). On macro and micro exploration of hashtag diffusion in Twitter. In 2014 IEEEACM International Conference on Advances in Social Networks Analysis and Mining ASONAM 2014, pp. 285–288. Watts, D. J., & Dodds, P. S. (2007). Influentials, networks, and public opinion formation. The Journal of Consumer Research, 34, 441–458. Webberley, W. M., Allen, S. M., & Whitaker, R. M. (2016). Retweeting beyond expectation: Inferring interestingness in Twitter. Computer Communications, 73, 229–235. Weng, L., Menczer, F. (2015). Topicality and impact in social media: Diverse messages, focused messengers. PLoS One, 10, e0118410. Weng, L., Flammini, A., Vespignani, A., & Menczer, F. (2012). Competition among memes in a world with limited attention. Science and Reports, 2, 335. Weng, L., Menczer, F., & Ahn, Y.-Y. (2013). Virality prediction and community structure in social networks. Science and Reports, 3, 2522.
30
B. H. Spitzberg
Weng, L., Menczer, F., Ahn, Y.-Y. (2014). Predicting successful memes using network and community structure. ArXiv14036199 Phys. Wise, A. F., & Shaffer, D. W. (2015). Why theory matters more than ever in the age of big data. The Journal of Learning Analytics, 2, 5–13. Xiao, L., Khazaei, T. (2019). Changing others’ beliefs online: online comments’ persuasiveness. In Proceedings of the 10th Int. Conf. Soc. Media Soc. Association for Computing Machinery, New York, NY, USA, pp. 92–101. Xiao, M., Wang, R., & Chan-Olmsted, S. (2018). Factors affecting YouTube influencer marketing credibility: A heuristic-systematic model. Journal of Media Business Studies, 15, 188–213. Xu, P., Wu, Y., Wei, E., Peng, T. Q., Liu, S., Zhu, J. J. H., & Qu, H. (2013). Visual analysis of topic competition on social media. IEEE Transactions on Visualization and Computer Graphics, 19, 2012–2021. Yan, X., & Jiang, P. (2018). Effect of the dynamics of human behavior on the competitive spreading of information. Computers in Human Behavior, 89, 1–7. Yang, Q., Tufts, C., Ungar, L., Guntuku, S., & Merchant, R. (2018). To retweet or not to retweet: Understanding what features of cardiovascular tweets influence their retransmission. Journal of Health Communication, 23, 1026–1035. Zhang, L., Zhao, J., & Xu, K. (2016). Who creates trends in online social media: The crowd or opinion leaders? Journal of Computer-Mediated Communication, 21, 1–16. Zhu, J.-H. (1992). Issue competition and attention distraction: A zero-sum theory of agenda-setting. Journalism Quarterly, 69, 825–836.
Chapter 3
Research on Misinformation and Social Networking Sites Lourdes S. Martinez
3.1 Introduction Since Facebook’s introduction to the information landscape in 2004, scholars have recognized the increasing importance of studying social networking sites and how users may use them to seek and share information. However, properties of online platforms (e.g., Nahon & Hemsley, 2013) and rapid information diffusion across users in social networks unfortunately may also serve to enable the propagation of misinformation (e.g., Al-Rawi, 2019; Bessi et al., 2015; Lewandowsky et al., 2012; Sharma et al., 2017; Vosoughi et al., 2018), which can take the form of inaccurate content (e.g., Loeb et al., 2019; Wang et al., 2019) or fake news (e.g., Broniatowski et al., 2018; Shao et al., 2018). The spread of misinformation (e.g., Allcott & Gentzkow, 2017) may produce a number of deleterious effects on society. Given the popularity of certain social networking sites (SNSs) such as Facebook, Twitter, and Instagram (Pew Research Center, 2019), continued scholarly attention dedicated to understanding the impact of misinformation spreading through social networking sites on individuals and society is both urgent and warranted. This chapter will provide a critical review of the extant literature on misinformation and social networking sites, and discuss directions for future research.
L. S. Martinez (B) School of Communication, San Diego State University, San Diego, CA, USA e-mail: [email protected] © Springer Nature Switzerland AG 2021 A. Nara and M.-H. Tsou (eds.), Empowering Human Dynamics Research with Social Media and Geospatial Data Analytics, Human Dynamics in Smart Cities, https://doi.org/10.1007/978-3-030-83010-6_3
31
32
L. S. Martinez
3.2 Background 3.2.1 Social Networking Sites and Misinformation Social networking sites represent a specific class of web-based services that permit users to “(1) construct a public or semi-public profile within a bounded system, (2) articulate a list of other users with whom they share a connection, and (3) view and traverse their list of connections and those made by others within the system.” (Boyd & Ellison, 2007, p. 211). This definition is not limited to platforms intended to merely reflect pre-existing offline social ties, but also extends to include platforms allowing users to connect online with others who they may not yet share an offline social tie. In the past 15 years, user bases of social networking sites have continued to rise as more individuals adopt and integrate the use of these platforms into daily life. Although only five percent of adults in the U.S. reported using at least one of these platforms in 2005, this figure has increased to over seventy percent today (Pew Research Center, 2019). Social networking sites with user bases capturing the largest share of U.S. adults include Facebook (69%), YouTube (73%), Instagram (37%), LinkedIn (27%), and Twitter (22%) (Pew Research Center, 2019). Daily use of these platforms is also fairly high, with over of half U.S. adults reporting daily use of Facebook (74%), Instagram (63%), Snapchat (61%), and YouTube (51%). Additionally, over forty percent of U.S. adults report using Twitter on a daily basis. Due to their rising popularity, this review examines social networking sites that offer users a way to maintain their current offline social connections with potential to initiate new ones online. In addition to managing social connections, social networking sites offer a number of affordances to their users, including but not limited to opportunities to socially interact with others, entertain themselves, pass the time, relax, express their opinion, and engage in social surveillance of other users (Whiting & Williams, 2013). Users also use these platforms to seek information and share information with other users (Whiting & Williams, 2013), including information seeking (e.g., Gil de Zúñiga et al., 2017) and sharing (e.g., Lee & Ma, 2012) of news content. Current estimates place the proportion of U.S. adults who consult social media as a news source at sixty-eight percent (Shearer & Matsa, 2018). Across platforms, Facebook captures the largest proportion of U.S. adults looking for news (43%), followed by YouTube (21%), Twitter (12%), Instagram (8%), LinkedIn (6%), Reddit (5%), and Snapchat (5%). In addition to various technological, social, and communicative affordances (e.g., Bucher & Helmond, 2017), social networking sites also allow users to curate content that appears on their feeds. This curation can present unique challenges to addressing the issue of misinformation on these platforms, especially once misinformation spreads, takes hold within a social network, and leads to misperceptions among members of that particular community. The concept of misinformation has been previously defined as any content that is considered unequivocally false by domain experts (Tan et al., 2015), and promulgated inadvertently or by design (Southwell
3 Research on Misinformation and Social Networking Sites
33
et al., 2018). In instances where misinformation is spread deliberately, the literature refers to this form of content as disinformation, which is shared in an effort to deceive others (e.g., Fallis, 2009). Although the exact amount of misinformation and disinformation present on social networking sites is unclear, emerging research suggests the overall problem of misinformation in terms of scale has likely declined in recent years due to efforts by platforms such as Facebook to actively remove such content (Allcott et al., 2019). While these efforts represent a step in the right direction, the issue of misinformation will likely remain a looming concern across a range of areas and require energy and resources to combat effectively. In addition, despite the convenience and ease of following news on social media, the majority (57%) of users in the U.S. who consult social media as a news source expect the content they encounter to be largely inaccurate, an issue that represents a top concern for social media news consumers and points toward a reduced level of trust in social media content (Shearer & Matsa, 2018). The challenge in designing effective responses to the misinformation problem on social media and restoring public trust demands a thorough examination of what is currently known and unknown about this issue, and serves as the impetus for this review. While some recent domain-specific reviews (e.g., Wang et al., 2019) and recommendations (e.g., Chou et al., 2018; Merchant & Asch, 2018) related to misinformation and social media have appeared in the literature, the current review seeks to move beyond examining any one specific field to provide a more holistic summary and set of recommendations for research on misinformation and SNSs.
3.3 Methods 3.3.1 Data Collection Articles for this review were identified by searching EBSCO, ProQuest, JSTOR, and Gale Academic OneFile using keywords relevant to misinformation (e.g., “misinform*”; “disinform*” “fake news”; “inaccurate”; “incorrect”) combined with terms capturing social media broadly (e.g., “social media”; “social networking site*”) or specific names of popular platforms (e.g., “Facebook”; “Twitter”; “Instagram”; “YouTube”; “Snapchat”; “WhatsApp”; “Reddit”; “Tumblr”; “LinkedIn”). These platforms represent the most commonly used social networking sites among U.S. adults (Shearer & Matsa, 2018). In order to be retained in the sample, each article had to include at least one word relevant to misinformation and one word referencing social media or social networking sites in the title. This was done in order to retain only articles for which misinformation and SNSs represented the main thrust of the study. Searches were specified to retrieve only articles published since Facebook’s debut and the writing of this review (between January 1, 2004 and December 31, 2019), and produced 2,196 results. When constrained to only include articles from peer-reviewed journals, these searches produced 101 results. Book reviews,
34
L. S. Martinez
news features, and conference proceedings were also removed. Articles were also excluded if they were written in a language other than English. This resulted in a total of 44 unique articles1 for which full-text versions were readily accessible for analysis. A coding scheme was created using guidelines from prior research (MacQueen et al., 1998) in order to code for key features of each article included in the sample. Briefly, this coding scheme outlined instructions for coding critical information related to the article (e.g., name of lead author, journal name, article title, year of publication, discipline for which article is written, and geographical location where research was conducted). The coding scheme also included categories to capture the type (e.g., editorial, conceptual piece, empirical, case study) and primary purpose of the article (e.g., descriptive, inferential, methodological validation, theory proposal). In order to determine the extent to which articles in the sample relied on theory to guide research on misinformation and social networking sites, articles were also coded for whether they mentioned and/or used a theory as part of the research. Aspects of methodology such as study design, sample composition (for articles using human subjects), social networking site(s) under observation, type(s) of misinformation examined, main findings of the research, and whether recommendations for resolving the issue of misinformation on social networking sites were offered as part of the study.
3.4 Results 3.4.1 Article Description 3.4.1.1
Publication Profile
Social networking sites have been in existence since 2004, however the area of misinformation research appears to only have received attention from scholars recently. Although search parameters for this review were open to any articles published since 2004, the oldest article in this sample was published in 2010. Since then, the number of articles examining misinformation and social networking sites has consistently increased, with an acceleration in number of articles performing research in this area occurring during or after the 2016 U.S. Presidential Election (see Fig. 3.1). Despite the relatively recent attention to misinformation on SNSs, articles in this sample were diverse in terms of the geographical location where research was performed. While scholarship from every continent in the world was represented by at least one article, the majority of studies were conducted in the U.S. (59.1%). In addition, the author bylines of most studies were comprised of more than one author (70.5%), however, of these articles, most contained collaborations 1
See Appendix.
3 Research on Misinformation and Social Networking Sites
35
Articles Published (%) 40 30 20 10 0 2010 2012 2013 2014 2015 2016 2017 2018 2019 Fig. 3.1 Proportion of articles published by year (N = 44)
consisting of researchers from within the same discipline (70.5%). The fields most frequently represented by articles in this sample were Communication (20.5%), Medicine (13.6%), Business (13.6%), and Public Health (9.1%). Figure 3.2 displays the complete breakdown of disciplines present in the sample.
3.4.1.2
Research Design
Most articles were empirical in nature (63.6%), while the remaining studies were classified as conceptual pieces, case studies, or editorials (36.4%). The primary purpose of most articles was to describe (34.1%) or make inferences (34.1%) about misinformation on SNSs, while a smaller number of studies were dedicated to proposing new theoretical frameworks and/or methodologies, or serving as calls to action for additional research (31.8%). With regard to the use of theory, fewer than half of studies explicitly referenced a theory (40.9%). Of these studies, most went beyond merely mentioning a theory (55.6%) and used theory as a guiding framework, or integrated theory into research through theory application or testing. Sample composition of articles also varied across studies. While most studies did not rely on human subjects research (59.1%), of the studies that used human subjects, the most commonly examined study population was the general adult population (50%), followed by adolescents (16.7%), students (16.7%), general SNSs users (11.1%), and mixed samples (11.1%). Studies included in this review tended to discuss SNSs in general terms or explore more than one SNS platform (45.5%), with the most common combination comprised of Facebook and Twitter. For studies focused on one platform, Twitter (25%) and Facebook (22.7%) were most frequently featured as the central SNS under exploration. Only a small portion (9%) elaborated on other SNS platforms (e.g., YouTube, WhatsApp). Type of misinformation also differed across studies. Close to half the sample (45.5%) included examinations of both misinformation and disinformation, while a third focused solely on disinformation (29.5%). The remaining quarter of studies looked only at misinformation (25%).
36
L. S. Martinez
Articles by Field (%) 2.3 2.3 2.3
9.1
11.4
13.6
6.8
2.3
20.5
2.3 6.8 2.3
Business
2.3 2.3 4.5
2.3 4.5 2.3 Business/Economics
Communication
Computer Science
Criminal Justice
Economics
Education
Engineering
International Affairs
International Relations
Journalism
Law
Library Sciences
Medicine
Political Science
Public Health
Public Policy
Statistics/Mathematics
Fig. 3.2 Proportion of articles by field (N = 44)
3.4.2 Thematic Patterns A thematic analysis was performed in order to uncover major themes of studies exploring misinformation and SNSs. This analysis was informed by recommendations outlined by Braun and Clark (2006). Categories of themes were identified, examined for similarities, and then refined through an iterative process, resulting in the identification of four non-mutually exclusive themes. The first theme considered the critical role of trust (or mistrust) in informational content and/or sources as a factor in understanding the problem of misinformation and potential solutions to this issue. Of the 44 articles, 29.5% examined some aspect of trust, with a number of studies examining the role of trust in understanding or addressing issue of misinformation (e.g., Bode & Vraga, 2015; Marchi, 2012; Sullivan, 2019; Talwar et al., 2019).
3 Research on Misinformation and Social Networking Sites
37
A second theme revealed an emphasis on the development of tools or strategies to aid in the detection of misinformation. Within this sample, 20.5% of articles described an approach or tool for detecting misinformation. Of these studies, the majority offered tools or strategies aimed toward organizations and improving their efforts in detecting misinformation, and/or designing content or interventions to combat the misinformation problem (66.7%). For example, Ozbay and Alatas (2019) proposed an approach for fake news detection on social media that could serve as a framework for designing tools that rapidly detect misinformation for organizations, while other studies such as Wang and Zhuang (2018) focused on detecting the spread of misinformation among users with potential to aid organizations in rumor-debunking interventions. Less than half (44.4%) focused on vulnerabilities of SNS users and strategies directly aimed at SNSs users. One example by Fielden et al. (2018) discussed how users validated information they found on SNS platforms, resulting in a subsequent need for users to become more critical consumers of information they encounter and sources from which information originates. The third theme described approaches for correcting misinformation once detected. Among the articles reviewed here, 38.6% displayed characteristics of this theme. A large portion of articles focused on corrective action that could be taken at the organizational level (88.2%). For example, Vraga and Bode (2017) examined corrective statements that organizations such as the Center for Disease Control could deliver in order to combat health-related misperceptions. Other studies, such as an article by Kraski (2018) comparing German and U.S. legal approaches to addressing misinformation on SNSs, and emphasized societal-level action through policy change that could be enacted in order to correct misinformation. Corrective action at the individual level (35.3%) was also explored by articles in this sample. Studies like Lutzke et al. (2019) examined user factors that could help combat misinformation, such as enhancing critical thinking skills of individuals who heavily engage with SNSs as sources of news and educational content. The last theme involved the presence of recommendations to fix the misinformation problem on SNSs based on the analysis of misinformation discussed in the article. Most articles provided at least one specific recommendation (79.5%). When offered, recommendations to address this problem often took the form of proposing regulation of SNSs (through government intervention or self-regulatory measures on the part of SNSs) and their content (e.g., Allcott & Gentzkow, 2017; Andorfer, 2017), and/or launching interventions targeting SNSs users (e.g., Chen et al., 2015; Lutzke et al., 2019; Wang et al., 2019).
38
L. S. Martinez
3.5 Discussion and Directions for Future Research 3.5.1 Recommendations In recent years, emerging research on misinformation and SNSs has attracted scholarly attention from researchers spanning an array of disciplines, including but not limited to: Political science, journalism, communication, medicine, law, international relations, public health, business, criminal justice, library science, education, statistics/mathematics, economics, engineering, and computer science. This review suggests that misinformation on SNSs represents an issue facing many fields without a clear or easy solution. The discussion that follows provides four recommendations derived from the present review: (1) performance of additional research on platforms other than Facebook or Twitter; (2) clarification of conceptualizations of misinformation and increased consistency in usage of terms; (3) greater integration of theory for enhancing understanding of how misinformation spreads and how to best correct misinformation once it proliferates on SNSs; and (4) promotion of interdisciplinary collaborations among researches investigating misinformation and SNSs.
3.5.1.1
Advance Research on Misinformation and SNSs Beyond Facebook and Twitter
This review shows that Facebook and Twitter currently represent the most studied platforms in the area of misinformation and SNSs. As the most commonly used SNS among U.S. adults (Shearer & Matsa, 2018), it stands to reason that Facebook is among one of the more studied platforms in this body of work. Similarly, although Twitter is not as popular as Facebook, the platform currently remains one of the more accessible sources of web-based data available to scholars (McCormick et al., 2017). The focus on Facebook and Twitter, however, raises potential questions of generalizability, especially when one considers the demographics characterizing use of other SNSs (Hampton et al., 2011; Rains & Brunner, 2015). Users of different platforms may be different in meaningful ways across factors such as age, race/ethnicity, and education (Duggan & Brenner, 2013), and current research may not be adequately capturing the full picture of misinformation on SNSs when two platforms dominate the research. More research that advances the literature beyond Facebook and Twitter is needed in order to more thoroughly understand the effects of misinformation and how it spreads on SNSs. Studies that examine platforms more commonly used by younger generations (e.g., Millennials, Generation Z) such as Youtube, Instagram, and Snapchat (Anderson & Jiang, 2018) will become increasingly important for answering lingering questions regarding the perceived trustworthiness of SNSs as news sources and their role in propagating disinformation or fake news.
3 Research on Misinformation and Social Networking Sites
3.5.1.2
39
Clarify Conceptualizations and Increase Consistency in Usage of Terms
The current review also reveals that terms used to study misinformation require greater clarification. While a number of definitions for misinformation have been provided (e.g., Giglietto et al., 2016), certain conceptualizations may be more applicable to specific types of research questions. For example, studies investigating questions of science communication such as Vraga and Bode (2017) may be more aptly guided with a conceptualization emphasizing “cases in which people’s beliefs about factual matters are not supported by clear evidence and expert opinion” (Nyhan & Reifler 2010, p. 305). Defining misinformation in this manner can help illuminate information gaps between expert and lay audiences that may be ripe for the emergence and spread of scientific misinformation. However, studies examining misinformation in the context of news may need to consider arriving at some consensus on how best to conceptualize “fake news” (Wang et al., 2019), and how it fits with existing understandings of misinformation and disinformation. Current definitions provide an assortment of characterizations of “fake news”, ranging from content devoid of factual information but relayed as fact (e.g., Allcott & Gentzkow, 2017) to content combining both factual and false information (Benkler et al., 2017). Other research by Tandoc et al. (2018) has identified the presence of six definitions characterizing “fake news” as content containing deception, manipulation, advertising, propaganda, and news-related satire and parody. Given that the term “fake news” was coined word of the year by Collins Dictionary as a result of its record usage (Roy et al., 2018) and no cohesive definition of this term exists among scholars (Al-Rawi, 2019), it is imperative that the literature moving forward strive to reach a level of agreement on how to define this concept in relation to misinformation research. Future research should seek a higher level of clarification that would be helpful for building theory and uncovering mechanisms explaining effects of misinformation on SNSs.
3.5.1.3
Integrate Contemporary Theory into More Research on Misinformation and SNSs
The majority of articles included in this review did not reference an existing theory in the context of their research, suggesting an atheoretical approach to the study of misinformation on SNSs. This is problematic because it undermines the development of a systematic approach to explaining or predicting effects of misinformation on SNSs. In addition, among articles that did make reference to at least one theory, most lacked a direct connection between theory and their research. Theories that were mentioned drew from a diverse array of disciplines, including but not limited to psychology [e.g., cognitive dissonance theory (Festinger, 1962), social comparison theory (Festinger, 1954), social judgement theory (Sherif et al., 1965), selfdetermination theory (Deci & Ryan, 1980), rational choice theory (Homans, 1961)], communication [e.g., diffusion of innovations (Rogers, 1962); inoculation theory
40
L. S. Martinez
(McGuire, 1961), uses and gratification theory (Katz et al., 1973), self-actualizingdutiful citizen model (Bennett, 2007)], sociology [e.g., homophily (Lazarsfeld & Merton, 1954)], and statistics e.g., extreme value theory (Fisher & Tippett, 1928)). Much of the theories referenced and/or used in current articles were developed prior to the advent of SNSs, and while potentially relevant for foundational purposes, such theories may not be adequate for fully explicating how the unique properties of SNS contribute to the issue of misinformation when compared to newer theories. For example, more recently developed theories such as the M3d multilevel model of meme diffusion (Spitzberg, 2014) may have more to say about how misinformation becomes viral on SNSs as it was constructed with SNSs taken into account. Future research should consult theories contemporary to the rise of SNSs, or seek to extend classical theories which take into account the current information environment.
3.5.1.4
Foster Interdisciplinary Research on Misinformation and SNSs
Results from this review also show that despite the fact that misinformation on SNSs represents a problem spanning several fields, researchers in this area are largely working within disciplinary silos. This lack of interdisciplinary collaboration represents potential lost opportunities to identify patterns that cut across fields, crosspollinate theories and methods, and build a more comprehensive and integrated body of literature on misinformation and SNSs (Wang et al., 2019). Echoing Chou et al.’s (2018) call for greater interdisciplinary collaborations in the context of health-related misinformation on social media, this review calls for expanding this recommendation to research on other forms of misinformation (e.g., news, political, science). As with health-related misinformation (Chou et al., 2018), future research should engage scholars from across disciplines in order to effectively determine prevalence, proliferation, and effects of other types of misinformation on SNSs, as well as any interventions designed to combat it.
3.5.2 Limitations Some limitations of this review are worthy of discussion. First, although search terms used to collect articles for this review included broad terms such as “social media” and “social networking site”, only the most commonly used platforms were searched by name. As a result, articles examining misinformation on less popular SNSs may not have appeared in search results. Similarly, studies investigating misinformation and SNSs that did not include search terms in their titles were not included in this review. It is unclear if studies with a focus other than misinformation and SNSs may have provided additional insight. Second, this review is limited to publications written in English about SNSs with English-speaking users. Results may not generalize to platforms with user bases comprised of non-English speaking individuals, nor would they likely represent studies in journals published in other languages. Last, policies
3 Research on Misinformation and Social Networking Sites
41
and features for content management on SNSs remain in a state of continuous flux, and suggest that significance and relevance of findings generated as a result of this review may be subject to change.
Appendix—44 Reviewed Articles 1. 2. 3.
4.
5. 6.
7.
8.
9. 10.
11. 12.
13. 14.
Allcott, H., & Gentzkow, M. (2017). Social media and fake news in the 2016 election. Journal of Economic Perspectives, 31, 211–236. Al-Rawi, A. (2019). Gatekeeping fake news discourses on mainstream media versus social media. Social Science Computer Review, 37, 687–704. Andorfer, A. (2017). Spreading like wildfire: Solutions for abating the fake news problem on social media via technology controls and government regulation notes. Hastings Law Journal, 69, 1409–1432. Arianto, R., Warnars, H. L. H. S., Abdurachman, E., Heryadi, Y., & Gaol, F. L. (2019). The architecture social media and online newspaper credibility measurement for fake news detection. Telkomnika, 17, 738–744. Barfar, A. (2019). Cognitive and affective responses to political disinformation in Facebook. Computers in Human Behavior, 101, 173–179. Bessi, A. (2017). On the statistical properties of viral misinformation in online social media. Physica A: Statistical Mechanics and its Applications, 469, 459– 470. Bode, L., & Vraga, E. K. (2015). In related news, that was wrong: The correction of misinformation through related stories functionality in social media. Journal of Communication, 65, 619–638. Borges, P. M., & Gambarato, R. R. (2019). The role of beliefs and behavior on facebook: A semiotic approach to algorithms, fake news, and transmedia journalism. Bradshaw, S., & Howard, P. N. (2018). The global organization of social media disinformation campaigns. Journal of International Affairs, 71, 23–32. Brummette, J., DiStaso, M., Vafeiadis, M., & Messner, M. (2018). Read all about it: The politicization of “fake news” on Twitter. Journalism & Mass Communication Quarterly, 95, 497–517. Chamberlain, P. R. (2010). Twitter as a vector for disinformation. Journal of Information Warfare, 9, 11–17. Chen, X., Sin, S.-C. J., Theng, Y.-L., & Lee, C. S. (2015). Why students share misinformation on social media: Motivation, gender, and study-level differences. The Journal of Academic Librarianship, 41, 583–592. Chou, W.-Y. S., Oh, A., & Klein, W. M. P. (2018). Addressing health-related misinformation on social media. JAMA, 320, 2417–2418. Colliander, J. (2019). “This is fake news”: Investigating the role of conformity to other users’ views when commenting on and spreading disinformation in social media. Computers in Human, Behavior, 97, 202–215.
42
15. 16.
17.
18.
19. 20.
21.
22.
23.
24. 25. 26.
27.
28. 29. 30.
L. S. Martinez
Densley, J., Dexter, K., & Eckberg, D. A. (2018). When legend becomes fact, tweet the legend. Journal of Behavioral and Social Sciences, 5, 148–156. Fielden, A., Grupaˇc, M., & Adamko, P. (2018). How users validate the information they encounter on digital content platforms: The production and proliferation of fake social media news, the likelihood of consumer exposure, and online deceptions. Geopolitics, History, and International Relations, 10, 51–57. Frish, Y., & Greenbaum, D. (2017). Is social media a cesspool of misinformation? Clearing a path for patient-friendly safe spaces online. The American Journal of Bioethics, 17, 19–21. Fung, I. C.-H., Fu, K.-W., Chan, C.-H., Chan, B. S. B., Cheung, C.-N., Abraham, T., & Tse, Z. T. H. (2016). Social media’s initial reaction to information and misinformation on Ebola, August 2014: Facts and Rumors. Public Health Reports, 131, 461–473. Gerbaudo, P. (2018). Fake news and all-too-real emotions: Surveying the social media battlefield. Brown Journal of World Affairs, 25, 85. Grinberg, N., Joseph, K., Friedland, L., Swire-Thompson, B., & Lazer, D. (2019). Fake news on Twitter during the 2016 U.S. presidential election. Science, 363, 374–378. Guarda, R. F., Ohlson, M. P., & Romanini, A. V. (2018). Disinformation, dystopia and post-reality in social media: A semiotic-cognitive perspective. Education and Information Technologies, 34, 185–197. Ho, A., McGrath, C., & Mattheos, N. (2017). Social media patient testimonials in implant dentistry: Information or misinformation? Clinical Oral Implants Research, 28, 791–800. Jones, M. O. (2019). The Gulf information war| propaganda, fake news, and fake trends: The weaponization of Twitter Bots in the Gulf Crisis. International Journal of Communication, 13, 27. Kim, A., & Dennis, A. R. (2019) Says who? The effects of presentation format and source rating on fake news in social media. MIS Quarterly, 43, 1025–1039. Kraski, R. (2018). Combating fake news in social media: U.S. and German legal approaches. St. John’s Law Review, 91. Loeb, S., Sengupta, S., Butaney, M., et al. (2019). Dissemination of misinformative and biased information about prostate cancer on YouTube. European Urology, 75, 564–567. Lutzke, L., Drummond, C., Slovic, P., & Árvai, J. (2019). Priming critical thinking: Simple interventions limit the influence of fake news about climate change on Facebook. Global Environmental Change, 58, 101964. Marchi, R. (2012). With Facebook, blogs, and fake news, teens reject journalistic “objectivity.” Journal of Communication Inquiry, 36, 246–262. Merchant, R. M., & Asch, D. A. (2018). Protecting the value of medical science in the age of social media and “fake news.” JAMA, 320, 2415–2416. Nunan, D., & Yenicioglu, B. (2013). Informed, uninformed and participative consent in social media research. International Journal of Research in Marketing, 55, 791–808.
3 Research on Misinformation and Social Networking Sites
31.
32. 33.
34. 35.
36.
37. 38.
39. 40. 41.
42.
43.
44.
43
Ortiz-Martínez, Y., & Jiménez-Arcia, L. F. (2017). Yellow fever outbreaks and Twitter: Rumors and misinformation. American Journal of Infection Control, 45, 816–817. Oyeyemi, S. O., Gabarron, E., & Wynn, R. (2014). Ebola, Twitter, and misinformation: A dangerous combination? BMJ, 349, g6178. Ozbay, F. A., & Alatas, B. (2019). A novel approach for detection of fake news on social media using metaheuristic optimization algorithms. Elektron Ir Elektrotechnika, 25, 62–67. Peña, A. M. (2016). Misinformed users: Improving informed decision-making on social media. Transplant International, 29, 35. Shin, J., Jian, L., Driscoll, K., & Bar, F. (2018). The diffusion of misinformation on social media: Temporal pattern, message, and source. Computers in Human Behavior, 83, 278–287. Steffens, M. S., Dunn, A. G., Wiley, K. E., & Leask, J. (2019). How organisations promoting vaccination respond to misinformation on social media: A qualitative investigation. BMC Public Health, 19, 1348. Sullivan, M. C. (2019). Leveraging library trust to combat misinformation on social media. Library & Information Science Research, 41, 2–10. Talwar, S., Dhir, A., Kaur, P., Zafar, N., & Alrasheedy, M. (2019). Why do people share fake news? Associations between the dark side of social media use and fake news sharing behavior. Journal of Retailing and Consumer Services, 51, 72–82. Trethewey, S. P. (2018). ‘Cough CPR’: Misinformation perpetuated by social media. Resuscitation, 133, e7–e8. Vraga, E. K., & Bode, L. (2017). Using expert sources to correct health misinformation in social media. Science Communication, 39, 621–645. Vraga, E. K., Kim, S. C., & Cook, J. (2019). Testing logic-based and humorbased corrections for science, health, and political misinformation on social media. Journal of Broadcasting & Electronic Media, 63, 393–414. Wang, B., & Zhuang, J. (2018) Rumor response, debunking response, and decision makings of misinformed Twitter users during disasters. Natural Hazards, 93, 1145–1162. Wang, Y., McKee, M., Torbica, A., & Stuckler, D. (2019). Systematic literature review on the spread of health-related misinformation on social media. Social Science & Medicine, 240, 112552. Walters, R. M. (2018). How to tell a fake: Fighting back against fake news on the front lines of social media. Texas Review of Law and Politics, 23, 111–180.
References Allcott, H., & Gentzkow, M. (2017). Social media and fake news in the 2016 election. Journal of Economic Perspectives, 31, 211–236.
44
L. S. Martinez
Allcott, H., Gentzkow, M., & Yu, C. (2019). Trends in the diffusion of misinformation on social media. Research & Politics, 6, 2053168019848554. Al-Rawi, A. (2019). Gatekeeping fake news discourses on mainstream media versus social media. Social Science Computer Review, 37, 687–704. Anderson, M., & Jiang, J. (2018). Teens, social media & technology 2018. Pew Research Center. Retrieved April 17, 2021, from https://www.pewresearch.org/internet/2018/05/31/teens-socialmedia-technology-2018/ Andorfer, A. (2017). Spreading like wildfire: Solutions for abating the fake news problem on social media via technology controls and government regulation notes. Hastings Law Journal, 69, 1409–1432. Benkler, Y., Faris, R., Roberts, H., & Zuckerman, E. (2017). Study: Breitbart-led right-wing media ecosystem altered broader media agenda. Columbia Journalism Review, 3, 2017. Bennett, W. L. (2007). Civic learning in changing democracies: Challenges for citizenship and civic education. In P. Dahlgren (Ed.), Young citizens (pp. 59–77). New Media Lond. UK Routledge. Routledge. Bessi, A., Coletto, M., Davidescu, G. A., Scala, A., Caldarelli, G., & Quattrociocchi, W. (2015). Science vs. conspiracy: Collective narratives in the age of misinformation. PLOS ONE, 10, e0118093. Bode, L., & Vraga, E. K. (2015). In related news, that was wrong: The correction of misinformation through related stories functionality in social media. Journal of Communication, 65, 619–638. Boyd, D. M., & Ellison, N. B. (2007). Social network sites: Definition, history, and scholarship. Journal of Computer-Mediated Communication, 13, 210–230. Braun, V., & Clarke, V. (2006). Using thematic analysis in psychology. Qualitative Research in Psychology, 3, 77–101. Broniatowski, D. A., Jamison, A. M., Qi, S., AlKulaib, L., Chen, T., Benton, A., Quinn, S. C., & Dredze, M. (2018). Weaponized health communication: Twitter bots and Russian trolls amplify the vaccine debate. American Journal of Public Health, 108, 1378–1384. Bucher, T., & Helmond, A. (2017). The affordances of social media platforms. In A. Marwick & T. Poell (Eds.), The SAGE handbook of social media (pp. 233–253). Sage Londres. Chen, X., Sin, S.-C. J., Theng, Y.-L., & Lee, C. S. (2015). Why students share misinformation on social media: Motivation, gender, and study-level differences. The Journal of Academic Librarianship, 41, 583–592. Chou, W.-Y. S., Oh, A., & Klein, W. M. P. (2018). Addressing health-related misinformation on social media. JAMA, 320, 2417–2418. Deci, E. L., & Ryan, R. M. (1980). Self-determination theory: When mind mediates behavior. Journal of Mind and Behavior, 1, 33–43. Duggan, M., & Brenner, J. (2013). The demographics of social media users, 2012. Pew Research Center’s Internet & American Life Project. Fallis, D. (2009). A conceptual analysis of disinformation. Festinger, L. (1962). A theory of cognitive dissonance. Stanford University Press. Festinger, L. (1954). A theory of social comparison processes. Human Relations, 7, 117–140. Fielden, A., Grupaˇc, M., & Adamko, P. (2018). How users validate the information they encounter on digital content platforms: The production and proliferation of fake social media news, the likelihood of consumer exposure, and online deceptions. Geopolitics, History, and International Relations, 10, 51–57. Fisher, R. A., & Tippett, L. H. C. (1928). Limiting forms of the frequency distribution of the largest or smallest member of a sample. Mathematical Proceedings of the Cambridge Philosophical Society, 24, 180–190. Giglietto, F., Iannelli, L., Rossi, L., & Valeriani, A. (2016). Fakes, news and the election: A new taxonomy for the study of misleading information within the hybrid media system. Social Science Research Network.
3 Research on Misinformation and Social Networking Sites
45
Gil de Zúñiga, H., Weeks, B., & Ardèvol-Abreu, A. (2017). Effects of the news-finds-me perception in communication: Social media use implications for news seeking and learning about politics. Journal of Computer-Mediated Communication, 22, 105–123. Hampton, K. N., Goulet, L. S., Rainie, L., & Purcell, K. (2011). Social networking sites and our lives. Pew Internet & American Life Project. Homans, G. (1961). Social behavior: Its elementary forms. Routledge and Kegan Paul. Katz, E., Blumler, J. G., & Gurevitch, M. (1973). Uses and gratifications research. Public Opinion Quarterly, 37, 509–523. Kraski, R. (2018). Combating fake news in social media: U.S. and German legal approaches. The St. John’s Law Review, 91. Lazarsfeld, P. F., & Merton, R. K. (1954). Friendship as a social process: A substantive and methodological analysis. In M. Berger (Ed.), Freedom and control in modern society (pp. 18–66). Van Nostrand. Lee, C. S., & Ma, L. (2012). News sharing in social media: The effect of gratifications and prior experience. Computers in Human Behavior, 28, 331–339. Lewandowsky, S., Ecker, U. K. H., Seifert, C. M., Schwarz, N., & Cook, J. (2012). Misinformation and its correction: Continued influence and successful debiasing. Psychological Science in the Public Interest, 13, 106–131. Loeb, S., Sengupta, S., Butaney, M., et al. (2019). Dissemination of misinformative and biased information about prostate cancer on Youtube. European Urology, 75, 564–567. Lutzke, L., Drummond, C., Slovic, P., & Árvai, J. (2019). Priming critical thinking: Simple interventions limit the influence of fake news about climate change on Facebook. Global Environmental Change, 58, 101964. MacQueen, K. M., McLellan, E., Kay, K., & Milstein, B. (1998). Codebook development for team-based qualitative analysis. CAM J, 10, 31–36. Marchi, R. (2012). With Facebook, blogs, and fake news, teens reject journalistic “objectivity.” Journal of Communication Inquiry, 36, 246–262. McCormick, T. H., Lee, H., Cesare, N., Shojaie, A., & Spiro, E. S. (2017). Using Twitter for demographic and social science research: Tools for data collection and processing. Sociological Methods & Research, 46, 390–421. McGuire, W. J. (1961). Resistance to persuasion conferred by active and passive prior refutation of the same and alternative counterarguments. Journal of Abnormal Psychology, 63, 326–332. Merchant, R. M., & Asch, D. A. (2018). Protecting the value of medical science in the age of social media and “fake news.” JAMA, 320, 2415–2416. Nahon, K., & Hemsley, J. (2013). Going viral. Polity Press. Nyhan, B., & Reifler, J. (2010). When corrections fail: The persistence of political misperceptions. Political Behavior, 32, 303–330. Ozbay, F. A., & Alatas, B. (2019). A novel approach for detection of fake news on social media using metaheuristic optimization algorithms. Elektron Ir Elektrotechnika, 25, 62–67. Pew Research Center. (2019). Social media fact sheet. Pew Research Center Internet Science & Technology. Retrieved December 31, 2019, from https://www.pewresearch.org/internet/fact-sheet/soc ial-media/ Rains, S. A., & Brunner, S. R. (2015). What can we learn about social network sites by studying Facebook? A call and recommendations for research on social network sites. New Media & Society, 17, 114–131. Rogers, E. M. (1962). Diffusion of innovations. Free Press. Roy, A., Basak, K., Ekbal, A., & Bhattacharyya, P. (2018). A deep ensemble framework for fake news detection and classification. ArXiv181104670 Cs Shao, C., Hui, P.-M., Wang, L., Jiang, X., Flammini, A., Menczer, F., & Ciampaglia, G. L. (2018). Anatomy of an online misinformation network. PLOS ONE, 13, e0196087. Sharma, M., Yadav, K., Yadav, N., & Ferdinand, K. C. (2017). Zika virus pandemic—Analysis of Facebook as a social media health information platform. American Journal of Infection Control, 45, 301–302.
46
L. S. Martinez
Shearer, E., & Matsa, K. E. (2018). News use across social media platforms 2018. Pew Research Center’s Journal Project. Retrieved December 31, 2019, from https://www.journalism.org/2018/ 09/10/news-use-across-social-media-platforms-2018/ Sherif, C. W., Sherif, M., & Nebergall, R. E. (1965). Attitude and attitude change: The social judgment-involvement approach. Saunders Philadelphia. Southwell, B. G., Thorson, E. A., & Sheble, L. (2018). Misinformation among mass audiences as a focus for inquiry. In B. G. Southwell, E. A. Thorson, & L. Sheble (Eds.), Misinformation mass audiences (pp. 1–14). University of Texas Press. Spitzberg, B. H. (2014). Toward a model of meme diffusion (M3D). Communication Theory, 24, 311–339. Sullivan, M. C. (2019). Leveraging library trust to combat misinformation on social media. Library & Information Science Research, 41, 2–10. Talwar, S., Dhir, A., Kaur, P., Zafar, N., & Alrasheedy, M. (2019). Why do people share fake news? Associations between the dark side of social media use and fake news sharing behavior. Journal of Retailing and Consumer Services, 51, 72–82. Tan, A. S. L., Lee, C., & Chae, J. (2015). Exposure to health (mis)information: Lagged effects on young adults’ health behaviors and potential pathways. Journal of Communication, 65, 674–698. Tandoc, E. C., Lim, Z. W., & Ling, R. (2018). Defining “fake news.” Digit Journal, 6, 137–153. Vosoughi, S., Roy, D., & Aral, S. (2018). The spread of true and false news online. Science, 359, 1146–1151. Vraga, E. K., & Bode, L. (2017). Using expert sources to correct health misinformation in social media. Science Communication, 39, 621–645. Wang, Y., McKee, M., Torbica, A., & Stuckler, D. (2019). Systematic literature review on the spread of health-related misinformation on social media. Social Science & Medicine, 240, 112552. Wang, B., & Zhuang, J. (2018). Rumor response, debunking response, and decision makings of misinformed Twitter users during disasters. Natural Hazards, 93, 1145–1162. Whiting, A., & Williams, D. (2013). Why people use social media: A uses and gratifications approach. Qualitative Market Research International Journal, 16, 362–369.
Chapter 4
Research Trends in Social Media/Big Data with the Emphasis on Data Collection and Data Management: A Bibliometric Analysis Qiong Peng and Xinyue Ye
4.1 Introduction Advances in social sensing and data acquisition technologies have led to an enormous amount of human dynamics data. Social media data, in particular, have been used extensively in human dynamics research (Shaw et al., 2016). In many ways, analysis of social media data could prove especially helpful in detecting anomalous events such as panic resulting from natural disasters (Wang & Ye, 2018; Wang et al., 2019), pandemics (Depoux et al., 2020), and the interaction among cities and regions (Gong et al., 2020; Ye et al., 2021). The social media data could also be used to observe shopping behavior (Ye et al., 2020a), travel recommendations (Bao et al., 2012), political activism (Thorson et al., 2013), and business intelligence (Karamshuk et al., 2013). However, there are potential drawbacks from relying on these data, such as unrepresentative sample, rumors, and location spoofing (Ye et al., 2020b). Focusing on the technical aspects of data collection and data management, text messages, photos, and videos posted on social media platforms can be collected and managed using several techniques. Furthermore, social media may be associated with location information, such as through a check-in function or geo-tagging; this information could offer a solid foundation for geographical analysis. Still, there are some strategic decisions that need to be made before data collection methods are established, such as the period of data collection and search criteria for collecting data (i.e. based on lists of user accounts or filtered by topics and corresponding Q. Peng School of Architecture, Planning and Preservation, University of Maryland, College Park, MD, USA X. Ye (B) Department of Landscape Architecture and Urban Planning, Texas A&M University, College Station, TX, USA e-mail: [email protected] © Springer Nature Switzerland AG 2021 A. Nara and M.-H. Tsou (eds.), Empowering Human Dynamics Research with Social Media and Geospatial Data Analytics, Human Dynamics in Smart Cities, https://doi.org/10.1007/978-3-030-83010-6_4
47
48
Q. Peng and X. Ye
hashtags) (Mayr & Weller, 2017). To access social media data, researchers can fetch the data using Python scrapy. Even more, there coding-free web tools that enable data collection and visualization, such as the Social Data Analytics Tool (SODATO) and Netlytic (Hussain & Vatrapu, 2014). Developed by the Copenhagen Business School, SODATO can fetch data from Facebook and Twitter. Those interested in utilizing this tool may contact the research group, but public access to SODATO is currently limited. Netlytic can capture data from YouTube and Twitter, and the data are geo-coded. Efficient data management is also critical to handling large quantities of social media data. There are increasing numbers of established data management techniques and techniques vary across data structure. For example, Zheng et al. (2014) conclude that there are three common data structures-stream, trajectory, and graph data. Regarding various data structure, they introduce corresponding data management techniques, including data reduction techniques for trajectories,1 noise filtering techniques for trajectories, techniques for indexing and query trajectories, managing spatiotemporal graphs techniques, hybrid indexing structures techniques that can well organize different data sources. Social data collection and data management research comprise a multidisciplinary field that covers a wide range of subjects including information science, computer science, and geography. It is necessary to identify the cutting-edge trends related to this research. Although there is increasing interest in big data and social media dataparticularly data collection and data management-there is still limited research on the “big picture” of social media data collection and data management. In addition, human dynamics research spans multiple disciplines of study. Hence, it is necessary to examine global development and research trends comprehensively when discussing social media data collection and data management. A systematic review of social media data collection and management techniques could help readers gain a better understanding of research achievements, directions, and development of research methods. Bibliometric analysis incorporates both visual and quantitative analytics to summarize trends in selected research fields (Garfield, 1970; Pritchard, 1969). This type of analysis can reveal temporal dynamics of scholarly outputs, spatial and institutional distributions of publications, scientific collaborations, and major research trends (Li et al., 2017). Furthermore, bibliometric network analysis, such as co-word analysis (Ding et al., 2001), co-citation analysis (He & Cheung Hui, 2002), coauthorship analysis (Glänzel et al., 2005), and co-publication analysis (Schmoch & Schubert, 2007), can be conducted to shed light on relationships between keywords, as well as other identifiers such as country, research institute, and author. In this study, we used a bibliometric method to examine global research trends of big data and social media data research in the last decade with an emphasis on data collection and data management. The purposes of this study are to (1) evaluate research performance by country, institute, journal, subject category, and keywords;
1
A spatial trajectory is a trace indicating a moving vehicle or induvial in geographical spaces.
4 Research Trends in Social Media/Big Data with the Emphasis …
49
and (2) briefly identify future research directions in big data and social media data research with an emphasis on data collection and data management.
4.2 Methodology, Data Collection, and Analysis 4.2.1 Applications Bibliometric analysis was carried out in order to evaluate the characteristics and trends in big data and social media data research. Bibliometric analysis was introduced by Pritchard (1969) as a mathematical and statistical approach to analyze pertinent literature and understand the global research trends in a specific area. It has been applied to environmental engineering and science, soil science, ecology, food safety, new energy utilization, and other aspects. Bibliometric indicators analyzed in this study include publication number, subject categories, source journals, countries, institutions, journals, and keywords.
4.2.2 Data Collection The dataset was derived from the database of the Science Citation Index (SCI) and Social Science Citation Index (SSCI) publications by Web of Science. The following keywords were used, including Topic = (“social media*” OR “big data*”) AND (“data collection*” OR “data management*”) to search all archived documents. The publications that contain any of those keywords and their variants (with *) in their titles, abstracts and keyword lists were included. The following information was downloaded: title, authors, institutions, abstract, keywords and cited references. The studies period spanned 2010–2019. Our bibliographic search resulted in 1,436 records.
4.2.3 Analysis Tools In order to conduct bibliometric analysis, an R package ‘Bilbiometrix’ (Aria & Cuccurullo, 2017), VOSviewer (Eck & Waltman, 2010), Ucinet (Borgatti et al., 2002), or other packages can be applied (see Cobo et al., 2011) for a review on bibliometric software). In this study, we use the R package “Bibliometrix” and VOSviewer. VOSviewer is a freely available texting software for generating bibliometric maps and analyzing trends in scientific literature. Natural language processing techniques are built in the VOSviewer package to enable one to create term co-occurrence networks based on English-language textual data. State-of-the-art techniques for
50
Q. Peng and X. Ye
network layout and network clustering are available in the software. VOSviewer software uses a circle and label to represent an element, in which the circle size represents importance, and circles with the same color belong to the same cluster. In the following section, we will focus on aspects that describe global scientific production for big data and social media data research: (1) (2) (3) (4) (5) (6)
Growth of output during 2010–2019 Distribution of output in subject categories and journals Most cited documents Geographic and institutional distribution of publications Institution collaboration network analysis Keywords analysis.
4.3 Results and Discussion 4.3.1 Characteristics of Article Outputs
400
1600
350
1400
300
1200
250
1000
200
800
150
600
100
400
50
200
0
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 Annual publicaon
0
Cumulave number of publicaon
Annual number of publicaon
1,436 publications have been identified as being data collection and data management of social media and big data-related during 2010–2019. A total of 1,436 documents includes 1,233 articles, 154 reviews, 10 book chapters, 28 early accesses, 30 proceeding papers, two letters, 39 editorial materials, one software review, and one book review. After removing records that do not have completed authorship and publication year information, we were left with 1,408 records. The characteristics of the article outputs are shown in Fig. 4.1 and Table 4.1. The number of annual publications rose from four in 2010 to 362 in 2019, illustrating a dramatic rise and
Cumulave publicaon
Fig. 4.1 Growth of publication outputs (Horizontal axis: year; Vertical axis: number of publications)
4 Research Trends in Social Media/Big Data with the Emphasis …
51
Table 4.1 Scientific outputs descriptors during 2010–2019 PY
TP
AU
AU/TP
NR
NR/TP
TC
TC/TP
2010
4
12
3.000
200
50.000
453
113.250
2011
9
35
3.889
301
33.444
555
61.667
2012
8
25
3.125
329
41.125
792
99.000
2013
35
102
2.914
1476
42.171
1793
51.229
2014
77
326
4.234
3128
40.623
3123
40.558
2015
157
600
3.822
7616
48.510
3123
19.892
2016
175
813
4.646
9053
51.731
3119
17.823
2017
244
1017
4.168
13,739
56.307
2626
10.762
2018
337
1622
4.813
19,529
57.950
1769
5.249
2019
362
1607
4.439
21,701
59.948
442
1.221
PY: year; TP: number of publications; AU: number of authors; TC: total citation count; NR: number of cited references; AU/TP, NR/TP, and TC/TP: an average of authors, references, and citation per paper
upward growth of this are of the research in the past decade. The average annual growth rate of all SCI and SSCI publications in the field was 56.74%. In addition, Fig. 4.1 shows that the annual growth rate of publication has accelerated since 2014. The average number of authors and references increased from three and 50 in 2010 to 4.439 and 59.948 in 2019, respectively (Table 4.1). The growth of an average number of authors per article indicates that collaboration in the field has increased. The growth of average references per article indicates that there is increased knowledge about this topic. It is interesting to see that the average number of citations per article has decreased overall since 2010.
4.3.2 Subject Categories and Major Journals Based on the classification of the Web of Science categories, the sample documents covered 194 subject categories. The research domain covered a wide variety of themes and disciplines. The top 20 subject categories were presented in Table 4.2. The results show that data collection and management for big data and social media research spanned a wider range of disciplines, but such studies mainly stemmed from computer science information systems, engineering electrical electronics, telecommunications, computer science theory methods, computer science interdisciplinary applications, and information science library science. The top 20 active journals are summarized in Table 4.3. This table shows that IEEE (The Institute of Electrical and Electronics Engineers) Access is the most productive journal, followed by Future Generation Computer Systems—The International Journal of eScience, Journal of Medical Internet Research, ISPRS (International Society for Photogrammetry and Remote Sensing) International Journal
52
Q. Peng and X. Ye
Table 4.2 Distribution of the top 20 subject categories Subject category
TP (%)
Computer science information systems
233
16.19
Engineering electrical electronic
158
10.98
Telecommunications
126
8.76
Computer science theory methods
108
7.51
Computer science interdisciplinary applications
90
6.25
Information science library science
90
6.25
Computer science software engineering
76
5.28
Health care sciences services
68
4.73
Medical informatics
60
4.17
Environmental sciences
51
3.54
Public environmental occupational health
51
3.54
Computer science artificial intelligence
50
3.47
Communication
49
3.41
Management
49
3.41
Computer science hardware architecture
44
3.06
Business
43
2.99
Medicine general internal
40
2.78
Multidisciplinary sciences
34
2.36
Environmental studies
33
2.29
Green sustainable science technology
29
2.02
Engineering civil
28
1.95
Operations research management science
28
1.95
of Geo-information, Sensors, and Sustainability. Regarding average citation number per article, IEEE Transactions on Knowledge and Data Engineering, Computers in Human Behavior, and Big Data are the three most highly cited journals, with a magnitude of 112.4, 70, and 44.429, respectively.
4.3.3 Most Cited Documents Based on the number of citations, we list the 10 most cited documents relating to data collection and data management in Table 4.4. The citation number has updated base on Google Scholar figures. If audiences wish to heavily research big data collection and data management, it might prove worthwhile to read those highly cited documents first to gain an overview of the topic. For example, one could read Chen and Zhang (2014) to learn more about challenges, techniques, and technologies as they relate to big data. In order to learn how to process and generate large datasets, it could prove
4 Research Trends in Social Media/Big Data with the Emphasis …
53
Table 4.3 The 20 most active journals Journals
TP
(%)
TC
(%)
IEEE Access
43
3.054
234
1.315
TC/TP 5.442
Future Generation Computer Systems—The International Journal of eScience
29
2.060
399
2.242
13.759
Journal of Medical Internet Research
29
2.060
384
2.158
13.241
ISPRS International Journal of Geo-information
14
0.994
46
0.258
3.286
Sensors
14
0.994
124
0.697
8.857
Sustainability
14
0.994
51
0.287
3.643
BMJ Open
13
0.923
42
0.236
3.231
PLOS ONE
12
0.852
235
1.321
19.583
Cluster Computing—The Journal of Networks Software Tools and Applications
10
0.710
24
0.135
2.400
Concurrency and Computation: Practice and Experience
10
0.710
63
0.354
6.300
IEEE Transactions on Knowledge and Data Engineering
10
0.710
1124
6.316
112.400
International Journal of Information Management
10
0.710
118
0.663
11.800
Transportation Research Record
10
0.710
39
0.219
3.900
IEEE Internet of Things Journal
9
0.639
222
1.248
24.667
Computers in Human Behavior
8
0.568
560
3.147
70.000
IEEE Transactions on Industrial Informatics
8
0.568
274
1.540
34.250
SIGMOD Record
8
0.568
56
0.315
7.000
Big Data
7
0.497
311
1.748
44.429
Computer Law & Security Review
7
0.497
54
0.303
7.714
Frontiers in Psychology
7
0.497
17
0.096
2.429
International Journal of Distributed Sensor Networks
7
0.497
20
0.112
2.857
Online Information Review
7
0.497
38
0.214
5.429
TP: number of publications; TC: total citation count; TC/TP: an average of citations per paper
beneficial to learn more about a programming model—MapReduce—by reading Dean and Ghemawat (2008).
4.3.4 Geographic and Institutional Distribution of Publications The geographic and institutional distributions of publications were generated based on author affiliation information. We summarized the 10 most productive countries in Fig. 4.2, in terms of the number of total publications, single country articles, and
54
Q. Peng and X. Ye
Table 4.4 The 10 most cited documents Title
Year
Publication
Author
Citation counta
Users of the world, unite! The challenges and opportunities of Social Media
2010
Business horizons
AM Kaplan, M Haenlein
17,762
MapReduce: Simplified data processing on large clusters
2004
usenix.org
J Dean, S Ghemawat 12,758
Big data: A revolution 2013 that will transform how we live, work, and think
Houghton Mifflin Harcourt
V Mayer-Schönberger, K Cukier
5,340
Big data: The next frontier for innovation, competition, and productivity
McKinsey
J Manyika
5,319
Business intelligence 2012 and analytics: From big data to big impact
MIS quarterly
H Chen, RHL Chiang, VC Storey
4,620
Big data: the 2012 management revolution
Harvard Business Review
A McAfee, E Brynjolfsson, TH Davenport
4,130
Critical questions for big data: Provocations for a cultural, technological, and scholarly phenomenon
2012
Information, communication & society
D Boyd, K Crawford 3,254
Big data: A survey
2014
Mobile networks and applications
M Chen, S Mao, Y Liu
2,403
Data-intensive applications, challenges, techniques and technologies: A survey on Big Data
2014
Information sciences
CLP Chen, CY Zhang
2,255
Beyond the hype: Big data concepts, methods, and analytics
2015
International journal of information management
A Gandomi, M Haider
2,212
a The
2011
citation count is updated based on Google Scholar on January 9, 2020
international collaborations, respectively. Out of these 10 countries, five were from Europe, two from North America, one from Australia, and two from Asia. The U.S. was the most productive country with a total of 506 articles. China ranked second with 234 articles, followed by the UK with 195 articles. Figure 4.2 also reveals that some countries achieved higher rates of international collaboration. These countries include France (collaboration rate: 65.00%), Spain (collaboration rate: 63.16%),
4 Research Trends in Social Media/Big Data with the Emphasis …
55
600 TP
IP
CP
500 400 300 200 100 0
Fig. 4.2 Most productive countries during 2010–2019 (TP, total publications; IP, the number of independent publications by country; CP, the number of internationally collaborative publications)
Canada (collaboration rate: 62.65%), Australia (collaboration rate: 60.31%), and Germany (collaboration rate: 60.32%). Co-authorship analysis enabled the study of the most influential countries’ cooperation network, as plotted in Fig. 4.3. The size of nodes reveals the productivity rate of each country, while the thickness of curved lines between countries demonstrates the strength of collaboration. The U.S., China, and the U.K. had the largest
Fig. 4.3 Co-authorship cooperation between productive countries
56
Q. Peng and X. Ye
number of papers with co-authorships. As we can see from Fig. 4.3, there are clusters that demonstrate inter-country collaboration. European Countries cluster together, whereas Asian countries and Australia cluster together. This suggests that countries from the same continent tend to collaborate more with one another than with countries from distant continents. Note that the U.S. is at the center of the collaboration network, which illustrates that U.S. researchers conduct an enormous amount of collaborative research with researchers from both Asian and European countries.
4.3.5 Institution Collaboration Network A collaboration network of the 80 most productive institutions is visualized using the VOSviewer (Fig. 4.4). The most productive institution proved to be the University of Michigan with 21 papers, followed by both the Chinese Academy of Science and the University of Sydney, each of which produced 17 papers (Table 4.5). Each node in Fig. 4.4 indicates an institution. The size of each node indicates the institution’s productivity. The bigger the node is, the more productive the organization is. The distance between two organizations in the visualization approximately indicates the relatedness of the organizations in terms of co-authorship links. The closer two organizations are located to each other, the stronger their relatedness. The co-authorship links between organizations are also represented by curve lines. The institutions are
Fig. 4.4 Institution collaboration network of most 66 central institutions
4 Research Trends in Social Media/Big Data with the Emphasis …
57
Table 4.5 Top 17 institutions based on the total publications Rank
Organization
Country
Number of publications
Total citations
1
University of Michigan
USA
21
226
2
Chinese Academy of Sciences
China
17
370
3
University of Sydney
Australia
17
266
4
University of Washington
USA
15
350
5
University of Oxford
UK
15
177
6
University of Arizona
USA
14
230
7
University of Cambridge
UK
14
312
8
Columbia University
USA
14
178
9
University of Melbourne
Australia
14
502
10
University of Toronto
Canada
14
84
11
University of Queensland
Australia
13
105
12
University of Wisconsin
USA
13
243
13
University of College London
UK
12
443
14
Huazhong University Science & Technology
China
12
54
15
University of Maryland
USA
12
379
16
Monash University
Australia
12
56
17
Wuhan University
China
12
72
clustered into two groups: each group has a unique color. Asian institutions are represented by the green group, while North America institutions, European institutions are represented by the red. Institutions that are working on social media data collection and management within the same continent are clustered closer together and have more connections than institutions scattered across different continents. This means that institutions from the same continents are more likely to collaborate with one another than institutions from different continents.
4.3.6 Keywords Analysis—Network Analysis Keywords supplied by the authors provide a very basic idea of the topics covered within the article. The 30 most frequently used keywords in the study period are calculated and ranked in Column (1) of Table 4.6. Co-occurrence links indicate the frequency whereby keywords occur simultaneously in a study. The co-occurrence relationships between keywords can be shown by the co-occurrence links in the co-occurrence word network. In this study, the co-occurrence relationships between the top 86 high-frequency keywords were examined, and the co-occurrence word networks were visualized by VOSviewer software (Fig. 4.5). The nodes represent high-frequency keywords, and the size of each represents the degree of frequency
58
Q. Peng and X. Ye
Table 4.6 Temporal evolution of the 30 most frequently used keywords Keywords
Periods 2010–2019 N
2010–2012 R
(%)
N
2013–2015 R
N
Rising Trend
2016–2019 R
N
R
Big data
383
1 5.75 ---
--- 69
1 314
1
Social media
155
2 2.33 7
1
42
2 106
2
Data management
61
3 0.92 1
22
9
7
3 X
51
Data collection
58
4 0.87 1
19 14
3
43
4
Cloud computing
43
5 0.65 ---
–
9
35
6 X
Twitter
42
6 0.63 2
3
11
5
29
8
Machine learning
40
7 0.60 ---
---
4
18
36
5 X
Facebook
35
8 0.53 2
2
13
4
20
Data mining
34
9 0.51 ---
---
7
10
27
9 X
Internet of things
34
10 0.51 ---
---
3
32
31
7 X
Privacy
32
11 0.48 ---
---
9
8
23
11
Big data analytics
28
12 0.42 ---
---
4
15
24
10 X
Crowdsourcing
24
13 0.36 ---
---
2
45
22
12 X
Data analytics
21
14 0.32 ---
---
3
26
18
14 X
Data analysis
20
15 0.30 ---
---
5
12
15
16
Internet
20
16 0.30 ---
--- 10
6
10
29
Ethics
19
17 0.29 ---
---
16
15
18
8 -
4
13
(continued)
the keyword is used. The high-frequency keywords are selected based on how many times keywords have. A higher value for a keyword represents a higher frequency at which the word was referenced as a keyword in the last 10 years. The distance between two keywords in the visualization approximately indicates the relatedness of the keywords in terms of co-occurrence links. The closer two keywords are located to each other, the stronger their relatedness. The strongest co-occurrence links between keywords are also represented by curved lines. As shown in Fig. 4.5, the 86 most frequent author keywords are grouped into four clusters. The red cluster mainly focuses on social media; the green and yellow clusters focus mainly on big data dimensions, and the blue cluster is mainly about data analytics. The keywords that are referenced with highest frequency are “big data”, “social media”, “data management”, “data collection”, “cloud computing”, “Twitter”, “machine learning”, “Facebook”, and “data mining”. Twitter and Facebook are the two most popular social data platforms referenced in human dynamics research. It is not surprising to see these two words occurring in high frequency. We also see
4 Research Trends in Social Media/Big Data with the Emphasis …
59
Table 4.6 (continued) Keywords
Periods 2010–2019 N
2010–2012 R
(%)
N
2013–2015 R
N
Rising Trend
2016–2019 R
N
R
Data science
17
18 0.26 ---
---
2
53
15
17 X
Big data management
16
19 0.24 ---
---
1
162
15
15 X
Citizen science
16
20 0.24 1
10
3
22
12
20
Hadoop
15
21 0.23 ---
---
1
420
14
19 X
Data
14
22 0.21 ---
---
6
11
8
MapReduce
13
23 0.20 ---
---
2
67
11
24 X
Sentiment analysis
13
24 0.20 ---
---
1
737
12
21 X
Surveillance
13
25 0.20 ---
---
2
95
11
26 X
Business intelligence
12
26 0.18 ---
---
2
39
10
28 X
IoT
12
27 0.18 ---
---
1
491
11
23 X
Recruitment
12
28 0.18 1
70
3
34
8
40
Security
12
29 0.18 ---
---
3
35
9
34
Technology
12
30 0.18 ---
---
4
20
8
41
36
N: the number of articles in the study period; R: the absolute rank of author keywords; ---: no such author keyword in a specific period
Fig. 4.5 Co-occurrence network of top 86 high-frequency keywords
60
Q. Peng and X. Ye
that machine learning is frequently referenced, which coincides with the fact that the machine learning approach has been popular and been applied heavily in social media data analysis and human dynamics. Other significant topics over the past 10 years include: “Internet of things”, “privacy”, “big data analytics”, “crowdsourcing”, “data analytics”, “Internet”, “ethics”, “data Science”, “big data management”, and “citizen science”. The high frequency of the aforementioned topics suggests that these are critical areas of social media data collection and data management throughout human dynamics research. For example, the keyword “ethics” and “privacy” both occur frequently in the literature, which reflects that privacy and ethics are among the critical concerns relating to social media data collection and management.
4.3.7 Keywords Analysis—Temporal Evolution By examining the temporal evolution of these keywords would give us insights on trending areas of research. To closely examine the temporal evaluation, we divide the 10 studied years into three consecutive periods (2010–2012, 2013–2015, and 2016–2019). The 30 most frequently used keywords throughout the entire studies period (2010–2019) are calculated and ranked in Columns (2), (3), and (4) of Table 4.6. We identify keyword trends on the basis of whether or not the rank of a keyword rises upward across the three consecutive periods. Eleven keywords (“data management”, “cloud computing”, “machine learning”, “data mining”, “Internet of things”, “big data analytics”, “crowdsourcing”, “data analytics”, “data Science”, “big data management”, “Hadoop”, “MapReduce”, “sentiment analysis”, “surveillance”, “business intelligence”, and “IoT”) are becoming increasingly popular over the past 10 years. For example, the keyword “machine learning” does not occur in articles in the 2010–2012 period, but usage of this keyword increased from 18th in 2013 to 2015 to fifth from 2016 to 2019, suggesting that machine learning rose as a trending research topic in the past years. This dramatic increase coincided with the popularity of AI and machine learning. It is not surprising that the keyword “cloud computing” has also been frequently referenced frequently in recent years. This can be attributed to the recent advance of cloud computing technology and applications in social data analysis. In addition, “crowdsourcing” and “surveillance” rose in popularity. Human crowdsourcing and surveillance cameras are the two important approaches for sensing and data acquisition. Surveillance cameras generate a huge volume of images and videos while human crowdsourcing generates data via mobile devices. These data acquisition approaches provide information for traffic analysis, human mobility, and urban structure research needs. Another example for treading research topics is the keyword “sentiment analysis”. With social media data, researchers are capable of analyzing the change of human sentiments and their mobility pattern as a result of an event (e.g., disaster, pandemic, or presidential candidate campaign).
4 Research Trends in Social Media/Big Data with the Emphasis …
61
4.4 Conclusions In conclusion, this study social data collection and data management as they are referenced in human dynamics research through a bibliometric approach. We presented an overview and brief picture of existing studies in this area. Audiences looking to understand social media data collection and management as it is referenced in human dynamics can use this study to help determine which articles they should read, which journals they should consider for publication submissions, and which research trends are most significant. In summary, the number of annual publications about big data and social media research with an emphasis on data collection and data management increased from just four in 2010 to 362 in 2019. The annual growth rate for such publications has accelerated since 2014. The studies covered wide variety of subjects, such as computer science information systems, engineering electrical electronic, telecommunications, computer science theory methods, computer science interdisciplinary applications, and information science library science. The three most productive journals in these areas were IEEE Access, Future Generation Computer Systems—The International Journal of eScience, and Journal of Medical Internet Research. This study suggests that “big data”, “social media”, “data collection”, “Twitter”, “Facebook”, and “privacy” have been the most popular topics in this area over the past 10 years. Some keywords, such as “data management”, “cloud computing”, “machine learning”, “data mining”, “Internet of things”, “big data analytics”, “crowdsourcing”, “data analytics”, “data Science”, “big data management”, “Hadoop”, “MapReduce”, “sentiment analysis”, “surveillance”, “business intelligence”, and “IoT”, attracted increasing attention, reflecting research trends. Furthermore, most of those 30 most frequently referenced keywords in this area were not listed as keywords during the 2010–2012 period; rather, they grew in popularity in the most recent periods. Acknowledgments This material is partially based upon work supported by the National Science Foundation under Grant Nos. 1739491 and 1937908. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the National Science Foundation.
References Aria, M., & Cuccurullo, C. (2017). bibliometrix: An R-tool for comprehensive science mapping analysis. Journal of Informetrics, 11, 959–975. Bao, J., Zheng, Y., & Mokbel, M. F. (2012). Location-based and Preference-aware Recommendation Using Sparse Geo-social Networking Data. In International Conference on Advances in Geographic Information Systems (pp. 199–208). ACM. Borgatti, S. P., Everett, M. G., & Freeman, L. C. (2002). Ucinet for windows: Software for social network analysis. Harvard, MA: Analytic Technologies, 6.
62
Q. Peng and X. Ye
Cobo, M. J., López-Herrera, A. G., Herrera-Viedma, E., & Herrera, F. (2011). Science mapping software tools: Review, analysis, and cooperative study among tools. Journal of the Association for Information Science and Technology, 62, 1382–1402. Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51, 107–113. Depoux, A., Martin, S., Karafillakis, E., Preet, R., Wilder-Smith, A., & Larson, H. (2020). The pandemic of social media panic travels faster than the COVID-19 outbreak. Journal of Travel Medicine. https://doi.org/10.1093/jtm/taaa031 Ding, Y., Chowdhury, G. G., & Foo, S. (2001). Bibliometric cartography of information retrieval research by using co-word analysis. Information Processing and Management, 37, 817–842. Garfield, E. (1970). Citation indexing for studying science. Nature, 227, 669–671. Glänzel, W., & Schubert, A. (2005). Analysing scientific networks through co-authorship. In H. F. Moed, W. Glänzel, & U. Schmoch (Eds.), Handbook of quantitative science and technology research (pp. 257–276). The Use of Publication and Patent Statistics in Studies of S&T Systems. Gong, J., Li, S., Ye, X., & Peng, Q. (2020). Measuring the dynamic impact of high-speed railways on urban interactions in China. ArXiv201008182 Cs He, Y., & Cheung Hui, S. (2002). Mining a web citation database for author co-citation analysis. Information Processing and Management, 38, 491–508. Hussain, A., & Vatrapu, R. (2014). Social data analytics tool: Design, development, and demonstrative case studies. In 2014 IEEE 18th International Enterprise Distributed Object Computing Conference Workshops and Demonstrations (pp. 414–417). Karamshuk, D., Noulas, A., Scellato, S., Nicosia, V., & Mascolo, C. (2013). Geo-spotting: Mining online location-based services for optimal retail store placement. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 793–801). Li, Q., Wei, W., Xiong, N., Feng, D., Ye, X., & Jiang, Y. (2017). Social media research, human behavior, and sustainable society. Sustainability, 9, 384. Mayr, P., & Weller, K. (2017). Think before you collect: Setting up a data collection approach for social media studies. The SAGE Handbook of Social Media Research Methods, 679. Philip Chen, C. L., & Zhang, C.-Y. (2014). Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Information Sciences, 275, 314–347. Pritchard, A. (1969). Statistical bibliography or bibliometrics. J Doc, 25, 348–349. Schmoch, U., & Schubert, T. (2007). Are international co-publications an indicator for quality of scientific research? Scientometrics, 74, 361–377. Shaw, S.-L., Tsou, M.-H., & Ye, X. (2016). Editorial: Human dynamics in the mobile and big data era. International Journal of Geographical Information Science, 30, 1687–1693. Thorson, K., Driscoll, K., Ekdale, B., Edgerly, S., Thompson, L. G., Schrock, A., Swartz, L., Vraga, E. K., & Wells, C. (2013). Youtube, Twitter and the occupy movement. Information, Communication & Society, 16, 421–451. Van Eck, N. J., & Waltman, L. (2010). Software survey: VOSviewer, a computer program for bibliometric mapping. Scientometrics, 84, 523–538. Wang, Z., Lam, N. S. N., Obradovich, N., & Ye, X. (2019). Are vulnerable communities digitally left behind in social responses to natural disasters? An evidence from Hurricane Sandy with Twitter data. Applied Geography, 108, 1–8. Wang, Z., & Ye, X. (2018). Social media analytics for natural disaster management. International Journal of Geographical Information Science, 32, 49–72. Ye, X., She, B., Li, W., Kudva, S., & Benya, S. (2020a). What and where are we tweeting about black friday? In R. R. Thakur, A. K. Dutt, S. K. Thakur, & G. M. Pomeroy (Eds.), Urban and regional planning and development. 20th century forms 21st century transform (pp. 173–186). Springer International Publishing. Ye, X., Li, S., & Peng, Q. (2021). Measuring interaction among cities in China: A geographical awareness approach with social media data. Cities, 109, 103041.
4 Research Trends in Social Media/Big Data with the Emphasis …
63
Ye, X., Zhao, B., Nguyen, T. H., & Wang, S. (2020b). Social media and social awareness. In Manual of digital earth (pp. 425–440). Springer. Zheng, Y., Capra, L., Wolfson, O., & Yang, H. (2014). Urban computing: Concepts, methodologies, and applications. ACM Transactions on Intelligent Systems and Technology, 5, 38:1–38:55.
Chapter 5
Similarity Measurement on Human Mobility Data with Spatially Weighted Structural Similarity Index (SpSSIM) Chanwoo Jin, Atsushi Nara, Jiue-An Yang, and Ming-Hsiang Tsou
5.1 Introduction Urban dynamics are built up on diverse types of human movements including migration, commuting, and leisure activities and hence the study of human mobility is essential for understanding complex urban systems (Dodge et al., 2016; Miller & Shaw, 2015). To examine human mobility and urban dynamics, various forms of data sources exist; for instance, census surveys, activity diaries, Call Detailed Records (CDRs) from cellphones, Global Positioning System (GPS) points, and new media data. Among these, researchers have recently adopted social media data, which has produced a massive amount of publicly available individual-scale timestamped geo-referenced data, for discovering unrevealed travel patterns and activities. For example, finding major human flows (Gao et al., 2018; Poon & Pandit, 1996; Xu et al., 2015) can support decisions across multiple fields including public health, epidemiology, and disaster response (Martín et al., 2017; Nara et al., 2017; Panigutti et al., 2017; Wesolowski et al., 2015). While previous studies have often examined and analyzed spatial distributions of movements, less research has focused on measuring similarities and differences between multiple data sources (Gao et al., Reprinted From: Jin, C., Nara, A., Yang, J.-A., & Tsou, M.-H. (2020). Similarity measurement on human mobility data with spatially weighted structural similarity index (SpSSIM). Transactions in GIS, 24(1), 104–122. https://doi.org/10.1111/tgis.12590. With permission from John Wiley and Sons. C. Jin · A. Nara (B) · M.-H. Tsou Center for Human Dynamics in the Mobile Age and Department of Geography, San Diego State University, San Diego, CA, USA e-mail: [email protected] J.-A. Yang Calit2/Qualcomm Institute, University of California San Diego, La Jolla, CA, USA © Springer Nature Switzerland AG 2021 A. Nara and M.-H. Tsou (eds.), Empowering Human Dynamics Research with Social Media and Geospatial Data Analytics, Human Dynamics in Smart Cities, https://doi.org/10.1007/978-3-030-83010-6_5
65
66 Table 5.1 Demographic characteristics of Instagram and Twitter in the U.S. (Greenwood et al., 2016)
C. Jin et al.
All online adults
Instagram
Twitter
32
24
Gender
Men
26
24
Women
38
25
Age
18–29
59
36
30–49
33
23
50–64
18
21
8
10
65+
% of online adults who use social media
2018). Each data source has its unique characteristics to describe human mobility due to diversities in survey participants or platform users who provide their mobility data, and methodologies to collect, manage, and publish geo-referenced data. For instance, in the case of social media data demographic characteristics, Instagram users are more popular among female and younger populations than Twitter (Table 5.1), the variations of which may produce significant differences in mobility patterns and play a significant role in shaping the complexity of human mobility. Therefore, it is crucial to understand the capabilities and limitations of each data source for describing human mobility. This can further help in grasping comprehensive human mobility patterns, where multiple mobility data sources complementarily explain different types of human movements. There have been attempts to develop new methodologies to measure similarities between mobility patterns from diverse sources (Xia et al., 2011; Yuan & Raubal, 2014) and to address those demographic biases on multiple social media data (Crooks et al., 2015; Gao et al., 2017). However, effectively and quantitatively measuring mobility similarities from multiple data sources continues to be a research challenge. To address this research gap, this paper proposes a new method, Spatially weighted Structural SIMilarity index (SpSSIM), to measure the similarity of Origin–Destination (OD) flow matrices to compare mobility patterns from multiple data sources. SpSSIM adopted Structural SIMilarity index (SSIM), an image quality assessment technique to measure the similarity between two images (Wang et al., 2004). SSIM has been applied to measure the similarity of human mobility by mapping the OD matrices into arrays of image values; however, previous works do not consider the spatial configuration of flows on the OD matrices. We extended SSIM by incorporating spatial adjacency by utilizing a series of spatial weight matrices. A key advantage of SpSSIM is that local similarities can be properly measured and investigated. While SSIM utilizes a square moving window to calculate the similarity of images or matrices, SpSSIM employs a geographically defined range with spatial weights. This enables SpSSIM to calculate similarities of flows in a certain geographic boundary. As our case study, we compared each pair of OD matrices of human
5 Similarity Measurement on Human Mobility Data …
67
daily mobility extracted from three mobility data sources; U.S. Census-based Longitudinal Employer-Household Dynamics Origin–Destination employment statistics (LODES), Twitter, and Instagram, and aggregated at the sub-regional areas (SRAs) scale in San Diego County, CA. Two geo-referenced social media data, Twitter and Instagram, were collected via publicly available Application Programming Interfaces (APIs) in 2015. The case study demonstrated the capability of SpSSIM to measure the mobility similarities between each data source and to provide an underlying knowledge of human mobility extracted from social media data, which can ultimately facilitate the understanding of the complexities of human mobility. The remainder of this article is organized as follows. Section 5.2 describes related works on measuring the similarity of human mobility and studying human mobility with social media data. In Sect. 5.3 introduces the SpSSIM methodology and Sect. 5.4 describes data used in this study respectively. Section 5.5 details the results of comparative experiments between SSIM and SpSSIM and interpretation of SpSSIM values with a case study of San Diego County, CA. The final section discusses implications of the results and conclusions with future works.
5.2 Related Work 5.2.1 Methodological Approaches for Quantifying Similarity of Mobility Methodological approaches to characterize and compare mobility patterns have been, in essence, quantifying similarity in mobility data, which can identify major trends in movements and allow comparing the trends from diverse data sources. As a traditional approach, dominant flow analysis (Nystuen & Dacey, 1961) counted the amount of flow and detected major trends of human mobility such as trading (Smith, 1970; Xu et al., 2015) and tourist traveling (Pearce, 1996). Another approach was employing principal component analysis to identify centers of mobility network (Garrison & Marble, 1964). Components derived from PCA represent the similarity of regions regarding the amount of flow (Poon & Pandit, 1996). For example, Clayton (1977) categorized US states in terms of the similarity of origins and the numbers of interstate immigrants. More recently, Gao et al. (2018) employed spatial scan statistics (Kulldorff, 1997) to compare major patterns of taxi trips in the morning and afternoon in New York City and county-to-county migration flows between age-groups in the United State by clustering origins and destinations. While these methodological approaches have been effective to summarize mobility patterns with a few major trends, the number of clusters was potentially arbitrary and the comparison was limited to detected clusters (Salvador & Chan, 2004). In other words, the similarities measured by clusters can be sensitive to the number of clusters and clustering methods.
68
C. Jin et al.
Various methodologies have been explored to calculate the similarity between mobility data, which are often treated as trajectories, i.e., two sets of temporally sequenced location points. Common trajectory similarity measurements calculate distances between points on each trajectory. For example, Euclidian distance has been widely used to measure geographic gaps between two points from each trajectory (Zheng & Zhou, 2011). Meanwhile, Levenshtein distance, or edit distance, developed from informatics to measure distance between two strings has also been applied to geographic trajectories by matching each of their intermediate points and calculating the differences (Yuan & Nara, 2015; Yuan & Raubal, 2014). These approaches provide similarity measurements of individual sequential movements. More recently, Behara et al. (2018) proposed Mean Normalized Levenshtein distance for OD matrices (MNLdOD) by applying Levenshtein distance to measure similarity of two OD matrices. This measurement compares the descending order of destinations by the normalized flow volume from each origin (i.e., a row of OD matrices) as strings and calculates the similarity row by row to capture differences in the order of destination choices from the same origin. Since MNLdOD employs the orders by flow volumes rather than the actual amount of flows, it is limited in fully incorporating flow volume variations in its similarity index.
5.2.2 Human Mobility and Social Media Human mobility has been a long-discussed issue in social and geographic sciences and complex mobility dynamics and behaviors have been studied in a variety of applications including migration, traveling, and evacuation (Cresswell, 2012; Noulas et al., 2012). Focusing on the daily human mobility, fundamental activities of human living such as commuting, shopping, and leisure trips often accompany movements, which further contribute to form complex urban dynamics through human–human and human–environment interactions in space and time (Huang & Wong, 2016; Sun et al., 2015; Wu et al., 2014). Thus, investigating human mobility is a key to comprehend complex urban systems, yet it has been challenging (Larsen et al., 2006; Yuan & Raubal, 2014) especially since traditional data collection methods (e.g. census survey and travel diary) were limited to observe and recode human mobility at the full-scale due to their high cost (Miller & Shaw, 2015). Nowadays, recent advancement and adaptation of mobile Information and Communication Technologies (mICTs) and location-aware technologies (LATs) have enabled the recording of human mobility via mobile devices. This larger and finer scale spatiotemporal data can fulfill the investigation of daily human mobility and reveal un-discovered patterns not captured in the data collected through traditional methods (Hawelka et al., 2014; Liu et al., 2012; Wu et al., 2014). Social media can be one of data sources capturing human mobility by taking advantages of mICTs and LATs. Recent studies have explored the potential of social media data to reconstruct individual trajectories and describe dynamic human movement behaviors in detail (Nara et al., 2017). For example, traveling patterns and
5 Similarity Measurement on Human Mobility Data …
69
behaviors have been studied using check-in data to detect hotspots and unusual visiting places (Sun et al., 2015), and geo-tagged Twitter posts to estimate the volume of country-to-country flows (Hawelka et al., 2014). Geotagged social media posts have been applied to investigate human mobility and evacuation behavior during disastrous events (Li et al., 2018; Martín et al., 2017; Wang & Taylor, 2014). Despite the usefulness of social media data to investigate human and urban dynamics, they are known to be biased by the unevenness of demographic, geographic, and temporal distributions. Furthermore, since each social media platform has its own unique usages and demographics, human mobility patterns extracted from social media potentially differ by platforms; however, few studies have investigated the similarity and difference in human mobility by social media platforms. To solve the bias of a single social media data source, some studies have integrated multiple social media data sources with other spatial and aspatial data to gain profound knowledge in human activities and urban contexts. For instance, Gao et al. (2017) synthesized multiple data sources from Flickr, Instagram, Twitter, Travel Blogs, and Wikipedia to extract semantics and identify cognitive regions based on a belief that each source represented different user groups. They assumed that Flicker was more tourism-oriented than other social media such as Twitter and Instagram, which showed daily activities, news and visited places. Crooks et al. (2015) utilized open-source crowdsourcing datasets ranging from Global Positioning System (GPS) tracks and Foursquare to Twitter, Flickr, and weblogs to demonstrate how social media, trajectory, and traffic data could be analyzed to capture the evolving nature of a city’s form and function. They argued that each data source represented a part of dynamic and complex urban functions. These approaches highlighted the importance of integrating multiple data sources to gain deeper insights into urban dynamics. Despite the importance and necessity of data integration in mobility studies, it has been less utilized due to limited understanding of the capabilities and limitations of each data source and their similarity and difference.
5.3 Methodology This research proposes a novel index, spatially weighted SSIM (SpSSIM), to measure the similarity of two OD flow matrices to compare mobility patterns. Our method adopted the concept of structural similarity index (SSIM), which originally assess image quality by comparing local patterns of image structure. We extended SSIM spatially by utilizing a series of spatial weight matrices that define the range of neighborhood geographically. The following sub-sections demonstrate the algorithm of SSIM, the process of spatial extension, and verification of SpSSIM.
70
C. Jin et al.
5.3.1 Spatially Weighted Structural Similarity Index (SpSSIM) Wang et al. (2004) developed SSIM to measure the similarity between two images for assessing the quality of copied or generated images from an original image. As the human visual system is familiar with the overall arrangement of images rather than single values of cells to compare images, SSIM considered the arrangement of image values by quantifying the local patterns of pixel intensities consisting of three components—luminance, contrast, and structure—with a moving window. SSIM calculates image similarity between two images X and Y is expressed as Eq. 5.1. SSIM (x, y) = f l(x, y)α · c(x, y)β · s(x, y)γ l(x, y) =
(5.1)
2μx μy + C1 2σx σy + C2 σxy + C3 , c(x, y) = 2 , s(x, y) = 2 2 2 μx + μy + C1 σx + σy + C2 σx σy + C3
where l(x, y)α , c(x, y)β , and s(x, y)γ represent the three components luminance, contrast, and structure respectively. x and y denote black-white color values of pixels (0 to 255) in images X and Y. µ, σ 2 , and σxy refer mean, variance, and covariance respectively. C1 , C2 , and C3 are constants to enforce SSIM to be less than 1. Therefore, the value of SSIM equals 1 when two images are exactly same, and it gets close to 0 when they are less similar. When we regard the importance of each component is identical (α = β = γ = 1), and C3 is equal to C2 /2, Eq. (5.1) can be simplified as the Eq. 5.2. 2μx μy + C1 2σxy + C2 SSIM (x, y) = μ2x + μ2y + C1 σx2 + σy2 + C2
(5.2)
Once SSIM is calculated in a window, it moves to the next cells and computes SSIM by the last cell of images. Then, the overall similarity of images X, Y is represented by the mean of local SSIM values when M denotes the total number of local windows (Eq. 5.3). MSSIM (X , Y ) =
M 1 SSIM xj , yj M j=1
(5.3)
SSIM has been widely utilized to assess the quality of images and to evaluate the performance of image processing due to its simplicity and accuracy (Brunet et al., 2012). The index has been recently employed to compare movements because the amount of flow in OD matrices can be considered as values of images (Djukic, 2014; Pollard et al., 2013). For example, Djukic (2014) used SSIM with a square window to evaluate the estimation of OD demands in traffic flows (Fig. 5.1) since traditional
5 Similarity Measurement on Human Mobility Data …
71
Fig. 5.1 Comparison of two pictures using SSIM (Adapted from Pollard et al. (2013))
statistics such as MSE are limited to consider the spatial correlation between OD pairs. Yet, SSIM is still problematic in terms of the sensitivity of OD pairs ordering because the order in a matrix is not always arranged by spatial contiguity or distances. For example, the contiguous cells in an OD matrix can be far from each other in real space when the order is based on the size of population or is randomly distributed. In this case, a square window is limited to filter out the spatial correlation between OD pairs because contiguous cells are not spatially adjacent. Moreover, the same values of OD matrices with different orders result in different SSIM values. To solve the problem, Djukic (2014) suggested to find the best way to represent spatial dependency of OD pairs by testing various window sizes and Behara et al. (2017) re-ordered OD pairs to group to the upper level and calculated SSIM within the new level; however, selecting the optimal window size and the optimal order of OD pairs remains challenging. To overcome the SSIM ordering issue to compare OD matrices, SpSSIM (Eq. 5.4) utilizes a series of spatial weight matrices instead of the moving window of SSIM. The range of locality is defined by multiplying a spatial weight matrix with OD matrix using Hadamard product (Eq. 5.5). The spatial weight matrix consists of 0 and 1 where flows (fij ) in the distance range (dij ) has the weight value of 1 (Eq. 5.6). In other words, SpSSIM computes a similarity between two OD matrices only in a specific geographic range by blocking values from outside of the range with multiplying 0 to the values (Fig. 5.2). As a result, SpSSIM will have the value between 0 and 1 and be close to 1 when two matrices are similar.
Fig. 5.2 Comparison of two OD matrices using SpSSIM
72
C. Jin et al.
2μwx μwy + C1 2σwx,wy + C2 SpSSIM (x, y, w) = 2 + σ2 + C μ2wx + μ2wy + C1 σwx 2 wy ⎡
w11 · · · ⎢ .. . . WF = W ∗ F = ⎣ . . wi1 · · · Dmin
⎤ ⎡ f11 · · · w1j .. ⎥ ∗ ⎢ .. . . . . ⎦ ⎣ . wij fi1 · · ·
wij max =
(5.4)
⎤ ⎡ ⎤ w11 ∗ f11 · · · w1j ∗ f1j f1j ⎥ .. ⎥ = ⎢ .. .. .. ⎦ . . ⎦ ⎣ . . fij wi1 ∗ fi1 · · · wij ∗ fij (5.5)
1, Dmin ≤ dij < Dmax 0, Otherwise
(5.6)
For the global similarity between two OD matrices, a series of spatial weight matrices is required to encompass the whole study area (Eq. 5.7). For instance, if the first spatial weight matrix takes 1 when the distance between two regions is less than 10 km (bin 1), the flows in the range of 0 to 10 km are included for a local SpSSIM value. Then, the second matrix represents spatial relationships between two regions in the range of 10 to 20 km to calculate another SpSSIM value in the next bin 2. When the series of bin covers the whole region (bin n), the average of SpSSIM values in each bin is calculated to denote the overall similarity between two OD matrices. As SSIM, the SpSSIM value equals to 1 when two OD matrices have the exact same patterns. 1 SpSSIM X , Y , W b Global SpSSIM (X , Y ) = n n
(5.7)
b=1
Moreover, SpSSIM can compare local inbound flows (in-flows) and outbound flows (out-flows) (Localized SpSSIM). The similarity of local directions of movements is measured by calculating SpSSIM with only columns or rows in a spatial weight matrix (Eq. 5.8). When the i-th row of two matrices are compared in a geographic bin, the value of localized SpSSIM represents the similarity of out-flows starting from the i-th region. On the other hand, the similarity of flows moving into the j-th region can be calculated by comparing the jth columns of two matrices.
Localized SpSSIM (X , Y ) =
SpSSIM Xi , Yi , W b , outflow SpSSIM Xj , Yj , W b , inflow
(5.8)
5.3.2 Bootstrap Verification To verify the statistical significance of SpSSIM, we employed bootstrap to estimate the distribution. Bootstrap generates a random distribution of a parameter
5 Similarity Measurement on Human Mobility Data …
73
by iteratively resampling the observed data (Westfall & Young, 1989). It simulates samples with the same size of observation and allows replacement (Efron, 1979). For example, when a set of observation is X = {x1 , x2 , x3 , . . . , xn }, a sample can be X˜ 1 = {x3 , x2 , x2 , . . . , xn }. Then, the process of sampling is iterated a large number of times and statistics of the created samples are computed. It is similar to Monte Carlo simulation regarding repeating a process, but Monte Carlo simulation generates random cases under null hypothesis rather than resamples from the existing dataset. In this study, we resampled the observed amount of flows from each OD matrix due to difficulty to assume a null hypothesis for mobility patterns. We randomly resample 999 times to generate the probability distribution of SpSSIM and estimate p-values of the index based on the mean and standard deviation of resampled values. The p-values verify whether the observed mobility patterns from two data sources are randomly different or not.
5.4 Data 5.4.1 Build OD Matrices from Social Media Data As a case study to demonstrate SpSSIM to measure similarity in human mobility, we compared OD matrices generated from three data sources, Twitter, Instagram, and LODES. First of all, we collected Twitter and Instagram georeferenced posts using APIs from 12/07/2014 to 05/17/2015 (161 days) in San Diego County. To avoid duplicated posts possibly generated by cross-posting from Instagram to Twitter, we considered tweets only posted from mobile-based Twitter application sources (e.g., Twitter for Android, Twitter for iPhone, etc.). This removed tweets cross posted from other social media platforms including Instagram and Foursquare. The numbers of posts in the period were 1,916,580 for Twitter and 4,362,176 for Instagram respectively. From these social media posts, we extracted individual daily trip segments by connecting temporally adjacent two georeferenced points within the same day. For example, if a person posted a message on a social media at a location A in the morning and another one at a location B at night, the segment from A to B is regarded a movement. If the person posts at three locations A, B, C sequentially in a day, the segment is regarded as two movements, A-B and B-C. These extracted trip segments include irrelevant data such as no movements and movements with an unrealistically high velocity. To further clean up the irrelevant trip segments, we removed segments with zero distance where their origin and destination are at a same location. We also removed segments with the average velocity greater than 65 miles per hour (mph), which is the state’s general maximum speed limit in California (California Department of Transportation, 2019). Then we aggregated these trip segments to build OD matrices based on 41 San Diego sub-regional areas (SRAs) as a spatial unit. SRAs represent local communities/neighborhoods that is suitable to describe regional contexts of human mobility. From the social media
74
C. Jin et al.
Table 5.2 Data summary Source of social media
Number of posts originally collected
Number of posts after removing cross-platform data
Number of extracted daily movements
Twitter
2,202,719
1,916,580
116,253
Instagram
4,362,176
–
297,339
data, we extracted 116,253 and 297,339 individual daily movements between SRAs from Twitter and Instagram respectively. Table 5.2 summarizes the number of data collected, processed, and generated for each social media platform. To compare human mobility extracted from social media data with non-social media-based mobility data, we used the LODES data. It represents commuting patterns, or home-to-work flows, based on employer reporting records at census block level that can cover more than 90% of all employment categories except self-employment or military personnel (Horner & Schleith, 2012). We aggregated LODES flows into the SRA level to investigate the regional context of human mobility patterns. A flow between two SRAs was defined as f ij indicating a person moved from the i-th SRA to the j-th SRA in a day. We define origin regions as rows and destination as columns in an OD matrix, where the sum of the i-th row denotes the total amount of out-flows from the i-th SRA and the sum of the j-th columns represents the total amount of in-flows to the j-th SRA. To understand flows between neighborhood, we removed internal flows, which their origins and destinations are the same region (f ii ). Then, we normalized the flows between SRAs through probability of flows, which scale the value of flows from all datasets to be between zero and one.
5.4.2 Data Description Table 5.3 describes descriptive statistics of flows in three data sources. Generally, all probability distributions of flows were positively skewed. Almost all movements from three sources were traveled within 50 km. Twitter and Instagram have more movements within 20 km (78.8%, 73.0% respectively) than LODES (57.5%). However, LODES has the largest total amount of flows and the lowest percentages of zero cells, which refer to no movements between two SRAs. In LODES, there are 23 (1.4%) OD pairs of SRAs with no flow out of 1,640 pairs (41 × 40) excluding internal flows. On the other hand, there are many OD pairs with no flow in Twitter and Instagram. This describes that the commuting-based mobility (LODES) was more ubiquitously distributed in San Diego County than social media-based mobility. This further indicates that social media were more frequently used within geographically confined regions rather than overall regions. Between Twitter and Instagram, Instagram-based flows (zero OD flows = 27.5%) were geographically sparser than Twitter-based flows (zero OD flows = 13.4%) even though the amount of flows in Instagram is 2.5 times more than that in Twitter. One potential explanation for these patterns is that each
5 Similarity Measurement on Human Mobility Data … Table 5.3 Descriptive statistics of flows
75
LODES
Twitter
Instagram
Total amount
836,974
116,253
297,339
Mean
497.9
69.2
176.9
Median
91
8
14
Std. D
1,135.601
180.722
551.491
Max
10,438
1,539
7,163
1st quartile
19
2
0
3rd quartile
420
44
108
# of Zero (ratio)
23 (1.4%)
219 (13.4%)
451 (27.5%)
Skewness
4.444
4.605
7.418
By 20 km (%)
57.5
78.8
73.0
By 50 km (%)
96.5
98.0
98.1
social media has different usages and purposes. For example, Twitter users are more likely post messages at their ordinary locations such as home and work, while Instagram users are more willing to share their extraordinary experiences by posting pictures at places for social activities and entertainment. Section 5.5 provides further discussions on the difference in flows between Twitter and Instagram. Four maps in Fig. 5.3 describe the density of total probability of flows in San Diego County and the spatial distributions of out-flows from LODES, Twitter, and Instagram respectively. The majority of flows were concentrated in the western region of San Diego County corresponding with population distribution. Flows represented as an arrow in the maps are major flows where the amount of flows are over 1.5 standard deviations from the mean of each source. To avoid over cluttering, only flows larger than +1.5 standard deviation are displayed. The most frequent flows from LODES moved into two regions (highlighted by the yellow border in Fig. 5.3a), Central San Diego known as the Central Business District (CBD) of San Diego and Kearny Mesa known as a new business district. As compared to LODES, flows from social media revealed that frequent destinations are not limited to those two business districts. Twitter users visited comparatively diverse destinations whereas Instagram users preferred traveling to coastal areas such as Coastal and Peninsula SRAs (Table 5.4).
5.5 Results 5.5.1 SSIM and Sensitivity To test the sensitivity of the OD pairs ordering, we generated two sets of reordered OD matrices of each data source and compared the results of SSIM and SpSSIM.
76
Fig. 5.3 Spatial distribution of flows: a Total, b LODES, c Twitter, d Instagram
C. Jin et al.
5 Similarity Measurement on Human Mobility Data …
77
Table 5.4 Top 5 flows between SRAs by data sources Rank LODES
Twitter
Instagram
Origin
Destination
Origin
Destination
1
Southeastern
Central
Central
Kearny Mesa Coastal
2
Kearny Mesa
Central
San Marcos
Escondido
3
Mid-city
Central
Kearny Mesa Central
4
Central
Kearny Mesa Kearny Mesa Mid-city
5
Del Mar-Mira Kearny Mesa Escondido Mesa
San Marcos
Origin
Destination Central
Kearny mesa Central Central
Coastal
Peninsula
Central
Central
Peninsula
We resampled the orders of pairs two times based on (1) the population size and (2) the alphabetical order of the SRA name. For SSIM, we tested 8 square windows, where the window size was increased from 5 to 40 cells by 5 cells. For SpSSIM, we set 12 distance bins where the bin width is 10 km and the distance range is from 0 to 120 km. For each bin, we calculated SpSSIM using a spatial weight matrix determined by Eq. 5.6. Cells of the nine heatmaps in Fig. 5.4 represent the probabilities of movements and rows and columns are origins and destinations respectively. The
Fig. 5.4 Heat maps of OD pairs
78
C. Jin et al.
Table 5.5 The values of SSIM by window sizes OD Reference
Reorder 1
Reorder 2
Win size 5
10
15
20
25
30
35
40
L-T
0.675
0.663
0.684
0.692
0.697
0.701
0.681
0.675
L-I
0.715
0.729
0.754
0.758
0.760
0.743
0.703
0.668
T-I
0.823
0.797
0.785
0.776
0.788
0.803
0.806
0.793
L-T
0.617
0.635
0.637
0.642
0.658
0.666
0.667
0.679
L-I
0.682
0.692
0.694
0.692
0.683
0.670
0.650
0.644
T-I
0.778
0.776
0.779
0.782
0.788
0.785
0.780
0.779
L-T
0.637
0.642
0.660
0.682
0.695
0.695
0.691
0.681
L-I
0.687
0.676
0.658
0.646
0.636
0.632
0.630
0.644
T-I
0.779
0.737
0.758
0.763
0.767
0.770
0.772
0.780
LODES heatmaps illustrate an apparent tendency of central places to move in with distinct lines, whereas Twitter maps reveal relatively diverse patterns and Instagram maps look mixed versions of the other two. We calculated SSIM by varying sizes of windows to test the sensitivity of SSIM (Table 5.5). LODES and Twitter (L-T) have the lowest similarity while Twitter and Instagram (T-I) show similar patterns. However, the SSIM values differ by orders. For example, L-T have all different values of SSIM when the orders were shuffled regardless of the window sizes. Reordered OD matrices basically represent the same phenomenon, but SSIM is incapable to identify that they are the same patterns although the differences are slight. Moreover, SSIM fails to consider spatial correlations between OD pairs. Since the two matrices were reordered by population and names, it is hard to guarantee that contiguous cells in a window are spatially close to each other. The sensitivity issues of pair ordering make SSIM less reliable, and it is challenging to define pair orderings. Contrary to SSIM, SpSSIM produces the same value regardless of the order of OD pairs. Because the weight matrices in SpSSIM define the spatial relationships between SRAs, reordering does not affect the spatial arrangements of OD pairs whereas SSIM compares contiguous cells in OD matrices regardless of OD pairs’ spatial configurations. Table 5.6 shows the SpSSIM values calculated for 12 travel distance bins. Most SpSSIM values up to 80 km travel distance ranges were significant at a 95% confidence level or better when compared to a random distribution. It demonstrates that the similarity between mobility patterns from each pair of sources is statistically significant. These results imply that SSIM is less suitable to be a measurement of similarity than SpSSIM because the former fails to calculate the same values from the same events while the latter succeeds.
5 Similarity Measurement on Human Mobility Data …
79
Table 5.6 The values of SpSSIM by spatial weight distances max (km) Dmin
L-T
L-I
T-I
SpSSIM
Mean
Std. dev
SpSSIM
Mean
Std. dev
SpSSIM
Mean
Std. dev
0–10
0.655***
0.335
0.085
0.682**
0.371
0.113
0.841***
0.290
0.107
10–20
0.744***
0.257
0.054
0.680***
0.328
0.076
0.732***
0.205
0.064
20–30
0.467***
0.237
0.049
0.617***
0.318
0.069
0.793***
0.192
0.060
30–40
0.377*
0.252
0.052
0.610***
0.333
0.079
0.657***
0.206
0.066
40–50
0.466**
0.271
0.060
0.717***
0.344
0.087
0.623***
0.221
0.070
50–60
0.523***
0.285
0.062
0.704***
0.348
0.093
0.713***
0.228
0.073
60–70
0.520**
0.298
0.068
0.395
0.357
0.101
0.774***
0.251
0.085
70–80
0.594**
0.339
0.088
0.715**
0.377
0.117
0.878***
0.290
0.107
80–90
0.568
0.371
0.115
0.207
0.385
0.134
0.458
0.328
0.133
90–100
0.637
0.401
0.147
0.515
0.390
0.157
0.772*
0.371
0.157
100–110
0.753
0.419
0.246
0.000
0.411
0.257
0.000
0.406
0.251
110–120
0.000
0.425
0.316
0.000
0.392
0.328
0.000
0.000
0.000
Global
0.525**
0.344
0.018
0.487**
0.366
0.005
0.603***
0.249
0.098
*** 99.9%,
**99%, *95% significance level
5.5.2 SpSSIM in San Diego County The SpSSIM values in Table 5.6 represent the degree of similarity in the mobility patterns derived from two different data sources by distances. Overall, the mobility patterns between social media in San Diego County were more similar to each other (Global SpSSIM (T-I) = 0.603) than to LODES flows (Global SpSSIM (L-T) = 0.525; Global SpSSIM (L-I) = 0.487). The mobility similarity between LODES and Twitter is relatively higher under 20 km (0.655 and 0.744). It describes that Twitter users were more likely to make short trips where origins and destinations were similar to home and work locations reported in LODES in the travel distance range from 0 to 20 km. The SpSSIM values, however, steeply decrease from 20 to 40 km (0.467 and 0.377). The dissimilarity increases since there are much fewer Twitter-based flows than LODES commuter-based flows in this distance range (Table 5.3). In addition, the majority of flows in LODES were heading to business districts such as Central San Diego and Kearny Mesa (see Fig. 5.3b) while the destinations of Twitter users were diverse including beach areas (South Bay, Oceanside, and Del Mar-Mira Mesa) and parks (Sweetwater and Poway) as well as business districts (see Fig. 5.3c). From 40 km, the SpSSIM value gradually increases again by 110 km since the probabilities in longer distance trips were close to zero in both data sources. To note, the SpSSIM values are not statistically significant over the range of 80 km. Compared to Twitter, mobility patterns from Instagram were less similar to LODES. Similar to other comparison, the SpSSIM values of LODES and Instagram are relatively higher within 0 to 20 km. Unexpectedly, however, the similarity between LODES
80
C. Jin et al.
and Instagram within 40–60 km peeked. A potential explanation of this pattern is that remarkable places in San Diego are concentrated in downtown and coastal area, where also have many jobs. This similarity is also observed from Table 5.3 and Fig. 5.3b, d. The most frequently visited destinations from two datasets are quite similar. It indicates that Instagram users are more willing to move longer distance than Twitter if attractions are far away. On the other hand, travel patterns derived from Twitter and Instagram resemble each other. The SpSSIM value of Twitter and Instagram in the range of 0 to 10 km (0.841) is the highest in the same distance range bin and the second highest among all SpSSIM values. This explains that short daily trips observed from Twitter and Instagram users share similar origins and destinations, which are clustered in Central San Diego, Kearny Mesa, Peninsula, and Mid-City (Fig. 5.3c, d). The SpSSIM value of Twitter and Instagram slightly decreased in the range of 10 to 20 km. Travel destinations of Twitter users in this distance range included suburb regions such as Escondido, San Marcos, Pendleton and North San Diego, whereas those of Instagram users were concentrated in the downtown region of San Diego City such as Central San Diego, Coastal, and Peninsular. The SpSSIM values further decreased as the distance range increased to 30 to 50 km because the total probability of mobility derived from Twitter in the range (0.06) was smaller than that from Instagram (0.09). In addition, destinations of Twitter movements were more scattered than those of Instagram (Table 5.4).
5.5.3 Localized SpSSIM Localized SpSSIM helps investigating similarity in terms of local in-flows and outflows. Figure 5.5 demonstrates an example of comparing localized in-flow SpSSIM between LODES and Twitter from 10 to 40 km. Since in-flows describe the number of people moving into a region, it represents the popularity of places. The mobility patterns from LODES and Twitter were less similar in Southeastern San Diego (highlighted by the blue border) within 20 km than other regions indicating the lowest SpSSIM values in each range (0.074 and 0.120 respectively) while the global SpSSIM is the highest in the range 10–20 km (0.744). On the other hand, the Del-Mar-Mira Mesa region (green border in Fig. 5.5) showed different patterns. While the SpSSIM value was high in the range of 0 to 20 km, the value dropped from 30 km. In the range of 20 to 40 km, SpSSIM values were 0.289 and 0.181 respectively while the global values are 0.467 and 0.377. Although the dissimilarities were not as large as Southeastern San Diego, it denotes that the flows moving into the region dramatically changed from 30 km. Figure 5.6 illustrates localized SpSSIM of in-flows between Twitter and Instagram in the range of 0 to 40 km. Unlike the comparison of LODES and Twitter, most regions show relatively high values indicating that Twitter and Instagram have similar mobility patterns. Within 10 km, the lowest localized SpSSIM value is 0.577 at Southeastern San Diego SRA denoting that short trip patterns of Twitter and
5 Similarity Measurement on Human Mobility Data …
Fig. 5.5 Localized SpSSIM (Inflows of L-T): a 10 km, b 20 km, c 30 km, d 40 km
81
82
Fig. 5.6 Localized SpSSIM (in-flows of T-I): a 10 km, b 20 km, c 30 km, d 40 km
C. Jin et al.
5 Similarity Measurement on Human Mobility Data …
83
Instagram users are relatively very similar. However, in the range of 10 to 20 km, Coastal (blue border in Fig. 5.6, localized SpSSIM = 0.298) and Central San Diego (green border in Fig. 5.6, localized SpSSIM = 0.404) display dissimilarity patterns compared to the global value in the range (0.732). This suggests that in-flow mobility patterns from two social media data in these two regions present notable dissimilarity when travel distances are 10–20 km (Fig. 5.6b) or longer (Fig. 5.6c, d). To further investigate the dissimilarity of the localized SpSSIM in those two regions, we mapped standardized differences of in-flows between two data sources by dividing the differences by the standard deviation of the difference. Figure 5.7a illustrates the standardized differences between LODES and Twitter regarding inflows into Southeastern San Diego SRA. The negative values in Fig. 5.7a represent that the probability of movements derived from Twitter was larger than LODES. In other words, more Twitter users entered into Southeastern San Diego than LODESbased commuters within 20 km. In particular, the movements of Twitter users between Central San Diego and National City, within 20 km from the origin, outnumbered LODES movements. The localized SpSSIM detects the large differences with significantly low values (0.074 and 0.120). Similarly, Fig. 5.7b demonstrates the differences between Twitter and Instagram flows moving into Coastal SRA. Blue lines with negative values describe more Instagram users entered into the region than Twitter users. The localized SpSSIM also depicts a large inflow of Twitter users from MidCity to Costal SRA (orange line) when traveling distances are within 20 km (Local
Fig. 5.7 Standardized difference of in-flows: a Southeastern San Diego (LODES-Twitter). b Coastal (Twitter-Instagram)
84
C. Jin et al.
SpSSIM = 0.298) while most shorter distance inflows to Costal SRA are dominated by Instagram users. Although the localized SpSSIM values measure the differences between mobility patterns formed from OD matrices of diverse sources, the measurement itself does not provide the contexts behind the (dis)similarity. Here we provide potential explanations for the discovered patterns. Dissimilarities between LODES and Twitter in Southeastern San Diego (Fig. 5.7a) can be explained by the socioeconomic backgrounds. Southeastern San Diego SRA has been one of the most impoverished areas in San Diego County (Joassart-Marcelli et al., 2014). Due to its economic decline, there have been fewer job opportunities and attracting places. Furthermore, the region has been historically more ethnically diverse than other areas. According to ACS 5Year (2007–2011) estimation, the Hispanic population took over 50% of the total population followed by the African and African-American population (18.2%). Due to lack of job opportunities in the region, LODES commuter-based in-flows was identified as low. On the other hand, Twitter has been more popular among Hispanic African and African-American than other ethnic groups (Krogstad, 2015), which can explain relatively larger in-flows into the region by Twitter users. Discovered patterns can be also described by the differences in the social media platform usage. Gao (2015) and Steiger et al. (2015) pointed out that Tweets are most likely associated with home and workplace activities. Instagram, on the other hand, has more geotagged pictures (18.8%) as compared to Twitter (0.6%). Since Twitter posts are more related to home and workplace activities, the similarity of LODESTwitter is larger than LODES-Instagram (Table 5.6). Moreover, the dissimilarity of Twitter and Instagram in terms of in-flows into Coastal SRA (Fig. 5.6) can come from the fact that there are many photogenic spots in the coastal region, where people are willing to share their experiences.
5.6 Conclusion In this research, we introduced a new method, SpSSIM, to measure the similarity between two OD matrices. We demonstrated the capability of SpSSIM by conducting a case study to compare three mobility data sources, LODES, Twitter, and Instagram in San Diego County. In addition, we assessed the OD matrix ordering problem in SSIM when it is applied to spatial datasets. Our sensitivity test verified that SpSSIM is robust regardless of the arrangement of OD pairs while conventional SSIM is sensitive. This is achieved by employing a series of spatial weight matrices to resolve the sensitivity to the spatial configuration. SpSSIM is also statistically verified through bootstrap which generates the hypothetical distribution of SpSSIM. Our case study revealed notable similarities and differences in the mobility patterns from three data sources. In general, the mobility patterns of two social media, Twitter and Instagram, resembled each other more so than when compared to LODES. The most frequent destinations of LODES were distributed in business districts while social media users were traveling to diverse points of interests. This
5 Similarity Measurement on Human Mobility Data …
85
is expected results since LODES flows are specifically comprised of employmentbased home-work trips whereas Twitter and Instagram flows are based on social media users who have various purposes for trips. SpSSIM can depict similarity over travel distances. For example, the similarity between LODES and Twitter increased in the range of 0 to 20 km and steeply decreased in the range of 20 to 40 km, which is explained by the sparseness of mobility probability and diversity of destinations in Twitter. Furthermore, SpSSIM can help discovering local outliers by mapping localized values. We demonstrated in-flow (dis)similarities of LODES-Twitter and Twitter-Instagram. The localized SpSSIM values quantify and characterize the local mobility from different data sources by geographic distances. While SpSSIM can successfully measure the similarity between two mobility datasets in the form of OD matrices, SpSSIM has two limitations. The first issue is that the distance ranges are arbitrary defined and SpSSIM values are sensitive on them. SpSSIM utilizes a series of pre-defined distance bins to overcome the sensitivity of OD pairs and window sizes in SSIM. Further research is required to understand the sensitivity of defined distance bins. However, we argue that the use of the distance instead of the window size in SSIM can provide an explainable frame in terms of spatial context. Thus, SpSSIM helps measuring the (dis)similarity of movements occurring in a specific spatial boundary in which researchers are interested. As another limitation, SpSSIM does not provide the amount of flow difference. Therefore, it is necessary to map the differences to understand the contexts of discovered mobility patterns (Fig. 5.7). Nevertheless, SpSSIM, as an exploratory tool, provides spatial distribution of similarity with localized values and better understanding of the human dynamics and complexity in urban system. By detecting outliers, researchers can selectively focus on investigating regions with high (dis)similarity and further study mobility contexts in those regions. In sum, this study provides a methodological approach to comparing mobility patterns in spatial context and deepens our understanding of social media data in mobility analysis. Acknowledgements This material is based upon work supported by the National Science Foundation under Grant No. 1634641, IMEE project titled “Integrated Stage-Based Evacuation with Social Perception Analysis and Dynamic Population Estimation”. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the National Science Foundation.
References Behara, K. N., Bhaskar, A., & Chung, E. (2017). Insights into geographical window based SSIM for comparison of OD matrices. Behara, K. N., Bhaskar, A., & Chung, E. (2018). Levenshtein distance for the structural comparison of OD matrices. Brunet, D., Vrscay, E. R., & Wang, Z. (2012). On the mathematical properties of the structural similarity index. IEEE Transactions on Image Processing, 21, 1488–1495.
86
C. Jin et al.
California Department of Transportation. (2019). California highways with 70 mph speed limits. http://www.dot.ca.gov/hq/roadinfo/70mph.htm Clayton, C. (1977). The structure of interstate and interregional migration: 1965–1970. The Annals of Regional Science, 11, 109–122. Cresswell, T. (2012). Geographic thought: A critical introduction. Wiley. Crooks, A., Pfoser, D., Jenkins, A., Croitoru, A., Stefanidis, A., Smith, D., Karagiorgou, S., Efentakis, A., & Lamprianidis, G. (2015). Crowdsourcing urban form and function. International Journal of Geographical Information Science, 29, 720–741. Djukic, T. (2014). Dynamic OD demand estimation and prediction for dynamic traffic management. Delft University of Technology. Dodge, S., Weibel, R., Ahearn, S. C., Buchin, M., & Miller, J. A. (2016). Analysis of movement data. International Journal of Geographical Information Science, 30, 825–834. Efron, B. (1979). Bootstrap methods: Another look at the jackknife. Annals of Statistics, 7, 1–26. Gao, S. (2015). Spatio-temporal analytics for exploring human mobility patterns and urban dynamics in the mobile age. Spatial Cognition and Computation, 15, 86–114. Gao, S., Janowicz, K., Montello, D. R., Hu, Y., Yang, J. A., McKenzie, G., Ju, Y., Gong, L., Adams, B., & Yan, B. (2017). A data-synthesis-driven method for detecting and extracting vague cognitive regions. International Journal of Geographical Information Science, 31, 1245–1271. Gao, Y., Li, T., Wang, S., Jeong, M. H., & Soltani, K. (2018). A multidimensional spatial scan statistics approach to movement pattern comparison. International Journal of Geographical Information Science, 32, 1304–1325. Garrison, W. L., & Marble, D. F. (1964). Factor-analytic study of the connectivity of a transportation network. Papers in Regional Science, 12, 231–238. Greenwood, S., Perrin, A., & Duggan, M. (2016). Social media update 2016. Hawelka, B., Sitko, I., Beinat, E., Sobolevsky, S., Kazakopoulos, P., & Ratti, C. (2014). Geo-located Twitter as proxy for global mobility patterns. Cartography and Geographic Information Science, 41, 260–271. Horner, M. W., & Schleith, D. (2012). Analyzing temporal changes in land-use-transportation relationships: A LEHD-based approach. Applied Geography, 35, 491–498. Huang, Q., & Wong, D. W. S. (2016). Activity patterns, socioeconomic status and urban spatial structure: What can social media data tell us? International Journal of Geographical Information Science, 30, 1873–1898. Joassart-Marcelli, P., Bosco, F. J., & Delgado, E. (2014). Southeastern San Diego’s food landscape: Challenges and opportunities. Krogstad, J. M. (2015). Social media preferences vary by race and ethnicity. Kulldorff, M. (1997). A spatial scan statistic. Communications in Statistics-Theory Methods, 26, 1481–1496. Larsen, J., Urry, J., & Axhusen, K. (2006). Mobilities, networks, geographies. Ashgate. Li, Z., Wang, C., Emrich, C. T., & Guo, D. (2018). A novel approach to leveraging social media for rapid flood mapping: A case study of the 2015 South Carolina floods. Cartography and Geographic Information Science, 45, 97–110. Liu, Y., Kang, C., Gao, S., Xiao, Y., & Tian, Y. (2012). Understanding intra-urban trip patterns from taxi trajectory data. Journal of Geographical Systems, 14, 463–483. Martín, Y., Li, Z., & Cutter, S. L. (2017). Leveraging Twitter to gauge evacuation compliance: Spatiotemporal analysis of Hurricane Matthew. PLoS ONE, 12, 1–22. Miller, H. J., & Shaw, S.-L. (2015). Geographic information systems for transportation in the 21st century. Geography Compass, 9, 180–189. Nara, A., Yang, X., Ghanipoor Machiani, S., & Tsou, M.-H. (2017). An integrated evacuation decision support system framework with social perception analysis and dynamic population estimation. International Journal of Disaster Risk Reduction, 25, 190–201. Noulas, A., Scellato, S., Lambiotte, R., Pontil, M., & Mascolo, C. (2012). A tale of many cities: Universal patterns in human urban mobility. PLoS ONE. https://doi.org/10.1371/journal.pone. 0037027
5 Similarity Measurement on Human Mobility Data …
87
Nystuen, J. D., & Dacey, M. F. (1961). A graph theory interpretation of nodal regions. Paper. Regional Science Association, 7, 29–42. Panigutti, C., Tizzoni, M., Bajardi, P., Smoreda, Z., & Colizza, V. (2017). Assessing the use of mobile phone data to describe recurrent mobility patterns in spatial epidemic models. Pearce, D. G. (1996). Domestic tourist travel in Sweden: A regional analysis. Geografiska Annaler: Series B, Human Geography, 78, 71–84. Pollard, T., Taylor, N., & van Vuren, T. (2013). Comparing the quality of OD matrices in time and between data sources (pp. 1–15) Poon, J., & Pandit, K. (1996). The geographic structure of cross-national trade flows and region states. Regional Studies, 30, 273–285. Salvador, S., & Chan, P. (2004). Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms. Smith, H. T. R. (1970). Concepts and methods in commodity flow analysis. Economic Geography, 46, 404–416. Steiger, E., de Albuquerque, J. P., & Zipf, A. (2015). An advanced systematic literature review on spatiotemporal analyses of Twitter data. Transactions in GIS, 19, 809–834. Sun, Y., Fan, H., Li, M., & Zipf, A. (2015). Identifying the city center using human travel flows generated from location-based social networking data. Environment and Planning B Planning and Design, 43, 480–498. Wang, Q., & Taylor, J. E. (2014). Quantifying human mobility perturbation and resilience in hurricane sandy. PLoS ONE, 9, 1–5. Wang, Z., Bovik, a. C., Sheikh, H. R., & Simmoncelli, E. P. (2004). Image quality assessment: form error visibility to structural similarity. IEEE Transactions on Image Processing, 13, 600–612. Wesolowski, A., Qureshi, T., Boni, M. F., Sundsøy, P. R., Johansson, M. A., Rasheed, S. B., EngøMonsen, K., & Buckee, C. O. (2015). Impact of human mobility on the emergence of dengue epidemics in Pakistan. Proceedings of the National Academy of Sciences, 112, 11887–11892. Westfall, P. H., & Young, S. S. (1989). P value adjustments for multiple tests in multivariate binomial models. Journal of American Statistical Association, 84, 780–786. Wu, L., Zhi, Y., Sui, Z., & Liu, Y. (2014). Intra-urban human mobility and activity transition: Evidence from social media check-in data. PLoS ONE. https://doi.org/10.1371/journal.pone.009 7010 Xia, Y., Wang, G.-Y., Zhang, X., Kim, G.-B., & Bae, H.-Y. (2011). Spatio-temporal similarity measure for network constrained trajectory data. International Journal of Computational Intelligence Systems, 4, 1070–1079. Xu, M., Li, Z., Shi, Y., Zhang, X., & Jiang, S. (2015). Evolution of regional inequality in the global shipping network. Journal of Transport Geography, 44, 1–12. Yuan, M., & Nara, A. (2015). Space-time analytics of tracks for the understanding of patterns of life. In M.-P. Kwan, D. Richardson, D. Wang, & C. Zhou (Eds.), Space-time integration in geography GIScience (pp. 373–398). Springer Netherlands. Yuan, Y., & Raubal, M. (2014). Measuring similarity of mobile phone user trajectories–a Spatiotemporal Edit Distance method. International Journal of Geographical Information Science, 28, 496–520. Zheng, Y., & Zhou, X. (2011). Computing with spatial trajectories. Springer Science & Business Media.
Chapter 6
An Integrated Evacuation Decision Support System Framework with Social Perception Analysis and Dynamic Population Estimation Atsushi Nara, Xianfeng Yang, Sahar Ghanipoor Machiani, and Ming-Hsiang Tsou
6.1 Introduction Effective evacuation during disastrous events is one of the most challenging issues for many local government agencies and large city traffic control centers in U.S. To build an effective evacuation model and response plans, the responsive agencies need to consider the dynamic change of human population in impact areas and social perception from local residents when designing traffic assignment plans, evacuation procedures, and shelter locations (Chen et al., 2008). Conventionally, population data come from government cross-sectional episodic census surveys. Census data represent only the nighttime population distribution, which hardly reflects dynamic population during a day, on weekdays vs. weekends, or with variations in seasons and holidays. Emerging Big Data from cellphone calls (Pei et al., 2014), social media (Tsou & Leitner, 2013), volunteered geographic information (VGI) (Elwood et al., Reprinted From: Nara, A., Yang, X., Ghanipoor Machiani, S., & Tsou, M.-H. (2017). An integrated evacuation decision support system framework with social perception analysis and dynamic population estimation. International Journal of Disaster Risk Reduction, 25, 190–201. https://doi.org/10.1016/j.ijdrr.2017.09.020. With permission from Elsevier. A. Nara (B) · M.-H. Tsou Center for Human Dynamics in the Mobile Age and Department of Geography, San Diego State University, San Diego, CA, USA e-mail: [email protected] X. Yang Department of Civil & Environmental Engineering, University of Utah, Salt Lake City, UT, USA S. Ghanipoor Machiani Department of Civil, Construction, and Environmental Engineering, San Diego State University, San Diego, CA, USA © Springer Nature Switzerland AG 2021 A. Nara and M.-H. Tsou (eds.), Empowering Human Dynamics Research with Social Media and Geospatial Data Analytics, Human Dynamics in Smart Cities, https://doi.org/10.1007/978-3-030-83010-6_6
89
90
A. Nara et al.
2012), and sensor networks (Bonomi et al., 2012) open unprecedented opportunities to analyze and model human dynamics in space and time (Yuan et al., 2014) and furthermore to capitalize on crowdsourcing intelligence for hazard information reporting, sharing, and modeling during disastrous events. This paper introduces a novel data integration framework for developing an evacuation decision support system for wildfire, Integrated Wildfire Evacuation Decision Support System (IWEDSS). IWEDSS integrates multiple data sources including social media, census survey, geographic information systems (GIS) data layers, volunteer suggestions, and remote sensing data. It consists of four core modules: (1) dynamic population estimation, (2) stage-based robust evacuation models, (3) social perception analysis, and (4) a web-based geospatial analytics platform. The system provides key functions for data collection, traffic demand modeling, evacuation operation, and information dissemination. IWEDSS offers scientifically-based and data-driven analytic tools for evacuation planers and resource managers to make better decisions that can reduce the evacuation time and potential number of injuries and deaths. This paper also presents a case study to demonstrate the suitability of incorporating social media data to estimate the dynamic change of human population.
6.2 Relationship to Other Research Focusing on the four core modules of the IWEDSS framework, a review of the literature and discussions for each core module are presented below.
6.2.1 Population Estimation The prevailing use of social media and mobile phone data provides a great research opportunity for researchers to map and analyze dynamic human behaviors, communications, and movements (Tsou, 2015). People use smart phones, mobile devices, and personal computers to build up their digital life and to leave their digital footprint on the Internet. These human-made digital records provide a foundation for human dynamics research. Human dynamics is a new transdisciplinary research field attracting scientists and researchers from different domains, including complex systems (Barabási, 2005), video analysis (Bregler, 1997; Wang & Singh, 2003), and geography (Tsou, 2015). One key research question of human dynamics is the dynamic change of population distribution in urban areas. Conventionally, the change of population distribution is estimated from census survey with data sampling and forecasting techniques. Recently, scientists started to use satellite images (Bhaduri et al., 2007), mobile phone data (Bengtsson et al., 2011; Deville et al., 2014), or vehicle probe data (Hara & Kuwahara, 2015) to estimate the dynamic change of population distribution at small area level. One example is to use mobile phone-based call detail records (CDR) to detect spatial and temporal differences in everyday activities
6 An Integrated Evacuation Decision Support System Framework …
91
among multiple cities (Ahas et al., 2015). Another example is to estimate seasonal, weekend, and daily changes in population distribution over multiple timescales with aggregated and anonymized mobile phone data (Deville et al., 2014). In GIS and cartographic research, dasymetric mapping methods have been applied to estimate population density using census data and ancillary data sources (Eicher & Brewer, 2001; Holt et al., 2004; Wright, 1936). The integration between vector-based census tracks and raster-based land cover data and satellite images for dasymetric mapping is a challenging problem. Mennis (2003) introduced raster surface representation of population density framework to combine categorical ancillary data and population density. To improve the traditional problems of binary value in categorical data and areal weighting, Mennis and Hultgren (Mennis & Hultgren, 2006) introduced an intelligent dasymetric mapping technique (IDM) with a data-driven methodology to calculate the ratio of class densities. Applying the concept of IDM, IWEDSS calculates population density utilizing social media data (geo-tagged and check-ins data) combined with other GIS data sources to estimate the dynamic distribution of human population at different times. There are several advantages of using social media for population estimation. The real-time updates of social media messages can reflect dynamic changes of population better than expensive remote sensing imagery, which requires time-consuming data collection and data processing procedures (Dong et al., 2010). Alternatively, mobile phone data, such as CDR, are also very expensive. Another drawback of CDR is the missing of actual communication content in each phone record. In contrast, social media data are easy-to-collect, free (using public access methods), content-rich, and real-time updated (Tsou, 2015).
6.2.2 Stage-Based Evacuation Planning Developing an effective evacuation plan is an important task during disaster events. Relocating people within the affected areas to safe places or shelters can reduce the impact of disaster events significantly. For evacuation planning and operations, the system input shall include both zoning of impact areas and estimation of evacuation demands (Murray-Tuite & Wolshon, 2013). To model the traffic demand at the aggregate level, dividing the impact area into a set of geographic zones is always critical (Arlikatti, 2006). However, the size of zones along with their total amount shall vary by the evacuation location and type of emergency event. For example, in evacuation of natural disasters such as wildfire (Li et al., 2015) and hurricane (Urbina & Wolshon, 2003), the evacuation zones shall have a larger size compared with the ones in downtown evacuation (Zhang & Chang, 2014). Modeling of evacuation demand usually provides the number of evacuees and their departure time choice within each zone. For mandatory evacuation, the number of evacuees is directly obtained from population size while non-mandatory evacuation often requires the estimation of people’s evacuate/stay decisions. In practice, many factors, such as the influence of neighbors (Baker, 1991) and strength of social network (Rogers et al., 1991), may affect the evacuation decisions. A comprehensive review of this issue could be found
92
A. Nara et al.
in Murray-Tuite and Wolshon (2013). With the total evacuation demand, estimation of evacuees’ departure time would distribute the demand into transportation network over time. Based on the empirical evidence, stated intention surveys, planner judgment, and simulation of the warning message diffusion (Southworth, 1991), studies often assumed an S-curve in various evacuation operations (e.g., wildfire: (Wolshon & Marchive, 2007; Dennison et al., 2007; Cova et al., 2011); hurricane: (Dixit et al., 2011; Lindell et al., 2005)). Given the zoning and traffic demand information, planning of evacuation shall address two critical issues: selection of traffic routes and determination of evacuation strategies (Murray-Tuite & Wolshon, 2013). With simulation based optimization technique, a category of studies adopted microscopic and mesoscopic models to design evacuation routings. Representative tools for such applications include VISSIM (Elmitiny et al., 2007), CORSIM (Gu, 2004), DYNASMART (Murray-Tuite & Mahmassani, 2004), DynusT (Chiu et al., 2005), and DynaMIT (Balakrishna et al., 2008). In addition to those traditional methods, recent studies also implement agentbased simulation models for evacuation planning (Chen & Zhan, 2008; Chen et al., 2006). Instead of using simulation based models, another research category intended to formulate the evacuation process by linear or nonlinear programming models. Those models often have an objective function such as minimization of evacuation time (Sherali et al., 1991) and a certain set of constraints which formulated with CTM (Liu et al., 2006) or other network optimization techniques (Hsu & Peeta, 2014). Recognizing that severe congestion may occur on transportation networks during evacuation, existing studies have introduced various strategies to reduce the evacuation time. From the supply side, effective strategies include contraflow operations (Wolshon, 2008), crossing elimination (Cova & Johnson, 2003), intersection signal optimization (Chen et al., 2007), and ramp closure (Ghanipoor Machiani et al., 2013), among others. Among existing demand side strategies, the effectiveness of stage-based operations has been demonstrated by many existing studies (Chen & Zhan, 2008; Hsu & Peeta, 2014; Liu et al., 2006; Sbayti & Mahmassani, 2006). By ranking the disaster impact in different zones, this strategy optimizes evacuation sequences with the purpose of reducing traffic demand on roadways. However, such staging operations often require the collaboration of evacuee social networks to disseminate evacuation information.
6.2.3 Social Perception Analysis and Feedback Based Evacuation Plans The insights provided by social media data have been applied to various scientific fields; some examples include: disease outbreaks (Chunara et al., 2012), travel related information through social media (Xiang & Gretzel, 2010), social tie strength evaluation (Gilbert & Karahalios, 2009), relationship between happiness and life patterns (Frank et al., 2013), and political power of social media (Shirky, 2011). Social media
6 An Integrated Evacuation Decision Support System Framework …
93
applications are also rapidly growing in interest among disaster research, such as studies of time sensitive waiting times in information propagation (Spiro et al., 2012), mechanisms of information production and distribution during flooding of the Red River Valley (Starbird et al., 2010), identifying information contributing to enhancing situational awareness during Oklahoma Grassfires and the Red River Floods (Vieweg et al., 2010), online information exchange behaviors of the public organizations during the Deepwater Horizon oil spill disaster (Sutton et al., 2013), and mapping of natural disasters using geo parsed real time tweet data streams (Middleton et al., 2014). Guan and Chen (Guan & Chen, 2014), instead of characterizing social media in the context of a disaster, characterized a disaster using social media. They introduced a “degree of disaster” measured using social media data to understand the evolution of a disaster (Guan & Chen, 2014). More related to behavioral studies, Liu and colleagues studied disaster behaviors by introducing a social mediated crisis communication model (SMCC) model (Jin et al., 2014; Liu et al., 2013, 2016). They examined how publics communicate about crises (Liu et al., 2013), and how they consume crisis information considering different origins (initiated from an internal organizational issue or from an issue external of the organization), and how it affects preferred information form and source (Jin et al., 2014). Although considerable research has been done relating social media and disaster, no research utilizing real time utilization of the social media information in the transportation planning process could be identified. IWEDSS employs the power of social media in evacuation planning in a real time manner by introducing a feedback based evacuation planning system. The decision making process during evacuation is a complex task that is made by the authorities and individuals/households (Lim et al., 2013). In current evacuation plans usually the latter is dependent on the former, but our framework incorporates the interdependency between the two decision making entities. This interdependency is expected by society. According to the American Red Cross survey, 69% of adults believe that emergency responders should be monitoring social media sites to quickly send help (Jin et al., 2014). Building this connection will result in capturing more compliant response from the public regarding disaster warnings. Disaster warnings are deemed to be a social process (Murray-Tuite & Wolshon, 2013). Interpretation of the message and subsequent actions varies among individuals. Individuals’ decision making process includes several stages and processes (Murray-Tuite & Wolshon, 2013) including (1) receiving an initial message, (2) interpreting the message, (3) assessing personal risk, (4) determining whether protection is attainable, (5) determining whether protective action can be handled, (6) determining whether the action will significantly reduce the consequences, (7) assessing options, and (8) choosing an action (Perry, 1985). Any information related to these stages that is obtained through the analysis of the social media will allow for more efficient evacuation planning. Sutton et al. (2008) conducted a survey about disaster information and communications technology, which showed a majority of participants sought information online. Some part of their search effort was to fill gaps in official news sources. Feedback based evacuation planning allows for recognizing these gaps and filling them in with complementary messages.
94
A. Nara et al.
6.2.4 Web-GIS and Spatial Decision Support Systems GIS play a crucial role in disaster management by supporting geographical decision making in mitigation, preparedness, response, and recovery (Cova, 1999; Cutter, 2003). GIS is capable of helping decision makers to conduct risk mapping, emergency planning, emergency plan activation, and damage assessment by: the multilayer geographical data integration composed of physical, social, demographic, and/or economic information (Cutter 2002); spatial and spatiotemporal data analytics (Andrienko et al., 2007); simulation models (Fiedrich & Burghardt, 2007; Silva & Eglese, 2000; Torrens, 2014); social media (McClendon & Robinson, 2013); cartographic visualizations (Becker & König, 2015); and crowdsourcing geographic information (Goodchild & Glennon, 2010; Zook et al., 2010). In past years, great efforts have been made to develop GIS software, toolkits, and spatial decision support systems for disaster management in both public and private sectors (Tomaszewski, 2014; Tomaszewski et al., 2015). Hazus-Multi-Hazard (Hazus-MH) is an example developed by Federal Emergency Management (FEMA). Hazus-MH is a commonly used, standardized, and standalone desktop decision support software designed to assess physical, economic and social impacts of earthquakes, hurricanes, flood, and tsunami emergencies in the U.S. (Hazus, 2016). It is an add-on extension to commercial GIS software, ESRI ArcGIS, and uses mapping and spatial analysis functions to produce loss estimates of the total cost of damages and casualties based on plausible disaster scenarios. The capability of existing decision support tools like Hazus-MH, however, is often limited by underlying assumptions and system design. First, most existing tools are often based on incomplete offline information. For example, the loss estimation in Hazus-MH does not consider dynamics of human activities, or social interactions in both physical and virtual space. In fact, human dynamics vary by time of day, day of the week, month of the year, and establish congruent and incongruent mobility patterns (Yuan & Nara, 2015; Yuan et al., 2014). Furthermore, such human dynamics can be affected by accessible dynamic information, such as live traffic information, place/event recommendations, and emergency alerts and evacuation orders, which are conveyed through physical and online social interactions (e.g., meeting with friends, mobile phone applications, and social media) (Goetzke et al., 2015). These dynamic human activities, mobility patterns, and their interactions are crucial factors for decision making in emergency evacuation. Second, the sum of the hardware, software, and training requirements needed for full GIS implementation is an obstacle to local and state emergency management personnel (Tate et al., 2011). Desktop GIS mapping and analysis tools require sophisticated knowledge in software, hardware, and databases with a steep learning curve and a substantial time commitment (Emrich et al., 2011). Many local communities lack the resources to fully support the implementation of traditional decision support tools (Maclachlan et al., 2007). Closed, standalone desktop disaster management systems can be a major barrier to data and information sharing, participatory decision making, and timely situational awareness. Web-GIS, an alternative approach,
6 An Integrated Evacuation Decision Support System Framework …
95
offers the potential to reduce these limitations. Web-GIS is a collection of network based geographic information services using the Internet to access geographic information, spatial analytical tools, and GIS web services (Páez et al., 2013; Peng & Tsou, 2003). By extending traditional desktop GIS functionality to the web, applications can be developed that are dynamic, accessible, interactive, and interoperable (Tate et al., 2011). As compared to desktop GIS, web-GIS offers improved spatial data access and dissemination, spatial data exploration and geovisualization, spatial data processing, analysis, and modeling (Aye et al., 2016; Dragi´cevi´c, 2004). These web-based geospatial information technologies and services are built upon the principles of Web 2.0; namely, individual production and user generated content, crowdsourcing, big data, architecture of participation, network effects, and openness (Andersen, 2007; Batty et al., 2010). Nevertheless, designing and implementing a spatial decision support system over the web presents challenges such as performance, technology integration, interoperability, security and privacy, and quality of service (Sugumaran & Sugumaran, 2007).
6.3 Framework of Integrated Wildfire Evacuation Decision Support System (IWEDSS) The IWEDSS framework incorporates a novel dynamic population estimation, evacuation models, social perception analysis, and web-GIS techniques to build a robust evacuation plan. IWEDSS provides key functions for data collection, traffic demand modeling, evacuation operation, and information dissemination and offers scientifically based and data driven analytic tools. IWEDSS aims to support decision makings for evacuation planers and resource managers that ultimately helps to reduce the evacuation time and potential number of injuries and deaths. It integrates multiple data sources including social media, census survey, GIS data layers, volunteer suggestions, and remote sensing data. Figure 6.1 shows four core modules of IWEDSS: dynamic population estimation, stage-based robust evacuation models, social perception analysis, and a web-based geospatial analytics platform. Using the Big Data driven techniques, the first module of IWEDSS estimates hourly based population density distribution in small urban areas. Dynamic population distribution information will serve as the demand input for designing evacuation models. Adopting the disaster impact models to predict the temporal spatial impact of the wildfire events, the second system module performs a risk assessment of the urban area and determine the evacuation risk zones. Then a stage-based robust plan, accounting for the uncertainty of traffic demand estimated with population density, will be initialized for evacuation operation. The third module analyzes the public opinions and feedback from local residents based on social media text analysis and volunteer suggestions. IWEDSS utilizes social media analytic research testbed (SMART) dashboard 2.0 (Tsou et al., 2015; Yang et al., 2016) and a mobile app (ReadySD-Social) (Tsou, 2017a, b) for collecting suggestions from registered
96
A. Nara et al.
Fig. 6.1 The design framework of IWEDSS
local volunteers to monitor public opinions and suggestions from local residents nearby disastrous events. By integrating the hourly based dynamic population model from the first module and the social perception analysis from the third module, IWEDSS estimates the movement of people during disasters and makes adjustment of the evacuation plan and shelter locations in a real time manner. Implemented as a web-based geospatial analytics platform, it provides an integrated computational modeling environment and web-based user friendly analysis tools for disaster mitigation planning and emergency responses. With scientifically based estimation, visualization, and analysis tools, IWEDSS delivers suggestions for decision makers, resource managers, and public officers actionable knowledge by fostering the understanding of the impacts of hazards on their communities of interest and measuring the effectiveness of mitigation strategies before, during, and after a wildfire event.
6 An Integrated Evacuation Decision Support System Framework …
97
6.3.1 Dynamic Population Distribution (Density) Estimation Model The first module, which plays a key role in IWEDSS, estimates the hourly based population density distribution in small urban areas by utilizing big data collected from historical and near real time information including social media data, remote sensing imageries, existing GIS data, Census demographic data, and volunteer crowdsource based geographic information. This module includes four processes to integrate and clean heterogeneous geotagged or check in social media data (including Twitter, Instagram, Foursquare, and Flickr) for the population density estimation. The first process is collecting geographically referenced social media data by social media APIs (Application Programming Interfaces). The social media APIs allow accessing various types of geographic information tied with social media posts such as geographic coordinates (i.e., longitude and latitude), street address, city name, and state name. To analyze them for dynamic population estimates, it is necessary to conduct geocoding, a process to convert from nongeographic coordinates to geographic coordinates, i.e., latitudes and longitudes. However, different types of social media data require different geocoding procedures. Thus, this module implements multi-level geocoding methods for Twitter, Instagram, Foursquare, and Flickr data by using their geotagged coordinates and bounding boxes of check in places. Specifically, the geocoding module utilizes five types of geocoding sources at multiple spatial scales: (1) geotagged coordinates at a point of interest level, (2) place check in location at a point of interest level or a defined bounding box, (3) user profile location at often a city or state level, (4) time zones, and (5) texts containing locational information (explicit or implicit information) at a point of interest level using a text location centroid. After the geocoding procedure, the second process is data cleaning to reduce and/or remove sources of noise and errors in these social media data. Examples of noise and error data include advertisements and marketing messages. The third process calculates the hourly average density of unique social media users at a certain spatial unit. In the IWEDSS framework, Traffic Analysis Zones (TAZs), commonly used for transportation planners and modelers, are chosen as basic spatial units of analysis. Since social media users may post multiple messages within an hour at multiple locations (Leetaru et al., 2013), we count only one geotagged message for every unique user for a certain hour at a TAZ. Then the hourly average density of social media users for each TAZ is calculated by the total counts of unique users in an hour divided by the area of each TAZ and the number of days for the data collection period. To demonstrate the calculation of the hourly average density of social media users, we conducted a case study based on Twitter data, which were collected throughout the year of 2015 (from 2015/1/1 to 2015/12/31) within the bounding box of San Diego County. The collected geotagged Twitter posts, or tweets, consist of 7,833,449 originally and 2,494,011 after the data cleaning procedure. Figure 6.2 draws the weekday (Monday to Friday) and weekend (Saturday and Sunday) average hourly count of
98
A. Nara et al.
Fig. 6.2 Average count of unique Twitter user in San Diego, 2015 (n = 1,571 TAZs)
unique Twitter users in San Diego County. Both weekday’s and weekend’s temporal trends on the average count of unique Twitter users show a similar pattern with two peaks around 13:00 and 19:00 and low counts in the early morning. However, values on the weekend daytime hours are higher than those on weekday. These population variations over time between weekday and weekend suggest that it is necessary to adjust the population density estimation using social media data by taking the temporal variation factor into account. Figure 6.3 displays spatial distributions of the average hourly Twitter user density in San Diego County within TAZs during weekday in four time periods; (a) 0:00–1:00, (b) 6:00–7:00, (c) 12:00–13:00, and (d) 18:00–19:00. These maps depict realistic dynamic population changes by capturing higher Twitter user densities in TAZs, which have very few population according to the Census survey data. Those areas include popular points of interest such as Balboa Park, San Diego Zoo, shopping malls, and San Diego International Airport. Finally, based on the hourly average density of unique social media users, we propose to calculate the social media based hourly population density estimate by applying temporal and spatial variation models defined as below (Tsou et al., 2017). ρ(t,s) = D(t,s) ϕ(t) φ(s) where, ρ(t,s) represents the population density estimate in a temporal unit t (i.e., a certain hour) at a spatial unit s (i.e., a TAZ), D(t,s) is the average density of unique social media users in t at s, and ϕ(t) and φ(s) are scaling factors to adjust the population density estimate based on temporal and spatial variations respectively. ϕ(t) is defined as a value of factor multiples with the frequency number of hourly average social media user in each TAZ. φ(s) is defined by utilizing a dasymetric mapping method (Mennis, 2003; Tapp, 2010; Yuan et al., 1997), which is a geospatial technique to more accurately distribute data using ancillary information such as land use and land
6 An Integrated Evacuation Decision Support System Framework …
99
Fig. 6.3 Weekday hourly average Twitter user density distribution in San Diego in four time periods; a 0:00–1:00, b 6:00–7:00, c 12:00–13:00, and d 18:00–19:00
100
A. Nara et al.
cover. We specifically employs the basic concept of intelligent dasymetric mapping technique (IDM) (Mennis & Hultgren, 2006) to refine the population density estimate based on different types of land use data (residential areas, commercial areas, etc.) and census data. While the proposed approach can produce realistic hourly dynamic population density estimates, a key challenge is to evaluate the outcome. Particularly, the model validation, involving the goodness-of-fit of the model to real data, is challenging since such fine temporal scale dynamic population data from real world covering a large area are extremely difficult to obtain. As an alternative approach, this paper attempts to validate the IWEDSS framework by comparing two population densities, nighttime and daytime, derived from Census survey data with weekday and weekend average hourly unique Twitter user densities. First of all, we compute the nighttime “residential” population density at TAZs by aggregating Census 2010 Decennial population data at Census blocks. The daytime population refers to the number of people who are present in an area during typical business hours. It can be estimated by the commuter-adjusted population estimate method defined as below (McKenzie et al., 2010). Commuter ad justed population = T otal ar ea population + T otal wor ker s wor king in ar ea − T oatl wor ker s living in area To obtain the number of workers working or living in TAZs, we use the U.S. Census Longitudinal Employer-Household Dynamics Origin–Destination Employment Statistics (LODES). LODES data are available at Census blocks, and thus they are aggregated to TAZs to calculate the commuter adjusted population density representing the daytime population. Figure 6.4a and b represent Census based population density estimates for nighttime and daytime respectively. We employ Spearman’s rank correlation coefficients to test the relationship between the weekday and weekend hourly unique Twitter user densities and censusbased nighttime and daytime population densities. Table 6.1 illustrates the result indicating that the hourly unique Twitter user densities are more strongly related to the daytime population density rather than the nighttime “residential” population density for all hours in both weekday and weekend. In addition, the hourly unique Twitter user densities are more strongly correlated with the nighttime population during evening hours (17:00–0:00) than the morning to afternoon hours (4:00–16:00). These results indicate that more geotagged Twitter messages are posted from “work-oriented” locations supporting our assumption that geotagged social media data are suitable for estimating dynamic human activities. Furthermore, we found that correlation coefficients are generally lower for weekend densities. This suggests that distributions of Twitter users during weekend are relatively different from “residential” and “work-oriented” population distributions as compared to weekday.
6 An Integrated Evacuation Decision Support System Framework …
101
Fig. 6.4 Census based night time (a) and daytime (b) population density distributions
6.3.2 Stage-Based Robust Evacuation Operation Model The second module is a stage-based robust evacuation operation model using the estimated population density derived from the first module. By dividing the entire city into a set of TAZs, where a zone of under 3,000 people is in common, the module determines the evacuation sequence for each TAZ along with suggested routings. In addition, to fully recognize the potential estimation errors of population density, the module design the evacuation plan with a robust optimization framework which considers the population input as uncertainty. After identifying the location of the occurred disaster (i.e., wildfire), IWEDSS first determines the impact areas which need to be evacuated. For example, the 2007 San Diego wildfire had over 1 million of evacuees and Fig. 6.5 shows a map of 2007 San Diego County wildfire evacuation plan (red color: burning area; orang color: fire perimeters; purple color: mandatory evacuation area; green color: reopened area). To predict the spread of wildfire over time (hourly based extent and intensity estimation), IWEDSS employs a well-established commercial software, Wildfire Analyst™, which utilizes real time weather information (wind speed and direction), land cover, terrain data, and other related factors. Then, the predicted wildfire spread areas over time are used to determine the evacuation risk zones (ERZ) in the region threatened by the disaster. The ERZ is defined as the zone containing population with highest evacuation risk which is measured by whether they can be safely evacuated before the reach of disaster impact. With time dependent assessment of the risk rate in all ERZs, this module optimizes their
102
A. Nara et al.
Table 6.1 Spearman’s rank correlation coefficients between the hourly unique Twitter user densities and the nighttime and daytime population densities estimated based on census surveys Time
Weekday
Weekend
Night time
Day time
Night time
Day time
0:00–1:00
0.615
***
0.640
***
0.572
***
0.604
***
1:00–2:00
0.564
***
0.597
***
0.498
***
0.537
***
2:00–3:00
0.451
***
0.496
***
0.442
***
0.471
***
3:00–4:00
0.399
***
0.468
***
0.311
***
0.344
***
4:00–5:00
0.317
***
0.429
***
0.222
***
0.294
***
5:00–6:00
0.453
***
0.560
***
0.231
***
0.348
***
6:00–7:00
0.473
***
0.601
***
0.351
***
0.464
***
7:00–8:00
0.446
***
0.611
***
0.462
***
0.554
***
8:00–9:00
0.428
***
0.606
***
0.488
***
0.595
***
9:00–10:00
0.427
***
0.616
***
0.485
***
0.608
***
10:00–11:00
0.423
***
0.619
***
0.504
***
0.637
***
11:00–12:00
0.431
***
0.634
***
0.470
***
0.615
***
12:00–13:00
0.435
***
0.637
***
0.500
***
0.643
***
13:00–14:00
0.450
***
0.645
***
0.497
***
0.642
***
14:00–15:00
0.473
***
0.661
***
0.493
***
0.636
***
15:00–16:00
0.496
***
0.673
***
0.488
***
0.628
***
16:00–17:00
0.501
***
0.667
***
0.515
***
0.651
***
17:00–18:00
0.529
***
0.672
***
0.521
***
0.639
***
18:00–19:00
0.555
***
0.684
***
0.524
***
0.639
***
19:00–20:00
0.576
***
0.685
***
0.551
***
0.652
***
20:00–21:00
0.608
***
0.690
***
0.584
***
0.659
***
21:00–22:00
0.619
***
0.682
***
0.600
***
0.665
***
22:00–23:00
0.653
***
0.695
***
0.617
***
0.665
***
23:00–24:00
0.646
***
0.684
***
0.602
***
0.648
***
(n = 1,571); ***p < 0.001
evacuation time according to the location of shelters, roadway capacities, and potential traffic demand generated. In review of literature, evacuation demand modeling is always a challenging issue due to the traffic flows under evacuation conditions differ from those in regular days. Conventional methods usually use the historical survey information for estimations. However, due to the difficulty in collecting up to date data, such models may fall short of accuracy due to the dynamic nature of population distributions. Therefore, the hourly based population density estimation derived in the first module serves as a key input of evacuation demand modeling. Similar to most existing studies in this subject, the demand model includes two primary steps: determination of number of evacuating vehicles, and prediction of evacuee departure time. The first step will involve an estimation of average vehicle
6 An Integrated Evacuation Decision Support System Framework …
103
Fig. 6.5 A map of San Diego County wildfire evacuation plan at 3:30 a.m., October 25, 2007 (SanGIS & Office of Emergency Services, 2007)
occupancy rate in each zone. Then the rate multiplied by the zone’s population produces the total traffic demand. After assigning an evacuation sequence to each ERZ, the next step is to distribute the evacuees over time according to the prediction of their departure time choices. Many existing studies employed the cumulative departure S-curves (Sorensen, 2000). IWEDSS follows the same line but recalibrates the curve based on people’s perception of the disaster collected by social media information acquired in the third module. One feature that many disasters have in common is their uncertain nature, in which exact data are unlikely to be available combined with a high likelihood of social disruption. Under such conditions, use of unreliable estimated data as the input of an evacuation planning system may require much more effort for plan changes in real time operations. To overcome this issue, a robust optimization function in this module, along with the state based evacuation strategy, accounts for the input data uncertainties. Specifically, such uncertainty is contributed by the estimated traffic demand generated in each zone and departure time choices of evacuees. For designing the evacuation routings of ERZs, this module first develops a base model, which is in the deterministic form and formulated as a mesoscopic simulation model in order to optimize computational efficiency. Given the defined uncertain input set, the module formulates the so called robust counterpart, that is, an extension of the base model, which takes the uncertainties into account. In strict robustness, the objective is to find a solution that is feasible for all possible cases and is able to
104
A. Nara et al.
provide best performance in the worst case scenario. For this evacuation problem, such a worst case in terms of time dependent traffic demand is the one that can cause the largest delay on roadway network, for which some vehicles will need to be guided to a farther route. However, due to the tradeoff between efficiency and robustness, the strict robustness definition may lead to an overly conservative plan. Thus, this module redefines a set of open balls in Cartesian space for the uncertain inputs so as to limit their variations. This robust optimization model brings an innovative solution to overcome the impact of traffic demand estimation errors.
6.3.3 Social Perception Analysis The third module is composed of public opinion monitoring, semantic and trend analysis, and evacuation adjustment plan based on social perception. People behave differently in times of crisis based on their perception of risk and danger. Therefore, efficient wildfire evacuation planning and developing mitigation strategies require both technical transportation network analysis as well as understanding of social perception from local residents. Social opinion and information dissemination of such opinion among individuals could facilitate or constrain the successful implementation of an evacuation plan. Thus, understanding social perception can be used as a powerful decision support tool to respond effectively to wildfire evacuation. IWEDSS embeds a public opinion monitor, which is based on the existing system, the Social Media Analytics Research Testbed (SMART) dashboard (Tsou et al., 2015; Yang et al., 2016). The SMART Dashboard has been applied to many public opinion monitoring tasks, including flu outbreaks, vaccine exemptions, flooding, and wildfires. Figure 6.6 illustrates the screenshot of the SMART dashboard for monitoring California Wildfire events daily by using predefined keywords. IWEDSS implements an improved SMART Dashboard (SMART Dashboard 2.0) by adding new hourly based data updating and dynamic keyword searching functions for real time and near real time analysis. In addition, it implements a social perception analysis model that applies semantic and trend analysis techniques to extract knowledge from social media texts to understand evacuees’ general perception, sentiments, and attitude toward evacuation related subjects as well as the extent of social confirmation of the official warnings and recommendations. Multilevel Model of Meme Diffusion (M3D) is employed as a theoretical guideline in developing the semantic analysis model. M3D is a new framework designed for describing online communication and the diffusion of memes (social media messages) via different social networks (Spitzberg, 2014). In real time applications, social perception analysis model is adjusted and updated using two approaches, (1) social media data from the SMART dashboard 2.0 and (2) the volunteers’ direct feedback using the ReadySD-Social mobile application.
6 An Integrated Evacuation Decision Support System Framework …
105
Fig. 6.6 The SMART dashboard for “Wildfires in California” topic
The social perception analysis model is updated and improved by analyzing direct feedback and comments from registered volunteers directly using a mobile application, ReadySD-Social a mobile app for broadcasting emergency information specifically applying for the San Diego region available for Android and iOS (Fig. 6.7) (Tsou, 2017a, b). Registered local volunteers in San Diego can use the ReadySDSocial app to retweet emergency announcements and evacuation messages from the
Fig. 6.7 ReadySD-Social, a mobile app for broadcasting emergency information in San Diego
106
A. Nara et al.
San Diego County Office of Emergency. The mobile app can be downloaded in both iOS and Android smartphones. ReadySD-Social also allows the registered volunteers to send out their feedback and suggestions related to the evacuation orders and shelter locations directly into an online forum managed inside the IWEDSS. These feedback texts are analyzed and integrated into the social perception model to monitor the course of the evacuation plan implementation and determine attitude evolution over time. Information obtained through the ReadySD-Social mobile app and the SMART dashboard 2.0 (public opinion monitor) serve as a supporting tool to assist with the evacuation planning. This information allows for greater clarity in the evaluation of the messages. Furthermore, it assists in evaluating which channel gets more notice from people in disseminating important messages, and whether people follow recommendations from authorities. Another example is to provide guides on the frequency of the warnings and recognizing the boundary between adequate warnings and over warning situations or excessive fear appeals. Excessive warning frequencies could lead to disregarding (cry wolf effect). Extracted information is incorporated in real time manner to tailor the evacuation planning toward more efficient and socially acceptable strategies. The results are integrated in wildfire evacuation strategies and logistics by recalibrating the developed plan in the second module.
6.3.4 A Web-Based Geospatial Analytics Platform Implemented as a web-based geospatial analytics platform, IWEDSS offers an integrated computational modeling environment and an interactive decision support system for disaster mitigation planning. There are three core components on the integrated computational modeling environment: (1) spatiotemporal databases; (2) analytical models; and (3) high performance computing (HPC). Spatiotemporal databases allow to store, update, manage, and efficiently query multiple layers of geospatial information necessary to predict the extent and intensity of disaster impacts over time and determine ERZs. Multi layered geospatial information includes Census based socio demographic data, transportation networks, land use/land cover data, dynamic population estimates, dynamic evacuation demand estimates, and evacuees’ perception of disasters. As a second component, a set of server side modules on Linux servers are implemented for the evacuation demand model and the semantic analysis model. In addition, IWEDSS implements data integration modules (described in Sect. 6.3.1) in the spatiotemporal databases to spatially and temporally integrate multiple geospatial data sources, which are heterogeneous in scale, format, structure, and quality. These implementations are achieved through the object relational database framework, Structured Query Language (SQL), and database functions. The third core component is the HPC solution to expedite data preparation, database query, and analytical computations by implementing algorithms that utilize database partitioning, multiple central processing units (CPUs) and graphics processing units (GPUs).
6 An Integrated Evacuation Decision Support System Framework …
107
IWEDSS is designed as a web-based geographic information service as an interactive decision support system for disaster mitigation planning. The web-GIS service provides geospatial tools and user friendly graphical user interfaces (GUIs) that allow users to interactively explore and simulate “what if” scenarios to assess spatial, temporal, and social vulnerabilities before, during, and after a disaster event, particularly focusing on evacuation. The platform is capable of quantifying community functionality, evacuation effectiveness, and system dynamics to evaluate community resilience. The core system is built upon a mixed configuration to take advantages of both open source and proprietary software resources, including ArcGIS, Wildfire Analyst™, PostgreSQL, PostGIS, MongoDB, Open Layers, Node.js, HTML5, JavaScript, and Python.
6.4 Conclusion This paper presents an innovative decision support system framework for wildfire evacuation by integrating social media data, GIScience, transportation, and human behavior analysis. As long as the communication grid is available, this framework can be extended to other types of disasters (e.g., tsunami, hurricanes, and technical hazards) with some modifications. The dynamic population density model developed in IWEDSS can be applied in many applications, including urban planning, elections, business marketing, and facility management. The social perception analysis model and public opinion monitors can also help other research domains such as traffic incident detection and public campaigns, as well as other social crises and natural disasters. One of the most valuable components in the IWEDSS framework is to establish a resident feedback network by connecting registered volunteers using a mobile phone application and an online forum. The method of building such community network is replicable for many U.S. cities and it will provide valuable social capital for helping local communities during disaster events and make society more resilient to nature disaster events. Acknowledgements This work was supported by the National Science Foundation under Grant No. 1634641, IMEE project titled “Integrated Stage-Based Evacuation with Social Perception Analysis and Dynamic Population Estimation”. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the National Science Foundation.
References Ahas, R., Aasa, A., Yuan, Y., Raubal, M., Smoreda, Z., Liu, Y., Ziemlicki, C., Tiru, M., & Zook, M. (2015). Everyday space–time geographies: Using mobile phone-based sensor data to monitor urban activity in Harbin, Paris, and Tallinn. International Journal of Geographical Information Science, 29, 2017–2039.
108
A. Nara et al.
Andersen, P. (2007). What is Web 2.0?: ideas, technologies and implications for education. JISC Bristol. Andrienko, G., Andrienko, N., Jankowski, P., Keim, D., Kraak, M.-J., MacEachren, A., & Wrobel, S. (2007). Geovisual analytics for spatial decision support: Setting the research agenda. International Journal of Geographical Information Science, 21, 839–857. Arlikatti, S. (2006). Risk area accuracy and hurricane evacuation expectations of coastal residents. Environment and Behavior, 38, 226–247. Aye, Z. C., Sprague, T., Cortes, V. J., Prenger-Berninghoff, K., Jaboyedoff, M., & Derron, M.-H. (2016). A collaborative (web-GIS) framework based on empirical data collected from three case studies in Europe for risk management of hydro-meteorological hazards. International Journal of Disaster Risk Reduction, 15, 10–23. Baker, E. J. (1991). Hurricane evacuation behavior’. International Journal of Mass Emergencies and Disasters, 9, 287–310. Balakrishna, R., Wen, Y., Ben-Akiva, M., & Antoniou, C. (2008). Simulation-based framework for transportation network management in emergencies. Transportation Research Record: Journal of the Transportation Research Board, 2041, 80–88. Barabási, A.-L. (2005). The origin of bursts and heavy tails in human dynamics. Nature, 435, 207–211. Batty, M., Hudson-Smith, A., Milton, R., & Crooks, A. (2010). Map mashups, Web 2.0 and the GIS revolution. Annals of GIS, 16, 1–13. Becker, T., & König, G. (2015). Generalized cartographic and simultaneous representation of utility networks for decision-support systems and crisis management in urban environments. ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 1, 19–28. Bengtsson, L., Lu, X., Thorson, A., Garfield, R., & von Schreeb, J. (2011). Improved response to disasters and outbreaks by tracking population movements with mobile phone network data: A post-earthquake geospatial study in Haiti. PLoS Med, 8, e1001083. Bhaduri, B., Bright, E., Coleman, P., & Urban, M. L. (2007). LandScan USA: A highresolution geospatial and temporal modeling approach for population distribution and dynamics. GeoJournal, 69, 103–117. Bonomi, F., Milito, R., Zhu, J., & Addepalli, S. (2012). Fog computing and its role in the internet of things. In Proceedings of the First Edition of the MCC Workshop on Mobile Cloud Computing (pp. 13–16). ACM. Bregler, C. (1997). Learning and recognizing human dynamics in video sequences. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 568–574) Chen, X., Meaker, J. W., & Zhan, F. B. (2006). Agent-based modeling and analysis of hurricane evacuation procedures for the Florida keys. Natural Hazards, 38, 321–338. Chen, M., Chen, L., & Miller-Hooks, E. (2007). Traffic signal timing for urban evacuation. Journal of Urban Planning and Development, 133, 30–42. Chen, R., Sharman, R., Rao, H. R., & Upadhyaya, S. J. (2008). Coordination in emergency response management. Communications of the ACM, 51, 66–73. Chen, X., & Zhan, F. B. (2008) Agent-based modelling and simulation of urban evacuation: Relative effectiveness of simultaneous and staged evacuation strategies. Journal of the Operational Research Society. https://doi.org/10.1057/palgrave.jors.2602321 Chiu, Y.-C., Korada, P., & Mirchandani, P. B. (2005). Dynamic traffic management for evacuation. In 84th Annual Meeting of the Transportation Research Board. Chunara, R., Andrews, J. R., & Brownstein, J. S. (2012). Social and news media enable estimation of epidemiological patterns early in the 2010 Haitian cholera outbreak. American Journal of Tropical Medicine and Hygiene, 86, 39–45. Cova, T. J. (1999). GIS in emergency management. Geographic Information System, 2, 845–858. Cova, T. J., & Johnson, J. P. (2003). A network flow model for lane-based evacuation routing. Transportation Research Part A: Policy and Practice, 37, 579–604. Cova, T. J., Dennison, P. E., & Drews, F. A. (2011). Modeling evacuate versus shelter-in-place decisions in wildfires. Sustainability, 3, 1662–1687.
6 An Integrated Evacuation Decision Support System Framework …
109
Cutter, S. L. (2003). GI science, disasters, and emergency management. Transactions in GIS, 7, 439–446. Cutter, S. L. (Ed.). (2002). American hazardscapes: The regionalization of hazards and disasters. Joseph Henry Press. de Silva, F. N., & Eglese, R. W. (2000). Integrating simulation modelling and GIS: Spatial decision support systems for evacuation planning. The Journal of the Operational Research Society, 51, 423–430. Dennison, P. E., Cova, T. J., & Mortiz, M. A. (2007). WUIVAC: A wildland-urban interface evacuation trigger model applied in strategic wildfire scenarios. Natural Hazards, 41, 181–199. Deville, P., Linard, C., Martin, S., Gilbert, M., Stevens, F. R., Gaughan, A. E., Blondel, V. D., & Tatem, A. J. (2014). Dynamic population mapping using mobile phone data. Proceedings of the National Academy of Sciences, 111, 15888–15893. Dixit, V., Montz, T., & Wolshon, B. (2011). Validation techniques for region-level microscopic mass evacuation traffic simulations. Transportation Research Record: Journal of the Transportation Research Board, 2229, 66–74. Dong, P., Ramesh, S., & Nepali, A. (2010). Evaluation of small-area population estimation using LiDAR, Landsat TM and parcel data. International Journal of Remote Sensing, 31, 5571–5586. Dragi´cevi´c, S. (2004). The potential of Web-based GIS. Journal of Geographical Systems, 6, 79–81. Eicher, C. L., & Brewer, C. A. (2001). Dasymetric mapping and areal interpolation: Implementation and evaluation. Cartography and Geographic Information Science, 28, 125–138. Elmitiny, N., Ramasamy, S., & Radwan, E. (2007). Emergency evacuation planning and preparedness of transit facilities: Traffic simulation modeling. Transportation Research Record: Journal of the Transportation Research Board, 1992, 121–126. Elwood, S., Goodchild, M. F., & Sui, D. (2012). Prospects for VGI research and the emerging fourth paradigm. In Crowdsourcing geographic knowledge. Springer Netherlands (pp. 361–375). Emrich, C. T., Cutter, S. L., & Weschler, P. J. (2011). GIS and emergency management. In The SAGE handbook of GIS and society (pp. 321–343). Sage. Fiedrich, F., & Burghardt, P. (2007). Agent-based systems for disaster management. Communications of the ACM, 50, 41–42. Frank, M. R., Mitchell, L., Dodds, P. S., & Danforth, C. M. (2013). Happiness and the patterns of life: A study of geolocated tweets. Science and Reports. https://doi.org/10.1038/srep02625 Ghanipoor Machiani, S., Murray-Tuite, P., Jahangiri, A., Liu, S., Park, B., Chiu, Y.-C., & Wolshon, B. (2013). No-notice evacuation management: Ramp closures under varying budgets and demand scenarios. Transportation Research Record: Journal of the Transportation Research Board, 2376, 27–37. Gilbert, E., & Karahalios, K. (2009). Predicting tie strength with social media. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 211–220). ACM. Goetzke, F., Gerike, R., Páez, A., & Dugundji, E. (2015). Social interactions in transportation: Analyzing groups and spatial networks. Transportation N Y, 42, 723–731. Goodchild, M. F., & Glennon, J. A. (2010). Crowdsourcing geographic information for disaster response: A research frontier. International Journal of Digital Earth, 3, 231–241. Gu, Y. (2004). Integrating a Regional Planning Model (TRANSIMS) with an operational model (CORSIM). Retrieved February 3, 2016, from https://theses.lib.vt.edu/theses/available/etd-022 02004-160557/ Guan, X., & Chen, C. (2014). Using social media data to understand and assess disasters. Natural Hazards, 74, 837–850. Hara, Y., & Kuwahara, M. (2015). Traffic Monitoring immediately after a major natural disaster as revealed by probe data—A case in Ishinomaki after the Great East Japan Earthquake. Transportation Research Part A: Policy and Practice, 75, 1–15. Hazus. (2016). FEMA.gov. Retrieved February 3, 2016, from http://www.fema.gov/hazus Holt, J. B., Lo, C. P., & Hodler, T. W. (2004). Dasymetric estimation of population density and areal interpolation of census data. Cartography and Geographic Information Science, 31, 103–121.
110
A. Nara et al.
Hsu, Y.-T., & Peeta, S. (2014). Risk-based spatial zone determination problem for stage-based evacuation operations. Transportation Research Part C: Emerging Technologies, 41, 73–89. Jin, Y., Liu, B. F., & Austin, L. L. (2014). Examining the role of social media in effective crisis management: The effects of crisis origin, information form, and source on publics’ crisis responses. Communication Research, 41, 74–94. Leetaru, K., Wang, S., Cao, G., Padmanabhan, A., & Shook, E. (2013). Mapping the global Twitter heartbeat: The geography of Twitter. First Monday, 18. Li, D., Cova, T. J., & Dennison, P. E. (2015). A household-level approach to staging wildfire evacuation warnings using trigger modeling. Computers, Environment and Urban Systems, 54, 56–67. Lim, H., Lim, M. B., & Piantanakulchai, M. (2013). A review of recent studies on flood evacuation planning. Journal of the Eastern Asia Society for Transportation Studies, 10, 147–162. Lindell, M. K., Jing-Chein, Lu., & Prater, C. S. (2005). Household decision making and evacuation in response to hurricane Lili. Natural Hazards Review, 6, 171–179. Liu, Y., Lai, X., & Chang, G.-L. (2006). Cell-based network optimization model for staged evacuation planning under emergencies. Transportation Research Record: Journal of the Transportation Research Board, 1964, 127–135. Liu, B. F., Jin, Y., & Austin, L. L. (2013). The tendency to tell: Understanding publics’ communicative responses to crisis information form and source. Journal of Public Relations Research, 25, 51–67. Liu, B. F., Fraustino, J. D., & Jin, Y. (2016). Social media use during disasters how information form and source influence intended behavioral responses. Communication Research, 43, 626–646. Maclachlan, J. C., Jerrett, M., Abernathy, T., Sears, M., & Bunch, M. J. (2007). Mapping health on the Internet: A new tool for environmental justice and public health research. Health & Place, 13, 72–86. McClendon, S., & Robinson, A. C. (2013). Leveraging Geospatially-Oriented Social Media Communications in Disaster Response. International Journal of Information Systems for Crisis Response and Management, 5, 22–40. McKenzie, B., Koerber, W., Fields, A., Benetsky, M., & Rapino, M. (2010). Commuter-adjusted population estimates: ACS 2006–10. Journey to Work and Migration Statistics Branch, US Census Bureau. Mennis, J. (2003). Generating surface models of population using dasymetric mapping. The Professional Geographer, 55, 31–42. Mennis, J., & Hultgren, T. (2006). Intelligent dasymetric mapping and its application to areal interpolation. Cartography and Geographic Information Science, 33, 179–194. Middleton, S. E., Middleton, L., & Modafferi, S. (2014). Real-time crisis mapping of natural disasters using social media. IEEE Intelligent Systems, 29, 9–17. Murray-Tuite, P., & Mahmassani, H. (2004). Transportation network evacuation planning with household activity interactions. Transportation Research Record: Journal of the Transportation Research Board, 1894, 150–159. Murray-Tuite, P., & Wolshon, B. (2013). Evacuation transportation modeling: An overview of research, development, and practice. Transportation Research Part C Emerging Technologies, 27, 25–45. Páez, A., Moniruzzaman, Md., Bourbonnais, P.-L., & Morency, C. (2013). Developing a web-based accessibility calculator prototype for the Greater Montreal Area. Transportation Research Part A: Policy and Practice, 58, 103–115. Pei, T., Sobolevsky, S., Ratti, C., Shaw, S.-L., Li, T., & Zhou, C. (2014). A new insight into land use classification based on aggregated mobile phone data. International Journal of Geographical Information Science, 28, 1988–2007. Peng, Z.-R., & Tsou, M.-H. (2003). Internet GIS: Distributed geographic information services for the internet and wireless networks. Wiley. Perry, R. W. (1985). Comprehensive emergency management: Evacuating threatened populations (Contemporary Studies in Applied Behavioral Science, Vol. 3). Jai Pr.
6 An Integrated Evacuation Decision Support System Framework …
111
Rogers, G. O., & Sorensen, J. H. (1991). Diffusion of emergency warning: Comparing empirical and simulation results. In C. Zervos, K. Knox, L. Abramson, & R. Coppock (Eds.), Risk analysis (pp. 117–134). Springer. SanGIS, Office of Emergency Services. (2007). San Diego County Fire Map. Sbayti, H., & Mahmassani, H. (2006). Optimal scheduling of evacuation operations. Transportation Research Record: Journal of the Transportation Research Board, 1964, 238–246. Sherali, H. D., Carter, T. B., & Hobeika, A. G. (1991). A location-allocation model and algorithm for evacuation planning under hurricane/flood conditions. Transportation Research Part B: Methodological, 25, 439–452. Shirky, C. (2011). The political power of social media. Foreign Affairs, 90, 28–41. Sorensen, J. H. (2000). Hazard warning systems: Review of 20 years of progress. Natural Hazards Review, 1, 119–125. Southworth, F. (1991). Regional evacuation modeling: A state of the art review. https://doi.org/10. 2172/814579 Spiro, E., Irvine, C., DuBois, C., & Butts, C. (2012). Waiting for a retweet: Modeling waiting times in information propagation. In 2012 NIPS workshop of social networks and social media conference. Retrieved 12, from http://snap.stanford.edu/social2012/papers/spiro-dubois-butts.pdf Spitzberg, B. H. (2014). Toward a model of Meme Diffusion (M3D). Communication Theory, 24, 311–339. Starbird, K., Palen, L., Hughes, A. L., & Vieweg, S. (2010). Chatter on the red: What hazards threat reveals about the social life of microblogged information. In Proceedings of the 2010 ACM Conference on Computer Supported Cooperative Work (pp. 241–250). ACM. Sugumaran, V., & Sugumaran, R. (2007). Web-based Spatial Decision Support Systems (WebSDSS): Evolution, architecture, examples and challenges. Communications of the Association for Information Systems, 19, 40. Sutton, J., Spiro, E., Butts, C., Fitzhugh, S., Johnson, B., & Greczek, M. (2013). Tweeting the spill: Online informal communications, social networks, and conversational microstructures during the deepwater horizon oilspill. International Journal of Information Systems for Crisis Response and Management, 5, 58–76. Sutton, J., Palen, L., & Shklovski, I. (2008). Backchannels on the front lines: Emergent uses of social media in the 2007 southern California wildfires. In Proceedings of 5th International ISCRAM Conference (pp. 624–632). Tapp, A. F. (2010). Areal interpolation and dasymetric mapping methods using local ancillary data sources. Cartography and Geographic Information Science, 37, 215–228. Tate, E., Burton, C. G., Berry, M., Emrich, C. T., & Cutter, S. L. (2011). Integrated hazards mapping tool. Transaction in GIS, 15, 689–706. Tomaszewski, B. (2014). Geographic Information Systems (GIS) for disaster management. CRC Press. Tomaszewski, B., Judex, M., Szarzynski, J., Radestock, C., & Wirkus, L. (2015). Geographic information systems for disaster response: A review. Journal of Homeland Security and Emergency Management, 12, 571–602. Torrens, P. M. (2014). High-resolution space–time processes for agents at the built–human interface of urban earthquakes. International Journal of Geographical Information Science, 28, 964–986. Tsou, M.-H. (2015). Research challenges and opportunities in mapping social media and Big Data. Cartography and Geographic Information Science, 42, 70–74. Tsou, M.-H., & Leitner, M. (2013). Visualization of social media: Seeing a mirage or a message? Cartography and Geographic Information Science, 40, 55–60. Tsou, M.-H. (2017a). ReadySD social for Android (1.1) [Mobile Application Software]. Tsou, M.-H. (2017b). ReadySD social for iOS (1.1) [Mobile Application Software]. Tsou, M.-H., Jung. C.-T., Allen, C., Yang, J.-A., Gawron, J.-M., Spitzberg, B. H., & Han, S. (2015). Social Media Analytics and Research Test-bed (SMART Dashboard). In Proceedings of the 2015 International Conference on Social Media & Society (pp. 2:1–2:7). ACM.
112
A. Nara et al.
Tsou, M. H., Zhang, H., Nara, A., & Han, S.-Y. (2017). Estimating hourly population distribution change at high spatiotemporal resolution in urban areas using geo-tagged tweets, land use data, and dasymetic maps. Urbina, E., & Wolshon, B. (2003). National review of hurricane evacuation plans and policies: A comparison and contrast of state practices. Transportation Research Part A: Policy and Practice, 37, 257–275. Vieweg, S., Hughes, A. L., Starbird, K, & Palen, L. (2010). Microblogging during two natural hazards events: What twitter may contribute to situational awareness. In Proceedings of SIGCHI Conference on Human Factors Computing System (pp. 1079–1088). ACM. Wang, J. J., & Singh, S. (2003). Video analysis of human dynamics—A survey. Real-Time Imaging, 9, 321–346. Wolshon, B. (2008). Empirical characterization of mass evacuation traffic flow. Transportation Research Record: Journal of the Transportation Research Board, 2041, 38–48. Wolshon, B., & Marchive, E. (2007). Emergency planning in the urban-wildland interface: Subdivision-level analysis of wildfire evacuations. Journal of Urban Planning and Development ASCE. https://doi.org/10.1061/(ASCE)0733-9488(2007)133:1(73) Wright, J. K. (1936). A method of mapping densities of population: With cape cod as an example. Geographical Review. https://doi.org/10.2307/209467 Xiang, Z., & Gretzel, U. (2010). Role of social media in online travel information search. Tourism Management, 31, 179–188. Yang, J.-A., Tsou, M.-H., Jung, C.-T., Allen, C., Spitzberg, B. H., Gawron, J. M., & Han, S.Y. (2016). Social media analytics and research testbed (SMART): Exploring spatiotemporal patterns of human dynamics with geo-targeted social media messages. Big Data & Society, 3, 2053951716652914. Yuan, Y., Smith, R. M., & Limp, W. F. (1997). Remodeling census population with spatial information from Landsat TM imagery. Computers, Environment and Urban Systems, 21, 14. Yuan, M., Nara, A., & Bothwell, J. (2014). Space–time representation and analytics. Annals of GIS, 20, 1–9. Yuan, M., & Nara, A. (2015). Space-time analytics of tracks for the understanding of patterns of life. In M.-P. Kwan, D. Richardson, D. Wang, & C. Zhou (Eds.), Space-Time Integration in Geography and GIScience (pp. 373–398). Springer Netherlands. Zhang, X., & Chang, G. (2014). A dynamic evacuation model for pedestrian–vehicle mixed-flow networks. Transportation Research Part C Emerging Technologies, 40, 75–92. Zook, M., Graham, M., Shelton, T., & Gorman, S. (2010). Volunteered geographic information and crowdsourcing disaster relief: A case study of the Haitian earthquake. World Medical Health Policy, 2, 7–33.
Chapter 7
Learning Dependence Relationships of Evacuation Decision Making Factors from Tweets Atsushi Nara, Sahar Ghanipoor Machiani, Nana Luo, Alidad Ahmadi, Karen Robinett, Ken Tominaga, Jaehee Park, Chanwoo Jin, Xianfeng Yang, and Ming-Hsiang Tsou
7.1 Introduction Individuals react very differently to evacuation orders. Their decisions to evacuate are dependent on a variety of factors such as risk perception, personal knowledge and experience, age, medical condition, presence of children or elderly, and pet ownership. Identifying and quantifying such key factors and understanding how they affect an individual’s evacuation decisions can help emergency response organizations improve evacuation plans and communication strategies. Conventionally, researchers have studied human evacuation behaviors by conducting post-disaster surveys, which could be costly, be limited by sampling methods, and be dependent on respondent availability resulting in non-timely responses. Social media data analytics is an alternative approach to examine human behaviors during a disaster as social media have become an important communication channel and researchers can access a large amount of data instantly at a relatively low cost. In recent years, Twitter has become a popular communication tool during emergency events such as hurricanes (Alam et al., 2018; Lachlan et al., 2014; Sadri et al., 2018) and wildfires (Latonero & Shklovski, 2011; Vieweg et al., 2010; Wang et al., 2016). For example, evacuees learn about the most recent news and evacuation orders by following their local emergency responding organizations’ Twitter A. Nara (B) · N. Luo · K. Robinett · K. Tominaga · J. Park · C. Jin · M.-H. Tsou Center for Human Dynamics in the Mobile Age and Department of Geography, San Diego State University, San Diego, CA, USA e-mail: [email protected] S. G. Machiani · A. Ahmadi Department of Civil, Construction, and Environmental Engineering, San Diego State University, San Diego, CA, USA X. Yang Department of Civil & Environmental Engineering, University of Utah, Salt Lake City, UT, USA © Springer Nature Switzerland AG 2021 A. Nara and M.-H. Tsou (eds.), Empowering Human Dynamics Research with Social Media and Geospatial Data Analytics, Human Dynamics in Smart Cities, https://doi.org/10.1007/978-3-030-83010-6_7
113
114
A. Nara et al.
account. This social media based communication is not only one-way. Emergency responding organizations may also use information that is shared by social media users. In this regard, there are studies that concern the use of analyzing public tweets to improve evacuation decisions. For instance, in a study on tweets during two incidents: the Red River flood 2009, and the Oklahoma grass fires 2009, Vieweg et al. (2010) demonstrated the application of analyzing Twitter data for improving situational awareness in safety–critical situations and identifying features that needed to be informed in the systems that improve situational awareness. In another study, Latonero and Shklovski (2011) discussed how emergency responding organizations such as the Los Angeles Fire Department (LAFD) utilizes Twitter as means of a two-way communication tool to both disseminate information to the public and to monitor the public’s tweets for emergency management purposes. Investigating tweets that are shared by social media users within emergency event areas can provide useful information. For example, during the 2014 wildfire in San Diego, CA, a user tweeted “There’s still a fire going on in 4S. However, my neighborhood hasn’t received an evacuation notice …”. In this example tweet, there are a few different pieces of information with one of the most important pieces being that there are likely people within a certain area that have not evacuated yet who might be in potential danger. In another example, a person tweeted “Got the evacuation call and packed the essentials …”. In this tweet, it can be gathered that the user has plans to evacuate if they had not done so already. These are only a few examples of tweets from users in an emergency area. However, if such tweets are investigated on a larger scale, they can provide emergency responding organizations with useful information including evacuees’ behavior, their compliance to evacuation orders, evacuation rates, neighborhood situations, and factors that prevent evacuees from following evacuation orders. The objective of this study is to utilize social media data to examine the relationship between human decision-making factors for wildfire evacuations and the evacuation response behaviors. We examine previous studies connected to the evacuation behavior, evacuee’s decision making, the use of social media, and emergency evacuation modeling and methodologies. Building up on previous models and frameworks, we propose a new evacuation decision-making conceptual model that categorizes and demonstrates the relationship between factors that have an impact on evacuee’s decisions regarding evacuations. We then employed a Bayesian Network model to analyze Twitter data obtained from the 2017 Lilac Fire incident in San Diego, California, and to learn dependent relationships among individual’s evacuation decision making factors.
7 Learning Dependence Relationships of Evacuation …
115
7.2 Literature Review 7.2.1 Evacuation Decision Making Models and Key Factors To examine evacuees’ responses to evacuation orders based on Twitter data, we first reviewed previous studies on evacuation behavior models and empirical studies to assess the decision-making process of evacuees and key factors affecting an individual’s decision to evacuate. In one of the early modeling efforts, Sorensen et al. (1987) developed an evacuation behavior model (see Fig. 7.1) demonstrating how different factors influence an individual’s behavior during evacuation. In another study, Perry (1979) developed a similar framework that demonstrates the relationship between important factors that affect an individuals’ decision on evacuation. Both included similar evacuation factors in one form or another in their frameworks, but didn’t have all the factors broken down or connected. Figure 7.1 displays Sorensen’s model that clearly draws pathways of evacuation decision-making. These models, however, are limited with rigid one-way connections that makes it difficult to evaluate the relationship between the decision-making factors and the response of individuals. The models are also based on general factors, which can be further specified with detailed factors to better understanding individual’s decision-making process. For example, “social ties” as one of factors could include a variety of categories such as family, friends, pets, and community. While the aforementioned studies are concerned with the development of decisionmaking models, other studies have focused on identifying how different factors may affect individuals’ decisions on evacuation. A number of articles provide reviews of literature on factors affecting evacuation decisions (Alawadi et al., 2020; Huang et al., 2016; Murray-Tuite & Wolshon, 2013; Thompson et al., 2017). Table 7.1 provides a summary of empirical studies that examined key factors affecting individuals’ evacuation decisions. The effect of each factor was evaluated based on how an evacuee’s decision is impacted by the factor; +, −, and +/− in the effect column denote “more likely to evacuate”, “less likely to evacuate”, and “mixed
Fig. 7.1 Evacuation behavior model (Sorensen et al., 1987 modified from)
–
Presence of elderly
Hurricane Bonnie, North Carolina, 1998
Whitehead et al. (2000)
895
Hurricane Bonnie and Floyed
Van Willigen Marieke et al. (2002)
Hurricane Bonnie 1998
Bateman and Edwards (2002)
1,050
Hurricane Floyed and six-post hurricanes in South Carolina 1999
Dow and Cutter (2000)
Yuba County Flooding, California 1997
397
+
Significant others (family or friends) decide to leave
Heath et al. (2001)
−
No place to evacuate
–
Mandatory evacuation order
(continued)
+
Presence of physically disabled family members –
+
Low risk perception Gender: Female
+
High risk perception
–
Presence of pets
+
–
Presence of elderly Previous experience with disaster
+
Presence of children
–
−
Presence of pets
Less warning time
−
Low risk perception
Effect +
Factor
Hurricane Andrew 1992, Hurricane Erin 1995 High risk perception
Incident
Cyclones on the north-east coast of Queensland
944
Gladwin et al. (2001)
Raggatt et al. (1993)
N
Study
Table 7.1 Key factors affecting individuals’ evacuation decisions
116 A. Nara et al.
350
Sadri Arif Mohaimin et al. (2017)
Baker (1991)
N
Study
Table 7.1 (continued)
12 hurricanes from 1961 to 1989 throughout the United States
Hurricane Sandy, New York, 2012
Incident – +
Presence of pets Higher intensity of hurricanes
+
Received the evacuation order from officials Living in a mobile home
(continued)
+ +
Living in high risk areas
+
+
Social network size (number of social ties) of three or more High risk perception
+ +
Previous evacuation experience Living near ocean or large water body
+
Mandatory evacuation order Living in a mobile home
– +
Married individuals
+
+
Living in a mobile home
Elderly individuals
Effect
Factor
7 Learning Dependence Relationships of Evacuation … 117
9,048
Smith and McCarty (2009)
Cohn et al. (2006)
N
Study
Table 7.1 (continued)
3 wildfires in Colorado in 2002, Arizona in 2002, and Montana in 2000
4 hurricanes in Florida, 2004
Incident
+ + – –
Gender: Female Presence of children younger than 18 years old Homeownership Presence of pets
+ – +/−
Previous experience with wildfires
–
No means of transportation to evacuate Low risk perception
–
No place to evacuate High risk perception
–
Medical condition that impede ability to leave
Job responsibilities that does not permit leaving –
+
Living in a mobile home
Effect +
Factor Higher storm strength
118 A. Nara et al.
7 Learning Dependence Relationships of Evacuation …
119
effect” respectively. For example, Gladwin et al. (2001) interviewed 944 residents of Florida who were present during two hurricanes, Andrew in 1992 and Erin in 1995, and employed an ethnographic decision tree analysis with a series of questions to predict if an individual would eventually decide to evacuate or stay. The study identified that the high-risk perception and the evacuation decision made by significant others (e.g., family, friends) positively affected individual’s decision to evacuate, whereas the low-risk perception, the presence of pets or elderly, and no evacuation place negatively affected. The previous empirical works listed on Table 7.1 identified key individual evacuation decision-making factors from a variety of emergency events including hurricanes, cyclones, flooding, as well as wildfires. Some of the involved factors are specific to the type of disaster event (e.g. severity of hurricanes, living near water body); however, the most factors are less event specific and relate to many types of events, like family safety or the evacuee’s perceived risks. Several key factors are listed in multiple studies including high risk perception (e.g., perceived a high severity of disaster), low risk perception (e.g., perceived safety of home), presence of children, elderly, and disabled people, pet ownership, previous experience, demographic characteristics (e.g., gender), and social networks based on kin relationships, friends, and communities. The effects of these factors are generally in agreement across different studies.
7.3 Methodology To build a model that can examines the relationship between individual’s decisionmaking factors and their effects in regard to wildfire evacuation and the response behaviors using social media data, our methodological approach consists of three phases: (1) building a conceptual model; (2) collecting and coding Twitter data (described in Sect. 4.2 and 4.3 respectively); and (3) learning decision making structures from Twitter data using a Bayesian network model.
7.3.1 Evacuation Decision-Making Model Based on the literature review on evacuation decision-making model and key factors that affect individual’s evacuation decisions, we propose a new evacuation decisionmaking model that categorizes and demonstrates the relationship between factors that have an impact on evacuees’ attitudes and decisions (Fig. 7.2). The factors were broken into 4 categories. The first is the personal experience and knowledge of individuals including the perceived risk. The second category describes evacuation information including if individuals received evacuation information, the source of the information, the credibility of the information/source, the type of emergency notification. The third category is the individual/family situation. This includes pet
120
A. Nara et al.
Fig. 7.2 Evacuation decision-making conceptual model
ownership, medical condition, disability, home ownership, living in a mobile home, job responsibility, family safety, and if there are seniors, children, or disabled individuals/family members in the household. The fourth category involves evacuation situation, examining if there is enough time to evacuate, if there is a place and a way to evacuate, and how crowded the area is. These factors independently or jointly affect to an individual’s evacuation decisions. The proposed model consists of two types of decision-making outcomes, Attitude Toward Evacuation and Evacuation Status. The former includes positive, negative, or neutral attitude toward evacuation describing whether individuals are more likely to decide to evacuate or not. The latter is the actual evacuation status (pre-evacuation or in preparation to evacuate, evacuating, evacuated, or not evacuated). Evacuation Status is a direct variable to explain individual’s evacuation decision making outcome; however, because this study used tweets and Twitter users may not directly mention their evacuation status in their message, Attitude Toward Evacuation was introduced to the model.
7.3.2 Bayesian Networks We employed Bayesian Networks (BNs) to describe how evacuation decision making factors are related to each other and affect the individual’s evacuation decision. BNs are a probabilistic graphical model that learns and estimates conditional dependencies among a set of variables using Bayesian statistical inference. A BN is defined by a network structure consisting of a directed acyclic graph (DAG) where nodes represent variables and arcs represent their probabilistic relationships between pairs of variables. The DAG contains only directed arcs and does not contain any loop or cycle. A parent node in a DAG is the node at the tail of the arc and a child node is
7 Learning Dependence Relationships of Evacuation …
121
the node at the head. In general, each arc on a DAG represents a direct dependence relationship between two nodes. Indirect dependence relationships are implicitly represented as sequences of arcs, or a path, connecting two nodes through one or more mediating nodes. To note, the directionality of arcs could be intuitively interpreted as a cause-and-effect relationship; however, challenges exist to justify causal effects in a BN learned from observational data (Pearl, 2009; Scutari & Denis, 2014). In a BN, a probability distribution P on a set of variable X is defined as follows: P(X ) =
n P X k | X k k=1
where X k is the set of parents of and when has more than one parent node, the probability depends on parents’ joint distribution. There is a two-steps process to learn BNs from observations, structure learning and parameter learning. The first step learns the structure of DAG that best describes the data, while the second learns the conditional probability distributions of variables implied by the structure of the DAG learned in the first step. The structure learning methods are classified into three approaches: (1) score-based algorithms that use goodness-of-fit scores as objective functions to maximize; (2) constraint-based algorithms that use conditional independence tests to learn the dependence structure of the data; and (3) hybrid algorithms that combine both approaches (Scutari et al., 2019). The score-based approach is often more accurate and robust than the constraintbased approach but less fast as the number of variables increase. In addition, prior knowledge can be combined with the machine-based structure learning process. For example, based on prior knowledge, certain arcs can be forced to be included or excluded from the network during the structure learning process by setting whitelists or blacklists, or they can be manually specified. Using the structure learned from the data and prior knowledge, the parameter learning process estimates the parameters of the conditional probability distributions. Two common methodologies for this process are Maximum Likelihood Estimation (MLE) that choose parameters by maximizing the likelihood function, and Bayesian estimation that fits parameters using the expected value of the posterior distribution. In this study, we used the bnlearn R package (Scutari, 2010), a Tabu greedy search score-based algorithm for structure learning and Bayesian estimation for parameter learning. We used tenfold cross-validation to evaluate how well the learned BN represents the dependence structure of the data.
122
A. Nara et al.
7.4 Study Area and Data 7.4.1 Study Area The 2017 Lilac Fire started in the Northern part of San Diego County. It was first reported on December 7, 2017 at 11:15 a.m. The wildfire was active for a total of 33 days in the northern part of San Diego County. With the area experiencing severe Santa Ana wind conditions (warm and dry winds), extremely low humidity, and strong winds, the Lilac Fire grew rapidly and burned its way across 4,100 acres, destroying 157 structures, and damaging an additional 64 + homes/structures. Emergency response agencies issued a total of 17 evacuation notifications, with 12 mandatory evacuation orders, 4 warnings, and 1 update that affected more than 44,000 households. They broadcasted those orders via telecommunication Wireless Emergency Alerts (WEA), the Internet, and social media (e.g. Twitter). All evacuation notifications were issued on December 7, 2017. More than 77,000 people were affected, and more than 1,300 evacuees were served at multiple shelters (County of San Diego, 2019).
7.4.2 Data Collection A web-based application called SMART Dashboard (Yang et al., 2016) was used to collect publicly available geo-referenced tweets posted within a 50 km radius from the ignition point of Lilac Fire from the time period of December 6th through the 9th (a total of 4 days). 16 keywords associated with wildfires and evacuations were initially selected to collect tweets. These included: “brush fire”, “wildfire”, “evacuations”, “evacuation”, “evacuate”, “fire”, “vegetation fire”, “evac”, “shelter”, “fire map”, “fire shelter”, “road closed”, “road closure”, “LilacFire” or “power”. Over the course of the 4-day period, this generated 77,715 tweets, many of which are not directly related to the wildfire. The keyword search was then narrowed down to 3 keywords: “evac”, “pack”, and “LilacFire” to further filter out irrelevant tweets. This resulted in tweets more focused on the Lilac wildfire incident related to evacuation. Over the 4 days, a total of 53,539 tweets were collected. Of those, 47,428 were retweets that were removed, leaving a total of 6,111 original tweets, which were self-generated tweets posted by individuals and non-individual entities.
7.4.3 Coding Tweets A codebook (see Appendix A) was created to code original tweets into variables based on the proposed evacuation decision making model (Fig. 7.2). Guided by the developed codebook, two coders went through 1,000 tweets and coded each tweet.
7 Learning Dependence Relationships of Evacuation …
123
The coders achieved an inter-coding reliability of k = 0.75 (p < 0.05) and percent agreement of 85.5% based on 1,000 randomly sampled coded tweets indicating a moderate level of agreement between the coders about the content of the tweets. In instances of coding disagreements, coders discussed discrepancies until resolved and, if necessary, revised the codebook. The rest of original tweets (n = 5,111) was coded by those trained coders. Table 7.2 demonstrates some of the coding examples for Attitude Toward Evacuation and Evacuation Status. Bold texts represent keywords that were used to determine a coding value. To model individual’s evacuation decision making, we further filtered 6,111 coded tweets by removing tweets posted by non-individual entities (n = 1,861) and then those not containing the individual user’s attitude toward evacuation (n = 3,728). Our final Twitter dataset for modeling includes 512 tweets containing individual’s Attitude Toward Evacuation. Table 7.2 Coding examples: Attitude Toward Evacuation and Evacuation Status with its sample tweets Factors Attitude toward evacuation
Tweets Evacuation status
Negative
“@… I don’t think we’re gonna have to evacuate it’s just super duper smokey everywhere so it seems closer than it is?? …” “My family and I didn’t end up having to evacuate, but thank you to everyone who opened their homes to us…”
Positive
Neutral
Not evacuated
“Was close to evacuating but the fire is heading west, great-full and sad at the same time I hope this fire get contained”
Pre-evacuation
“Ready to leave when necessary. Packing supplies for a fire evac really puts things into perspective. Most everything is replaceable. Praying for our firefighters & that everyone stays safe. #lilacfire
Evacuating
“We are evacuating and the dog is driving #LilacFire”
Evacuated
“I had to evacuate I hope and pray our house is still standing as to other ppl too, this fire ain’t no joke” “So um I might have to evacuate” “@cityofvista does Vista need to evacuate?”
124
A. Nara et al.
7.5 Results 7.5.1 Coding Results Table 7.3 summarizes the result of tweets coding by each variable. 512 tweets with any sign or mention of Twitter user’s attitude toward evacuation were posted by 429 unique individual users. Of 429 unique users, during the 4-day period, 369 users posted only once (86.0%), 45 users posted twice (10.5%), 9 users posted three times (2.1%), 4 users posted four times (0.9%), and 2 users posted five times (0.5%). Regarding the Twitter user’s Attitude Toward Evacuation, of 512 tweets, 323 (63.1%) are positive, 89 (17.4%) are negative, and 106 (20.7%) are neutral. There was no tweet that contained any sign or mention of the Twitter user living in a mobile house, individuals/family members with disabilities in household, or the traffic congestion in the user’s neighborhood; therefore, three variables, Mobile Home, Disability, and Traffic Congestion, were excluded from the Bayesian Network model. Figures 7.3 and 7.4 show the hourly frequency distributions of the coded tweets by Attitude toward Evacuation and by Evacuation Status respectively. Along with the frequency, each figure displays the number of unique users per hour and key events related to the Lilac Fire evacuation. The majority of tweets were posted in line with the time when evacuation orders and warnings were issued. The peak was observed on December 7th at 9 pm right after the final evacuation order was issued and the number of tweets significantly dropped on December 8th and after. Of 341 tweets posted on December 7th when evacuation warnings and orders were issued, 204 (59.8%) expressed a positive Attitude Toward Evacuation and 154 (45.2%) were coded as a pre-evacuation status mentioning packing stuff, being ready to evacuate, and waiting for an evacuation order and further notification.
7.5.2 Bayesian Network Model The result of the learned DAG and the conditional probability tables are shown in Figs. 7.5 and Table 7.4 respectively. The resultant DAG on Fig. 7.5 was first learned purely from the data without the use of prior knowledge. Then we reversed the direction of the arc between Perceived Risk and Attitude Toward Evacuation (represented as a dashed arrow on Fig. 7.5) to reflect the proposed conceptual model (Fig. 7.2), which improved the BIC (Bayesian Information Criterion) score of the model from −3032.7 to −3113.7. To evaluate the leaned BN, we used a tenfold cross validation by randomly separating the data into 10 even sets including 1 set for test and 9 sets for training. Since each data allocation would influence the validation results, we ran the cross validation 200 times and calculated the average statistics. The average log-likelihood loss (or negative entropy), which is the negated expected log-likelihood of the test set for the BN fitted from the training set, was 5.442 (σ = 0.017). The average prediction accuracy for Attitude Toward Evacuation was 0.874
7 Learning Dependence Relationships of Evacuation …
125
Table 7.3 Summary of coding results (N = 512) Variable
N
%
Attitude toward evacuation
Variable
N
%
Pet ownership
Negative
90
17.6
No
488
95.3
Neutral
124
24.2
Yes
24
4.7
Positive
298
58.2
Medical issue No
510
99.6
Pre-evacuation
204
39.8
Yes
2
0.4
Evacuating
9
1.76
Home ownership
Evacuation status
Evacuated
81
15.8
No
474
92.6
Not evacuated
4
0.8
Yes
38
7.4
Not stated
214
41.8
Mobile home No
512
100.0
High
350
68.4
Yes
0
0.0
Low
45
8.8
Job responsibility
Not stated
117
22.9
Perceived risk
Information received No
352
68.8
Yes
160
31.3
Information source Family/Relative
2
0.4
No
499
97.5
Yes
13
2.5
Family safety No
452
88.3
Yes
60
11.7
Elderly in household
Other Individual
7
1.4
No
508
99.2
Government Agency
24
4.7
Yes
4
0.8
Non-Profit Organization
1
0.2
Children in household
Media
14
2.7
No
501
97.9
Other
26
5.1
Yes
11
2.2
Unknown
80
15.6
Disability
Not stated
358
69.9
No
512
100.0
Yes
0
Information type Evacuation order
48
9.4
Place to evacuate
0.0 0.0
Evacuation warning
18
3.5
No
4
0.8
Other Evac. notification
94
18.4
Yes
23
4.5
Other notification
7
1.4
Not stated
485
95.7
Not stated
345
67.4
No
512
100.0
3.3
Yes
0
0.0
Information credibility
Traffic congestion
Negative
17
Positive
87
17.0
Means of transportation
Not stated
408
79.7
No
2
0.4 (continued)
126
A. Nara et al.
Table 7.3 (continued) Variable
N
%
Previous experience
Variable
N
%
Yes
18
3.5
492
96.1
9
1.8
No
503
98.2
Not stated
Yes
9
1.86
Time availability
Knowledge
No
No
509
99.4
Yes
0
0.0
Yes
3
0.59
Not stated
503
98.2
Fig. 7.3 Hourly frequency distribution of tweets coded by user’s attitude toward evacuation
Fig. 7.4 Hourly frequency distribution of tweets coded by user’s evacuation status
7 Learning Dependence Relationships of Evacuation …
127
Fig. 7.5 Evacuation decision making structure
(Cohen’s kappa coefficient, κ = 0.776) and for Evacuation Status was 0.812 (κ = 0.684). The BN learned from tweets reveals dependence relationships of individual’s evacuation decision making variables. The BN structural learning identifies two networks. One network includes the variables related to information, perceived risk, evacuation situation (Place to Evacuate) and evacuation decision making implying that perceiving risk and received information influence individual’s evacuation decision making. The other network with the variables related to individual/family situation and evacuation situation (Means of Transportation, Time Availability) is disconnected to the network with evacuation decision making nodes. This implies that those individual/family situation and evacuation situation factors did not play a significant role in making evacuation decision in our case study. This result is partially due to the low frequency of tweets that mentioned “Yes” on those factors. Focusing on direct dependence relationships of evacuation decision making outcome variables, the learned BN identifies two key evacuation decision making factors, Perceived Risk and Source of Information. To note, information in our data refers to mostly evacuation notification such as official evacuation warnings and orders. These two factors directly influence to individual’s decision making positively or negatively, which is in line with findings from previous studies (See Table 7.1). The result also presents how these two factors jointly influence to evacuation decision making. Attitude Toward Evacuation is more likely positive when individuals perceived a high risk and received information from media (p = 0.635) or unknown sources (p = 0.666) as compared to government agencies (p = 0.063). This low probability of the information source being government agencies is due to information seeking behavior. We found that Twitter users directly communicated with government agencies to seek further information about evacuation. Of 24 tweets
128
A. Nara et al.
Table 7.4 Conditional probability table Parent node
Conditional probability
Attitude toward evacuation
Information source (perceived risk = high) Family/relative
Other individual
Government agency
Non-profit organization
Negative
0.333
0.007
0.001
0.333
Neutral
0.333
0.497
0.936
0.333
Positive
0.333
0.497
0.063
0.333
Media
Other
Unknown
Not stated
Negative
0.092
0.001
0.167
0.066
Neutral
0.273
0.943
0.167
0.091
Positive
0.635
0.056
0.666
0.844
Attitude toward evacuation
Information source (perceived risk = low) Family/relative
Other individual
Government agency
Non-profit organization
Negative
0.333
0.333
0.986
0.333
Neutral
0.333
0.333
0.007
0.333
Positive
0.333
0.333
0.007
0.333
Media
Other
Unknown
Not stated
Negative
0.973
0.333
0.886
0.727
Neutral
0.013
0.333
0.112
0.061
Positive
0.013
0.333
0.002
0.212
Attitude toward evacuation
Information source (perceived risk = not stated) Family/relative
Other individual
Government agency
Non-profit organization
Negative
0.007
0.201
0.168
0.013
Neutral
0.497
0.796
0.830
0.973
Positive
0.497
0.003
0.002
0.013
Media
Other
Unknown
Not stated
Negative
0.007
0.002
0.364
0.268
Neutral
0.986
0.997
0.364
0.341
Positive
0.007
0.002
0.273
0.390
Evacuation status
Attitude toward evacuation Negative
Positive
Neutral
Pre-evacuation
0.001
0.684
0.001
Evacuating
0.001
0.030
0.001
Evacuated
0.001
0.272
0.001
Not Evacuated
0.001
0.014
0.001
Not Stated
0.997
0.000
0.998 (continued)
7 Learning Dependence Relationships of Evacuation …
129
Table 7.4 (continued) Parent node
Conditional probability
Attitude toward evacuation
Information source (perceived risk = high)
Information type
Information received No
Yes
Evacuation order
0.017
0.262
Evacuation warning 0.003
0.107
Other evacuation notification
0.000
0.044
Other notification
0.000
0.586
Not stated
0.979
0.001
Information credibility
Information received No
Yes
Negative
0.006
0.094
Positive
0.052
0.431
Not Stated
0.942
0.475
Information source
Information received
Family/relative
No
Other individual
Government agency
Non-profit organization
Yes
Family/relative
0.000
0.013
Other Individual
0.000
0.044
Government Agency 0.000
0.150
Non-profit organization
0.000
0.088
Media
0.000
0.007
Other
0.000
0.162
Unknown
0.017
0.461
Not Stated
0.982
0.075
Pet ownership
Children in household No
Yes
No
0.964
0.457
Yes
0.036
0.543
Elderly in household
Family safety No
Yes
No
0.999
0.930
Yes
0.001
0.070 (continued)
130
A. Nara et al.
Table 7.4 (continued) Parent node
Conditional probability
Attitude toward evacuation
Information source (perceived risk = high)
Children in household
Family safety No
Yes
No
0.999
0.814
Yes
0.001
0.186
Means of transportation
Pet ownership No
Yes
No
0.004
0.007
Family/relative
Other individual
Government agency
Yes
0.025
0.252
Not Stated
0.971
0.741
Time availability
Means of transportation No
Yes
Not mentioned
No
0.500
0.173
0.010
Not Stated
0.500
0.827
0.990
Place to evacuate
Attitude toward evacuation Negative
Neutral
Positive
No
0.046
0.001
0.000
Yes
0.012
0.001
0.074
Not Stated
0.942
0.998
0.926
Home ownership
Time availability No
Not mentioned
No
0.447
0.934
Yes
0.553
0.066
Perceived risk
Low
High
Not Stated
0.088
0.683
0.229
Information received
No
Yes
0.687
0.313
Family safety
No
Yes
0.882
0.118
Experience
No
Yes
0.981
0.019
No
Yes
Knowledge
Non-profit organization
(continued)
7 Learning Dependence Relationships of Evacuation …
131
Table 7.4 (continued) Parent node
Conditional probability
Attitude toward evacuation
Information source (perceived risk = high)
Medical issue
Family/relative
Other individual
0.993
0.007
No
Yes
0.995
0.005
Government agency
Non-profit organization
mentioning that the source of received information was from government agencies, 15 asked questions directly to government agency’s Twitter account to clarify whether they should evacuate or not. These information seeking tweets were codes as a neutral Attitude Toward Evacuation, which in turn contributed a higher conditional probability of a neutral Attitude Toward Evacuation (p = 0.936). Table 7.5 compares example tweets that mentioned a high-risk perception and information received from one of three information sources, media, unknown sources, and government agency. Although these information seeking tweets do not influence to individual’s evacuation decisions by themselves, the direct communication with government agencies and Table 7.5 Example tweets mentioning a high-risk perception and information received from three sources Evacuation decision Making factors
Tweets
Attitude toward evacuation
Perceived risk
Information source
Positive
High
Media
“Watching news. Fires are close. Gotta drink all the beer before evacuation.” “we evacuated last night the MOMENT we got put in the voluntary evacuation zone (news said”
Unknown
“I just got word I have to evacuate now…” “We just got an alert and me and my family might have to evacuate…”
Neutral
Government Agency
“@CALFIRESANDIEGO Is this still a warning? Or is it an evacuation order?” “@CALFIRESANDIEGO @SDSheriff So outside of blue lines is non-mandatory evacuation area?”
132
A. Nara et al.
their response would strongly influence to individual’s evacuation decision making. Our model does not incorporate this since response tweets or follow-up conversations were not observed in our data. When individuals perceive a low risk, their attitude is more likely negative to evacuate regardless of sources of received information (Government Agency: p = 0.986, Media: p = 0.973, Unknown: p = 0.886). Information Credibility or Information Type is not indirectly causally related to Attitude Toward Evacuation as their paths do not follow the direction of the arrows. Attitude Toward Evacuation is also directly related to Place to Evacuate where individuals expressing a negative attitude are less likely to have a place to evacuate (p = 0.046) and those with a positive attitude are likely to have a place to evacuate (p = 0.074); nevertheless, these conditional probabilities are very low because of the low frequency of tweets that mentioned Place to Evacuate and positive (n = 23) or negative (n = 4) Attitude Toward Evacuation together in a single message. Evacuation Status is directly related to Attitude Toward Evacuation where individuals with a positive attitude are more likely in a pre-evacuation status (p = 0.684) or an evacuated status (p = 0.272). Factors related to individual/family situations or evacuation situations do not influence individual’s evacuation decision making as they are disconnected from the evacuation decision making outcome variables. Previous Experience, Knowledge, and Medical Issue are presumed to have no connection to other variables and hence they are isolated. While these factors were identified as key decision-making factors in previous studies (see Table 7.1), our result implies that they are not a primary factor in our case study. This result, however, is largely due to the lack of enough tweets mentioning those factors.
7.6 Discussions Understanding the basis of human behavior during evacuations allows decisionmakers and response teams to provide better guidance and informed orders of evacuation. A wide range of factors from personal characteristics (e.g., age, gender) to risk perception and past experience have been proven to affect evacuation behavior. Conventional methods (e.g., surveys of evacuees) of obtaining information regarding these influential factors are more applicable for post event processing. In the era of big data collected through online mobile platforms, such valuable information can be obtained with up-to-the-minute insight. The use of information obtained from the content of potential evacuees’ tweets can explain why they are motivated to evacuate or not. Harnessing this information can also let responders know how to better disseminate disaster information over social media platforms. In this study, we explored how social media data can be used to extract insight on evacuation behavior. We designed a conceptual model based on the literature review, developed a codebook to code Twitter messages, and employed a BN to build a model to inductively learn dependence relationships of evacuation decision making factors from Twitter communications. In our case study of analyzing tweets during the Lilac
7 Learning Dependence Relationships of Evacuation …
133
Fire, the learned BN highlighted two key factors, risk perception and information source, that jointly influenced to individual’s evacuation decision making. This case study also implies that factors related to individual/family situations, evacuation situations, knowledge, and previous experience might not be a primary decisionmaking factor, although this result can be explained by the lack of enough data. In addition to key findings, our research framework of social media data analysis and BN-based probabilistic modeling can be integrated into decision support systems to help understanding (near) real-time evacuation situations such as individual’s responses to notifications and estimating dynamic evacuation demands (Nara et al., 2017). Furthermore, the mixed effects of human decision-making probabilities can be incorporated in simulation models such as Agent Based Models to explore what-if scenarios to examine evacuation plans and strategies from bottom-up. These efforts can ultimately contribute to improve the current intervention strategy for the evacuation notification system. Although this study presents a novel research framework and promising results that shed light on understanding the human decision-making process and behavior toward wildfire evacuation, there exist a few notable challenges. First, social media data are known to be biased (Hargittai, 2020; Jiang et al., 2019). In this study, 512 tweets posted by 429 unique individual Twitter users under an emergency situation could contain demographic and socioeconomic biases, which may cause unreliable inferences. It is challenging to identify, understand, and unlearn implicit bias in social media data as user profiles are not publicly available at the individual level. Second, the small sample size (n = 512) after filtering the big data (n = 53,539) may not be able to establish sufficient basis for learning conditional probability distributions for certain evacuation decision making factors not being expressed in our case study. Third, unlike systematically collected structured survey data, Twitter messages must be carefully and systematically evaluated and coded to produce structured data for analysis and modeling. Although two coders achieved a moderate level of agreement between the coders about the content of the tweets, the coding results could be swayed particularly with the subjective nature of language. To examine the validity of analysis and modeling based on social media data, the result can be compared with those by traditional methods such as survey questionnaires (Martín et al., 2020). Acknowledgements This work was supported by the National Science Foundation under Grant No. 1634641, IMEE project titled “Integrated Stage-Based Evacuation with Social Perception Analysis and Dynamic Population Estimation” and the San Diego State University Summer Undergraduate Research Program. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the National Science Foundation or the San Diego State University. We would also like to thank Christian Mejia for his hours spent to support data preprocessing.
134
A. Nara et al.
Appendix: Tweets Codebook
Variable
Code
Description
English
No = 0
Contained English words
Yes = 1 Retweet
Twitter user account type
No = 0 Yes = 1
Contained “RT @ USERNAME” at the beginning of the text
Individual = 1
Type of the Twitter user account
Law enforcement/Fire Dept. = 2 CalFire = 3 211 San Diego = 4 American Red Cross = 5 Office of Emergency Services = 6 Media = 7 Community group = 8 Local government agency = 9 Other government agency = 10 Other non-profit = 11 Other = 12 Evacuated
No = 0 Yes = 1
Evacuating
No = 0 Yes = 1
Pre-evacuation
No = 0 Yes = 1
Not evacuated
No = 0 Yes = 1
Contained any sign or mention that the Twitter user evacuated Contained any sign or mention that the Twitter user was evacuating Contained any sign or mention that the Twitter user was in pre-evacuation status (e.g., in preparation) Contained any sign or mention that the Twitter user decided not to evacuate (continued)
7 Learning Dependence Relationships of Evacuation …
135
(continued) Variable
Code
Attitude toward evacuation Negative = 0 Positive = 1 Neutral = 2 Not stated = −1
Perceived risk/threat
Low = 0 High = 1 Not stated = −1
Information received
No = 0 Yes = 1
Information source
Family/relative = 0 Other individual = 1
Description Contained any sign or mention whether the Twitter user’s attitude toward evacuation was positive (likely to evacuate), negative (unlikely to evacuate), neutral (undecided/uncertain), or not stated Contained any sign or mention that the Twitter user perceived high/low risk of the incident Contained any sign or mention that the user received information from other(s) or shared other URLs regarding evacuation or wildfire The source of received information
Governmental/official agency = 2 Non-profit organization = 3 Media = 4 Other = 5 Unknown source = 6 Not stated = −1 Information type
Official evacuation order = 0
Emergency notification type
Official evacuation warning = 1 Other evac. notifications = 2 Others = 3 Not stated = −1 Information credibility
Negative = 0 Positive = 1 Not stated = −1
Previous experience
No = 0 Yes = 1
Evacuation knowledge
No = 0 Yes = 1
Pet ownership
No = 0 Yes = 1
Contained any sign or mention of the Twitter user’s negative or positive attitude toward the information source Contained any sign or mention of the Twitter user’s previous experience with evacuation Contained any sign or mention of the Twitter user’s knowledge about evacuation Contained any sign or mention of pet ownership (continued)
136
A. Nara et al.
(continued) Variable
Code
Description
Medical issue
No = 0
Contained any sign or mention of medical issue
Yes = 1 Home ownership
No = 0 Yes = 1
Mobile home
No = 0 Yes = 1
Job responsibility
No = 0 Yes = 1
Family safety
No = 0 Yes = 1
Elderly in household
No = 0 Yes = 1
Children in household
No = 0 Yes = 1
Disability
No = 0 Yes = 1
Place to evacuate
No = 0
Traffic congestion
No = 0
Yes = 1 Yes = 1 Means of transportation
No = 0 Yes = 1
Time availability
No = 0 Yes = 1
Contained any sign or mention of home ownership Contained any sign or mention that the user is living in a mobile house Contained any sign or mention of job responsibility Contained any sign or mention of concerning family safety Contained any sign or mention of senior citizen(s) in a household Contained any sign or mention of children in a household Contained any sign or mention of individual/family member with disabilities in household Contained any sign or mention of having a place to evacuate Contained any sign or mention of the traffic congestion in the Twitter user’s neighborhood Contained any sign or mention of the possession of reliable means of transportation to evacuate Contained any sign or mention of the time availability to evacuate
References Alam, F., Ofli, F., Imran, M., & Aupetit, M. (2018). A Twitter tale of three hurricanes: Harvey, Irma, and Maria. ArXiv180505144 Cs. Alawadi, R., Murray-Tuite, P., Marasco, D., Ukkusuri, S., & Ge, Y. (2020). Determinants of full and partial household evacuation decision making in hurricane matthew. Transportation Research Part D: Transport and Environment, 83, 102313.
7 Learning Dependence Relationships of Evacuation …
137
Baker, E. J. (1991). Hurricane evacuation behavior. International Journal of Mass Emergencies and Disasters, 9, 287–310. Bateman, J. M., & Edwards, B. (2002). Gender and evacuation: A closer look at why women are more likely to evacuate for hurricanes. Natural Hazards Review, 3, 107–117. Cohn, P. J., Carroll, M. S., & Kumagai, Y. (2006). Evacuation behavior during wildfires: Results of three case studies. Western Journal of Applied Forestry, 21, 39–48. County of San Diego (2019). 2017 lilac fire after action report. Dow, K., & Cutter, S. L. (2000). Public orders and personal opinions: Household strategies for hurricane risk assessment. Global Environment Change Part b: Environmental Hazards, 2, 143– 155. Gladwin, C. H., Gladwin, H., & Peacock, W. G. (2001). Modelling hurricane evacuation decisions with ethnographic method. International Journal of Mass Emergencies Disasters, 19, 117–143. Hargittai, E. (2020). Potential biases in big data: Omitted voices on social media. Social Science Computer Review, 38, 10–24. Heath, S. E., Kass, P. H., Beck, A. M., & Glickman, L. T. (2001). Human and pet-related risk factors for household evacuation failure during a natural disaster. American Journal of Epidemiology, 153, 659–665. Huang, S.-K., Lindell, M. K., & Prater, C. S. (2016). Who leaves and who stays? A review and statistical meta-analysis of hurricane evacuation studies. Environmental Behaviour, 48, 991–1029. Jiang, Y., Li, Z., & Ye, X. (2019). Understanding demographic and socioeconomic biases of geotagged Twitter users at the county level. Cartography and Geographic Information Science, 46, 228–242. Lachlan, K. A., Spence, P. R., & Lin, X. (2014). Expressions of risk awareness and concern through Twitter: On the utility of using the medium as an indication of audience needs. Computers in Human Behavior, 35, 554–559. Latonero, M., & Shklovski, I. (2011). Emergency management, twitter, and social media evangelism. International Journal of Information Systems for Crisis Response Management, 3, 1–16. Marieke, V. W., Terri, E., Bob, E., & Shawn, H. (2002). Riding out the storm: Experiences of the physically disabled during hurricanes Bonnie, Dennis, and Floyd. Natural Hazards Review, 3, 98–106. Martín, Y., Cutter, S. L., & Li, Z. (2020). Bridging twitter and survey data for evacuation assessment of hurricane Matthew and hurricane Irma. Natural Hazards Review, 21, 04020003. Mohaimin, S. A., Ukkusuri, S. V., & Hugh, G. (2017). The role of social networks and information sources on hurricane evacuation decision making. Natural Hazards Review, 18, 04017005. Murray-Tuite, P., & Wolshon, B. (2013). Evacuation transportation modeling: An overview of research, development, and practice. Transportation Research Part c: Emerging Technologies, 27, 25–45. Nara A, Yang X, Ghanipoor Machiani S, Tsou M-H (2017) An integrated evacuation decision support system framework with social perception analysis and dynamic population estimation. International Journal of Disaster Risk Reduction 25:190–201. Pearl, J. (2009). Causality. Cambridge University Press. Perry, R. W. (1979). Evacuation decision-making in natural disasters. Mass Emergencies, 4, 25–38. Raggatt, P., Butterworth, E., & Morrissey, S. (1993). Issues in natural disaster management: Community response to the threat of tropical cyclones in Australia. Disaster Prevention and Management: an International Journal. https://doi.org/10.1108/09653569310040955 Sadri, A. M., Hasan, S., Ukkusuri, S. V., & Cebrian, M. (2018). Crisis communication patterns in social media during hurricane sandy. Transportation Research Record, 2672, 125–137. Scutari, M. (2010). Learning Bayesian networks with the bnlearn R package. Journal of Statistical Software, 35, 1–22. Scutari, M., & Denis, J.-B. (2014). Bayesian networks: With examples in R (1st ed.). Chapman and Hall/CRC.
138
A. Nara et al.
Scutari, M., Graafland, C. E., & Gutiérrez, J. M. (2019). Who learns better Bayesian network structures: Accuracy and speed of structure learning algorithms. International Journal of Approximate Reasoning, 115, 235–253. Smith, S. K., & McCarty, C. (2009). Fleeing the storm(s): An examination of evacuation behavior during Florida’s 2004 hurricane season. Demography, 46, 127–145. Sorensen, J. H., Vogt, B. M., & Mileti, D. S. (1987). Evacuation: An assessment of planning and research. Oak Ridge National Lab., TN (USA). Thompson, R. R., Garfin, D. R., & Silver, R. C. (2017). Evacuation from natural disasters: A systematic review of the literature. Risk Analysis, 37, 812–839. Vieweg, S., Hughes, A. L., Starbird, K., & Palen, L. (2010). Microblogging during two natural hazards events: What twitter may contribute to situational awareness. In Proceedings of the SIGCHI conference on human factors computing systems (pp 1079–1088). New York, NY, USA: ACM. Wang, Z., Ye, X., & Tsou, M.-H. (2016). Spatial, temporal, and content analysis of Twitter for wildfire hazards. Natural Hazards, 83, 523–540. Whitehead, J. C., Edwards, B., Van Willigen, M., Maiolo, J. R., Wilson, K., & Smith, K. T. (2000). Heading for higher ground: Factors affecting real and hypothetical hurricane evacuation behavior. Global Environment Change Part B Environmental Hazards, 2, 133–142. Yang, J.-A., Tsou, M.-H., Jung, C.-T., Allen, C., Spitzberg, B. H., Gawron, J. M., & Han, S.Y. (2016). Social media analytics and research testbed (SMART): Exploring spatiotemporal patterns of human dynamics with geo-targeted social media messages. Big Data & Society, 3, 2053951716652914.
Chapter 8
Examining Spatiotemporal and Sentiment Patterns of Evacuation Behavior During 2017 Hurricane Harvey Chenxiao (Atlas) Guo and Qunying Huang
8.1 Introduction Recent decades have witnessed an increasing risk of natural disasters, leading to lots of fatalities and severe damages to the ground infrastructures and private properties. Among various natural disasters, hurricane (tropical cyclone) has the highest economic loss, with a temporal range of almost half a year and a wide spatial distribution of affected areas, especially for coastal regions in the United States (Weinkle et al., 2018). The development of atmospheric science provides the opportunities to predict the intensity and moving route of hurricanes in about three to five days, allowing people to take effective actions ahead. Compared to other relatively smallscale disasters, the death caused by hurricane could be minimized by taking proper evacuating actions, instead of getting sheltered. Therefore, much attention has been paid to the evacuation behaviors during hurricanes. One of the major research directions related to evacuation is the simulation of evacuation behavior, mainly focusing on whether, when, and where people evacuate, and there are a lot of studies concentrating on the traffic, accidents, or facilities related to the evacuation. Traditionally, researchers rely on the post-event questionnaire survey, which could be expensive and slow, and it only contains limited participants (Henderson et al., 2009). Alternatively, the official traffic volume data become another data source, however, it is centered on vehicles instead of evacuees. In the era of big data, social media platforms play a significant role in not only releasing guidance information for the public, such as weather, traffic, and emergency, but also providing great opportunities for researchers to gather information from the C. Guo · Q. Huang (B) Department of Geography, University of Wisconsin-Madison, Madison, WI, USA e-mail: [email protected] C. Guo e-mail: [email protected] © Springer Nature Switzerland AG 2021 A. Nara and M.-H. Tsou (eds.), Empowering Human Dynamics Research with Social Media and Geospatial Data Analytics, Human Dynamics in Smart Cities, https://doi.org/10.1007/978-3-030-83010-6_8
139
140
C. Guo and Q. Huang
public in a massive amount, and further detect the patterns of public behaviors. In the context of disaster management, user-generated contents such as text, image, and video, along with the spatiotemporal information, have great potentials to reveal and help understand the evacuation behaviors. Much progress has been made to explore “whether”, “when” and “where” people evacuate during natural disasters using social media data, and to better understand the spatiotemporal pattern of the evacuation (Cheng et al., 2016; Houston et al., 2015; Lindsay, 2011; Pantti et al., 2012). However, existing work mainly focuses on the general spatiotemporal pattern without fully considering evacuation movements at different stages of disaster (e.g., preparedness, response, and recovery), and lacks comprehensive analysis with user-generated content (e.g., sentiment). To this end, this research utilizes geo-tagged twitter data to explore the evacuation patterns during the 2017 Hurricane Harvey.
8.2 Related Work Traditionally, natural disaster researches mainly utilize the post-event survey as the main data source, which provides first-hand and event-directed data with informative details (Henderson et al., 2009). However, no matter the questionnaire or interview, the post-event survey can only provide a limited response, which comes out a certain time after the event, and it is costly in both time and finance. Besides the survey data, other data sources such as GPS are also leveraged, but it is generally hard to acquire rich GPS datasets that can capture the behaviors of the victims. Though GPS data could be effectively used in natural disaster monitoring (Kafi & Gibril, 2016), it can barely provide information on people’s behavior during emergency. Social media platforms, with the unique advantages of providing a huge amount of data instantly with the disaster event at a relatively low cost, have been heavily used in natural disaster studies (Murray-Tuite & Wolshon, 2013). Although there are issues such as general representativeness of users, accessibility during emergencies, lower data quality, and less stability, the benefits of using social media data are substantial, including disseminating information and receiving feedback, as well as being used as an emergency management tool (Lindsay, 2011). Recent trends also witness the increasingly important role of social media data in both academics and practices (Zhang et al., 2019). In general, social media data is mainly used in following analytical perspectives on disaster management.
8 Examining Spatiotemporal and Sentiment Patterns …
141
8.2.1 Event Detection and Spatiotemporal Pattern Identification When disasters happen, detecting the location and magnitude for the first time is a vital task for emergency management. Social media can provide instant first-hand information by direct or indirect victims (Rossi et al., 2018). Due to different reasons, this approach sometimes could be even faster than the official announcement in practice. After the event, social media data is also helpful in analyzing the spatiotemporal pattern of relevant phenomenon (Schempp et al., 2019). In terms of this, there are many successful practices. Earle et al. (2011) utilize twitter as the platform to detect multiple earthquake events all over the world based on the keyword in tweet content. Xiao et al. (2015) build a regressions model to examine the spatial heterogeneity and explain the number of tweets by mass, material, access, and motivation during Hurricane Sandy. Furthermore, Sadri et al. (2018) explore the evolution of various communication patterns and also determine the user concerns that emerged over the process of Hurricane Sandy.
8.2.2 Topic Modeling and Semantic-Based Content Analysis Besides the time and location information, social media data also includes rich information in other formats, such as texts, images, voices, and videos, which could reveal a deeper understanding of the semantics, such as sentiment and awareness of condition (Verma et al., 2011). Buntain and Lim (2018) study the commonalities in response across disasters in the Twitter platform, extracting vocabularies for different types of disasters, which are examined to be better correlated with casualties than baseline crisis lexica. Alam et al. (2018) apply a machine learning algorithm to conduct a semantic analysis of textual and image, using twitter data from three hurricanes in 2017. Dong et al. (2013) analyze the causality correlation between the approaching hurricane Sandy and public sentiment using Twitter data. Xiao et al. (2015) utilize logistic regression to classify social media messages during disasters into categories during various phases (preparedness, response, impact, recovery). Huang et al. (2019) apply the learning model CNN to automatically tag the flood event tweets during a flood event based on text and image information of tweets. Nguyen et al. (2019) present a new sequence-to-sequence based framework for forecasting people’s needs during disasters using social media and weather data.
142
C. Guo and Q. Huang
8.2.3 Social Vulnerability and Social Network Analysis When a disaster happens, victims care not only themselves but also their families, relatives, friends, neighborhoods, and other communities and groups. Communications via social media provide the opportunity to explore the social connection and vulnerability among these relationships (Kirschenbaum, 2004). Demuth et al. (2018) analyze tweets during the hurricane to examine people’s hurricane risk communication, risk assessment, and response. Liu et al. (2018) conduct a semantic network analysis of government organizations’ social media messaging during Hurricane Harvey.
8.2.4 Information Visualization Platform With rich spatiotemporal and semantic information, social media data could be utilized to visualize people’s behaviors, showing specific patterns under the circumstance of natural disaster events. Sahin ¸ et al. (2019) build an integrated visualization system to detect emergencies from social media data and make an evacuation plan using a multiple-agent algorithm. Anderson et al. (2019) develop a system to visualize the tweet related to multiple types of disasters and spatially cluster the tweets to identify important locations and movement. Charalabidis et al. (2014) illustrate that social media has significantly contributed to natural disaster management as a tool for information communication. Huang and Xiao (2015) develop a spatial web portal to visualize the tweets during Hurricane Sandy in different disaster phases. In short, a lot of researches have proved that real-time analytics using social media data provide good opportunities to automatically detect and monitor events such as natural disasters (Yu et al., 2019).
8.2.5 Communication and Interaction as Crowdsourcing Platform Social media platform supports bidirectional communication, where victims can both release and acquire information, thus it could work as an interactive crowdsourcing platform. Gao et al. (2011) summarize the advantages and shortfalls of crowdsourcing for disaster relief and point out the issues of geo-tag determination, report verification, automated report summarization, social behavior prediction, as well as scalability and safety. Samuels et al. (2018) explore how decreases in Tweets contribute to local crisis identification by analyzing the correlation between daily Tweet counts and FEMA Building Level Damage Assessments during Hurricane Harvey.
8 Examining Spatiotemporal and Sentiment Patterns …
143
To sum up, social media as an important information interaction and communication platform has been proved with great potentials in benefiting the research and practices of natural disaster management.
8.3 Data This research uses geotagged twitter data to generate evacuation trajectories. The tweet dataset was first collected using keywords related to hurricane Harvey using the official Twitter Streaming application programming interface (API), which is marked as the first round of data collection. A spatiotemporal filter was then set to keep only the tweets within Hurricane Harvey’s affected areas during its active days to select potential victims. Next, a manual validation procedure of all 1,726 twitter users was performed to further identify eligible users. In our case, 1,320 users (76%) were recognized as real persons, while other accounts such as weather information, traffic information, local news, commercial company, other groups and organizations were filtering out and not considered in this research. The second round of collection was conducted using Twitter Standard Search API, to get up to the latest 3,200 historical tweets of each of the 1,320 validated users. By expanding the temporal range of the dataset, the long-term trajectories of these active victims could be captured, including the very old ones. Another temporal filter is applied to narrow down the time range to the year 2017, and 153,732 tweets were finally selected as verified tweet records. Statistics illustrate that there is a clear peak in the number of daily tweets, matching the hurricane’s peak days (Fig. 8.1). To perform the research, we also include hurricane track data, administrative unit data, and several regions of interest that are relevant to Hurricane Harvey. For the hurricane track dataset, we use NOAA’s Tropical Cyclone Best Track Data (HURDAT2), with track points’ geographic coordinates and their UTC timestamp, category, storm type, intensity (maximum central wind speed), and minimum sea-level pressure. Regarding the administrative unit for analysis and visualization, we consider land (compared to the sea), country, state, county, core-based statistical area (CBSA,
Fig. 8.1 Number of verified tweets by day in 2017
144
C. Guo and Q. Huang
Fig. 8.2 Regions of interest used in this research
which represents the metropolitan statistical area of cities), as well as the urban range for visualization. These datasets are acquired from the United States Census Bureau’s official release. We are interested in some related regions (Fig. 8.2), such as the boundary of the Houston Metropolitan Statistical Area (MST), since Houston is the worst-hit city in this event. Also, we will analyze the level of evacuation zone (county-level), including a mandatory evacuation zone (near Houston) and a voluntary evacuation zone (near Corpus Christi). To further analyze the magnitude of disaster effect, we also refer to a county-level affected area which is defined by a certain evaluation group after the disaster: Moderately Affected Area and Severely Affected Area.
8.4 Methodology Figure 8.3 shows how this research is designed, including several basic types: raw data, intermediate data, processing, and output. There are mainly four data sources, including geotagged tweets, hurricane track data, administrative unit data, and interested regions data, which are introduced in the data section. The first step is joining the additional spatiotemporal information to the geotagged points data, then after
8 Examining Spatiotemporal and Sentiment Patterns …
145
Fig. 8.3 The workflow of generating trajectories and examine spatiotemporal-sentimental patterns
the validation and sentiment analysis, the points are connected into segment-level trajectories. Based on this trajectory dataset, we conduct unit-based and regionalbased spatiotemporal and sentiment analytics and visualization. In this research, the geographic extent is limited to the contiguous United States.
8.4.1 Case Event This study uses Hurricane Harvey in the 2017 Atlantic Hurricane Season as the case event. Harvey is a Category-4 hurricane generated in late August and dismissed in early September. It has become the costliest Atlantic hurricane so far (tied with Hurricane Katrina), with the economic loss of 125 billion dollars and 107 deaths. Hurricane Harvey’s substantial cost is associated with its spatiotemporal and meteorological characteristics. After the initial stage, the small storm gradually got weaken, but suddenly it upgraded dramatically. Besides the severe storm surge around coastal Texas, it hovered around the city of Houston for an extremely long time, bringing record-breaking heavy precipitation within several days. To conduct temporal analysis effectively, we consider two perspectives to split this event into several parts chronologically, as listed in the Table 8.1: (1) According to the phases of natural disasters, we use preparedness, response, and recovery to
146
C. Guo and Q. Huang
Table 8.1 The disaster phases and stages of 2017 Hurricane Harvey UTC date
Key event
Disaster phase
Hurricane stage
13-Aug
Harvey started to develop from a tropical wave
Preparedness
N/A
14-Aug
–
Preparedness
N/A
15-Aug
–
Preparedness
N/A
16-Aug
–
Preparedness
N/A
17-Aug
Harvey reached tropical storm
Preparedness
Stage I
18-Aug
–
Preparedness
Stage I
19-Aug
–
Preparedness
Stage I
20-Aug
–
Preparedness
Stage I
21-Aug
–
Preparedness
Stage I
22-Aug
–
Preparedness
Stage I
23-Aug
Harvey re-attained tropical storm
Response
Stage II
24-Aug
–
Response
Stage II
25-Aug
–
Response
Stage II
26-Aug
Harvey made landfall at peak intensity
Response
Stage III
27-Aug
–
Response
Stage III
28-Aug
–
Response
Stage III
29-Aug
–
Response
Stage III
30-Aug
Harvey weakened to a tropical depression
Response
Stage IV
31-Aug
–
Response
Stage IV
1-Sep
–
Response
Stage IV
2-Sep
–
Recovery
Stage IV
3-Sep
Harvey was dismissing
Recovery
Stage IV
4-Sep
–
Recovery
N/A
5-Sep
–
Recovery
N/A
6-Sep
–
Recovery
N/A
7-Sep
–
Recovery
N/A
8-Sep
–
Recovery
N/A
9-Sep
–
Recovery
N/A
10-Sep
–
Recovery
N/A
11-Sep
–
Recovery
N/A
generally refer the pre-event, during-event and post-event periods, each of which is made of 10 days; and (2) Based on the meteorological characteristics of Hurricane Harvey (Fig. 8.4), as well as how it impacts on local people, we mark four stages for Hurricane Harvey, in order to observe the change of variables at finer temporal scales.
8 Examining Spatiotemporal and Sentiment Patterns …
147
Fig. 8.4 Meteorological characteristics of Hurricane Harvey at different stages
8.4.2 Trajectory Generation After the data collection and preprocessing steps, the geotagged twitter dataset is generated, based on the tweets of valid users spatiotemporally affected by Hurricane Harvey. Before generating the trajectories, we conduct sentiment analysis based on the tweet content. We use two relatively simple and reliable methods to calculate the sentiment indicators from the tweet content, including Valence Aware Dictionary and sEntiment Reasoner (VADER) and Linguistic Inquiry and Word Count (LIWC). VADER is a rule-based package specifically for social media text analysis, generating scores for positive, negative, and neutral sentiments. LIWC is a lexicon-based linguistic analysis library, with rich results indicating emotional sentiments such as happiness, anxiety, sadness, and anger. With the assumption to use twitter-based trajectories representing the twitter users’ trajectories, the general idea of generating segment-level trajectories is to connect the lines between all the two temporally successive locations for individual validated users. However, Twitter allows users to set any geotag locations, even much far away from their real locations, recording users’ spatial cognitions or sense of belonging, instead of their real spatial locations. Such characteristic makes it possible to generate two types of easily detected fake trips: trips with too high moving speed and trips without a valid moving distance. Therefore, we check all the trips and exclude those with moving speed larger less than 600 mph or with no moving distance between all the temporally adjacent tweets of the same users, considering that commercial passenger aircrafts usually have a cruising speed of less than 600 mph.
8.4.3 Spatiotemporal Analytics with Sentiments As described in the research flowchart, the analysis based on generated trajectories will be conducted from two dimensions: (1) general unit-based or region-based
148
C. Guo and Q. Huang
(Houston, evacuation areas, affected areas) analysis; and (2) spatiotemporal analytics or spatiotemporal-sentimental analytics. First, the segment-level trajectories are aggregated to the administrative unit level, and their connections with the most trajectories will be recognized, both at the national scale and within the widely affected regions. Thus, the most popular destinations would be recognized. Also, the difference between the overall yearly pattern and the patterns during Hurricane Harvey will be compared. Second, focusing on several regions of interests, we explore the spatiotemporal pattern by comparing the number of trajectories over specific regions of different types, in an appropriate temporal scale. Third, we analyze the change of sentiment and emotional scores over time, highlighting the significant patterns related to Hurricane Harvey. Also, we associate the sentiment score with space, exploring the destinations with different levels of sentiment when people arrive. At last, the sentiment analytics is combined with the interested regions, under the consideration of different stages during Hurricane Harvey. Specifically, we are interested in both the average sentiment at different types of regions, and the average change of sentiment over different types of regions.
8.5 Results Based on this segment-level trajectory dataset, we conduct general spatiotemporal analysis with aggregated administrative units, as well as analysis on regions of interest (Houston, evacuation areas, affected areas), and then visualize the spatiotemporal and sentiment pattern of the evacuation behaviors.
8.5.1 General Spatiotemporal Analytics Since this research only focuses on the evacuation behaviors in the contiguous United States, all the trajectories outside of this range will be discarded for analysis and visualization. To examine the general pattern of the trajectories, we use the CBSA to aggregate the trajectories within the metropolitan statistical area of U.S. cities. Table 8.2 shows the number of trajectories among the continuous U.S. cities (CBSA) during the entire year of 2017, as well as during Hurricane Harvey. Figure 8.5 displays the distribution of these aggregated trajectories in 2017. Recalling that all the selected users are spatiotemporally related to the event of Hurricane Harvey, it makes sense to observe that the city of Houston becomes the center of our research. Combining bidirectional trajectories, we recognize the strong connection between Houston and the cities of Austin, Dallas, and San Antonio and then less strong connection with Los Angeles, New York, New Orleans, Beaumont, and Port Arthur, Las Vegas, Chicago, and Atlanta, as well as some non-Houston
8 Examining Spatiotemporal and Sentiment Patterns …
149
Table 8.2 Number of trajectories among contiguous U.S. cities in 2017 and during Hurricane Harvey Entire year of 2017
During Hurricane Harvey
City pair
Number of City pair trajectories
Number of trajectories
Houston–Austin
872
Houston–Austin
124
Houston–Dallas
571
Houston–New York
103
Houston–San Antonio
529
Houston–Dallas
Houston–Los Angeles
460
Houston–Los Angeles
76
Houston–New York
393
Houston–San Antonio
62
Los Angeles–Las Vegas
297
Houston–Austin
Houston–New Orleans
241
Houston–Beaumont and Port Arthur
93
124 49
Houston–Beaumont and Port Arthur 236
Houston–Atlanta
46
Houston–Las Vegas
231
Houston–Chicago
37
Houston–Chicago
222
Houston–Miami
35
Houston–Atlanta
209
Houston–Las Vegas
32
Fig. 8.5 Aggregated trajectories among contiguous U.S. cities in 2017
150
C. Guo and Q. Huang
connections like Los Angeles–Las Vegas and Austin–San Antonio. Results show that in normal days, the strongest connections are found between Houston and nearby regional Texas cities, and then between Houston and major cities in the contiguous United States. Focusing on Hurricane Harvey, we generate a series of figures showing the connections between Houston and other cities (CBSA, as before) during the hurricane period. Figure 8.6 displays the number and distribution of the trajectories between Houston and other cities during Hurricane Harvey, and the pattern has slightly changed from the overall pattern: The strongest connections are between Houston and the city (CBSA) of Austin, followed by New York, Dallas, Los Angeles and San Antonio, Beaumont and Port Arthur, Atlanta, Chicago, Miami, and Las Vegas. Compared to the patterns in non-hurricane days, some major U.S. cities become more connected with Houston during Hurricane, such as New York and Los Angeles, rather than the regional cities in Texas. In General, Austin, Dallas, San Antonio, and Beaumont and Port Arthur still play the role of important regional cities which are strongly connected to Houston. Considering both the rank and the percentage of numbers of trajectories during Hurricane Harvey, Beaumont and Port Arthur (CBSA) have higher importance during Hurricane Harvey, while Austin and San Antonio become less important. Such patterns might be explained by the impact of Hurricane Harvey, as Austin and San Antonio
Fig. 8.6 Aggregated trajectories with Houston during Hurricane Harvey
8 Examining Spatiotemporal and Sentiment Patterns …
151
were firstly under the track of the storm while Beaumont and Port Arthur were in the opposite direction, but the fact is that Hurricane Harvey later turned to the east, causing a worse condition in Beaumont and Port Arthur than Austin. Zooming in to the generally affected area, Fig. 8.7 displays the spatial pattern at the county level (still described using the main city names): almost all the strong connections are between Houston and other regional cities. The strongest connection is between Houston and Austin, followed by the connections between Houston and Dallas, Beaumont and Port Arthur (both Beaumont and Port Arthur are located in one county), San Antonio, and San Angelo, successively. From the perspectives of three 10-day disaster phases, preparedness, responses, and recovery, we see the change of temporal pattern. From the preparedness (August 13–22) to the response phase (August 23–September 1), there is a dramatically increased number of trajectories with Houston, and then it dropped back in the recovery phase (September 2–September 11). After the aggregation by cities (CBSA), we measure and compare the maximum count of trajectories of all cities, the count of connected cities, the average count of trajectories of individual city, as well as the total count of trajectories, within different phases of disasters (Table 8.3 and Fig. 8.8). To provide a comprehensive understanding of the spatial pattern of the destinations from Houston during Hurricane Harvey, we also generate a proportional symbol map at the county level (Fig. 8.9). Besides the similar pattern of the farther cities, we can
Fig. 8.7 Aggregated trajectories within the affected area during Hurricane Harvey
152
C. Guo and Q. Huang
Table 8.3 Statistics of trajectories with Houston at different phases of disasters with aggregated cities (CBSA) Maximum count of trajectories of Individual city
Count of connected cities
Average count of trajectories of individual city
Total count of trajectories
Preparedness phase (Aug 13–Aug 22)
38
99
4.37
433
Response Phase (Aug 23–Sep 1)
69
117
5.58
654
Recover Phase (Sep 2–Sep 11)
54
95
4.75
452
Fig. 8.8 Comparison of aggregated trajectories with Houston at different phases of disasters, showing the response phase with the highest connection in general
also clearly see the clustering pattern of the targeting counties right around the city of Houston. Due to different spatial units (CBSA and county), the pattern of individual cities might look different, but the overall pattern remains unchanged.
8 Examining Spatiotemporal and Sentiment Patterns …
153
Fig. 8.9 Destinations from Houston at the county level during Hurricane Harvey
8.5.2 Spatiotemporal Analytics on Interested Regions After examining the general spatiotemporal pattern, we conduct analytics and visualizations towards several interested regions, including the greater Houston area, evacuation areas, and affected areas.
8.5.2.1
Greater Houston Area
Based on the spatial relationship with the greater Houston Area, all trajectory segments are categorized into four groups: inbound, outbound, inside, and outside. Specifically, the inbound group refers to trajectories entering Houston Area; the outbound group refers to those leaving from there; inside group refers to the trajectories entirely within Houston Area; the outside group refers to those entirely outside Houston Area. Focusing on the first three relevant groups, we count the number of trajectories in 2017 by day, and plot them with a 7-day smoothing strategy (Fig. 8.10). It is easier to catch the event around the end of August, in which the three groups all have a peak value during the Hurricane Harvey period. In addition, there is a small
154
C. Guo and Q. Huang
Fig. 8.10 Number of trajectories inbound, outbound, and inside Houston in 2017 (7-day smooth)
peak of inside-Houston trajectories in early February 2017, which is probably related to a famous local event at Houston, the Super Bowl LI. Figure 8.11 further zooms in to the days of three disaster phases: preparedness, response, recovery. Interestingly, the peak of inbound is earlier than the peak of outbound trajectories. The outbound peak is a relatively good representation of the evacuation behavior during the hurricane, and the moving after the hurricane due to the catastrophic loss of private properties and public infrastructures. The inbound peak might reflect the real moving behaviors towards the city, but it is also related to the characteristics of the twitter-based trajectory: people may stay in Houston for a long time, but likely post tweets intensively during the disaster; meanwhile, their prior posts captured by the social media were still outside, causing such “fake inbound trajectories”.
Fig. 8.11 Number of trajectories inbound, outbound and inside Houston at different phases of disasters (preparedness, response, recovery)
8 Examining Spatiotemporal and Sentiment Patterns …
8.5.2.2
155
Evacuation Areas
As stated in the description of the case event, there used to be very different and controversial evacuation policies at different areas during Hurricane Harvey, requiring further examination. Considering the evacuees were encouraged or forced to move from the evacuation areas (mandatory and voluntary) to other safer places outside, we calculate the percentage of outbound trajectories over the relevant trajectories in their corresponding stage, for both the mandatory evacuation area (e.g., Corpus Christi) and the voluntary evacuation area (e.g., Houston). The relevant trajectories exclude the outside trajectories, specifically, they include the inside trajectories, inbound trajectories, and the outbound trajectories within the evacuation areas. Results show that the mandatory evacuation area has a higher level of outbound behavior than the voluntary evacuation area, which illustrates the effectiveness of the mandatory evacuation policy (Fig. 8.12). Regarding time, the mandatory evacuation areas have the highest level of outbound movement in stage III when hurricane Harvey just made landfall, and then dropped to the lowest level in stage IV when hurricane Harvey is weakening; for voluntary evacuation areas, things happened in an almost opposite way: the outbound trajectories reach the lowest level in stage III, with a significant increase to the peak in state IV. This comparison illustrates more practices of evacuation movements in mandatory evacuation areas than voluntary evacuation areas.
Fig. 8.12 Percentage of trajectories outbound mandatory and voluntary evacuation areas at different stages
156
8.5.2.3
C. Guo and Q. Huang
Affected Areas
By looking at the post-event published moderately affected areas and severely affected areas, we are able to simplify the spatiotemporal pattern of evacuation behaviors. We categorized the main “evacuation” trajectories into three types: “severe → outside”, “severe → moderate” and “moderate → outside”, and count the number of each type over time (Fig. 8.13). The result shows that the “severe → outside” trajectories grow earlier with an early and significant peak, and the “severe → moderate” trajectories grow later with a late and moderate peak, while there is no peak for the “moderate → outside” trajectories during this event. With the narrative reversed, this result proves that people who evacuated early and actively turned out to live in severely affected areas, illustrating the accuracy and effectiveness of the evacuation orders.
8.5.3 General Sentimental Analytics Based on the tweet content, we use two packages with Python to calculate the sentiment (VADER) and emotion scores (LIWC). After plotting the daily averaged sentiment score (from VADER) by day in 2017, a dramatic valley is observed during Hurricane Harvey’s period, indicating relatively negative sentiments (Fig. 8.14). Hurricane Harvey significantly impacts people’s sentiment expression on Twitter. This result shows an agreement with the previous steps to keep only the Twitter users who were spatiotemporally affected by Hurricane Harvey. Zooming to Hurricane Harvey’s days and focusing on the emotions, we calculate the average score of three negative emotions: anger, sadness, and anxiety. The score is calculated by the LIWC (lexicon-based emotional analysis), so we can only interpret the relative temporal pattern instead of directly comparing the absolute values of
Fig. 8.13 Numbers of “evacuation” trajectories leaving the affected areas
8 Examining Spatiotemporal and Sentiment Patterns …
157
Fig. 8.14 Averaged sentiment score by day in 2017, with Y-axis showing the sentiment score
different indexes. Results illustrate that the anger emotion has an early peak on August 27, and the sadness emotion keeps high from August 29 to September 1, while the anxiety emotion has a small peak on August 31 (Fig. 8.15). This result pictures an average level of emotion changes expressed on Twitter: people got angry when the hurricane just made landfall and caused a lot of damages, and felt sad about the aftermaths after the hurricane became weaker and moved away, with a bit more anxious emotion. We also explore the relationship between the evacuation destination and the tweetbased sentiment. To keep the result reliable, we only consider the major destination cities (CBSA) with over 300 trajectories with Houston. The targeting cities with averaged sentiment score levels are displayed in Fig. 8.16. Destinations with higher sentiment scores include Las Vegas, Los Angeles, New York, Denver, Atlanta, etc.; lowsentiment-score destinations include Chicago, Little Rock, New Orleans, Orlando, Dallas, etc. The positive expressions in tweets include showing thanks to people who helped them, blessings to the impacted areas, and appreciation of safe survival; the negative expressions include the general discussion of the disaster, the exclamation of being a victim (loss of properties, and the complaints about the rescue work, etc.).
Fig. 8.15 Averaged scores of negative emotions by day during Hurricane Harvey
158
C. Guo and Q. Huang
Fig. 8.16 Major destinations from Houston with averaged sentiment level
8.5.4 Sentimental Analytics on Interested Regions Combined with the regions of interest and the hurricane stages, the corresponding average sentiment scores are calculated and compared, showing the average level of sentiment in different regions, as well as the average change of sentiment when moving from one region to another.
8.5.4.1
Greater Houston Area
Figure 8.17 illustrates the relatively lower level of sentiment score inside Houston rather than outside of Houston, during the four stages of Hurricane Harvey, both of which witness a significant drop from Stage II to Stage III. According to this result, we argue that Hurricane Harvey causes a more negative impact on the sentiment of Houston citizens. In terms of the sentiment changes, the outbound trajectories have a relatively large variance, with an intense drop at stage II, and a large increase at other stages of Hurricane Harvey, and the inbound trajectories also have a drop at stage IV (Fig. 8.18). This result might reflect the significant influence of hurricane’s threats on people who
8 Examining Spatiotemporal and Sentiment Patterns …
159
Fig. 8.17 Average sentiment scores inside and outside Houston at different stages of Hurricane Harvey
Fig. 8.18 Change of sentiment scores among trajectories with different spatial relationships to Houston at different stages of Hurricane Harvey
were moving away from their home in the days when the hurricane is reaching to the peak energy right before making landfall.
8.5.4.2
Evacuation Areas
All three types of evacuation witness a clear decease of average sentiment score during the Hurricane Harvey (Fig. 8.19), especially at stage III, of which the mandatory evacuation area has the lowest level of sentiment, followed by the voluntary evacuation area. This result might reveal the negative impact of evacuation orders, especially mandatory ones, on people’s sentiment, and such negative impact is more significant after the hurricane makes landfall.
160
C. Guo and Q. Huang
Fig. 8.19 Average sentiment scores among outside, voluntary evacuation area, and mandatory evacuation area at different stages of Hurricane Harvey
Regarding the change under these categories (Fig. 8.20), the “voluntary → mandatory” trajectories experience a substantial decrease at Stage II, and the opposite direction trajectories have a similar level of increase at Stage III. Trajectories in other types do not show a clear pattern. Usually, the mandatory evacuation areas face a more challenging situation than the voluntary evacuation areas, so it is not wise to travel
Fig. 8.20 Change of sentiment scores among trajectories with different spatial relationships to different evacuation areas at different stages of Hurricane Harvey
8 Examining Spatiotemporal and Sentiment Patterns …
161
Fig. 8.21 Average sentiment scores among outside, moderately affected area, and severely affected area at different stages of Hurricane Harvey
from a region with voluntary evacuation order to one with a mandatory one, which may explain that the “voluntary → mandatory” trajectories have a significant drop of sentiment score before the hurricane made landfall. In contrast, when the hurricane is weakening at Stage III, people moving from mandatory evacuation areas back to voluntary evacuation areas witness a more positive average sentiment level.
8.5.4.3
Affected Areas
Figure 8.21 displays the interesting sentiment pattern of moderately and severely affected areas. Before and including Stage II, only the severely affected area has a lower average sentiment than the outside, while after and including Stage III, the moderately affected area drop to the same level as the severely affected area till the end of the hurricane. The different trends of sentiments are in accordance with the assumption that the severely affected areas witness a more significant drop of sentiment score than the moderately affected area, especially when the disaster truly happened. As discussed in the previous section, here we compare the three types of “evacuation” trajectories (Fig. 8.22): “severe → outside”, “severe → moderate” and “moderate → outside”. In general, for the “evacuation movements”, the more severe the relevant areas are affected, the more significant the sentiment changes, with a lower sentiment level.
162
C. Guo and Q. Huang
Fig. 8.22 Change of sentiment scores among trajectories with selected spatial relationships to different affected areas (“evacuation”) at different stages of Hurricane Harvey
8.6 Conclusion This research utilized a geo-tagged Twitter dataset to draw a picture of evacuation behaviors during 2017 Hurricane Harvey. To provide a comprehensive understanding of the spatiotemporal and sentimental pattern of the evacuation behaviors, the trajectories were categorized into different moving directions at different disaster stages, with the consideration of evacuation policy and the disaster impacts. Specifically, we extract the evacuation trajectories of 1,320 verified Twitter users impacted by Hurricane Harvey and build a framework to analyze the spatiotemporal and sentiment pattern of these users’ movement behaviors, while considering important event-specific regions. First, the aggregated trajectories at city level help compare the interactions among cities and how they changed during the hurricane. For example, Houston has strong connections with many major nationwide cities as well as nearby local cities; However, this pattern witnesses a change during the hurricane period that the connections with big cities generally get stronger than with local ones, matching the hurricane’s historical track. When we aggregate the days into several disaster stages, the response phase arguably has a stronger magnitude of the connections, though it might be related to twitter data’s natural characteristics. Second, twitter-based trajectories show great potentials in analyzing the spatiotemporal pattern relevant to interested areas. For instance, Houston has relatively stronger inside movements rather than the inbound or outbound ones during this disaster response phase. Areas with mandatory evacuation order turn to have increasing outbound trajectories even after the landfall, while voluntary evacuation areas oppositely have fewer and decreasing outbound trajectories till the hurricane
8 Examining Spatiotemporal and Sentiment Patterns …
163
moved away, which mainly justifies the mandatory evacuation order. The analysis for affected areas also indicates that people evacuated early and actively turned out to actually live in the places with severe disaster impact. Third, the sentimental and emotional scores generated from tweet contents help draw an integrative understanding of the evacuation patterns. In general, hurricane as a disaster, will significantly decrease the sentiment score and evokes negative emotions like anger, sadness, and anxiety differently at different stages, and trajectories moving towards different destinations may also have different expressions of sentiments. In terms of the interested area, hurricane rises more negative sentiments in the worst-hit areas around the landfall period. A higher negative impact is observed on citizens’ sentiment in areas with mandatory evacuation orders than with voluntary ones and outside areas. Similarly, the regions with practically more severe hurricane damage have a deeper negative influence of sentiment expressions than the moderately affected regions. To conclude, this framework provides an applicable approach for natural disaster research and practice to better understand the evacuation behaviors. Though limitations exist in general social media data applications and specific applications in natural disaster management, such a social sensing approach indicates great potentials in revealing more hidden patterns and playing a more and more important role. In future, a deeper mining of the geotagged twitter data could be conducted to reveal more semantic information such as the discussed topics [31], awareness of the situation, and even transportation mode. Also, the user-level analysis is worth exploring, providing clearer user portraits in the context of natural disasters. Further studies also include the comparison studies among multiple data sources, as well as the applications in more complicated multi-event scenarios, to better utilize social media data in natural disaster applications.
References Alam, F., Ofli, F., Imran, M., & Aupetit, M. (2018). A twitter tale of three hurricanes: Harvey, Irma, and Maria. ArXiv180505144 Cs. Anderson, J., Casas Saez, G., Anderson, K., Palen, L., Morss, R. (2019). Incorporating context and location into social media analysis: A scalable, cloud-based approach for more powerful data science. https://doi.org/10.24251/HICSS.2019.275 Buntain, C.L., Lim, J.K.R. (2018). #pray4victims: Consistencies in response to disaster on Twitter. Proceedings of the ACM Human Computer Interaction, 2:25:1–25:18. Charalabidis, Y., Loukis, E. N., Androutsopoulou, A., Karkaletsis, V., & Triantafillou, A. (2014). Passive crowdsourcing in government using social media. Transform Gov People Process Policy, 8, 283–308. Cheng, J., East, P., Blanco, E., Kang Sim, E., Castillo, M., Lozoff, B., & Gahagan, S. (2016). Obesity leads to declines in motor skills across childhood. Child: Care, Health and Development, 42, 343–350. Demuth, J. L., Morss, R. E., Palen, L., et al. (2018). “Sometimes da #beachlife ain’t always da wave”: Understanding people’s evolving hurricane risk communication, risk assessments, and responses using Twitter narratives. Weather, Climate, and Society, 10, 537–560.
164
C. Guo and Q. Huang
Dong, H., Halem, M., Zhou, S. (2013). Social media data analytics applied to hurricane Sandy. In: 2013 international conference on social computing, pp. 963–966. Earle, P. S., Bowden, D. C., & Guy, M. R. (2011). Twitter earthquake detection: Earthquake monitoring in a social world. Annals of Geophysics, 54:708715 Gao, H., Barbier, G., & Goolsby, R. (2011). Harnessing the crowdsourcing power of social media for disaster relief. IEEE Intelligent Systems, 26, 10–14. Henderson, T. L., Sirois, M., Chen, A.C.-C., Airriess, C., Swanson, D. A., & Banks, D. (2009). After a disaster: Lessons in survey methodology from hurricane Katrina. Population Research and Policy Review, 28, 67–92. Houston, J. B., Hawthorne, J., Perreault, M. F., et al. (2015). Social media and disasters: A functional framework for social media use in disaster planning, response, and research. Disasters, 39, 1–22. Huang, Q., & Xiao, Y. (2015). Geographic situational awareness: Mining tweets for disaster preparedness, emergency response, impact, and recovery. ISPRS International Journal of Geo-Information, 4, 1549–1568. Huang, X., Wang, C., Li, Z., & Ning, H. (2019). A visual–textual fused approach to automated tagging of flood-related tweets during a flood event. International Journal of Digital Earth, 12, 1248–1264. Kafi, K. M., & Gibril, M. B. A. (2016). GPS application in disaster management: A review. Asian Journal of Applied Science, 4. Kirschenbaum, A. (2004). Generic sources of disaster communities: A social network approach. International Journal of Sociology and Social Policy, 24, 94–129. Lindsay, B. R. (2011). Social media and disasters: Current uses, future options, and policy considerations. 1–14. Liu, W., Lai, C,-H., Xu, W. (Wayne) (2018). Tweeting about emergency: A semantic network analysis of government organizations’ social media messaging during Hurricane Harvey. The Public Relations Review, 44, 807–819. Murray-Tuite, P., & Wolshon, B. (2013). Evacuation transportation modeling: An overview of research, development, and practice. Transportation Research Part C Emerging Technologies, 27, 25–45. Nguyen, C., Schlesinger, K. J., Han, F., Gür, I., & Carlson, J. M. (2019). Modeling individual and group evacuation decisions during wildfires. Fire Technology, 55, 517–545. Pantti, M., Wahl-Jorgensen, K., & Cottle, S. (2012). Disasters and the media, First (printing). Peter Lang Inc. Rossi, C., Acerbo, F. S., Ylinen, K., Juga, I., Nurmi, P., Bosca, A., Tarasconi, F., Cristoforetti, M., & Alikadic, A. (2018). Early detection and information extraction for weather-induced floods using social media streams. The International Journal of Disaster Risk Reduction, 30, 145–157. Sadri, A. M., Hasan, S., Ukkusuri, S. V., & Cebrian, M. (2018). Crisis communication patterns in social media during Hurricane Sandy. Transportation Research Record, 2672, 125–137. Sahin, ¸ C., Rokne, J., & Alhajj, R. (2019). Emergency detection and evacuation planning using social media. In T. Özyer, S. Bakshi, & R. Alhajj (Eds.), Soc (pp. 149–164). Springer International Publishing, Cham. Samuels, R., Taylor, J. E., & Mohammadi, N. (2018). The sound of silence: Exploring how decreases in Tweets contribute to local crisis identification. ISCRAM. Schempp, T., Zhang, H., Schmidt, A., Hong, M., & Akerkar, R. (2019). A framework to integrate social media and authoritative data for disaster relief detection and distribution optimization. The International Journal of Disaster Risk Reduction, 39:101143. Verma, S., Vieweg, S., Corvey, W. J., Palen, L., Martin, J. H., Palmer, M., Schram, A., & Anderson, K. M. (2011). Natural language processing to the rescue? Extracting “situational awareness” tweets during mass emergency. ICWSM. Weinkle, J., Landsea, C., Collins, D., Musulin, R., Crompton, R. P., Klotzbach, P. J., & Pielke, R. (2018). Normalized hurricane damage in the continental United States 1900–2017. Nature Sustainability, 1, 808–813.
8 Examining Spatiotemporal and Sentiment Patterns …
165
Xiao, Y., Huang, Q., & Wu, K. (2015). Understanding social media data for disaster management. Natural Hazards, 79, 1663–1679. Yu, M., Huang, Q., Qin, H., Scheele, C., & Yang, C. (2019). Deep learning for real-time social media text classification for situation awareness—using Hurricanes Sandy, Harvey, and Irma as case studies. International Journal of Digital Earth, 12, 1230–1247. Zhang, C., Fan, C., Yao, W., Hu, X., & Mostafavi, A. (2019). Social media for intelligent public information and warning in disasters: An interdisciplinary review. International Journal of Information Management, 49, 190–207.
Chapter 9
Sentiment Analysis of Social Media Response and Spatial Distribution Patterns on the COVID-19 Outbreak: The Case Study of Italy Gabriela Fernandez, Carol Maione, Karenina Zaballa, Norbert Bonnici, Brian H. Spitzberg, Jarai Carter, Harrison Yang, Jack McKew, Filippo Bonora, Shraddha S. Ghodke, Chanwoo Jin, Rachelle De Ocampo, Wayne Kepner, and Ming-Hsiang Tsou
G. Fernandez (B) · C. Maione · K. Zaballa · H. Yang · C. Jin · R. De Ocampo · W. Kepner · M.-H. Tsou Department of Geography, Center for Human Dynamics in the Mobile Age, Metabolism of Cities Living Lab, San Diego State University, San Diego, CA, USA e-mail: [email protected] C. Maione Department of Management, Economics, and Industrial Engineering, Politecnico Di Milano, Milan, Italy B. H. Spitzberg Department of Communication, San Diego State University, San Diego, CA, USA N. Bonnici Malta Council for Science and Technology, Kalkara, Malta J. Carter Columbia University, New York, NY, USA John Deere, New York, NY, USA S. S. Ghodke University of College London, London, UK F. Bonora University of Bologna, Bologna, Italy J. McKew AECOM, Newcastle, Australia © Springer Nature Switzerland AG 2021 A. Nara and M.-H. Tsou (eds.), Empowering Human Dynamics Research with Social Media and Geospatial Data Analytics, Human Dynamics in Smart Cities, https://doi.org/10.1007/978-3-030-83010-6_9
167
168
G. Fernandez et al.
9.1 Introduction At least since the Black Death plague of the mid-1300s, epidemics have been a source of extreme sentiments among those at risk. In contemporary times, both mainstream and social media provide important outlets in which the population gives voice to those sentiments. As devastating as any given disease may be, how people feel about the disease and its reverberating implications throughout society is likely to play a significant role in how they behave toward the disease. How people respond to their feelings thereby provide important windows into the understanding, surveillance, and alteration of such health-relevant behavior. The COVID-19 (aka SARS-CoV2, Coronavirus) pandemic offers new and advanced opportunities to investigate the potential role of such sentiments in the dynamic properties of the disease and their relationship to various public, media, and governmental responses to the disease. This study examines one mode of social media, COVID-19 tweets, in 10 major Italian cities during the pandemic and their relationship to a number of geographic indicators in order to demonstrate the potential epidemiological value of such social media surveillance of disease-related sentiments. There are at least three interrelated ways in which sentiments provide an important insight into disease outbreaks. First, they provide a window into the emotional experience of epidemics and related threats to health. They provide an effective profile of population-level phenomenology in response to disease. Second, active surveillance of such sentiments provides an opportunity to see how and to what extent such emotions provide lagging or leading signals of the progression and dynamics of the disease itself. Such diagnostic and potentially predictive information may be integrated into more comprehensive health care policies and interventions. Third, several emotions, and in particular fear, play key roles in leading theories of health messaging and influence. To the extent that certain messaging campaigns can tap into and activate those emotions, health behavior can be significantly influenced toward more effective strategies of treatment and prevention. Fourth, to the extent that such surveillance can track the disease outbreak in association with geospatial, chronological and demographic features, it can provide essential information on where, when and how to target health infrastructure, resource commitments, and health communication campaign characteristics. Understanding the unfolding dynamics of the disease outbreak can facilitate health policy at all levels, from global to national to regional to local. There are many sentiments that may be associated with and expressed about a threatening disease. Among the most common epidemic-related emotions are fear and its closely affiliated emotions of worry, anxiety and anger (Ahorsu et al., 2020). In addition, as prospects arise regarding news of successful therapies, prevention tactics, governmental interventions and subsiding infection or mortality rates, feelings of joy, happiness, relaxation or relief may also play a role in how the public responds to the epidemic. This study will focus on three emotional states: fear, anger, and joy. Fear has a long history as a topic of academic interest as a linguistic construction (Chamberlain, 1899). Both fear and anger expressed in social media have been found
9 Sentiment Analysis of Social Media Response and Spatial …
169
to distinguish highly influential nodes in social networks (Chung & Zeng, 2020) and appear uniquely predictive of social media cascades and contagions (Fan et al., 2018). Prior studies of epidemics have demonstrated that fear is a significant component of people’s hearing about, experiencing, and reacting to the disease breakout. Early studies of the COVID-19 pandemic also indicate that fear, anger, and joy are common elements of people’s emotional response (Pakpour & Griffiths, 2020). Such sentiments appear to correspond to people’s expressions in social media. For example, fear and stress were common in Chinese social media during the COVID-19 outbreak (Abd-Alrazaq et al., 2020). Such worry- and fear-related posts in Weibo—a local microblogging website—demonstrated geospatial concentrations in the most affected cities in China (Han et al., 2020). Another study of Weibo texts found that “as the COVID-19 epidemic began to spread throughout the country after January 20, 2020, the public eased their concerns and fears caused by their uncertainty toward and ignorance of the epidemic, and responded to the epidemic with a more objective attitude” (Zhao et al., 2020). Similarly, an analysis of tweets in the U.S. found that “keyword and sentiment analysis factoring the timeline, showed an increase in seriousness, and fear […], indicating that public sentiment changed as the consequences of the rapid spread of Coronavirus, and the damaging impact upon COVID-19 patients became more evident” (Samuel et al., 2020). Indeed, during the early days of COVID-19 entering the U.S., “the fear-sentiment […] was the most dominant emotion across the entire Tweets data” (Samuel et al., 2020). A global study of sentiments in 20 M Twitter messages found that fear was evident in over 50% in January and slightly under 30% in early April, suggesting that sentiments are responsive to changes over time with the epidemic (Lwin et al., 2020). Social media increasingly demonstrate their potential for cyberspace signaling of what is occurring in real space (Spitzberg, 2014, 2019). Clearly the public is increasingly relying on social media as a primary way in which they receive and diffuse news and information about matters of concern in general (Allcott & Gentzkow, 2017), and about the pandemic in particular (Jurkowitz & Mitchell, 2020). There is evidence that since the pandemic began, people have begun spending more time on social media (Koeze & Popper, 2020; Watson, 2020), perhaps both because they have more leisure time in quarantine and to engage in personal information-seeking to reduce their uncertainty about the disease. Interest in the Coronavirus has diffused rapidly in social media topic networks (Park et al., 2020). Of the approximately 50 M people online and 35 M people active on social media in Italy, almost all of whom use mobile devices to connect with one or more platforms (YouTube, WhatsApp, Facebook, Instagram, Twitter, and Messenger), the average amount of daily time engaging in such media is almost 2 h, and increasing (Starri, 2020). Across a variety of digital epidemiology studies, there has been “a positive correlation between digital data and the temporal evolution of epidemics or outbreaks” (Towers et al., 2015). If activity in cyberspace correlates to activity in real space, then tracking such media provides the potential for important health intervention. Sentiments can be contagious in somewhat the same ways as diseases (Christakis & Fowler, 2013; Ferrara & Yang, 2015). It is a small surprise, therefore, that social media that diffuse widely are considered “viral.” Inquiry into the extent to which
170
G. Fernandez et al.
sentiments expressed in social media during health crises such as disease epidemics becomes vital to managing such crises. The study reported here provides an initial exploration into the feasibility of such surveillance. The chapter is structured as follows: Sect. 9.2 describes the socio-economic indicators that characterize the different geographies of Italy, spatial distribution and spread of COVID-19 during the period of March-June 2020. The methodology of our study is reported in Sect. 9.3. Following, Sect. 9.4 presents and analyzes our main findings. Finally, Sect. 9.5 draws conclusions and suggests future research directions.
9.2 COVID-19 Outbreak and Spatial Distribution Italy has been one of the most hit countries in Europe and the first one to present diffused COVID-19 sites. February 21st marked the beginning of a dreadful chapter in the Italian history, with its first COVID-related death. Since then, cases quickly escalated with disproportionate impacts on the Northern regions, where the highest rates of positive cases, deaths, and hospitalized patients were registered (Dipartimento della Protezione Civile, 2020; Lab24, 2020). On February 23rd, the Italian Government announced a “Code Red” state of emergency mandate which meant a number of municipalities in the Northern regions including Lombardy, Veneto, and Emilia-Romagna, was further extended to a national lockdown on March 11th (Italian Ministry of Health, 2020). Following the “Stay at home” ordinance, the lockdown restricted the movement of individuals and goods to essential activities only, banned all forms of public events and mass gathering, and imposed the use of masks, and social distancing in all public spaces (Italian Ministry of Health, 2020). The spread and distribution of COVID-19 cases changed dramatically based on geography. An incremental difference can be shown between the number of cases and deaths that occurred in the South when compared to the North of Italy, with islands being the least affected areas. Possible explanation can be related to the timing of policy implementation at the local and regional level. Moreover, major social, environmental, and economic effects took place due to the socio-economic consequences that have characterize the North–South divide for centuries. Historically, Italy has suffered a North–South divide that resulted from massive migrations of people coming from the Southern regions to the North in search to improve livelihoods, jobs, and incomes (Musolino, 2018). As a result, cities in Northern Italy have developed more populated urban environments, with better infrastructure, and well-established industrial systems, furthering the North–South gap divide. It is not surprising that the greatest share of GDP and economic activities are concentrated in the North and Center, while the South and Islands present higher unemployment rates and have less developed infrastructure, including health care and medical capacity. The study aims to understand how COVID-19 response changes across several geographies.
9 Sentiment Analysis of Social Media Response and Spatial …
171
9.3 Materials and Methods 9.3.1 Case Selection To capture and understand the interconnection differences between spatial geographies in Italy during the pandemic. The study classified 10 major Italian metropolitan cities into four geographical categories: north, south, central, and islands. The selected metropolitan cities are composed of the following—northern cities include: Milan, Turin, Venice, and Bologna; central cities include: Florence and Rome; southern cities include: Naples and Bari; and island cities include: Palermo and Cagliari. Table 9.1 shows the socio-demographic characteristics and geography for the selected Italian city case studies.
9.3.2 Data Collection Data collection includes a total of 4,227,882 coronavirus-related tweets between March 2, 2020, and June 15, 2020, using the Twitter standard search application programming interface (API). Selected keywords most widely used by scientist and news media community key terms relating to the novel coronavirus were considered. Twitter keywords consist of a set of predefined coronavirus search key terms in both the English and Italian language. Predefined Twitter search key terms/hashtags in English include: COVID-19, Coronavirus, CoronavirusOutbreak, coronavirusitaly, Table 9.1 Socio-economic characteristics for selected Italian case studies based on geography (ISTAT, 2019) Type
City
Region
Area (km2 )
Pop (n)
Radius surveyed (km)
Lat
North (west)
Milan
Lombardy
1,575
3,190,340
38
45.46
9.18
Turin
Piedmont
6,829
2,293,340
59
45.07
7.68
Venice
Veneto
2,462
858,455
56
45.44
12.31
Bologna
Emilia Romagna
3,702
1,005,831
9
44.49
11.34
Florence
Tuscany
3,514
1,007,435
56
43.76
11.25
Rome
Lazio
5,352
4,336,915
67
41.90
12.49
South
Naples
Campania
1,171
3,128,702
45
40.85
14.26
Bari
Apulia
3,821
1,251,004
48
41.11
16.87
Islands
Cagliari
Sardinia
1,248
431,302
37
39.22
9.12
Palermo
Sicily
5,009
1,276,525
11
38.11
13.36
North (east)
Central
Long
172
G. Fernandez et al.
racism, COVID2019, COVID19italy, Flu, ItalyCoronavirus, Lombardy, Italyquarantine, quarantineItaly, COVID. Predefined Twitter search key terms in Italian include: razzismo, Italiani all’estero, Influenza, Amuchina, Codogno, Contagiati, Contagio, Coronaviriusitalia, COVID19italia, COVID2019italia, Coronavirusitalia, CoronavirusItalla, Lombardia, zonarossa, focolai, and quarentena. A number of Python libraries were used such as Tweepy, Pandas, and BeautifulSoup to run our script. Tweepy was used to provide API access to Twitter, Panda was used to handle our dataframes, and BeautifulSoup was used to handle extraction of data from HTML and XML files. Auxiliary libraries used are re(regular expressions), json(to harvest from exact locations), sys, and datetime. We extracted and stored the text and metadata of the tweets. Moreover, we searched for the most popular Twitter hashtags used by people including those that have incorrect spelling on coronavirus. All code used in this study to generate the Twitter translation and sentiment analysis are freely available in our Github repository. https://github.com/HDMA-SDSU/Transl ate-Tweets.
9.3.3 Data Analysis To study public sentiments, we chose Twitter as our target field. As one of the world’s biggest social media platforms, Twitter hosts abundant user-generated posts, which closely reflect the public’s reactions towards COVID-19 with low latency. In our study, as one of our analysis we developed a sentiment analysis to classify tweets related to fear, anger, and joy. The sentiment analysis was performed using the NRC Emotion Lexicon (Mohammad & Turney, 2013; Mohammad et al., 2010) which scores a list of 14,182 unigrams based on sentiments (positive and negative) and emotions. Moreover, a score for each emotion was produced based on the number of unigrams scored for each emotion normalized to the number of unigrams present in each tweet. The mean score for the three sentiments fear, anger, and joy was calculated from across all of the tweets for a given day. These measures were cross analyzed with COVID-19 total cases, Italian policy and mandated regulatory measures, related to COVID-19 and socio-economic characteristics. Our study explored a selected number of socio-economic indicators to cover a wide spectrum of social (Percentage of people aged 65+, Total number of COVID-19 deaths, and Cumulative number of COVID-19 positive cases) environmental (Population density, Industry manufacture expenditure, and Industry services expenditure), and economic factors (GDP per capita, Unemployment rate, and Intensive care unit (ICU) beds) to explore the differences and capacities between the four Italian regions of Italy (north, central, south, and islands). A one-way analysis t-test of variance (ANOVA) was employed to determine whether there are any statistically significant differences between the means of the four Italian regions and assess which factors are most dependent from the selected independent variables. Data for all selected indicators including socio-economic data was collected between the months of March and June 2020.
9 Sentiment Analysis of Social Media Response and Spatial …
173
9.4 Results and Analysis 9.4.1 Spatial Analysis Across Italy Social impacts. In an aging country like Italy, where a fourth of the population is aged 65 or older (Dowd et al., 2020), the nation’s population structure may explain the high mortality and the severity of the virus’ impacts. Italy’s northern regions paid the highest cost as victims, registered between 120 and 568% more deaths compared to 2019 (Istituto Superiore di Sanità, 2020a), with the 80–89 age group presenting the most lethal rates (approximately 40% lethality) (Istituto Superiore di Sanità, 2020b). The North region of Italy recorded 5–10 times more COVID-19 cases and deaths compared to the south, central, and island region, with an average 46,220 COVID-19 cases and 6,894 deaths across the four (Phase 0, I, II, III) phases (Dipartimento della Protezione Civile, 2020). The Center region of Italy recorded an average 10,906 COVID-19 cases and 1,008 deaths. The South region of Italy recorded an average of 5,641 COVID-19 cases and 499 deaths. Moreover, Islands recorded an average of 3,018 COVID-19 cases and 210 deaths, respectively (Dipartimento della Protezione Civile, 2020) (Fig. 9.1). As COVID-19 claimed increasingly more victims among the older groups, the Residenze Sanitarie Assistenziali (RSAs), or elderly residences, became virus’ hotspots. The Istituto Superiore di Sanità, the nation’s most important health institute, found that more than 40% of the deaths occurred in RSAs during the months of the lockdown were related to COVID-19, with the northern regions showing higher mortality (La Repubblica, 2020a). Furthermore, with COVID-19 an entire generation slowly disappearing right before our eyes, bringing alarming changes to the country’s socio-cultural and identity. Economic impacts. Economic disparities are well rooted in national history. A closer look at Italy’s economic geographies show that the greatest share of the national GDP is concentrated in the northern and central regions, where the GDP per capita ranges between e31,000 and 39,000, compared to the southern, and islands regions which hold a GDP per capita of e17,000–21,000 euros (Statista, 2020a). The
Fig. 9.1 Geography of COVID-19 spread effects in Italy: social (left), economic (middle), and environmental (right) pandemic impacts
174
G. Fernandez et al.
economic differences were further emphasized during the months of the lockdown. While local economies were not flourishing across the entire country, GDP losses occurred unevenly. The South and the Islands were most affected by the economic crisis, reporting an average unemployment rate of 16.8% and 16.2% respectively, compared to 5.7% in the North and 8% in the Center region of Italy (ISTAT, 2020a) (Fig. 9.1). A possible explanation is that the North of Italy is the economic engine to the majority of the country’s production activities, which remained open during the lockdown, thereby providing jobs, services, and incomes. Contrarily, to the South and Island regions of Italy, which heavily depend on tourism, suffered the negative impacts of the sector’s decline (Armitage & Esposito, 2020). Second, it is also important to mention that many workers who lost their jobs relocated from northern to southern regions following the lockdown, triggering an inverse migration, which may bear medium- and long-term impacts on the country’s economy (D’Angerio, 2020). The economic divide also translates into regional differences in the management of the National Health System. Italy’s system is composed of several Aziende Socio Sanitarie Territoriali (ASST), regional-based medical agencies that have significant autonomy regarding emergency management on the local territory (Gagliano et al., 2020). The decentralization of the national medical system can bring to further fragmentation of the system itself, especially during pandemic times, leading to the formation of independent and autonomous nuclei, sometimes under military management (e.g., Giuzzi & Ravizza, 2020; Longo, 2020)). We can also observe regional differences across ASSTs concerning the hospital capacity, ranging from the number of beds, and in particular ICU beds, access to medical and personal protection equipment, ventilators, medical workers and staff, and financial resources among others. As of March 2020, it was estimated that Italy’s ICU capacity was equivalent to 12.5 beds for every 100,000 inhabitants, with a total of 5,200 beds (Remuzzi & Remuzzi, 2020; Statista, 2020b). During the lockdown, more than half of the ICU beds were occupied by COVID-19 patients, bringing the unit on the brink of collapse. On April 13, out of the total hospitalized COVID-19 patients, 2,098 patients occupied ICU beds in the North, 425 in the Center, 151 in the South, and 78 in the Islands, respectively (Lab24, 2020) (Fig. 9.1). Some regions outnumbered their bed capacity: for example, the Lombardy region (north) recorded a total of 1,143 ICU COVID-19 patients in a single day. In addition to ICU non-COVID patients, well beyond its ICU capacity of 724 (Armocida et al., 2020). A possible explanation could be due to the increasingly larger cuts made to the National Health System’s finances over the period of 2010–2019 that caused a massive disinvestment in emergency health care (Armocida et al., 2020; Prante et al., 2020). Environmental impacts. Since the first days of the lockdown, a stark changing trend was observed in mobility and emissions across the country. With the closure of many activities, including schools and non-essential businesses, many Italians were confined to their homes, drastically reducing the number of cars and private vehicles on Italy’s most-crowded roads. Particularly, the levels of particulate matter found in air, which reached alarming highs in the weeks before the lockdown, were curbed as an indirect impact of the lockdown (Becchetti et al., 2020; Fronte, 2020).
9 Sentiment Analysis of Social Media Response and Spatial …
175
Scientists have advanced theories correlating the high level of atmospheric pollution to higher mortality from COVID-19 (Becchetti et al., 2020; Intini, 2020a, b; La Repubblica, 2020b; Talignani, 2020). In highly polluted regions, exposure to poor air quality (e.g. PM2.5, PM10, O3 , SO2 , and NO2 ) accelerated the speed of contagion from COVID-19 (Becchetti et al., 2020; D’Aria, 2020; Iannaccone, 2020): PM2.5 and PM10 acted like “virus carriers” in northern regions like Lombardy and Emilia Romagna (Conticini et al., 2020; D’Aria, 2020). During Phase 0, the average PM values recorded in Lombardy were 81.24 PM2.5 (AQI) and 34.47 PM10 (AQI), and 74.38 PM2.5 (AQI) and 29.47 PM10 (AQI) in Emilia Romagna (World Air Quality Index 2020). In addition, since the virus has an incubation period of 14 days, the high levels of particulate matter recorded in the days before the national shutdown caused an upsurge in the number of cases during the first weeks of the lockdown. Alongside transportation, another primary source of air pollution is the close proximity to industrial plants. National statistics showed that northern regions invest more in industrial development, mainly with reference to the manufacturing and services (e.g., transport) sectors (Fig. 9.1). On average, the North invests e42,030 in the manufacturing industry and e132,421 in the services industry every year, with the Lombardy region alone investing twice as much as the North’s average (ISTAT, 2020b). Lower concentration of industries is observed across the rest of the country, with the Center investing e14,977 in manufacturing and e113,527 in services, e8,305 in manufacturing, and e65,370 in services in the South, and only e3,183 in manufacturing and e45,908 in services in the Islands (ISTAT, 2020b).
9.4.2 Factors in Spatial Distribution Across Italy The study employed an ANOVA One-Way test to explore the relationship between Italian geography (north, central, south, and islands) and the social, economic, and environmental factors influenced by the COVID-19 pandemic outbreak. The alternative hypothesis sought for differences between nine factors across the different Italian geographic areas. In particular, we expected to observe differences in at least 1 out of the 3 factors used for each category. When testing for differences, ANOVA presented significance (p < 0.05) for social and economic factors, but did show significance for environmental factors. However, our spatial analysis found correspondence between the higher concentration of industries and services, and higher PM values during Phase 0, and higher number of COVID-19 cases and COVID-19 deaths. Table 9.2 shows the significant factors included: Cumulative number of COVID-19 cases (p = 0.022), GDP per capita (p = 0.013), and total number of ICU beds (p = 0.018).
176 Table 9.2 ANOVA one-way: exploring Italy’s social, economy, and environment factors during the COVID-19 outbreak
G. Fernandez et al.
Social
Economy
Environment
Factors
Significance (p)
Percentage of people aged 65+
0.603
Total number of COVID-19 deaths
0.076
Cumulative number of COVID-19 cases
0.022
GDP per capita
0.013
Number of ICU beds
0.018
Unemployment rate
0.127
Population density
0.665
Manufacturing expenditure
0.103
Services expenditure
0.456
9.4.3 Twitter Social Media Descriptive Statistics by Geography Over 4 million tweets were collected in the north, central, south, and island geographies of Italy. A bubble word cloud visualization was developed to highlight the top five social, environment, and economic social media re-tweet thematic keywords related to COVID-19 based on Italy’s north, central, south, and islands. Figure 9.2 shows the majority of the total tweets and retweets collected in the north and central areas of Italy. The regions included topics related to sports, religion, COVID-19,
Fig. 9.2 Social, economy, and environment COVID-19 related tweets by geography, and top 5 keyword retweets (yellow = social retweets; orange = economic re-tweets; red = environmental retweets; grey = total tweets)
9 Sentiment Analysis of Social Media Response and Spatial …
177
health care, policy, entertainment, fake news, vaccines, symptoms, university graduations, Donald Trump and others. The twitter data was categorized based on social, environmental, and economic COVID-19 related tweets to capture the changes across geographies. The top five retweets—keyword data were selected between the months of March to June, 2020.
9.4.4 Sentiment Analysis by Geography This section focuses on analyzing millions of COVID-19 Twitter tweets classified as joy, anger, and fear in the north, central, south, and island regions of Italy. Our sentiment analysis studied COVID-19 tweets related to fear, anger, and joy and total number of daily COVID-19 cases, total number of daily COVID-19 deaths, and governmental policy measures during Phase 0, Phase 1, Phase 2, and Phase 3. We wanted to understand if sentiments of tweets shift over the course of the pandemic and when lagged to specific policy shifts before and after local governments announce preventative policies based on geography. Figure 9.3 shows the sentiments in northern Italy during the time of the COVID-19 outbreak. The north region of Italy was impacted the most during the pandemic when compared to the rest. The north included a total of 1,802,038 (Milan: 1,372,402; Turin: 229,171; Veneto: 127,251; and Emilia Romagna: 73,214) tweets related to COVID-19 with sentiments on fear, anger, and joy based on time. In the north 42.62% of the total tweets data shows the number of tweets related to fear increased between the months of March and April, Phase 1, (lockdown is enforced) and Phase 2 (lockdown is lifted), as the total number of daily COVID-19 cases increases. The central region of Italy included a total of 2,021,940 (Florence: 122,251; Rome: 1,899,689) tweets related to COVID-19 with sentiments on fear, anger, and
Fig. 9.3 Sentiments in northern Italy during the time of COVID-19
178
G. Fernandez et al.
Fig. 9.4 Sentiments in central Italy during the time of COVID-19
joy based on time. Figure 9.4 shows the highest number of data with 47.82% of the tweets in central Italy during the time of the COVID-19 outbreak. The central region was also impacted but in different times. The data shows the high peaks occurred in late April right after the marine transfers and industry measures were enforced. Simultaneously, there was a slight rise in COVID-19 cases. A notable peak in fear, deaths, and cases appeared on May 3rd which coincides with the policy lockdown being lifted. The rise of cases and fear in March and decline started in mid-April, which follows the North. The southern regions of Italy included a total 306,110 tweets (Naples: 263,889; Bari: 42,221) that is 7.24% of the total data related to COVID-19 with sentiments on fear, anger, and joy based on time. Figure 9.5 shows the sentiments in southern Italy
Fig. 9.5 Sentiments in southern Italy during the time of COVID-19
9 Sentiment Analysis of Social Media Response and Spatial …
179
Fig. 9.6 Sentiments in Italy’s islands during the time of COVID-19
during the time of the COVID-19 outbreak. While cases and deaths follow the trends of the North and Central regions, a significant peak is found in the dataset where anger, fear, COVID cases, and deaths were the highest on March 30–31st, (Phase 1) which happens to be a significant period before the world COVID-19 cases reach 1 million and the wake of the travel ban extension was enforced in the South. The island regions of Italy included a total 97,794 of tweets (Cagliari: 32,759; Sicily: 65,035) with a 2.31% related to COVID-19 with sentiments on fear, anger, and joy based on time. Figure 9.6 shows the sentiments in the islands of Italy during the time of the COVID-19 outbreak. It can be seen that COVID cases here have gone decline faster than the other regions in the period of March 28th. The data shows a general decline of cases and deaths after the lockdown was lifted even though a rise in fear and anger followed.
9.4.5 Policy Overview During the Pandemic The Italian government’s response is characterized in four phases: Phase 0 (February 21st–March 10th), after the surge of the virus outbreak, is identified with initial containment measures, including cancellation of public social, sport, and religious events, and shift to online education at all school levels; Phase 1 (March 11th– May 3rd) is identified with the lockdown and strict containment measures; Phase 2 (May 4th–June 14th) encompasses post-lockdown measures that allow some levels of movements across the country; and Phase 3 (June 15th–ongoing) allows cross-country movements in the respect of social distancing and personal safety
180
G. Fernandez et al.
Table 9.3 Italian policy timeline by phase based on social, economy, and environment factors Phase 0 Social
Social
Phase 2
Economy
Env.
Social
Economy
Env. Tourism
Pollution
Unemployment
Mobility
Non-essentials
Essentials
Social events
Education
Protection
Tourism
Stay at home
Pollution
Unemployment
Mobility
Non-essentials
Essentials
Social events
Education
Protection
Tourism
Stay at home
Pollution
Unemployment
Type
Env. Mobility
Non-essentials
Essentials
Social events
Education
Protection
Stay at home
RGN
Phase 1
Economy
North
LOM PIE VEN EMI
South
LAZ
CAM
Islands
Central
TOS
SAR
PUG
SIC
Legend: red = N/A; yellow = national level policy; green = regional level policy [LOM: Lombardy; PIE: Piedmont; VEN: Vene-to; EMI: Emilia Romagna; TOS: Tuscany; LAZ: Lazio; CAM: Campania; PUG: Apulia; SAR: Sardinia, and SIC: Sicily]
(Italian Ministry of Health, 2020). Some regions adopted more stringent containment measures, in addition to the national level policies. Table 9.3 shows the variations between national and regional policy levels by phase and social, economic, and environmental factors.
9.5 Psychological Burdens of the COVID-19 Pandemic Overall, the COVID-19 outbreak around the world introduced a number of sentiments from mental health, loneliness, stemming from social isolation, fear of contracting the disease, economic strain, and uncertainty of the future. As a result, our study analyzed fear, anger, and joy sentiments between the months of March and June before and after the lockdown. Our study calculated the demographic characteristic differences between the four regions of Italy exploring sentiments related to psychological distresses, and socio-economic factors. It can be seen between the different geographies and socio-economic dependent and independent factors that the governmental policies created to prevent its spread introduced new life stressors and disrupted daily living for most people. As a result, the COVID-19 outbreak event caused physical, emotional, and psychological harm, and the virus can be itself
9 Sentiment Analysis of Social Media Response and Spatial …
181
considered a traumatic event. Thus, influencing people to act out and express themselves through social media solely based on emotion in an angry, fearful, and joyous manner due to the uncertainty of COVID-19 policy mandates.
9.6 Conclusion Overall, the trends on sentiments with temporal and geographical information has shown that there are major differences between the north, central, south, and island areas of Italy. Our study employed a number of statistical methods to visualize the socio-economic differences across Italy’s four regions and the COVID-19 effects before, during and after the mandated policy lockdown. Monitoring jurisdiction level numbers of COVID-19 cases, deaths, and changes in incidence is critical for understanding community risk and making decisions about community mitigation, including social distancing, and strategic health care resource allocation. Sentiments track policy and society events, as well as case and case fatality rates, to differing degrees from one region to another. The role of sentiment analysis in today’s modern world can be a useful way to understand the COVID-19 outbreak’s effects through the use of social media. Furthermore, informing health officials and policy makers to track how the pandemic is changing over time using social media, as policy mandates, and health campaigns. Future research recommends analyzing the trends on sentiments based on various policy strategies such as lockdown, social distancing, epidemiological data etc., to be undertaken further to combat the coronavirus disease and understand people’s behaviors. Finally, the role of social media analysis through various artificial intelligence approaches could present fresh avenues for research that would be useful in the fields of life sciences, social sciences, and machine learning.
References Abd-Alrazaq, A., Alhuwail, D., Househ, M., Hamdi, M., & Shah, Z. (2020). Top concerns of tweeters during the COVID-19 pandemic: Infoveillance study. Journal of Medical Internet Research, 22, e19016. Ahorsu, D. K., Lin, C.-Y., Imani, V., Saffari, M., Griffiths, M. D., & Pakpour, A. H. (2020). The fear of COVID-19 scale: Development and initial validation. International Journal of Mental Health and Addiction. https://doi.org/10.1007/s11469-020-00270-8 Allcott, H., & Gentzkow, M. (2017). Social media and fake news in the 2016 Election. Journal of Economic Perspective, 31, 211–236. Armitage, R., & Esposito, B. (2020). Coronavirus has transformed Italy’s Amalfi Coast from a tourist hotspot to a virtual ghost town. In News. Retrieved August 25, 2020, from https://www. abc.net.au/news/2020-03-12/coronavirus-empties-italys-stunning-amalfi-coastline/12044986. Armocida, B., Formenti, B., Ussai, S., Palestra, F., & Missoni, E. (2020). The Italian health system and the COVID-19 challenge. Lancet Public Health, 5, e253.
182
G. Fernandez et al.
Becchetti, L., Conzo, G., Conzo, P., & Salustri, F. (2020). Understanding the heterogeneity of adverse COVID-19 outcomes: The role of poor quality of air and lockdown decisions. https:// doi.org/10.2139/ssrn.3572548. Chamberlain, A. F. (1899). On the words for “Fear” in certain languages. A study in linguistic psychology on JSTOR. American Journal of Psychology, 10, 302–305. Christakis, N. A., & Fowler, J. H. (2013). Social contagion theory: Examining dynamic social networks and human behavior. Statistics in Medicine, 32, 556–577. Chung, W., & Zeng, D. (2020). Dissecting emotion and user influence in social media communities: An interaction modeling approach. Information & Management, 57, 103108. Conticini, E., Frediani, B., & Caro, D. (2020). Can atmospheric pollution be considered a co-factor in extremely high level of SARS-CoV-2 lethality in Northern Italy? Environmental Pollution, 261, 114465. D’Angerio, V. (2020). Effetto Covid: il South working alla riscossa grazie al lavoro agile. Benvenuti Al Sud D’Aria, I. (2020). Coronavirus, lo smog accelera il contagio? Non è vero, anzi sì. In La Repubblica. Retrieved August 25, 2020, from https://www.repubblica.it/salute/medicina-e-ricerca/ 2020/03/20/news/coronavirus_lo_smog_accelera_il_contagio_lopalco_l_inquinamento_fa_ male_ma_il_virus_corre_sulle_nostre_gambe_-251786376/. Dipartimento della Protezione Civile. (2020). COVID-19 Italia–Monitoraggio della situazione. Retrieved May 25, 2020, from https://opendatadpc.maps.arcgis.com/apps/dashboards/b0c68bce2 cce478eaac82fe38d4138b1. Dowd, J. B., Andriano, L., Brazel, D. M., Rotondi, V., Block, P., Ding, X., Liu, Y., & Mills, M. C. (2020). Demographic science aids in understanding the spread and fatality rates of COVID-19. Proceedings of the National Academy of Sciences, 117, 9696–9698. Fan, R., Xu, K., & Zhao, J. (2018). An agent-based model for emotion contagion and competition in online social media. Physica Statistical Mechanics and Its Applications, 495, 245–259. Ferrara, E., & Yang, Z. (2015). Quantifying the effect of sentiment on information diffusion in social media. PeerJ Computer Science, 1, e26. Fronte, M. (2020). Coronavirus: l’inquinamento gioca un suo ruolo? In Focus. Retrieved August 25, 2020, from https://www.focus.it/scienza/salute/coronavirus-covid-19-smog-inquinamentolombardia. Gagliano, A., Villani, P. G., Col, F. M., Manelli, A., Paglia, S., Bisagni, P. A. G., Perotti, G. M., Storti, E., & Lombardo, M. (2020). COVID-19 epidemic in the middle province of Northern Italy: Impact, logistics, and strategy in the First Line Hospital. Disaster Medicine and Public Health Preparedness, 14, 372–376 Giuzzi, C., & Ravizza, S. (2020). Coronavirus, all’ospedale militare di Baggio contagiati alcuni medici e infermieri. In Corriere Della Sera. Retrieved August 30, 2020, from https://milano.cor riere.it/notizie/cronaca/20_marzo_13/coronavirus-milano-chiude-l-ospedale-militare-baggioalcuni-dipendenti-contagiati-6c52f46c-651b-11ea-ac89-181bb7c2e00e.shtml. Han, X., Wang, J., Zhang, M., & Wang, X. (2020). Using social media to mine and analyze public opinion related to COVID-19 in China. International Journal of Environmental Research and Public Health, 17, 2788. Iannaccone, S. (2020). Inquinamento: cala il biossido di azoto, ma i venti portano le polveri sottili. In La Repubblica. Retrieved August 30, 2020, from https://www.repubblica.it/ambiente/ 2020/03/30/news/meno_biossido_di_azoto_piu_polveri_sottili_cosa_sta_accadendo_nei_nos tri_cieli-252701516/. Intini, E. (2020a). COVID-19: mortalità più elevata nelle aree inquinate. In Focus. Retrieved August 30, 2020, from https://www.focus.it/scienza/salute/coronavirus-covid-19-la-mortalitapiu-elevata-nelle-aree-piu-inquinate. Intini, E. (2020b). Gli effetti dell’inquinamento atmosferico sulla salute. In Focus. Retrieved August 30, 2020, from https://www.focus.it/scienza/salute/smog-e-salute-gli-effetti-dell-inquinamentoatmosferico-sul-corpo-umano.
9 Sentiment Analysis of Social Media Response and Spatial …
183
ISTAT. (2019). Population and households. Retrieved May 27, 2020, from https://www.istat.it/en/ population-and-households. ISTAT. (2020a). National Accounts regional main aggregates: Value added by industry. Retrieved August 27, 2020, from http://dati.istat.it/Index.aspx?QueryId=11479&lang=en. ISTAT. (2020b). Unemployment rate: Unemployment rate—regional level. Retrieved August 27, 2020, from http://dati.istat.it/Index.aspx?QueryId=20744&lang=en. Istituto Superiore di Sanità. (2020). Epidemia COVID-19. Aggiornamento nazionale 7 maggio 2020—ore 16:00.—Google Search. Istituto Superiore di Sanità. (2020). Integrated surveillance of COVID-19 in Italy. Italian Ministry of Health. (2020). Gazzetta Ufficiale. Retrieved May 20, 2020, from https://www. gazzettaufficiale.it/. Jurkowitz, M., & Mitchell, A. (2020) Americans who primarily get news through social media are least likely to follow COVID-19 coverage, most likely to report seeing made-up news. Pew Res. Cent. Koeze, E., & Popper, N. (2020). The virus changed the way we internet. N. Y. Times. La Repubblica. (2020a). Coronavirus, indagine sulle Rsa: il 41% dei morti sono sospetti Covid. In La Repubblica. Retrieved August 27, 2020, from https://www.repubblica.it/cronaca/2020/06/17/ news/indagine_di_iss_e_garante_sulle_rsa_il_41_dei_morti_sono_sospetti_covid-259498018/. La Repubblica. (2020b). Coronavirus, lo studio: più colpite le aree inquinate. In la Repubblica. Retrieved August 27, 2020, from https://www.repubblica.it/ambiente/2020/04/10/news/corona virus_lo_studio_piu_colpite_le_aree_inquinate-253673938/. Lab24. (2020). Coronavirus in Italy: Updated map and case count. In www.ilsole24ore.com. Retrieved May 27, 2020, from https://lab24.ilsole24ore.com/coronavirus/en/. Longo, G. (2020). Coronavirus, l’ospedale militare Celio apre una struttura con 120 posti letto. In lastampa.it. Retrieved August 30, 2020, from https://www.lastampa.it/cronaca/2020/04/08/news/ coronavirus-l-ospedale-militare-celio-apre-una-struttura-con-120-posti-letto-1.38694312. Lwin, M. O., Lu, J., Sheldenkar, A., Schulz, P. J., Shin, W., Gupta, R., & Yang, Y. (2020). Global sentiments surrounding the COVID-19 pandemic on Twitter: Analysis of Twitter trends. JMIR Public Health and Surveillance, 6, e19447. Mohammad, S., & Turney, P. (2010). Emotions evoked by common words and phrases: Using mechanical turk to create an emotion lexicon. In Proc. NAACL HLT 2010 Workshop Comput. Approaches Anal. Gener. Emot. Text. (pp. 26–34). Los Angeles, CA. Mohammad, S. M., & Turney, P. D. (2013). Crowdsourcing a word—Emotion Association Lexicon. Computational Intelligence, 29, 436–465. Musolino, D. (2018). The north-south divide in Italy: Reality or perception? European Spatial Research and Policy, 25, 29–53. Pakpour, A. H., & Griffiths, M. D. (2020). The fear of COVID-19 and its role in preventive behaviors. Journal of Concurrent Disorders, 2, 58–63. Park, H. W., Park, S., & Chong, M. (2020). Conversations and medical news frames on Twitter: Infodemiological study on COVID-19 in South Korea. Journal of Medical Internet Research, 22, e18897. Prante, F. J., Bramucci, A., & Truger, A. (2020). Decades of tight fiscal policy have left the health care system in Italy ill-prepared to fight the COVID-19 outbreak. Intereconomics, 55, 147–152. Remuzzi, A., & Remuzzi, G. (2020). COVID-19 and Italy: What next? The Lancet, 395, 1225–1228. Samuel, J., Ali, G. G. M. N., Rahman, M. M., Esawi, E., & Samuel, Y. (2020). COVID-19 public sentiment insights and machine learning for tweets classification. Information, 11, 314. Spitzberg, B. H. (2014). Toward a model of meme diffusion (M3D). Communication Theory, 24, 311–339. Spitzberg, B. H. (2019). Trace of pace, place, and space in personal relationships: The chronogeometrics of studying relationships at scale. Personal Relationships, 26, 184–208. Starri, M. (2020). Report Digital 2020: in Italia cresce l’utilizzo dei social.
184
G. Fernandez et al.
Statista. (2020a). Gross domestic product (GDP) per capita of Italy in 2018, by region. In Statista. Retrieved August 30, 2020, from https://www.statista.com/statistics/658274/gross-domestic-pro duct-gdp-per-capita-of-italy-by-region/. Statista. (2020b). The countries with the most critical care beds per capita. In Stat. Infographics. Retrieved August 30, 2020, from https://www.statista.com/chart/21105/number-of-critical-carebeds-per-100000-inhabitants/. Talignani, G. (2020). Coronavirus, lo studio italiano: “Letalità e inquinamento atmosferico, il Nord soffre di più.” In La Repubblica. Retrieved May 30, 2020, from https://www.repubblica.it/amb iente/2020/04/06/news/coronavirus_e_inquinamento_atmosferico_l_analisi_italiana_ecco_p erche_lombardia_ed_emilia_soffrono_di_piu_-253322995/. Towers, S., Afzal, S., & Bernal, G., et al. (2015). Mass media and the contagion of fear: The case of Ebola in America. PLoS One, 10, e0129179. Watson, A. (2020). Consuming media at home due to the coronavirus worldwide 2020, by country. In Statista. Retrieved April 30, 2020, from https://www.statista.com/statistics/1106498/homemedia-consumption-coronavirus-worldwide-by-country/. Zhao, Y., Cheng, S., Yu, X., & Xu, H. (2020). Chinese Public’s attention to the COVID-19 epidemic on social media: Observational descriptive study. Journal of Medical Internet Research, 22, e18825.
Chapter 10
Conceptualizing an Ecological Model of Google Search and Twitter Data in Public Health Bo Liang and Ye Wang
10.1 Introduction Public health surveillance is “the ongoing, systematic collection, analysis, and interpretation of health-related data essential to planning, implementation, and evaluation of public health practice.” —WHO webpage
With the growing popularity of Internet use, consumer-generated data has become an important data source for researchers (Chevalier & Mayzlin, 2006; Jansen et al., 2009). As health is an important part of people’s daily lives, more and more Internet users are engaging in health-related online activities such as online search for health information or sharing experience in online communities (Fox & Duggan, 2013). Geo-tagged and time-stamped data generated from these online activities could inform public health practitioners of the population’s information needs, opinions, attitudes, and behaviors on specific health issues, and health status, becoming an important source of public health surveillance (Gittelman et al., 2015; Hay et al., 2013; Velasco et al., 2014). Existing research has primarily focused on using Google Search and Twitter data in public health surveillance because of the large volume and easy accessibility of data. According to a Pew report, 72% of American Internet users search for health information online (Fox & Duggan, 2013). Google dominates the search engine market with almost 90% of market share (eMarketer.com, 2018). Twitter is a social media platform on which users publish their real-time activities using short messages. B. Liang (B) Department of Business, Nevada State College, Henderson, NV, USA e-mail: [email protected] Y. Wang College of Arts and Sciences, University of Missouri – Kansas City, Kansas City, MO, USA e-mail: [email protected] © Springer Nature Switzerland AG 2021 A. Nara and M.-H. Tsou (eds.), Empowering Human Dynamics Research with Social Media and Geospatial Data Analytics, Human Dynamics in Smart Cities, https://doi.org/10.1007/978-3-030-83010-6_10
185
186
B. Liang and Y. Wang
More than 23% of online adults use Twitter (Duggan, 2015). Google and Twitter provide tools to capture geo-tagged and time-stamped search and tweet data. One common mistake in Big Data research is the ignorance of the context of data collection and users (Tsou, 2015). The social complexity of Twitter and Google data needs to be taken into account in order to improve the accuracy of public health surveillance. Health-related tweets and searches are motivated by a diverse range of human conditions. Patients may look for and exchange messages due to their own illnesses and infections. Meanwhile, studies showed that the volume of tweets and searches was often associated with the volume of news coverage of specific diseases (Metcalfe et al., 2011; Southwell et al., 2016; Wilson & Brownstein, 2009). On social media, sources of tweets include not only regular individual users but also organizations (e.g., medical, political, and cultural groups), online influencers, and news media. These organizations and influencers create social and informational environments that may affect public awareness, knowledge, and attitudes towards health issues. All of these social factors need to be considered when monitoring public health concern and developing models of disease detection and prediction. The heterogeneity and multi-facet nature of social media and search data requires an ecological view to guide big data analysis, interpretation, and visualization. Drawing upon literature on online health communication, this study proposes a conceptual framework of Google Search and Twitter data in public health surveillance, mirroring an ecological system of health behaviors. The ecological perspective situates health behaviors in a multi-level system, inclusive of biology, social, cultural, informational, policy, and physical environments, for the purpose of more effective behavior changes (Sallis et al., 2015). Adopting this perspective, our framework emphasizes a combined strategy of analyzing sources, subjects and predicates, sentiments, and topics of social media data, in conjunction with search data. This framework can guide public health practitioners purposively to screen out “noises,” and choose appropriate strategies of analysis. Compared with clinical data, data from Google Search and Twitter is often not clean and has more noises for epidemic monitoring. Studies on Twitter have used keyword-filtering methods to separate noise data (e.g., concern for flu) from clean data (e.g., flu infection) (Broniatowski et al., 2013; Paul & Dredze, 2012; Wilson & Brownstein, 2009). Built upon previous work, this study proposes a systematic approach that analyzes, interprets, and visualizes “noises” in big data for various uses of public health surveillance (e.g., monitoring infection incidence, or public sentiment).
10.2 Infodemiology and Ecological Models of Health Behavior The complexity of social media and search data needs a framework that offers a comprehensive approach to public health. The adoption of ecological models in health research and practices highlights the strength of these models by taking a
10 Conceptualizing an Ecological Model of Google Search …
187
comprehensive approach to health issues (Sallis et al., 2015). The ecological models examine “multi-level influences, including intrapersonal (biological, psychological), interpersonal (social, cultural), organizational, community, physical environment, and policy” (Sallis et al. 2015, p. 466), and thereby guide the development of encompassing strategies to address health issues. Using an ecological perspective, we can situate social media and search data in the ecological system of health behavior and define its coverage and usage in measuring and monitoring health behaviors and issues. Our idea of introducing the ecological perspective to big data analysis echoes with and expands Eysenbach’s view of infodemiology (Eysenbach, 2009). Eysenbach (2009) defined infodemiology as “the science of distribution and determinants of information in an electronic medium, specifically the Internet, or in a population, with the ultimate aim to inform public health and public policy” (p. 2). Eysenbach (2009) suggested that robust metrics, or “infodemiology indicators” should be developed in a way that reflects information and communication patterns in real time. Our study expanded Eysenbach’s view of infodemiology by describing this “electronic medium” from an ecological perspective. The most important principle of ecological perspectives is that “multiple levels of factors influence health behaviors” (Sallis et al., 2015, p. 470). In other words, individuals’ behaviors are influenced by informational, social, physical, and policy environments in which they live” (Sallis et al., 2015). In particular, the “electronic medium” allows organizations, news media, social influencers, and individual users to exchange information and communicate with each other, making up the social and informational environments for health behaviors and issues. The “electronic medium” shapes the social and informational environments of health behaviors and issues in various ways. Social support that individuals receive online, for example, from an online patient group, may influence their self-efficacy and control over health behaviors (Guan & So, 2016; Willis, 2016). The more social support one feels that he or she obtains from a social group, the stronger one’s selfefficacy is (Guan & So, 2016). Online patient support groups allow people to learn vicariously from others’ experiences, and obtain encouragement and suggestions to manage health conditions (Willis, 2016). Social norms, communicated via the “electronic medium,” prescribe expected behavior of a social group, and thus may influence individuals’ subjective beliefs about health issues such as revaccination (Goldstein et al., 2015). However, posts and reposts of certain health beliefs or behaviors, can tighten community norms of smaller online groups that hold an alternative view, for example, vaccine refusal (Reich, 2020). Social persuasion and advocacy spreads ideas all over the virtual world, and pushes people towards different views and practices (Wang & Li, 2016). These elements make up the social environment in which individuals reside. Similarly, a person’s health decisions do not happen in an information vacuum; an individual cannot be separated from health information that he/she is getting from the news, doctors, education, and friends and families (Sallis et al., 2015). And, the “electronic medium” simply makes dissemination of health information easier than ever before. The abundance of health information available online makes the Internet a popular venue to obtain health information (McMullan
188
B. Liang and Y. Wang
et al., 2019). Sometimes, people’s obsession about online health information could cause cyberchondria, distress, or anxiety (McMullan et al., 2019). In this study, we formulate eight propositions by developing a group of indexes as “infodemiological indicators” from an ecological perspective. The main contribution of this study is to better capture the social and informational aspects of Big Data originated from the “electronic medium” for public health surveillance.
10.3 An Ecological Model of Google Search and Twitter Data in Public Health In this study, we interpret big data analysis of tweets and searches for public health surveillance along information and social dimensions of ecology. Through eight propositions, we construct an ecological model of Google Search and Twitter data for public health surveillance by differentiating message sources, goals, and content,. This model monitors public attention of health issues by analyzing information flow from celebrities, media, organizations, blogs, to consumers, who react, by retweeting or searching online. In addition, it detects diseases by examining consumers’ reports of personal experiences on Twitter and online search.
10.3.1 Search Data and Informational Environment As discussed above, the major data source of a big data approach, namely, social media and search data, is part of the social and informational environment in the ecological system of health. Search data is a record of users’ information seeking behavior on search engine websites, and thus, it is part of the informational environment in the ecological system of health (Liang et al., 2019). Online searching is a human–machine interaction between searchers—information demand, and search engine ranking algorithms—information supply. Web search activities reflect the public’s real-time exposure, interests, concerns, and intentions at regional and national levels (Askitas & Zimmermann, 2009; Battelle, 2005; Ettredge et al., 2005; Liang & Scammon, 2013; Scheitle, 2011). Searching online is a goal-oriented behavior. Rose and Levinson (2004) identified three types of goals of web search: navigational (i.e., going to a specific web site), informational (i.e., learning something), and resources (i.e., obtaining a resource, like recipes from a website). Further, they distinguished five sub-types of informational goals: directed (i.e., learning something in particular about a topic), undirected (i.e., learning anything/everything about a topic), advice, locate (e.g., finding out whether/where some real-world service is provided), and list (i.e., getting a list of plausible suggested web sites).
10 Conceptualizing an Ecological Model of Google Search …
189
Based on Rose and Levinson’s study of web search goals, online health information seeking is an informational goal-oriented behavior. We propose to differentiate two types of online health information seeking: directed and non-directed information seeking. According to Health Online 2013 by Pew Research (Fox & Duggan, 2013), the top goal of online health information seekers (35% of US adults) is to find out what medical condition they or someone else they know might have—a directed informational goal. The second most important directed informational goal of health information seeking is identifying treatment options (Choudhury et al., 2014). We define online health information seekers with directed informational goals, including diagnosing and finding treatment options, as “online diagnosers”. As online diagnosers are often motivated by individual infection cases, studies are able to find high correlations between search data and infection incidences. Researchers in medical informatics have found a high correlation of the occurrence of selected search queries and the incidences of certain diseases, especially infectious diseases (e.g., influenza), and thus have suggested the use of search query data for syndromic surveillance, or early detection of outbreaks (Brownstein et al., 2009; Ginsberg et al., 2009; Hulth et al., 2009; Pelat et al., 2009). Research in this field has primarily focused on correlation analyses of weekly Google search data and CDC Weekly U.S. Influenza Surveillance data (Carneiro & Mylonakis, 2009; Dugas et al., 2013; Ginsberg et al., 2009). Research has also been conducted on other pandemic diseases and at global scales. For example, a study found that Google search volume for dengue-related queries were highly correlated with the number of reported cases of dengue in five countries, including: Bolivia, Brazil, India, Indonesia and Singapore (Chan et al., 2011). Another study found that Google Search data approximated the seasonality and geographic distribution of previously identified Lyme disease cases reported by CDC (Seifter et al., 2010). Thus, our first proposition is: Proposition 1 Disease-related search data generated from online diagnosers with directed informational goals—including diagnosing and finding treatment options— could be used as an indicator of disease infection incidence in public health surveillance—Google Disease Infection Index (See Table 10.1). The non-directed informational goals of online health seekers include undirected (e.g., generally learning about a disease, prevention of a disease) and advice (e.g., how to avoid getting a disease) goals. Internet users with such non-directed informational goals are often motivated by their concerns for current pandemic status. These concerns are triggered by news coverage and/or announcements of public health agencies (Metcalfe et al., 2011; Southwell et al., 2016; Wilson & Brownstein, 2009). People may want to know anything/everything about a disease, and obtain advice. As non-directed informational goals are not necessarily linked with individual infection cases, they become a possible source of “noises” in search data for the purpose of public health surveillance. Studies have reported the limitations of Google search data in public health surveillance. Google Flu Trends (GFT) is a Web tool monitoring search engine records using a computational search term query model. Studies have found substantial flaws in GFT. Especially, GFT dramatically overestimated the intensity of the
190
B. Liang and Y. Wang
Table 10.1 Application of our proposed indexes of Google and Twitter data Index
Keywords selection
Data source
Suggested data analysis methods
Implication
Computed by standardizing and averaging data from Google and Twitter Disease Infection Index
Monitoring disease incidence with data representative of the overall population
1
Google-Twitter – disease infection composite
–
2
Google disease infection
Symptoms and treatment (e.g., flu fever, flu medicine)
Google Correlation/regression/time Monitoring search series analysis disease data incidence with (e.g., data Google representative Trends) of the overall population
3
Twitter disease infection
True Twitter infection API (e.g., getting (present tense), I/my (self))
4
Google disease awareness
General Google Time series analysis (e.g., flu, flu search shot) data (e.g., Google Trends)
Monitoring public awareness with data representative of the overall population
5
Twitter disease awareness
False infection (e.g., concerned about flu)
Twitter API
Topic/time series analysis
Monitoring public awareness with data representative of the population aged 18–49, females, higher income, better educated
6
Google search results credibility
Keywords used for Indexes 2 and 4
Google search results pages
Source analysis
Monitoring the source credibility of information
Correlation/regression/time Monitoring series analysis disease incidence with data representative of the population aged 18–49, females, higher income, better educated
(continued)
10 Conceptualizing an Ecological Model of Google Search …
191
Table 10.1 (continued) Index 7
Twitter source sentiment
Keywords selection
Data source
Suggested data analysis methods
Implication
Keywords used for Indexes 3 and 5
Twitter API
Source/social network/topic/sentiment analysis
Monitoring changes in social influence on public opinions over time
A/H3N2 epidemic during the 2012/2013 season (Butler, 2013; Lazer et al., 2014; Olson et al., 2013). Some researchers have suggested that the problems may lie in widespread media coverage of severe US flu season of 2013, and the declaration of public-health emergencies. The large volume of news coverage may have caused spikes in flu-related searches by people who were not sick. Wilson and Brownstein (2009) reported that a massive increase in searching for the word “Listeria” coincided perfectly with news media attention during a listeriosis outbreak in Canada in 2008. The over-estimation of search data is a complication of multiple information sources in the informational environment (see Fig. 10.1). First, health information may come from the mass media, in forms of TV, the Internet, radio etc. These established communication channels could gain public attention and raise public
Fig. 10.1 An ecological model of Big Data from Google search and Twitter: a case of flu research
192
B. Liang and Y. Wang
awareness about health issues at regional, national and global levels. Second, health organizations, like hospitals, clinics, health-related NGOs, etc., also disseminate health information to the public. Particularly, official websites and social media sites make it very easy for these organizations to possess owned media, for example, an organization’s website or Facebook page, to circumvent the mass media and directly reach out to individuals. Hence, we put forward our second proposition. Proposition 2 Disease-related search data generated by Internet users with nondirected informational goals could be used as an indicator of public disease awareness in public health surveillance—Google Disease Awareness Index (See Table 10.1). Research has found that people were most likely to click high-ranking search results; and, more than 90% of the clicks were from the first five links (Granka et al., 2004; Pan et al., 2007; Silverstein et al., 1998; Spink et al., 2001). Accordingly, search results’ ranking can be used to judge the popularity of information sources. One study showed that Wikipedia and the National Library of Medicine ranked highly in search results for queries relevant to drugs (Law et al., 2011). Modave et al. (2014) evaluated the quality of weight loss-related search results in the top five positions. This study found that the majority of web pages were not from reputable organizations such as medical and government websites. The public needs access to information from reputable organizations to guide their pandemic preparedness and response activities. Besides collecting and analyzing data of information demand— what people are searching for, public health surveillance should also monitor the information supply side of Google Search—the information sources listed in the topranking positions of Google search results. Our third proposition is about information supply in public health surveillance. Proposition 3 Sources of disease-related search results at high-ranking positions (e.g., top 5 search links) could be used as an indicator of information source credibility in public health surveillance—Google Search Results Credibility Index (See Table 10.1). Besides search engines, social media sites also create opportunities for individuals to exchange health information. Literature on health communication has widely documented social media users updating their health conditions, or health conditions of someone they know, or sharing informational and social resources with others (Pershad et al., 2018). It is another data source for public health surveillance.
10.3.2 Twitter Users: Exchanging Health Information and Personal Experiences The value of social media in public health surveillance has been widely acknowledged. Twitter is a frequently used social media data source in big data methods of
10 Conceptualizing an Ecological Model of Google Search …
193
public health surveillance. Twitter is a platform in which users publish their realtime activities using short messages (no longer than 140 characters). A user’s tweets are publicly available to any other users. A user’s “followers” can receive a user’s updates through their homepage feed. Conversations on Twitter.com or tweets can be easily accessed through Twitter’s Application Programing Interface (API) by users. Researchers have used tweets to investigate or monitor the public’s opinions and behaviors about specific issues such as political elections, earthquakes, and health (e.g., Bosley et al., 2013). We attempt to address the complexity of Twitter data systematically. A user tweets about a disease (e.g., flu) for various reasons. Our proposed model differentiates two types of tweets: true infection tweets, and false infection tweets. First, tweets containing disease-related keywords could be posted by users who have caught a disease (e.g. flu) recently, or know someone (e.g., friends, families) who have caught a disease recently. We define these tweets as true infection tweets. Data from these users would help monitor disease activities. For example, Chew and Eysenbach (2010) found that H1N1 incidence rates were highly correlated with the volume of tweets sharing personal experiences (r = 0.77) and concern (r = 0.66). Second, tweets containing disease-related keywords could also be posted by users who have heard about or are concerned about recent activities of a disease, or users who want to share information, knowledge, and rumors about a disease (Lamb et al., 2012). We define these tweets as false infection tweets. False infection tweets largely reflect increased media attention, and could be misleading in flu surveillance (Broniatowski et al., 2013; Lamb et al., 2012). For example, Chew and Eysenbach (2010) found that peaks in the volume of H1N1-related tweets by different content categories (e.g., concern) coincided with different types of news events and viral campaigns. Thus, using disease-related keywords as filters is only as good as the assumption that all disease-mentions are true infection tweets. So far, different approaches/conditions have been applied to filter out false infection flu tweets. Aramaki et al. (2011) used two conditions to filter out false infection flu tweets: tweets posted (1) by persons or surrounding persons who had flu (2) in present tense (current) or recent past (within 24 h) (r = 0.89). Doan et al. (2012) filtered out false infection flu tweets using semantic features such as negation (do not have a flu), hashtags (non-F\flu related hashtag), emoticons (smiley flu-related), and humor (r = 98.46%). Achrekar et al. (2011) filtered out false infection flu tweets by removing retweets of previous posts and tweets from the same users within a certain period (r = 0.9846). Broniatowski et al. (2013) created a new supervised classification model to separate tweets indicating the author of the tweet catching infection from other tweets (tweets about flu, flu infection, flu infection of others) (r = 0.93). Lamb et al. (2012) identified nuanced differences between flu tweets using set of word class features: infection (e.g., getting, recovered), possession (e.g., bird, flu), concern (e.g., afraid, fear), vaccination (e.g., vaccine, shot), past tense, present tense, self (e.g., I, my), and others (e.g., you’re her). This model retained the tweets about self’s current/present flu infection, and filtered out other tweets. Results from this model showed a high correlation of the volume of tweets and CDC ILI data
194
B. Liang and Y. Wang
(r = 0.9887). Paul et al. (2014) furthered this study by building a flu prediction model integrating the Twitter surveillance system by Lamb et al. and CDC ILI data. This integrative model could reduce forecasting error by 17–30% over a baseline that only uses historical CDC ILI data. Thus, Twitter data, aided by appropriate keyword-filtering strategies, can be a reliable source of public health surveillance. Accordingly, our study proposes: Proposition 4 Twitter data of true infection could be used as an indicator of disease infection incidence in public health surveillance—Twitter Disease Infection Index (see Table 10.1). Proposition 5 Twitter data of false infection could be used as an indicator of disease awareness in public health surveillance—Twitter Disease Awareness Index (see Table 10.1). However, we would also like to point out a few limitations to Twitter data. First, Twitter data has a low signal-to-noise ratio. Second, there is a sampling bias in Twitter data. According to a recent Pew report (Perrin & Anderson, 2019), only 22% of American adults reported as Twitter users. In addition, Twitter discourse is dominated by a small number of users. Users in the top 10% by the number of tweets, are responsible for 80% of all tweets created by U.S. adults (Wojcik & Hughes, 2019). Women are more active users than men. Among the top 10% most active users—65% are women (Wojcik & Hughes, 2019). Twitter users are younger, with 38% of Twitter users aged 18–29 and 26% of users aged 30–49 years old (Perrin & Anderson, 2019). Twitter users are also more educated, and have higher incomes than U.S. adults overall (Perrin & Anderson, 2019). When big data is “noisy, dynamic, heterogeneous, untrustworthy,” a cross-checking approach could improve data accuracy (Paul et al. 2014, p. 480). To overcome the limitations of Twitter data, we propose a cross-checking approach to public health surveillance: Proposition 6 Cross-checking directed informational searches and true infection tweets, could improve the accuracy of data on disease infection incidence in public health surveillance—Google-Twitter Disease Infection Composite Index (see Table 10.1).
10.3.3 Twitter Users: Social Influence and Public Opinion Social media is also a source of social norms, social support, and social influence. A study found that in an all-peer weight-management online community, members formed collective understanding of the social meaning of weight-loss behaviors, weight-loss products/brands, and services (Willis, 2016). These collective understandings create a sense of “community” that fosters a set of social norms and creates social climate (Preece et al., 2004). Within these online social networks, the social
10 Conceptualizing an Ecological Model of Google Search …
195
influence on social media has become a powerful force to change public opinion on health issues. Reich’s (2020) study on vaccine-refusal moms revealed that social media played a crucial role in forming, upholding, and reinforcing social norms against vaccination by providing informational, emotional, and appraisal support. Meanwhile, they reached outside of the group, joined the public conversation about vaccination, and projected their social influence by posting and reposting vaccinerefusing messages, which were quickly disseminated online (Reich, 2020). The vaccine-refusal groups are one example of social influence’s growing on social media. A body of literature in marketing and strategic communication documented and analyzed virtual opinion leaders, traditional opinion leaders, organizational advocates, market mavens (experts), etc. actively injecting their influence into online social networks, and shaping public opinions and behaviors (Wang & Li, 2016). Effective monitoring of social influence and public opinion require us to identify the sources of social influence and the sentiment of public opinions. Accordingly, our framework proposes to combine sentiment-, topic-, and source-based big data approaches to monitor the changes in public health attitudes and behaviors. Specifically, we suggest that sources of tweets should be categorized into four groups: individual users (including: regular users, health professionals, celebrities, bloggers), legacy news media, digital-native news media, and authorities (e.g., CDC). Legacy media refers to news media established before the digital age, and digitalnative media refers to online only news media that were born in the digital age. We also suggest that overall tweets sentiment and content topics should be examined regularly and continuously, on a daily or weekly basis. First, sentiment-based big data approach examines influence of a Twitter user’s network on this user’s health attitudes. Salathé and Khandelwal (2011) collected data of tweets, followers, and friends of a user. This study found that users who were linked (by being a follower or a friend) shared the same sentiments on vaccinations. Eichstaedt et al. (2015) found that the volume of tweets reflecting negative social relationships, disengagement, and negative emotions was highly correlated with atherosclerotic heart disease (AHD) rate at the county level. Second, topical analysis can be applied to assess the quality of online health information. The lack of critical evaluation of online health information by Internet users is a well-documented problem. Oyeyemi et al. (2014) collected Ebola-related tweets in English from Guinea, Liberia, and Nigeria during September of 2014. They found that most tweets contained misinformation, and misinformation had a much larger reach than correct information. Recent research investigated the influence of “bot”, automated accounts, on dissemination of information on public issues on Twitter such as the 2016 US presidential election (Vicario et al., 2016) and e-cigarette promotion (Allem et al., 2017). Chew and Eysenbach (2010) integrated content and sentiment analysis of tweets about H1N1 of 2009. They classified tweet content into six categories: resources, personal experience, personal opinion, jokes, marketing, and spam. This study found that H1N1 resources were the most common type of content (52.6%), followed by personal experiences (22.5%). Individual users often cited legacy news media (23% of links), and news feeds and niche news (12% of links), as health information sources
196
B. Liang and Y. Wang
in tweets (Chew & Eysenbach, 2010). In contrast, public health and government authorities such as the CDC and WHO were rarely referenced directly by users (1.5% of links). Third, source-based big data approach has been used to measure different users’ influence on Twitter. Influential users have significant influence on a variety of topics including health and other topics (Cha et al., 2012). The most influential users (users with the most followers and retweets) are elite users, including: news media, celebrities, bloggers, and organizations; among these elite users, news media is the most influential category across a variety of topics (Cha et al., 2012; Kwak et al., 2010; Wu et al., 2011). The source-based big data approach also reveals how the influence of sources change over the course of an epidemic (Chen et al., 2013; Lamb et al., 2012). Studies found that during an early stage of an epidemic, legacy news media would pay a lot of attention to the epidemic incident, and dominate the Twitter discourse about this incident; over time, attention to legacy news media would decrease and the Twitter discourse would be dominated by digital-native news media (e.g., news blog). Chunara et al. (2012) investigated data collected from HealthMap news media reports, tweets, and government reports during the 2010 Haitian cholera outbreak. Their findings showed that the volume of legacy news media sources was highly correlated (R square is over 60%) with the volume of officially reported cases in the early stage of this epidemic and slightly correlated (below 40%) with the volume of reported cases in the later stages of the outbreak because of decreasing media attention. Chew and Eysenbach (2010) found that, over the course of the H1N1 flu season, legacy news media’s websites were cited significantly less, whereas references to digitalnative news sites (news blogs/feeds/niches, social networks, and other web pages) increased. These findings are consistent with Liang and Scammon’s (2016) finding that at an early stage of a food recall incident, legacy news websites are most likely to obtain higher-ranking positions in search results, while at a later stage, links to social media sites, and online news and information media would occupy the most positions in search results. Thus, combining source-, sentiment-, and topic-based big data approaches, we propose: Proposition 7 A combined source-, sentiment-, and topic-based approach to Twitter data could be used in monitoring changes in social influence on public opinions over time—Twitter Source-Sentiment Index (see Table 10.1).
10.3.4 Geo-analysis and the Ecological Perspectives Our proposed framework can also benefit the research from ecological perspectives in health communication. The big data approach provides sufficient volume of data to test hypotheses about informational and social variations across units of study. An ecological view of public health emphasizes that information (e.g., local news
10 Conceptualizing an Ecological Model of Google Search …
197
coverage), social (e.g., religious orientation, race and ethnicity), physical (e.g., accessibility to doctors, hospitals, gyms), and public policy (e.g., health insurance policy) environments vary across geographic locations. The health information environment is made up of health messages from various sources, including news, advertisements, interpersonal discussions, healthcare professionals, and the Internet (Sallis et al., 2015). The spread of health information varies across populations. Wang and Rodgers (2013) found that there were more local health news stories in newspapers serving Hispanic communities or the national audience than in Black newspapers. Research has also examined the physical environments of health behavior, such as the density of food outlets (Janssen et al., 2018; Vogli et al., 2014), access to physical activity facilities, and walkability of neighborhoods (Matsaganis & Wilkin, 2015; Saelens et al., 2003). A big challenge facing ecological models is lack of a large quantity of data from a wide range of geographical regions to conduct hypothesis testing (Sallis et al., 2015). Often, health researchers find that there is insufficient variation in social, environmental, and policy variables across units of study when using a random sample from, for example, Houston, TX (Sallis et al., 2015). Facilitated by the big data approach, the large volume of data available nation-wide and globally may make it possible to examine informational and social variables of health behaviors at multi-geographic scales. For example, Google search data is available at country, state, metropolitan and city-level. Twitter data is tagged by geographic information (geotags). Tsou et al. (2013) proposed a research framework for analyzing the spatial distribution of web pages and Twitter messages with predefined keywords in a case study in 2012 US Presidential Election. Public health agencies could use Google search and geotagged Twitter data to make plans and respond appropriately at the municipal and state level of government. Built upon previous work, more hypothesis testing can be conducted by using search and social data, to examine the role of information and social environments in creating or mitigating health disparities across populations (Marmot, 2005; Reidpath et al., 2002). Some initial studies have been conducted by using geo-analysis of Google search and Twitter data. Liang and Scammon (2013) investigated the extent to which Google search data can be explained by the incidence of the flu, information and social factors (e.g., media attention), and resources factors (e.g., health insurance coverage, number of hospital beds). Liang et al. (2019) studied Google search and geotagged Twitter data to examine the association between information and social environments and regional prevalence of overweight and obesity. Thus, we propose that within the ecology of public health, analyzing and interpreting geotagged Google search and Twitter data could be employed to reveal the influence of information and social environments at the city or state-level. Proposition 8 Geo-tagged Google search and Twitter data could be used to examine the relationships between information and social environments, and public health conditions, public awareness, and opinions at multi-geographic scales.
198
B. Liang and Y. Wang
10.4 Conclusions and Implications As people spends more time online searching for or sharing health information, systematically collecting, analyzing and interpreting online big data on health issues—infodemiology, has become a key issue in public health surveillance. Tsou defined big data as “a large dynamic data set created by or derived from human activities, communications, movements, and behaviors” (Tsou 2015, p. S70). In our work, we recognize the complexity and dynamics of big data, and the efforts of previous scholars in collecting, cleaning/filtering, analyzing, and modeling big data—especially, Google search and Twitter data for public health surveillance. From an ecological perspective, we propose an integrated framework of how Google search and Twitter data could be collected, analyzed, interpreted, and applied in public health surveillance. As we see people posting and searching on the Internet, we should be aware of the undercurrent of information sources and social influences through search engine result pages and social media posting. Especially, we should appreciate the roles of diverse information sources in the ecological models of health behavior, in order to build more accurate models and produce more meaningful interpretations of data used in public health surveillance. Santillana et al. (2015) proposed a machine learning approach to accurately forecast flu incidences by integrating multiple sources, including: Google Search, Twitter posts, real-time hospital visit records, and data from Flu Near You, a website that allows the public to report their health information using a quick weekly survey. Recognizing human dynamics in big data, researchers have combined various data sources (e.g., call detail records data and geo-tagged tweets) in the studies of event detection (e.g., natural disasters) (Wachowicz & Liu, 2016). All in all, filtered—and sourced—data could be used to monitor incidences, and information and social context of pandemics. Table 10.1 below shows how the set of indexes in our proposed model could be applied in public health surveillance along the dimensions of keyword selection, data source, suggested data analysis methods, and implication. Future research is needed to validate these indexes based on empirical data, and examine how to use these indexes to inform public health policies and practices across geographic regions. Research is also needed to investigate the concurrence and correlation of the set of Twitter sources (e.g., legacy news media, digital-native news media) to better understand the interplay between information sources in the ecology of public health.
References Achrekar, H., Gandhe, A., Lazarus, R., Ssu-Hsin, Y., & Liu, B. (2011). Predicting flu trends using Twitter data. In 2011 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS) (pp. 702–707).
10 Conceptualizing an Ecological Model of Google Search …
199
Allem, J.-P., Ferrara, E., Uppu, S. P., Cruz, T. B., & Unger, J. B. (2017). E-Cigarette surveillance with social media data: social bots, emerging topics, and trends. JMIR Public Health Surveillance, 3, e98. Aramaki, E., Maskawa, S., & Morita, M. (2011). Twitter catches the flu: Detecting influenza epidemics using Twitter. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (pp 1568–1576). Edinburgh, Scotland, UK: Association for Computational Linguistics. Askitas, N., & Zimmermann, K. F. (2009). Google econometrics and unemployment forecasting. https://doi.org/10.2139/ssrn.1465341 Battelle, J. (2005). The birth of Google. Wired-San Francisco, 13, 102. Bertino, E. (2013). Big data—Opportunities and challenges panel position paper. In 2013 IEEE 37th Annual Computer Software Applications Conference (pp. 479–480). IEEE. Bosley, J. C., Zhao, N. W., Hill, S., Shofer, F. S., Asch, D. A., Becker, L. B., & Merchant, R. M. (2013). Decoding twitter: Surveillance and trends for cardiac arrest and resuscitation communication. Resuscitation, 84, 206–212. Broniatowski, D. A., Paul, M. J., & Dredze, M. (2013). National and local influenza surveillance through Twitter: An analysis of the 2012–2013 influenza epidemic. PLoS One, 8, e83672. Brownstein, J. S., Freifeld, C. C., & Madoff, L. C. (2009). Digital disease detection—Harnessing the web for public health surveillance. New England Journal of Medicine, 360, 2153–2157. Butler, D. (2013). When Google got flu wrong: US outbreak foxes a leading web-based method for tracking seasonal flu. Nature, 494, 155–157. Carneiro, H. A., & Mylonakis, E. (2009). Google Trends: A web-based tool for real-time surveillance of disease outbreaks. Clinical Infectious Diseases, 49, 1557–1564. Cha, M., Benevenuto, F., Haddadi, H., & Gummadi, K. (2012). The world of connections and information flow in Twitter. IEEE Transactions on Systems, Man, and Cybernetics—Part A Systems and Humans, 42, 991–998. Chan, E. H., Sahai, V., Conrad, C., & Brownstein, J. S. (2011). Using web search query data to monitor dengue epidemics: A new model for neglected tropical disease surveillance. PLoS Neglected Tropical Diseases, 5, e1206. Chen, F., Griffith, A., Cottrell, A., & Wong, Y.-L. (2013). Behavioral responses to epidemics in an online experiment: using virtual diseases to study human behavior. PLoS One, 8, e52814. Chevalier, J. A., & Mayzlin, D. (2006). The effect of word of mouth on sales: Online book reviews. Journal of Marketing Research, 43, 345–354. Chew, C., & Eysenbach, G. (2010). Pandemics in the age of Twitter: Content analysis of tweets during the 2009 H1N1 outbreak. PLoS One, 5, e14118. Chunara, R., Andrews, J. R., & Brownstein, J. S. (2012). Social and news media enable estimation of epidemiological patterns early in the 2010 Haitian cholera outbreak. American Journal of Tropical Medicine and Hygiene, 86, 39–45. De Choudhury, M., Morris, M. R., & White, R. W. (2014). Seeking and sharing health information online: Comparing search engines and social media. In Proceedings of SIGCHI Conference on Human Factors Computing Systems (pp. 1365–1376). New York, NY, USA: Association for Computing Machinery. De Vogli, R., Kouvonen, A., & Gimeno, D. (2014). The influence of market deregulation on fast food consumption and body mass index: A cross-national time series analysis. Bulletin of the World Health Organization, 92, 99-107A. Del Vicario, M., Bessi, A., Zollo, F., Petroni, F., Scala, A., Caldarelli, G., Stanley, H. E., & Quattrociocchi, W. (2016). The spreading of misinformation online. Proceedings of the National Academy of Sciences USA, 113, 554–559. Doan, S., Ohno-Machado, L., & Collier, N. (2012). Enhancing Twitter data analysis with simple semantic filtering: Example in tracking influenza-like illnesses. In 2012 IEEE Second International Conference on Healthcare Informatics, Imaging and System Biology (pp. 62–71). Dugas, A. F., Jalalpour, M., Gel, Y., Levin, S., Torcaso, F., Igusa, T., & Rothman, R. E. (2013). Influenza forecasting with Google Flu Trends. PLoS One, 8, e56176.
200
B. Liang and Y. Wang
Duggan, M. (2015). The demographics of social media users. Pew Research Center’s Internet Science & Technology. Eichstaedt, J. C., Schwartz, H. A., Kern, M. L., et al. (2015). Psychological language on Twitter predicts county-level heart disease mortality. Psychological Science, 26, 159–169. eMarketer.com (2018) Search referral share, by search engine, US performance metrics, estimates and historical data. In Inside Intell. Retrieved Sep 1, 2019, from https://www.emarketer.com/per formance/channel/59ee1f37bfce890eb411f134/58e39a6f2357af0f1484d953. Ettredge, M., Gerdes, J., & Karuga, G. (2005). Using web-based search data to predict macroeconomic statistics. Communications of the ACM, 48, 87–92. Eysenbach, G. (2009). Infodemiology and infoveillance: Framework for an emerging set of public health informatics methods to analyze search, communication and publication behavior on the internet. Journal of Medical Internet Research, 11, e11. Fox, S., & Duggan, M. (2013). Health online 2013. Health (n y), 2013, 1–55. Ginsberg, J., Mohebbi, M. H., Patel, R. S., Brammer, L., Smolinski, M. S., & Brilliant, L. (2009). Detecting influenza epidemics using search engine query data. Nature, 457, 1012–1014. Gittelman, S., Lange, V., Crawford, C. A. G., Okoro, C. A., Lieb, E., Dhingra, S. S., & Trimarchi, E. (2015). A new source of data for public health surveillance: Facebook likes. Journal of Medical Internet Research, 17, e98. Goldstein, S., MacDonald, N. E., & Guirguis, S. (2015). Health communication and vaccine hesitancy. Vaccine, 33, 4212–4214. Granka, L. A., Joachims, T., & Gay, G. (2004). Eye-tracking analysis of user behavior in WWW search. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 478–479). New York, NY, USA: Association for Computing Machinery. Guan, M., & So, J. (2016). Influence of social identity on self-efficacy beliefs through perceived social support: A social identity theory perspective. Communication Studies, 67, 588–604. Hay, S. I., George, D. B., Moyes, C. L., & Brownstein, J. S. (2013). Big data opportunities for global infectious disease surveillance. PLoS Medicine, 10, e1001413. Hulth, A., Rydevik, G., & Linde, A. (2009). Web queries as a source for syndromic surveillance. PLoS One, 4, e4378. Jansen, B. J., Zhang, M., Sobel, K., & Chowdury, A. (2009). Twitter power: Tweets as electronic word of mouth. Journal of the American Society of Information Science and Technology, 60, 2169–2188. Janssen, H. G., Davies, I. G., Richardson, L. D., & Stevenson, L. (2018). Determinants of takeaway and fast food consumption: A narrative review. Nutrition Research Reviews, 31, 16–34. Kwak, H., Lee, C., Park, H., & Moon, S. (2010). What is Twitter, a social network or a news media? In Proceedings of the 19th International Conference on World Wide Web (pp. 591–600). New York, NY, USA: Association for Computing Machinery. Lamb, A., Paul, M. J., & Dredze, M. (2012). Investigating Twitter as a source for studying behavioral responses to epidemics. In AAAI Fall Symposium on Information Retrieval and Knowledge Discovery in Biomedical Text (pp. 81–83). Citeseer. Law, M. R., Mintzes, B., & Morgan, S. G. (2011). The sources and popularity of online drug information: An analysis of top search engine results and web page views. Annals of Pharmacotherapy, 45, 350–356. Lazer, D., Kennedy, R., King, G., & Vespignani, A. (2014). The parable of Google flu: Traps in big data analysis. Science, 343, 1203–1205. Liang, B., & Scammon, D. L. (2016). Food contamination incidents: What do consumers seek online? Who cares? International Journal of Nonprofit and Voluntary Sector Marketing, 21, 227–241. Liang, B., Wang, Y., & Tsou, M.-H. (2019). A “fitness” theme may mitigate regional prevalence of overweight and obesity: Evidence from Google search and tweets. Journal of Health Communication, 24, 683–692.
10 Conceptualizing an Ecological Model of Google Search …
201
Liang, B., & Scammon, D. L. (2013). Incidence of online health information search: A useful proxy for public health risk perception. Journal of Medical Internet Research, 15, e114. Marmot, M. (2005). Social determinants of health inequalities. The Lancet, 365, 1099–1104. Matsaganis, M. D., & Wilkin, H. A. (2015). Communicative social capital and collective efficacy as determinants of access to health-enhancing resources in residential communities. Journal of Health Communication, 20, 377–386. McMullan, R. D., Berle, D., Arnáez, S., & Starcevic, V. (2019). The relationships between health anxiety, online health information seeking, and cyberchondria: Systematic review and metaanalysis. Journal of Affective Disorders, 245, 270–278. Metcalfe, D., Price, C., & Powell, J. (2011). Media coverage and public reaction to a celebrity cancer diagnosis. Journal of Public Health, 33, 80–85. Modave, F., Shokar, N. K., Peñaranda, E., & Nguyen, N. (2014). Analysis of the accuracy of weight loss information search engine results on the internet. American Journal of Public Health, 104, 1971–1978. Olson, D. R., Konty, K. J., Paladini, M., Viboud, C., & Simonsen, L. (2013). Reassessing Google flu trends data for detection of seasonal and pandemic influenza: A comparative epidemiological study at three geographic scales. PLoS Computational Biology, 9, e1003256. Oyeyemi, S. O., Gabarron, E., & Wynn, R. (2014). Ebola, Twitter, and misinformation: A dangerous combination? BMJ, 349, g6178. Pan, B., Hembrooke, H., Joachims, T., Lorigo, L., Gay, G., & Granka, L. (2007). In Google we trust: Users’ decisions on rank, position, and relevance. Journal of Computer-Mediated Communication, 12, 801–823. Paul, M. J., Dredze, M., & Broniatowski, D. (2014). Twitter improves influenza forecasting. PLoS Currents. https://doi.org/10.1371/currents.outbreaks.90b9ed0f59bae4ccaa683a39865d9117 Paul, M. J., & Dredze, M. (2012). A model for mining public health topics from Twitter. Health (N. Y.), 11. Pelat, C., Turbelin, C., Bar-Hen, A., Flahault, A., & Valleron, A.-J. (2009). More diseases tracked by using Google Trends. Emerging Infectious Diseases, 15, 1327–1328. Perrin, A., & Anderson, M. (2019). Share of U.S. adults using social media, including Facebook, is mostly unchanged since 2018. In Pew Research Center. Retrieved April 15, 2021, from https://www.pewresearch.org/fact-tank/2019/04/10/share-of-u-s-adults-using-socialmedia-including-facebook-is-mostly-unchanged-since-2018/. Pershad, Y., Hangge, P. T., Albadawi, H., & Oklu, R. (2018). Social medicine: Twitter in healthcare. Journal of Clinical Medicine, 7, 121. Preece, J., Nonnecke, B., & Andrews, D. (2004). The top five reasons for lurking: Improving community experiences for everyone. Computers in Human Behavior, 20, 201–223. Reich, J. A. (2020). “We are fierce, independent thinkers and intelligent”: Social capital and stigma management among mothers who refuse vaccines. Social Science & Medicine, 257, 112015. Reidpath, D. D., Burns, C., Garrard, J., Mahoney, M., & Townsend, M. (2002). An ecological study of the relationship between social and environmental determinants of obesity. Health & Place, 8, 141–145. Rose, D. E., & Levinson, D. (2004). Understanding user goals in web search. In Proceedings of the 13th International Conference on World Wide Web (pp. 13–19). New York, NY, USA: Association for Computing Machinery. Saelens, B. E., Sallis, J. F., & Frank, L. D. (2003). Environmental correlates of walking and cycling: Findings from the transportation, urban design, and planning literatures. Annals of Behavioral Medicine, 25, 80–91. Salathé, M., & Khandelwal, S. (2011). Assessing vaccination sentiments with online social media: Implications for infectious disease dynamics and control. PLoS Computational Biology. https:// doi.org/10.1371/journal.pcbi.1002199 Sallis, J. F., Owen, N., & Fisher, E. (2015). Ecological models of health behavior. Health Behavior: Theory, Research, and Practice, 5.
202
B. Liang and Y. Wang
Santillana, M., Nguyen, A. T., Dredze, M., Paul, M. J., Nsoesie, E. O., & Brownstein, J. S. (2015). Combining search, social media, and traditional data sources to improve influenza surveillance. PLoS Computational Biology, 11, e1004513. Scheitle, C. P. (2011). Google’s Insights for search: A note evaluating the use of search engine data in social research*. Social Science Quarterly, 92, 285–295. Seifter, A., Schwarzwalder, A., Geis, K., & Aucott, J. (2010). The utility of “Google Trends” for epidemiological research: Lyme disease as an example. Geospatial Health, 4, 135–137. Silverstein, C., Henzinger, M., Marais, H., & Moricz, M. (1998). Analysis of a very large AltaVista query log. Technical Report 1998-014, Digital SRC. Southwell, B. G., Dolina, S., Jimenez-Magdaleno, K., Squiers, L. B., & Kelly, B. J. (2016). Zika virus-related news coverage and online behavior, United States, Guatemala, and Brazil. Emerging Infectious Diseases, 22, 1320–1321. Spink, A., Wolfram, D., Jansen, M. B. J., & Saracevic, T. (2001). Searching the web: The public and their queries. Journal of the American Society of Information Science and Technology, 52, 226–234. Tsou, M.-H. (2015). Research challenges and opportunities in mapping social media and Big Data. Cartography and Geographic Information Science, 42, 70–74. Tsou, M.-H., Yang, J.-A., Lusher, D., Han, S., Spitzberg, B., Gawron, J. M., Gupta, D., & An, L. (2013). Mapping social activities and concepts with social media (Twitter) and web search engines (Yahoo and Bing): A case study in 2012 US Presidential Election. Cartography and Geographic Information Science, 40, 337–348. Velasco, E., Agheneza, T., Denecke, K., Kirchner, G., & Eckmanns, T. (2014). Social media and internet-based data in global systems for public health surveillance: A systematic review. Milbank Quarterly, 92, 7–33. Wachowicz, M., & Liu, T. (2016). Finding spatial outliers in collective mobility patterns coupled with social ties. International Journal of Geographical Information Science, 30, 1806–1831. Wang, Y., & Li, Y. (2016). Proactive engagement of opinion leaders and organization advocates on social networking sites. International Journal of Strategic Communications, 10, 115–132. Wang, Y., & Rodgers, S. (2013). Reporting on health to ethnic populations: A content analysis of local health news in ethnic versus mainstream newspapers. Howard Journal of Communications, 24, 257–274. Willis, E. (2016). Patients’ self-efficacy within online health communities: Facilitating chronic disease self-management behaviors through peer education. Health Communication, 31, 299–307. Wilson, K., & Brownstein, J. S. (2009). Early detection of disease outbreaks using the Internet. CMAJ, 180, 829–831. Wojcik, S., & Hughes, A. (2019). Sizing up Twitter users. Pew Research Center Internet, Science & Technology. Wu, S., Hofman, J. M., Mason, W. A., & Watts, D. J. (2011). Who says what to whom on twitter. In Proceedings of the 20th International Conference on World Wide Web (pp. 705–714). New York, NY, USA: Association for Computing Machinery.
Chapter 11
A Case Study in Belief Surveillance, Sentiment Analysis, and Identification of Informational Targets for E-Cigarettes Interventions Lourdes S. Martinez, Ming-Hsiang Tsou, and Brian H. Spitzberg
11.1 Introduction Social media technology has ushered an era enabling unprecedented opportunities for individuals to connect with one another. This interconnection, however, has arrived with exigencies related to the misuse and abuse of user data gathered through social media platforms (Allem & Ferrara, 2016; Armstrong et al., 2018), and the emergence of social media platforms as avenues for spreading misinformation (e.g., inaccurate information) and disinformation (e.g., deception information) (Bessi et al., 2015; Chou et al., 2018; Rainie et al., 2017; Rich, 2018). This is particularly problematic when such content becomes memetic and virally spreads across social networks (Garrett et al., 2019; Shin et al., 2018; Vicario et al., 2016). Bots, in particular, present a challenge to reversing trends of the sense of privacy and public trust in social media content. From a privacy perspective, bots are designed to mimic authentic human activity, and may use online user data to be effectively engaging (Broniatowski et al., 2018; Stella et al., 2018; Sutton, 2018). There are various reasons why users may Reprinted From: Martinez, L. S., Tsou, M.-H., & Spitzberg, B. H. (2019). A Case Study in Belief Surveillance, Sentiment Analysis, and Identification of Informational Targets for E-Cigarettes Interventions. Proceedings of the 10th International Conference on Social Media and Society, 15–23. https://doi.org/10.1145/3328529.3328540. L. S. Martinez (B) · B. H. Spitzberg School of Communication, San Diego State University, San Diego, CA, USA e-mail: [email protected] B. H. Spitzberg e-mail: [email protected] M.-H. Tsou Department of Geography, San Diego State University, San Diego, CA, USA e-mail: [email protected] © Springer Nature Switzerland AG 2021 A. Nara and M.-H. Tsou (eds.), Empowering Human Dynamics Research with Social Media and Geospatial Data Analytics, Human Dynamics in Smart Cities, https://doi.org/10.1007/978-3-030-83010-6_11
203
204
L. S. Martinez et al.
not wish their data to be used to guide activity of bots in this manner. From a trust perceptive, while not all bots are designed with veiled intent or for nefarious purposes (Lima Salge & Berente, 2017), public health experts and scholars are expressing increasing concerns regarding the potential for bots to disseminate misinformation and disinformation related to public health issues (Broniatowski et al., 2018). Trust in health information suffers when misinformation and disinformation proliferate to a point where significant correction is required (Allem & Ferrara, 2016), and possibly fuels perceptions that online health information is neither reliable nor accurate (Allem et al., 2017; Ferrara et al., 2016). Recent theorizing in the communication discipline suggests that in order to understand when and how public health-related online information (including misinformation and disinformation) is likely to virally spread online across a social network, researchers need to consider human processes and how they employ unique properties of social media in realspace (Kopp et al., 2018; Shin et al., 2018). The multilevel model of meme diffusion (M3 D) is distinctively positioned for this purpose. Answering the question of what drives the diffusion of information is an ongoing objective of various fields and theories. Distinct models emphasize different variables: aspects of the message or meme itself; the influential sources from which such messages derive; the structural features or human dynamics of the social networks to which such messages are sent; the societal and cultural dynamics that contextualize such messages; and the geotechnical context surrounding such dynamics. Given that research identifies features at each of these levels of analysis, it follows that a multilevel approach is likely needed to fully model such phenomena. The multilevel model of meme diffusion (M3 D) aims to provide such a framework. The M3 D proposes that sets of variables influence the diffusion of a message or meme. Following Dawkins’ (2016) proposal that memes are comparable to genes in that they transfer information from one person to another, Spitzberg (Schlaile et al., 2018; Sharag-Eldin et al., 2018; Spitzberg, 2014) further conjectured several conditions in which memes are likely to spread. Memes that are more novel and affect-laden travel farther and faster. Their spread is further facilitated when they are shared by users who are considered more credible, trustworthy, and competent. Network characteristics can also help or hinder spread of memes. For example, social networks that are internally homophilous but externally diverse (e.g., many bridgers) help spread memes faster and further. Additionally, the lack of counterarguments in the larger information environment as well as access to technology among a range of proximal users also promotes the diffusion of memes. Our web-based social media analytics research testbed (SMART) dashboard includes geotargeted social media (specifically Twitter) application programming interfaces (APIs) that allows for real-time tracking of various topics (URL: http:// vision.sdsu.edu/hdma/smart), and has been previously used to monitor a range of topics using keywords to gather data on disease outbreaks, drug abuse, and emergencies related to natural disasters (Tsou et al., 2015). The SMART dashboard is also well-suited to investigate many of the variables hypothesized by the M3 D. The purpose of the current article is to demonstrate the utility of the SMART dashboard for
11 A Case Study in Belief Surveillance, Sentiment Analysis …
205
examining and tracking social media messages in a previously unexplored context: e-cigarettes.
11.2 Background E-cigarettes (or “electronic cigarettes”) represent a type of electronic nicotine delivery systems (ENDS), which have experienced increasing use by adolescents (Arrazola et al., 2015) and emerging adults (Choi & Forster, 2013; King et al., 2013). E-cigarette use is a growing public health concern because these devices may contain potentially toxic chemicals (Cobb et al., 2010) and their long-term effects on health remain unestablished (Rachel et al., 2014). For young people, e-cigarettes may act as a gateway into use of combustible tobacco products (Barrington-Trimis et al., 2016). Young people are also heavily engaged users of social networking sites such as Twitter (Duggan & Brenner, 2013), which are often viewed as important information sources (Kim et al., 2014; Westerman et al., 2014) and settings for socialization (Barkhuus & Tashiro, 2010). In addition, a lack of distinction between online and offline life among young individuals is likely to increase impact of messages shared over social media (Wright & Li, 2011). However, more research is needed to examine the types of messages about e-cigarettes that are shared over social media, who is sharing these messages (e.g., bots versus authentic humans), and the role of e-cigarettes advocates (Kim et al., 2017a) and their promotion of these devices. We introduce two case studies using the SMART dashboard to examine content about e-cigarettes and sources of these messages on Twitter, as well as how they operate in real-time.
11.3 Case Studies in E-Cigarettes The SMART dashboard is constructed with several data mining programs, GIS methods, and geo-targeted social media APIs in order to monitor topics of interest through space and across time. Capabilities of this tool permit researchers to generate visualizations, and perform descriptive and predictive analyses of these topics in various U.S. cities across time. Stakeholders, such as government officials and those involved in healthcare delivery and first response, can easily access this tool. We note the following features of the SMART dashboard to further emphasize its unique and important capabilities: 1. 2.
Collect and update social media messages daily, along with their spatial attributes and geographic patterns of diffusion in different cities. Present evolution of social media trends (daily, weekly, monthly) over time as they occur in real-time.
206
3. 4. 5.
L. S. Martinez et al.
Filter data to extract noise and errors to improve accuracy of analyses and tracking of social media messages. Display temporal trends of social media messages by individual cities or by aggregation of data across all listed cities. Provide insight into social media messages and how they differ between cities, which may be used by local health agencies and organizations to map geographical hot spots or areas in need of intervention.
Prior research has already demonstrated the utility of the SMART dashboard for collecting and analyzing social media messages in other contexts (e.g., Flu, Whooping Cough, Wildfire, Drugs, and Aztecs) (Allen et al., 2016; Aslam et al., 2014; Han et al., 2018; Issa et al., 2017; Kim et al., 2017b; Martinez et al., 2017, 2018; Nagel et al., 2013; Sanguinet, 2016; Sharag-Eldin et al., 2018; Shi et al., 2019; Wang & Ye, 2018, 2019; Ye et al., 2018), and details of the SMART dashboard’s original development are available elsewhere (Tsou et al., 2017; Yang et al., 2016). In the present study, we expand on this prior work by examining the use of the SMART dashboard to examine social media messages in the context of e-cigarettes. By entering the following list of keywords (Cole-Lewis et al., 2015; Huang et al., 2014) into the SMART dashboard, we were able to collect a total of 193,051 tweets between October 2015 and February 2016 across all regions in the U.S.: Vaping, Vape, Vaper, Vapers, Vapin, Vaped, Evape, Vaporing, e-cig*, ecig*, e-pen, epen, ejuice, ejuice, e-liquid, or eliquid. The SMART dashboard retained only tweets that included at least one of these listed keywords. The filtering and data cleaning functions of the SMART dashboard are based on past work (Tsou et al., 2015), and are summarized in Fig. 11.1 and adapted for the current study. Figures 11.2 and 11.3 further detail these and additional functions of the SMART dashboard. The following sections introduce two case studies in the context of e-cigarettes to illustrate how the
Filter
Machine Learning
•Remove retweets, tweets containing only URL addresses, and users with any keyword in their username. •Collect training sample, idenfy relevant keywords, and use machine learning. •Display top retweet, URL, hashtag, and menon.
Stascs
Spaal Analysis
•Conduct hot spot analysis and overlay with layers of other data.
Fig. 11.1 Data filter and cleaning procedures in SMART Dashboard (adapted Tsou et al., 2015)
11 A Case Study in Belief Surveillance, Sentiment Analysis …
207
Fig. 11.2 The screen shot of SMART dashboard for E-cigarettes
Top Index Numbers • Displays number of tweets collected per day, week, or month. Trend Function • Queries actual tweeting texts, displayed in daily, weekly, or monthly views. Word Cloud Function • Presents most prominent keywords tweeted per day, week, month, or aggregate. Tweets in Cities Function • Displays normalized tweeting rates per city. Additional Functions • Shows top 10 list of top URLs, hashtags, retweets, mentions, and images. Fig. 11.3 Key features for interactive query and visualizations in SMART dashboard (Tsou et al., 2015)
SMART dashboard can be used to study public health topics. These examples focus on public perceptions and beliefs of e-cigarettes guided by concepts of infodemiology and infoveillance (Kim, 2018). Infodemiology refers to scientific approaches in the study of online content (collected and analyzed in real-time) used to draw insights in
208
L. S. Martinez et al.
order to advise public health and public policy. Infoveillance uses infodemiological methods with the objective of surveillance. Although surveillance represents an important public health activity, traditional approaches of disease detection are typically expensive and can be labor intensive. In contrast, social media content offers data that can be collected and analyzed in realtime, and may provide public health practitioners and researchers with an alternative way to monitor and survey disease outbreaks. We elected to discuss these two case studies to emphasize the advantages and opportunities for disease surveillance offered by tools employing social media analytics, including the SMART dashboard.
11.3.1 Belief Surveillance Figures 11.4 and 11.5 show results of a sentiment analysis for a random sample of tweets (N = 973) collected between October 2015 and February 2016 across the U.S. We performed a general sentiment analysis (Fig. 11.4) and an analysis specifically examining whether tweets confirmed or rejected commonly held beliefs about e-cigarettes (e.g., stigma associated with e-cigarettes, perceived harmfulness, capacity for generating second-hand smoke, helpfulness as a cessation tool, versatility, and potential for addiction) (Fig. 11.5). The general sentiment analysis examined whether the tweet conveyed content that struck a tone that was positive (approval of e-cigarettes), negative (disapproval of e-cigarettes), neutral (neither approving or disapproving e-cigarettes), ambiguous (both approving and disapproving of ecigarettes), or other (nonsensical or incomprehensible) (Martinez et al., 2018). The categories for a more specific analysis of sentiment included: (a) the effectiveness
80% 70%
68%
60% 50% 40% 30% 20%
12%
15%
10%
4%
2%
0% Positive
Neutral
Negative Ambiguous
Fig. 11.4 Sentiment analysis (N = 973) (Martinez et al., 2018)
Other
11 A Case Study in Belief Surveillance, Sentiment Analysis …
9% 8% 7% 6% 5% 4% 3% 2% 1% 0%
209
8% Confirmed
Rejected
5% 3%
3% 3%
3% 0%
1%
2% 1%
0%
0%
Fig. 11.5 Sentiment analysis: confirmation and rejection of common beliefs about e-cigarettes (N = 973) (Martinez et al., 2018)
or ineffectiveness of e-cigarette use as a cessation tool; (b) increased or reduced addictiveness of e-cigarettes compared to traditional cigarettes; (c) presence or lack of stigma regarding e-cigarettes; (d) whether e-cigarettes cause or reduce 2nd-hand smoke; (e) if e-cigarettes produce beneficial or harmful effects on health; (f) freedom or restrictions on users and when or where they can vape; and (g) the general satisfaction or dissatisfaction from using e-cigarettes instead of traditional e-cigarettes. We can use a graph similar to this to compare sentiment related to e-cigarettes from official regional health census data, as well as data from Monitoring the Future, a National Institute on Drug Abuse and NIH ongoing study of young adult attitudes, values and behaviors (Collins et al., 2019), capturing perceptions of e-cigarettes as a cessation tool, along with personal disapproval of and perceived risk from regularly using e-cigarettes. Figure 11.4 indicated that the majority of e-cigarettes tweets are positive (68%), indicating support for the use of e-cigarettes. Such pro-vaping content could have a very negative impact for public health. Figure 11.5 illustrated that “Stigma” and “Second-Hand Smoke” are the major reasons in the sampled tweets that support the use of e-cigarettes. On the other hand, “Harmfulness” and “Stigma” are the primarily expressed reasons to reject e-cigarettes. “Stigma” plays the most important roles in both confirmed and rejected messages for e-cigarettes. Our interactive SMART dashboard can provide practitioners with a way to collect social media messages in real-time to monitor and visualize trends in e-cigarettes sentiment. The daily, weekly, and monthly monitoring functions of the SMART dashboard offer public health authorities at state and local levels a tool for collecting data on beliefs about
210
L. S. Martinez et al.
e-cigarettes. Sentiment analysis of these data can indicate how beliefs are changing over time, point toward opportunities for intervention, and present an additional way to evaluate existing intervention efforts. Using the M3 D’s meme-level concepts to distinguish which types of tweets are likely to spread, we can also examine the content of tweets for these characteristics before they become viral. For example, the finding that stigma was central to both confirmatory and disconfirmatory messages related to e-cigarettes suggests it may help fuel a larger debate, which is likely to garner attention among users and increase the likelihood of spreading across social networks. In this way, the identification of tweets before they potentially become viral can help thwart the spread of health misinformation, or even suggest message designs that could effectively counter-argue such viral trends. This would also allow public health practitioners to get ahead of the conversation online with argumentation that can potentially slow or stop tweets that endorse e-cigarettes and that exhibit memetic potential from becoming viral.
11.3.2 Promotion and Advocates A second case study uses the SMART dashboard to identify proponents of behaviors that may undermine public health goals. In the context of e-cigarettes, advocates (Kim et al., 2017a) on social media may voice positions promoting the use of e-cigarettes that could shape views and risk perceptions of vulnerable populations, including younger populations. Figure 11.6 shows a summary of activity rates for authentic
Fig. 11.6 Activity rate for advocates Twitter accounts (ranked by the numbers of followers)
11 A Case Study in Belief Surveillance, Sentiment Analysis …
211
Fig. 11.7 Activity rate for a normal advocates Twitter account (#1)
human accounts acting as advocates of e-cigarettes generated by the SMART dashboard, including network size and daily, weekly, and monthly activity. The daily rates are calculated using the whole lifespan of user accounts. The weekly rates are based on the last seven days (most recently). The monthly rates are based on the last 30 days. We found that most advocates’ daily post numbers are between 3 and 53. Their daily, last seven days, and last 30 days rates are very consistent except the #4 account, which might be a hybrid human-bot account (or multiple users using one account) with 355 average daily posts. Also, the #4 account was created within 30 days of our collection period (missing the last 30 days activity rates). The SMART dashboard also provides data for word clouds, which can offer insight into the most prominently featured words used by advocates. Figures 11.7 and 11.8 illustrate two sample profiles for advocates, including the five most recent tweets posted by the user, and a word cloud generated using the last 3,200 tweets shared. Figure 11.7 illustrates the activities from a human advocate (#1) and Fig. 11.8 illustrates the activities from a potential cyborg account (#4).
11.4 Conclusion Social media analytics provide opportunities for collecting data that can be used in disease surveillance and improving understanding of behavioral determinants of diseases. The SMART dashboard provides one tool to advance some of these
212
L. S. Martinez et al.
Fig. 11.8 Activity rate for a cyborg advocates Twitter account (#4)
opportunities by serving as a means of public health-related belief surveillance, bot detection, and identification of campaign targets and informational needs of different communities in real-time. This project extends research in the area of e-cigarettes and social media analytics by gathering geo-tagged tweets, and offering spatiotemporal insight into social media messages posted about e-cigarettes. Specifically, we are able to track beliefs and risk perceptions about e-cigarettes, bot activity, and campaign targets with a spatiotemporal view. Linking time and place together using the M3 D model in this manner allows discovery of important patterns that illuminate understanding of disease transmission and social media activities. We have demonstrated the use of the SMART dashboard in this capacity within the context of e-cigarettes as an example for how public health practitioners and researchers may consider using this tool in the future for other public health issues. The individual level of M3 D model can be used to study the motivation and skills of the e-cigarette advocates. The use of the SMART dashboard for these purposes, however, does present certain challenges that merit discussion. The first challenge relates to the issue of privacy and its protection, which remains a significant concern for all social media analytic tools. Throughout the process of developing the SMART dashboard, our team has strived to protect the privacy of individuals using social media. Specifically, we only gather public tweets made available through the public Twitter APIs. We also provide a Privacy Policy inviting concerned users to reach out: “If you have any concerns about the privacy issues in our web applications, please Email us. After verify your information, we will remove specific social media contents based on your requests.” One option for enhancing privacy protection is to assign anonymous IDs to
11 A Case Study in Belief Surveillance, Sentiment Analysis …
213
all users. This approach, however, could undermine research seeking to understand social networks and the attributes that contribute to the creation and diffusion of social media messages. Such challenges will require future researchers to weigh the suitability of social media messages as a data source with the need to protect user privacy. Acknowledgements This material is based upon work supported by the National Science Foundation under Grant No. 1416509, project titled “Spatiotemporal Modeling of Human Dynamics Across Social Media and Social Networks” and No. 1634641, “Integrated Stage-based Evacuation with Social Perception Analysis and Dynamic Population Estimation”. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. The authors are grateful for contributions from other HDMA team members in the development of SMART dashboard.
References Allem, J.-P., & Ferrara, E. (2016). The importance of debiasing social media data to better understand e-cigarette-related attitudes and behaviors. Journal of Medical Internet Research, 18, e6185. Allem, J.-P., Ferrara, E., Uppu, S. P., Cruz, T. B., & Unger, J. B. (2017). E-cigarette surveillance with social media data: Social bots, emerging topics, and trends. JMIR Public Health Surveill, 3, e8641 Allen, C., Tsou, M.-H., Aslam, A., Nagel, A., Gawron, J.-M. (2016). Applying GIS and machine learning methods to Twitter data for multiscale surveillance of influenza. PLOS ONE, 11, e0157734. Armstrong, M. P., Tsou, M.-H., & Seidl, D. E. (2018). Geoprivacy. In Huang B (ed) Comprehensive geographic information systems (pp. 415–430). Oxford: Elsevier. Arrazola, R. A., Singh, T., Corey, C. G., et al. (2015). Tobacco use among middle and high school students—United States, 2011–2014. Morbidity and Mortality Weekly Report , 64, 381–385. Aslam, A. A., Tsou, M.-H., Spitzberg, B. H., et al. (2014). The reliability of tweets as a supplementary method of seasonal influenza surveillance. Journal of Medical Internet Research, 16, e250. Barkhuus, L., & Tashiro, J. (2010). Student socialization in the age of facebook. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (pp. 133–142) New York, NY, USA. Barrington-Trimis, J. L., Urman, R., Leventhal, A. M., et al. (2016). E-cigarettes, cigarettes, and the prevalence of adolescent tobacco use. Pediatrics. https://doi.org/10.1542/peds.2015-3983 Bessi, A., Coletto, M., Davidescu, G. A., Scala, A., Caldarelli, G., & Quattrociocchi, W. (2015). Science versus conspiracy: Collective narratives in the age of misinformation. PLOS ONE, 10, e0118093. Broniatowski, D. A., Jamison, A. M., Qi, S., AlKulaib, L., Chen, T., Benton, A., Quinn, S. C., & Dredze, M. (2018). Weaponized health communication: Twitter Bots and Russian trolls amplify the vaccine debate. American Journal of Public Health, 108, 1378–1384. Choi, K., & Forster, J. (2013). Characteristics associated with awareness, perceptions, and use of electronic nicotine delivery systems among young US Midwestern adults. American Journal of Public Health, 103, 556–561. Chou, W.-Y.S., Oh, A., & Klein, W. M. P. (2018). Addressing health-related misinformation on social media. JAMA, 320, 2417–2418. Cobb, N. K., Byron, M. J., Abrams, D. B., & Shields, P. G. (2010). Novel nicotine delivery systems and public health: The rise of the e-cigarette. American Journal of Public Health, 100, 2340–2342.
214
L. S. Martinez et al.
Cole-Lewis, H., Varghese, A., Sanders, A., Schwarz, M., Pugatch, J., & Augustson, E. (2015). Assessing electronic cigarette-related tweets for sentiment and content using supervised machine learning. Journal of Medical Internet Research, 17, e4392. Collins, L., Glasser, A. M., Abudayyeh, H., Pearson, J. L., & Villanti, A. C. (2019). E-cigarette marketing and communication: How e-cigarette companies market e-cigarettes and the public engages with e-cigarette information. Nicotine & Tobacco Research, 21, 14–24. Dawkins, R. (2016). The selfish gene. Oxford University Press. de Lima Salge, C. A., & Berente, N. (2017). Is that social bot behaving unethically? Communications of the ACM, 60, 29–31. Duggan, M., & Brenner, J. (2013). The demographics of social media users, 2012. Pew Research Center’s Internet & American Life Project Washington, DC. Ferrara, E., Varol, O., Davis, C., Menczer, F., & Flammini, A. (2016). The rise of social bots. Communications of the ACM, 59, 96–104. Garrett, B., Murphy, S., Jamal, S., MacPhee, M., Reardon, J., Cheung, W., Mallia, E., & Jackson, C. (2019). Internet health scams—Developing a taxonomy and risk-of-deception assessment tool. Health and Social Care in the Community, 27, 226–240. Han, S. Y., Tsou, M.-H., & Clarke, K. C. (2018). Revisiting the death of geography in the era of Big Data: The friction of distance in cyberspace and real space. International Journal of Digital Earth 11, 451–469. Huang, J., Kornfield, R., Szczypka, G., & Emery, S. L. (2014). A cross-sectional examination of marketing of electronic cigarettes on Twitter. Tob Control, 23, iii26–iii30. Issa, E., Tsou, M.-H., Nara, A., & Spitzberg, B. (2017). Understanding the spatio-temporal characteristics of Twitter data with geotagged and non-geotagged content: Two case studies with the topic of flu and Ted (movie). Annals of GIS, 23, 219–235. Kim, J.-N. (2018). Digital networked information society and public health: Problems and promises of networked health communication of lay publics. Health Communication, 33, 1–4. Kim, K.-S., Sin, S.-C.J., & Tsai, T.-I. (2014). Individual differences in social media use for information seeking. The Journal of Academic Librarianship, 40, 171–178. Kim, I.-H., Feng, C.-C., Wang, Y.-C., Spitzberg, B. H., & Tsou, M.-H. (2017b). Exploratory spatiotemporal analysis in risk communication during the MERS outbreak in South Korea. The Professional Geographer, 69, 629–643. Kim, A., Miano, T., Chew, R., Eggers, M., & Nonnemaker, J. (2017). Classification of Twitter users who tweet about e-cigarettes. JMIR Public Health Surveill, 3, e8060. King, B. A., Alam, S., Promoff, G., Arrazola, R., & Dube, S. R. (2013). Awareness and everuse of electronic cigarettes among U.S. adults, 2010–2011. Nicotine & Tobacco Research, 15, 1623–1627. Kopp, C., Korb, K. B., & Mills, B. I. (2018). Information-theoretic models of deception: Modelling cooperation and diffusion in populations exposed to “fake news.” PLOS ONE, 13, e0207383. Martinez, L. S., Hughes, S., Walsh-Buhi, E. R., & Tsou, M.-H. (2018). Okay, we get it. You vape: An analysis of geocoded content, context, and sentiment regarding e-cigarettes on Twitter. Journal of Health Communication, 23, 550–562. Martinez, L. S., Spitzberg, B. H., Tsou, M. H., Issa, E., & Peddecord, M. (2017). Vax Populi: The social [media](de) construction of public health policy. In The International Communication Association. San Diego CA Nagel, A. C, Tsou, M.-H., Spitzberg, B. H., et al. (2013). The complex relationship of realspace events and messages in cyberspace: Case study of influenza and pertussis using tweets. Journal of Medical Internet Research, 15, e237 Rachel, G., Neal, B., & Glantz, S. A. (2014). E-cigarettes. Circulation, 129, 1972–1986. Rainie, H., Anderson, J. Q., & Albright, J. (2017). The future of free speech, trolls, anonymity and fake news online. Pew Research Center Washington, DC. Rich, M. D. (2018). Truth decay: An initial exploration of the diminishing role of facts and analysis in American public life. Rand Corporation.
11 A Case Study in Belief Surveillance, Sentiment Analysis …
215
Sanguinet, M. E. (2016). Hashtags, tweets and movie receipts: Social media analytics in predicting box office hits. San Diego State University. Schlaile, M. P., Knausberg, T., Mueller, M., & Zeman, J. (2018). Viral ice buckets: A memetic perspective on the ALS Ice Bucket Challenge’s diffusion. Cognitive Systems Research, 52, 947– 969. Sharag-Eldin, A., Ye, X., & Spitzberg, B. (2018). Multilevel model of meme diffusion of fracking through Twitter. Chin Sociol Dialogue, 3, 17–43. Shi, X., Xue, B., Tsou, M.-H., Ye, X., Spitzberg, B., Gawron, J. M., Corliss, H., Lee, J., & Jin, R. (2019). Detecting events from the social media through exemplar-enhanced supervised learning. International Journal of Digital Earth, 12, 1083–1097. Shin, J., Jian, L., Driscoll, K., & Bar, F. (2018). The diffusion of misinformation on social media: Temporal pattern, message, and source. Computers in Human Behavior, 83, 278–287. Spitzberg, B. H. (2014). Toward a Model of Meme Diffusion (M3D). Communication Theory, 24, 311–339. Stella, M., Ferrara, E., & Domenico, M. D. (2018). Bots increase exposure to negative and inflammatory content in online social systems. Proceedings of the National Academy of Sciences, 115, 12435–12440. Sutton, J. (2018). Health Communication trolls and bots versus public health agencies’ trusted voices. American Journal of Public Health, 108, 1281–1282. Tsou, M.-H., Jung, C.-T., Allen, C., Yang, J.-A., Gawron, J.-M., Spitzberg, B. H., & Han, S. (2015). Social media analytics and research test-bed (SMART dashboard. Proceedings of International Conference on Social Media (pp. 1–7). Association for Computing Machinery. Tsou, M.-H., Jung, C.-T., Allen, C., Yang, J.-A., Han, S. Y., Spitzberg, B. H., & Dozier, J. (2017). Building a real-time geo-targeted event observation (Geo) viewer fosr disaster management and situation awareness. In International Cartographic Conference (pp. 85–98). Vicario, M. D., Bessi, A., Zollo, F., Petroni, F., Scala, A., Caldarelli, G., Stanley, H. E., & Quattrociocchi, W. (2016). The spreading of misinformation online. Proceedings of the National Academy of Sciences, 113, 554–559. Wang, Z., & Ye, X. (2018). Social media analytics for natural disaster management. International Journal of Geographical Information Science, 32, 49–72. Wang, Z., & Ye, X. (2019). Space, time, and situational awareness in natural hazards: A case study of Hurricane Sandy with social media data. Cartography and Geographic Information Science, 46, 334–346. Westerman, D., Spence, P. R., & Van Der Heide, B. (2014). Social media as information source: Recency of updates and credibility of information. Journal of Computer-Mediated Communication, 19, 171–183. Wright, M. F., & Li, Y. (2011). The associations between young adults’ face-to-face prosocial behaviors and their online prosocial behaviors. Computers in Human Behavior, 27, 1959–1962. Yang, J.-A., Tsou, M.-H., Jung, C.-T., Allen, C., Spitzberg, B. H., Gawron, J. M., & Han, S.Y. (2016). Social media analytics and research testbed (SMART): Exploring spatiotemporal patterns of human dynamics with geo-targeted social media messages. Big Data & Society, 3, 2053951716652914. Ye, X., Sharag-Eldin, A., Spitzberg, B., & Wu, L. (2018). Analyzing public opinions on death penalty abolishment. Chin Sociol Dialogue, 3, 53–75.
Chapter 12
Placing Community: Exploring Racial/Ethnic Community Connection Within and Between Racial/Ethnic Neighborhoods Joseph Gibbons
12.1 Introduction The ethnic neighborhood, marked by a largely homogenous racial/ethnic population, has long been associated with community connection for people of color (Bois, 1899; Lin, 1998; Small, 2004; Whyte, 1943; Zhou, 1992). A strong sense of community connection increases the likelihood that one will trust their neighbors and be willing to work with each other to address local issues (Sampson, 2012). A strong sense of community connection can be pooled for a number of benefits, from offsetting feelings of discrimination (Hunt et al., 2007), managing local crime (Sharkey, 2018), to looking out for the health of one another (Klinenberg, 2003). Despite the association of community connection to ethnic neighborhoods, there is also evidence these neighborhoods do not have strong community connection. Ethnic neighborhoods have been found to be sites of intense social alienation and abandonment for people of color (Anderson, 1999; Black, 2010; Klinenberg, 2003; Sampson, 2012). Complicating matters, ethnic communities have been found outside of areas that are considered ethnic neighborhoods (Foster et al., 2015). One reason for these misunderstandings is the existing methodologies used to study the association of racial/ethnic minorities and their neighborhoods do not sufficiently account for space. Hierarchical linear modelling (HLM) is a principle method to determine how minorities relate to their neighborhoods (Bakker & Dekker, 2012; Gibbons & Yang, 2015; Laurence, 2011; Wu et al., 2011). Neighborhood effects under HLM are assumed to be discrete and mutually independent across space (Yang & Matthews, 2012). This is problematic for several reasons. First, HLM
J. Gibbons (B) Department of Sociology, San Diego State University, San Diego, CA, USA e-mail: [email protected] © Springer Nature Switzerland AG 2021 A. Nara and M.-H. Tsou (eds.), Empowering Human Dynamics Research with Social Media and Geospatial Data Analytics, Human Dynamics in Smart Cities, https://doi.org/10.1007/978-3-030-83010-6_12
217
218
J. Gibbons
relies on administrative data like Census tracts to identify neighborhoods. Administrative measures fall victim to the Modifiable Aerial Unit Problem (MAUP), arbitrarily determining boundaries that do not reflect local perceptions (Vogel, 2016). Through administrative data racial/ethnic neighborhoods could be misidentified or overlooked outright (Small, 2004). Second, even if neighborhoods are correctly identified through HLM, they miss out on the underlying spatial structure of community. A stationary approach to measuring community overlooks the possibility that it can vary in intensity within a neighborhood as well as extend beyond the neighborhood’s boundaries. To better understand the local strength of racial/ethnic community, I propose Geographically Weighted Regression (GWR). GWR is an exploratory method to evaluate the spatial heterogeneity of regression models (Fotheringham et al., 2003). A chief advantage of GWR is it does not require administrative units to measure local effects, it instead accomplishes this through weighing the effects for a resident against other nearby residents. In short, I can determine how race/ethnicity and community connection vary across space. Through GWR, I will visually compare the spatial variation of race/ethnicity and community connection to racial/ethnic neighborhoods identified through Census tracts to determine how well they correspond (Gibbons & Schiaffino, 2016).
12.2 Background A sense of community is a sense of emotional security, a space where one feels they belong and can rely on peers. This can be distilled down to a sense of belonging to the community, trust in the other community members, and cooperation. For ethnic groups, this sense of community is derived from their shared racial and ethnic background, including shared experiences (Anderson, 1999; Ray, 2014). America’s history of racial/ethnic discrimination of minority groups has contributed to a sense of need for this belongingness. People of color pool together out of a sense of need due to outside hostility. Neighborhoods are seen as a key source of community due to regular day-to-day interaction of residents afforded by spatial proximity (Tilly, 2010). Neighborhood community is born out of regular interactions from walking down the street (Jacobs, 1961), regularly visiting the local market (Saegert et al., 2002), visiting the same local bars (Oldenburg, 1989), or worshipping at the same church (Lin, 1998, 2011; Ray, 2014). For ethnic neighborhoods, the spatial proximity of people of the same racial/ethnic background had a role in fostering this community (Heringa et al., 2018). Indeed, ethnic neighborhoods had a reputation well into the twentieth century as population centers for these groups (Bois, 1899; Suttles, 1968; Whyte, 1943). Park et al. (1921) once observed a Chicago neighborhood that was predominated by Italians who immigrated from the same Sicilian village. However, since the mid-twentieth century, it has been argued that the neighborhood has lost its central role in the formation of community (Wellman, 1979). A
12 Placing Community: Exploring Racial/Ethnic Community Connection …
219
key component of this change is a shift in the spatial nature of relationships. American cities have seen a large-scale shift from dense center cities to more sprawled, car-centric, suburban areas which lack spaces where residents have the chance to spontaneously meet and interact. The effect of this change is walkable spatial proximity has a less central role in community formation than it had been previously thought to have. Instead, people drive to their communities (Jackson, 1985; Putnam, 2000). To this end, research from recent decades has found racial/ethnic minorities travel some distance by car to visit cultural centers such as community nonprofits or churches (Chung, 2007; McRoberts, 2003). This decentered way of living has been linked with reduced community in neighborhoods. McRobert’s (2003) research on a non-Hispanic Black (henceforth Black) church, in particular, found that the parishioners who commuted in for worship felt considerable social distance from the local Black population outside the church walls, even though the church was located in a Black population center. However, it is still not clear, overall, how truly diminished the importance of racial/ethnic neighborhoods are for communities. Research has found a strong association with non-Whites and predominately non-White neighborhoods (Gibbons & Yang, 2015). One reason for the continued endurance of racial/ethnic neighborhoods as a site of minority community could be residential segregation (Uslaner, 2011). The existing discussion on lost local community has either explicitly, or implicitly, assumed a non-Hispanic White (henceforth White) population that has been able to move to less dense suburban communities with relative ease (Jackson, 1985; Putnam, 2000). Many non-White residents, in particular those who are Black, endure discrimination in the housing market, including from home sellers, realtors, and financiers, which largely confine their residential options to predominately Black or mixed Black neighborhoods (Massey & Denton, 1993). A result of this defacto residential segregation is a continued relevance of the spatial proximity found in neighborhoods for community formation. There are some questions to be raised about the spatial nature of segregation and non-White communities. For one, can this racial/ethnic sense of neighborhood spill over into areas adjacent to an ethnic neighborhood? Classic work on ethnic neighborhoods found these populations can rigidly enact a sense of local boundaries in their day to day lives, for example ignoring nearby stores if they fall outside their neighborhood’s perceived boundary (Suttles, 1968). Next, while the mobility for some non-Whites can be limited, the boundaries of their neighborhoods do change over time, even if only in a gradual, block-by-block fashion (Sugrue, 1996). Past research has shown that new minorities in White neighborhoods can be met with considerable hostility (Reider, 1987; Sugrue, 1996), as an extension of neighborhood boundary maintenance (Suttles, 1968). Is the sense of community for minorities weaker in border/transition areas? Lastly, is a sense of community uniformly consistent across ethnic neighborhoods? The previous research on ethnic neighborhoods implicitly assumes community strength will be uniformly consistent throughout a neighborhood, overlooking the potential for an underlying spatial structure of this community.
220
J. Gibbons
Another consideration is how consistently residential segregation and community carries across different racial/ethnic groups. Black populations in the United States have the highest degree of residential segregation in the United States (Logan et al., 2011), but this segregation is more likely to take on a clustered characteristic than other minority groups (Lee et al., 2008). Hispanic and Asian populations, meanwhile, are argued to have a greater opportunity to self-select into ethnic neighborhoods than their Black peers (Iceland & Nelson, 2008; Zhou, 1992). This could result in a stronger sense of ethnic community in these neighborhoods than for Black populations. Residential segregation is also related to concentrated structural disadvantages for non-White neighborhoods, including institutional neglect, which can erode social connectivity (Klinenberg, 2003; Sampson, 2012; Small, 2004). These disruptive effects may not so much eliminate networks, but they can make individual connections far more transactional and ‘disposable’ (Desmond, 2012). The loss of robust social connections can, in turn, make it difficult for residents to form a shared sense of priorities, undermining local community (Sampson, 2012). As such, the sense of community found in non-White neighborhoods may be weaker than that found elsewhere because of the segregating forces that contributed to their formation. Another point of debate is how strong community can be outside of central cities. Recall suburbs are argued to lack spatially concentrated community due to the dependence on cars for travel, lower housing density, and lack of places to meet (Oldenburg, 1989; Putnam, 2000). What is more, they are commonly thought to be predominately White (Jackson, 1985). However, suburbs have been not only found to have robust community (Gans, 1967), but have significant non-White populations (Brown, 2007; Pattillo-McCoy, 2000; Woldoff, 2011). For example, Hispanic populations in across suburban and rural North Carolina have been found to have similar patterns of communities to their peers in cities (Flippen & Parrado, 2012). One key difference of these suburban/rural communities is their boundaries are more fluid than their urban peers, which may lead to differences from cities (Foster et al., 2015).
12.3 Data and Methods 12.3.1 Data Source To empirically examine how race/ethnicity relates to neighborhood community connection, I focused on the 8305 adult respondents from the PHMC surveyed in the five counties that comprise the Philadelphia metropolitan area including Philadelphia (coterminous with the city), Bucks, Chester, Delaware, and Montgomery. The PHMC has been found to be both reliable and valid. A recent study reported that several socioeconomic indicators drawn from PHMC data were comparable with those estimated by the Centers for Disease Control and Prevention (Gibbons & Yang, 2015). The PHMC calculates balancing weights that adjust for sampling bias
12 Placing Community: Exploring Racial/Ethnic Community Connection …
221
and non-responses which I use in my multivariate analysis (PHMC, 2015). Philadelphia is an ideal location for this survey. The metropolitan area has a clearly defined core city, Philadelphia, surrounded by a decentered suburban area, allowing for a comparison of spatial patterns between these places. Also, the metropolitan area is characterized by a high degree of racial segregation between Whites and non-Whites (Logan, 2011). Neighborhoods are proxied by Census tracts, which report the 2010–2014 estimates of the ACS. Identified by the Office of Management and Budget, Census tracts are the commonly used proxy for neighborhoods based on administrative data (Massey & Denton, 1993). While the PHMC does not make respondent addresses available, it does offer geocodes for census tracts. To simulate the spatial dispersion of respondents within their respective tracts, I used ArcGIS 10.4 to randomly generate a set of coordinates that fall within a respondent’s tract using a method innovated by Yang and Matthews (2012). Multiple coordinates for each respondent were generated to conduct sensitivity analysis on the accuracy of this method, which is available upon request.
12.3.2 Measurements My dependent variable, community connection, was derived through principal components analysis (PCA) of the respondents Willingness to help neighbors: “Would you say that most people in your neighborhood are always, often, sometimes, rarely, or never willing to help their neighbors?” which I coded on a scale of 5–1 where 5 signifies always and 1 never (loading 78%); feeling of Belonging to neighborhoods: Do you strongly agree, agree, disagree, or strongly disagree that you belong and are part of your neighborhood ” which was coded where strongly agree is coded 4 and strongly disagree was coded 1 (loading 77%); and Trust of their neighbors: “Do you strongly agree, agree, disagree, or strongly disagree with the statement that most people in your neighborhood can be trusted?”” which is also coded as 4 = strongly agree and 1 = strongly disagree (loading 75%). The PCA results suggested that one factor is sufficient to capture over 60% of the variance among these three questions. I used the regression method to obtain the factor score as my dependent variable (with a mean of 0 and a standard deviation of 1). A higher score indicates stronger community connection. The Cronbach’s Alpha tests was 0.6783, suggesting the modest internal consistency of this measure. My independent variables (see Table 12.1) include dichotomous measures of race/ethnicity, classified into White (reference group, hence just White), Black, Hispanic, and Other non-Hispanic minorities; education attainment classified as no high school (reference group), high school, college educated, and college educated or greater; Marital status was categorized into single (reference group), married or living with a partner, widowed/divorced/separated (WDS), and another marital status; self-rated health was coded as poor/fair = 1 and good/very good/excellent = 0; gender (reference female); living below the federal poverty line; Employment
222
J. Gibbons
Table 12.1 Descriptives of Independent Variables (N = 8,305) Variable
Mean
Standard deviation
Demographics Race/ethnicity White
0.708
0.455
Black
0.199
0.399
Hispanic
0.052
0.222
Other non-Hispanic
0.041
0.198
53.188
15.024
Self-rated health (1 = poor/fair, 0 otherwise)
0.371
0.483
Gender (males = 1, females = 0)
0.611
0.488
Age
Marital status Single
0.232
0.010
Married/living with a partner
0.564
0.496
Widowed/divorced/separated
0.193
0.395
Other marital status
0.010
0.099
0.104
0.305
Unemployed
0.136
−0.435
Full-time employed
0.486
0.500
Part-time employed
0.126
0.332
Retired
0.213
0.409
Other employment status
0.039
0.194
Socioeconomic status Poverty (yes = 1, no = 0) Employment status
Educational attainment No high school
0.050
0.218
High school graduates
0.254
0.436
Some college
0.214
0.410
Bachelor’s degree or greater
0.482
0.500
Homeowner (yes = 1, no = 0)
0.777
0.416
Group membership
1.670
6.284
Residential attainment
status was categorized into Unemployed (reference group), Fulltime employed, parttime employed, Retired, and Other Employment Status; self-rated health coded as poor/fair = 1 and good/very good/excellent = 0; and homeowner. In addition to these dichotomous measures, I include the continuous measure of self-reported Age and Group Membership. Group membership is based on the total number of local groups that a respondent participates in: “How many local groups or organizations in
12 Placing Community: Exploring Racial/Ethnic Community Connection …
223
your neighborhood do you currently participate in such as social, political, religious, school-related, or athletic organizations?”. In addition to PHMC variables used in my analysis, I utilize the ACS data to adopt a typology of racial composition used in previous HLM studies to identify how individual community connection relates to neighborhood-level measures (Gibbons & Yang, 2015). I will visually compare my individual-level GWR findings with tract data to determine how well racial/ethnic associations with community connection correspond to ethnic neighborhoods identified with administrative data (Gibbons & Schiaffino, 2016). Tract-level Racial/ethnic neighborhoods include Mostly White neighborhoods, defined as tracts that are at least 60% White, with no minority group representing more than 20%. Next, Mostly Black neighborhoods have at least 50% Black residents and no more than 20% of another racial/ethnic group. Mostly nonblack minority neighborhoods consist of at least 50% Hispanic or Asian residents and no more than 20% Black. Also, I identify Mixed neighborhoods that are the tracts that cannot be classified into any of the typologies above.
12.3.3 Analytic Methods and Strategy The main analysis for this study takes place in two steps. I start with a conventional, or ‘global’ ordinary least squares (OLS) model. Next, I conducted GWR with the software GWR4 to examine how the relationship of community connection varies across the Philadelphia metropolitan area (Fotheringham et al., 2003). GWR is an extension of generalized regression models, meaning the interpretation of GWR regression coefficients is the same as OLS model (Brunsdon et al., 1998, 2008; Chen & Yang, 2012). However, while a global OLS coefficient reports the overall association between a variable and community connection, GWR will produce coefficients for all 8305 respondents in the PHMC. This enables me to detect variations in magnitude and significance in the association of the variables. GWR coefficients are determined through an iteratively reweighted least squares estimation strategy. Essentially, the parameters for one respondent are weighted against the magnitude of their neighbors. Specifically, we used a bi-square weighting function, which is a commonly used weighting scheme (Fotheringham et al., 2003). The sensitivity, or bandwidth, of this weighting is determined by GWR4 through their ‘golden search’ function. To explore how ethnic community relates to ethnic neighborhoods, I will then visually compare the resulting GWR coefficients to census-tract level data (Gibbons & Schiaffino, 2016).
224
J. Gibbons
12.4 Results 12.4.1 Racial/Ethnic Distribution of Philadelphia The distribution of racial/ethnic minorities in the Philadelphia metropolitan area are reported in Fig. 12.1. Based on the classifications from the typology, reported in the top left panel, White and non-White neighborhoods are clearly defined, indicating the racial/ethnic division of the area. The mostly Black neighborhoods are largely confined to municipal Philadelphia or immediately adjacent areas. This supports the previous research on residential segregation which highlighted the spatially clustered segregation of Black populations (Lee et al., 2008). The two primary Black neighborhoods of Philadelphia are West Philadelphia and North Philadelphia. These have been predominately Black neighborhoods for roughly 60 years (Anderson, 1990, 1999; Baltzell, 1967; Bois, 1899; Henderson, 2015; Hernandez, 2005). North Philadelphia has a larger middle-class population compared to West Philadelphia (Anderson, 1990, 1999). Meanwhile, the predominately non-White communities are also largely confined to the city. These communities are historically Puerto Rican (Hernandez, 2005). The suburban regions outside of Philadelphia are largely White. I explore how well the census tract-defined ethnic neighborhoods correspond to the concentration of racial/ethnic minorities as reported by the PHMC data with Kernel Density Estimation (KDE). Reported in the remaining panels of Fig. 12.1, the KDE of Black respondents from the PHMC survey shows their location largely corresponds to predominantly Black neighborhoods reported by the ACS, albeit with some spillover into adjacent tracts. This further supports the past assertions of high segregation and clustering of this population (Lee et al., 2008). Meanwhile, the KDE of the Hispanic population is less consistent with the tract classifications. While the predominately non-Black minority neighborhoods have a high density of Hispanic residents, this population is spread throughout many other parts of the city, including predominately Black and mixed neighborhoods. These findings are in line with the past segregation literature which argues Hispanic populations would be less segregated than Blacks. The non-Hispanic other racial/ethnic KDE displays similar trends to the Hispanic KDE.
12.4.2 Description of PHMC Data The descriptive statistics of the predictor variables are presented in Table 12.1. Approximately 71% of respondents identify as White, 20% identify as Black, 5% Hispanic, and 4% identify with another race/ethnicity. On average, respondents belonged to 1.67 local groups. There was substantial variation in membership, local groups have a standard deviation of 6.28. Turning to the other predictors, the mean age of respondents was 53 years old and 61% were female. Fifty-six percent of respondents were married or cohabitating at the time of the survey, while roughly
12 Placing Community: Exploring Racial/Ethnic Community Connection … Table 12.2 Community connection score, by race (N = 8,305)
Mean
Standard deviation
White
0.281
1.256
Black
−0.513
1.460
Hispanic
−0.587
1.453
Other
−0.304
1.474
0.000
1.000
Average
225
20% were widowed, divorced, or separated. The poverty rate among PHMC respondents was about 10%. Employment statistics identify about 49% of respondents as employed full-time, about 12% employed part-time and 21% were retired. In terms of educational attainment, almost half of the sample had a bachelor’s degree or higher, while 21% had at least some college and another 25% had a high school degree. A large majority, 78%, of respondents owned their home. As my dependent variable community connection is a means-centered score, there is little direct interpretation that can be done of the variable itself. However, in Table 12.2, I break the score down by race/ethnicity to provides more subtext as for how community connection varies. The average score of community connection for Whites was 0.281 points above the mean. Meanwhile all other racial/ethnic groups had negative, or below mean, scores. For example, Hispanic respondents had a score of −0.587. This suggests racial/ethnic disparities in community connection.
12.4.3 Global OLS Results Table 12.3 shows the results of the global OLS regression. In keeping with the results from Table 12.2, Non-White populations are significantly less likely to have strong community connection compared to White respondents. Hispanic respondents, for example, have a score of −0.637 (P ≤ 0.001). This means when controlling for all other relevant variables the community connection score for a Hispanic resident of the Philadelphia region is 0.637 points lower than a White respondent’s score. Socioeconomic characteristics also carry expected relations with community connection: poverty has a significant negative relationship, (−0.134, P ≤ 0.010), while being married, college education, and home ownership have positive relations. Also, group membership is significantly related to community connection. With my next models, I use GWR to determine how the magnitude of these relationships vary across the Philadelphia metropolitan area.
226
J. Gibbons
Table 12.3 Global regression Variables
Coefficient
Standard error
Intercept
−1.074
0.101
***
−0.486
0.036
***
Demographics Race/ethnicity (ref.: White) Black Hispanic
−0.637
0.067
***
Other non-Hispanic
−0.155
0.075
**
Age
0.021
0.001
***
Self-rated health (1 = poor/fair, 0 otherwise)
0.009
0.028
Gender (males = 1, females = 0)
0.173
0.028
***
Marital status (ref.: single) Married/living with a partner
0.237
0.038
***
Widowed/divorced/separated
−0.204
0.043
***
Other marital status
−0.485
0.158
***
−0.134
0.047
**
Full-time employed
−0.013
0.048
Part-time employed
−0.403
0.056
***
Retired
−0.183
0.051
***
Other employment status
−0.411
0.081
***
High school graduates
−0.315
0.063
***
Some college
−0.169
0.064
**
0.253
0.064
***
Homeownership (yes = 1, no = 0)
0.146
0.036
***
Group
0.004
0.002
**
Socioeconomic status Poverty (yes = 1, no = 0) Employment status (ref.: unemployed)
Educational attainment (ref.: no high school)
Bachelor’s degree or greater Residential attainment
(AIC = 26,595.939)
12.4.4 GWR OLS Results To confirm the appropriateness of GWR for my data, I compare the corrected AIC score of my GWR results, reported in Table 12.4, with the global results, reported in Table 12.3 (Fotheringham et al., 2003). When the difference in AICs between two models is larger than 4, the model with the smaller AIC is strongly preferred (Burnham & Anderson, 2002). To this end, the GWR model (AIC = 24,367.566) fit my data better, and the results should provide more robust results to my research questions and hypotheses. As the GWR model generates results for each respondent
12 Placing Community: Exploring Racial/Ethnic Community Connection …
227
Table 12.4 Five-number summary of full geographically weighted regression model (Bandwidth = 3,000) Variable Intercept
Coefficients Min
Q1
Median
Q3
Max
−3.351
−1.974
−0.736
−0.129
1.418
Demographics Race/ethnicity (ref.: White) Black
−1.15
−0.63
−0.369
−0.176
0.171
Hispanic
−1.828
−0.621
−0.221
0.197
1.207
Other non-Hispanic
−1.248
−0.301
−0.082
0.105
0.431
0.003
0.013
0.019
0.03
0.046
Self-rated health (1 = poor/fair, 0 otherwise)
−0.343
−0.082
0.048
0.123
0.328
Gender (males = 1, females = 0)
−0.462
−0.037
0.107
0.303
0.647
Married/living with a partner
−0.606
−0.074
0.18
0.402
0.653
Widowed/divorced/separated
−0.88
−0.603
−0.367
−0.07
0.255
Other marital status
−2.07
−1.264
−0.674
−0.125
0.684
−1.414
−0.582
−0.172
0.112
0.938
Full-time employed
−0.772
−0.366
0.054
Part-time employed
−1.291
−0.723
−0.364
Age
Marital status (ref.: single)
Socioeconomic status Poverty (yes = 1, no = 0) Employment status (ref.: unemployed) 0.327 −0.17
0.621 0.205
Retired
−0.883
−0.392
−0.098
Other employment status
−1.967
−0.881
−0.504
−0.22
0.362
−1.229
−0.452
−0.276
−0.03
0.607
0.021
0.366
Educational attainment (ref.: no high school) High school graduates Some college
−0.924
−0.406
−0.149
0.085
0.661
Bachelor’s degree or greater
−0.502
−0.185
0.16
0.419
0.877
−0.288
0.033
0.126
0.256
1.106
−0.031
0
0.006
0.017
0.243
Residential attainment Homeownership (yes = 1, no = 0) Group (AIC = 24,367.566)
228
J. Gibbons
Table 12.5 Select significant coefficients from geographically weighted regression model by neighborhood racial/ethnic composition (Bandwidth = 3,000) Neighborhood type
Race/ethnicity coefficients (Ref. White) Black
Predominately black Predominately Hispanic Predominately racially mixed Predominately white
−8.272
Hispanic
Other non-Hispanic
−6.516
8.332
1.767
−11.099
2.000
−15.639
−50.085
11.730
−103.991
−23.325
−336.022
in my data, it is impractical to show all local estimates. Following previous studies (Gibbons & Schiaffino, 2016), I report a summary of the GWR results (i.e., minimum, three quartiles, and maximum) in Table 12.4. As Table 12.4 shows, the GWR estimates vary quite dramatically across the metropolitan area. All racial/ethnic coefficients range from negative to positive. Hispanic coefficients most notably vary from −1.828 to the highest score of 1.207. This means that there are areas in the Philadelphia metropolitan area where Hispanic respondent’s community connection score is greater than Whites by 1.207. Recall that in the global model Hispanic respondents had a significant negative coefficient of −0.637. The magnitudes of coefficients for the other measures also vary across space, poverty changes from −1.414 to 0.938. These results raise several questions about the relationship of race/ethnicity to community connection. To determine if ethnic neighborhoods help explain the variations identified with the GWR model, I summarize the race/ethnicity GWR coefficients by neighborhood type in Table 12.5. Unlike Table 12.4, I only report values that had statistical significance. Black and Hispanic respondents have, on average, negative coefficients in predominantly Black tracts. Interestingly, Blacks have a positive coefficient in nonBlack minority tracts. To put these results in perspective, all significant minority coefficients are negative in White neighborhoods, and the magnitude of these associations are much stronger than that found for the negative coefficients in non-White tracts. To better contextualize my GWR estimates and identify how significant they are across the region, I created maps, reported in Fig. 12.2, using ArcGIS 10.4 depicting spatially smoothed local coefficients for race/ethnicity and organizational membership (Matthews & Yang, 2012). I then overlaid the local coefficients with spatially smoothed t-values systems to only report coefficients with a t-value that is greater than 1.96 (p-value < 0.05). That is, the areas with coloring were estimated to have statistically significant coefficients. I used the red to green gradient scheme to show different magnitudes of the local estimates, green being strong associations and red indicating weak to negative associations. The resulting maps show how the significance of race/ethnicity varies across the Philadelphia metropolitan area. For Black respondents, the negative coefficients lie mostly outside the city/county of Philadelphia. The most visible area with negative
12 Placing Community: Exploring Racial/Ethnic Community Connection …
229
coefficients coalesces in the suburbs near the city proper in a mostly White area straddling the borders of Delaware, Chester, and Montgomery Counties. This specific cold spot extends into the Western reaches of the city, overlapping with the census tracts which compose a mostly Black section of West Philadelphia. As Whites are the reference group, this means that Black’s community connection in this mostly Black area carries a negative relation compared to Whites. Meanwhile, the North Philadelphia predominately Black neighborhood has more positive coefficients for Black respondents. The Hispanic coefficients are presented in Fig. 12.2. The coefficients for Hispanic respondents vary dramatically by location. All the positive significant Hispanic coefficients are located outside of the city of Philadelphia. The largest presence of negative community coefficients for Hispanics can be found in Bucks County and a part of far Northeast Philadelphia. Save for one exception in Montgomery County, no significant Hispanic coefficients overlap with predominantly non-Black minority tracts. Another trend worth discussion is the relationship of group membership and community connection. Recall that in the global model that group membership has a significant and positive result. Figure 12.2, in contrast, identifies a dramatic amount of local variation from positive to negative group membership coefficients across the region. Indeed, while for much of the map the relation of group membership to community connection varies from moderate to strongly positive, there is a large area with negative coefficients in the northern section of the metropolitan area, mostly in Bucks County.
12.5 Discussion and Conclusions Through GWR, I explored the underlying spatial structures of race/ethnicity’s relationship to community connection. This allowed determining how well racial/ethnic community connection corresponds to ethnic neighborhoods identified by administrative units like Census tracts. The GWR models reveal insightful nuance as for how community connection is perceived across a segregated metropolitan area. This nuance can be seen with Philadelphia’s Black community. On the one hand, I find the location of the Black sample generally corresponds to the ACS tract estimates. This reflects past segregation research which found a high degree of spatial clustering of Black populations (Lee et al., 2008). On the other hand, my GWR models find that Blacks on average have less community than Whites in mostly Black tracts. However, mapping the GWR coefficients out reveals this negative effect is not consistent across these neighborhoods. Local Black community connection is generally positive in North Philadelphia but negative in West Philadelphia. It is not clear from my data why Black residents of these neighborhoods associate with community connection so differently. Past research offers some room for speculation. West Philadelphia has been noted to experience substantial gentrification by Whites in recent decades, which can undermine community strength (Anderson, 1990; Gibbons et al., 2019).
230
J. Gibbons
In addition, North Philadelphia has an arguably larger and more established Black middle-class population (Anderson, 1999). It is important to also note that the community connection for Blacks is more negative in mostly White areas than mostly Black areas. Indeed, Black community connection appears weakest in an area in Montgomery and Delaware counties adjacent to the mostly Black population in Philadelphia. This could be a reflection of the influx of Black residents into these places, leading to the kind of hostile relationship from White neighbors found in other cities (Reider, 1987; Sugrue, 1996). Put simply, while we cannot say Black residence in mostly Black neighborhoods has a unilaterally positive association with community, there is less of a community disparity for this group than if they lived in a mostly White neighborhood. While segregation is related to clustered Black neighborhoods, these neighborhoods are not consistently related to Black community. The association of administratively defined neighborhoods and racial/ethnic community connection is even weaker for other racial/ethnic minorities. Neither Hispanic nor other non-Hispanic racial/ethnic minorities strongly corresponded to predominately non-Black minority areas. Moreover, there is no significant relationship between Hispanic population and community connection in the non-Black minority communities of Philadelphia. These results offer strong evidence that existing Hispanic neighborhoods are not lining up with tract boundaries and are thus being overlooked in demographic research (Small, 2004). However, like with the Black findings, Hispanic community connection is still weaker in predominately White areas. Meanwhile, suburbs contain both the strongest and weakest Hispanic community connection. The strongest Hispanic community connection is found in Montgomery, Chester, and Delaware counties, while the weakest is in Bucks County. Why the suburbs of Philadelphia carry such variation for the Hispanic community is not clear. One possible distinction could be ethnic differences between Hispanics. According to the 2011–2015 ACS, while Puerto Ricans constitute 65% of Philadelphia’s Hispanic population, they only constitute 33% of the suburban Hispanic population. Mexicans, the next largest Hispanic group of the region, constitute 34% of the suburban population, compared to 9% of Philadelphia’s population. Linguistic similarities aside, Mexicans and Puerto Ricans carry a number of cultural differences which could result in disparate community outcomes (Portes & Rumbaut, 2006). Also, the Mexican population is relatively newer to the region compared to Puerto Ricans, which could also impact the strength of their community (Henderson, 2015). Comparing the Hispanic findings to the Black findings encourages not only more consideration on residential segregation’s influence on cities, but also how we view the city/suburb divide. The Black population of the Philadelphia metropolitan area, similar to other U.S. cities, endures much higher segregation, leading to their clustering in the city of Philadelphia. Perhaps as an extension of this, Black community is generally stronger in the city of Philadelphia, even if its strength in the city is not consistent. The Hispanic population, meanwhile, is less subject to this segregation and able to spread across the region (Lee et al., 2008). Their communities, in turn, take on the kind of fluid character identified in other regions (Foster et al., 2015). The large
12 Placing Community: Exploring Racial/Ethnic Community Connection …
231
size of the suburban Hispanic community could also reflect greater car dependence, people driving to friends, family, and other community centers (Flippen & Parrado, 2012). In sum, these findings show that while the broad strokes of how segregation and suburbs matter for community discussed in previous research generally hold, how these associations unfold reveals considerable variation. These findings not only point to the limitations as for how ethnic neighborhood boundaries are commonly framed, but also call into question how effectively ethnic concentrations determined by administrative data predict minority community connection. Future research should more closely explore how the granular spatial variations of racial/ethnic community connection correspond to racial/ethnic concentration. One way to achieve this would be through the household-level ethnic neighborhood measures advocated by Logan et al. (2011). However, limited availability of contemporary household-level count data makes such an undertaking on a mass scale prohibitive. Another key finding is that group membership’s role on community is not constant across space. For much of the region, the association of group membership to community connection varies from moderately positive to strongly positive. However, there are several areas with a lack of significance of local group membership or where it has an outright negative relation to community connection. It is difficult to discern why group membership’s role on community connection would vary so much as I cannot say for sure what groups my respondents are members. The negative coefficients can largely be found in suburban areas. This could support the argument of previous research pointing to the detrimental influence of suburbs on community organizations (Oldenburg, 1989; Putnam, 2000). However, my findings also show this adverse effect is not consistent across all suburban areas. There are several conclusions that can be gleamed from this study. There is no one size fits all model to understand how racial/ethnic community corresponds to minority neighborhoods. Not only does the usefulness of aggregate data to identify non-White neighborhoods vary by ethnic group, these boundaries have inconsistent relation with community formation, which varies greatly within and between tracts. In addition, the urban/suburban divide used to characterize community strength is simply not adequate to understand the variations in local community. Whether suburbs effect one’s sense of community appears to depend on the race/ethnicity and the degree they are segregated. While there seems to be some difference in city and suburban sense of community, these differences do not neatly correspond to Philadelphia’s city/county limits. Tools like GWR present a promising way to unpack these variations.
232
Fig. 12.1 Racial/ethnic breakdown of the philadelphia metropolitan area
J. Gibbons
12 Placing Community: Exploring Racial/Ethnic Community Connection …
233
Fig. 12.2 Select GWR coefficients in the Philadelphia metropolitan area
References Anderson, E. (1990). Streetwise: Race, class, and change in an urban community. University of Chicago Press. Anderson, E. (1999). Code of the street: Decency, violence, and the moral life of the inner city. New York, NY: W.W. Norton. Bakker, L., & Dekker, K. (2012). Social trust in urban neighbourhoods: The effect of relative ethnic group position. Urban Studies, 49, 2031–2047. Baltzell, E. D. (1967). Introduction to the 1967 Edition. In The Philadelphia Negro (1967 Edn, pp. ix–xliv). Philadelphia, PA: University of Philadelphia Press. Black, T. (2010). When a heart turns rock solid: The lives of three Puerto Rican brothers on and off the streets. Vintage. Brown, S. K. (2007). Delayed spatial assimilation: Multigenerational incorporation of the Mexicanorigin population in Los Angeles. City & Community, 6, 193–209. Brunsdon, C., Fotheringham, S., & Charlton, M. (1998). Geographically weighted regression. Journal of the Royal Statistical Society Series, 47, 431–443. Brunsdon, C., Fotheringham, S., & Charlton, M. (2008). Geographically weighted regression: A method for exploring spatial nonstationarity. Encyclopedia of Geographic Information Science, 558. Burnham, K., & Anderson, D. (2002). Model selection and multimodel inference: A practical information-theoretic approach. New York, NY: Springer.
234
J. Gibbons
Chen, V.Y.-J., & Yang, T.-C. (2012). SAS macro programs for geographically weighted generalized linear modeling with spatial point data: Applications to health research. Computer Methods and Programs in Biomedicine, 107, 262–273. Chung, A. Y. (2007). Legacies of struggle: Conflict and cooperation in Korean American politics. Stanford University Press. Desmond, M. (2012). Disposable ties and the urban poor. American Journal of Sociology, 117, 1295–1335. Du Bois, W. E. B. (1899). The Philadelphia Negro: A social study. University of Pennsilvania Press. Flippen, C. A., & Parrado, E. A. (2012). Forging Hispanic communities in new destinations: A case study of Durham, North Carolina. City & Community, 11, 1–30. Foster, K. A., Pitner, R., Freedman, D. A., Bell, B. A., & Shaw, T. C. (2015). Spatial dimensions of social capital: Spatial dimensions of social capital. City & Community, 14, 392–409. Fotheringham, S., Brunsdon, C., & Charlton, M. (2003). Geographically weighted regression: The analysis of spatially varying relationships. John Wiley & Sons. Gans, H. (1967). The Levittowners: Ways of life and politics in a new suburban community. Columbia University Press. Gibbons, J., & Yang, T.-C. (2015). Connecting across the divides of race/ethnicity: How does segregation matter? Urban Aff Rev Online First, 1–28. Gibbons, J. R., Barton, M. S., & Reling, T. T. (2019). Do gentrifying neighbourhoods have less community? Evidence from Philadelphia. Urban Studies Online First, 1–21. Gibbons, J. R., & Schiaffino, M. K. (2016). Determining the spatial heterogeneity underlying racial and ethnic differences in timely mammography screening. International Journal of Health Geographics. https://doi.org/10.1186/s12942-016-0067-3 Henderson, T. (2015). Puerto Rican newcomers seek work, family on the Mainland. Philadelphia, PA: The Pew Charitiable Trusts. Heringa, A., Bolt, G., & Dijst, M. (2018). Path-dependency in segregation and social networks in the Netherlands. Social and Cultural Geography, 19, 668–690. Hernandez, V. V. (2005). From Pan-Latio Enclaves to a community: Puerto Ricans in Phladelphia, 1910–210. Puerto Rican Diaspora Hist (pp. 88–106). Perspect. Temple University Press. Hunt, M. O., Wise, L. A., Jipguep, M.-C., Cozier, Y. C., & Rosenberg, L. (2007). Neighborhood racial composition and perceptions of racial discrimination: Evidence from the black women’s health study. Soc Psychol Q, 70, 272–289. Iceland, J., & Nelson, K. A. (2008). Hispanic segregation in metropolitan America: Exploring the multiple forms of spatial assimilation. American Sociological Review, 73, 741–765. Jackson, K. T. (1985). Crabgrass frontier: The suburbanization of the United States. Oxford University Press. Jacobs, J. (1961). The death and life of great American cities. Random House. Klinenberg, E. (2003). Heat wave: A social autopsy of disaster in Chicago. University of Chicago Press; Chicago. Laurence, J. (2011). The effect of ethnic diversity and community disadvantage on social cohesion: A multi-level analysis of social capital and interethnic relations in UK communities. European Sociological Review, 27, 70–89. Lee, B. A., Reardon, S. F., Firebaugh, G., Farrell, C. R., Matthews, S. A., & O’Sullivan, D. (2008). Beyond the census tract: Patterns and determinants of racial segregation at multiple geographic scales. American Sociological Review, 73, 766–791. Lin, J. (1998). Reconstructing Chinatown: Ethnic Enclave. University of Minnesota Press, Minneapolis, MN. Lin, J. (2011). The power of urban ethnic places: Cultural heritage and community life. Routledge. Logan, J. R., Spielman, S., Xu, H., & Klein, P. N. (2011). Identifying and bounding ethnic neighborhoods. Urban Geography, 32, 334–359. Logan J. R. (2011). Separate and unequal: The neighborhood gap for blacks, Hispanics and Asians in Metropolitan America. Providence: Brown University.
12 Placing Community: Exploring Racial/Ethnic Community Connection …
235
Massey, D. S., & Denton, N. A. (1993). American Apartheid: Segregation and the making of the underclass. Harvard University Press. Matthews, S. A., & Yang, T.-C. (2012). Mapping the results of local statistics: Using geographically weighted regression. Demographic Research, 26, 151–166. McRoberts, O. M. (2003). Streets of glory: Church and community in a black urban neighborhood. University of Chicago Press. Oldenburg, R. (1989). The personal benefits. Gt (pp. 43–65). Good Place Cafes Coffee Shops Bookst. Bars Hair Salons Hangouts Heart Community. Da Capo Press. Park, R., Miller, H. A., & Thompson, K. (1921). Old world traits transplanted: The early sociology of culture. Harper. Pattillo-McCoy, M. (2000). Black picket fences : Privilege and peril among the black middle class. University of Chicago Press. PHMC. (2015). 2014–2015 household health survey documentation. Public Health Management Corporation. Portes, A., & Rumbaut, R. G. (2006). Immigrant America: A portrait. University of California Press, Berkely, CA. Putnam, R. D. (2000). Bowling alone: The collapse and revival of American community. Simon & Schuster. Ray, R. (2014). An intersectional analysis to explaining a lack of physical activity among middle class black women: Intersectionality, physical activity, and middle class black women. Sociology Compass, 8, 780–791. Reider, J. (1987). Canarsie: The Jews and Italians of Brooklyn against liberalism. Harvard University Press, Cambridge, MA. Saegert, S., Thompson, J. P., & Warren, M. R. (2002). Social capital and poor communities. Russell Sage Foundation. Sampson, R. J. (2012). Great American City: Chicago and the Enduring neighborhood effect (1st ed.). University of Chicago Press. Sharkey, P. (2018). Uneasy peace: The great crime decline, the renewal of city life, and the next war on violence. New York, NY: W.W. Norton. Small, M. L. (2004). Villa Victoria: The transformation of social capital in a Boston Barrio. University of Chicago Press. Sugrue, T. J. (1996). The origins of the urban crisis: Race and inequality in Postwar Detroit. Princeton University Press. Suttles, G. (1968). The social order of the slum. University of Chicago Press. Tilly, C. (2010). Cities, states, and trust networks: Chapter 1 of cities and states in world history. Theory and Society, 39, 265–280. Uslaner, E. M. (2011). Trust, diversity, and segregation in the United States and the United Kingdom1. Comparative Sociology, 10, 221–247. Vogel, M. (2016). The modifiable areal unit problem in person-context research. J Res Crime Delinquency, 53, 112–135. Wellman, B. (1979). The community question: The intimate networks of East Yorkers. American Journal of Sociology, 1201–1231. Whyte, W. F. (1943). Street corner society: The social structure of an Italian Slum. University of Chicago Press. Woldoff, R. A. (2011). White flight/black flight: The dynamics of racial change in an American neighborhood. Cornell University Press. Wu, Z., Hou, F., & Schimmele, C. M. (2011). Racial diversity and sense of belonging in urban neighborhoods. City & Community, 10, 373–392. Yang, T.-C., & Matthews, S. A. (2012). Understanding the non-stationary associations between distrust of the health care system, health conditions, and self-rated health in the elderly: A geographically weighted regression approach. Health & Place, 18, 576–585. Zhou, M. (1992). Chinatown: The socioeconomic potential of an Urban Enclave. Temple University Press.
Chapter 13
Exploring Gentrification Through Social Media Data and Text Clustering Techniques Cheng-Chia Huang, Atsushi Nara, Joseph Gibbons, and Ming-Hsiang Tsou
13.1 Introduction and Research Background Gentrification, a term which was coined by Glass (1964), originally refers to an urban process that involves property price increase and replacement of working-class residents by the incoming middle classes. Literally, Glass used “gentry-fication” to describe a phenomenon when a new “urban gentry” moves back to inner cities and replaces the existing residents. The term, new “urban gentry” explains a group of people who are different from the conventional middle-class. Unlike affluent middleclass households who prefer to live in the suburbs, urban areas are more attractive to a new “urban gentry” because of the shorter commute, the higher return on housing investment, or the different lifestyle from the mundane suburban (Lees et al., 2008). Researchers also use different terms to refer the “urban gentry” such as “gentrifiers,” “yuppies” (Short, 1989), “hipsters” (Laam, 2011), or “creative workers (Florida, 2002). Today, although different interpretations exist in the gentrification literature and various types of gentrification has been reported (Barton, 2016; Loretta et al., 2008), generally gentrification is a transformation of a working-class or abandoned area of a city under the influence of redevelopment and influx of higher-income residents, which involves economic upgrading and replacement of long-term residents who C.-C. Huang Esri, Redlands, CA, USA A. Nara (B) · M.-H. Tsou Department of Geography, Center for Human Dynamics in the Mobile Age, San Diego State University, San Diego, CA, USA e-mail: [email protected] J. Gibbons Department of Sociology, Center for Human Dynamics in the Mobile Age, San Diego State University, San Diego, CA, USA © Springer Nature Switzerland AG 2021 A. Nara and M.-H. Tsou (eds.), Empowering Human Dynamics Research with Social Media and Geospatial Data Analytics, Human Dynamics in Smart Cities, https://doi.org/10.1007/978-3-030-83010-6_13
237
238
C.-C. Huang et al.
were often of lower social status. Furthermore, the gentrification-related displacement involves both residential and cultural facets (Gibbons & Barton, 2016). Resident displacement refers to low-income residents of a neighborhood pushed out due to increasing housing costs (Anderson, 1990; Betancur, 2011; Loretta et al., 2008), whereas the cultural displacement refers to the change of culture in the neighborhoods (Freeman, 2006; Sharon, 2011). Researchers also emphasized that this transformation is spatial–temporal dynamic since the spatial patterns and characteristics of neighborhoods continue to change during a gentrification process (Gale et al., 1980; Kerstein, 1990; Liu & O’Sullivan, 2016; O’Sullivan, 2002; Pattison et al., 1983; Torrens & Nara, 2007). Despite the rich gentrification literatures and their significant contributions to examine gentrification phenomena, previous studies were often focused on either qualitative or quantitative approach and not many employed a mixed-method to take the advantages of both. Incorporating human perception is one major advantage from a qualitative perspective to detect the cultural shift that resulted in gentrification. Barton (2016) pointed out that qualitative approaches can confirm not only the identified areas experienced a natural form of improvement such as incumbent economic upgrading but also a cultural change. Nevertheless, qualitative approaches are typically labor-intensive and time-consuming when they are applied to a large area as compared to quantitative approaches. On the other hand, quantitative approaches utilize secondary data and systematic statistical methods to study gentrification, which can be applicable to a larger area but possibly overlook subtle cultural shifts. In recent years, researchers have sought to bridge the gap between qualitative and quantitative methods. For example, Papachristos et al. (2011) used the number of coffee shops as the cultural measurement of gentrification in their quantitative study. While their method incorporated human perception in a quantitative way, the counts of neighborhood coffee shops only reflect the certain changing consumption patterns (Barton, 2016). In other words, the attempt to quantitatively measure cultural shifts did not examine other cultural characteristics of gentrification such as artists (Sharon, 1989), yoga (Kern, 2012), nightlife (Hae, 2011; Laam, 2011). Another issue in previous gentrification studies is that spatial and temporal scales were often either confined or relatively coarse due to the limited data availability. For instance, plenty of qualitative studies examined gentrification dynamics at a fine temporal scale, but only focused on a single or small gentrifying areas (Boyd, 2008; Hamnett & Whitelegg, 2007). Quantitative studies looked into a large area and calculated gentrification indices; nevertheless, the temporal scales of those indices were based on multi-year aggregated data (Ley, 1986; Ley & Dobson, 2008). Therefore, there is a lack of research considering gentrification dynamics at a finer spatiotemporal scale across a large area. Without examining the fine spatial and temporal scale across a large area, those studies overlooked the variance of gentrification dynamics during a short period in a large region. The emergence of location-based social media now provides an opportunity to address these issues in gentrification studies (e.g., Gibbons et al., 2018; Zukin et al., 2015)). Social media has both “big” and “deep” characteristics (Sui & Goodchild, 2011). In the twentieth century, social science relied on two types of data: “surface
13 Exploring Gentrification Through Social Media Data …
239
data” about lots of people and “deep data” about few individuals. “Surface data” is a big volume of data used in the disciplines that adapted quantitative methods like sociology, economics, political science, and computational geography to explore a phenomenon on a macro level. In contrast, “deep data” recording detailed information about a small group of people have been used in the disciplines that adapted qualitative approaches such as psychology, anthropology, ethnography, and art history to explore a phenomenon on a micro-level (Manovich, 2011). In other words, researchers had to sacrifice the details for constructing a macro picture, and vice versa. However, today, we can follow the opinions, ideas, and feelings of hundreds of millions of people through social media (Manovich, 2011). As a consequence, we no longer need to make a choice from either data volume or data depth (Sui & Goodchild, 2011). Yet, in order to take the benefit of “big” and “deep” social media data, there is a need to develop an analytical framework to examine the fine-grained gentrification dynamics and to quantify human perception for identifying gentrifying areas across a large area. With the publicly available social media posts, this study develops a data mining framework to identify and characterize the gentrifying areas. The framework incorporates human perception to recognize a sense of gentrification through social media posts at the finer spatial and temporal scale, and it can be easily applied to a large region. Specifically, this study examined gentrification dynamics in Salt Lake City, UT by collecting and analyzing Instagram posts. Our methodological approach employed text processing and text mining techniques to explore gentrification ambiance that facilitates capturing the sense of gentrifications. Gentrification ambiance can be described by, for example, the presence of gentrifiers (e.g., new urban gentry (Glass, 1964), yuppie (Short, 1989), hipsters (Hae, 2011), gay and lesbian communities (Castells, 1983a; Knopp, 1990; Lees et al., 2008; Rothenberg et al., 1995)), a specific art tyle (e.g., historical or Victorian buildings (Carpenter & Lees, 1995), postmodern landscape (Mills, 1988), marginal and bohemian art (Sharon, 1989)), vibrant nightlife activities (Currid-Halkett, 2008; Florida, 2002; Hae, 2011; Sharon, 2011), and businesses for certain leisure activities (e.g., trendy restaurants, galleries, coffee shops, bars, and yoga Kern, 2012; Ley, 1986; Papachristos et al., 2011; Zukin et al., 2015)). By mapping identified gentrification ambiance, we visually investigated the spatial distribution of gentrification ambiance over the years between 2013 and 2015 and compared them with human-perceived gentrification areas. Our work adds to the current literature on gentrification, social media, urban geography, and space–time GIS. In addition, although discussing the consequences of gentrification often lead to controversial debates, the analytical results about finegrained gentrification dynamics acquired from the framework can support decision makings for local residents, communities, and governments. For example, the gained knowledge can inform stakeholders where gentrifying areas are at the very early stage, which could further help to prevent from or to minimize the potential negative impacts such as displacement, segregation, and discrimination (Atkinson, 2000; Betancur, 2002; Butler & Robson, 2003; Davidson, 2010; Laam, 2011; Powell & Spencer, 2002; Slater, 2006; Wyly & Hammel, 2004).
240
C.-C. Huang et al.
13.2 Methodology 13.2.1 A Text Mining Approach to Explore Gentrification Ambiance Through Instagram Posts In this study, we employed text mining to examine gentrification dynamics. The proposed methodological approach is based on the assumption that gentrifying areas have similar gentrification ambiance; therefore, their Instagram textual contents are similar and can be distinguished from those contents posted from non-gentrifying places. Our approach consists of two steps. The first step is to identify and explore the spatial distribution of gentrification by applying text processing and text clustering techniques to geotagged Instagram post. The second step is to analyze identified clusters with the land use and the word usage to further characterize gentrification. In the text processing step, we combined all Instagram text including captions and hashtags per census block group (CBG) per year, where each combined text is called a document. The text data combined from the Instagram posts posted from the entire study area per year is called a corpus where bag-of-word refers to all words in the corpus and the term frequency vector is a vector that represents how many times each word appears in a document. We employed standard text data preprocessing techniques to achieve better performance on data mining. In particular, we removed punctuations, stop words, emoji, and non-English words and applied the stemming process. After the pre-processing, each document was transformed into a term frequency vector. To reflect how important a word is to a document in the entire corpus, every term frequency vector was converted into a TF-IDF (Term Frequency–Inverse Document Frequency) vector. In TF-IDF, larger weights were given to the important words and smaller weights were given to common words. Important words (i.e., a word with high TF-IDF value) satisfy two conditions: (1) appearing frequently in a document (common locally), and (2) appearing rarely in entire corpus (rare globally). The equation for calculating TF-IDF for a word is as follows. T F I D F = W × Log(W/(1 + D)) where W is the number of a word in a document, and D is the number of documents using the word within the corresponding corpus. To reduce the influence of increasing number of users over the years and the different sizes of documents in each year, the TF-IDF of each word in a vector was normalized by the number of unique users in a census block group during a year. The normalized TF-IDF is calculated as follows. T F I D F = W × Log(W/(1 + D))/U
13 Exploring Gentrification Through Social Media Data …
241
where U represents the number of unique users in each census block group during a year. In the text clustering step, the normalized TF-IDF vectors were converted into a cosine distance matrix. We then applied hierarchical clustering on the distance matrix. This clustering method assigns all of the documents to a single cluster and then partitions the cluster into two least similar clusters. This process runs recursively on each cluster until there is one cluster for every document. To find an optimal number of clusters, we used an Elbow Method (Ketchen & Shook, 1996), which calculates the total WSS (within-cluster sum of the square) for each number of clustering, k, plots the curve of WSS according to k, and identifies the significant turning point of the graph that indicates the optimum number of clusters. The procedure of text pre-processing and text clustering was performed by Python with NLTK (Natural Language Toolkit) and GraphLab-Create libraries and R with a NbClust package. To evaluate how identified clustering groups describe gentrifying places, we compared the clustering results with human-perceived gentrifying areas, which were identified by conducting a qualitative content analysis on news, reports and online discussions posted during 2000–2015 and containing words “Salt Lake City” and “gentrification”. To further analyze the text mining results, we investigated the land use and the word usage in each clustering group. The land use information on gentrifying clusters can help us determining the type of gentrification, while the word usage analysis facilitates understanding how Instagram users perceive place and describe gentrification ambiance. The yearly land use information was obtained from parcel data collected through the Salt Lake City Assessor’s comprehensive appraisal process. In the word usage analysis, we compared words used in each group with gentrification key words. We determined those keywords according to existing literatures (Beauregard, 1986; Castells, 1983b; Criekingen, 2009; Hae, 2011; Holt, 2008; Jager, 1986; Kern, 2012; Knopp, 1990; Laam, 2011; Ley, 1986; Loretta et al., 2008; Rothenberg et al., 1995; Sharon, 1989), which were further revised by domain experts. Specifically, there are 25 keywords: tonight, night, gentrification, gentrifier(s), gentrifying, gentrified, bar(s), coffee, café, restaurant(s), gallery/galleries, young, youth, trendy, aesthetic(s), art, hipster(s), beer, gay(s), lesbian(s), Victorian, bohemian, yoga, yuppie, and expensive. After stemming process with porter stemmer (Porter, 1980, 2001), the gentrification keywords are converted into 21 words: tonight, night, gentri, bar, coffe, cafe, restaur, galleri, young, youth, trendi, aesthet, art, hipster, beer, gay, lesbian, victorian, bohemian, yoga, and expens.
13.2.2 Study Area Figure 13.1 illustrates the study area, parts of Salt Lake City, UT, where has experienced rapid population growth and urban development. According to the report on the
242
C.-C. Huang et al.
Fig. 13.1 Study area
U.S. Census website,1 Utah is the fastest-growing state. Its population increased 2.03 percent from 2015 to 2016. The population growth is associated with its urban development, and the rapid growth can result in a significant gentrification phenomenon in its city areas. Geographically, Salt Lake City and its suburban neighborhoods are located within a topographically confined Salt Lake Valley. In terms of urban structure, Salt Lake City can be considered as a simple mono-centric city (i.e., singlecore Central Business District in the Valley). Therefore, urban dynamics are simpler than other topographically complex and interconnected cities like the San Francisco Bay Area. The dashed line shown in Fig. 13.1 is the boundary of Salt Lake City according to the Salt Lake City government. The study area only includes those census block groups having more than 50% area within the city’s boundary. Additionally, those census block groups with a population density less than 5 percentile (446.02 people/km2 ) are excluded in this research.
13.2.3 Data We collected 1,236,070 posts between years 2013 and 2015 in Salt Lake County from Instagram, a mobile photo, text, and video sharing service. Only Instagram text data were used, which includes captions and hashtags. Instagram captions are the text for describing the pictures and videos. Instagram hashtags are the text using the hash symbol (#) to identify messages on a specific topic. Comparing to other social media, Instagram data is more suitable for this research because it is the most popular social 1
https://www.census.gov/newsroom/press-releases/2016/cb16-214.html.
13 Exploring Gentrification Through Social Media Data …
243
Fig. 13.2 The distribution of Instagram posts between 2013 and 2015
media in Utah2 except for Facebook, in which most posts are not publicly accessible through API. An Instagram retrieval engine was developed to collect Instagram posts between years 2013 and 2015 in Salt Lake County through Instagram API. MongoDB, a NoSQL database, was used to temporally store and manage the collected Instagram data. Of 1,236,070 posts, we used 652,190 posts posted within the study area. Figure 13.2 shows a map of collected Instagram posts. The black dots indicate that most Instagram posts were present in the Central Community. Before analyzing these posts, this research filtered the posts defined as noise. The sources of noise in social media include advertisements, marketing, messages, robots, and non-relevant conversations (Tsou, 2015). The extreme users were defined as the noise in this research because Instagram posts were used to measure the posting patterns of all users, and a large amount of Instagram posted by a small group of users seriously influences the research outcome. From an overview of the raw dataset presented in Fig. 13.3 (Top), the pattern of Instagram posts per user matches a powerlaw distribution. It indicates there are extreme users who contributed to a much higher number of Instagram posts. In this research, the method to filter extreme users is to remove users with post counts above the 90th percentile of posting frequencies. The 90th percentile of posting frequencies is 576.6, which means that the users with more than 577 posts from 2013 to 2015 were removed from the raw dataset. Figure 13.3 (Bottom) represents the dataset after filtering extreme users depicting a steady decrease in user counts by Instagram posts. Our data after filtering consists of 640,353 Instagram posts.
2
https://www.similarweb.com/blog/second-most-popular-social-network-by-state.
244
C.-C. Huang et al.
Fig. 13.3 The distribution of the number of users by the number of Instagram posts (Top: Before filtering; Bottom: After filtering)
13 Exploring Gentrification Through Social Media Data …
245
13.3 Results 13.3.1 Human-Perceived Gentrifying Areas To evaluate clustering results, we first identified gentrifying areas as references by human perception. Based on the 11 articles collected from news, reports, and online discussions, we identified 9 places that have experienced gentrification in the study area. Those 9 places are highlighted in Figs. 13.4, 13.5, and 13.6 with yellow dash-line circles. The first is the Avenues. A report on the Great American Country website listed five great neighborhoods that experienced gentrification, including the Avenues. It described the Avenues as having the typical landscape of gentrification, stating, “The Avenues is an affluent, walkable, outdoor-centric section of town… The Avenues has more diversity in the style of home, with bungalows, Victorians, ramblers/ranches, and two-story houses (Cutler, 2015).” The Visit Salt Lake website also mentioned the aesthetics and leisure activities available in the Avenues, which are the distinct cultural factors in a gentrifying area, stating, “Perhaps the quirkiest and artsiest neighborhood of Salt Lake, The Avenues combines beautiful historic residential neighborhoods with hip contemporary features like yoga and Pilates studios, spas and bed-and-breakfasts (Kukura, 2016)”. The Avenues was also referred to as a gentrifying area by the Zillow community in 2011 (Bcgallo et al., 2011). Sugar House is another neighborhood listed on the Great American Country report. It was described as a “walkable/bike-able” area where “a mix of young professionals and young families” lived (Cutler, 2015). According to this description, the
Fig. 13.4 A map of the text clustering result in 2013
246
C.-C. Huang et al.
Fig. 13.5 A map of the text clustering result in 2014
Fig. 13.6 A map of the text clustering result in 2015
walkable streets and the diversity of young professionals are the characteristics of gentrification. There are other reports about Sugar House. In 2013, a Utah Stories report mentioned that gentrification is happening in Sugar House, stating, “I fear what is happening in Sugar House isn’t the rebirth of a neighborhood. It’s surpassing gentrification and heading straight towards homogenization (A Local’s Perspective on the Sugar House Development, 2013).” In 2014, the Reddit community also discussed the gentrification of Sugar House (Kimberlyjo et al., 2014).
13 Exploring Gentrification Through Social Media Data …
247
Downtown Salt Lake City is another place that has experienced gentrification. In 2011, a real estate broker mentioned the gentrification of the eastern part of downtown on Zillow (Bcgallo et al., 2011). Another Zillow user in the discussion supported this opinion by describing the new condominiums popping up and the refurbishment going on in the south of downtown. The gentrification of downtown is also supported by the Great American Country report, which stated that a large number of renovations, walkability, and a vibrant nightlife are found in downtown. Additionally, the Artspace project might contribute to the gentrification of downtown: “Artspace has long been involved in Salt Lake City’s redevelopment. It revitalizes former industrial sites or abandoned buildings and builds spaces that appeal to tenants who help transform the community (Markosian, 2007).” According to Markosian’s report on the Utah Stories, while it is a nonprofit organization and it aims to provide affordable places for artists to prevent increases in rent, the strategy for redeveloping run-down areas is highly similar to the other cases of gentrification in such neighborhoods as SoHo, New York City. Besides downtown, the south of downtown, Trolley Square, and its surrounding areas were also referred to as gentrifying areas (Bcgallo et al., 2011). The Marmalade District on the western side of the Capitol was regarded as another gentrifying area. The real estate broker who mentioned the gentrification in downtown Salt Lake City also mentioned the Marmalade District: “The Marmalade area has been seeing a lot of remodels and upgrades in quality; I recently sold a home there that is currently being upgraded. Even West of 300 West has seen some gentrification (Bcgallo et al., 2011)”. The Summit Sotheby’s International Realty (Living in Salt Lake City, 2012) and Great American Country reports (Cutler, 2015) have described the gentrification there. In the reports, The Marmalade District was regarded as an “up-and-coming” section of town. It was “a blighted industrial area,” but “rapidly becoming an artsy community with a bohemian feel.” In the same manner, the Marmalade District experienced a transformation from an industrial or disinvested area to a place full of gentrification ambiance. According to the report of Summit Sotheby’s International Realty (Living in Salt Lake City, 2012) and Great American Country reports (Cutler, 2015), the landscape in the 9th and 9th (at the intersection of 900 East and 900 South Streets) and 15th and 15th (at the intersection of 1500 East and 1500 South Streets) areas also match the characteristics of gentrification. They were delineated as a place known as its “foottraffic friendly”, “culturally diversity,” and “the gay community” full of “quirky shops”, “independent movie theaters”, “art galleries” and “coffee shops.” Based on the online reports, 9th and 9th and 15th and 15th are walkable and culturally diverse places. The coffee shops and art galleries there are the key cultural factors associated with gentrification.
248
C.-C. Huang et al.
13.3.2 Text Clustering Results As a result of text clustering and the Elbow method, we determined 7 cluster groups. Figures 13.4, 13.5, and 13.6 show the clustering results of the years 2013, 2014, and 2015 respectively. The yellow dash line circles on the maps represent the humanperceived gentrifying areas. The results indicate that the text clustering method grouped most areas of human-perceived gentrification into two groups, Groups 5 and 7, and they were distinguished from other areas. As the map shows in Fig. 13.3, in 2013, some areas of human-perceived gentrification, including downtown, the south of downtown, and the core of Sugar House, were clustered into Group 5; the other areas of human-perceived gentrification, including West of 300 West, the Marmalade District, a part of the Avenues, and 9th and 9th, were clustered into Group 7. Only Trolley Square and 15th and 15th were clustered with other groups. In 2014 and 2015, Trolley Square was also clustered into Group 5. In other words, the text clustering method identified seven areas of human-perceived gentrification in 2013 (Fig. 13.4) and eight areas of human-perceived gentrification in 2014 and 2015. Only one area of human-perceived gentrification, 15th and 15th, was overlooked in 2014 and 2015. The text clustering approach is also capable of exploring gentrification dynamics on a fine-grained temporal scale. As the patterns shown in Figs. 13.4, 13.5, and 13.6, Group 7 continued to expand over the years. This expansion indicates that more and more census block groups have similar Instagram texts, which might result from the expansion of gentrification ambiance. Although we cannot be sure of whether the expanding areas are really gentrifying, they are more likely to become gentrified because their sense of place is more akin to the gentrifying areas. In addition, it can be noticed that the expanding areas were increasing along with the human-perceived gentrifying areas. This finding confirms the arguments of previous studies, which pointed out that poor neighborhoods near an area that experiences gentrification are more likely to be gentrified (Guerrieri et al., 2013; Hackworth, 2007). The text clustering approach also differentiates between the two types of gentrification: commercial gentrification and residential gentrification. It revealed two clusters, Group 5 and Group 7, representing commercial and residential gentrification respectively. Neighborhoods in Group 5 including downtown, south of downtown, Trolley Square, and Sugar House are business and shopping districts with large shopping centers and trendy shops, while those in Group 7 consisting of West of 300 West, the Marmalade District, the Avenues, and 9th and 9th are upscale residential areas with diverse architectural styles, historic houses, and small retail shops. The details of the land use of each group are shown in Fig. 13.7. Group 5, representing commercial gentrification, has the largest percentage of commercial areas, making up almost 50% of the entire area, while 30% of the area is residential. This indicates that Group 5 is a mixed-use location characterized by commercial activities. Residential and condominium areas make up more than 50% of Group 7, meaning this area contains large residential neighborhoods, similar to Groups 2, 3, and 4. However, the percentage of commercial land use is larger in Group 7 than in Groups
13 Exploring Gentrification Through Social Media Data …
249
Fig. 13.7 Land use profile of each group
2, 3, and 4, indicating Group 7 is a residential place with a considerable amount of commercial land-use area. We further analyzed the word usage in each clustering group to understand how Instagram users perceive gentrification ambiance. Figure 13.8 presents the average of the normalized TF-IDF vectors of every word per clustering group. The horizontal
Fig. 13.8 The average normalized TF-IDF of each word in clustering groups
250
C.-C. Huang et al.
axis refers to all words used in the entire corpus. Specifically, there are 41,205 words on the horizontal axis. If a word was not used in a group, its average TF-IDF is zero. For example, Group 2, which temporarily appeared in 2013 (see Fig. 13.4), has the average TF-IDF of each word is zero. This means that almost no words were used in this group. The vertical axis is the average TF-IDF. Each bar in the graph represents the mean TF-IDF of each word. A significant peak in a group means that the word has a high average TF-IDF indicating the word’s relative importance for the group. For example, the two peaks in Group 4 represent two significantly important words, “templ” and “templesur,” where “templ” means “temple” and “templesur” means “templesure” after the stemming process. This shows that the location names “temple” and “temple square” are the two most important words in Group 4. The result shown in Fig. 13.8 differs from the original expectation. Because the mean TF-IDF refers to the importance of a word in a group, some gentrification keywords with significant peaks in the gentrifying groups (Group 5 and Group 7) were expected to be shown in the graph. However, compared with the other groups, the distributions of the bars in Groups 5 and 7 are more even than in the other groups. There are still some peaks, but they are lower than the peaks in the other groups. This result indicates that Instagram users used words that are more diverse in gentrifying areas, and this might result from the greater diversity of activities and events occurring there. This result is similar to what Hristova et al. (2016) found in London. According to their case study, places that experience gentrification have the most diverse venues and visitors. The socio-economic diversity and class diversity in gentrifying areas were also identified in previous studies (Freeman, 2009; McKinnish et al., 2010). Nevertheless, the even distributions of the average normalized TF-IDF for every word in gentrifying groups (i.e., Groups 5 and 7) do not mean that gentrification keywords play no role in gentrifying groups. Figure 13.9 depicts word clouds based on
Fig. 13.9 The word clouds based on the average normalized TF-IDF in each group
13 Exploring Gentrification Through Social Media Data …
251
Fig. 13.10 Ranks of gentrifying keywords in each cluster group
the top 100 words of the average normalized TF-IDF. As shown in these word clouds, more gentrification keywords appear in gentrifying groups. Specifically, only one gentrification keyword appears each in Group 1 and Group 3: restaurant and coffee, and there are no gentrification keywords in Groups 2, 4, and 6. Yet, in Group 5, six gentrification keywords appear among the top 100 words, as well as three keywords in Group 7. In other words, the gentrification keywords play a more important role in gentrifying groups than in non-gentrifying groups. Figure 13.10 helps us to examine the ranking of gentrification keywords in each group. The horizontal axis represents all 21 gentrification keywords that appeared in the Instagram dataset, and the vertical axis represents the ranking. Each line refers to a cluster group, and the dashed lines represent gentrifying groups (i.e., Groups 5 and 7). Group 2 is excluded as it does not contain any gentrification keywords. In the graph, the dashed lines lay above all other lines of most of the gentrification keywords, such as night, tonight, bar, coffee, café, aesthetics, art, hipster, beer, gay, and yoga. This indicates that when differentiating gentrifying from non-gentrifying areas, these keywords play a more important role than other keywords.
13.4 Conclusion 13.4.1 Key Findings This study introduced a novel text mining framework that identifies a sense of gentrification from Instagram data. A gentrification indicator, gentrification ambiance, that facilitates capturing the cultural characteristics of gentrification. Gentrification
252
C.-C. Huang et al.
ambiance was explored by applying text processing and text clustering techniques to the Instagram posts. By comparing the spatial distributions of gentrification ambiance over years with areas of human-perceived gentrification derived from news, reports, and online discussions, the results showed that areas observed as having a sense of gentrification closely correspond to the human-perceived gentrifying areas. Additionally, our approach differentiated two types of gentrification, namely residential and commercial gentrification. This is an advantage over conventional censusbased gentrification typologies that typically focus on classifying places into, for example, (not) gentrified, (not) gentrifying, or (not) potentially gentrifiable using a set of pre-determined rules (e.g., Bostic & Martin, 2003; Ding et al., 2016; Freeman, 2005; Voorhees, 2014)). Furthermore, our approach is capable of exploring spatially and temporally finegrained gentrification dynamics. While conventional studies often look at the gentrification dynamics with a coarser scale using, for example, decennial census da-ta at the census tract level, we examined at the scale of CBG by year using social media posts. This allows us to capture the expansion of gentrification, a shifting pattern in relation to a sense of gentrification.
13.4.2 Limitations While there are several findings in this research, the results are affected by some limitations. First of all, our study does not consider Instagram users’ behavior. The shift in Instagram user’s behaviors might influence the text clustering results. The results of text clustering in 2013 coincide almost perfectly with areas of human-perceived gentrification, but the gentrifying groups increased rapidly in 2014 and 2015. This indicates that more places have Instagram texts similar to those in gentrifying areas. This phenomenon might be due to the expansion of the gentrification ambiance, which helps us to identify places undergoing gentrification. Yet, this phenomenon could also result from the emergence of certain Instagram user behaviors. In other words, trendy contents (sometimes relate to gentrification) are frequently posted by a large amount of Instagram users, which might lead to homogeneous Instagram text in gentrifying and non-gentrifying places. To address this problem, we need more research to study Instagram user behaviors. The uncertainty of the data and the algorithm also affects the research results. Because social media data are not designed for research purposes, the check-in data can be controlled by users, and it is difficult to know the level of uncertainty of the data (Lazer et al., 2014). Even though data filtering was applied in this research to reduce the influence of robots and extremely active users, fake Instagram check-in posts might lead to an inaccurate result. Conducting research to explore the level of data uncertainty is the next step in addressing this problem. Another challenge is the lack of the exact boundaries of areas of human-perceived gentrification. In this research, these gentrification areas, derived from news, reports, and online discussions, do not have a clear boundary. They are the names of places
13 Exploring Gentrification Through Social Media Data …
253
with general locations; therefore, it is difficult to evaluate the performance of the Instagram analysis results. Hence, further research is necessary draw the boundaries of the gentrifying areas as reference data to validate data mining-based results. This can be done, for example, qualitatively by surveying residents or quantitatively by analyzing fine-scale housing value dynamics. Once the physical boundary of gentrification is available, the comparison can be done using spatial methods like Spatial Association Between Regionalizations (SABRE) (Nowosad & Stepinski, 2018). Last but not least, we did not utilize images or image recognition techniques because of the difficulty to obtain permission for applying computer vision technique to Instagram photos granted by Instagram, although images are the main content of Instagram posts. Incorporating publicly available Instagram photos with image-based machine learning could enrich the input data and ultimately enhance our understanding of gentrification ambiance. In addition, applying spatio-temporal check-in patterns and semantic analysis can also help us to check the typology of a place (Cranshaw et al., 1977; McKenzie et al., 2015). Acknowledgements This work was partially supported by the National Science Foundation under Grant No. 1634641, IMEE project titled “Integrated Stage-Based Evacuation with Social Perception Analysis and Dynamic Population Estimation.” Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the National Science Foundation.
References A local’s perspective on the sugar house development (2013). Retrieved May 1, 2017, from http:// utahstories.com/2013/11/a-locals-perspective-on-the-sugar-house-development/ Anderson, E. (1990). The village setting. Streetwise race Cl. Change urban community (pp. 7–55). The Univ. of Chicago Press. Atkinson, R. (2000). Measuring gentrification and displacement in greater London. Urban Studies, 37, 149–165. Barton, M. (2016). An exploration of the importance of the strategy used to identify gentrification. Urban Studies, 53, 92–111. Bcgallo, et al. (2011). Best area in Salt Lake City for gentrification play. In Zillow.com. Retrieved May 5, 2016, from https://www.zillow.com/advice-thread/Best-Area-in-Salt-Lake-City-for-Gen trification-Play/413670/ Beauregard, R. A. (1986). The chaos and complexity of gentrification. In Gentrification city (pp 35–55) Betancur, J. J. (2002). The politics of gentrification the case of West Town in Chicago. Urban Affairs Review, 37, 780–814. Betancur, J. (2011). Gentrification and community fabric in Chicago. Urban Studies, 48, 383–406. Bostic, R. W., & Martin, R. W. (2003). Black home-owners as a gentrifying force? Neighbourhood dynamics in the context of minority home-ownership. Urban Studies, 40, 2427–2449. Boyd, M. (2008). Defensive development: The role of racial conflict in gentrification. Urban Affairs Review, 43, 751–776. Butler, T., & Robson, G. (2003). Plotting the middle classes: Gentrification and circuits of education in London. Housing Studies, 18, 5–28.
254
C.-C. Huang et al.
Carpenter, J., & Lees, L. (1995). Gentrification in New York, London and Paris: An international comparison. International Journal of Urban and Regional Research, 19, 286–303. Castells, M. (1983a). The city and the grassroots: A cross-cultural theory of urban social movements. Univ of California Press. Castells, M. (1983b). The city and the grassroots: A cross-cultural theory of urban social movements. Arnold. Cranshaw, J., Hong, J. I., & Sadeh, N. (1977). The livehoods project : Utilizing social media to understand the dynamics of a city. In Proceedings of ICWSM (pp. 58–65). Currid-Halkett, E. (2008). The Warhol economy: How fashion, art, and music drive New York City-new edition. Princeton University Press. Cutler, A. (2015). Salt Lake City, Utah: Neighborhoods to know. Retrieved May 1, 2017, from http://www.greatamericancountry.com/places/local-life/living-in-salt-lake-city--utah Davidson, M. (2010). Love thy neighbour? Social mixing in London’s gentrification frontiers. Environment and Planning A, 42, 524–544. Ding, L., Hwang, J., & Divringi, E. (2016). Gentrification and residential mobility in Philadelphia. Regional Science and Urban Economics, 61, 38–51. Florida, R. (2002). Bohemia and economic geography. Journal of Economic Geography, 2, 55–71. Freeman, L. (2005). Displacement or succession?: Residential mobility in gentrifying neighborhoods. Urban Affairs Review, 40, 463–491. Freeman, L. (2006). There goes the ’hood: Views of gentrification from the ground up. Temple University Press. Freeman, L. (2009). Neighbourhood diversity, metropolitan segregation and gentrification: What are the links in the US? Urban Studies, 46, 2079–2101. Gale, D. E. (1980). Neighborhood resettlement: Washington, D.C. In S. B. Laska & D. Spain (Eds.), Back city (pp. 95–115). Burlington. Gibbons, J., & Barton, M. S. (2016). The Association of minority self-rated health with black versus white gentrification. Journal of Urban Health, 93, 909–922. Gibbons, J., Nara, A., & Appleyard, B. (2018). Exploring the imprint of social media networks on neighborhood community through the lens of gentrification. Environment and Planning B: Urban Analytics and City Science, 45, 470–488. Glass, R. (1964). London: Aspects of change. University College. Guerrieri, V., Hartley, D., & Hurst, E. (2013). Endogenous gentrification and housing price dynamics. Journal of Public Economics, 100, 45–60. Hackworth, J. (2007). The neoliberal city: Governance, ideology, and development in American urbanism. Cornell University Press. Hae, L. (2011). Gentrification and politicization of nightlife in New York city. ACME, 10, 564–584. Hamnett, C., & Whitelegg, D. (2007). Loft conversion and gentrification in London: From industrial to postindustrial land use. Environment and Planning A, 39, 106–124. Holt, L. (2008). Embodied social capital and geographic perspectives: Performing the habitus. Progress in Human Geography, 32, 227–246. Hristova, D., Williams, M. J., Musolesi, M., Panzarasa, P., & Mascolo, C. (2016). Measuring urban social diversity using interconnected geo-social networks. In Proceedings of 25th International Conference on World Wide Web—WWW 16 (pp. 21–30). Jager, M. (1986). Class definition and the esthetics of gentrification: Victoriana in Melbourne. Gentrification City (pp. 78–91). Allen and Unwin. Kern, L. (2012). Connecting embodiment, emotion and gentrification: An exploration through the practice of yoga in Toronto. Emotion, Space and Society, 5, 27–35. Kerstein, R. (1990). Stage models of gentrification: An examination. Urban Affairs Review, 25, 620–639. Ketchen, D., & Shook, C. (1996). The application of cluster analysis in strategic management research: An analysis and critique. Strategic Management Journal, 17, 441–458.
13 Exploring Gentrification Through Social Media Data …
255
Kimberlyjo, et al. (2014) Why is everyone here so obsessed with Sugarhouse? Retrieved May 1, 2017, from. https://www.reddit.com/r/SaltLakeCity/comments/2cfra0/why_is_everyone_ here_so_obsessed_wi%0Ath_sugarhouse/%0A Knopp, L. (1990). Some theoretical implications of gay involvement in an urban land market. Political Geography Quarterly, 9, 337–352. Kukura, J. (2016). 9 hippest neighborhoods of Salt Lake. Retrieved May 1, 2017, from https://www. visitsaltlake.com/blog/post/2016/9/9-Hippest-Neighborhoods-of-Salt-Lake/8367/ Laam, H. (2011). Dilemmas of the nightlife fix: post-industrialisation and the gentrification of nightlife in New York City. Urban Studies, 48, 3449–3465. Lazer, D., Kennedy, R., King, G., & Vespignani, A. (2014). The parable of Google Flue: Traps in big data analysis. Science, 343, 1203–1205. Lees, L., Slater, T., & Wyly, E. K. (2008). Gentrification. Routledge. Ley, D. (1986). Alternative explanations for inner-city gentrification: A Canadian assessment. Annals of the Association of American Geographers, 76, 521–535. Ley, D., & Dobson, C. (2008). Are there limits to gentrification? The contexts of impeded gentrification in Vancouver. Urban Studies, 45, 2471–2498. Liu, C., & O’Sullivan, D. (2016). An abstract model of gentrification as a spatially contagious succession process. Computers, Environment and Urban Systems, 59, 1–10. Living in Salt Lake City. (2012). Retrieved May 1, 2017, from. http://www.summitsothebysrealty. com/eng/article/living-in-salt-lake-city Loretta, L., Slater, T., & Wyly, E. K. (2008). Gentrification. Routledge/Taylor & Francis Group. Manovich, L. (2011). Trending: The promises and the challenges of big social data. Retrieved May 1, 2017, from http://manovich.net/content/04-projects/067-trending-the-promises-and-the-challe nges-of-big-social-data/64-article-2011.pdf Markosian, R. (2007). Artspace in Salt Lake City. Retrieved May 1, 2017, from http://utahstories. com/2007/01/artspace-in-salt-lake-city/ McKenzie, G., Janowicz, K., Gao, S., Yang, J.-A., & Hu, Y. (2015). POI pulse: A multi-granular, semantic signature-based information observatory for the interactive visualization of big geosocial data. Cartographica, 50, 71–85. McKinnish, T., Walsh, R., & Kirk White, T. (2010). Who gentrifies low-income neighborhoods? Journal of Urban Economics, 67, 180–193. Mills, C. A. (1988). “Life on the upslope”: The postmodern landscape of gentrification. Environment and Planning D: Society and Space, 6, 169–190. Nowosad, J., & Stepinski, T. F. (2018). Spatial association between regionalizations using the information-theoretical V-measure. International Journal of Geographical Information Science, 32, 2386–2401. O’Sullivan, D. (2002). Toward micro-scale spatial modeling of gentrification. Journal of Geographical Systems, 4, 251–274. Papachristos, A. V., Smith, C. M., Scherer, M. L., & Fugiero, M. A. (2011). More coffee, less crime? The relationship between gentrification and neighborhood crime rates in Chicago, 1991 to 2005. City & Community, 10, 215–240. Pattison, T. J. (1983). The stages of gentrification: The case of Bay Village. In P. L. Clay & R. M. Hollister (Eds.), Neighborhood policy plan (pp. 77–92). LexingtonBooks. Porter, M. (1980). An algorithm for suffix stripping. Program, 14, 130–137. Porter, M. (2001). Snowball: A language for stemming algorithms. Snowball, 1–15. Powell, J., & Spencer, M. (2002). Giving them the old “One-Two”: Gentrification and the K.O. of impoverished urban dwellers of color. Howard Law Journal, 46, 433. Rothenberg, T. (1995). “And she told two friends”: Lesbian creating urban social space. In D. Bell, G. Valentine, & J. Silk (Eds.), Mapping Desire: Geog Sexuality (pp. 165–181). Routledge. Sharon, Z. (1989). Loft living: Culture and capital in urban change. Rutgers University Press. Sharon, Z. (2011). Naked city: The death and life of authentic urban places. Oxford University Press.
256
C.-C. Huang et al.
Short, J. R. (1989). Yuppies, Yuffies and the new urban order. Transactions of the Institute of British Geographers, 14, 173–188. Slater, T. (2006). The eviction of critical perspectives from gentrification research. International Journal of Urban and Regional Research, 30, 737–757. Sui, D., & Goodchild, M. (2011). The convergence of GIS and social media: Challenges for GIScience. International Journal of Geographical Information Science, 25, 1737–1748. Torrens, P. M., & Nara, A. (2007). Modeling gentrification dynamics: A hybrid approach. Computers, Environment and Urban Systems, 31, 337–361. Tsou, M.-H. (2015). Research challenges and opportunities in mapping social media and big data. Cartography and Geographic Information Science, 42, 70–74. Van Criekingen, M. (2009). Moving in/out of Brussels’ historical core in the early 2000s: Migration and the effects of gentrification. Urban Studies, 46, 825–848. Voorhees, N. P. (2014). The socioeconomic change of Chicago’s community areas (1970–2010): Gentrification index. Chic. Univ. Ill. Wyly, E. K., & Hammel, D. J. (2004). Gentrification, segregation, and discrimination in the American urban system. Environment and Planning A, 36, 1215–1241. Zukin, S., Lindeman, S., & Hurson, L. (2015). The omnivore’s neighborhood? Online restaurant reviews, race, and gentrification. Journal of Consumer Culture, 1469540515611203.
Chapter 14
Spatial Distribution Patterns of Geo-tagged Twitter Data Created by Social Media Bots and Recommended Data Wrangling Procedures Ming-Hsiang Tsou, Hao Zhang, Jaehee Park, Atsushi Nara, and Chin-Te Jung
14.1 Introduction Geo-tagged social media data are important for social media analytics because data scientists and GIScience researchers can aggregate social media messages or users into a city, a region, or nearby points of interests (POIs) for location-based analysis and regional trend analysis. Currently, the public Twitter Application Programming Interfaces (APIs) can provide five types of geocoding methods: 1. Geo-tagged coordinates, 2. Place check-in location (bounding box), 3. User Profile Location, 4. Time Zones, 5. Texts containing locational information (explicit or implicit information). Among the five geocoding methods, tweets with geo-tagged coordinates is the most popular data source used in location-based social media research. Geo-tagged tweets have precise latitude and longitude coordinates (decimal degrees) stored in a metadata field of tweets, called “geo” (a deprecated field name in APIs) or “coordinates” (the current field name in APIs). When users turn on the precise location tag function on their Twitter accounts (which is off by default), their tweets will be geo-tagged using GPS or Wi-Fi signals in their mobile devices. Since many users do not enable precise location tags, there are only around 1% of tweets containing geo-tagged information. The percentage of geo-tagged tweets may vary among different topics or keywords. For example, during a wildfire event in San Diego, the percentage of geo-tagged tweets can become 4% or higher (Wang et al. 2016). GIScientists and geographers can use geo-tagged tweets to study their geographic context and spatial association of users or messages. One popular method of M.-H. Tsou (B) · H. Zhang · J. Park · A. Nara Department of Geography, San Diego State University, San Diego, CA, USA e-mail: [email protected] C.-T. Jung Esri, Redland, CA, USA © Springer Nature Switzerland AG 2021 A. Nara and M.-H. Tsou (eds.), Empowering Human Dynamics Research with Social Media and Geospatial Data Analytics, Human Dynamics in Smart Cities, https://doi.org/10.1007/978-3-030-83010-6_14
257
258
M.-H. Tsou et al.
collecting geo-tagged tweets is to utilize Twitter’s Streaming API with a predefined bounding box or multiple predefined keywords. Previous works have demonstrated that social media data collected by the Twitter Streaming API (free version with 1% sampling rate) are a good sample of Twitter’s Firehose API (very expensive version providing100% tweet data in Twitter servers) (Morstatter et al. 2013). In academics, many GIScience researchers used geo-tagged tweets for conducting spatial analysis and GIS operations for their research projects. For example, the 2013 special issue of “Mapping Cyberspace and Social Media” in Cartography and Geographic Information Science (Tsou and Leitner 2013) includes seven refereed research papers and four out of seven paper are using geo-tagged tweets as their main data sources. There are two types of Twitter APIs in general, Streaming API for collecting real-time feeds of Twitter messages and Search API for collecting historical tweets (up to 7 or 9 days before the search date) with specific keywords or user names with database query methods. This paper will mainly focus on the characteristics of geo-tagged tweets collected from the Twitter Streaming API in two cities, San Diego and Columbus.
14.2 Twitter Spams, Bots, and Cyborgs Previous research has identified that a major type of data noises in Twitter data is spams created by bots and cyborgs. Twitter spams, similar to email spams, vary in different forms. Some spams are transparent and easy to be identified while some spams are sophisticated (Yardi et al. 2010). Spams are usually created for reaching more users and increase the financial gain for spammers. Spammers now use multiple platforms to disseminate spams which also include social media services (Lumezanu and Feamster 2012). Many researchers have addressed the importance of eliminating spams and bots for a clearer and safer online networking environment (Chu et al. 2012). The existence of spams and bots in Twitter was first noticed in 2008 by studying over 100,000 Twitter users and grouping them base on follower-to-following ratios (Krishnamurthy et al. 2008). The groups contain broadcasters (users with large amount of followers), acquaintances (users who have a follower-to-following ratio close to 1), and miscreants (users who follow a lot and less followed) (Krishnamurthy et al. 2008). Many Twitter spams include URLs. According to Grier et al. (2010), 8% of the URLs shared on Twitter are considered as the linkage to malicious websites, which contains malware or scams. Clicking on those embedded URL may cause serious problems to users’ privacy and local computers (Thomas and Nicol 2010). The harmfulness of spams and bots does not only limit to malware and scams. Spam message also waste the storage space on server side (Lin and Huang 2013). Twitter released its rules on eliminating and controlling spams created by bots, which include the regulation on banning the spam accounts permanently, and the definition of spam accounts (Twitter 2016). Unfortunately, some messages from spam accounts have been changed correspondingly to defeat these rules. Previously, researchers tried to detect spam account by building a URL blacklist. However, 90% of users have already clicked on a spam URL before it is put into blacklist (Grier
14 Spatial Distribution Patterns of Geo-tagged Twitter Data …
259
et al. 2010). Furthermore, the URL shorten service also lead to the uncertainty on detecting spam URL (Uribl 2013). Currently, three major trends for detecting spam accounts include data compression algorithms (Bratko et al. 2006), machine learning (Goodman et al. 2005), and statistics (Fetterly et al. 2004). In recent years, more and more new spam detection methods are created based on these three bases. Lin and Huang (2013) evaluated the tweets by examining the URL rate and the interaction rate. Lumezanu and Feamster (2012) categorized the tweets by characterizing the publishing behavior and analyzing the effectiveness of spam. Bayesian statistic was utilized for classifying the tweets (Wang 2010). It is worth to learn that most of the spam tweets are not sent by human accounts but bot accounts. However, social media bots does not only send out spam tweets, but also send out useful information sometimes, such as weather information, traffic update, and earthquake events. The existence of social media bots may not be widely recognized by all the Twitter users. According to Japan Times, social media bots are defined as follow: “Twitterbots are small software programs that are designed to mimic human tweets. Anyone can create bots, though it usually requires programming knowledge. Some bots reply to other users when they detect specific keywords. Others may randomly tweet preset phrases such as proverbs. Or if the bot is designed to emulate a popular person (celebrity, historic icon, anime character etc.) their popular phrases will be tweeted. Not all bots are fully machine-generated, however, and interestingly the term “bot” has also come to refer to Twitter accounts that are simply “fake” accounts” (Akimoto 2011). Meanwhile, cyborgs, a mix of humans and bots, refer to either bot-assisted humans or human-assisted bots (Chu et al. 2010). A bot can send tweets automatically by called Twitter APIs. Sometimes, after a bot receives audience, the creator (human) may tweet through bots which led to a merged version of humans and bots–cyborgs. The emergence of bots and cyborgs should be attributed to the growing user population and open nature of Twitter (Chu et al. 2010). Bots and cyborgs generate many tweets everyday by providing various information, including news updates, advertisements, emergency information, et al. (Chu et al. 2010). Some researchers study how to increase bots’ influence in social media (Messias et al. 2013). On the other hand, identifying bot and cyborgs are not easy (Boshmaf et al. 2011). According to Chu and his team (Grier et al. 2010), they identified that 10.5% of Twitter accounts are bots while 36.2% are classified as cyborgs. The existence of bots and cyborgs bring both pros and cons. The information being tweeted, such as news, job posting, allow people with the access to latest updates. However, the spam tweets and fake news sent by bots and cyborgs are harmful to social media users. Azmandian et al. (2013) introduce a two-steps procedure for eliminating the bots from the Twitter data. The two-steps are: (1) All the Twitter users whose tweets contain URLs more than 70% of the time will be identified as bots. (2) All the users who traveled for more than 120 km/h will be identified as a bot. However, these previous research did not focus on the elimination of spams, bots and cyborgs in geo-tagged tweets. This paper will focus on how to identify and remove bots and cyborgs in geo-tagged social media data.
260
M.-H. Tsou et al.
14.3 System Errors in Twitter Streaming APIs for Geo-tagged Tweets Since the Twitter Streaming API provides the bounding box option to enable users to search and collect only geo-tagged tweets within the bounding box area. Many people think that all collected geo-tagged tweets by the Streaming API will be located within the user-defined bounding box. However, we compared two case studies of Streaming API with bounding boxes for the County of San Diego and the City of Columbus during one month (November, 2015). Both cases illustrated that there are only 42.7% of tweets are contained within San Diego County and 83.8% of tweets are within the City of Columbus (Table 14.1). Figure 14.1 illustrates the spatial distribution of geotagged tweets collected by the Twitter Streaming API using the bounding box of San Diego County during two months (October and November, 2015). Around 42.7% of the geo-tagged tweets are within the boundary of San Diego County and 57.3% of tweets are outside the original defined bounding box. We also conducted kernel density maps using geotagged tweets and found out that the major hot spots (clustered areas) are located around San Diego Downtown and two other hot spots are around the cities of Los Angeles and San Francisco. Figure 14.1 illustrates that the distribution of outsidebounding box tweets are mainly located in California and a few tweets are located in Mexico. There are a small amount of tweets located in South America, Europe, and Southeast Asia. Figure 14.2 illustrates the spatial distribution of geotagged tweets using the bounding box of the City of Columbus in Ohio, USA during October and November, 2015. 83.8% of tweets are within the actual boundary of the City of Columbus. 15.8% Table 14.1 The percentage of geo-tagged tweets within the original bounding box or within the State boundary in the County of San Diego and the City of Columbus collected in one month (November, 2015) GeoViewer geography setting @SanDiego
Percentage to total tweets (%)
San Diego County (SDC)
97,944
42.7
California State (excluding SDC)
56,382
24.6
75,082
32.7
Other regions Total @Columbus
# of Tweets in November
53,291
83.8
Ohio State (excluding CC)
10,043
15.8
236
0.4
Total
131,464
229,408
Columbus City (CC)
Other regions
Outside boundary
63,570
10,279
14 Spatial Distribution Patterns of Geo-tagged Twitter Data …
261
Fig. 14.1 The spatial distribution of geo-tagged tweets using the Streaming API with the bounding box of San Diego County (October and November 2015) at county, state, and world scales
of tweets are within the Ohio State. Only 0.4% of geotagged tweets are outside the Ohio State. Comparing the two case studies, the San Diego case has much higher ratio of outside bounding box tweets. The Columbus case has better results of using bounding box to collect geo-tagged tweets. There is one possible explanation about this system error generated by the Twitter APIs. Based on the Twitter API documents, the streaming API uses the following heuristic rules to determine whether a given Tweet falls within a bounding box (https://dev.twitter.com/streaming/overview/request-parameters#locations): 1. 2.
3.
If the coordinates field is populated, the values there will be tested against the bounding box. Note that this field uses geoJSON order (longitude, latitude). If coordinates is empty but place is populated, the region defined in place is checked for intersection against the locations bounding box. Any overlap will match. If none of the rules listed above match, the Tweet does not match the location query. Note that the geo field is deprecated, and ignored by the streaming API.
The Twitter rule #2 may be the key reason for collecting tweets outside the bounding box. For example, if a tweet selects “California” as the place to checkin, the place box of California will overlap with the bounding box of San Diego.
262
M.-H. Tsou et al.
Fig. 14.2 The spatial distribution of geo-tagged tweets using the streaming API with the bounding box of Columbus City (October and November 2015) at county, state, and world scales
Therefore, the tweet outside San Diego will be selected. However, there are still several global geo-tagged tweets which cannot be explained by the heuristic rules. We will recommend that the future users of Twitter Search APIs should add the geo-filter procedure after collecting geo-tagged tweets using bounding box methods to remove the system errors.
14.4 Identify Commercial Bots and Cyborgs in Geo-tagged Tweets After manually reviewing hundreds of geo-tagged tweets in our social media data collection, we found out that one easy way to detect the bot or cyborg tweets for advertisement or spams is to examine a metadata field in tweets, called “source”. If a tweet was created on an iPhone device, the source field will be “Twitter for iPhone”. If a tweet was created by specific bots or web programs (for cyborgs), the source field could be “TweetMyJOBS”, “dlvr.it”, or “AutoCarSale”. Therefore, by
14 Spatial Distribution Patterns of Geo-tagged Twitter Data …
263
classifying different types of “source” values in the metadata, we can remove these noises or spams created by bots or cyborgs for commercial purposes (Table 14.2). We collected one month of geotagged tweets in San Diego County in November 2015. After removing the tweets outside the San Diego County, 97,944 tweets are within San Diego County. After reviewing hundreds of different source names in tweets, we can create a black list for possible bots and advertisement cyborg tweets based on the source names in San Diego (Table 14.2). Among the total tweets in San Diego, there are 29.42% tweets being recognized as bots or cyborgs by our list. Table 14.2 The Proportion of social media bots with different “source” names in one month of geo-tagged tweets (November, 2015) in San Diego County Source category
Source name
Hashtag
Job
TweetMyJOBS
16,005
SafeTweet by TweetMyJOBS
4,726
CareerCenter
6
(Total) Advertisement
20,737 dlvr.it
Coupon
269
dine here
182
Simply Best Coupons
77
Auto City Sales
56
sp_california
41
Cities
2,105
iembot
24
Sandaysoft Cumulus
7
3,421
(Total)
2,136
Earthquake
Earthquake
News
203
EarthquakeTrack.com
69
QuakeSOS
9
San Diego Trends
843
WordPress.com
111
TTN SD traffic
512
TTN LA traffic
11
1,043
(Total) Traffic
954
(Total) Percentage of Noise
21.17
3.49
2.18
762
everyEarthquake
(Total)
(%)
2,837
Golfstar
(Total) Weather
Tweet frequency
523
1.06
0.97
0.53 29.42
264
M.-H. Tsou et al.
Table 14.3 illustrates the top six categories of bots sources, including Jobs (21.17%), Advertisement (3.49%), Weather (2.18%), Earthquake information (1.06%), News (0.97%), and Traffic (0.53%). Similar to the previous comparison studies, we also collect one month of geotagged tweets in the City of Columbus in November 2015. Table 14.3 illustrated the black list for bots and advertisement tweets in Columbus. There are 53.47% tweets being recognized as messages created by bots in Columbus based on our black list. The percentage of bot’s tweets is much higher than the San Diego case. The top five categories of bot sources in Columbus are Jobs (43.23%), Advertisement Table 14.3 The Proportion of data noises with different “source” names in one month of geo-tagged tweets (November, 2015) in the City of Columbus Source category
Source name
Hashtag
Job
TweetMyJOBS
16,789
SafeTweet by TweetMyJOBS
6,250
(Total) Advertisement
23,039 dlvr.it
Coupon
147
dine hear
77
Beer Menus
53
DanceDeets
News
3 2
JCScoop
1
LeadingCourses.com
1
TTN CMH traffic
1,486
Columbus Trends
1,021
eLobbyist
80
WordPress.com
10
1,934 1,486
twitterfeed
8
stolen_bike_alerter
1
Cities
578
iembot
337
(Total) Weather (Total) Percentage of noise
4
SmartSearch
(Total)
43.23
4 Coupon
sp_oregon
(Total)
(%)
1,642
circlepix
sp_ohio
Traffic
Tweet frequency
1,120
915
3.63 2.79
2.10
1.72 53.47
14 Spatial Distribution Patterns of Geo-tagged Twitter Data …
265
(3.63%), Traffic (2.79%), News (2.10%), and Weather (1.72%). We cannot find the Earthquake information bots in the City of Columbus. Figure 14.3 illustrates the proportion of “source” categories including both noise and non-noise tweets within one month of geo-tagged tweets (November 2015) from the Streaming API within the San Diego County boundary. The red color indicates the tweets labelled as bots or cyborgs based on our black list. The green color indicates the tweets created by generic Twitter platform (such the Android or iOS Twitter Apps). The blue color indicates the tweets from other third-party apps or services, such as Instagram or Foursquare. The most popular source (platform) in San Diego geo-tagged tweets is Instagram (46,484 tweets, 47.46%), which is a very popular social media platform for sharing photos and videos either publicly or privately on mobile devices. Instagram users can extend their photo sharing to other platforms, including Twitter and Facebook. Since the default setting of Instagram is geotagenabled. Therefore, most Instagram messages sharing on Twitter include detail geolocations. The second popular platform is “TweetMyJOB”, which contains job market advertising tweets using the geolocations of recruiting companies. Foursquare is a location-based search-and-discovery services with personalized recommendations and tips. The generic Twitter apps are the fifth for Android and the seventh for iPhone devices (Fig. 14.3). Figure 14.4 illustrates the proportion of “source” categories within the City of Columbus. The most popular platform in Columbus is the “TweetMyJOB” bots. The second popular platform is Instagram. The ranking of the rest categories are similar to the San Diego case study.
Fig. 14.3 The numbers of Tweets produced by different platforms inside the San Diego bounding box during the month of November, 2015
266
M.-H. Tsou et al.
Fig. 14.4 The numbers of Tweets produced by different platforms inside the City of Columbus bounding box during the month of November, 2015
14.5 Creating a White List or a Black List for Social Media Bots? The previous section illustrated a method of identifying and removing social media bots by creating a black list based on the “Source” metadata field. The alternative method is to create a white list, which includes only valid devices or computing environments in the Twitter Source field, such as “Twitter for iPhone”, “Twitter for Android”, “Twitter for Web Client”, “Twitter for iPad”, “Twitter for Windows”, etc. However, the selection rules for white lists or black lists are subjective and difficult to be standardized. Different cities and different events will collect different types of bots and cyborgs. Therefore, their black lists and white lists will be different. Also, data collected from different types of Twitter APIs (Streaming APIs vs. Search APIs with keywords) will need different types of black lists or white lists. Table 14.4 illustrates an example of white list and black list from the data collected with Twitter Search APIs with keywords, which is very different from the list created using the Streaming APIs. In general, tweets from white list sources can be considered as human-made tweets posted from individual users. White lists can also be classified into mobile devices and desktop devices. If a social media research project focuses only mobile users, they can create a new white list containing only tweets created on mobile devices. Most bots are not be able to create messages directly on mobile devices. Black list include Twitter client platforms that provide automatic tweeting functions
14 Spatial Distribution Patterns of Geo-tagged Twitter Data …
267
Table 14.4 An example of white lists (partial) and black lists (partial) in 2017 case studies in San Diego using Twitter Search APIs White list
Black list
Source name
Tweet count
Source name
Tweet count
Twitter for iPhone
8,760
dlvr.it
191
Twitter Web Client
5,958
San Diego Now
129
Twitter for Android
5,659
OOYUZNEWSTWEETS
123
Twitter for iPad
1,387
trueAnthem
64
Twitter Lite
667
WordPress.com
23
Facebook
624
Parkbench.com
19
IFTTT
430
SocialNewsDesk
18
TweetDeck
333
LinkedIn
16
Google
187
Convey: Make it post for you
13
Buffer
120
Twitshot.com
11
Hootsuite
114
Great SDSU Tweets
9
SocialFlow
98
Periscope
7
Twitter for Windows
96
Pullquote: save and share quotes
7
Tweetbot for iOS
63
CMS Dicom Medios
5
Paper.li
54
FeedBlitz
5
Instagram
46
MantSanDiego
5
Mobile Web (M2)
35
GroupTweet
4
TweetCaster for Android
34
HubSpot
4
Twitter for Mac
28
TMV Twitter Publish Beta
4
RoundTeam
26
$tup!dTwitB0t$
3
Echofon
18
AC/DC Bag
3
iOS
16
Crowdfire - Go Big
3
Tumblr
13
GalacticConnectionLiveApp
3
Twitter for Windows Phone
13
NetworkedBlogs
3
with several accounts or bots. These are mostly used for commercial account and generate same contents. For example, ‘dlvr.it’ is one kind of Twitter client platform and provides automatic tweet function for sending large numbers of messages with scheduler or recycling previous posts. It is difficult to decide whether the white list or black list methods are appropriate for one specific research task or not. Data scientists should take a close look at the distribution patterns of different sources and also consider the nature of the case studies and the purpose of research. In geo-tagged tweets collected by the Streaming APIs, black list might be a better choice because the percentage of top ten bots and cyborgs is significant and identifiable. Tweets collected using Search APIs with keywords might need to use white lists because the devices and platforms in the black list could be very long.
268
M.-H. Tsou et al.
14.6 Spatial Distribution Patterns of Geo-tagged Social Media Bots This section will focus on the visualization of the actual locations of bot and cyborg tweets in our collected data. From a spatial analysis perspective, if we cannot remove these noises, the spatial distribution patterns of these messages created by bots may cause a significant problem or biases in the outcomes of many location-based social media research. We started to map the most popular bots in our black list: TweetMyJOBS, which is a Twitter-based recruiting service embedded with the job advertisement tweets. We found out that most of TweetMyJOBS tweets are around the center of cities (San Diego downtown or Columbus downtown) (Fig. 14.5) due to the higher density of business buildings and addresses in the center of cities. The red color areas in Fig. 14.5 are the high kernel density areas created by the clusters of tweets. There are other spatial distribution patterns of bots or cyborgs based on their source names: • Cities is one of the twitter bots for weather forecasting. Their spatial distribution is following the locations of major weather stations or the center of local neighborhood throughout the whole cities. • Dlvr.it is a new service for attracting and engaging audiences across the web with powerful content sharing tools. It can help users to distribute their posted social media messages to other platforms automatically. All dlvr.it tweets are located in a single point (at the center of San Diego downtown and at the center of Columbus downtown). • San Diego Trends and Columbus Trends is used for sending out the local news. There is only one location tweeting through San Diego Trends and Columbus Trends which is the locations of San Diego City Hall and Columbus City Hall.
Fig. 14.5 The spatial distribution of the TweetMyJOBS tweets (blue color dots) in San Diego (left) and Columbus (right) during the month of November, 2015. Red color areas are the hotspots of clustered tweets in these locations
14 Spatial Distribution Patterns of Geo-tagged Twitter Data …
269
Fig. 14.6 The spatial distribution of the TNN CMH traffic tweets in Columbus during the month of November, 2015. Red dots are the clustered tweets created by the traffic bots
• TNN CMH traffic is the platform of posting traffic accidents. The whole platform generated 1486 tweets in Columbus City during the month of November, 2015. According to the spatial distribution and kernel density map below, the spatial distribution of their tweets are highly correlated to major roads and road intersections (Fig. 14.6). Based on these finding, we can conclude that the spatial distribution patterns of geo-tagged messages created by bots and cyborgs are not random. Different types of bots generate their unique spatial distribution patterns of messages. Researchers should remove these messages created by bots or cyborgs before conducting any spatiotemporal analysis.
14.7 User Biases and Tweeting Frequency in Geo-tagged Tweets To explore potential user biases in geo-tagged tweets, we calculated the frequency of geo-tagged tweets for individual unique users. Geo-tagged tweets were collected within the bounding boxes of San Diego County, CA and the City of Columbus, OH in U.S. throughout the month of November, 2015. After removing the bots and cyborgs, 69,317 human-made tweets are within San Diego bounding box and 15,916 unique
270
M.-H. Tsou et al.
Twitter users (accounts) were identified within the collected tweets. Figure 14.7 (left) reveals the number of users along with their geo-tagged rates throughout the whole month of November, 2015 in San Diego. Over 7,900 users only had one tweet during the whole month, which consists up to 49% of total users. More than 80% of Twitter users created less than 5 tweets in the whole month. But 1% of Twitter users created 16% of total Tweets. This finding is very similar to other literatures in Twitter message analysis (Li et al. 2013; Sloan and Morgan 2015). Using the bounding box of the Columbus City, 29,902 human-made tweets were identified for the month of November, 2015. 8,758 unique Twitter users are identified. Over 5,000 users in Columbus only tweeted once during the whole month, which consists up to 58% of total users (Fig. 14.7, right). Over 86% of Twitter users created less than 5 tweets in the whole month. Meanwhile, 1% of Twitter users created 19% of total Tweets. The patterns of tweets per human user in both San Diego and Columbus cases are similar (Fig. 14.7). The San Diego case has larger number of human tweets and unique human users. The Columbus case has higher portion of users with only one tweet per month and the higher percentage of tweets created by 1% top human users. Table 14.5 shows a side by side comparison between the two case studies. To avoid the user biases (small percentage of users creating large amounts of tweets) in social media analytics, we can calculate the number of unique users rather
Fig. 14.7 The number of users along with their geo-tagged tweeting rates within November 2015 in San Diego County (left) and in Columbus City (right)
Table 14.5 User biases comparison between San Diego and Columbus cases Human tweets
Human users
Ratio of user who tweeted 1 time (%)
1–5 times (%)
Most active user
San Diego
69,317
15,916
49
84.2
903
Columbus
29,902
8,758
58
89.5
964
14 Spatial Distribution Patterns of Geo-tagged Twitter Data …
271
than the number of tweets for statistic analysis or sentiment analysis (Tsou 2015). We can also remove top 1 or 5% of active users if we need to analyze the common messages from the general population. Another possible method is to select one tweet per unique user for sentiment analysis or hotspot analysis.
14.8 Recommended Data Wrangling and Cleaning Procedures Data wrangling Kandel et al. (2011) is an important procedure in data science for cleaning and processing big data. When social media researchers and geographers utilized geo-tagged tweets for their research projects, they need to adopt appropriate data wrangling procedures in order to remove data noises, spams, user biases, and system errors from the collected geo-tagged tweets. This paper recommends the following procedure for handling geo-tagged tweets collected from Twitter’s Streaming APIs: 1. 2. 3.
4.
5.
6.
Use the Twitter Streaming APIs with a bounding box to collect geotagged tweets. Use the same bounding box to filter out (remove) tweets outside the bounding box or outside the targeted city boundary using GIS software. Manually review the source fields in the remained tweets to identify top 10 or more bots or cyborgs from the Source field and create a black list or a white list for data cleaning procedure. Remove tweets created by bots or cyborgs using the black list or select the meaningful tweets using the white list (depending on the purpose of research and the nature of events). If the research goal is to focus on aggregated messages or trending topic analysis, researchers can conduct social media analysis using total numbers of tweets within a spatial unit (such as county boundaries or zip codes) for spatial analysis. If the research needs to analyze common messages from the general population or representative public opinions, researchers should consider the user biases and convert the tweet data into the numbers of unique users within a spatial unit rather than using the numbers of total tweets within a spatial unit. Another possible solution is to remove the top 1% or 5% active users from the collected tweets.
Geo-tagged social media has been used by many GIScientists and researchers. But many previous research projects did not adopt a comprehensive data wrangling procedure for cleaning geo-tagged tweets collected from Twitters’ Streaming APIs. This paper examines spatial distribution patterns of geo-tagged tweets created by bots and cyborgs and recommended a comprehensive procedure for identifying and cleaning these tweets. Hopefully, by identifying these system errors, user biases, and data noises, researchers can remove these bots and cyborgs from their raw data before conducting actual spatial analysis and improve their research findings and outcomes.
272
M.-H. Tsou et al.
Acknowledgements This material is based upon work supported by the National Science Foundation under Grant No. 1416509, project titled “Spatiotemporal Modeling of Human Dynamics Across Social Media and Social Networks” and Grant No. 1634641, IMEE project titled “Integrated StageBased Evacuation with Social Perception Analysis and Dynamic Population Estimation”. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the National Science Foundation. The authors thank other HDMA team members’ contribution to this research.
References Akimoto, A. (2011). Japan the Twitter nation. In Japan Times. http://www.japantimes.co.jp/life/ 2011/05/18/digital/japan-the-twitter-nation/. Azmandian, M., Singh, K., Gelsey, B., Chang, Y-H., & Maheswaran, R. (2013). Following human mobility using tweets. In L. Cao, Y. Zeng, A. L. Symeonidis, V. I. Gorodetsky, P. S. Yu, & M. P. Singh (Eds.), Agents data mining interaction (pp. 139–149). Heidelberg, Berlin: Springer. Boshmaf, Y., Muslukhov, I., Beznosov, K., & Ripeanu, M. (2011). The socialbot network: When bots socialize for fame and money. In Proceedings of the 27th Annual Computer Security Applications Conference (pp. 93–102). New York, NY, USA: Association for Computing Machinery. Bratko, A., Filipiˇc, B., Cormack, G. V., Lynam, T. R., & Zupan, B. (2006). Spam filtering using statistical data compression models. Journal of Machine Learning Research, 7, 2673–2698. Chu, Z., Gianvecchio, S., Wang, H., & Jajodia, S. (2010). Who is tweeting on Twitter: Human, bot, or cyborg? In Proceedings of the 27th Annual Computer Security Applications Conference (pp. 21–30). New York, NY, USA: Association for Computing Machinery. Chu, Z., Widjaja, I., & Wang, H. (2012). Detecting social spam campaigns on twitter. In F. Bao, P. Samarati, & J. Zhou (Eds.), Applied cryptography and network security (pp. 455–472). Heidelberg, Berlin: Springer. Fetterly, D., Manasse, M., & Najork, M. (2004). Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. In Proceedings of the 7th International Workshop on the Web and Databases Coloca ACM SIGMODPODS 2004 (pp. 1–6). New York, NY, USA: Association for Computing Machinery. Goodman, J., Heckerman, D., & Rounthwaite, R. (2005). Stopping spam. Scientific American, 292, 42–49. Grier, C., Thomas, K., Paxson, V., & Zhang, M. (2010). @spam: The underground on 140 characters or less. In Proceedings of the 17th ACM Conference Computer and Communications Security (pp. 27–37). New York, NY, USA: Association for Computing Machinery. Kandel, S., Heer, J., Plaisant, C., Kennedy, J., van Ham, F., Riche, N. H., Weaver, C., Lee, B., Brodbeck, D., & Buono, P. (2011). Research directions in data wrangling: Visualizations and transformations for usable and credible data. Information Visualization, 10, 271–288. Krishnamurthy, B., Gill, P., & Arlitt, M. (2008). A few chirps about twitter. In Proceedings of the 1st Workshop on online society network (pp. 19–24). New York, NY, USA: Association for Computing Machinery. Li, L., Goodchild, M. F., & Xu, B. (2013). Spatial, temporal, and socioeconomic patterns in the use of Twitter and Flickr. Cartography and Geographic Information Science, 40, 61–77. Lin C. P., & Huang H. P. (2013). A study of effective features for detecting long-surviving Twitter spam accounts. In 2013 15th International Conference on Advanced Communication Technology (pp. 841–846). Pyeong Chang, South Korea: ICACT. Lumezanu, C., & Feamster, N. (2012). Observing common spam in Twitter and email. In Proceedings of the 2012 Internet Measurement Conference (pp. 461–466). Association for Computing Machinery.
14 Spatial Distribution Patterns of Geo-tagged Twitter Data …
273
Messias, J., Schmidt, L., Oliveira, R., & Souza, F. B. D. (2013). You followed my bot! Transforming robots into influential users in Twitter. Morstatter, F., Pfeffer, J., Liu, H., & Carley, K. (2013). Is the sample good enough? Comparing data from Twitter’s streaming API with Twitter’s Firehose. In Proeedings of the 7th International AAAI Conference on Weblogs and Society Media 7. Sloan, L., & Morgan, J. (2015). Who Tweets with their location? Understanding the relationship between demographic characteristics and the use of Geoservices and Geotagging on Twitter. PLOS ONE, 10, e0142209. Thomas, K., & Nicol, D. M. (2010). The Koobface botnet and the rise of social malware. In 2010 5th International Conference on Malicious and Unwanted Software (pp. 63–70). Tsou, M.-H. (2015). Research challenges and opportunities in mapping social media and Big Data. Cartography and Geographic Information Science, 42, 70–74. Tsou, M.-H., & Leitner, M. (2013). Visualization of social media: seeing a mirage or a message? Cartography and Geographic Information Science, 40, 55–60. Twitter. (2016). The Twitter rules. https://support.twitter.com/articles/20170467. Uribl. (2013). Realtime URL blacklist. http://uribl.com/about.shtml. Wang, A. H. (2010). Don’t follow me: Spam detection in Twitter. In 2010 International Conference on Security and Cryptography SECRYPT (pp. 1–10). Wang, Z., Ye, X., & Tsou, M.-H. (2016). Spatial, temporal, and content analysis of Twitter for wildfire hazards. Natural Hazards, 83, 523–540. Yardi, S., Romero, D., Schoenebeck, G., & Boyd, D. (2010). Detecting spam in a Twitter network. First Monday. https://doi.org/10.5210/fm.v15i1.2793
Chapter 15
The Future of Human Dynamics Study: Research Challenges and Opportunities During and After the COVID-19 Pandemic Ming-Hsiang Tsou
15.1 The Impact of COVID-19 Pandemic for Human Dynamics Study Human Dynamics is an exciting research domain bridging social computational science, big data analytics, public health, transportation, machine learning methods, etc. As illustrated in the previous chapters, Human Dynamics is a transdisciplinary research field focusing on the understanding of dynamic patterns, relationships, narratives, changes, and transitions of human activities, behaviors, and communications (originally defined by the HDMA Center Website HDMA, 2014; Tsou, 2018). There are many promising research opportunities in Human Dynamics. On the other hand, there are remaining challenges and obstacles in the study of Human Dynamics. One of the major challenges recently (Year 2020 and 2021) is the unrepentantly world-wide COVID-19 Pandemic. Starting in December 2019, the pandemic of COVID-19 has changed everyone’s life significantly globally and locally. On April 01, 2021, there are over 30 million of people are infected and over 549,098 deaths in the United States (CDC, 2020). The impacts of COVID-19 pandemic are tremendous in many aspects, including public health, economic development, travels, social inequality, education, and academic research. With the new policies and restrictions in many countries and regions (such as social distancing, stay-at-home orders, lock-down policy, work-from-home (WFH), online classes, etc.) The spatial and temporal patterns of human movement, activities, behaviors have been altered and transformed significantly due to the new regulation and responses during the pandemic. These changes actually create a great research opportunity for researchers to understand the complex relationships and interactions M.-H. Tsou (B) Department of Geography, Center for Human Dynamics in the Mobile Age, San Diego State University, San Diego, CA, USA e-mail: [email protected] © Springer Nature Switzerland AG 2021 A. Nara and M.-H. Tsou (eds.), Empowering Human Dynamics Research with Social Media and Geospatial Data Analytics, Human Dynamics in Smart Cities, https://doi.org/10.1007/978-3-030-83010-6_15
275
276
M.-H. Tsou
of human movements and behaviors in real world and to compare the dynamic changes before and after the pandemic. The following sections will provide some suggestions about the short-term and the long-term research agendas, opportunities, and the challenges in the study of Human Dynamics.
15.2 The Specialized COVID-19 Research Agendas for Human Dynamics (The Short-Term and the Long-Term Agendas) The study of Human Dynamics heavily relies on the processes and analytics of observable real-world data and information from our human behaviors, communication, and activities. Different from the other pandemics in the history (such as the 1918 Spanish Flu), the advancement of data science, medical sensors, biotechnologies, and information technologies have provided tremendous amount of data and medical records for the study of COVID-19 Pandemic. Public health agencies can monitor and track the COVID-19 outbreaks in local neighborhoods and provide daily updates and near real-time reports to the public via web-based dashboard or updated web maps. The 2019 COVID-19 pandemics is the first example of “data-rich” and “real-time trackable” pandemics in the twenty-first century. The following sessions will propose two research agendas for the study of Human Dynamics during and after the data-rich COVID-19 pandemic. The short-term research agenda will focus on the immediate actions and responses to mitigate the pandemic or provide data-oriented suggestions to public health agencies regarding their policy updates or new regulations during the pandemic. The long-term research agenda will emphasize on the long-term impact of COVID-19 pandemics to our local and global communities in the domains of transportation, education, economic activities, and public health.
15.2.1 Short Term (0–3 Years) Research Agenda There are four key topics suggested for the short-term COVID-19 research agenda, focusing on the immediate actions, responses, and opportunities to understand and mitigate the pandemic. Some research outcomes can also be used to provide guidance or facilitate resource management for government agencies. 1.
Establishing a transdisciplinary and interoperable COVID-19 data research consortium for the study of COVID-19 pandemic. The data consortium should include medical, public health, socioeconomic, governmental, social media, and transportation research institutes and data centers together. Currently, many research agencies and institutes have archived and collected various COVID-19 related datasets individually. However, it is very difficult to
15 The Future of Human Dynamics Study …
2.
3.
277
combine or integrate different types of COVID-19 data together from multiple agencies for data fusion or information synthesis (Sui, 2017). For example, a local government agency may provide a daily Zip Code level datasheet for the COVID-19 confirmed cases, which is not compatible with the human mobility data provided by SafeGraph at the daily census block group level. The weekly unemployment rate data from the Applied Geographic Solutions, Inc. (AGS) (at the block group level or Zip Code level) did not match with some local public health data at the sub-regional area (SRA) level. Different cities and counties in US provide different types of COVID-19 datasets with different spatial and temporal units. There is an urgent need to develop a comprehensive data sharing mechanism and an interoperable standard for COVID-19 related datasets. The new data consortium should focus on the development of interoperable data standards and open-source or sharable data conversion platforms, algorithms, and tools. Facilitating interdisciplinary research collaboration and synergy to develop new theories and new methods for analyzing COVID-19 outbreaks. In addition to data interoperability, interdisciplinary research synergy is more important for advancing our medical and scientific knowledge to mitigate the COVID-19 pandemic. The research of COVID-19 outbreaks will include many interdisciplinary fields, such as medical science, epidemiology, sociology, geography, psychology, economics, urban study, communication, and public affairs. Many traditional mathematical models and theories for epidemiology and public health are no longer applicable to the COVID-19 outbreaks due to the dramatically change of our modern society and data collection methods, including international travels, cultural impacts, public transportation systems, social media, and network communications. To understand the complicated relationships between human Dynamics and the COVID-19 outbreaks, scientists and researchers from different disciplines need to collaborate together to develop new methods and new tools. For example, we need to develop new communication theories and intervention messages targeting salient beliefs for social media channels in order to improve the COVID-19 vaccination rates for the public and minority groups. Another example is to develop new GIS methods and spatial diffusion models to study how the COVID-19 outbreaks diffuse and spread at different temporal scales (daily, weekly, monthly) and different spatial scales (Countries, States, counties, cities, Zip codes, census tracts, etc.) Identifying the health disparity problems in the COVID-19 pandemic from multiple perspectives, including the study of infected rates in different neighborhoods and different ethnic groups, the arrangement of medical resources (hospital beds, ICU beds, testing clinics, etc.), mortality rates, and the setting of vaccination priorities. Many recent COVID-19 outbreak research projects have revealed serious health disparity problems in terms of ethnicity, socioeconomic status, education, and age groups at different spatial and temporal scales (Cordes & Castro, 2020; Tsou et al., 2021). However, many published results are mainly focusing on a single perspective of health disparity rather than the interconnected multiple health disparity factors and their integrated
278
4.
M.-H. Tsou
impacts to different communities and different groups. We need to study the health disparity issues of the COVID-19 pandemic by inspecting the complex associations between all possible related outcomes (confirmed cases, hospitalization cases, ICU cases, death cases, testing rates, positive rates, and vaccination rates, etc.) and all potential factors together (socioeconomic status, ethnicity, personal medical history, local neighborhood environment, etc.). These associations may be changeable at different geographic scale (countries, counties, cities, Zip codes, health service regions, etc.) and temporal phases (early outbreaks, slow growth, rapid outbreaks, etc.) due to the modifiable area unit problems (MAUP) and change of social restrictions or governmental policies. Analyzing Human Mobility changes from multiple data sources, such as mobile phones and social media and their possible associations to governmental restrictions, events, vacations, or holidays. Since the spread of COVID19 is highly related to human contact and human movements, many social media platforms and mobile apps are providing dynamic human movement datasets to help researchers to monitor the effectiveness of stay-at-home orders or travel restrictions, including SafeGraph mobility data, Google community mobility reports, Apple Mobility Trends Report, Facebook Movement Range data, etc. These data need to be further analyzed and integrated in order to provide more meaningful explanation of human movement during the COVID19 pandemic. Researchers need to explore possible connections between human movements and the actual impacts to the infected case rates and the association to stay-at-home orders, presidential election events, holidays, and governmental restrictions at local neighborhood or municipal level. The dynamic change of human mobility patterns might be associated with economic activities or societal indexes (such as unemployment rate, GDP growth, or business closures, etc.) related to COVID-19 outbreaks.
15.2.2 Long Term (4–10 years) Research Agenda and Impacts The long-term research agenda (4–10 years) of Human Dynamics should emphasize on the long-term impacts of COVID-19 pandemics to our local and global communities in the domains of transportation, education, public health, economic activities, etc. Some impacts could be fixed or recovered by the collaborative efforts from government agencies and local communities. Some long-term impacts are irreversible and irrevocable. Researchers need to focus on the following long-term impacts and make adjustment to cope the fundamental changes in our society after the COVID-19 pandemic. 1.
The percentage of work-from-home (WFH) employees and online education (synchronic or asynchronized) students will increase dramatically after the COVID-19 pandemic. During the stay-at-home period, many people realized that their daily works and office routines can be done remotely and the new
15 The Future of Human Dynamics Study …
2.
3.
279
working style (work-from-home) is also welcome by the employers in order to save the cost of office space and transportation. Many students and teachers like the online courses with their flexibility and accessibility. Although many students also express their preferences to the traditional face-to-face education. The dramatical increased acceptance rates of new WFH style and online education will have significant impacts to the patterns of human movements and transportation in a long run. Researchers need to analyze the change of human transportation needs and the adjustment of urban transportation systems due to the new trends of WFH and online education. The study of Human Dynamics in workers transportation patterns and student enrollments may help us to provide a better understanding of these changes, to rebuild the urban transportation infrastructure, and to re-open schools, colleges, and universities after the COVID-19 pandemic. Analyzing the dynamic movements of consumer products, and mixed digital online services (such as Uber Eat, Amazon, Netflix, etc.) as parts of Human Dynamics studies. During the COVID-19 pandemic, many people utilize online order services to purchase their food, clothes, essentials, and entertainment products, rather than go to real stores, shopping centers, or movie theaters physically. The patterns of consumer movements and behaviors in retail sector have been changed dramatically during the pandemic. Researchers should focus on the study of dynamic transportation routes and deliver destinations of consumer products (such as Uber Eats and Amazon Orders) from providers to consumers. The changes of customer behaviors and purchase preferences during and after COVID-19 pandemic in different regions and cities can provide valuable insight for the long-term recovery and rebuild of local communities. However, many of consumer datasets might be controlled by the service platforms and will not be available for researchers. In addition, researchers should also analyze the usage patterns of online entertaining streaming services, such as Netflix, YouTube, HBO Go, Disney Plus(+), etc. During the COVID19 pandemic, the usage of these streaming video services has been increased dramatically due to the stay-at-home or WFH orders. The analysis of these usage patterns can provide a better understanding about how to improve and customize these video streaming services for various users and neighborhoods using their geolocations (Zip code, cities, etc.), socioeconomic status (incomes, ethnic groups, etc.) and personal profiles. Focusing on the study of Human Dynamics using Internet of Things (IoT) devices. IoT devices are embedded physical sensors with network connections to provide fast status updates of devices or to exchange data with other systems (Ashton, 2009). There is a great potential to study Human Dynamics using Internet of things (IoT) devices after the COVID-19 pandemic. For example, the usages of smart home devices (such as Thermostats and humidifiers) can be used for monitoring the energy consumption and humidity at each household. Some research projects indicated that high humidity environment may reduce the infection chance of COVID-19. There are other possible studies in the fields of smart transportation (connected vehicles), mobile wearable trackers (Fitbit
280
4.
5.
M.-H. Tsou
trackers, Apple watches, etc.) and smart appliances (refrigerators, TVs, and smart speakers, etc.). Due to the COVID-19 outbreaks, many households spend significant amounts of money to remodel their houses or improve their home by adding smart home devices or smart appliances. The data collected from these home-based IoT devices can be used to study collective human behaviors and human activities, which are essential to understand the complicated patterns of Human Dynamics. Another example is the Strava global heatmap (Rettberg, 2020). Strava is a popular human exercise app using GPS data to monitor and tracking users’ exercise performance. The Strava global heatmap is created by combining 700 millions of human activity data from Strava’s global network of athletes (https://www.strava.com/heatmap). Extending the study of Human Dynamics from real world to Virtual Reality (VR). With the constraints and restrictions on physical interaction and social distancing due to the COVID-19 pandemic, many human activities and interactions have been transformed into online forms. Traditional online communication tools and environments are not ideal for the needs of education, skill trainings, and business activities (Martín-Gutiérrez et al., 2017). It is essential to build a comprehensive Virtual Reality (VR) environment (Ashtari et al., 2020), customizable for various online services after the pandemic. Along with the growth of VR environments, the study of Human Dynamics should focus on how to collect and measure human activities, movements, and communications inside these VR platforms. What kinds of VR user experiences and feedback can be collected and analyzed for different applications? Different applications will have different types of user interfaces and user interactions. The possible VR applications will cover include online education, healthcare services, telemedicine, remote consulting, entertainments, business activities, etc. Facilitating the development of responsive decision support systems for the effective management of Public Health resources and services. During the year of 2020, the failure of monitoring and mitigating the COVID-19 outbreaks in many local and federal governments indicates that traditional public health policy making procedures and medical resource managements are problematic and sluggish. Many practices and regulations in epidemiology and health intervention methods are not able to cope with the fast change of COVID-19 outbreaks. Public health researchers and staffs need to integrate the study of Human Dynamics with Public Health management and epidemiology. We need to develop a responsive and adjustable decision support systems for monitoring disease outbreaks and public health crisis. For example, the daily updates of confirmed COVID-19 cases at census tract levels can be used to adjust the next week’s locations for COVID-19 testing sites and vaccination sites. We can relocate these testing sites and vaccination sites to the most needed neighborhood areas. Combining web-based GIS functions and real-time tracking of outbreaks, public health officials and staff can use the information to create effective strategies and plans to mitigate the impacts of disease outbreaks and to prevent the shortage of medical resources. In addition, by linking to other IoT devices and
15 The Future of Human Dynamics Study …
281
citizen science sensors, we might be able to create an advanced syndromic surveillance system to predict the potential unknown outbreaks in advance.
15.3 The Future Challenges of Human Dynamics Study After the “Data-Rich” COVID-19 Pandemic The “data-rich” COVID-19 pandemic created many research opportunities for the study of Human Dynamics. However, there are still several research challenges and obstacles for COVID-19 related studies, such as data privacy and data sharing issues. The following paragraphs will highlight two key challenges for the studies of COVID19 pandemic and Human Dynamics. The first challenge is the data privacy issue and data protection. Many COVID-19 datasets may include medical testing data and hospital treatment records. Researchers need to remove identifiable information first and then aggregate individual records into groups or regions. Following the data protection policy defined by federal laws or local regulations, researchers need to consider many aspects of data privacy, including the category of identifiable information, locational privacy, and data security and protection methods. One major concern for the study of Human Dynamics is the protection of locational privacy, which may include individuals’ resident locations, GPS data, or physical activities. There is no clear guidance about how to protect locational privacy in COVID-19 related data. Some preliminary suggestions are to develop data suppression method for the mapping of sensitive confirmed cases, mortality cases, or hospitalization cases. One suppression example is that if the total number of cases within a region (Zip code units, census tracts, or sub-regional areas) is below 5, we will replace the actual number with “N/A” (not available) to protect the patient locations in a small region. There are other cartographic representation methods could be used to protect the locational privacy, such as heatmaps, hexagon maps, and kernel density maps. Geo-masking algorithms (Seidl et al., 2016) are another possible option for the protection of location privacy. The second challenge is the data sharing challenges, especially for the available human mobility data from for-profit corporations, such as SafeGraph, Twitter, Facebook, Apple, and Google. Although these mobility data are free to use and available from their websites, the actual ownership of these mobility data still belong to these companies. Different from open data (which can be re-distributed or modified freely), commercial data have many restrictions for the study of Human Dynamics and COVID-19 outbreaks. For example, the downloaded original Twitter data cannot be shared or re-distributed online. Other researchers cannot validate or verify the Twitter data analysis procedures published before. Some companies did not provide a detail explanation about their data collection methods and data cleaning procedures. For-profit corporations might stop or change their data services anytime. On the other hand, the data sharing from the public sector is more straightforward and more sharable. However, there are still some minor concerns from the public section regarding the data sharing policy. Some local public health agencies may not want
282
M.-H. Tsou
to provide detailed COVID-19 data at a smaller region level (such as the census tract level) to prevent potential litigation from citizens regarding their privacy protection.
15.4 The Conclusion The COVID-19 pandemic is an unexpected disaster in the human history. It has altered everyone’s behaviors, activities, communications, and movements dramatically. It also opens a great opportunity for researchers to study Human Dynamics with real world datasets and to develop more effective tools and systems to mitigate the damages of COVID-19 pandemic. This final chapter provides some representative examples and suggestions for the future development of Human Dynamics in both short-term and long-term periods. In addition, the future research activities of Human Dynamics need to follow two high-level research principles to guide their research goals: transformative research and translational research. Adopted from National Science Board (2007), NSF defines transformative research as “Transformative research involves ideas, discoveries, or tools that radically change our understanding of an important existing scientific or engineering concept or educational practice or leads to the creation of a new paradigm or field of science, engineering, or education” (NSF, 2007). The research agendas proposed for Human Dynamics should become a catalyst to transform scientific research (National Academies of Sciences Engineering, 2019) in the domain of public health, epidemiology, transportation, and other disciplines into new paradigms or new discoveries. Translational research (Woolf, 2008) can be defined as “effective translation of the new knowledge, mechanisms, and techniques generated by advances in basic science research into new approaches for prevention, diagnosis, and treatment of disease is essential for improving health” (Fontanarosa & DeAngelis, 2002, p. 1728). We can extend the definition of translational research from health science domain to other fields, such as public affairs, transportation, economics, etc. The study of Human Dynamics should extend the theories and methods from basic science to practical applications and systems for improving our health, economy, social justice, etc. Both individuals and the whole human beings should receive benefits directly from the research findings and outcomes of Human Dynamics. By adopting both transformative and translational research objectives, the study of Human Dynamics could become an important key to resolve the current medical challenges and social problems caused by the COVID-19 pandemic. Researcher and scientists will collaborate together to develop new methods, new theories, and new tools to advance the knowledge of Human Dynamics and to build a safe and resilient community in a post-pandemic world.
15 The Future of Human Dynamics Study …
283
References Ashtari, N., Bunt, A., McGrenere, J., Nebeling, M., & Chilana, P. K. (2020). Creating augmented and virtual reality applications: Current practices, challenges, and opportunities. In Proceedings of 2020 CHI Conference on Human Factors in Computing Systems (pp. 1–13). Association for Computing Machinery, New York, NY, USA. Ashton, K. (2009). That ‘internet of things’ thing. RFID J, 22, 97–114. CDC. (2020). Coronavirus disease 2019 (COVID-19). Centers for Disease Control and Prevention. Retrieved April 1, 2021, from https://www.cdc.gov/coronavirus/2019-ncov/index.html Cordes, J., & Castro, M. C. (2020). Spatial analysis of COVID-19 clusters and contextual factors in New York City. Spatial and Spatio-Temporal Epidemiology, 34, 100355. Fontanarosa, P. B., & DeAngelis, C. D. (2002). Basic science and translational research in jama. JAMA, 287, 1728. HDMA. (2014). The center for human dynamics in the mobile age (HDMA center) website. Retrieved April 1, 2021, from https://humanDynamics.sdsu.edu/ Martín-Gutiérrez, J., Mora, C. E., Añorbe-Díaz, B., & González-Marrero, A. (2017). Virtual technologies trends in education. EURASIA Journal of Mathematics, Science and Technology Education, 13, 469–486. National Academies of Sciences Engineering, Medicine. (2019). Fostering transformative research in the geographical sciences. https://doi.org/10.17226/21881 National Science Board. (2007). Enhancing support of transformative research at the National Science Foundation. National Science Foundation. NSF. (2007). Transformative research: Definition. Retrieved April 15, 2021, from https://www.nsf. gov/about/transformative_research/definition.jsp Rettberg, J. W. (2020). Situated data analysis: A new method for analysing encoded power relationships in social media platforms and apps. Humanities & Social Sciences Communications, 7, 1–13. Seidl, D. E., Jankowski, P., & Tsou, M.-H. (2016). Privacy and spatial pattern preservation in masked GPS trajectory data. International Journal of Geographical Information Science, 30, 785–800. Sui, D. (2017) Information synthesis. In International encyclopedia of geography (pp. 1–13). American Cancer Society. Tsou, M.-H. (2018). The future development of GISystems, GIScience, and GIServices. In B. Huang (Ed.), Comprehensive geographic information systems (pp. 1–4). Elsevier, Oxford. Tsou, M.-H., Xu, J., Lin, C.-D., Daniels, M., Embury, J., Ko, E., & Gibbons, J. (2021). Analyzing socioeconomic factors and health disparity of COVID-19 spatiotemporal spread patterns at neighborhood levels in San Diego County. medRxiv 2021.02.22.21251757. Woolf, S. H. (2008). The meaning of translational research and why it matters. JAMA. https://doi. org/10.1001/jama.2007.26