135 62 97MB
English Pages 910 [896] Year 2021
Studies in Computational Intelligence 1072
Rosa Maria Benito · Chantal Cherifi · Hocine Cherifi · Esteban Moro · Luis M. Rocha · Marta Sales-Pardo Editors
Complex Networks & Their Applications X Volume 1, Proceedings of the Tenth International Conference on Complex Networks and Their Applications COMPLEX NETWORKS 2021
Studies in Computational Intelligence Volume 1072
Series Editor Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland
The series “Studies in Computational Intelligence” (SCI) publishes new developments and advances in the various areas of computational intelligence—quickly and with a high quality. The intent is to cover the theory, applications, and design methods of computational intelligence, as embedded in the fields of engineering, computer science, physics and life sciences, as well as the methodologies behind them. The series contains monographs, lecture notes and edited volumes in computational intelligence spanning the areas of neural networks, connectionist systems, genetic algorithms, evolutionary computation, artificial intelligence, cellular automata, selforganizing systems, soft computing, fuzzy systems, and hybrid intelligent systems. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution, which enable both wide and rapid dissemination of research output. Indexed by SCOPUS, DBLP, WTI Frankfurt eG, zbMATH, SCImago. All books published in the series are submitted for consideration in Web of Science.
More information about this series at https://link.springer.com/bookseries/7092
Rosa Maria Benito Chantal Cherifi Hocine Cherifi Esteban Moro Luis M. Rocha Marta Sales-Pardo •
•
•
•
•
Editors
Complex Networks & Their Applications X Volume 1, Proceedings of the Tenth International Conference on Complex Networks and Their Applications COMPLEX NETWORKS 2021
123
Editors Rosa Maria Benito Escuela Técnica Superior de Ingeniería Agronómica, Alimentaria y de Biosistemas Universidad Politécnica de Madrid Madrid, Madrid, Spain Hocine Cherifi LIB, UFR Sciences et Techniques University of Burgundy Dijon, France Luis M. Rocha Thomas J. Watson College of Engineering and Applied Science Binghamton University Binghamton, USA
Chantal Cherifi IUT Lumière University of Lyon Bron Cedex, France Esteban Moro Grupo Interdisciplinar de Sistemas Complejos Universidad Carlos III de Madrid Leganés, Madrid, Spain Marta Sales-Pardo Department of Chemical Engineering Universitat Rovira i Virgili Tarragona, Tarragona, Spain
ISSN 1860-949X ISSN 1860-9503 (electronic) Studies in Computational Intelligence ISBN 978-3-030-93408-8 ISBN 978-3-030-93409-5 (eBook) https://doi.org/10.1007/978-3-030-93409-5 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2022, corrected publication 2022 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
It is the tenth edition of the “International Conference on Complex Networks & their Applications”, one of the major international events in network science. Every year, it brings together researchers from a wide variety of scientific backgrounds ranging from finance, medicine and neuroscience, biology and earth sciences, sociology and politics, computer science and physics, and many others to review the field’s current state and formulate new directions. The great diversity of the attendees is an opportunity for cross-fertilization between fundamental issues and innovative applications. The proceedings of this edition, hosted by the Polytechnic University of Madrid in Spain from November 30 to December 02, 2021, contains a good sample of high-quality contributions to the multiple issues of complex networks research. This edition attracted authors from all over the world, with 424 submissions from 56 countries. Each submission has been peer-reviewed by at least three independent reviewers from the international program committee. The 137 papers included in the proceedings are the results of this rigorous selection process. The challenges for a successful edition are undoubtedly related to the quality of the contributors. The success also goes to the fascinating plenary lectures of the keynote speakers. • Marc Barthélémy (CEA, France): “Challenges in Spatial Networks” • Ginestra Bianconi (Queen Mary University of London, UK): “Higher-Order Networks and their Dynamics” • João Gama (University of Porto, Portugal): “Mining Evolving Large-Scale Networks” • Dirk Helbing (ETH Zürich, Switzerland): “How Networks Can Change Everything for Better or for Worse” • Yizhou Sun (UCLA, USA): “Graph-based Neural ODEs for Learning Dynamical Systems” • Alessandro Vespignani (Northeastern University, USA): “Computational Epidemiology at the time of COVID-19”
v
vi
Preface
Our thanks also go to the speakers of the traditional tutorial sessions for delivering insightful talks on November 29, 2021. • Elisabeth Lex (Graz University of Technology, Austria) and Markus Schedl (Johannes Kepler University, Austria): “Psychology-informed Recommender Systems” • Giovanni PETRI (ISI Foundation, Italy): “A crash-course in TDA and higher-order dynamics” This edition during a pandemic has been quite challenging. The deep involvement of many people, institutions, and sponsors is the key to its success. We sincerely gratify the advisory board members, Jon Crowcroft (University of Cambridge), Raissa D’Souza (University of California, Davis, USA), Eugene Stanley (Boston University, USA), and Ben Y. Zhao (University of Chicago, USA), for inspiring the essence of the conference. We record our thanks to our fellow members of the organizing committee: José Fernando Mendes (University of Aveiro, Portugal), Jesús Gomez Gardeñes (University of Zaragoza, Spain), and Huijuan Wang (TU Delft, Netherlands), the lightning sessions chairs, Manuel Marques Pita (Universidade Lusófona, Portugal), José Javier Ramasco (IFISC, Spain), and Taha Yasseri (University of Oxford, UK), the poster sessions chairs, Luca Maria Aiello (ITU Copenhagen, Denmark) and Leto Peel (Université Catholique de Louvain, Belgium) the tutorial chairs, Sabrina Gaito (University of Milan, Italy) and Javier Galeano (Universidad Politécnica de Madrid, Spain) the satellite chairs, Benjamin Renoust (Osaka University, Japan), Xiangjie Kong (Dalian University of Technology, China), the publicity chairs, Regino Criado (Universidad Rey Juan Carlos, Spain) and Roberto Interdonato (CIRAD - UMR TETIS, Montpellier, France) the sponsor chairs. Our profound thanks go to Matteo Zignani (University of Milan, Italy), publication chair, for the tremendous work at managing the submission system and the proceedings publication process. Thanks to Stephany Rajeh (University of Burgundy, France), Web chair, for maintaining the website. We would also like to record our appreciation for the work of the local committee chair, Juan Carlos Losada (Universidad Politécnica de Madrid, Spain), and all the local committee members, David Camacho (UPM, Spain), Fabio Revuelta (UPM, Spain), Juan Manuel Pastor (UPM, Spain), Francisco Prieto (UPM, Spain), Leticia Perez Sienes (UPM, Spain), Jacobo Aguirre (CSIC, Spain), Julia Martinez-Atienza (UPM, Spain), for their work in managing the sessions. They intensely participated to the success of this edition. We are also indebted to our partners, Alessandro Fellegara and Alessandro Egro from Tribe Communication, for their passion and patience in designing the conference’s visual identity. We would like to express our gratitude to our partner journals involved in sponsoring keynote talks: Applied Network Science, EPJ Data Science, Social Network Analysis and Mining, and Entropy.
Preface
vii
We are thankful to all those who have contributed to the success of this meeting. Sincere thanks to the authors for their creativity. Finally, we would like to express our most sincere thanks to the program committee members for their considerable efforts in producing high-quality reviews in a minimal time. These volumes make the most advanced contribution of the international community to the research issues surrounding the fascinating world of complex networks. Their breath, quality, and novelty demonstrate the profound contribution of complex networks in understanding our world. We hope you will enjoy reading the papers as much as we enjoyed organizing the conference and putting this collection of articles together. Rosa M. Benito Hocine Cherifi Esteban Moro Chantal Cherifi Luis Mateus Rocha Marta Sales-Pardo
The original version of the book was revised: The volume number has been changed from 1015 to 1072. The correction to the book is available at https://doi. org/10.1007/978-3-030-93409-5_72
Organization and Committees
General Chairs Rosa M. Benito Hocine Cherifi Esteban Moro
Universidad Politécnica de Madrid, Spain University of Burgundy, France Universidad Carlos III, Spain
Advisory Board Jon Crowcroft Raissa D’Souza Eugene Stanley Ben Y. Zhao
University of Cambridge, UK Univ. of California, Davis, USA Boston University, USA University of Chicago, USA
Program Chairs Chantal Cherifi Luis M. Rocha Marta Sales-Pardo
University of Lyon, France Binghamton University, USA Universitat Rovira i Virgili, Spain
Satellite Chairs Sabrina Gaito Javier Galeano
University of Milan, Italy Universidad Politécnica de Madrid, Spain
Lightning Chairs José Fernando Mendes Jesús Gomez Gardeñes Huijuan Wang
University of Aveiro, Portugal University of Zaragoza, Spain TU Delft, Netherlands
ix
x
Organization and Committees
Poster Chairs Manuel Marques-Pita José Javier Ramasco Taha Yasseri
University Lusófona, Portugal IFISC, Spain University of Oxford, UK
Publicity Chairs Benjamin Renoust Andreia Sofia Teixeira Michael Schaub Xiangjie Kong
Osaka University, Japan University of Lisbon, Portugal MIT, USA Dalian University of Technology, China
Tutorial Chairs Luca Maria Aiello Leto Peel
Nokia-Bell Labs, UK UC Louvain, Belgium
Sponsor Chairs Roberto Interdonato Regino Criado
CIRAD - UMR TETIS, France Universidad Rey Juan Carlos, Spain
Local Committee Chair Juan Carlos Losada
Universidad Politécnica de Madrid, Spain
Local Committee Jacobo Aguirre David Camacho Julia Martinez-Atienza Juan Manuel Pastor Leticia Perez Sienes Francisco Prieto Fabio Revuelta
CSIC, Spain UPM, Spain UPM, Spain UPM, Spain UPM, Spain UPM, Spain UPM, Spain
Publication Chair Matteo Zignani
University of Milan, Italy
Web Chair Stephany Rajeh
University of Burgundy, France
Organization and Committees
xi
Program Committee Jacobo Aguirre Amreen Ahmad Masaki Aida Luca Maria Aiello Marco Aiello Esra Akbas Mehmet Aktas Tatsuya Akutsu Reka Albert Laura Alessandretti Antoine Allard Aleksandra Aloric Claudio Altafini Benjamin Althouse Luiz G.A. Alves G. Ambika Fred Amblard Andre Franceschi De Angelis Marco Tulio Angulo Alberto Antonioni Nino Antulov-Fantulin Nuno Araujo Elsa Arcaute Saul Ares Panos Argyrakis Malbor Asllani Tomaso Aste Martin Atzmueller Konstantin Avrachenkov Giacomo Baggio Rodolfo Baggio Franco Bagnoli Annalisa Barla Paolo Barucca Nikita Basov Giulia Bassignana Gareth Baxter Marya Bazzi Andras A. Benczur
Centro de Astrobiología, Spain Jamia Millia Islamia, India Tokyo Metropolitan University, Japan IT University of Copenhagen, Denmark University of Stuttgart, Germany Oklahoma State University, USA University of Central Oklahoma, USA Kyoto University, Japan The Pennsylvania State University, USA ENS Lyon, Italy Université Laval, Canada Institute of Physics Belgrade, Serbia Linköping university, Sweden Institute for Disease Modeling, USA Northwestern University, USA Indian Institute of Science Education and Research, India University Toulouse 1, France Unicamp, Brazil National Autonomous University of Mexico, Mexico Carlos III University of Madrid, Spain ETH Zurich, Switzerland Universidade de Lisboa, Portugal University College London, UK Centro Nacional de Biotecnología, Spain Aristotle University of Thessaloniki, Greece University College Dublin, Ireland University College London, UK Osnabrueck University, Germany INRIA, France University of Padova, Italy Bocconi University and Tomsk Polytechnic University, Italy University of Florence, Italy Università di Genova, Italy University College London, UK St. Petersburg State University, Russia INSERM, France University of Aveiro, Portugal University of Oxford and University of Warwick, UK Hungarian Academy of Sciences, Hungary
xii
Rosa M. Benito Kamal Berahmand Marc Bertin Ginestra Bianconi Ofer Biham Hanjo Boekhout Johan Bollen Anthony Bonato Christian Bongiorno Anton Borg Javier Borge-Holthoefer Stefan Bornholdt Javier Borondo Cecile Bothorel Federico Botta Julien Bourgeois Alexandre Bovet Dan Braha Ulrik Brandes Rion Brattig Correia Markus Brede Piotr Bródka Iulia Martina Bulai Javier M. Buldu Francesco Bullo Raffaella Burioni Noah Burrell Fabio Caccioli Rajmonda Sulo Caceres Carmela Calabrese M Abdullah Canbaz Carlo Vittorio Cannistraci Vincenza Carchiolo Giona Casiraghi Douglas Castilho Costanza Catalano Remy Cazabet Po-An Chen Xihui Chen Chantal Cherifi Hocine Cherifi Guanghua Chi Peter Chin Matteo Chinazzi
Organization and Committees
Universidad Politécnica de Madrid, Spain Queensland University of Technology, Australia Université Claude Bernard Lyon 1, France Queen Mary University of London, UK The Hebrew University of Jerusalem, Israel Leiden University, Netherlands Indiana University Bloomington, USA Ryerson University, Canada Università degli studi di Palermo, Italy Blekinge Institute of Technology, Sweden Internet Interdisciplinary Institute, Spain Universität Bremen, Germany UPM, Spain IMT Atlantique, France The University of Warwick, UK UBFC, France Mathematical Institute, University of Oxford, UK NECSI, USA ETH Zürich, Switzerland Instituto Gulbenkian de Ciência, Portugal University of Southampton, UK Wroclaw University of Science and Technology, Poland University of Basilicata, Italy Center for Biomedical Technology, Spain UCSB, USA Università di Parma, Italy University of Michigan, USA University College London, UK Massachusetts Institute of Technology, USA University of Naples Federico, Italy Indiana University Kokomo, USA TU Dresden, Germany Università di Catania, Italy ETH Zürich, Switzerland Federal University of Minas Gerais, Brazil Central Bank of Italy, Italy Université Lyon 1, France National Chiao Tung University, Taiwan University of Luxembourg, Luxembourg Lyon 2 University, France University of Burgundy, France Facebook, USA Boston University, USA Northeastern University, USA
Organization and Committees
Matteo Cinelli Richard Clegg Reuven Cohen Alessio Conte Michele Coscia Christophe Crespelle Regino Criado Pascal Crépey Mihai Cucuringu Marcelo Cunha Bhaskar Dasgupta Joern Davidsen Toby Davies Pasquale De Meo Fabrizio De Vico Fallani Guillaume Deffuant Charo I. del Genio Pietro Delellis Yong Deng Mathieu Desroches Patrick Desrosiers Karel Devriendt Kuntal Dey Riccardo Di Clemente Matías Di Muro Constantine Dovrolis Ahlem Drif Johan Dubbeldam Jordi Duch Nicolas Dugué Victor M Eguiluz Frank Emmert-Streib Alexandre Evsukoff Mauro Faccin Giorgio Fagiolo Guilherme Ferraz de Arruda Daniel Figueiredo Marco Fiore Alessandro Flammini Manuel Foerster Angelo Furno Sabrina Gaito Javier Galeano
xiii
University of Venice, Italy Queen Mary University of London, UK Bar-Ilan University, Israel University of Pisa, Italy IT University of Copenhagen, Denmark Université Claude Bernard Lyon 1, France Universidad Rey Juan Carlos, Spain EHESP, France University of Oxford, USA Federal Institute of Bahia, Brazil University of Illinois at Chicago, USA University of Calgary, Canada University College London, UK Vrije Universiteit Amsterdam, Italy Inria - ICM, France Cemagref, France Coventry University, UK University of Naples Federico II, Italy Xi’an Jiaotong University, China Inria Sophia Antipolis Méditerranée Research Centre, France Université Laval, Canada University of Oxford, UK Accenture Tech Labs, India University of Exeter, UK Universidad Nacional de Mar del Plata, Argentina Georgia Institute of Technology, USA Sétif 1 University, Algeria Delft University of Technology, Netherlands Universitat Rovira i Virgili, Spain LIUM, France IFISC, Spain Tampere University of Technology, Finland COPPE/UFRJ, Brazil Université de Paris, France Sant’Anna School of Advanced Studies, Italy ISI Foundation, Italy UFRJ, Brazil IMDEA Networks Institute, Spain Indiana University Bloomington, USA Bielefeld University, Germany University of Lyon, France University of Milan, Italy Universidad Politécnica de Madrid, Spain
xiv
Lazaros Gallos Riccardo Gallotti José Manuel Galán Joao Gama Yerali Gandica Jianxi Gao Floriana Gargiulo Alexander Gates Vincent Gauthier Ralucca Gera Raji Ghawi Tommaso Gili Silvia Giordano David Gleich Antonia Godoy Kwang-Il Goh Jesus Gomez-Gardenes Antonio Gonzalez-Pardo Bruno Gonçalves Joana Gonçalves-Sá Przemyslaw Grabowicz Carlos Gracia-Lázaro Jean-Loup Guillaume Mehmet Gunes Sergio Gómez Meesoon Ha Jürgen Hackl Aric Hagberg Chris Hankin Yukio Hayashi Mark Heimann Torsten Heinrich Denis Helic Chittaranjan Hens Samuel Heroy Takayuki Hiraoka Philipp Hoevel Petter Holme Seok-Hee Hong Ulrich Hoppe Yanqing Hu Sylvie Huet
Organization and Committees
Rutgers University, USA Fondazione Bruno Kessler, Italy Universidad de Burgos, Spain University of Porto, Portugal Cergy Paris Université, France Rensselaer Polytechnic Institute, USA University of Paris Sorbonne, France Northeastern University, USA Telecom Sud Paris, France Naval Postgraduate School, USA Technical University of Munich, Germany IMT School for Advanced Studies Lucca, Italy SUPSI, Switzerland Purdue University, USA University College London, Spain Korea University, South Korea Universidad de Zaragoza, Spain Universidad Rey Juan Carlos, Spain Data for Science, Inc., USA Nova School of Business and Economics, Portugal University of Massachusetts, Amherst, USA Institute for Biocomputation and Physics of Complex Systems, Spain Université de la Rochelle, France Stevens Institute of Technology, USA Universitat Rovira i Virgili, Spain Chosun University, South Korea University of Liverpool, Switzerland Los Alamos National Laboratory, USA Imperial College London, UK Japan Advanced Institute of Science and Technology, Japan Lawrence Livermore National Laboratory, USA Chemnitz University of Technology, Germany Graz University of Technology, Austria Indian Statistical Institute, India University College London, UK Aalto University, Finland University College Cork, Ireland Tokyo Institute of Technology, Japan University of Sydney, Australia University Duisburg-Essen, Germany Sun Yat-sen Univ., China Inrae, France
Organization and Committees
Yuichi Ikeda Roberto Interdonato Antonio Iovanella Gerardo Iñiguez Mahdi Jalili Jaroslaw Jankowski Marco Alberto Javarone Hawoong Jeong Tao Jia Chunheng Jiang Jiaojiao Jiang Ming Jiang Di Jin Hang-Hyun Jo Ivan Jokić Jason Jung Marko Jusup Arkadiusz Jędrzejewski Rushed Kanawati Eytan Katzav Mehmet Kaya Domokos Kelen Dror Kenett Mohammad Khansari Khaldoun Khashanah Hamamache Kheddouci Hyoungshick Kim Jinseok Kim Maksim Kitsak Mikko Kivela Konstantin Klemm Peter Klimek Andrei Klishin Dániel Kondor Xiangjie Kong Onerva Korhonen Elka Korutcheva Dimitris Kotzinos Reimer Kuehn
xv
Kyoto University, Japan CIRAD - UMR TETIS, France University of International Studies of Rome, Italy Central European University, Austria RMIT University, Australia West Pomeranian University of Technology, Poland Centro Ricerche Enrico Fermi, UCL - CBT, Italy Korea Advanced Institute of Science and Technology, South Korea Southwest University, China Rensselaer Polytechnic Institute, USA University of New South Wales, Australia University of Illinois at Urbana-Champaign, USA University of Michigan, USA The Catholic University of Korea, South Korea Delft University of Technology, Netherlands Chung-Ang University, South Korea Tokyo Institute of Technology, Japan Wrocław University of Science and Technology, Poland Université Sorbonne Paris Nord, France The Hebrew University of Jerusalem, Israel Firat University, Turkey Institute for Computer Science and Control, Hungary Johns Hopkins University, USA University of Tehran, Iran Stevens Institute of Technology, USA Universit Claude Bernard, France Sungkyunkwan University, South Korea University of Michigan, USA Northeastern University, USA Aalto University, Finland IFISC, Spain Medical University of Vienna, Austria University of Pennsylvania, USA SMART, Singapore Zhejiang University of Technology, China Aalto University/Universidad Politécnica de Madrid, Finland Universidad Nacional de Educación a Distancia, Spain Paris University, France King’s College London, UK
xvi
Prosenjit Kundu Ryszard Kutner Haewoon Kwak Richard La José Lages Renaud Lambiotte Aniello Lampo Valentina Lanza Paul Laurienti Anna T. Lawniczak Eric Leclercq Deok-Sun Lee Sune Lehmann Balazs Lengyel Maxime Lenormand Juergen Lerner Claire Lesieur Haiko Lietz Fabrizio Lillo Ji Liu Run-ran Liu Giacomo Livan Lorenzo Livi Juan Carlos Losada Meilian Lu Luca Luceri John C.S. Lui Leonardo Maccari Matteo Magnani Cécile Mailler Fragkiskos Malliaros Giuseppe Mangioni Rosario Nunzio Mantegna Madhav Marathe Manuel Sebastian Mariani Radek Marik Daniele Marinazzo Andrea Marino Antonio Marques Manuel Marques-Pita Christoph Martin
Organization and Committees
State University of New York at Buffalo, USA University of Warsaw, Poland Singapore Management University, Singapore University of Maryland, USA Université Bourgogne Franche-Comté, France University of Oxford, UK Universitat Oberta de Catalunya, Spain Université du Havre, France Wake Forest, USA University of Guelph, Canada University of Burgundy - Laboratory of Informatics (LIB), France Korea Institute for Advanced Study, South Korea Technical University of Denmark, Denmark Hungarian Academy of Sciences, Hungary INRAE, France University of Konstanz, Germany CNRS, France GESIS, Germany University of Bologna, Italy Stony Brook University, USA Hangzhou Normal University, China University College London, UK University of Manitoba, Canada Universidad Politécnica de Madrid, Spain Beijing University of Posts and Telecommunication, China SUPSI, Switzerland The Chinese University of Hong Kong, Hong Kong University of Venice, Italy Uppsala University, Sweden UVSQ, France Paris-Saclay University, France University of Catania, Italy Palermo University, Italy University of Virginia, USA University of Zurich, Switzerland Czech Technical University, Czechia Ghent University, Belgium University of Florence, Italy Universidad Rey Juan Carlos, Spain Universidade Lusofona, Portugal Hamburg University of Applied Sciences, Germany
Organization and Committees
Cristina Masoller Rossana Mastrandrea Ruth Mateos de Cabo Michael Mathioudakis John Matta Fintan McGee Matúš Medo Jose Fernando Mendes Ronaldo Menezes Humphrey Mensah Engelbert Mephu Nguifo Anke Meyer-Baese Salvatore Micciche Radosław Michalski Letizia Milli Andreea Minca Boris Mirkin Shubhanshu Mishra Lewis Mitchell Bivas Mitra Marija Mitrović Dankulov Andrzej Mizera Osnat Mokryn Roland Molontay Raul Mondragon Misael Mongiovì Alfredo Morales Andres Moreira Esteban Moro Greg Morrison Sotiris Moschoyiannis Igor Mozetič Peter Mucha Alberto P. Munuzuri Masayuki Murata Tsuyoshi Murata Matthieu Nadini Alexandre Nicolas Peter Niemeyer Jordi Nin
xvii
Universitat Politècnica de Catalunya, Spain IMT Institute of Advanced Studies, Italy Universidad CEU San Pablo, Spain University of Helsinki, Finland Southern Illinois University Edwardsville, USA Luxembourg Institute of Science and Technology, Luxembourg Bern University Hospital and University of Bern, Switzerland University of Aveiro, Portugal University of Exeter, UK Syracuse University, USA University Clermont Auvergne - LIMOS CNRS, France FSU, USA University of Padova, Italy Wrocław University of Science and Technology, Poland CNR Pisa, Italy Cornell University, USA Higher School of Economics Moscow, UK University of Illinois at Urbana-Champaign, USA The University of Adelaide, Australia Indian Institute of Technology Kharagpur, India Institute of Physics Belgrade, Serbia University of Luxembourg, Luxembourg University of Haifa, Israel Budapest University of Technology and Economics, Hungary Queen Mary University of London, UK Consiglio Nazionale delle Ricerche, Italy MIT Media Lab, USA Universidad Tecnica Federico Santa Maria, Chile Universidad Carlos III de Madrid, Spain University of Houston, USA University of Surrey, UK Jozef Stefan Institute, Slovenia Dartmouth College, USA Univ. of Santiago de Compostela, Sudan Osaka University, Japan Tokyo Institute of Technology, Japan City, University of London, UK Université Claude Bernard Lyon 1, France Leuphana Universität Lüneburg, Germany Universitat Ramon Llull, Spain
xviii
Rogier Noldus El Faouzi Nour-Eddin Neave O’Clery Masaki Ogura Marcos Oliveira Andrea Omicini Gergely Palla Pietro Panzarasa Fragkiskos Papadopoulos Symeon Papadopoulos Michela Papandrea Philip Pare Han Woo Park Juyong Park Pierre Parrend Leto Peel Tiago Peixoto Matjaz Perc Anthony Perez Lilia Perfeito Giovanni Petri Juergen Pfeffer Carlo Piccardi Bruno Pinaud Flavio Pinheiro Clara Pizzuti Pawel Pralat Francisco Prieto-Castrillo Christophe Prieur Natasa Przulj Oriol Pujol Rami Puzis Christian Quadri Filippo Radicchi Jose J. Ramasco Asha Rao Miguel Rebollo Gesine Reinert Juan Restrepo Daniel Rhoads
Organization and Committees
Ericsson, Netherlands Université Gustave Eiffel, France University College London, UK Osaka University, Japan University of Exeter/GESIS, USA Alma Mater Studiorum–Università di Bologna, Italy Statistical and Biological Physics Research Group of HAS, Hungary Queen Mary University of London, UK Cyprus University of Technology, Cyprus Information Technologies Institute, Greece SUPSI, Switzerland Purdue University, USA Yeungnam University, South Korea Korea Advanced Institute of Science and Technology, South Korea EPITA Strasbourg, France Maastricht University, Belgium Central European University, and ISI Foundation, Germany University of Maribor, Slovenia LIFO - Université d’Orléans, France LIP - Laboratório de Instrumentação e Física Experimental de Partículas, Portugal ISI Foundation, Italy Technical University of Munich, Germany Politecnico di Milano, Italy Université de Bordeaux, France Universidade NOVA de Lisboa, USA National Research Council of Italy (CNR), Italy Ryerson University, Canada Universidad Politécnica de Madrid, Spain Telecom ParisTech, France University College London, Spain University of Barcelona, Spain Ben Gurion University of the Negev, Israel University of Milan, Italy Northwestern University, USA IFISC (CSIC-UIB), Spain RMIT University, Australia Universitat Politècnica de València, Spain University of Oxford, UK University of Colorado Boulder, USA Universitat Oberta de Catalunya, Spain
Organization and Committees
Pedro Ribeiro Laura Ricci Marian-Andrei Rizoiu Alessandro Rizzo Luis M. Rocha Fernando Rosas Giulio Rossetti Fabrice Rossi Camille Roth Celine Rozenblat Giancarlo Ruffo Alex Rutherford Marta Sales-Pardo Iraj Saniee Francisco C. Santos Hiroki Sayama Nicolas Schabanel Michael Schaub Maximilian Schich Frank Schweitzer Hamida Seba Emre Sefer Santiago Segarra Irene Sendiña-Nadal Termeh Shafie Saray Shai Yilun Shang Aneesh Sharma Rajesh Sharma Erez Shmueli Julian Sienkiewicz Anurag Singh Tiratha Raj Singh Rishabh Singhal Per Sebastian Skardal Oskar Skibski Keith Smith Igor Smolyarenko Zbigniew Smoreda Annalisa Socievole Albert Sole Sucheta Soundarajan Jaya Sreevalsan-Nair
xix
University of Porto, Portugal Università di Pisa, Italy University of Technology, Sydney, Australia Politecnico di Torino, Italy Binghamton University, USA Imperial College London, UK KDD Lab ISTI-CNR, Italy Université Paris Dauphine, France CNRS, Germany University of Lausanne Institut de Géographie, Switzerland Università degli Studi di Torino, Italy Max Planck Institute, Germany Universitat Rovira i Virgili, Spain Bell Labs, Nokia, USA Universidade de Lisboa, Portugal Binghamton University, USA École Normale Supérieure de Lyon, France RWTH Aachen University, Germany Prof. for Cultural Data Analytics, Estonia ETH Zurich, Switzerland University Lyon1, France Ozyegin University, Turkey Rice University, USA Rey Juan Carlos University, Spain University of Manchester, UK Wesleyan University, USA Northumbria University, UK Google, USA University of Tartu, Estonia Tel-Aviv University, Israel Warsaw University of Technology, Poland National Institute of Technology Delhi, India Jaypee University of Information Technology, India Dayalbagh Educational Institute, India Trinity College, USA University of Warsaw, Poland Nottingham Trent University, UK Brunel University, UK Orange Labs, France National Research Council of Italy, Italy Universitat Oberta de Catalunya, Spain Syracuse University, USA IIIT Bangalore, India
xx
Massimo Stella Fabian Stephany Cedric Sueur Kashin Sugishita Xiaoqian Sun Xiaoqian Sun Michael Szell Bosiljka Tadic Andrea Tagarelli Patrick Taillandier Kazuhiro Takemoto Frank Takes Fabien Tarissan Ana Maria Tarquis Claudio Juan Tessone François Théberge I-Hsien Ting Michele Tizzoni Olivier Togni Joaquin J. Torres Vincent Antonio Traag Jan Treur Janos Török Stephen Uzzo Lucas D. Valdez Pim van der Hoorn Piet Van Mieghem Onur Varol Balazs Vedres Wouter Vermeer Christian Lyngby Vestergaard Nathalie Vialaneix Javier Villalba-Diez Johannes Wachs Huijuan Wang Lei Wang Ingmar Weber Guanghui Wen Karoline Wiesner Gordon Wilfong Mateusz Wilinski Richard Wilson
Organization and Committees
University of Exeter, UK University of Oxford, UK Université de Strasbourg, France Tokyo Institute of Technology, Japan Beihang University, China Chinese Academy of Sciences, China IT University of Copenhagen, Denmark Jozef Stefan Institute, Slovenia University of Calabria, Italy INRAE/MIAT, France Kyushu Institute of Technology, Japan Leiden University, Netherlands ENS Paris-Saclay (ISP), France Universidad Politécnica de Madrid, Spain Universität Zürich, Switzerland Tutte Institute for Mathematics and Computing, Canada National University of Kaohsiung, Taiwan ISI Foundation, Italy Burgundy University, France University of Granada, Spain Leiden University, Netherlands Vrije Universiteit Amsterdam, Netherlands Budapest University of Technology and Economics, Hungary New York Hall of Science, USA IFIMAR, UNMdP-CONICET, Argentina Eindhoven University of Technology, Netherlands Delft University of Technology, Netherlands Sabanci University, Turkey University of Oxford, UK Northwestern University, USA Institut Pasteur, France INRAE, France Hochschule Heilbronn, Germany Central European University, Hungary Delft University of Technology, Netherlands Beihang University, China Qatar Computing Research Institute, Qatar Southeast University, China University of Potsdam, Germany Nokia Bell Labs, USA Los Alamos National Laboratory, USA University of York, UK
Organization and Committees
Dirk Witthaut Bin Wu Jinshan Wu Haoxiang Xia Fei Xiong Xiaoke Xu Gitanjali Yadav Gang Yan Xiaoran Yan Taha Yasseri Ying Ye Sandro Zampieri An Zeng Paolo Zeppini Yuexia Zhang Zi-Ke Zhang Matteo Zignani Eugenio Zimeo Lorenzo Zino Fabiana Zollo Arkaitz Zubiaga
xxi
Forschungszentrum Jülich, Germany Beijing University of Posts and Telecommunications, China Beijing Normal University, China Dalian University of Technology, China Beijing Jiaotong University, China Dalian Minzu University, China University of Cambridge, UK Tongji University, Shanghai, China, China Zhejiang Lab, China University College Dublin, Ireland Nanjing University, China University of Padova, Italy Beijing Normal University, China GREDEG Group de Recherche en Droit Economie et Gestions, France Beijing Information Science and Technology University, China Hangzhou Normal University, China Università degli Studi di Milano, Italy University of Sannio, Italy University of Groningen, Netherlands Ca’ Foscari University of Venice, Italy Queen Mary University of London, UK
Contents
Network Analysis A Fair-Cost Analysis of the Random Neighbor Sampling Method . . . . . Yitzchak Novick and Amotz Bar-Noy Analysis of Radiographic Images of Patients with COVID-19 Using Fractal Dimension and Complex Network-Based High-Level Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Weiguang Liu, Jianglong Yan, Yu-tao Zhu, Everson José de Freitas Pereira, Gen Li, Qiusheng Zheng, and Liang Zhao Dynamical Influence Driven Space System Design . . . . . . . . . . . . . . . . . Ruaridh A. Clark, Ciara N. McGrath, and Malcolm Macdonald Classification of Dispersed Patterns of Radiographic Images with COVID-19 by Core-Periphery Network Modeling . . . . . . . . . . . . . . . . . Jianglong Yan, Weiguang Liu, Yu-tao Zhu, Gen Li, Qiusheng Zheng, and Liang Zhao Small Number of Communities in Twitter Keyword Networks . . . . . . . Linda Abraham, Anthony Bonato, and Alexander Nazareth Finding Cross-Border Collaborative Centres in Biopharma Patent Networks: A Clustering Comparison Approach Based on Adjusted Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zhen Zhu and Yuan Gao
3
16
27
39
50
62
Professional Judgments of Principled Network Expansions . . . . . . . . . . Ian Coffman, Dustin Martin, and Blake Howald
73
Attributed Graphettes-Based Preterm Infants Motion Analysis . . . . . . . Davide Garbarino, Matteo Moro, Chiara Tacchino, Paolo Moretti, Maura Casadio, Francesca Odone, and Annalisa Barla
82
xxiii
xxiv
Contents
Dynamics of Polarization and Coalition Formation in Signed Political Elite Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ardian Maulana, Hokky Situngkir, and Rendra Suroso
94
Navigating Multidisciplinary Research Using Field of Study Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 Eoghan Cunningham, Barry Smyth, and Derek Greene Public Procurement Fraud Detection: A Review Using Network Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 Marcos S. Lyra, Flávio L. Pinheiro, and Fernando Bacao Characterising Different Communities of Twitter Users: Migrants and Natives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 Jisu Kim, Alina Sîrbu, Giulio Rossetti, and Fosca Giannotti Evolution of the World Stage of Global Science from a Scientific City Network Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 Hanjo D. Boekhout, Eelke M. Heemskerk, and Frank W. Takes Propagation on Multi-relational Graphs for Node Regression . . . . . . . . 155 Eda Bayram Realistic Commodity Flow Networks to Assess Vulnerability of Food Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 Abhijin Adiga, Nicholas Palmer, Sanchit Sinha, Penina Waghalter, Aniruddha Dave, Daniel Perez Lazarte, Thierry Brévault, Andrea Apolloni, Henning Mortveit, Young Yun Baek, and Madhav Marathe Structural Network Measures PageRank Computation for Higher-Order Networks . . . . . . . . . . . . . . . 183 Célestin Coquidé, Julie Queiros, and François Queyroi Fellow Travelers Phenomenon Present in Real-World Networks . . . . . . 194 Abdulhakeem O. Mohammed, Feodor F. Dragan, and Heather M. Guarnera The Fréchet Mean of Inhomogeneous Random Graphs . . . . . . . . . . . . . 207 François G. Meyer A Simple Extension of the Bag-of-Paths Model Weighting Path Lengths by a Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 Sylvain Courtain and Marco Saerens Spectral Rank Monotonicity on Undirected Networks . . . . . . . . . . . . . . 234 Paolo Boldi, Flavio Furia, and Sebastiano Vigna Learning Centrality by Learning to Route . . . . . . . . . . . . . . . . . . . . . . . 247 Liav Bachar, Aviad Elyashar, and Rami Puzis
Contents
xxv
On the Exponential Ranking and Its Linear Counterpart . . . . . . . . . . . 260 Dmitry Gromov and Elizaveta Evmenova FPPR: Fast Pessimistic PageRank for Dynamic Directed Graphs . . . . . 271 Rohith Parjanya and Suman Kundu Community Structure An Extension of K-Means for Least-Squares Community Detection in Feature-Rich Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 Soroosh Shalileh and Boris Mirkin Selecting Informative Features for Post-hoc Community Explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297 Sophie Sadler, Derek Greene, and Daniel Archambault Community Detection by Resistance Distance: Automation and Benchmark Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309 Juan Gancio and Nicolás Rubido Analysis of the Co-authorship Sub-networks of Italian Academic Researchers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321 Vincenza Carchiolo, Marco Grassia, Michele Malgeri, and Giuseppe Mangioni Dissecting Graph Measure Performance for Node Clustering in LFR Parameter Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328 Vladimir Ivashkin and Pavel Chebotarev Analyzing Community-Aware Centrality Measures Using the Linear Threshold Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342 Stephany Rajeh, Ali Yassin, Ali Jaber, and Hocine Cherifi CoVerD: Community-Based Vertex Defense Against Crawling Adversaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354 Pegah Hozhabrierdi and Sucheta Soundarajan Link Analysis and Ranking wsGAT: Weighted and Signed Graph Attention Networks for Link Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369 Marco Grassia and Giuseppe Mangioni Link Predictability Classes in Complex Networks . . . . . . . . . . . . . . . . . 376 Elizaveta Stavinova, Elizaveta Evmenova, Andrey Antonov, and Petr Chunaev Vertex Entropy Based Link Prediction in Unweighted and Weighted Complex Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388 Purushottam Kumar and Dolly Sharma
xxvi
Contents
Network Models Population Dynamics and Its Instability in a Hawk-Dove Game on the Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405 Tomoko Sakiyama Context-Sensitive Mental Model Aggregation in a Second-Order Adaptive Network Model for Organisational Learning . . . . . . . . . . . . . 411 Gülay Canbaloğlu and Jan Treur A Leading Author Model for the Popularity Effect on Scientific Collaboration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424 Hohyun Jung, Frederick Kin Hing Phoa, and Mahsa Ashouri Factoring Small World Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438 Jerry Scripps Limitations of Chung Lu Random Graph Generation . . . . . . . . . . . . . . 451 Christopher Brissette and George Slota Surprising Behavior of the Average Degree for a Node’s Neighbors in Growth Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463 Sergei Sidorov, Sergei Mironov, and Sergei Tyshkevich Towards Building a Digital Twin of Complex System Using Causal Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475 Luka Jakovljevic, Dimitre Kostadinov, Armen Aghasaryan, and Themis Palpanas Constructing Weighted Networks of Earthquakes with Multipleparent Nodes Based on Correlation-Metric . . . . . . . . . . . . . . . . . . . . . . 487 Yuki Yamagishi, Kazumi Saito, Kazuro Hirahara, and Naonori Ueda Motif Discovery in Complex Networks Motif-Role Extraction in Uncertain Graph Based on Efficient Ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501 Soshi Naito and Takayasu Fushimi Analysing Ego-Networks via Typed-Edge Graphlets: A Case Study of Chronic Pain Patients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514 Mingshan Jia, Maité Van Alboom, Liesbet Goubert, Piet Bracke, Bogdan Gabrys, and Katarzyna Musial Analyzing Escalations in Militarized Interstate Disputes Using Motifs in Temporal Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527 Hung N. Do and Kevin S. Xu Experiments on F-Restricted Bi-pattern Mining . . . . . . . . . . . . . . . . . . . 539 Guillaume Santini, Henry Soldano, and Stella Zevio
Contents
xxvii
Temporal Networks Finding Colorful Paths in Temporal Graphs . . . . . . . . . . . . . . . . . . . . . 553 Riccardo Dondi and Mohammad Mehdi Hosseinzadeh Quantitative Evaluation of Snapshot Graphs for the Analysis of Temporal Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566 Alessandro Chiappori and Rémy Cazabet Convergence Properties of Optimal Transport-Based Temporal Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 578 Diego Baptista and Caterina De Bacco A Hybrid Adjacency and Time-Based Data Structure for Analysis of Temporal Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593 Tanner Hilsabeck, Makan Arastuie, and Kevin S. Xu Modeling Human Behavior Markov Modulated Process to Model Human Mobility . . . . . . . . . . . . . 607 Brian Chang, Liufei Yang, Mattia Sensi, Massimo A. Achterberg, Fenghua Wang, Marco Rinaldi, and Piet Van Mieghem An Adaptive Mental Network Model for Reactions to Social Pain . . . . . 619 Katarina Miletic, Oleksandra Mykhailova, and Jan Treur Impact of Monetary Rewards on Users’ Behavior in Social Media . . . . 632 Yutaro Usui, Fujio Toriumi, and Toshiharu Sugawara Versatile Uncertainty Quantification of Contrastive Behaviors for Modeling Networked Anagram Games . . . . . . . . . . . . . . . . . . . . . . . 644 Zhihao Hu, Xinwei Deng, and Chris J. Kuhlman Neural-Guided, Bidirectional Program Search for Abstraction and Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657 Simon Alford, Anshula Gandhi, Akshay Rangamani, Andrzej Banburski, Tony Wang, Sylee Dandekar, John Chin, Tomaso Poggio, and Peter Chin Success at High Peaks: A Multiscale Approach Combining Individual and Expedition-Wide Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 669 Sanjukta Krishnagopal Data-Driven Modeling of Evacuation Decision-Making in Extreme Weather Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 681 Matthew Hancock, Nafisa Halim, Chris J. Kuhlman, Achla Marathe, Pallab Mozumder, S. S. Ravi, and Anil Vullikanti Effects of Population Structure on the Evolution of Linguistic Convention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 693 Kaloyan Danovski and Markus Brede
xxviii
Contents
Quoting is not Citing: Disentangling Affiliation and Interaction on Twitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 705 Camille Roth, Jonathan St-Onge, and Katrin Herms Network in Finance and Economics The COVID-19 Pandemic and Export Disruptions in the United States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 721 John Schoeneman and Marten Brienen Default Prediction Using Network Based Features . . . . . . . . . . . . . . . . . 732 Lorena Poenaru-Olaru, Judith Redi, Arthur Hovanesyan, and Huijuan Wang Can You Always Reap What You Sow? Network and Functional Data Analysis of Venture Capital Investments in Health-Tech Companies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 744 Christian Esposito, Marco Gortan, Lorenzo Testa, Francesca Chiaromonte, Giorgio Fagiolo, Andrea Mina, and Giulio Rossetti Asymmetric Diffusion in a Complex Network: The Presence of Women on Boards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 756 Ricardo Gimeno, Ruth Mateos de Cabo, Pilar Grau, and Patricia Gabaldon Marginalisation and Misperception: Perceiving Gender and Racial Wage Gaps in Ego Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 768 Daniel M. Mayerhoffer and Jan Schulz A Networked Global Economy: The Role of Social Capital in Economic Growth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 780 Jaime Oliver Huidobro, Alberto Antonioni, Francesca Lipari, and Ignacio Tamarit The Role of Smart Contracts in the Transaction Networks of Four Key DeFi-Collateral Ethereum-Based Tokens . . . . . . . . . . . . . . . . . . . . 792 Francesco Maria De Collibus, Alberto Partida, and Matija Piškorec Resilience, Synchronization and Control Synchronization of Complex Networks Subject to Impulses with Average Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 807 Bangxin Jiang and Jianquan Lu Retrieval of Redundant Hyperlinks After Attack Based on Hyperbolic Geometry of Web Complex Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 817 Mahdi Moshiri and Farshad Safaei
Contents
xxix
Deep Reinforcement Learning for FlipIt Security Game . . . . . . . . . . . . 831 Laura Greige and Peter Chin Accelerating Opponent Strategy Inference for Voting Dynamics on Complex Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 844 Zhongqi Cai, Enrico Gerding, and Markus Brede Need for a Realistic Measure of Attack Severity in Centrality Based Node Attack Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 857 Jisha Mariyam John and Divya Sindhu Lekha Mixed Integer Programming and LP Rounding for Opinion Maximization on Directed Acyclic Graphs . . . . . . . . . . . . . . . . . . . . . . . 867 Po-An Chen, Ya-Wen Cheng, and Yao-Wei Tseng Correction to: Complex Networks & Their Applications X . . . . . . . . . . Rosa Maria Benito, Chantal Cherifi, Hocine Cherifi, Esteban Moro, Luis M. Rocha, and Marta Sales-Pardo
C1
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 879
Network Analysis
A Fair-Cost Analysis of the Random Neighbor Sampling Method Yitzchak Novick1,2(B) and Amotz Bar-Noy1 1
2
CUNY Graduate Center, New York, NY 10016, USA [email protected] Touro College and University System, New York, NY 10018, USA
Abstract. Random neighbor sampling, or RN , is a method for sampling vertices with an average degree greater than the mean degree of the graph. It samples a vertex, then exchanges it for one of its neighbors which is assumed to be of higher degree. While considerable research has analyzed various aspects of RN , the extra cost of sampling a second vertex is typically not addressed. In this paper, we offer an analysis of RN from the perspective of cost. We define three separate costs, the cost of sampling a vertex, the cost of sampling a neighbor, and the cost of selecting a vertex (as opposed to discarding it in exchange for another). We introduce variations to RN , RVN which retains the first sampled vertex, and RN -RV and RVN -RV which are ‘hybrid’ methods that sample differently in two separate phases. We study all of these methods in terms of the costs we define. The cost-benefit analysis highlights the methods’ strengths and weaknesses, with a specific focus on their respective performances for finding high-degree vs. low-degree vertices. The analyses and results provide a far richer understanding of RN and open a new area of exploration for researching additional sampling methods that are best suited for cost-effectiveness in light of particular goals. Keywords: Fair cost comparison High-degree vertex sampling
1
· Random neighbor sampling ·
Introduction
It can often be beneficial to locate high-degree vertices within a graph. When total knowledge of a graph is not available, one is seemingly limited to selecting a randomly sampled vertex (RV ) from the collection of vertices with the expected degree being merely the mean degree of all vertices. Among alternative methods that have been suggested is that of Cohen et al. [6] where a vertex is sampled at random as with RV , but this vertex is discarded and a randomly sampled neighbor (RN ) of the vertex is selected instead1 . The method is inspired by Scott 1
In this paper we will distinguish between the acts of ‘sampling’ and ‘selecting’ vertices. Sampling will refer to isolating a vertex from among a set of vertices, and selecting will refer to retaining a vertex for whatever purpose the sampling is intended. Thus, RN would be described as sampling a vertex, then sampling and selecting one of its neighbors.
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2021, SCI 1072, pp. 3–15, 2022. https://doi.org/10.1007/978-3-030-93409-5_1
4
Y. Novick and A. Bar-Noy
Feld’s friendship paradox [10], the phenomenon that friends of people have a higher mean degree than people themselves do. Although the friendship paradox does not prove RN ’s superiority to RV [15] (its actual manifestation as a vertex sampling technique would be random edge sampling [16,20]), it is still true that the expected degree of a vertex obtained by RN is greater than or equal to the mean degree of the graph, E [RN ] ≥ E [RV ]. In fact, in any graph with a single edge connecting two vertices of unequal degree the inequality is strict, E [RN ] > E [RV ]. These inequalities were demonstrated in [6] and [18] among others and a proof was also presented in a work in progress by Kumar et al. [15] where it is further attributed to a comment to an online article [23]. What is ignored about RN though, is the extra cost that the method requires. While RN outperforms RV in terms of results, it also requires additional computational steps, namely it requires sampling twice as many vertices. This concept is well illustrated with an example. Suppose our goal in sampling was to maximize the degree of the maximum-degree vertex in the collection of vertices we select. We could allocate some budget b for how many vertices we are allowed to sample, and compare RN to RV for different budgets. We conducted this experiment on a set of Barab´asi Albert [3] graphs. We repeatedly sample for different budgets b and record the average maximum degree of collections for each b. The first scatter plot in Fig. 1 shows the results. The green dots represent b vertices collected with RV . A na¨ıve view of RN , one that does not take cost into account, would compare these b vertices with b neighbors collected with RN , represented by the blue dots. However, a budget b should actually yield a collection comprised of 2b vertices and 2b neighbors, represented by the red dots. The comparison of RV and RN should be between these values which present a fair assessment of RN , demonstrating its strength even in light of its additional cost. It should be noted that the inclusion of the initially sampled vertices, a subject we will discuss in this paper, gives negligible benefit for this experiment. This is demonstrated by the chart on the right. The max-degree is much more likely to be found as a neighbor than as a vertex, so the collections with and without the initial vertices give similar results.
Fig. 1. Max-degree vertices in a collection for a budget b
Fair-Cost
5
The original context of Cohen et al.’s paper was the immunization of a population in order to arrest a spreading pandemic. In this context, the cost of sampling the neighbor is negligible. However, as RN has gained popularity and been suggested as a sampling method in different diverse contexts (for example [5,12,17]), a cost analysis is important. This paper explores the sampling costs, as well as the cost of selecting and utilizing a vertex. We explore the costs’ impact on the performance of RN , which reveals specific strengths and weaknesses. Through analyzing these costs, we also discover potential improvements to RN . These suggest alternative sampling methods, which we will present and analyze in light of the associated costs.
2
RN ’s Inefficiency for Finding Leaves
There is an important characteristic of RN that bears significantly on the cost analysis we perform. While RN ’s strength for finding the high-degree vertices, or ‘hubs’, within a graph is well established, this strength translates into a weakness if the goal is finding the low-degree vertices, or ‘leaves’, instead. Two graph traits that contribute to RN ’s ability to find hubs are the power-law distribution [3] and negative assortativity [19]. The abundance of leaves suggest a low expected degree for an initially sampled vertex, and some amount of disassortativity suggests that a leaf’s neighbor will not be another leaf but rather a hub. The connection to assortativity has been mentioned previously in [15] and [20]. In truth, there is a loose connection between these two graph traits. Both the Erd˝ os-Gallai [8] and Havel-Hakimi [11] theorems are predicated on the simple fact that, if there are not enough hubs to satisfy all of the edge-endpoints of the hubs themselves, the hubs must connect disassortatively with leaves. While it is possible for a graph to have a power-law distribution of degree and still be assortative, in most practical cases the relatively low number of hubs will necessitate some amount of disassortativity. In our research, we will perform many experiments using the Barab´ asi Albert [3] random graph model which generates graphs with power-law degree distribution that is common in real-world networks [1,9]. Although BA graphs have been found to be non-assortative on average [19], assortativity and degree ratio phenomena have also been studied on the local level [13,21,22,24], and it has been demonstrated that hubs in BA graphs are locally disassortative [2,4]. As a result, RN performs particularly well in these graphs. But, by the same token, RV is far superior at finding leaves because RN is so attuned to finding hubs. As an analytical illustration of the phenomenon, contrast the performance of RN and RV in a star graph, which is a highly oversimplified illustration of a power-law distribution of degree. When sampling with RV , all vertices have an equal probability of being sampled, P (vi ) = n1 . The expected number of samples required to find the center is E [Sc ] = n, and the expected number of samples required to find all leaves is E [SL ] = n(Hn − 1) = Θ(n log n). Contrast this with sampling by RN . The probability of selecting the center is 1 P (Sc ) = n−1 n and the probability of selecting a leaf is P (Sl ) = n . The expected
6
Y. Novick and A. Bar-Noy
number of samples required to find the center approaches 1 at E [Sc ] = but the expected number of samples required to collect all leaves is E [SL ] =
n−1 i=1
1 n−i nn−1
= n(n − 1)
n−1 i=1
n (n−1) ,
−1
n−1 1 1 = n(n − 1) n−i i i=1
= n(n − 1)Hn−1 = Θ(n2 log n)
(1)
An experimental illustration of the phenomenon with BA graphs is shown in Fig. 2. The vertices are ranked by degree and the x-axis represents the percent of top vertices successfully selected. RV , shown in red, samples uniformly and finds hubs slowly compared to RN , shown in blue. However, RV will find all vertices with n log n samples (per the coupon collector’s problem), while RN struggles to complete the collection. We also include results from Erd˝os R´enyi [7] graphs to show the connection to RN ’s strength. ER graphs typically have a weaker effect of the friendship paradox on a local level [21] which means the RN RV ratio is far weaker than in BA graphs [20]. So the overall effect is the same as in the BA graphs, but it is less pronounced.
Fig. 2. Required number of samples to collect the top x high-degree vertices in a graph
In our cost analysis, it will often be relevant to focus on this distinction, analyzing methods and costs for selecting hubs vs. selecting leaves. Hybrid Sampling Methods. This shift between the respective strengths of RV and RN suggests sampling via a hybrid method which we will call RN -RV . RN could be used in an initial phase in order to prioritize hubs, but then RV could be used in a second phase in order to cover more of the vertices. Of course this method is not as straightforward as either RV or RN which may present some challenges in practice. And a more significant challenge is determining the optimal point at which to switch methods. We will use RN -RV in our cost analysis to highlight
Fair-Cost
7
important theoretical points, but a full analysis of the method and the challenge of finding an optimal switching point is beyond the scope of this paper.
3
Sampling Costs - Cv and Cn
The first two costs we introduce are the cost of sampling a vertex, Cv , and the cost of sampling a neighbor, Cn . In our first example in the introduction, we made an implicit assumption that these two costs are the same, or Cv = Cn . While this is likely true in many scenarios, sampling vertices and neighbors may not always be identical processes so we will generalize to two separate costs to allow for cases where Cv < Cn and Cv > Cn . RkN Sampling. The introduction of these two costs already suggests a tweak to RN . RN ’s strength is the higher expected degree of the neighbor over the vertex, a gain for which it is worth paying an additional Cn above the Cv already spent if the method has any benefit at all. With this in mind, an alternative method would be to sample a vertex and then select k of its neighbors instead of just 1. This would allow us to better capitalize on the already spent Cv , lowering the cost of each selected neighbor to Ckv + Cn as opposed to Cv + Cn . Clearly, RkN is a generalization of RN with RN being the specific case of k = 1. RkN will not be a focus of this work, but we intend to offer an analysis of it in an extended version of this paper. 3.1
Critical Cn
Let us fix Cv = 1 so that we are expressing Cn , and total cost, both in terms of Cv . We will seek a ‘critical’ Cn (CCn ) where RV and RN perform equally well in light of their associated costs. CCn is, ultimately, a measure of RN ’s superiority. The greater RN performs compared to RV , the higher cost one would pay for a neighbor. And a negative CCn would indicate a scenario where RN performs worse than RV . CC n Using Expected Degree. We will first calculate a single CCn value that applies to a graph as a whole. Using E [RV ] and E [RN ]: E [RV ] =
E [RN ] E [RN ] −1 , so CCn = 1 + CCn E [RV ]
(2)
This definition is intuitive because, as noted, CCn is a function of the ratio As RN ’s strength over RV increases, the cost of sampling a neighbor can increase and still be worthwhile. Notice also that if we grant the premise of perfect correlation between expected degree and worthwhile costs, RN has to perform at least twice as well as RV to justify a Cn equal to Cv . E[RN ] E[RV ] .
CC n for Canonical Graphs. We will apply Eq. 2 to some classic graphs.
8
Y. Novick and A. Bar-Noy
d-regular Graph. In a d-regular graph, or any perfectly assortative graph for that matter, RN reduces to RV and CCn = 0. The intuition is obvious. In a graph where RN offers no advantage over RV , any positive cost would be wasted. Star Graph. In a star graph, E [RV ] = CCn =
2(n−1) n
and E [RN ] =
(n−1)2 +1 , n
so
1 + (n − 1)2 −1 2(n − 1)
(3)
As n increases, the expression approaches n − 1 which is the degree of the center vertex. This reflects the high Cn price that is worth paying to exchange the leaf vertex one would initially sample with high probability for the center. Complete Bipartite Graph. Let L and R be the two sides of the graph. All 2LR vertices in L are of degree R, and all vertices in R are of degree L. RV = L+R , and RN = 3.2
L2 +R2 L+R .
Therefore, in a complete bipartite graph, CCn =
L2 +R2 2LR
− 1.
CCn for Different Sampling Amounts and Results
We define CCn as a function of either a number of samples taken, or a desired result. To this end, we will define three metrics of success of a sampling method: 1. Total Degrees - We repeatedly sample vertices from a graph with replacement using RV and RN and track the sum of the degrees of all selected vertices. CCn for this metric should converge on the CCn value based on expected degree defined in Eq. 2. 2. Total Unique Degrees - We repeatedly sample vertices from a graph with replacement using RV and RN and track the sum of the degrees of any new vertices selected. Here we will present resulting values as a percentage of the total degrees in the graph. 3. Max Degree - We repeatedly sample vertices from a graph with replacement using RV and RN and track the maximum degree vertex selected. Here too we will present resulting degree values as a percentage of the max-degree vertex in the graph. The second metric corrects for the inclusion of duplicates in the first metric. If our goal is to immunize a network, for example, we surely can’t take credit for inoculating the same vertex twice. We include the first metric mostly for comparison, but still suggest it might be useful in some scenarios where repeatedly selecting vertices, particularly hubs, has value. To calculate CCn as a function of sampling iterations, let RV (i) and RN (i) be the resulting values of selecting i vertices with RV and with RN respectively. The cost of i vertices selected with RV is i and the cost of i neighbors selected by RN is i(1 + Cn ). For any i, we can calculate CCn (i) as follows: RN (i) RN (i) RV (i) = , so CCn (i) = −1 i i(1 + CCn (i)) RV (i)
(4)
Fair-Cost
9
To calculate CCn as a function of resulting values, assume some resulting value V requires i sampling iterations of RV and j sampling iterations of RN , or V = RV (i) = RN (j). For this value V , we can calculate CCn (V ) as V V i = , so CCn (V ) = − 1 i j(1 + CCn (V )) j
(5)
We experimented with ER and BA graphs with varying parameters. Figure 3 shows results for a set of BA graphs. These results were typical for most of the experiments on both BA and ER graphs, and a number of real networks taken from the Koblenz online collection [14] followed these patterns as well.
Fig. 3. CCn values calculated for BA graphs for total degrees, total unique degrees, and max-degree. The top charts plot by samples, the bottom charts plot by results. ] The results for total degrees (the leftmost charts) correlated with E[RN E[RV ] − 1 as predicted. The middle charts plot the sum of unique degrees, by samples on top and by results on the bottom. As we continue to sample, the number of new vertices returned obviously decreases, until both methods are failing to add degrees to our total and CCn = 0. However, there is a range where CCn < 0, which corresponds to the range where all of the hubs have been found and RV is performing better than RN because it is returning new leaves while RN is repeating hubs. The last charts show the results for max-degree. CCn starts out high as RN selects a higher max-degree than RV does. But, similar to the total unique degrees metric, max-degree will cause CCn to approach and eventually reach 0 once the max-degree vertex has been found by RV and neither method is gaining anything with continued sampling. This explains the mostly monotonically decreasing nature of the function in the top chart. The bottom chart plots by results. There is considerably more noise in this chart than the
10
Y. Novick and A. Bar-Noy
others because sampling for max-degree is harder to normalize with repetition, but the generally increasing nature of the function shows that RN has more relative value compared to RV as the maximum degree being sought increases.
4
Selection Cost and RVN Sampling
In our introductory example in Fig. 1 we mentioned that, given a budget b which we would use to collect vertices, and assuming Cv = Cn = 1, we could collect at most 2b neighbors. But we also selected the 2b vertices we sampled as a means of collecting the neighbors and our final collection was size b. We will refer to the sampling method where we select both the originally sampled vertex as well as its sampled neighbor as RVN . The obvious explanation for why one would opt for RN over RVN is a cost that would be associated with selecting a vertex, which we will call Cs . Even having spent Cv + Cn to sample the vertex and its neighbor, we select only the neighbor in order to capitalize on the Cs we spend to do so. Recall that we suggested that Cohen et al. [6] probably considered the extra cost of sampling a neighbor negligible which is why costs were not considered. By defining Cv and Cn , we can explain this suggestion more formally by saying that perhaps Cn Cv , and the extra cost of sampling the neighbor can be ignored. However, the far more likely explanation is that Cs Cv ≈ Cn . If the immunizations are in very short supply, it is wasteful to administer one to the lower-degree vertex instead of paying an additional Cv + Cn to find another higher-degree subject. 4.1
RVN Versus RN
In comparing RVN and RN , we are considering the collective cost of sampling a vertex and neighbor, Cv + Cn , on the one hand, and the cost of selecting the vertex, Cs , on the other. For the purpose of this analysis, we will ignore any comparison between Cv and Cn directly, and define α as the cost of collectively sampling a vertex and neighbor, so α = Cv + Cn . Intuitively, we would assume a preference for spending the lower expense in order to capitalize more on the greater expense. Therefore, if α < Cs we are more likely to sample again in order to spend Cs on the higher-degree neighbor, and if α > Cs we are more likely to spend Cs on selecting the vertex in order to capitalize more on the α we have already spent. But of course the direct comparison of RV and RN influences this decision as well. We will define a ‘critical α’ (Cα) as the α value for which RVN and RN are equal and use it to relate α to Cs and the RN RV ratio. C α for Expected Degree. We will calculate Cα using the expected values of RVN and RN . We will compare a single selection of RVN to a single selection of RN . Every odd selection of RVN is the selection of a vertex and every even selection of RVN is a neighbor, so we will express the expected degree of a single selection of RVN as E [RVN ] = 12 (E [RV ] + E [RN ]). Therefore
Fair-Cost
E [RV ] + E [RN ] E [RN ] = , Cα + 2Cs Cα + Cs
so Cα = Cs
E [RN ] −1 E [RV ]
11
(6)
This uses the same expression as the one we found for CCn in Eq. 2. We see that a stronger RN as compared to RV leads to a higher Cα, that is α must be significantly larger than Cs for the selection of the first vertex to be worthwhile. As RN weakens vis-´a-vis RV , Cα decreases and the importance of capitalizing on α increases relative to the importance of capitalizing on Cs . RVN vs. RN for Unique Degrees. As a practical sampling method, we will analyze RVN for the goal of accumulating unique degrees. For the goal of non-unique degrees, the value of RVN versus RN will depend on Cα. For the goal of a high max-degree, we already noted in the introduction that RVN adds negligible benefit over RN as shown in Fig. 1, as the max-degree of the collection is very likely to come from the neighbors. Experimental results on BA and ER graphs for varied values of n, m, and p showed that the max-degree vertex would be a neighbor in BA graphs with probabilities that typically were well above .9, and even in ER graphs typically above .7. However, for the goal of accumulating unique degrees, RVN is an important sampling method to explore. RVN would seem to give the practical benefits of RV -RN , but might be more cost effective because many leaves are selected at the point where we first encounter them and pay Cv for them. Figure 2 included RVN results represented by the black lines. We see that RVN finds high-degree vertices faster than RV but still more slowly than RN . But, as the goal of accumulating vertices approaches 100%, RVN appears slightly weaker than RV , though far stronger than RN . Of course this suggests a second two-phase hybrid method as well, RVN RV . Like RN -RV , we take advantage of the power of RN to accumulate hubs before switching to RV to get the leaves, but we also pay the extra Cs to select the sampled leaves in the first phase. Clearly, choosing between these methods is a question of the specific priorities of any given scenario. If capitalizing on Cs in the initial phase is of greater importance, RN -RV makes more sense, if capitalizing on Cv in the initial phase is the priority then RVN -RN is better. And the same tweak can be applied to RkN as well. RVkN would be a sampling method that selects a vertex and k of its neighbors, again spending an extra Cs in order to better capitalize on the spent Cv .
5
Fair Cost Analysis
We have discussed costs, introduced new sampling methods, and discussed the methods’ strengths and weaknesses, particularly in light of the goals of finding hubs and leaves. While Fig. 2 illustrates this contrast, it does not conform to a consistent cost model. In terms of the costs discussed in this paper, it could be described as charging Cv for RV and RN , and both Cv and Cn for RVN . We will now perform an actual cost analysis to demonstrate the methods’ characteristics in terms of the costs we have defined. Once again, we will perform an analytical
12
Y. Novick and A. Bar-Noy
analysis using the star graph, and an experimental analysis using BA graphs. For this analysis we will assume we pay Cv and Cn anytime we sample a vertex, even if we have sampled it before, but we pay Cs only once per vertex. 5.1
Cost Analysis on the Star Graph
Table 1 summarizes the analysis for star graphs. Because we are paying Cs only once per vertex, Cs expenses are uninteresting and therefore omitted. Table 1. An analysis of costs Cv and Cn in a star graph of n vertices Method
Selecting center E [Cv ] E [Cn ]
Selecting all leaves E [Cv ] E [Cn ]
RV
n
Θ(n log n)
RN
n/(n − 1) n/(n − 1) Θ(n2 log n) Θ(n2 log n)
0
RVN
1
RN -RV
n/(n − 1) n/(n − 1) Θ(n log n)
RVN -RV 1
1 1
Θ(n log n) Θ(n log n)
0 Θ(n log n) 0 0
For RV , the expected number of samples required to obtain the center is n, and getting all leaves is Θ(n log n). These are all Cv costs, there are no Cn costs. RN finds the center with just over 1 samplings that incur costs for Cv and Cn . But, as noted above, it requires Θ(n2 log n) units of both Cv and Cn to find all leaves. RVN necessarily finds the center after spending one unit of Cv and one of Cn , and will require Θ(n log n) iterations to find all leaves. The value of the hybrid methods over RVN is evident, as they stop spending Cn after the first phase so finding leaves only incurs Cv costs. And the contrast between RN -RV and RVN -RV is in selecting the center. RN -RV will find the center with very close to 1 sampling only with high probability, whereas RVN -RV necessarily finds it with a single iteration as noted above. 5.2
Cost Analysis on BA Graphs
We experimented with a set of BA graphs to illustrate results similar to those of the star graph for random graphs that are more realistic. Figure 4 demonstrates a comparison for RV , RN , and RVN . The first plot shows the costs of Cv which, as discussed, do not represent RN fairly. This is even truer for RVN where a single vertex gives both the vertex itself and its neighbor. On the other hand, Cn applies only to methods that include a neighbor. If this were the only expense, RV would cost nothing, and RVN would be a better value than RN because of the vertex it includes at not cost. The third plot captures both of these expenses together, truly highlighting the strength of RVN . It performs well for high-degree vertices and appears only slightly worse than RV for collecting the
Fair-Cost
13
whole graph, far closer than RN . The fourth chart shows Cs expenses. Because we are assuming that Cs is only paid once per vertex, all methods pay the same amount to collect the entire graph. But the experiments showed that methods that collect neighbors spend fewer units of Cs for the hubs. This is because they find them sooner, before they have spent Cs on the many leaves that are necessarily sampled while seeking hubs with RV . If the hubs are the priority and Cs is a significant concern, RVN is weaker than RN in these results.
Fig. 4. Costs for collecting the top x high-degree vertices in a graph
Hybrid Methods in BA Graphs. While the hybrid methods were simple to analyze in a star graph, it is more complicated to apply them in random or realworld graphs. The complication is determining the point at which to switch to RV as mentioned above. Clearly as the switching point approaches 0 the method reduces to RV and as it approaches infinity the method reduces to RN or RVN . In the star graph, there is a clear delineation between a single hub and multiple leaves. It is also simple to calculate an expected number of samples required to find the center. In more intricate graphs it is not as simple. When sampling consistently with a method whose strength is hubs, even if it misses a few early on, the law of averages will correct for this. However, switching to RV before all high-degree vertices have been found will not allow for this correction. If one’s goal is acquiring all of whatever subset of vertices they deem ‘hubs’, switching to RV too soon can completely negate the benefit of the first sampling phase. On the other hand, if some tolerance is defined, for example we would say we are satisfied with 90% of the hubs, then perhaps a reasonable switching point could be determined. Obviously this is a rich research endeavor on its own, and we leave it too as a future extension of this work.
14
Y. Novick and A. Bar-Noy
References 1. Albert, R., Jeong, H., Barab´ asi, A.L.: The diameter of the world wide web. Nature 401, 130–131 (1999) 2. Babak, F., Rabbat, M.G.: Degree Correlation in Scale-Free Graphs. arXiv:1308.5169 (2013) 3. Barab´ asi, A.L., Albert, R.: Emergence of scaling in random networks. Science 286(5439), 509–512 (1999) 4. Bertotti, M.L., Modanese, G.: The Bass Diffusion Model on Finite Barab´ asi-Albert Networks. Physics and Society. arXiv:1806.05959 (2018) 5. Christakis, N.A., Fowler, J.H.: Social network sensors for early detection of contagious outbreaks. PLoS ONE 5, 9 (2010) 6. Cohen, R., Havlin, S., Ben-Avraham, D.: Efficient immunization strategies for computer networks and populations. Phys. Rev. Lett. 91(24), 247901-1–2479014 (2003) 7. Erd˝ os, P., R´enyi, A.: On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci. 5(1), 17–60 (1960) 8. Erd˝ os, P., Gallai, T.: Gr´ afok, El¨ o´ırt Fok´ u Pontokkal. Matematikai Lapok 11, 264– 674 (1960) 9. Faloutsos, M., Faloutsos, P., Faloutsos, C.: On power-law relationships of the internet topology. In: The Structure and Dynamics of Networks, pp. 195–206. Princeton University Press (2011) 10. Feld, S.L.: Why your friends have more friends than you do. Am. J. Soc. 96(6), 1464–1477 (1991) 11. Hakimi, S.L.: On realizability of a set of integers as degrees of the vertices of a linear graph. Soc. Ind. Appl. Math. 10, 196–506 (1962) 12. Han, B., Li, J., Srinivasan, A.: Your friends have more friends than you do: identifying influential mobile users through random-walk sampling. IEEE/ACM Trans. Netw. 22(5), 1389–1400 (2014) 13. Jackson, M.O.: The Friendship Paradox and Systematic Biases in Perceptions and Social Norms. arXiv preprint arXiv:1605.04470 (2016) 14. Kunegis, J.: Koblenz Network Collection (2013). http://konect.uni-koblenz.de/ 15. Kumar, V., Krackhardt, D., Feld, S.L.: Network Interventions Based on Inversity: Leveraging the Friendship Paradox in Unknown Network Structures. https:// vineetkumars.github.io/Papers/NetworkInversity.pdf (2018) 16. Leskovec, J., Faloutsos, C.: Sampling from large graphs. ACM Proc. 12, 631–636 (2006) 17. Lu, L., Chen, D., Ren, X., Zhang, Q., Zhang, Y., Zhou, T.: Vital nodes identification in complex networks. Phys. Rep. 650, 1–63 (2016) 18. Momeni, N., Rabbat, M.G.: Effectiveness of Alter Sampling in Social Networks. arxiv:1812.03096 (2018) 19. Newman, M.E.J.: Assortative mixing in networks. Phys. Rev. Lett. 89(20), 208701 (2002) 20. Novick, Y., Bar-Noy, A.: Finding high-degree vertices with inclusive random sampling. In: Complex Networks and Their Applications IX, vol. 943 (2021) 21. Pal, S., Yu, F., Novick, Y., Swami, A., Bar-Noy, A.: A study on the friendship paradox - quantitative analysis and relationship with assortative mixing. Appl. Netw. Sci. 4, 71 (2019) 22. Piraveenan, M., Prokopenko, M., Zomaya, A.Y.: Classifying complex networks using unbiased local assortativity. In: ALife-XII Conference, pp. 329–336 (2010)
Fair-Cost
15
23. Strogatz, S.: Friends You Can Count On. https://opinionator.blogs.nytimes.com (2012) 24. Thedchanamoorthy, G., Piraveenan, M., Kasthuriratna, D., Senanayake, U.: Node assortativity in complex networks: an alternative approach. Procedia Comp. Sci. 29, 2449–2461 (2014)
Analysis of Radiographic Images of Patients with COVID-19 Using Fractal Dimension and Complex Network-Based High-Level Classification Weiguang Liu1 , Jianglong Yan1,4 , Yu-tao Zhu2 , Everson Jos´e de Freitas Pereira4 , Gen Li3 , Qiusheng Zheng3 , and Liang Zhao2,4(B) 1
4
School of Computer Science, Zhongyuan University of Technology, Zhengzhou, China [email protected] 2 China Branch of BRICS Institute of Future Networks, Shenzhen, China [email protected] 3 Henan Key Laboratory on Public Opinion Intelligent Analysis, Zhongyuan University of Technology, Zhengzhou, China [email protected] Department of Computing and Mathematics, University of Sao Paulo (USP), Ribeirao Preto, Brazil {eversonpereira,zhao}@usp.br Abstract. An important task in combating COVID-19 involves the quick and correct diagnosis of patients, which is not only critical to the patient’s prognosis, but can also help to optimize the configuration of hospital resources. This work aims to classify chest radiographic images to help the diagnosis and prognosis of patients with COVID-19. In comparison to images of healthy lungs, chest images infected by COVID-19 present geometrical deformations, like the formation of filaments. Therefore, fractal dimension is applied here to characterize the levels of complexity of COVID-19 images. Moreover, real data often contains complex patterns beyond physical features. Complex networks are suitable tools for characterizing data patterns due to their ability to capture the spatial, topological and functional relationship between the data. Therefore, a complex network-based high-level data classification technique, capable of capturing data patterns, is modified and applied to chest radiographic image classification. Experimental results show that the proposed method can obtain high classification precision on X-ray images. Still in this work, a comparative study between the proposed method and the state-of-the-art classification techniques is also carried out. The results show that the performance of the proposed method is competitive. We hope that the present work generates relevant contributions to combat COVID-19. Keywords: COVID-19 · Fractal dimension classification · Complex networks
· Complexity · High-level
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2021, SCI 1072, pp. 16–26, 2022. https://doi.org/10.1007/978-3-030-93409-5_2
Analysis of Radiographic Images of Patients
1
17
Introduction
COVID-19 is a new type of corona virus capable of infecting a wide variety of domestic and wild animal species. Unlike most corona viruses that cause only similar symptoms that of a cold, this new variation, in several cases causes pneumonia, respiratory syndrome and several other serious inflammations in the body [1]. A major aggravating factor of this new virus is its transmission rate R-naught (basic reproduction number), which is quite high [2–4]. The disease is often asymptomatic, which can further aggravate the infection, because a person who believes he/she has nothing ends up exposing several others to pathology. In this way, containment measures are very important to reduce contact of people infected with other vulnerable ones (without antibodies). Quick and precise testing and patient identification are essential for this purpose. Many studies have achieved great successes in identifying COVID-19 through chest X-ray image processing [5–7]. We seek in this work to identify the diagnosis of the new corona virus through analysis of chest X-ray images, using a complex network-based high-level classification technique. Data classification is the daily activity of animal and human. Our brain classifies objects, colors, people, among other things all the time. The classification also is a common activity for computers, however, we can identify that the way how a human and a machine perform the classification is quite different. Traditionally, computational classification techniques only use physical characteristics, such as similarity, distance, or distribution to define data classes, while the human brain looks for data patterns. So it is possible for a computer to make semantic mistakes, such as confusing between a monkey when walking on two legs with a human. On the other hand, such kind of tasks is easy for human beings, because they make classification by analyzing the data at an organizational and semantic level, no matter how similar or dissimilar measured by the physical features of the data. The classification methods consider only physical features are called low-level classification; while those ones, which take into account not only physical features, but also pattern formation of the data, are called high-level classification [11]. One of the important steps in data classification is feature extraction from the original data set. Chest X-ray images of healthy lungs show a regular pattern, while COVID-19 images present irregular patterns, a salient feature is the formation of filaments, which can be seen in Fig. 2 (normal images) and Fig. 3 (COVID-19 images), respectively. Such a finding implies that the two classes of images present different geometrical complexity. For this reason, we extract fractal dimensions [8,9] of the training and testing images for classification. Nowadays, many real systems appear or can be converted to network forms, resulting in the complex network research. Complex network refers to large scale graphs with nontrivial connection patterns [12–16]. This kind of networks are suitable tools for characterizing data patterns due to their ability to capture the spatial, topological and functional relationship data. The interconnections between nodes in a network allow, through some suitably defined measures, to identify patterns, producing thus a high-level classifier.
18
W. Liu et al.
The original idea of high-level classification has been proposed by Silva and Zhao [11,17] and extended by Carneiro, Colliri and Zhao [19,20]. In this scheme, the low-level classification can be implemented by any traditional classification technique, while the high-level technique explores the complex topological properties of the network built from the input data. In the work introduced in [11], the high-level classification is performed using three network measures: assortativity, clustering coefficient and average degree. In [17], the measure of the transient and the cycle lengths of a tourist walk was used for characterising network patterns. In both cases, the classification is performed by checking the conformity of the pattern formed by each network (each class) for each test data sample, i.e. a test sample is assigned to that class where its insertion in the corresponding network causes the least variation of the measures under consideration. In [19,20], the part of the low-level classification has been eliminated and pure high-level classification techniques have been proposed. However, in those work, several network measures are employed, which introduce a set of parameters representing the weight of each measure and such weights are hardly to be correctly determined. For this reason, we introduce a modified high-level classification technique using only one network measure, the communicability measure [22], which largely reduces the parameter calibration task. This work aims to analyze radiographic images to assist diagnosis and prognosis of patients with COVID-19. For this purpose, a network-based high-level classification technique is modified and applied to perform the classification of chest X-ray images of healthy patients and patients with COVID-19. Experimental results show that the composed method classifies images with high precision. In this work, a comparative study has also been carried out between the proposed method and the state-of-the-art ones. The obtained results show that the proposed method is competitive. We expect that the present work and future developments generate contributions to combat COVID-19.
2
The Proposed Method
In this section, we will present the proposed method for classifying chest X-ray images step by step. 2.1
Image Feature Extraction
In image processing, fractal dimension [8,9] can be used to describe the geometrical complexity of the images. For calculating the fractal dimension, we use the box counting method [9]. The method is applied to binary images. Therefore, given an gray-level chest X-ray image, we first generate several binary images simply using a series of threshold values. For each binary image, we cover the image with a grid, and then count how many boxes of the grid are covering the pattern in the image. Then we repeat the process but using smaller boxes. By shrinking the size of the boxes repeatedly, we can accurately capturing the
Analysis of Radiographic Images of Patients
19
structure of the pattern. The fractal dimension D is the slope of the line when we plot the value of log(N ) against the value of log(r): D=
log(N ) log(r)
(1)
where N is the number of boxes that cover the pattern and r is the inverse of the box size. Finally, for each gray-level image, we get a vector of values, each of them is the box-counting dimension of one of its binary images. 2.2
The Modified High-Level Classification Technique
The high-level classification is divided into two phases: the training phase and the testing phase. In the training phase, we construct a network for each class of image features, where the feature vector of each image is a node and the connections between nodes are formed by a technique that uses either k Nearest Neighbors (kNN) or Radius Neighbors (RN ). The kNN is used for sparse regions and RN for dense regions. If a feature vector of an image has a small number of similar feature vectors of other images, the corresponding node in the network falls in a sparse region. In this case, this node is connected to its k most similar nodes. On the other hand, if a feature vector has a large number of similar ones, it falls in a dense region and it is connected to all the nodes within a predefined similarity radius. The network construction criteria is defined by the following equation:
RN (xi , Yxi ), if |RN (xi , Yxi )| > k N (xi ) = kN N (xi , Yxi ), otherwise
(2)
Here, we have two classes: the normal class and COVID-19 class. Therefore, two networks are constructed. Figure 1 shows the generated networks using the rule described above and the training data set described in the next section. After the network construction, the measure of each network (each class), Gbef ore (classi ), i = 1, 2, is calculated. In our method, we use the average communicability measure Mvi [22] as Gbef ore,af ter (classi ), which accounts not only for the shortest paths connecting two nodes but also the longer paths with a lower contribution. The communicability Mvi from node vi to all other nodes of the network is described by, 1 1 1 Pv v + Wv v , i = j (3) Mv i = (N − 1) s! i j k! i j j∈N
k>s
where s is the length of the shortest path between vi and vj , Pvi vj is the number of shortest paths between vi and vj and Wvi vj is the number of paths connecting
20
W. Liu et al.
Fig. 1. The constructed networks using the training data set described in the next section. The red network is formed by the normal class training samples and the blue network is formed by the COVID-19 class training samples.
vi and vj of size k > s. The reasoning behind this choice is that the shortest paths are significantly affected by structural changes in a network. In the testing phase, we classify the unlabeled data samples one by one. Firstly, a new data sample is inserted into each of the two networks constructed so far. Then, the same measure of each network after the insertion, Gaf ter (classi ), i = 1, 2 is calculated. Now, we have the impact of the insertion of the new sample to each class ΔG(classi ) = ||Gbef ore (classi ) − Gaf ter (classi )||, i = 1, 2.
(4)
Finally, the new sample is classified to class j, where ΔG(classj ) = min{ΔG(classi )}, i = 1, 2.
(5)
In other words, the new sample conforms the pattern formed by network j and it doesn’t generate larger perturbation to the network j. Observe that the new sample can even stay far from the elements of class j in the physical space.
3
Experimental Results
In this section, we present the computational experimental results of chest X-ray image classification using the proposed method. 3.1
Database
In all the simulations of this paper, we use the public data set: the COVID-19 Chest X-ray Database of COVID-19 Radiography Database [21]. Specifically, we
Analysis of Radiographic Images of Patients
21
randomly choose 150 healthy lung images and 150 COVID-19 images to form the training and testing data set, respectively. Totally, there are 300 images in the data set. To perform our high-level classification technique, we split the dataset into two sub datasets, training set and testing set, using the ratio of 9:1, i.e., for each class, we randomly select 135 training samples and 15 testing samples. Our training set has two classes: normal and COVID-19. 3.2
Fractal Feature Extraction
Figures 2 and 3 show some examples of the normal and COVID-19 chest X-ray images, respectively. From these images, we see that the healthy lungs present a regular pattern, while the COVID-19 images present irregular patterns, such as the formation of filaments. Therefore, we extract the fractal dimension features for characterizing the two classes. In order to calculate the fractal dimension (box-counting dimension in this paper) for a gray-level image, we generate a series of binary images from it, then we calculate the box-zhzhcounting dimension for each binary image. Therefore, a box-counting dimension curve is generated for each original gray-level image. Figures 4 and 5 show the corresponding binary images of the gray-level normal and COVID-19 images, respectively. Figure 6 shows the calculated box-counting dimension curves for those original gray-level images. From this figure, we see that the normal images and the COVID-19 images have quite different complexity levels in term of fractal dimension.
(a)
(b)
(c)
(d)
Fig. 2. Four chest X-ray images of healthy lungs. The images are taken from the COVID-19 Radiography Database [21].
22
W. Liu et al.
(a)
(b)
(c)
(d)
Fig. 3. Four images of patients infected by COVID-19. The images are taken from the COVID-19 Radiography Database [21].
Fig. 4. Binary normal class images generated using different threshold values on the respective gray-level images (The leftmost ones). The threshold values used for generating the binary images from left to right are 100, 110, 120, 130, 140, and 150, respectively.
3.3
Classification Results
After feature extraction, we construct a similarity matrix for each class by calculating the the distance of fractal dimension vectors of each pair of images.
Analysis of Radiographic Images of Patients
23
Fig. 5. Binary COVID-19 images generated using different threshold values on the respective gray-level images (The leftmost ones). The threshold values used for generating the binary images from left to right are 100, 110, 120, 130, 140, and 150, respectively.
Fig. 6. Calculated fractal dimensions against the binary image threshold values for the 8 gray-level images shown by Fig. 2 and 3, respectively. In this figure, “N” means normal samples and “C” means COVID samples.
Then, we generate a network for each class by applying Eq. 2. Figure 7 shows the constructed normal and COVID-19 networks, respectively, as well as the insertion of two testing data samples (black colored points) for illustration purpose. For the testing data sample shown by Fig. 7(a), ΔG(classnormal ) = 0.0726 and ΔG(classCOV IS-19 ) = 0.0534. Therefore, it is classified to the COVID-19 class. Regarding the second example shown by Fig. 7(b), ΔG(classnormal ) = 0.0372 and ΔG(class)COV IS-19 ) = 0.0752. Then, it is classified to the normal class. To measure the performance of our high-level classifier, we performed the training and classification also using several classic and the-state-of-the art techniques. The obtained results are shown in Table 1.
24
W. Liu et al.
(a)
(b)
Fig. 7. The constructed networks using the training data set. In each of the two subfigures, the red network is formed by the normal class training samples and the blue network is formed by the COVID-19 class training samples. For illustrating the classification process, in each of the subfigures, the black node is the testing node and it is inserted into the training networks. Table 1. Classification accuracy comparison. In this experiment, each of the 150 normal and 150 COVID-19 images are randomly splitted into two subgroups: 90% of the each of the 150 images (135 images) forms the training group and other 10% (15 images) forms the testing group. For the new high-level classification technique, the result is averaged over 50 executions with randomly selected training and testing samples each. Classification technique
Classification accuracy
AdaBoost
93.9%
Decision Tree
93.4%
Logistic Regression
96.8%
Multilayer Perceptron
89.1%
Naive Bayes
89.6%
Random Forest
93.7%
SVM
94.7%
New High-Level Classification Technique 97.0%
As shown in the table, our algorithm presents competitive performance in term of classification precision compared to several classic classification methods for COVID-19 identification. This is due to the fact that the proposed method can find out pattern of each class instead of using just physical aspects.
Analysis of Radiographic Images of Patients
4
25
Conclusions
The results obtained in this work make a relevant contribution to the area of machine learning and mainly to the pandemic combating. The paper shows the possibility of extracting patterns from X-ray images through complex network construction and the analysis of impacts on the measures of the networks. The superior assertiveness of the proposed method, compared to the traditional and the state-of-the-art data classification techniques, shows that the pattern identification is an efficient and robust way for class prediction. As future works, a technique will be developed to find the best value of the k parameter in a fully automatic way for each problem, making the algorithm much more accurate and without the need for human adjustment. The effectiveness of pattern identification using network measurements will be tested make more local references and in the vicinity of the node instead of the entire network, so the network growth would not impact computational cost as much. We will also investigate the possibility of subdividing each class into more than one networks, such that not only the entire pattern, but also the subpatterns embedded in each data class can be identified. At last but not least, the possibility of predicting severity of patients with COVID-19 will be studied from the classification results. This is not only useful for prognosis of patients, but it is also important for configuring hospital resources in an optimized way. Acknowledgment. This work is supported in part by the Sao Paulo Research Foundation (FAPESP) under grant numbers 2015/50122-0, the Brazilian National Council for Scientific and Technological Development (CNPq) under grant number 303199/2019-9, and the Ministry of Science and Technology of China under grant number: G20200226015.
References 1. Huang, C., et al.: Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China. Lancet 395, 497–506 (2020) 2. Majunder, M.S.; Mandl, K.D.: Early transmissibility assessment of a novel coronavirus in Wuhan, China. SSRN (2020) 3. Zhao, S., et al.: Preliminary estimation of the basic reproduction number of novel coronavirus (2019-nCoV) in China, from 2019 to 2020: a data-driven analysis in the early phase of the outbreak. Int. J. Infect. Dis. 92, 214–217 (2020) 4. Read, J.M., et al.: Novel coronavirus 2019-nCoV: early estimation of epidemiological parameters and epidemic predictions. MedRXiv (2020) 5. Gozes, O., et al.: Rapid AI Development Cycle for the Coronavirus (COVID-19) Pandemic: Initial Results for Automated Detection and Patient Monitoring using Deep Learning CT Image Analysis. arXiv2003.05037 (2020) 6. Hofmanninger, J., Prayer, F., Pan, J., R¨ ohrich, S., Prosch, H., Langs, G.: Automatic lung segmentation in routine imaging is a data diversity problem, not a methodology problem. https://arxiv.org/abs/2001.11767 (2020)
26
W. Liu et al.
7. Yee, S.L.K., Raymond, W.J.K.: Pneumonia diagnosis using chest X-ray images and machine learning. In: Proceedings of the 2020 10th International Conference on Biomedical Engineering and Technology, (ICBET 2020), pp. 101–105. Association for Computing Machinery, New York (2020). https://doi.org/10.1145/ 3397391.3397412. ISBN 9781450377249 8. Mandelbrot, B.: How long is the coast of Britain? Statistical self-similarity and fractional dimension. Science 156, 636–638 (1967) 9. Falconer, K.: Fractal Geometry: Mathematical Foundations and Applications. Wiley, Chichester (2003) 10. Finkel, R.A., Bentley, J.L.: Quad trees a data structure for retrieval on composite keys. Acta Informatica 4, 1–9 (1974). https://doi.org/10.1007/BF00288933 11. Silva, T.C., Zhao, L.: Network-based high level data classification. IEEE Trans. Neural Netw. Learn. Syst. 23, 954–970 (2012) 12. Watts, D.J., Strogatz, S.H.: Collective dynamics of ‘small-world’ networks. Nature 393, 440–442 (1998). https://doi.org/10.1126/science.286.5439.509 13. Barabasi, A.-L., Albert, R.: Emergence of scaling in random networks. Science 286, 509–512 (1999). https://doi.org/10.1126/science.286.5439.509 14. Albert, R., Barabasi, A.L.: Statistical mechanics of complex networks. Rev. Mod. Phys. 74, 47–97 (2002). https://doi.org/10.1103/RevModPhys.74.47 15. Newman, M.E.J.: The structure and function of complex networks. SIAM Rev. 45, 167–256 (2003). https://doi.org/10.1137/S003614450342480 16. Barab´ asi, A., et al.: Network Science. Cambridge University Press, Cambridge (2016) 17. Silva, T.C., Zhao, L.: High-level pattern-based classification via tourist walks in networks. Inf. Sci. 294, 109–126 (2015) 18. Silva, T.C., Zhao, L.: Machine Learning in Complex Networks. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-17290-3. https://www.springer.com/ gp/book/9783319172897. ISBN 9783319172897 19. Carneiro, M.G., Zhao, L.: Organizational data classification based on the importance concept of complex networks. IEEE Trans. Neural Netw. Learn. Syst. 29, 3361–3373 (2018) 20. Colliri, T., et al.: A network-based high level data classification technique. In: 2018 International Joint Conference on Neural Networks (IJCNN), pp. 1–8 (2018) 21. COVID-19 Radiography Database. https://www.kaggle.com/tawsifurrahman/ covid19-radiography-database. Accessed 4 Feb 2021 22. Estrada, E., Hatano, N.: Communicability in complex networks. Phys. Rev. E 77, 036111 (2008). https://doi.org/10.1103/PhysRevE.77.036111
Dynamical Influence Driven Space System Design Ruaridh A. Clark(B) , Ciara N. McGrath, and Malcolm Macdonald Electronic and Electrical Engineering, University of Strathclyde, Glasgow G1 1XW, UK [email protected]
Abstract. Complex networks are emerging in low-Earth-orbit, with many thousands of satellites set for launch over the coming decade. These data transfer networks vary based on spacecraft interactions with targets, ground stations, and other spacecraft. While constellations of a few, large, and precisely deployed satellites often produce simple, gridlike, networks. New small-satellite constellations are being deployed on an ad-hoc basis into various orbits, resulting in complex network topologies. By modelling these space systems as flow networks, the dominant eigenvectors of the adjacency matrix identify influential communities of ground stations. This approach provides space system designers with much needed insight into how differing station locations can better achieve alternative mission priorities and how inter-satellite links are set to impact upon constellation design. Maximum flow and consensusbased optimisation methods are used to define system architectures that support the findings of eigenvector-based community detection.
1
Introduction
Historically, constellations were composed of a few large satellites that produced simple, grid-like, communication network topologies. New small-satellite constellations present as complex data transfer networks due to the variety of orbital positions and heterogeneous capabilities of the satellites involved. This paper demonstrates how holistic assessment of these complex networks can aid space system designers. Data transfer is a spreading process that can be represented by a network in order to detect the relative influence of nodes [3]. A network of averaged contacts over time, enables the network’s adjacency matrix to provide insights into the major pathways for spread, as in [2] for the identification of influential disease spreaders. For space system flow networks, where targets are sources of data and ground stations are sinks, the eigenvectors of the adjacency matrix can detail the relative influence of ground stations in terms of receiving target data. Specifically, the concept of dynamical influence – the influence of a node’s dynamical state on the rest of the network [7] – is employed to detect influence and divide the system according to the data received from targets. This form of community detection was introduced in [3], as the communities of dynamical c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2021, SCI 1072, pp. 27–38, 2022. https://doi.org/10.1007/978-3-030-93409-5_3
28
R. A. Clark et al.
influence (CDI), whereby communities were detected based on node alignment in a Euclidean space defined by the system’s dominant eigenvectors (i.e. those associated with the largest eigenvalues). A network-based approach is proposed herein, partly because it is intractable to evaluate a wide range of feasible architectures through the use of high fidelity simulations of data transfer. Constellation design has predominantly focused on large, latency prioritising, constellations that maintain continuous contact between targets and ground stations (referred to as bent-pipe systems). Examples of these systems include OneWeb and Starlink, where target-ground station geographical proximity [1,14] has been shown to drive ground station placement and minimum cost, maximum flow optimisation has been used to define effective inter-satellite link topologies [14]. For many other applications involving data collection, latency is important but not critical as long as it falls within reasonable bounds. These store-and-forward systems – where spacecraft gather information from one location (e.g. ship AIS beacons or Earth monitoring images) and deliver it to another surface location (referred to as a ground station) – are the focus of this paper as ground station placement must account for both target coverage and data throughput. In the past, ground station network design has relied on engineering judgement and best practices. Lacoste et al. demonstrated the difficulties in applying best-practices for selecting multiple ground stations [8]. They found it difficult to predict the contribution of a given ground station to an existing set, highlighting the need for combinatorial optimisation methods for the ground station selection problem. Our approach aims to allow designers to select stations in strategic locations, hence reducing the number of stations or the lease time they require to deliver their service.
2
Methods
This paper analyses flow networks that represent the data transfer capacities of entities in a space systems, where the sink nodes (ground stations) are connected to source nodes (targets) via intermediary spacecraft nodes that can also share inter-satellite link (ISL) connections. A toy example of a space system without ISLs is displayed in Fig. 1. Ground station selection methods shall be presented that consider data volume and target coverage. This toy example highlights how one ground station receives data from both targets and therefore – regardless of the data volume transferred along each connection – will achieve better target coverage. The loop-back edges ensure cycles that are necessary for spectral identification of popular pathways. Loop-back edge weights are far smaller weights than data transfer edges to minimise the impact of these artificial connections on assessments of target coverage. 2.1
Space System Definition
A space system is defined for this study based on the orbital positions and targets of the Spire Global, Inc. constellation that collects AIS data from ships globally.
Dynamical Influence Driven Space System Design
29
Fig. 1. Toy example of a data transfer flow network. Dashed lines indicate artificial loop-back edges that ensure each ground station is part of a cycle.
All 111 spacecraft that as of July 2021 were operated by Spire Global, Inc. are included in this case study. The two-line elements (TLEs) for these spacecraft are obtained from [6]. The Keplerian orbit elements of the spacecraft at epoch are detailed in data set [11]. All spacecraft in the constellation are in approximately circular orbits but, due to the use of rideshare launches, are at a variety of altitudes, Right Ascension of Ascending Nodes (RAANs), and inclinations. The spacecraft are well distributed in RAAN, with 74 in sun-synchronous orbits, 22 at around 51.6◦ inclination, 8 at around 37◦ , 4 in near-polar orbits, and 3 in near equatorial orbits. The target locations for the case study are based on data provided by Spire Global, Inc. [15] for the 24-h period of 11-August-2019 14:09 UTC to 12-August2019 14:08 UTC that provides the last reported position of all ships detected in this 24-h window. From this, 250 targets are positioned to approximate the locations of ships worldwide that cannot be seen from land with these locations detailed in data set [11]. Ninety-four ground station sites are considered for this study, including 77 detailed by Portillo et al. as possible ground station locations [13] and an additional 17 locations estimated from Spire Global, Inc. published ground station network. These locations are detailed in the data set [11]. 2.2
Data Transfer Capacity Network
The data transfer capacity networks are graphs defined as G = (V, E), where there is a set of V vertices and E edges, which are ordered pairs of elements of V when considering data transfer. The adjacency matrix, A, is a square N×N matrix where N is the number of vertices and is equal to the total number of ground station, spacecraft, and targets in the system. An adjacency matrix captures the network’s connections where (A)ij > 0 if there exists an edge connecting vertex i and j and 0 otherwise. This matrix is representative of the data transfer capacity of communication links that emerge during a defined period of time. Data transfer capacities are calculated by simulating the motion of all spacecraft. The spacecraft initial locations and orbit paths are propagated using a fixed-step integrator based on the Livermore Solver for Ordinary Differential Equations [17]. The equations are formulated using the Gauss equations as derived from general perturbation methods [16]. Only perturbations due to the Earth’s oblateness to the second order (J2 ) are included.
30
R. A. Clark et al.
a
b
0.01
0.005
c
0.01
0.005
d
0.01 0.005
e
0.01
0.005
0.01
0.005
0
0
0
0
0
-0.005
-0.005
-0.005
-0.005
-0.005
-0.01
-0.01
-0.01
-0.01
-0.01
-0.015
-0.015
-0.015
-0.015
-0.015
-0.02
-0.02
-0.02
-0.02
-0.02
-0.025
-0.025
-0.025
-0.025
-0.025
-0.03
-0.03
-0.03
-0.03
-0.03
-0.035
-0.035
-0.035
-0.035
-0.035
-0.04
-0.04 0.01
0.012
0.014
-0.04 0.01
0.012
0.014
-0.04 0.01
0.012
0.014
-0.04 0.01
0.012
0.014
0.01
0.012
0.014
Fig. 2. Variation in CDI communities (denoted by node colour) with number of input eigenvectors from 3 to 7 (a to e). v1 & v2 are the first two dominant eigenvectors.
From the simulation, the cumulative time that each spacecraft, target, or ground station is in view of a spacecraft is determined. A spacecraft is considered in view of a target or a ground station if the elevation angle is greater than 15◦ elevation. Therefore, (A)ij = cij dij where cij is cumulative time in view and dij is the data rate between nodes i and j. Artificial loop-back edges from all ground stations (sinks) to all targets (sources), see Fig. 1, create cycles that are captured by the system’s eigenvectors. A weight of 0.001 is used for the artificial loop-back edges, where this value has a minor influence on community assignment, as described in the following section, as long as the value is far smaller than the data transfer weights. 2.3
Communities of Dynamical Influence
Communities of dynamical influence (CDI), introduced in [3], is used to provide insight into the flow pathways through the network. CDI identifies communities based on their alignment in Euclidean space defined by the system’s dominant eigenvectors. The nodes in this space, which are further from the origin of the coordinate system than any of their connections, are defined as leaders of separate communities. This is assessed by comparing the magnitude of each node’s position vector with the scalar projection onto this vector from all other node position vectors. Each community is then defined with respect to the leaders, by assessing which leader each node is in closest alignment with using the scalar product of position vectors. Communities are ranked in terms of their influence by evaluating the largest entry of the first dominant eigenvector (v1 ) for each community (i.e. eigenvector centrality (EC)) that is known to be a non-negative vector [9]. The community that contains the node with the largest EC value (v1 ) is ranked as the most influential community, with the other communities ranked in descending order according to their largest EC value. Increasing the number of eigenvectors, used by CDI, equates to increasing the number of dimensions - in the eigenvector-defined Euclidean space - and
Dynamical Influence Driven Space System Design
31
this can lead to variation in community assignment. For example, the yellow (least influential) community in Fig. 2 b, c, & d is part of the purple (most influential) community in a and grows to include the negative v2 nodes in e. By incorporating more eigenvectors, a more nuanced picture of community structure can be revealed. However, the more eigenvectors included the less prominent are the most dominant eigenvectors and, hence, the greater risk that the communities no longer reflect the most popular data pathways. ISL Space Systems. By including ISLs, spacecraft nodes can become more prominent (i.e. larger eigenvector entries) than ground station nodes for the first few dominant eigenvectors. Ground station community assignment using CDI is prone to error if the ground station nodes are not prominent in any of the eigenvectors used. Therefore an adaptation of CDI is required to detect consistent ground station communities. Previously 5 dominant eigenvectors were used by CDI, for systems with ISLs this is updated to include eigenvectors up to the 5th dominant eigenvector that includes a ground station node with the largest entry in magnitude. For the example presented later in Sect. 3, 500 kb/s ISL data rates (see Fig. 5a) results in the first 33 dominant eigenvectors being used to evaluate CDI. While for 5 kb/s ISL data rates (see Fig. 5b) all the largest magnitude entries – for the first 5 dominant eigenvectors – belonging to ground station nodes, so the first 5 dominant eigenvectors are used by CDI. 2.4
Ground Station Selection
A few methods of ground station selection are considered. These methods use the flow network to make a ground station selection based on differing design priorities. Maximum flow aims to maximise the data throughput from any targets to the ground stations. Mean consensus leadership performs a trade-off between data volume and coverage to achieve high data throughput and good target coverage. Minimum consensus leadership prioritises target coverage, by improving the connectivity of the least connected target. Maximum Flow. Maximum flow is assessed using the Ford-Fulkerson algorithm [5] by considering all targets as a single source. This enables maximum flow to be calculated, from the data transfer capacity network, for each sink node (ground station) separately. Maximum flow considers the bottlenecks for data flow by considering every link’s transfer capacity from source to sink. Linear Consensus Protocol. The consensus leadership selections are identified by assessing the ability of ground station nodes to lead target nodes to consensus according to the following consensus protocol. We consider a network where each node vi has a state xi ∈ IR and continuoustime integral dynamics, x˙ i (t) = ui (t) where ui ∈ IR is the control input for agent
32
R. A. Clark et al.
i. The linear consensus protocol is ui (t) = vj ∈Ni aij (xj − xi ) and describes how each node adjusts its state based on the state of its neighbours, as presented in [12], where A2 = [aij ] is the weighted adjacency matrix (for paths of length 2) and the set of neighbours for node vi is Ni . This adjacency matrix for paths of length 2, A2 , can be created by squaring the adjacency matrix, A2 = A2 . (Note: This approach is only proposed for systems without inter-satellite links (ISLs) as target to ground station paths using ISL transmissions will have length greater than 2.) Given the linear consensus protocol, the state of the network develops according to x(t) ˙ = −Lx(t) with the graph Laplacian matrix, L, defined as L = D − A2 where D = diag(out(v1 ), ..., out(vn ))is a diagonal matrix composed of the outdegrees of each node, i.e. out(vi ) = j aij . Given the definitions for the continuous-time integral dynamics and x˙ i (t), the discrete-time agent dynamics are given in [4] as xi (t + 1) = xi (t) + ui (t) provided that 0 < < max1i dii where dii is an element of D. The choice of affects the number of steps required for nodes to reach convergence, therefore = 0.999 × max1i dii as the number of computational steps can be reduced while still guaranteeing convergence of the system [4]. Convergence is defined here as x ¯i > 0.99 ∀ i ∈ τ , where τ is the set of all source nodes, when xj = 1 ∀ j ∈ g with g the set of all sink nodes. Consensus Leadership. To select ground stations, all ground station nodes are provided with variable resources that define their contribution in leading the target nodes to a new state. In fact, all nodes have resources assigned in a resource vector, r = {r1 , ..., rn } where n is the total number of nodes in the system. However, for all non-ground station nodes these resources are set ri = 1 ∀ i ∈ / g where g is theset of all sink nodes. For the ground station nodes, 0 ≥ ri ≥ 1 ∀ i ∈ g, and i ri = nd ∀ i ∈ g where nd is the desired number of sink nodes (i.e. the number to be selected). These resources define each ground station’s contribution by scaling each row of the adjacency matrix by r, ⎡ ⎤ ⎫ r ⎪ ⎪ ⎪ ⎬ ⎢r⎥ ⎪ ⎢ ⎥ n elements Aw = A2 R where R = ⎢ .. ⎥ ⎣.⎦ ⎪ ⎪ ⎪ ⎪ r ⎭ where indicates a Hadamard product, i.e. element-wise multiplication that in this case only varies the weight of elements in columns corresponding to sink nodes. This altered adjacency matrix, Aw , replaces the adjacency matrix, A2 , for assessing state dynamics using the linear consensus protocol. The initial state of all sink nodes is set at xi = 1 ∀ i ∈ g, where g is the set of all sink nodes. Therefore, the system will reach consensus (i.e. x ¯i ≈ 1 ∀ i ∈ τ with x ¯ the mean value and τ the set of all source nodes) as long as all of the source nodes have a directed path to one of the sink nodes in g, which act as network leaders.
Dynamical Influence Driven Space System Design
33
Algorithm 1. Consensus leadership optimisation 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20:
Input: Resource vector (r), desired no. of sink nodes (nd ) Set ns ← 0 (number of selected sink nodes) while nd > ns do if nd − ns > 1 then Gradient-based numerical optimisation with input uniform vector if function evaluations/iterations limit is exceeded then Power optimisation of resources end if if ri < 1 ∀ i ∈ v, where v are un-selected sink nodes then Gradient-based numerical optimisation with input zero vector if function evaluations/iterations limit is exceeded then Power optimisation of resources end if end if Add i to selected nodes and set ri ← 1, where ri = maxi ri ∀ i ∈ v else Brute force final node selection end if end while return Optimised resource allocation
The resource vector r is primarily optimised using a gradient-based numerical optimiser [10], supported by heuristic algorithms that improve efficiency and mitigate against the solver finding local minima far from the global optimum. The optimiser attempts to minimise separate objective functions to produce either a mean or minimum consensus leadership selection. A ground station node i is selected whenever the optimiser assigns ri > 1, or when the optimisation converges and ri has the largest entry of r from the pool of unselected ground stations. An overview of this algorithm is provided in Algorithm 1. Note that Power Optimisation is a heuristic algorithm, described in [3], that efficiently increases the proportion of resources assigned to the largest entries of r. For the final ground station selection it is possible to assess each node individually (brute force) rather than use a gradient-based optimiser. Mean Consensus Leadership. The mean consensus leadership aims to maximise the mean consensus state of all target nodes, with the optimisation defined as follows, i∈τ (1 − xi ) min nτ s.t. rj ≤ 1 ∀ j ∈ g (1)
rj = ng ∀ j ∈ g, ng ∈ Z j
34
R. A. Clark et al.
where nτ is the number of source nodes, ng is the number of ground stations, and m is the total available resources. The source nodes states, xi , are evaluated at a point prior to convergence, defined as the closest step to 0.9 × sref where sref is the number of steps to convergence. Initially, sref is defined for a system with a uniform resource vector, ri = nndg ∀ i ∈ g where nd is the desired number of sink nodes and ng is the total number of sink nodes considered (i.e. possible ground station locations). Note that sref and the evaluation step are updated during the optimisation, after each sink node selection when for a given sink node i the allocated resources become ri = 1 for the first time. Minimum Consensus Leadership. The minimum consensus leadership aims to maximise the consensus state of the target with the lowest state value at the evaluation step, with the optimisation defined as follows, min 1 − min(xi ) i∈τ
s.t.
rj ≤ 1 ∀ j ∈ g
rj = ng ∀ j ∈ g, ng ∈ Z.
(2)
j
3
Results
The division of ground stations into communities of dynamical influence (CDI), reveals the differences in ground station connectivity to spacecraft and targets. For the space system described in Sect. 2.1, Fig. 3 a & b shows how community assignment relates to the network embedding according to v1 , v2 , and v3 . The ability of these communities to reach targets globally is captured by v1 , where Fig. 3 c shows that the largest v1 entries are attributed to the community of northern and southern ground stations (community 1). These ground station communities can be understood by considering the spacecraft inclinations in the constellation. Inclination provide an estimate for the highest latitude ground station that will be visible to a given spacecraft, where the field of view can allow for contact to be made with high latitude ground stations. Spacecraft inclination also provides insight into the ground stations that will be seen for longest, as those will be ground station with similar latitudes to a spacecraft’s inclination. Therefore, the 1st community (purple) achieves long contact times with polar orbiting spacecraft (74 sun-synchronous and 4 nearpolar orbit in this constellation), with the lower latitudes in this community serviced most readily by the 22 spacecraft at 51.6◦ . The 2nd community, in terms of influence (cyan), contains equatorial stations that are serviced by 3 equatorial orbiting spacecraft, which provide a strong connection through to equatorial shipping targets. Finally, the 3rd community (yellow) forms a band that is most connected to the 8 spacecraft at 37◦ inclination. Three methods of ground station selection, as described in Sect. 2.4, are employed to propose ground station locations for differing mission priorities.
Dynamical Influence Driven Space System Design
a
b
0.01
0.005
0.01
c
35
90 ° N
0.005
0
0
-0.005
-0.005
-0.01
-0.01
-0.015
-0.015
45 ° N
0°
-0.02
-0.02
-0.025
-0.025
-0.03
-0.03
-0.035
-0.035
-0.04
-0.04 0.01
0.012
0.014
45 ° S
90 ° S 180° W
least influential
90 ° W
0°
90 ° E
180° E
most influential
Fig. 3. Ground stations are assigned communities using CDI, colour denotes influence in respect to target connectivity. Nodes are embedded in a with v1 & v2 and in b with v2 & v3 . In c, the nodes are placed according to their location and dot size for each node i is proportional to (v1 )3i .
In Fig. 4, selections of 1, 5, 10, 20, and 30 ground stations are presented both in a Euclidean space, defined by v1 & v2 , and on a map of the Earth with community assignment noted. Figure 4 demonstrates how CDI and v1 can combine to provide intuitive insights for space system designers. Maximum flow selects (Fig. 4 a & b) exclusively from the most influential community (coloured purple) as its objective improves data throughput without considering target coverage. The selections largely align with v1 magnitude, but v1 is still a global measure of influence that considers connectivity to all targets. Therefore, maximum flow selects exclusively from high latitude ground stations due to the increased time in contact with spacecraft versus other locations. The mean consensus leadership (Fig. 4 c & d) is shown to select ground stations that cover all three communities. For small selection sizes 1,5, & 10, the most influential community is still prioritised but this method achieves the most even division of ground stations across the three CDI communities for the 20 & 30 ground station selections. Resulting in 9, 12, and 9 ground stations in the communities, ordered by influence, for the 30 selection. The minimum consensus leadership (Fig. 4 e & f ) shows the same initial prioritisation of the most influential community. However, the selections then diverge from mean consensus leadership by prioritising the least influential community, which likely equates to improving connectivity to the least connected targets in the system. Ground stations are still selected from all three communities to ensure global target coverage. Incorporating inter-satellite links (ISLs) with 1000 km range on every spacecraft, for the space system described in Sect. 2.1, alters the distribution of influential ground station as shown in Fig. 5. The updated community maps differ depending on the data rate set for the ISLs, where in a 500 kb/s and in b 5 kb/s are modelled while downlink data rates to ground stations remain at 1000 kb/s. Comparing these maps with Fig. 3c indicates that the ground stations with the greatest influence continue to be at the highest latitude bands. In contrast to Fig. 3c the equatorial ground stations are less important, which is
36
R. A. Clark et al.
a
b
0.01
0.005 0 -0.005 -0.01 -0.015 -0.02 -0.025 -0.03 -0.035 -0.04 0.01
c
0.012
0.014
d
0.01 0.005 0
-0.005 -0.01 -0.015 -0.02 -0.025 -0.03 -0.035 -0.04 0.01
e
0.012
0.014
f
0.01
0.005 0 -0.005 -0.01 -0.015 -0.02 -0.025 -0.03 -0.035 -0.04 0.01
0.012
0.014
using ground station
Smallest selection using ground station:
un-selected 1 5 10 20 30
consensus leadership selection sizes: least influential
20 30
10
5 1
most influential
Fig. 4. Differing ground station selections are shown for various sizes. Selections: a & b maximum flow; c & d mean consensus leadership; e & f minimum consensus leadership. In a, c, & e the smallest selection size employing each ground station is noted, alongside the node’s position in a Euclidean space defined by v1 & v2 . In b, d, & f the location and community designation using CDI (see Fig. 3b) are highlighted for the ground stations in each selection size. Communities are denoted by colour.
Dynamical Influence Driven Space System Design
a
b
37
90 ° N
45 ° N
0°
45 ° S
90 ° S 180° W
least influential
90 ° W
0°
90 ° E
180° E
most influential
Fig. 5. Ground station community maps are presented for ISL links with data rates of a 500 kb/s and b 5 kb/s. Ground station nodes are assigned to communities using CDI, colour denotes community influence in respect to target connectivity. Nodes are embedded according to a v1 & v2 and b v2 & v3 . In c, the nodes are placed according to their location and dot size for each node i is proportional to (v1 )3i .
likely due to the ISLs enabling the equatorial satellites to pass equatorial target data onto polar orbiting spacecraft. This claim is also supported by the comparison between Fig. 5a & b. The equatorial ground stations have the smallest v1 values in a with 500 kb/s ISLs, but their v1 values are equivalent or larger than many higher latitude stations in b when less data can be transmitted between spacecraft with only 5 kb/s ISLs. This increase in equatorial influence in b, also results in only one community being detected at latitudes below 50◦ .
4
Conclusions
The differing contact patterns of a spacecraft constellation can result in the formation and detection of communities of dynamical influence (CDI). These CDI communities are informative when considering space system design, specifically ground station selection, as they identify both high data throughput locations and locations with different target coverage. These insights were highlighted through comparison with different ground station selection methods. For the selections that maximised data throughput – maximum flow – all ground stations were shown to belong to the most influential community. For the selections that prioritised improving connectivity of the least connected targets – minimum consensus leadership – the largest proportion of stations in the large selection sizes (20 & 30 ground stations) were based in the least influential community. For the selections that aimed to achieve a balance between high data throughput and target coverage – mean consensus leadership – there was an even division of ground station across the three CDI communities in the 20 & 30 ground station selections. Finally, it was shown using CDI that inter-satellite links (ISLs) should be accounted for by space system designers, as they will likely impact the best performing ground station locations.
38
R. A. Clark et al.
References 1. Chen, Q., Yang, L., Liu, X., Guo, J., Wu, S., Chen, X.: Multiple gateway placement in large-scale constellation networks with inter-satellite links. Int. J. Satell. Commun. Netw. 39(1), 47–64 (2021) 2. Clark, R.A., Macdonald, M.: Identification of effective spreaders in contact networks using dynamical influence. Appl. Netw. Sci. 6(1), 1–18 (2021). https://doi. org/10.1007/s41109-021-00351-0 3. Clark, R.A., Punzo, G., Macdonald, M.: Network communities of dynamical influence. Sci. Rep. 9(17590), 1–13 (2019). https://doi.org/10.1038/s41598-019-53942-4 4. Di Cairano, S., Pasini, A., Bemporad, A., Murray, R.M.: Convergence properties of dynamic agents consensus networks with broken links. In: 2008 American Control Conference, pp. 1362–1367. IEEE (2008). https://doi.org/10.1109/ACC.2008. 4586682 5. Ford, L.R., Fulkerson, D.R.: Maximal flow through a network. In: Classic papers in combinatorics, pp. 243–248. Birkh¨ auser, Boston (2009). https://doi.org/10.1007/ 978-0-8176-4842-8 15 6. Kelso, T.S.: Norad two-line element sets current data. https://celestrak.com/. Accessed 07 Nov 2019 ´ Egu´ıluz, V.M., San Miguel, M.: A measure of individ7. Klemm, K., Serrano, M.A., ual role in collective dynamics. Sci. Rep. 2(292) (2012). https://doi.org/10.1038/ srep00292 8. Lacoste, F., Gu´erin, A., Laurens, A., Azema, G., Periard, C., Grimal, D.: FSO ground network optimization and analysis considering the influence of clouds. In: Proceedings of the 5th European Conference on Antennas and Propagation (EUCAP), pp. 2746–2750. IEEE (2011) 9. Lohmann, G., Margulies, D.S., Horstmann, A., Pleger, B., Lepsien, J., Goldhahn, D., Schloegl, H., Stumvoll, M., Villringer, A., Turner, R.: Eigenvector centrality mapping for analyzing connectivity patterns in fMRI data of the human brain. PloS one 5(4), e10232 (2010) 10. MathWorks: ‘fmincon’, constrained minimization (2020). https://uk.mathworks. com/help/optim/ug/fmincon.html 11. McGrath, C.N., Clark, R.A.: Location of ground stations, targets and spacecraft for spire global case study, July 2021. https://doi.org/10.5281/zenodo.5243314 12. Olfati-Saber, R., Murray, R.M.: Consensus problems in networks of agents with switching topology and time-delays. IEEE Trans. Autom. Control 49(9), 1520– 1533 (2004). https://doi.org/10.1109/TAC.2004.834113 13. del Portillo, I., Cameron, B., Crawley, E.: Ground segment architectures for large LEO constellations with feeder links in EHF-bands. In: 2018 IEEE Aerospace Conference. IEEE (2018). https://doi.org/10.1109/AERO.2018.8396576 14. del Portillo, I., Cameron, B.G., Crawley, E.F.: A technical comparison of three low earth orbit satellite constellation systems to provide global broadband. Acta Astronaut. 159, 123–135 (2019). https://doi.org/10.1016/j.actaastro.2019.03.040 15. Spire Global Ltd.: Spire maritime website. https://maritime.spire.com/. Accessed 10 Dec 2019 16. Vallado, D.A., McClain, W.D.: Fundamentals of Astrodynamics and Applications, 3rd edn, pp. 615–633, pp. 642–667. Microcosm Press and Springer (2007) 17. Virtanen, P., et al.: SciPy 1.0: fundamental algorithms for scientific computing in python. Nat. Methods 17, 261–272 (2020)
Classification of Dispersed Patterns of Radiographic Images with COVID-19 by Core-Periphery Network Modeling Jianglong Yan1,4 , Weiguang Liu1 , Yu-tao Zhu2 , Gen Li3 , Qiusheng Zheng3 , and Liang Zhao2,4(B) 1
School of Computer Science, Zhongyuan University of Technology, ZhengZhou, China [email protected] 2 China Branch of BRICS Institute of Future Networks, ShenZhen, China [email protected] 3 Henan Key Laboratory on Public Opinion Intelligent Analysis, Zhongyuan University of Technology, ZhengZhou, China [email protected] 4 Department of Computing and Mathematics, University of Sao Paulo (USP), Ribeirao Preto, Brazil [email protected]
Abstract. In real world data classification tasks, we always face the situations where the data samples of the normal cases present a well defined pattern and the features of abnormal data samples vary from one to another, i.e., do not show a regular pattern. Up to now, the general data classification hypothesis requires the data features within each class to present a certain level of similarity. Therefore, such real situations violate the classic classification condition and make it a hard task. In this paper, we present a novel solution for this kind of problems through a network approach. Specifically, we construct a core-periphery network from the training data set in such way that core node set is formed by the normal data samples and peripheral node set contains the abnormal samples of the training data set. The classification is made by checking the coreness of the testing data samples. The proposed method is applied to classify radiographic image for COVID-19 diagnosis. Computer simulations show promising results of the method. The main contribution is to introduce a general scheme to characterize pattern formation of the data “without pattern”. Keywords: Data classification class pattern · COVID-19
1
· Core-periphery network · Dispersed
Introduction
COVID-19 caused by severe acute respiratory syndrome coronavirus 2 (SARSCoV-2), has become an unprecedented public health crisis. On 30 Aug 2021, the c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2021, SCI 1072, pp. 39–49, 2022. https://doi.org/10.1007/978-3-030-93409-5_4
40
J. Yan et al.
number of cases of COVID-19 exceeded two hundreds millions and the death toll exceeded four millions worldwide. The outbreak of COVID-19 represents a major and urgent threat to global health. As the coronavirus pandemic brings floods of people to hospital emergency rooms around the world, health professionals are struggling to triage patients to determine which ones will need intensive care. At the peaks of the crises, health professionals faced terrible decisions on who should receive help and resources. Such a severe and complex situation requires the development of suitable tools to respond in a quick and accurate way. Artificial Intelligence (AI), Machine Learning (ML) in particular, plays an important role in this battle [1]. The quick and correct clinical decisions not only are critical for patient prognosis, but also can help to make collective decisions on hospital management by analyzing the correlation between severity of patients and availability of hospital resources. Such clinical decisions are based on patient diagnoses and one of the important methods is the analysis and classification of X-ray chest images to detect whether a patient gets COVID-19 and which is the level of severity [2–5]. Data classification is a supervised machine learning task characterized by the use of a given training set with labeled samples to generate a model (classifier), which is later used to classify unlabeled data [6]. Due to the importance of this learning paradigm in real applications, many classification techniques have been developed [6,7], such as the K-Nearest Neighbors (KNN), Discriminating Linear Analysis (LDA), Naive-Bayes, Multi-Layer Perceptron, Support Vector Machines (SVM), Decision Tree, Graph Neural Netowrks, and Deep Learning. Essentially, traditional data classification techniques perform training and classification using physical attributes of the data (for example, distance, similarity or distribution), which are called low-level classification techniques. Often, data items are not isolated points in the attribute space, but tend to form certain patterns. The classification of data that considers the patterning of the input data in addition to the physical attributes is referred to as high-level classification [8]. Although each traditional classification technique has its own feature, all of them share the same heuristic: basically, the classification process consists of dividing the data space into subspaces, each representing a class. The trained classifier serves to define the decision boundaries in the data space and the label induction checks the relative position of each unlabeled instance in relation to the boundaries. These subspaces are not overlapped in the case of crisp classification, but can be slightly overlapped in the case of fuzzy classification. In any case, strong distortions in the shapes of the overlapping classes or subspaces are generally not allowed. In other words, traditional data classification techniques work according to the physical features of the training data, ignoring many other intrinsic and semantic relationships between the data items, which normally generate classes in complex forms in the data space. On the other hand, it is known that the human (animal) brain can identify patterns according to the semantic meaning of the input data. Therefore, it is interesting to perform the classification in addition to the widely applied space-dividing concept. In this context, network-based techniques can make contributions from quite different points of
Classification of Dispersed Patterns
41
view. The original idea of building a hybrid technique for classifying low and high levels was proposed by Silva and Zhao [8,9]. In this scheme, the low-level classification can be implemented by any traditional classification technique, while the high-level technique explores the complex topological properties of the network built from the input data. Later, the authors [10] proposed a networkbased method of classification according to the “importance” concept instead of feature similarity as applied in almost all classification techniques. Although the above mentioned works have made advances on the pattern based classification, capturing data pattern is in general still a hard problem. Basically, data classification requires to find a data pattern with similar features for each class and dissimilar features between different classes. However, in real-world applications, we frequently encounter the following situations: the data samples of the normal class form regular pattern, where the similar features can be easily found; while the data sample features of abnormal classes vary so much from one to another, which makes difficult to discover the similar features among data samples if it is not impossible. This happens in the COVID-19 diagnosis by classifying X-ray chest images. The images of normal lungs present high similarity in the original images or extracted features, while the images of COVID-19 present large difference among them. This feature has been checked by calculating the distances and the standard deviations of the original images as well as several features extracted from the original images: histogram of the image pixels, frequency components of Fourier transform, histogram of Quadtree division [11], and the fractal dimension [12,13] of the images. Therefore, it is hard to capture similar features for the COVID-19 class. Nowadays, many real systems appear or can be converted to network forms, resulting in the complex network research. Complex network refers to large scale graphs with nontrivial connection patterns [14,15]. One of the well-studied network model is the core-periphery network and core-periphery structure was detected in many complex systems [16–21]. This kind of networks consists of one or more subsets of tightly connected nodes with high centrality scores, called cores, and a subset or subsets of low degree peripheral nodes, which are connected to some cores, but have few even no connection to other peripheral nodes. In this work, we present a simple but general solution for characterizing class of data with large variance using core-periphery networks. The normal class forms the core and the abnormal class forms the periphery. The classification can be made by checking the coreness level of the testing data samples. In this way, we present a pattern finding scheme for data “without pattern”. It is worth noting that the proposed method has fundamental difference to outlier detection [22,23] in at least two points: 1) Outliers are usually rare events. Therefore, there are few outliers compared to the normal data samples. On the other hand, abnormal class can have many data samples. 2) Based on the observation of the first item, we usually have focus on finding out only the normal data pattern while ignoring outlier pattern. An outlier is considered as the one which is not obeys the normal data pattern. However, in data classification problem, we have to characterize the pattern of each class. In this sense, the
42
J. Yan et al.
proposed method not only provides a novel classification strategy, but also can be used for outlier detection.
2
Methods
In this work, we propose a method for representing data class with dispersed features and applying the method for X-ray chest image classification. The data set consists of two classes: normal class (healthy lungs) and COVID-19 class (lungs infected by COVID-19), which are denoted as CN and CC , respectively. 1. Feature extraction. For each X-ray chest image, various features are extracted. In this work, we consider the following ones: pixel histogram, frequency components of Fourier transform, fractal dimension [12,13] and histogram of Quadtree division [11]. The two former features characterize statistical property of pixels, while the later two take into account geometrical complexity of the images. For calculating the fractal dimension, we use the box counting method [13]. We cover the image with a grid, and then count how many boxes of the grid are covering the pattern in the image. Then we repeat the process but using smaller boxes. By shrinking the size of the boxes repeatedly, we can accurately capturing the structure of the pattern. The fractal dimension D is the slope of the line when we plot the value of log(N) against the value ) of log(r): D = log(N log(r) , where N is the number of boxes that cover the pattern and r is the inverse of the box size. Basically, for the original images and each of the four features, images of healthy lungs have high similarity, while the similarity among the lungs with COVID-19 is very low. One of the features, which presents the biggest variance within each data class, is selected for classification. 2. Training. The training phase consists of constructing a core-periphery network for the selected data sample feature of the training data set. We here consider the binary classification problem. After an image feature is chosen, a similarity matrix M is formed containing the distance between each pair of image features. Euclidean distance is used here. In the core-periphery network, each data sample (the feature vector of a original image) is a node. Each node is connected to all other nodes falling within a distance threshold value . After a network is formed with a specific value of , we calculate the core-periphery ρ-measure. This process is repeated for other values of . The final network is chosen as the one, where the ρ-measure starts to reaches to the highest value. The ρ-measure can be calculated by the following formula [16]: aij δij , (1) ρ= i,j
where A is the adjacency matrix of a network G = (V, E) with N nodes and its element aij = 1 if node i and node j are linked and 0, otherwise. δij = ci cj , where ci measures the coreness of the node i, ci = 1 or ci = 0 means that node
Classification of Dispersed Patterns
43
i belongs to the core or the periphery, respectively. An ideal core/periphery network model consists a fully connected core and a periphery that is fully connected to the core, but there are no links between any two nodes in the periphery. ρ-measure checks how close a network is to the ideal core-periphery structure. Specifically, ρ-measure reaches the maximum value when A and Δ = [δij ] are identical. In other words, the core-periphery structure seeks to find out a configuration vector c to maximize ρ. 3. Classification. At the classification phase, we insert the testing data sample x to the core-periphery network constructed so far. Then we calculate its coreness measure. The testing sample will be classified to the core class (i.e., the normal class in the image classification problem treated in this paper) if it gets a high coreness value; otherwise, it is classified to the peripheral class (i.e., the COVID-19 class in this paper). It is well-known that, in a coreperiphery network, the core nodes have high centrality values, but it is not true that the peripheral nodes always have low centrality values. Therefore, in this work, the coreness of a node is determined using the k-core concept [21,24]. A k-core of a network is a maximal component in which all vertices have degree at least k. A vertex has coreness k if it belongs to a k-core but not to any (k + 1)-core. Since the constructed networks are expected to be coreperiphery type, therefore, there is a Kc -core, where the Kc -core containing most of the nodes of the normal class (the core class) and the nodes of the peripheral class are outside the Kc -core. The classification can be done by just checking which group of nodes the testing sample belongs to.
3
Experimental Results
In this section, we present the simulation results applying the proposed method for COVID-19 image pattern characterization and classification. In all the simulations of this paper, we use the public data set: the COVID-19 Chest X-ray Database of COVID-19 Radiography Database [25]. Specifically, we randomly choose 150 healthy lung images and 150 COVID-19 images to form the training and testing data set, respectively. Illustration of COVID-19 Pattern Dispersion In this subsection, we show that the features of the data samples in the normal class present high similarity, while the those in the COVID-19 class disperse a lot. In this way, the traditional classification hypothesis does not holds. We first present some illustrative examples. In Fig. 1, the upper four Xray chest images are healthy lungs and the lower four are patients infected by COVID-19. Figures 2 and 3 show the Quadtree division and their corresponding quadrant size histogram for each of the 2 × 4 images. We see that the normal chest images present similar features, but the COVID-19 images possess larger variations from one to another. Figure 4 shows the fractal dimensions calculated for the four healthy and the four COVID-19 images. One can perceive that the
44
J. Yan et al.
COVID-19 curves have much dispersion than the normal ones. The same feature also can be seen in the histograms the frequency components of Fourier transform.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Fig. 1. (a)-(d): Four chest images of healthy lungs. (e)-(h): Four images of patients infected by COVID-19. The images are taken from the COVID-19 Radiography Database [25] (https://www.kaggle.com/tawsifurrahman/covid19-radiography-database).
Now we calculate the average distance and the standard deviation of each extracted feature for each of the normal and COVID-19 classes of the whole training data set. The results are presented in Table 1. We see that, in all the cases, the images of the normal class have smaller average distances and standard deviations, while the COVID-19 class presents large average distances and standard deviations, indicating that we cannot correctly classify a COVID-19 image according to the similarity between the test image and the training images. In other words, the classical data classification hypothesis doesn’t hold in this kind of cases and new mechanisms to deal with data classes with large variances need to be designed. Core-Periphery Network Formation Now we construct the core-periphery network for the training data set. Here, each data sample (an image or a vector of the image feature) is a node. All the nodes have distance smaller than are connected, where is a threshold value of the similarity matrix of the training data samples. For each network constructed using a different value, we calculate the core-periphery ρ-measure by Eq. 1. The final core-periphery network is the one when starts to reach the highest ρ-measure.
Classification of Dispersed Patterns
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
45
Fig. 2. Corresponding Quadtree divisions of the images shown by Fig. 1.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Fig. 3. Corresponding histograms of the quadrant sizes of Quadtree divisions shown by Fig. 2.
Figure 5 shows the ρ-measure of the networks by varying the distance threshold value . We see that the ρ-measure reaches to the highest value around = 1.14. Taking this value, the corresponding generated network and its adjacent matrix can be seen in Fig. 6. Those present a clear core-periphery pattern.
46
J. Yan et al.
Fig. 4. Fractal dimensions against the binary image thresholds of the images shown by Fig. 1. Given an grey-level image, it is firstly transformed to a binary image using a threshold value. Then, the box-counting dimension is calculated on the binary image. Table 1. Average Distance and Standard Deviation of the training images. The training set is formed by randomly selected 150 images of the normal class and 150 images of the COVID-19 class from the public data set [25]. Mean distance Standard deviation Normal COVID-19 Normal COVID-19 Original Images FFT Quadtree Fractal dimension
0.135 0.088 0.574 0.196
0.656 0.611 0.834 0.45
0.150 0.139 0.309 0.105
0.673 0.936 0.502 0.273
Classification Results After the core-periphery network is constructed, we can classify each test sample by inserting it in the network and calculating its coreness measure. The testing sample will be classified to the core class if it gets a high coreness value; otherwise, it is classified to the periphery class (the COVID-19 class in this paper). Using this simple criteria, we get the classification results in two different configurations of the training and testing data sets, which are shown by the second and the third line of Table 2, respectively. We see that the proposed method gets quite good classification accuracy in comparison to the classic and the state-of-the-art techniques. The classification techniques under comparison are: AdaBoost (Ada), Decision Tree (DT), Multi-Layer Perceptron (MLP), NaveBayes (NB), Random Forest (RF) and Support Vector Machine (SVM). The proposed technique is denoted as Core-Periphery Network Method (CP).
Classification of Dispersed Patterns
47
Fig. 5. ρ-measures of the networks generated by varying the values.
(a)
(b)
Fig. 6. (a) The adjacent matrix of the generated network with = 1.14. The white points represent the links between pairs of nodes and the black points mean the absence of links. (b) The generated network with = 1.14. The red nodes are the normal class data samples and the blue nodes are the COVID-19 data samples. Table 2. Classification accuracy comparison. The second line of this table shows the classification accuracy experiment on the following data configuration: Each of the 150 normal and 150 COVID-19 images are randomly splitted into two subgroups: 90% of the each of the 150 images (135 images) forms the training group and other 10% (15 images) forms the testing group. The third line shows the classification accuracy experiment on the following data configuration: each of the 150 normal and 150 COVID-19 images are randomly splitted into two subgroups: 80% of the each of the 150 images (120 images) forms the training group and other 20% (30 images) forms the testing group. For the proposed CP method, the result is averaged over 50 executions with randomly selected training and testing samples each. Technique
Ada DT
MLP NB
RF
SVM CP
Accuracy (%) 83.3 83.3 50.0
93.3 93.3 96.6
100.0
Accuracy (%) 83.3 83.3 50.0
93.3 90.0 96.6
98.3
48
4
J. Yan et al.
Conclusions
In this work, we have presented a novel method to handle data classification problem when the classic classification hypothesis doesn’t hold. Specifically, a core-periphery network is constructed in which the core subnetwork represents the normal class and the peripheral subnetwork represents abnormal classes with different levels of internal dispersion. In other words, we present a method to capture data class pattern for the kind of data “without pattern”. Our approach, in spite of its simplicity, paves a way to deal with general classification problems. For multi-class problems, where the data set is divided into different levels of variances (different classes), the single core-periphery network presented in this paper can be generalized to pairwise ones. Another idea to solve multi-class problem is to construct multi-level core-periphery training network. These will be considered as future works. Acknowledgment. This work is supported in part by FAPESP under grant numbers 2015/50122-0, the Brazilian National Council for Scientific and Technological Development (CNPq) under grant number 303199/2019-9, and the Ministry of Science and Technology of China under grant number:G20200226015.
References 1. Hu, Y., Jacob, J., Parker, G.J.M., Hawkes, D.J., Hurst, J.R., Stoyanov, D.: The challenges of deploying artificial intelligence. Nat. Mach. Intell. 2, 298–300 (2020) 2. Pereira, R.M., Bertolini, D., Teixeira, L.O., Silla, C.N., Jr., Costa, Y.M.G.: COVID19 identification in chest X-ray images on flat and hierarchical classification scenarios. Comput. Methods Programs Biomed. 194, 105532 (2020) 3. Abbas, A., Abdelsamea, M.M., Gaber, M.M.: Classification of COVID-19 in chest X-ray images using DeTraC deep convolutional neural network. Appl. Intell. 51(2), 854–864 (2020). https://doi.org/10.1007/s10489-020-01829-7 4. Elaziz, M.A., Hosny, K.M., Salah, A., Darwish, M.M., Lu, S., Sahlol, A.T.: New machine learning method for image-based diagnosis of COVID-19. Plos One 15, e0235187 (2020) 5. Loey, M., Smarandache, F., Khalifa, N.E.M.: Within the lack of chest COVID-19 X-ray dataset: a novel detection model based on gan and deep transfer learning. Symmetry 12(4), 651 (2020) 6. Bishop, C. M.: Pattern Recognition and Machine Learning. 2nd edn. Springer (2011) 7. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521, 436–444 (2015) 8. Silva, T.C., Zhao, L.: Network-based high level data classification. IEEE Trans. Neural Netw. Learn. Syst. 23(6), 954–970 (2012) 9. Silva, T.C., Zhao, L.: High-level pattern-based classification via tourist walks in networks. Inform. Sci. 294, 109–126 (2015) 10. Carneiro, M.G., Zhao, L.: Organizational data classification based on the importance concept of complex networks. IEEE Trans. Neural Netw. Learn. Syst. 29(8), 3361–3373 (2018) 11. Finkel, R.A., Bentley, J.L.: Quad trees a data structure for retrieval on composite keys. Acta Informatica 4, 1–9 (1974)
Classification of Dispersed Patterns
49
12. Mandelbrot, B.: How long is the coast of britain? Statistical self-similarity and fractional dimension. Science 156, 636–638 (1967) 13. Falconer, K.: Fractal Geometry: Mathematical Foundations and Applications. 1st edn. John Wiley & Sons (2003) 14. Barabasi, A.-L., Albert, R.: Emergence of scaling in random networks. Science 286(5439), 509–512 (1999) 15. Watts, D.J., Strogatz, S.H.: Collective dynamics of ‘small-world’ networks. Nature 393(5439), 440–442 (1998) 16. Borgatti, S.P., Everett, M.G.: Models of core/periphery structures. Soc. Netw. 21, 375–395 (1999) 17. Csermely, P., London, A., Wu, L.-Y., Uzzi, B.: Structure and dynamics of core/periphery networks. J. Complex Netw. 1, 93–123 (2013) 18. Zhou, S., Mondragon, J.R.: The rich-club phenomenon in the Internet topology. IEEE Comm. Lett. 8, 180–182 (2004) 19. Dorogovtsev, S.N., Goltsev, A.V., Mendes, J.F.F.: k-core organization of complex networks. Phys. Rev. Lett. 96, 040601 (2006) 20. Holme, P.: Core-periphery organization of complex networks. Phys. Rev. E 72, 046111 (2005) 21. Carmi, S., Havlin, S., Kirkpatrick, S., Shavitt, Y., Shir, E.: A model of Internet topology using k-shell decomposition. PNAS 104, 11150–11154 (2007) 22. Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. 41, 1–58 (2009) 23. Hodge, V.J., Austin, J.: A survey of outlier detection methodologies. Artif. Intell. Rev. 22, 85–126 (2004) 24. Kong, Y.-X., Shi, G.-Y., Wu, R.-J., Zhang, Y.-C.: k-core: theories and applications. Phys. Rep. 832, 1–32 (2019) 25. COVID-19 Radiography Database. https://www.kaggle.com/tawsifurrahman/ covid19-radiography-database. Accessed 4 Feb 2021
Small Number of Communities in Twitter Keyword Networks Linda Abraham, Anthony Bonato(B) , and Alexander Nazareth Ryerson University, Toronto, ON, Canada [email protected]
Abstract. We investigate networks formed by keywords in tweets and study their community structure. Based on datasets of tweets mined from over seven hundred political figures in the U.S. and Canada, we hypothesize that such Twitter keyword networks exhibit a small number of communities. Our results are further reinforced by considering via socalled pseudo-tweets generated randomly and using AI-based language generation software. We speculate as to the possible origins of the small community hypothesis and further attempts at validating it.
1
Introduction
Twitter is a dominant social media and micro-blogging platform, allowing users to present their views in concise 280-character tweets. An active social media presence has become the mainstay of modern political discourse in the United States and Canada; many politicians, such as members of Congress and members of Parliament, frequently tweet. The corpus of tweets by such political figures forms a massive data source of regular updates on government strategy and messaging. Tweets may reveal approaches to reinforce political platforms, describe policy, or either bolster support from followers or antagonize political adversaries. Besides their political content, the mining and analysis of tweets by political figures may lead to fresh insights into the structure and evolution of networks formed by Twitter keywords. In Twitter keyword networks, the nodes are keywords, which are significant words, distinguished from common stop words such as “and” or “the.” Nodes are adjacent if they are in the same tweet; we may consider this a weighted graph, where multiple edges arise from multiple occurrences of keyword pairs. These are co-occurrence networks of keywords in tweets, and the extraction and analysis of co-occurrence networks provide a quantitative method in the largescale analysis of such tweets. Networked data may be mined from Twitter, and algorithms applied to probe the community structure of the resulting networks. See Fig. 1 for an example of the approach described in the previous paragraph, taken from March 2020 of tweets by then-President Donald J. Trump. We chose this month as it was the beginning of major lockdowns owing to the COVID19 pandemic in the U.S. In the figure, 100 of the top of keywords used by Trump are linked by co-occurrence. The graph in Fig. 1 clusters into communities c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2021, SCI 1072, pp. 50–61, 2022. https://doi.org/10.1007/978-3-030-93409-5_5
Small Number of Communities in Twitter Keyword Networks
51
Fig. 1. Keyword network of tweets by President Trump from March of 2020. (Color figure online)
focused on the following five themes: Republican endorsements (pink), COVID19 response (green), attacks on news media or Democrats (orange), the economy (dark green), and White House announcements (blue). We will describe more fully how the keyword networks we investigated were formed in the next section. Word co-occurrence networks have been studied extensively within network science. Network analysis of co-occurrence networks was used to map knowledge structure in scientific fields; see, for example, [11,15,25]. Hashtag co-occurrence networks in Twitter were studied in [22,23]. Transcript data of patients with potential Alzheimer’s disease was studied from the context of co-occurrence networks in [12]. Co-occurrence networks of character names in novels was employed in [2,16] to determine communities and key protagonists. Model selection techniques using machine learning were employed in [5] to align co-occurrence networks in movie scripts with various complex network models, such as preferential attachment and random graphs with given expected degree sequences. A detailed review of the literature up to 2014 on word co-occurrence networks may be found in [9]. In the present paper, our analysis of Twitter keyword networks does not focus on the meaning of specific tweets, but more on the overarching structure of Twitter keyword networks from political figures over the year 2020. We chose political figures as they tend to generate a consistent number of tweets over time. We analyzed patterns in the narrative and vocabulary used over the year across hundreds of accounts. A multi-year study of the keyword networks from
52
L. Abraham et al.
the tweets of President Trump revealed a low number of communities; see [6]. On average, and over several years, Trump’s tweets clustered into at most five communities. The results of [6] and the much larger dataset discussed in Sect. 3 led us to hypothesize that tweets organize themselves into a small number of communities for a distinct user. In particular, the hypothesis proposes that users on Twitter post about a small number of topics, using whatever keywords that are relevant to them. We organize the discussion in this paper as follows. In Sect. 2, we consider our framework for the analysis of networks of Twitter keywords. We hypothesize that tweets organize themselves into a low number of communities for a distinct user, and refer to this thesis as the small community hypothesis. Our methods are detailed in Sect. 3, which describes the mining of keyword networks supported by over seven hundred political figures in the U.S. and Canada. Our results support the small community hypothesis and are strengthened by considering control data such as random words and tweets formed by AI using GPT-2. We finish with a discussion of our results and propose future work. We consider undirected graphs with multiple undirected edges throughout the paper. Additional background on graph theory and complex networks may be found in the book [4].
2
Small Community Hypothesis
There is no universal consensus on a precise definition for a community in a network. According to Flake, Lawrence, and Giles [10], a community in a graph is a set of vertices with more links to community members than to non-members. Modularity is defined as 1 ki kj Ai,j − δ(ci , cj ), 2m i,j 2m where Ai,j is the weight of the edge between i and j, ki is the sum of the weights of all edges incident to vertex i, ci is the community of which vertex i is a part, δ is the Kronecker delta function, and m is the sum of all edge weights, given by: m = 12 i,j Ai,j . Modularity is used to measure the quality of a partition of a graph and also as an objective function to optimize. Exact modularity optimization is a computationally hard problem, so we look to approximation algorithms to calculate a graph partition. For this, we used the Louvain algorithm [3] that takes a greedy approach to modularity optimization. As discussed in the introduction, an analysis of keywords taken from Trump’s tweets over multiple years revealed a small number of communities. The small community hypothesis (or SCH ) states that the tweets from any user, given sufficient volume, will group themselves into a low number of thematically related clusters. While the SCH does not predict the exact number of clusters, data presented in Sect. 3 suggests that in Twitter keyword networks, the number of
Small Number of Communities in Twitter Keyword Networks
53
communities is typically less than ten. For example, the keyword networks in Fig. 3 of Prime Minister Justin Trudeau and Fig. 1 of President Trump resolve into four and five communities, respectively. Note that the SCH does not predict what communities occur in an individual user’s Twitter account, or how such communities change over time. Instead, we view it as an emergent, quantitative property of Twitter keyword networks. In the next section, we will describe our methods and data, which we extended to a much wider dataset of Twitter users. Further, we considered two other datasets generated as control groups to test our methodology.
3
Methods and Data
Twitter has an Application Programming Interface (or API) that can be accessed for free, with some restrictions. An API is a way for requesting information through a computer program, which we used for retrieving tweets from our users of interest. To take advantage of this API, we used the Tweepy Python library [20], which allows ready access to Twitter data with the Python programming language. Occasionally, there was an issue within the code or API, and as a result, nothing was returned. The most notable case of this was for the account of President Trump with handle @realDonaldTrump. For his account, we downloaded the tweets manually from the Trump Twitter Archive (or TTA), a third party website which keeps a record of all of President Trump’s tweets, including those that may have been deleted [19]. This is an especially invaluable resource now that @realDonaldTrump has been suspended from the platform, and it is no longer possible to view his historical Twitter feed on the official site. For a particular user, the Python code we wrote performed the data processing as follows: 1. Collected tweets from Twitter API or Trump Twitter Archive. 2. Removed retweets and non-English tweets if necessary. 3. Tokenized tweets, removed stop words, calculated monthly keyword frequency. 4. For each month, generated 100 × 100 adjacency matrices of the top 100 keywords by frequency. The first step used Tweepy and the Twitter API to gather data between dates. For President Trump, we downloaded the data in CSV format from the TTA, again specifying a range of dates. We obtained a CSV file including all relevant data. This includes the tweet itself, along with metadata including date, number of retweets, and number of likes. In the second step, we filtered out retweets. We only considered tweets in English (otherwise, this would skew the number of communities detected). Additionally, for Canadian and American politicians, typically non-English tweets are just duplicates of English tweets. Prime Minister Trudeau is a good example of that phenomenon, as he tweets everything once in English and a second time in French. The third step involved
54
L. Abraham et al.
common NLP techniques to break down the sentences into something more manageable for a machine to process. Tokenization is the process of extracting words from a sequence of characters [7]. Once we have identified the words, we can remove the stop words, which are the common and less meaningful words. For our purposes, we used the Natural Language ToolKit Python library, which includes a corpus of stop words [13]. These stop words mostly include articles and pronouns (for example, “the,” and “he”), and others added manually.
Fig. 2. A tweet from the official Twitter account of Prime Minister Trudeau [21].
Once the keywords were identified, we calculated the monthly frequency of each word, and kept the top 100 for each month. In the last step, we created the 100 × 100 adjacency matrices to represent a graph. For two words i and j, and an adjacency matrix M , the position M (i, j) contains a numeric value indicating the number of times that both words i and j appeared in the same tweet by the user in the month. This represents the weight of the edge between i and j in the weighted graph described by M . For the tweet in Fig. 2, we would add 1 to the weight of the edge between every pair of keywords in this tweet, for instance between the vertices for “covid-19” and “safe.” Recall that we only consider pairs where both words appear in the top 100 keywords by frequency in the month. With steps (1) through (4) completed, data was visualized with Gephi using the ForceAtlas2 (FA2) layout algorithm and Louvain community detection algorithm; see [1] for more on Gephi. March 2020 marked the beginning of preventative measures taken in North America, such as Ontario closing public schools on March 12th, the day after the WHO declared a global pandemic [18]. The Twitter keyword networks of President Trump and Prime Minister Trudeau are presented in Fig. 1 and Fig. 3, respectively. The Trump Twitter keyword network had five communities in March 2020, while the one of Trudeau has four; these are the colored sets of nodes in the figures. Similar patterns were detected in the later months of 2020, as will be described in the next section.
Small Number of Communities in Twitter Keyword Networks
3.1
55
Political Figures
In an effort to validate the SCH, we formed Twitter keyword networks from 703 political figures from Canada and the U.S. From Canada, there were 190 accounts mainly composed of Members of Parliament, with a few other figures from initial testing, such as Ontario Premier Doug Ford. The data from Canadian politicians included 94 Liberals, 71 Conservatives, 16 NDPs, two Green Party members, one from the Saskatchewan Party, and six Independents. From the United States, there were 513 accounts including state governors and members of congress. The data from American politicians included 242 Republicans, 268 Democrats, and three Independents. The accounts of President Trump and Prime Minister Trudeau were also included. The total number of tweets scraped in our analysis was 562,425. We refer the reader to https://cutt.ly/VWtT4bA where we list approximately 12,000 community numbers for the accounts we scraped.
Fig. 3. Keyword network created from tweets by Prime Minister Trudeau in March of 2020.
Bilingual accounts (especially among Canadian politicians) posed a challenge since our analysis is based on English tweets only; if these tweets occupied only a minority of the feed, then the rest of the tweets by the author could be kept in the dataset. We also removed retweets, though in some cases this is not ideal since some accounts use this feature as a large share of their communications. For this dataset, we employed a Python package for community detection with NetworkX [8] to automate the process described in the previous section. One challenge was the aspect of randomness associated with the Louvain algorithm,
56
L. Abraham et al.
where we could run the algorithm twice on identical networks and derive different results. To combat this, for each network, we ran the detection algorithm one hundred times and took the most frequently derived result. We analyzed tweets grouped by quarter and by month. See Fig. 4 for the monthly and quarterly distribution of communities found in the keyword networks of Canadian and American politicians. From the difference in these two sets of graphs, we observed that overall the distribution of data shifted left, with the most frequent number of communities changing from five to four, thus supporting the SCH. This also had the effect of reducing the size of the right tail, with fewer outlying networks containing large community numbers.
Fig. 4. Frequency of number of communities found using the Louvain algorithm on Twitter keyword networks from 703 politicians in 2020, from January to December. Results are shown for monthly and quarterly keyword networks.
(a) U.S. governors.
(b) Canadian premiers.
Fig. 5. Heatmap revealing the number of communities based on U.S. gubernatorial tweets and those of Canadian premiers. The monthly community counts were averaged over 2020.
Small Number of Communities in Twitter Keyword Networks
57
Overall these findings are in line with our observations from President Trump and Prime Minister Trudeau in Sect. 3, with the relatively low number of communities, centering around four to six. Though the number of communities may appear to decrease with a higher volume of tweets (see quarterly versus monthly numbers), we did not find any significant correlation between these variables. We observed that if there were a relatively low number of tweets in a month, the network was more likely to contain a larger number of communities. A heat map recording the number of communities using gubernatorial tweets and those of premiers may be found in Fig. 5. In the case of sparse data, the resulting network was often disconnected, which lead to unpredictable results. In particular, the Louvain algorithm occasionally assigned small connected components, such as those associated with individual tweets, to their communities. For example, U.S. Representative Danny K. Davis of Illinois had only 35 tweets written in October 2020, which lead to 16 communities detected by the Louvain algorithm; see Fig. 6.
Fig. 6. Keyword networks for October 2020 of U.S. House Representative Danny K. Davis of Illinois’ 7th congressional district. The data was made up of 35 tweets, and 16 communities were detected.
3.2
Pseudo-Tweets
To further test the validity of the SCH, we generated other related datasets as control groups. These other datasets consist of what we refer to as pseudo-tweets that did not come from an account on Twitter. The goal of the analysis of this data was to detect what phenomena influence the small community hypothesis. In particular, the SCH may be influenced by the vocabulary of a specific user, or how language is implemented. The first dataset, and perhaps most obvious, is
58
L. Abraham et al.
one composed of randomly generated tweets. These are simply a set of random English words adhering to the restriction on length of tweets (that is, 280 characters). The expected results from analysis of this dataset was a large number of communities, or no pattern at all, due to the pseudo-tweets not using language as a human would with regards to word choice and sentence structure. The method for generating these messages does not weight word choice by any measure of popularity, nor does it adhere to grammar rules. We generated six “months” worth of data with 100 tweets per month and ran the community detection on the results. See Fig. 7 for the resulting number of communities from each month. The algorithm revealed 57 to 67 communities, which is certainly not small when compared to the number of communities found in politicians.
Fig. 7. Number of communities found each “month” using pseudo-tweets composed of random English words.
Fig. 8. A sample pseudo-tweet generated using the public GPT-2 model, fine tuned on real tweets from @realDonaldTrump.
Another dataset we considered was one containing computer-generated tweets. In recent years, there has been a rapid improvement in the quality of linguistic algorithms mimicking human speech. A well-known example of this is the use of deep learning to create nearly indistinguishable fake videos of celebrities and politicians, also known as deepfakes [17]. A related application of deep learning is an advanced text-based language engine developed by OpenAI. GPT2 is the second generation of an unsupervised learning algorithm which has been pre-trained on 40 GB of internet data and will predict the next word given a piece of text [14]. Due to the authors’ concerns about malicious application of
Small Number of Communities in Twitter Keyword Networks
59
Fig. 9. Frequency of communities found in GPT-2 generated keyword networks.
the full algorithm, only a smaller version is available publicly for research and testing. Using a Python implementation of GPT-2 in [24], we generated a large number of pseudo-tweets. These were limited by character length, though a bit more roughly since we wanted to allow the AI to finish its sentences. We used the model to generate tweets mimicking those of President Trump by feeding in all of the @realDonaldTrump tweets we had gathered in the initial analysis, using data from January to June of 2020. See Fig. 8 for a sample pseudo-tweet generated by GPT-2. We observed that the messages were generally quite coherent and, in our view, their content was in line with President Trump’s preferred vocabulary. We ran the model 50 times, each one creating six months of 100 tweets each. That is, there were 300 networks and 30,000 total pseudo-tweets. We then applied our community analysis algorithms; see Fig. 9 for the resulting frequency of communities. Compared to the random English words from the last analysis, this distribution is more in line with what we expect from a human-authored Twitter account.
4
Discussion and Future Work
Twitter is one of the most popular social networks owing in part to its accessibility and short format. The restriction on message length creates an informationdense medium that lends itself well to a networked keyword analysis. Using community detection and other network science tools, we analyzed this rich data source as a set of complex networks of keywords. We proposed the small community hypothesis, where Twitter keyword networks cluster into a low number of communities. We tested the hypothesis on data scraped from Twitter and comprised of tweets of politicians in the U.S. and Canada. The data set contained 562,425 tweets from 703 accounts for all of 2020. From the results of the Louvain community algorithm, we found that over 75% of months fell between four and six communities across all accounts.
60
L. Abraham et al.
Our results suggest that the SCH is an observable phenomenon within Twitter keyword networks. We also tested the hypothesis on two other datasets, one of random English words and another of pseudo-tweets generated by the GPT-2 deep learning model. The former gave large community numbers, and the latter dataset gave results closer to the original data. One direction for future work would be to probe the origins of SCH and whether it is a random occurrence within Twitter keyword networks or, for an example, a consequence of how humans use language. The SCH may be foundational in how users approach social media, focusing their messaging on a small number of topics. The fact that the randomized pseudo-tweet datasets had a much larger number of communities compared to those generated by GPT-2 (which more closely follows actual tweets) indicates that the SCH could be an artefact of language. Another direction would be to expand our analysis to Twitter accounts of non-political figures. These may include journalists, actors, or accounts of public figures that tweet with enough volume. We had hoped to include more data from the accounts of prominent public figures outside of the political sphere, but we encountered problems with the API and with the code for data processing, mostly due to sporadic user inactivity. We did examine 24 Twitter accounts of public figures who are not politicians but who tweet often, and found analogous results with on average seven communities monthly across 2020. Acknowledgments. The research for this paper was supported by grants from NSERC and Ryerson University.
References 1. Bastian, M., Heymann, S., Jacomy, M.: Gephi: an open source software for exploring and manipulating networks. In: Proceedings of the International AAAI Conference on Weblogs and Social Media (2009) 2. Beveridge, A., Shan, J.: Network of thrones. Math Horiz. Mag. 23, 18–22 (2016) 3. Blondel, V., Guillaume, J.L., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities in large networks. J. Stat. Mech.: Theory Exp. 10, P10008 (2008) 4. Bonato, A.: A Course on the Web Graph. American Mathematical Society Graduate Studies Series in Mathematics. Providence, Rhode Island (2008) 5. Bonato, A., D’Angelo, D.R., Elenberg, E.R., Gleich, D.F., Hou, Y.: Mining and modeling character networks. In: Proceedings of the 13th Workshop on Algorithms and Models for the Web Graph (2016) 6. Bonato, A., Roach, L.: The math behind Trump’s tweets. The Conversation (2018) 7. Chiarcos, C., Ritz, J., Stede, M.: By all these lovely tokens...Merging conflicting tokenizations. Lang. Resour. Eval. 46, 53–74 (2012) 8. Community Detection for NetworkX. https://python-louvain.readthedocs.io/en/ latest/. Accessed 1 Sep 2021 9. Cong, J., Liu, H.: Approaching human language with complex networks. Phys. Life Rev. 11, 598–618 (2014) 10. Flake, G.W., Lawrence, S., Giles, C.L.: Efficient identification of Web communities. In: Proceedings of the Sixth ACM-SIGKDD International Conference on Knowledge Discovery and Data Mining (2000)
Small Number of Communities in Twitter Keyword Networks
61
11. Law, J., Bauin, S., Courtial, J., Whittaker, J.: Policy and the mapping of scientific change: a co-word analysis of research into environmental acidification. Scientometrics 14, 251–264 (1988) 12. Millington, T., Luz, S.: Analysis and classification of word co-occurrence networks from Alzheimer’s patients and controls. Front. Comput. Sci. 3, 649508 (2021) 13. Natural Language Toolkit. https://www.nltk.org/. Accessed 1 Sep 2021 14. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI (14 February 2019). https:// openai.com/blog/better-language-models/. Accessed 1 Sep 2021 15. Radhakrishnan, S., Erbis, S., Isaacs, J.A., Kamarthi, S.: Novel keyword cooccurrence network-based methods to foster systematic reviews of scientific literature. PLOS ONE 12(3), e0172778 (2017) 16. Ribeiro, M.A., Vosgerau, R.A., Andruchiw, M.L.P., Ely de Souza Pinto, S.: The complex social network from the Lord of the Rings. Rev. Bras. Ensino Fs 38, 1304 (2016) 17. Sample, I.: What are deepfakes—and how can you spot them?. The Guardian (13 January 2020). https://www.theguardian.com/technology/2020/jan/13/what-aredeepfakes-and-how-can-you-spot-them. Accessed 1 Sep 18. Timeline: How Canada has changed since coronavirus was declared a pandemic. Global News (11 April 2020). https://globalnews.ca/news/6800118/pandemic-onemonth-timeline/. Accessed 1 Sep 2021 19. Trump Twitter Archive V2. https://www.thetrumparchive.com/. Accessed 1 Sep 20. Tweepy Documentation. https://docs.tweepy.org/en/latest/api.html. Accessed 1 Sep 21. Twitter account of Justin Trudeau “I spoke on the phone with @POTUS Trump again today...”. https://twitter.com/JustinTrudeau/status/1238926089637986306. Accessed 1 Sep 2021 22. Wang, R., Liu, W., Gao, S.: Hashtags and information virality in networked social movement. Online Inf. Rev. 40, 850–866 (2016) 23. Weng, L., Menczer, F.: Topicality and impact in social media: diverse messages, focused messengers. PLOS ONE 10, e0118410 (2015) 24. Woolf, M.: How to make custom AI-generated text with GPT-2 (4 September 2019). https://minimaxir.com/2019/09/howto-gpt2/. Accessed 1 Sep 2020 25. You, T., Yoon, J., Kwon, O.-H., Jung, W.-S.: Tracing the evolution of physics with a keyword co-occurrence network. J. Korean Phys. Soc. 78(3), 236–243 (2021). https://doi.org/10.1007/s40042-020-00051-5
Finding Cross-Border Collaborative Centres in Biopharma Patent Networks: A Clustering Comparison Approach Based on Adjusted Mutual Information Zhen Zhu1(B) and Yuan Gao2 1 2
University of Kent, Canterbury CT2 7NZ, UK [email protected] University of East Anglia, Norwich NR4 7TJ, UK [email protected]
Abstract. The recent speedy development of COVID-19 mRNA vaccines has underlined the importance of cross-border patent collaboration. This paper uses the latest edition of the REGPAT database from the OECD and constructs the co-applicant patent networks for the fields of biotechnology and pharmaceuticals. We identify the cross-border collaborative regional centres in these patent networks at NUTS3 level using a clustering comparison approach based on adjusted mutual information (AMI). In particular, we measure and compare the AMI scores of the clustering before and after arbitrarily removing cross-border links of a focal node against the default clustering defined by national borders. The region with the largest difference in AMI scores is identified as the most cross-border collaborative centre, hence the name of our measure, AMI gain. We find that our measure both correlates with and has advantages over the traditional measure betweenness centrality and a simple measure of foreign share. Keywords: Patent networks · Clustering comparison mutual information · Cross-border
1
· Adjusted
Introduction
Globalisation and knowledge-based economy have stimulated the process of knowledge diffusion in the form of research and development (R&D) collaboration. Knowledge spillovers have been found to be geographically localised [1] and easier within firms than between [2]. R&D collaboration between organisations in different countries (across national borders or simply cross-border thereafter) could expose the participating parties to more heterogeneous resources, knowledge and skill sets. The data from the European Regional Innovation Survey from 1995 to 1997 has already shown that manufacturing firms with an external innovation network are more successful [3]. Conducting research on cross-border c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2021, SCI 1072, pp. 62–72, 2022. https://doi.org/10.1007/978-3-030-93409-5_6
Finding Cross-Border Collaborative Centres in Biopharma Patent Networks
63
knowledge diffusion is especially meaningful as R&D cooperation and dissemination of innovation have been identified as key indicators in the National Innovation System (NIS) studies [4,5]. More recently, the development of COVID-19 mRNA vaccines on an unprecedented timescale has showcased the importance of cross-border patent collaborations [6]. In this paper, we focus on identifying regional centres in the cross-border collaborative networks as such centrality is associated with higher level of innovation intensity and quality. Our proposed identification method is based on the adjusted mutual information (AMI) gain by comparing each pair of elective partitions. In quantitative innovation studies, patent information has been a widely used data source [7–12]. In the literature of R&D collaboration, researchers have been building linkages based on patent co-invention and co-application. In particular, the location information of patent inventors and applicants allows for accurate studies on cross-regional, co-inventionship and talent mobility. For example, Chessa et al. constructed five networks using the OECD REGPAT database [13] to explore the R&D integration in the European Union. These include the patent co-inventor and publication co-author networks, the patent co-applicant network, the patent citation network and the patent inventor mobility network. Singh’s analysis of patents filed to the U.S. Patent and Trademark Office (USPTO) uses patent citation data to measure the knowledge flow and builds interpersonal networks between inventors. In line with the previous literature like Kogut and Zander [2], this analysis shows intra-regional and intra-firm knowledge flows are stronger than those across regional or firm boundaries [14]. On the temporal dimension, a study based on patents originated from OECD countries and filed through the European Patent Office (EPO) found that the negative impact of geographical distance and institutional borders on R&D collaboration decreased from the end of 1980s till mid-1990s before it started to grow [15]. Further analysis looks into the how the quality of inter-regional knowledge networks (also based on the REGPAT patent database) impacts the regional research productivity [16]. REGPAT is also used in combination with the Eurostat database with a focus on the innovation-lagging-behind European regions to suggest that having wider inter-regional co-patenting networks with closer collaboration with knowledge-intensive regions could help the less innovative regions to close the gap [17]. As we have seen in the aforementioned literature, a rising number of literature have come to recognise the importance of knowledge spillovers. The earlier works look into various knowledge transmission channels (e.g., citation, collaboration, inventor mobility, etc.), and the more recent studies began to leverage the power of network methods. But still, a relatively smaller body of literature have come up with a method to measure the regional R&D network centrality. So far the most common approaches derive from the conventional social network analysis (SNA), such as degree centrality or betweenness centrality [18,19]. Berge et al. argued that such studies could miss the conceptual problems at the aggregated regional level and lose the information regarding the structure of network relations [20].
64
Z. Zhu and Y. Gao
They propose a new method based on the concept of inter-regional bridging paths defined as the indirect connections between two regions via a third region as the bridge. Our analysis conducts network construction based on the co-applicant linkages as they represent the collaboration between institutions. In terms of network centres identification, we take a different approach from the existing literature. Clustering comparison measures traditionally have been used for external validation as well as clustering solutions search [21]. In this paper, we propose another application of clustering comparison as a way of identifying central nodes in networks. In particular, we measure and compare the similarity scores of the clustering before and after arbitrarily removing cross-border links of a focal node against the default clustering defined by national borders. The widely used adjusted mutual information (AMI) is chosen here as the clustering comparison measure, hence the name of our measure, AMI gain. Using the examples of coapplicant patent networks in the fields of biotechnology and pharmaceuticals, we find that our measure, AMI gain, both correlates with and has advantages over the traditional measure of betweenness centrality and a simple measure of foreign share. The rest of the paper is organised as follows: Sect. 2 introduces the database and our measure. Section 3 presents the results and statistically compares our measure with betweenness centrality and a simple measure of foreign share. Finally, Sect. 4 concludes the paper with further discussions.
2 2.1
Data and Methods REGPAT Database
In this study, we use the latest edition of the OECD REGPAT database (released in January, 2021) which has been widely used in the relevant prior works. This database enables researchers to link patent data to regions based on the addresses of the patent applicants and inventors at NUTS3 level, covering more than 5,500 regions across OECD countries, EU 28 countries, Brazil, China, India, the Russian Federation and South Africa [13]. The patent data component in this database comes from the EPO Worldwide Statistical Patent Database (PATSTAT Global, Autumn 2020), which covers patent applications filed to the EPO and patent applications filed under the Patent Co-operation Treaty (PCT) at international phase, both from 1977 (priority date). We focus the analysis on 30 countries in Europe, i.e., the EU28 countries except for Cyprus before the Brexit plus Iceland, Norway and Switzerland. As a result, we have 1389 NUTS3 level regions, i.e., the nodes in the networks. And a cross-border link occurs when it connects two regions belonging to two different countries. We construct two co-applicant patent networks for the two fields of biotechnology and pharmaceuticals according to the IPC concordance
Finding Cross-Border Collaborative Centres in Biopharma Patent Networks
65
table published by the WIPO [22], where the nodes are the NUTS3 regions in these 30 countries and the links are weighted by the accumulated number of co-applicant collaboration instances between regions over time (i.e., from 1977 onward). Note that a patent may have one (i.e., contributing no links), two (i.e., contributing one link), or more (i.e., contributing more than one links) applicants. Also note that self-loops are considered and weighted. We further restrict our attention to the largest components of the two networks, with 765 nodes for biotechnology and 608 nodes for pharmaceuticals respectively. 2.2
Methods
We denote a network as G = (V, E) where V is the set of nodes (or vertices) and E is the set of links (or edges). To describe our measure, we further denote vi ∈ V as node i in the network and evi ,vj ∈ E as the edge between node i and node j. The weight of evi ,vj is denoted as wvi ,vj and wvi ,vj = wvj ,vi for an undirected network. The set of node i’s neighbouring (directly connected) nodes is denoted as N (vi ). The largest component of the network is denoted as C1 . A partition i of the network is denoted as Pi . Finally, the partition after removing node i is denoted as P−vi . Regarding clustering comparison, we use adjusted mutual information (AMI), which calculates the similarity score between two partitions (or clusterings), say Pi and Pj , as follows: AM I(Pi , Pj ) =
M I(Pi , Pj ) − E{M I(Pi , Pj )} max(H(Pi ), H(Pj )) − E{M I(Pi , Pj )}
where E{·} calculates the expected value, H(·) calculates the entropy and M I(·) calculates the (unadjusted) mutual information [21]. The value of AMI ranges from 0 to 1 and 0 implies the most dissimilar whereas 1 implies the most similar between partitions. Algorithm 1 shows the pseudocode of calculating the AMI gain for each node. Note that for each node we conduct a counterfactual exercise by arbitrarily removing its cross-border links. The rationale behind our measure is that such a counterfactual exercise will produce a partition more similar to the default partition defined by national borders, for which we denote as Pd . Therefore, the difference between the AMI scores when compared with the default partition will more than often be positive after the node removal and we call the difference as the AMI gain. As a result, the cross-border collaborative centres are identified with the largest AMI gains. Note that we use the Louvain method [23] at the default resolution level 1.0 for community detection, which also takes into account link weights.
66
Z. Zhu and Y. Gao
Algorithm 1. Calculating AMI gain P0 ← Louvain(C1 ) Get P0 by applying Louvain to the largest component C1 AM I 0 ← AM I(P0 , Pd ) AMI between P0 and the default partition Pd for vi ← v1 , vn do Loop through the nodes of C1 for N (vi )j ← N (vi )1 , N (vi )m do Loop through the neighbours of vi if N (vi )j is cross-border then Drop cross-border neighbours of vi remove evi ,N (vi )j end if end for P−vi ← Louvain(C−vi ) AM I −vi ← AM I(P−vi , Pd ) AMI gain for node vi ΔAM I vi = AM I −vi − AM I 0 end for
For comparison, we also consider the traditional measure of betweenness centrality (which also takes into account link weights) and a simple measure of foreign share. Algorithm 2 shows the pseudocode of calculating the foreign share for each node. Algorithm 2. Calculating foreign share for vi ← v1 , vn do sumvi ← 0 sumfvi ← 0 for N (vi )j ← N (vi )1 , N (vi )m do if N (vi )j is cross-border then sumfvi ← sumfvi + wvi ,N (vi )j sumvi ← sumvi + wvi ,N (vi )j else sumvi ← sumvi + wvi ,N (vi )j end if end for sumf F S vi = sumvvi i end for
3
Loop through the nodes of C1 Loop through the neighbours of vi Add up foreign neighbour edge weights Add up total neighbour edge weights Add up total neighbour edge weights
Compute foreign share for node vi
Results
As a concrete example, Fig. 1 and Fig. 2 show the community detection results before and after we arbitrarily remove the cross-border links of the region BE234 (Arr. Gent) from the biotechnology patent network. Each color represents a different community detected. Note that mostly the communities are still characterised by national borders, even though cross-border links sometimes break certain regions from their default national communities (e.g., different colors within France and Germany). Also note that a significant difference between the two figures is that most regions in Netherlands share the community with the
Finding Cross-Border Collaborative Centres in Biopharma Patent Networks
67
UK (Fig. 1) before the counterfactual removal but are separated from the UK (Fig. 2) after the removal, which helps BE234 (Arr. Gent) attain a high AMI gain score. Table 1 shows the top 10 regions identified by our measure, AMI gain, as well as by betweenness centrality in the field of biotechnology. Although not shown in the table, foreign share has identified 32 regions all with 100% crossborder connections in the field of biotechnology, including, for example, Kelheim in Germany and Malta. Similarly, Table 2 shows the top 10 regions identified by our measure, AMI gain, versus by betweenness centrality, in the field of pharmaceuticals. Again not shown in the table, foreign share has identified 37 regions in tie in the field of pharmaceuticals, including, for example, Plymouth in the UK and Malta. There is some overlapping between the results by AMI gain and by betweenness centrality. For example, Vienna and Copenhagen in Table 1, and Stockholm, Milan and Paris in Table 2. On the other hand, the local measure of foreign share cannot differentiate the regions very well on the top as many regions are in tie. Moreover, foreign share does not take into account structural and global properties of the network. A small region, such as Malta, simply with all its links cross-border would top the table by foreign share.
Fig. 1. Biotechnology patent network community detection result
68
Z. Zhu and Y. Gao
Fig. 2. patent network community detection result (cross-border links removed for BE234) Table 1. Top 10 regions in biotechnology #
AMI gain
Betweenness centrality
NUTS3 code Region name
NUTS3 code Region name
1 BE234
Arr. Gent
FR101
Paris
2 DE126
Mannheim Stadtkreis
CH031
Basel-Stadt
3 UKD31
Greater Manchester South
DE212
M¨ unchen Kreisfreie Stadt
4 DEE02
Halle (Saale) Kreisfreie Stadt
UKI11
Inner London - West
5 AT130
Vienna
ITI43
Rome
6 DK011
City of Copenhagen
ES300
Madrid
7 SE110
Stockholm County
SE110
Stockholm County
8 UKF22
Leicestershire CC and Rutland UKJ14
Oxfordshire
9 DE125
Heidelberg Stadtkreis
DE300
Berlin
Salzburg und Umgebung
AT130
Vienna
10 AT323
Table 2. Top 10 regions in pharmaceuticals #
AMI gain
Betweenness centrality
NUTS3 code Region name
NUTS3 code Region name
1 UKJ33
Hampshire CC
FR101
Paris
2 SE110
Stockholm County
UKH12
Cambridgeshire CC
3 DK011
City of Copenhagen
CH031
Basel-Stadt
4 DEA22
Bonn Kreisfreie Stadt
SE110
Stockholm County
5 DEA2B
Rheinisch-Bergischer Kreis CH011
Vaud
6 ITC4C
Milan
ES300
Madrid
7 FR101
Paris
DE300
Berlin
8 DE926
Holzminden
ITC4C
Milan
9 ITC33
Genoa
AT130
Vienna
Carbonia-Iglesias
DE125
Heidelberg Stadtkreis
10 ITG2C
Finding Cross-Border Collaborative Centres in Biopharma Patent Networks
69
More systematically, Fig. 3 shows the scatter plots between our measure, AMI gain, and betweenness centrality or foreign share for biotechnology and pharmaceuticals respectively. The Pearson correlation coefficient (denoted by r) as well as the Spearman correlation coefficient (denoted by ρ) between AMI gain and either of the two alternative measures are positive. Note that the Spearman correlations are stronger than the Pearson ones as the former only considers the ranking of the values. Therefore, our measure, AMI gain, captures certain similar information as either betweenness centrality or foreign share does but also differs from either of them in a nontrivial way. Furthermore, Fig. 4 shows the empirical cumulative distribution functions (ECDFs) of our measure, AMI gain, betweenness centrality and foreign share for biotechnology and pharmaceuticals respectively. For both fields, betweenness centrality results are dominated by a few regions (as its ECDF curve is bent towards the top left corner). AMI gain and foreign share have, relatively speaking, more uniform distributions of values (i.e., closer to the 45 ◦ C line). As a result, AMI gain helps identify and differentiate the central cross-border
Fig. 3. Correlations between the three measures
70
Z. Zhu and Y. Gao
Fig. 4. Empirical cumulative distribution functions of the three measures
collaborative regions on the top (better than foreign share) and for a large percentile range (better than betweenness centrality).
4
Conclusions
R&D collaborations beyond national borders are critical for knowledge spillovers at large scale, which is well demonstrated by the recent development of COVID19 mRNA vaccines at an unprecedented timescale. This paper uses the latest edition of the REGPAT database from the OECD and constructs the co-applicant patent networks in Europe at NUTS3 level for the fields of biotechnology and pharmaceuticals. We contribute to the literature of finding cross-border collaborative centres in patent networks by proposing a clustering comparison approach based on adjusted mutual information. The rationale behind our approach is that a counterfactual exercise of removing cross-border links from a focal node will produce a partition more similar to the default partition defined by national borders. Therefore, the difference between the AMI scores when compared with the default partition will more than often be positive after the node removal. The results based on our measure, AMI gain, are positively correlated with those by betweenness centrality or by a simple measure of foreign share. Nevertheless, when compared with betweenness centrality, AMI gain better differentiates cross-border centres from local ones and offers a more uniform distribution of values. On the other hand, when compared with foreign share, AMI gain is more of a global and structural measure and better differentiates the nodes on the top. Our future research will further explore the robustness of our measure with more variations of the parameters and across contexts.
Finding Cross-Border Collaborative Centres in Biopharma Patent Networks
71
References 1. Jaffe, A.B., Trajtenberg, M., Henderson, R.: Geographic localization of knowledge spillovers as evidenced by patent citations. Q. J. Econ. 108(3), 577–598 (1993) 2. Kogut, B., Zander, U.: Knowledge of the firm, combinative capabilities, and the replication of technology. Organ. Sci. 3(3), 383–397 (1992) 3. Koschatzky, K., Sternberg, R.: R&D cooperation in innovation systems-some lessons from the European regional innovation survey (ERIS). Eur. Plan. Stud. 8(4), 487–501 (2000) 4. Organisation for Economic Co-operation and Development (OECD). Managing national innovation systems. OECD Publishing (1999) 5. Chang, P.-L., Shih, H.-Y.: The innovation systems of Taiwan and China: a comparative analysis. Technovation 24(7), 529–539 (2004) 6. Gaviria, M., Kilic, B.: A network analysis of COVID-19 mRNA vaccine patents. Nat. Biotechnol. 39(5), 546–548 (2021) 7. Griliches, Z., Pakes, A., Hall, B.H.: The value of patents as indicators of inventive activity (1986) 8. Fleming, L.: Recombinant uncertainty in technological search. Manag. Sci. 47(1), 117–132 (2001) 9. Jaffe, A.B., Trajtenberg, M.: Patents, Citations, and Innovations: A Window on the Knowledge Economy. MIT Press, Cambridge (2002) 10. Hall, B.H., Jaffe, A., Trajtenberg, M.: Market value and patent citations. RAND J. Econ. 36, 16–38 (2005) 11. Gao, Y., Zhu, Z., Riccaboni, M.: Consistency and trends of technological innovations: a network approach to the international patent classification data. In: Cherifi, C., Cherifi, H., Karsai, M., Musolesi, M. (eds.) COMPLEX NETWORKS 2017 2017. SCI, vol. 689, pp. 744–756. Springer, Cham (2018). https://doi.org/10. 1007/978-3-319-72150-7 60 12. Gao, Y., Zhu, Z., Kali, R., Riccaboni, M.: Community evolution in patent networks: technological change and network dynamics. Appl. Netw. Sci. 3(1), 1–23 (2018). https://doi.org/10.1007/s41109-018-0090-3 13. Maraut, S., Dernis, H., Webb, C., Spiezia, V., Guellec, D.: The OECD REGPAT database: a presentation. OECD Sci. Technol. Ind. Work. Pap. 2008(2), 0 1 (2008) 14. Singh, J.: Collaborative networks as determinants of knowledge diffusion patterns. Manag. Sci. 51(5), 756–770 (2005) 15. Morescalchi, A., Pammolli, F., Penner, O., Petersen, A.M., Riccaboni, M.: The evolution of networks of innovators within and across borders: evidence from patent data. Res. Policy 44(3), 651–668 (2015) 16. Sebesty´en, T., Varga, A.: Research productivity and the quality of interregional knowledge networks. Ann. Reg. Sci. 51(1), 155–189 (2013) 17. De Noni, I., Orsi, L., Belussi, F.: The role of collaborative networks in supporting the innovation performances of lagging-behind European regions. Res. Policy 47(1), 1–13 (2018) 18. Wanzenboeck, I., Scherngell, T., Brenner, T.: Embeddedness of regions in European knowledge networks: a comparative analysis of inter-regional R&D collaborations, co-patents and co-publications. Ann. Reg. Sci. 53(2), 337–368 (2014) 19. Wanzenb¨ ock, I., Scherngell, T., Lata, R.: Embeddedness of European regions in European union-funded research and development (R&D) networks: a spatial econometric perspective. Reg. Stud. 49(10), 1685–1705 (2015)
72
Z. Zhu and Y. Gao
20. Berg´e, L.R., Wanzenb¨ ock, I., Scherngell, T.: Centrality of regions in R&D networks: a new measurement approach using the concept of bridging paths. Reg. Stud. 51(8), 1165–1178 (2017) 21. Vinh, N.X., Epps, J., Bailey, J.: Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J. Mach. Learn. Res. 11, 2837–2854 (2010) 22. WIPO: IPC concordance table (2019). https://www.wipo.int/ipstats. Accessed 06 Aug 2021 23. Blondel, V.D., Guillaume, J.-L., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities in large networks. J. Stat. Mech.: Theory Exp. 2008(10), P10008 (2008)
Professional Judgments of Principled Network Expansions Ian Coffman, Dustin Martin, and Blake Howald(B) Thomson Reuters Special Services, LLC, 1410 Spring Hill Road, Suite 125, Mclean, VA 22102, USA {ian.coffman,dustin.martin,blake.howald}@trssllc.com http://trssllc.com
Abstract. As data increase in both size and complexity, it is essential to determine optimal solutions for processing and interacting with network representations by human analysts. In this paper, we explore different ways of growing networks from a small set of “seed” nodes to create tractable sub-networks for human analysis. In particular, we define a method for generating different network expansions for preference testing and statistical analysis in a paired-comparison task. To illustrate this method, we generate a range of expansions from a database of politically exposed persons and conduct the preference study with professional criminal network analysts. The resulting statistical analysis demonstrates an avoidance of purely random generated networks and preference for naive network expansion methods (depth-limited traversals based on one or two “hops” from the seed nodes) over more statistically rigorous methods (e.g., edge- and node-based sampling). From these results and inferential analysis, we propose practice considerations and define future research. Keywords: Network expansion · Network sampling · Network analysis · Network visualization · Social network · Human assessment
1
Introduction
Very large networks are increasingly common objects of study in the network science literature, given both their availability (e.g., dblp - 317,080 nodes, 1,049,866 edges [5], LiveJournal - 3,997,962 nodes, 34,681,188 edges [14]) and the development of and access to robust graph database representation and analysis tools like Neo4j [15], Cytoscape [4], and Gephi [2]. However, as networks grow increasingly larger, so does the importance of their ease of use, as some networks may be too large to fit into physical memory, too large to efficiently compute certain network properties, or too visually complex to be human-interpretable. Recognition of issues such as those above has led to numerous research contributions aimed at identifying a means of reducing the size of intractable networks in order to effectively apply a desired analysis or algorithmic procedure. Many such efforts center on sampling as a means of achieving this goal, whereby a c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2021, SCI 1072, pp. 73–81, 2022. https://doi.org/10.1007/978-3-030-93409-5_7
74
I. Coffman et al.
subset of the complete network’s nodes and edges are selected according to some well specified procedure (e.g., selecting nodes with uniform probability up to some proportion of the complete network [13,18]). Crucial to this line of investigation is how well such sampling procedures yield networks that preserve important topological properties of the larger network. Estimates of network statistics (e.g. clustering, assortativity) based on a sample can be biased depending on factors such as the type of sampling procedure employed or the proportion of the complete network sampled. Several numerical studies have been published exploring this issue in detail (e.g., [1,12]). This study is similarly concerned with the well-formedness of sampled networks, but unlike previous approaches that are largely theoretical in nature, we take an applied perspective by focusing on how well certain network representations facilitate the completion of a criminal network targeting task. While studies such as those cited above rely on strictly numerical comparisons, the task we consider introduces a distinctively cognitive component. Thoroughly addressing this issue through informal interviews or knowledge elicitation sessions is likely difficult because the principles guiding analysts’ decision making may be tacit and difficult to explicate, so what is needed is a means of formally exploring professional judgements to better understand which types of network expansions are best suited to a particular applied task. Here, we establish a method for such investigations and provide a set of initial results by employing an experimental paradigm centered on forced-choice, pairwise comparisons of candidate network expansions. We recruited a sample of professional criminal network analysts and asked them to judge pairs of sampled networks and indicate which better allows them to carry out a simulated analytical task based on the disruption of a criminal network, where the stimuli consisted of images that were systematically varied based on properties of the network generating procedure. This paper is structured as follows: Sect. 2 presents the elements of our method including expanded network generation (Sect. 2.1), network stimuli creation (Sect. 2.2), experimental design (Sect. 2.3), and analysis (Sect. 2.4). Section 3 provides the results of our method applied to a data set of politically exposed persons (PEPs) with subsequent statistical analysis of the preference judgments. Section 4 discusses the results relative to perceived strengths and limitations of the method, paving the way for additional research. Section 5 concludes.
2
Methods
Our method centers on a simulated analytical task whereby participants are asked to choose which of two side-by-side network images better facilitates its completion. Following a typical real-world task, participants were first presented with an initial seed network and informed that their task would be to view and judge candidate expansions of that seed network, which are larger networks that include the seed network as a subnetwork. Participants were instructed to choose
Judgements of Expanded Networks
75
the expansion that better indicates which node’s removal from the seed network would maximally disrupt the flow of communication. 2.1
Network Generation
Both the seed network and its expansions were derived from a graph database instance of Refinitiv World Check [20], a curated database of PEPs containing ∼1.5 million nodes (PEPs) and ∼5 million edges (shared relationships among PEPs). The seed network consisted of an American politician and four of their associates. The unlabeled, bidirectional networks served as experimental stimuli fall into three general categories with associated generation procedures: – Sampling-Based Expansions. These expansions were constructed by sampling from the complete World Check network utilizing the following methods: • Node sampling: Sample nodes with uniform probability up to some proportion of the complete network. Then, induce any edges present in the complete network between them to form a sub-network. • Edge sampling: Sample edges with uniform probability, inducing nodes connected to sampled edges up to some proportion of the complete network. • Totally induced edge (TIE) sampling [1]: Sample edges with uniform probability, including connected nodes up to some proportion of the complete network. Then, induce any edges in the complete network not originally sampled. For node and edge sampling, the procedure was performed at 25%, 50%, and 75% of the complete World Check network. For TIE sampling, the procedure was carried out at 25% (50% and 75% proportions were too large to appropriately visualize given our drawing parameters described in Subsect. 2.2). From these samples, our expansion procedure determined which, if any, sampled nodes were also present in the seed network; all nodes that were present served as the origination point of a breadth-first search through the sampled network. Any discovered nodes and edges were added to the seed network to complete the “expanded” network. – Depth-Limited Traversal Expansions. These expansions of either one or two “hops” were created according to the following procedure: for one hop, create a new network by including every seed node’s neighbors and the edges connecting those neighbors, and for two hops, create a new network by including every seed node’s neighbors and those neighbors’ neighbors with connecting edges. – Random Networks. While not true expansions of the seed network, we included two random networks, which are Erd˝ os-R´enyi model-generated networks [6] containing either 30 or 60 nodes with Binomial linkage probability set at 0.1. These networks facilitate a “baseline” measure, with the expectation that participants would strongly avoid network representations with no natural structure.
2.2
Stimuli
The visual representations of the networks described above that served as the experimental stimuli were created with the spring layout drawing function of the
76
I. Coffman et al.
Python package NetworkX [10], which is an implementation of the FruchtermanReingold force-directed algorithm [9]. Seed network nodes were in fixed positions across all stimuli and were colored red in contrast to the blue nodes of the expanded networks. All stimuli are shown in Fig. 1.
(a) E25
(b) E50
(c) E75
(d) DLT1
(e) DLT2
(f) T25
(g) N25
(h) N50
(i) N75
(j) R30
(k) R60
(l) Example
Fig. 1. Experimental stimuli. Note: (a) E25 = Edge sampling at 25% of the complete network, (b) E50 = Edge sampling at 50% of the complete network, (c) E75: Edge sampling at 75% of the complete network, (d) DLT1 = one-hop depth-limited traversal, (e) DLT2 = two-hop depth-limited traversal, (f) T25 = TIE sampling at 25% of the complete network, (g) N25 = node sampling at 25% of the complete network, (h) N50 = node sampling at 50% of the complete network, (i) N75 = node sampling at 75% of the complete network, (j) R30 = random network with 30 nodes, (k) R60 = random network with 60 nodes, (l) is the example network given with the preference task instructions. See Table 2 in the appendix for basic network summary statistics.
Judgements of Expanded Networks
2.3
77
Experiment
Study participants were recruited from a pool of professional criminal network analysts employed by Thomson Reuters Special Services, LLC. The participant analysts had experience working with U.S. federal law enforcement and national security agencies, specifically in areas related to targeting with networks (e.g., criminal networks, supply chain, etc.). The study was determined to be exempt from Institutional Review Board review, but participants were nonetheless provided with a brief consent form outlining the experiment, possible risks, etc. The experimental task consisted of a two-alternative forced-choice design where all possible pairwise combinations of stimuli are presented; for our 11 stimuli, this yielded 55 trials that were presented in random order. At the beginning of the experiment, participants were familiarized with the task by viewing an image of network that served as a clear example of communication disruption via node removal (node “A” in Fig. 1(l)). Then, for each trial in the data collection phase, two expanded networks appeared side-by-side with the letter “q” under the left image and the letter “p” under the right image; preferences were indicated by pressing either of these letters on the keyboard to select a network. At the top of the screen, participants were posed the question “which network better helps you determine which RED entity to remove to MOST DISRUPT network communication?”. The experiment was implemented in PsychoPy, a Python-based software package originally developed for the administration of psychophysical experiments [17] and hosted online with PsychoPy’s sister service Pavlovia [16]. Considerations due to the COVID-19 pandemic required remote administration of the study; however, the experimenters held virtual meetings with each participant outlining the task and providing a walkthrough of several practice trials. Finally, in addition to the primary task, participants were administered a brief demographic survey, collecting information on years of experience as network analysts, familiarity with network concepts/statistics, etc. 2.4
Data Analysis
Results are modeled at the stimulus-level via the Bradley-Terry model for paired comparisons, which predicts, for each pair, the probability of one stimulus “beating” another [3,11]: P (stim1 “beats” stim2) =
astim1 , astim1 + astim2
(1)
where a can be thought of as the “ability” of a certain stimulus to better facilitate the analytical task of network communication disruption. While it is possible to obtain direct estimates of these ability parameters via maximum likelihood, we instead use the R package BradleyTerry2 [19] to employ a logit-linear form of this model that allows us to estimate the log odds of one stimulus “beating” any other and test these estimates for statistical significance.
78
3
I. Coffman et al.
Results
A total of 15 analysts with an average of 6.9 years of network analysis experience (interquartile range = 5.5) participated in the study. Table 1 summarizes the Bradley-Terry logit models. For ease of interpretation, we chose the random network with 30 nodes to serve as the reference level; thus, each coefficient estimate is the log odds of its associated stimulus “beating” this baseline stimulus. As seen, the difference between estimates of log odds and zero is significant at the 95% confidence level for all stimuli with the exception of the random network with 60 nodes. Table 1. Summary of the Bradley Terry model fit to the data. Note: Significance tests are two-tailed, and all estimates are compared to the random network with 30 nodes, which serves as the reference level. Estimate
Std. error
z
p
R60 - Random, 60 nodes
0.55
0.30
1.87
0.06
N50 - Node sampling, 50% of complete network
1.21
0.29
4.15
< 0.001
E25 - Edge sampling, 25% of complete network
1.21
0.29
4.15
< 0.001
T25 - TIE sampling, 25% of complete network
1.41
0.29
4.80
< 0.001
N75 - Node sampling, 75% of complete network
1.47
0.29
5.00
< 0.001
E75 - Edge sampling, 75% of complete network
2.25
0.30
7.46
< 0.001
N25 - Node sampling, 25% of complete network
2.28
0.30
7.55
< 0.001
E50 - Edge sampling, 50% of complete network
2.72
0.31
8.72
< 0.001
DLT2 - Depth-limited traversal, 2 hops
2.79
0.31
8.89
< 0.001
DLT1 - Depth limited traversal, 1 hop
2.83
0.31
8.98
< 0.001
Summaries of regression models incorporating categorical variables only allow for comparisons of each estimate to the reference category; therefore, as a followup analysis, we used the R package qvcalc [7] to compute the quasi-variances [8] of each estimate, which facilitate approximate inference by visual comparison for any potential contrast. Figure 2 plots each stimulus’ log odds estimate plus or minus 1.96 quasi-standard errors. If any two estimates’ error bars do not overlap, they may be considered distinguishable at the 95% confidence level.
Fig. 2. Estimates ± 1.96 quasi-standard errors.
Judgements of Expanded Networks
4
79
Discussion
The main generalization that emerges from our results is that when the stimuli are ranked in order of ability as estimated by the model summarized above, it is clear that the random networks very poorly facilitate the analytical task while the depth-limited traversal network expansions are the best-performing relative to the rest of the stimulus set. These would seem to be natural endpoints on a continuum of social network structure: random networks are expected to contain little such structure, while the depth-limited traversal expanded networks preserve that structure entirely, at least up to some proportion of the complete network. We take this general finding to partially validate our experimental method insofar as it demonstrates its ability to yield a basic, interpretable result with statistically rigorous certainty. The relative performance of many of the stimuli, mainly the sampling-based network expansions, is substantially less clear. Many of these networks are statistically indistinguishable from the worst- and best-performing stimulus categories; for example, the edge-sampled network at 50% of the complete network cannot be differentiated from either of the depth-limited traversal expansions in terms of overall performance. Further, there are no clear generalizations related to the parameters of the sampling-based expansions; it is not the case, for example, that all edge-sampled networks or that all networks sampled at 25% of the complete network tend to cluster anywhere in the performance ranking. These results should be considered in the context of several important limitations of our study. A central issue is that the space of experimental parameters is very large, and our experiment represents only one set of possible configurations. For example, we chose to sample from the World Check network, but it is possible that its specific topological properties may, in some way, influence analysts’ decision making, thus limiting the extent to which our findings can be generalized to all social networks. Further, the specific task given to the participants, while organic to their expertise, is only one of several possible tasks; others (e.g., identification of communities, connection strength, etc.) may be better facilitated by different network expansion procedures. An additional limitation concerns randomness in the generation of our expanded networks: the sampling procedures that generate some of the subnetworks as well as the drawing procedure that creates the stimuli both have a random component, meaning that multiple applications of the same expansion procedure would result in different-looking stimuli at each iteration. This raises the concern that participants may be reacting to idiosyncratic features of each stimulus rather than the general expansion category it is meant to represent. Thus, improvements on our design would generate multiple stimuli for each sampling category to nullify any stimulus-specific effects. Finally, future efforts would ideally explore which properties of the expanded networks relate to their performance relative to others. For example, it is reasonable to suppose that the best-performing network expansions may represent the largest amount of workable data that is neither too large to process nor too parsimonious (relative to the seed set) to be relevant. Such propositions could
80
I. Coffman et al.
be formally tested if relevant features of the expanded networks were identified and recorded to be included as covariates in an analysis. To be successful, statistical power considerations would likely require a larger sample of analysts than utilized here.
5
Conclusion
We have presented a method for creating expanded networks from seed nodes in order to extract well-structured sub-networks from large datasets. In presenting different networks generated from this method to criminal analysts for preference judgments, we have statistically justified expectations that random networks (exhibiting no structure other than the seed nodes) are avoided and depth limited traversals (preserving most of the network connected to the seed nodes) are strongly preferred. Future research will focus on the generalizeability of these results to additional datasets and different types of criminal and non-criminal analysts. These endeavors will work to clarify patterns associated with different downsampling methods, size of data, and features of networks that influence cognitive preferences. Acknowledgments. The authors would like to thank Nick Ratliff and Alli Zube for productive feedback during early pilot studies. Thank you also to Brian Josey, Spencer Torene, Nick Zube, and three anonymous reviewers for critical substantive and editorial insights on earlier drafts.
Appendix
Table 2. Basic summary statistics for each network. Nodes Edges Average degree N25 - Node sampling, 25% of complete network 17 N50 - Node sampling, 50% of complete network 19 N75 - Node sampling, 75% of complete network 68 E25 - Edge sampling, 25% of complete network 16 E50 - Edge sampling, 50% of complete network 44 E75 - Edge sampling, 75% of complete network 91 41 T25 - TIE sampling, 25% of complete network 30 R30 - Random network, 30 nodes 60 R60 - Random network, 60 nodes 81 DLT1 - Depth-limited traversal, 1 hop 144 DLT2 - Depth-limited traversal, 2 hops
18 32 103 19 100 274 73 25 171 98 178
2.12 3.37 3.03 2.38 4.55 6.02 3.56 3.21 5.7 2.42 2.47
Judgements of Expanded Networks
81
References 1. Ahmed, N., Neville, J., Kompella, R.R.: Network sampling via edge-based node selection with graph induction. Technical Report CSD TR 11-016, Computer Science Department, Purdue University (2011) 2. Bastian, M., Heymann, S., Jacomy, M.: Gephi: an open source software for exploring and manipulating networks. In: Proceedings of the International AAAI Conference on Weblogs and Social Media (2009). http://www.aaai.org/ocs/index.php/ ICWSM/09/paper/view/154 3. Bradley, R.A., Terry, M.E.: Rank analysis of incomplete block designs: I. The Method of Paired Comparisons. Biometrika 39(3/4), 324–345 (1952) 4. Cytoscape: An open source platform for complex network analysis. https:// cytoscape.org 5. DBLP Computer Science Bibliography. https://dblp.org/ 6. Erd˝ os, P., R´enyi, A.: On random graphs I. Pub. Math. 6, 290–297 (1959) 7. Firth, D.: qvcalc: Quasi variances for factor effects in statistical models. R package version 1.0.2 (2020). https://CRAN.R-project.org/package=qvcalc 8. Firth, D., De Menezes, R.X.: Quasi-variances. Biometrika 91(1), 65–80 (2004) 9. Fruchterman, T.M.J., Reingold, E.M.: Graph drawing by force-directed placement. Softw. Pract. Exper. 21, 1129–1164 (1991) 10. Hagberg, A., Swart, P., Chult, D.: Exploring network structure, dynamics, and function using NetworkX. Technical report, Los Alamos National Lab. (LANL), Los Alamos, NM (2008) 11. Hunter, D.R.: MM algorithms for generalized Bradley-Terry models. Ann. Stat. 32(1), 384–406 (2004) 12. Lee, S., Kim, P.J., Jeong, H.: Statistical properties of sampled networks. Phys. Rev. E, Stat. Nonlinear Soft Matter Phys. 73, 016102 (2006) 13. Leskovec, J., Faloutsos, C.: Sampling from large graphs. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 631–636 (2006) 14. LiveJournal social network and ground-truth communities. https://snap.stanford. edu/data/com-LiveJournal.html 15. Neo4j Graph Data Platform. https://neo4j.com/ 16. Pavlovia. http://www.pavlovia.org 17. Peirce, J.W., MacAskill, M.R.: Building Experiments in PsychoPy. Sage, London (2018) 18. Rafiei, D.: Effectively visualizing large networks through sampling. In: VIS 05 IEEE Visualization, pp. 375–382 (2005) 19. Heather, T., Firth, D.: Bradley-Terry models in R. The BradleyTerry2 package. J. Stat. Softw. 48(9), 1–21 (2012) 20. Refinitiv World Check. https://www.refinitiv.com/content/dam/marketing/en us/documents/brochures/world-check-risk-intelligence-brochure.pdf
Attributed Graphettes-Based Preterm Infants Motion Analysis Davide Garbarino1,2 , Matteo Moro1,2(B) , Chiara Tacchino3 , Paolo Moretti3 , Maura Casadio1 , Francesca Odone1,2 , and Annalisa Barla1,2 1
Department of Informatics, Bioengineering, Robotics and Systems Engineering (DIBRIS), University of Genoa, Genova, Italy {davide.garbarino,matteo.moro}@edu.unige.it, {maura.casadio,francesca.odone,annalisa.barla}@unige.it 2 Machine Learning Genoa (MaLGa) Center, Genova, Italy 3 Istituto Giannina Gaslini, Genova, Italy {chiaratacchino,paolomoretti}@gaslini.org Abstract. The study of preterm infants neuro-motor status can be performed by analyzing infants spontaneous movements. Nowadays, available automatic methods for assessing infants motion patterns are still limited. We present a novel pipeline for the characterization of infants spontaneous movements, which given RGB videos leverages on network analysis and NLP. First, we describe a body configuration for each frame considering landmark points on infants bodies as nodes of a network and connecting them depending on their proximity. Each configuration can be described by means of attributed graphettes. We identify each attributed graphette by a string, thus allowing to study videos as texts, i.e. sequences of strings. This allows us exploiting NLP methods as topic modelling to obtain interpretable representations. We analyze topics to describe both global and local differences in infants with normal and abnormal motion patters. We find encouraging correspondences between our results and evaluations performed by expert physicians. Keywords: Temporal networks · Attributed graphettes · Preterm infants motion · Markerless human motion analysis · Deep learning Natural language processing · Latent dirichlet allocation
1
·
Introduction
The analysis of preterm infants motion is a crucial and complex task. The World Health Organization (WHO) [28] has highlighted that among preterm infants (i.e., infants born before 37 completed weeks of gestation), there is a 5–15% chance of developing motor alterations caused by permanent lesions of the developing brain [5] that commonly involve areas of the brain intended for control of movements. An early diagnosis of abnormal motion patterns would allow the start of early rehabilitation treatments, increasing the chances of recovery. D. Garbarino and M. Moro—Equally Contributed c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2021, SCI 1072, pp. 82–93, 2022. https://doi.org/10.1007/978-3-030-93409-5_8
Attributed Graphettes-Based Preterm Infants Motion Analysis
83
In this direction, due to the increase of preterm survival rate in high-income countries [4], a lot of effort has been made to find an automatic and reliable way to characterize and analyze preterm infants neuro-motor status based on the characterization of infants spontaneous movements [18]. An accurate quantitative analysis of human motion is usually performed with wearable sensors, markers and motion capture systems [10,21]. Unfortunately, markers and sensors placed on the body skin are cumbersome and they can affect the naturalness of the motion [8], especially in infants [18]. For these reasons, recently, marker-less techniques for human motion analysis based on computer vision have been studied [10,12] and applied to the analysis of infants motion [1,2,9,14]. These techniques have the potential to solve or reduce some of the issues of marker-based approaches, as they allow for a more natural personfriendly interaction, they are non invasive and not expensive. Furthermore, they can be adopted to characterize motor configurations not easily detectable with markers. In this paper we approach the problem of representing spontaneous movements sequences of preterm infants by studying it as a temporal network analysis problem. More precisely, we map each frame of a video to a 5-nodes graph whose nodes are landmark points and edges are inserted based on the distance of the landmark points on the image plane. As far as we know, this is an original approach, never used before. We model the networks as sequences of 5-nodes attributed graphettes [16], defined as not necessarily connected, non-isomorphic induced subgraphs of a larger graph, whose nodes are equipped with attributes. We want to exploit this modelling choice in order to obtain an interpretable, low-dimensional representation of each video, able to convey information about the local dynamics of each infant. In this sense, in [19] authors present a work in which they define a representation of a large social network by using methods of topic modelling [3]. Specifically, In [19] authors build topic models (that they call structural topics) upon graphettes occurrences in node neighborhoods by using anonymous walks to approximate their concentrations in the network. Such topic models, in which graphettes are actually encoded as words of a text that is the network, allow to overcome the only description of networks through graphettes concentrations [24] by including the distribution on graphettes themselves. This allows us to obtain a representation of the network as a mixture of structural topics which in turn are outlined as a mixture of graphettes. In our problem, we leverage on the method described in [19], instantiated with a Latent Dirichlet Allocation [7] model, to identify local motion patterns able to characterize infants spontaneous movements. This method allow us to highlight insightful differences between the classes of infants with normal (N) and abnormal (Ab) motion patterns. In particular, the motion of infants in Ab class is better characterized by highly symmetric configurations and lower variability. On the contrary, infants in N class are characterized by less symmetric configurations and higher motion variability.
84
2 2.1
D. Garbarino et al.
Materials and Methods Dataset
Data acquisition was performed 3 months after infants birth and involved 118 preterm infants (one video for each infant, 78 females, born at 29 ± 2 weeks and weighting 1150 ± 303g). The acquisition setup was composed by a single RGB camera (Canon Legria HF R37, acquiring at 25 frames per second with a resolution of 1080 × 1920 pixels) placed on a support above a treatment table. We excluded from the analysis those portions of videos where interventions of the operators occluded part of the scene or where the infants were crying. Among the 118 infants included in this study, 53 had a clinical diagnosis of neuro-motor disorders, presenting a wide spectrum of motor disorders intensities but only a minority with major impairments. The neuro-motor assessments performed 30 months after the video recording were based on different clinical evaluations and tests, including the Bayley test [6] for the majority of the infants and Magnetic Resonance Imaging (MRI) at birth. The study and the consent form signed by parents were approved by the Giannina Gaslini Hospital Institutional Review Board on 20/06/2013 (protocol number: IGGPM01). 2.2
Pre-processing: Landmark Points Detection and Filtering
In our setting, we use as nodes of the networks representation a set of meaningful landmark points detected on infant bodies. To this aim, we rely on a deep model for semantic features detection, DeepLabCut (DLC) [20], suitably trained to detect nose (N), left hand (LH), right hand (RH), left foot (LF) and right foot (RF). From our dataset, we randomly select 10 frames from 100 videos and we manually label the points of interest. Among the different deep architectures provided in DLC, we select a ResNet-50 [17] pretrained on Imagenet [11] and with a final deconvolutional layer to extract spatial density probability maps associated with each landmark point. The architecture is trained with the parameters suggested in [20]. We adopt our trained model to extract the positions of the five landmark points in all the frames for each infant’s video, as shown in Fig. 1. For each video, the outputs of the model are {(xtl , ylt , ctl )}Tt=1 , with l = {N, LH, RH, LF, RF}. The l−th point in the t−th frame is identified by its position (xtl , ylt ) and likelihood ctl , a number in the interval [0, 1]. The coordinates obtained are then filtered in order to improve the stability across time of the estimated points and discard mispredictions. Focusing on ctl , we are able to quantify the uncertainty behind the detection of each point in each frame. We consider as wrongly detected points with ctl < 0.75. Then, in order to recognize other possible mispredictions, we drop points corresponding to high peaks in the speed profile of each coordinate. We discard points with these characteristics and, if the information loss lasts less then 2 s (50 frames), we interpolate the trajectories in order to reconstruct the information. Finally we smooth the resulting signals with a low-pass filter (Butterworth, 4th order, 10 Hz cut-off frequency).
Attributed Graphettes-Based Preterm Infants Motion Analysis
85
Fig. 1. Examples of detected landmarks in the image plane. Images cropped for visualization purpose.
2.3
Networks Definition
For each video, we build a temporal sequence of networks (one per frame) describing the relation among the landmark points of interest in the image plane, used as nodes, that are connected through edges depending on their relative proximity. More specifically, edges are obtained by computing the Euclidean distance between every pair of landmark points in all the images composing our dataset. For each infant, all the computed distances are normalized by the maximum distance across the whole video in the image plane between the nose and the virtual middle point between the feet. Distances normalization compensates for possible differences both in the size of infants’ body and in the distances between the camera and the acquisition plane. We use the normalized distances distribution to identify which points are close to or far from each other at each timepoint. In order to privilege sparser networks for the ease of analysis, we assume to be unlikely for two landmark points to be often close to each other. Therefore, we state that if the distance between two landmark points is greater than the 25th quartile of the corresponding empirical distribution, then they are far from each other, and we do not connect them with an edge. Conversely, we link two nodes with an edge if their normalized Euclidean distance is lower than the 25th quartile of the corresponding empirical distribution. This assumption allows us to define a binary temporal network for each infant in the dataset: Fig. 2 shows the first 10 configurations of an infant’s sequence represented by their adjacency matrix. The choice of the threshold is driven by the necessity for sparsity yet arbitrary. Possible alternatives will be further explored in future works.
Fig. 2. Ten consecutive layers of an infant network represented as a sequence of adjacency matrices.
86
D. Garbarino et al.
Each layer in an infant network represents a configuration at a specific timepoint (frame) and it is defined as a 5-nodes graph, which we exchangeably call attributed graphette or configuration. In the remainder of the paper, we leverage this representation to describe infants motion in an unsupervised fashion. 2.4
Attributed Graphettes-Based Representation
Each temporal network built for each infant and defined in Sect. 2.3 can be represented as a sequence of configurations describing infants motion patterns. More formally, the t-th layer, corresponding to the t-th frame, of a temporal network G is represented by a graph gt = (V, Et , L), where V = {1, 2, 3, 4, 5} is the set of nodes, Et is the set of edges and L = {N, RH, LH, RF, LF} is the set of node attributes. It is important to note that at each timepoint t, the map assigning a node n ∈ V to a label l ∈ L is a bijection. Furthermore |Et | ∈ {0, . . . , 10}, thus allowing for the presence of not connected subgraphs in every infant video. Such subgraphs are called graphettes [16], defined as not necessarily connected, nonisomorphic induced subgraphs of a larger graph. Similarly to graphlets [24] and motifs [22], graphettes are a suitable tool to give a local and global description of large complex networks. Indeed, by computing node-level graphettes concentrations in a network we are able to describe local wiring patterns [27] and, at the same time, by aggregating this local information, we get a global description of the network based on the occurrences of these substructures [24]. It is usually a hard task to develop exhaustive graphettes enumeration algorithms, especially when dealing with attributed graphettes, as they are much larger in number than those without attributes [13]. Nevertheless, we exploit the work of [16] and the nature of our problem to define an exhaustive enumeration algorithm. Attributed Graphettes Enumeration Algorithm. Knowing that the number of possible configurations representing one frame is limited to 210 , we generate all the possible 5-nodes attributed graphettes and represent them with the upper triangle of their adjacency matrix unravelled into a bit vector. Each bit vector is then associated to a word, i.e., a string composed by the letter g and an integer number ranging from 0 to 1023. A practical example of this process is shown in Fig. 3.
Fig. 3. Canonical representation of one instance of a 5-node attributed graphette.
Attributed Graphettes-Based Preterm Infants Motion Analysis
87
Given an infant video G, we sequentially associate each frame t of the video with an attributed graphette, represented by the corresponding string gn . For instance, the attributed graphette 0000000000 corresponds to the string g0 . A network is then defined as an ordered sequence of elements from the set {g0 , . . . , g1023 }, whose length is equal to the number of frames of the corresponding video. Indeed, G results as a collection of configuration names that we treat as text, resorting to Natural Language Processing (NLP) methods for text representation in order to enumerate attributed graphettes and describe infants motion in terms of their occurrences. In this regard, the Bag-Of-Words (BOW) [15] model is a histogram representation that transforms any text into fixed-length vectors by counting how many times each word appears in. This vectorization process is performed by fixing or inferring a vocabulary, which is contained in or equal to the set of all words found in the documents. In our case, the vocabulary of all configurations appearing in the dataset consists of 650 attributed graphettes. Therefore, after fitting a BOW model, every infant’s network turns out to be a vector of size 1 × 650. Figure 4 (left panel) offers a visual representation of a video as a BOW vector. In order to identify those configurations that are discriminative for networks in the dataset, we need to normalize raw counts in BOW vectors properly. For this purpose, we leverage on Term Frequency - Inverse Document Frequency (tf-idf), a common algorithm to transform word counts into meaningful real numbers [26]. More specifically, given a configuration gn and a network G, tf-idf measures the originality of gn by comparing the number of times gn appears in G (i.e. term frequency) to the number of networks gn appears in (i.e. document frequency). To reduce the dimensionality of these representations, we set a threshold on the minimum and maximum document frequency of configurations. In the tf-idf case, we also retain the ability of weighting graphettes based on their commonality in the dataset. Figure 4 (right panel) illustrates a tf-idf transform (with minimum and maximum document frequencies set to 45% and 70% respectively) of a network in the infants dataset.
Fig. 4. BOW (left) and tf-idf (right) word cloud visualization of an infant’s temporal network. The size of configuration names is proportional to their weights in the corresponding representation. Note that the configuration g512 is either very frequent or rare in the collection of infants networks and therefore it has weight equal to 0 in the tf-idf representation.
88
D. Garbarino et al.
Latent Dirichlet Allocation. Even if tf-idf approach provides an arbitrary amount of reduction in description length, it does not reveal any information on intra-networks distribution over all attributed graphettes. To overcome this limitation, we resort to topic modeling [3] to define an interpretable low-dimensional representation of videos, able to describe the distribution of attributed graphettes for each infant and also able to give local information on the dynamic of infants by considering co-occurrences of configurations. Topic models [3] are probabilistic generative models for large collections of textual data (i.e., text corpora). A notable topic model is Latent Dirichlet Allocation (LDA) [7] defined as a 3-level hierarchical Bayesian model, in which every item in a corpus is modelled as a mixture over an underlying set of topics, which are, in turn, described by a probability distribution over words. Topic probabilities offer an explicit low-dimensional representation of texts which has been recently adopted to analyse large social networks [19]. In the remainder of the paper we adopt an LDA representation of infant networks built upon the tf-idf transformations of attributed graphette counts. Data Augmentation. Typically, in order to obtain reliable and stable topics, LDA needs to be trained on a large amount of data. Our dataset is composed of 118 infants (65 with normal and 53 with abnormal motion patterns) which is too small to infer meaningful topics from LDA. Then, in order to augment the dataset, we simulate videos from the two classes (i.e., infants with normal (N) and abnormal (Ab) motion patterns) until we obtain a balanced dataset of 1130 videos (118 original videos, 500 and 512 simulated videos from the classes N and Ab, respectively). Simulated networks are composed of 10710 consecutive configurations, which is the average number of frames composing original infants videos. We simulate temporal networks by leveraging normalized bigrams (i.e., couples of adjacent configurations) counts from the original dataset. More specifically, given a temporal network G we compute bigram frequencies and associate every configuration gn with a vector vgn = (bfi )650 i=1 where bfi corresponds to the normalized frequency of the bigram (gn gn(i) ) in G, gn(i) identifying the i-th configuration in the vocabulary. Thus, for every infant, we obtain a matrix XG (650 × 650) describing an infant-specific conditional distribution over configurations. We generate networks by first picking at random an infant from a chosen class and a starting configuration, then we iteratively sample configurations from the probability distribution identified by XG . Number of Topics Selection. One of the most crucial LDA hyperparameters that needs to be tuned is the number of topics. In literature, many metrics have been defined in order to find an optimal number of topics [7,25]. We focus on the maximization of the Intrinsic Topic Coherence Measure (ITCM) [23], which is a metric based on the co-occurrence of words within the documents being modeled. For every topic p, ITCM is defined as IT CM (p, V p ) =
M m−1 m=2 h=1
log
p df (vm , vhp ) + 1 , df (vhp )
(1)
Attributed Graphettes-Based Preterm Infants Motion Analysis
89
Fig. 5. Average ITCM evaluated for number of topics (N oT ) ranging in {2, 3, 4, 5, 6, 7} and for different values of maximum and minimum document frequencies in the tf-idf representation. As shown in the bottom right corner, the optimal choice is N oT = 5, maximum df equal to 70% and minimum df equal to 45%. p where V p = (v1p , . . . , vM ) is a list of the M most probable configurations in the p p p , vh ) is the number of documents where the configurations pair vm topic p, df (vm and vhp appear together in, and df (vhp ) is the document frequency of the configuration vhp . For each topic, co-occurrence frequencies of the M most probable p , vhp ) in Eq. (1)) are computed within fixed-size temporal configurations (df (vm windows for every network. We consider M = 10 and a temporal window size equal to 110 frames. We select an optimal number of topics (N oT ) by studying how ITCM varies as N oT ranges in {2, 3, 4, 5, 6, 7} when applying LDA to the tf-idf transform of the augmented dataset for different settings of maximum and minimum document frequencies. Maximum and minimum document frequencies values are chosen based on the original dataset statistics. More specifically, minimum document frequencies range in {1%, 10%, 18%, 27%, 36%, 45%} where 1% corresponds to 1 infant in the original dataset and 45% correspond to the total amount of infants with abnormal motion patterns. Similarly, maximum document frequencies range in {60%, 65%, 70%, 75%} where we set 60% as lower bound as we assume that every configuration appearing in more than half of the infants is non-discriminative. As shown in Fig. 5, we obtain that an optimal ITCM is reached at N oT = 5, maximum and minimum document frequency equal to 70% and 45%, respectively.
3
Results: Topics Analysis
By fitting LDA with such hyperparameters to the augmented infants dataset we obtain 5 topics describing local motion patterns as a result of the ITCM
90
D. Garbarino et al.
maximization. Figure 6 shows the topics summarized by their 5 most probable configurations. Topic-specific most probable configurations differ from each other only by few edges and also appear as little modifications of a basic configuration. This is evident by looking at the first 2 most probable configurations in Fig. 6. For instance Topic 2 is well summarized by the configuration in which the only present edges are the ones which connect a hand with the corresponding foot, meaning LH-LF and RH-RF. Indeed the 2 most probable configurations appear as slight deviations from this basic configuration. Then, we study topic proportions for every network in the original dataset in order to look for differences between the networks representation of infants with normal and abnormal motion patterns. Topic proportions of networks provide us with a global description of infants movement. Indeed, for each network in the dataset, larger mixture components correspond to topics whose most probable configurations are peculiar to the corresponding infant’s motion sequence. Furthermore, topic proportions are suitable to be interpreted as probabilistic assignments to clusters, which are identified by the corresponding topics. We perform class-specific topic proportions analysis, as reported in Table 1. In particular, for each network in the dataset, we observe the largest mixture component in its topic representation, that tells us the confidence in assigning the network itself to the corresponding topic. Once assigned the infants to the corresponding prevalent topic, we compute intra-topic, class-specific mean, minimum and maximum probability assignments. We claim that such statistics are good descriptors of the variety of intra-class motion. Also, for each topic, we compute the concentrations of infants in N and Ab classes assigned to it. Differences in such concentrations would indicate different global motion patterns between the two classes. Furthermore, for each topic, we evaluate the mean global symmetry and density of the 5 most probable configurations as well as the mean symmetry of hands and feet neighborhood. In general, from Table 1
Fig. 6. Visual representation of the five obtained topics described by their 5 most probable configurations: the top 2 are depicted as graphs whereas the last 3 are synthetized by their encoding. The size of a configuration encoding is proportional to its weight in topic-configurations probability distribution.
Attributed Graphettes-Based Preterm Infants Motion Analysis
91
Table 1. Results of topic analysis. For each topic, we report statistics on: Intra-class assignment probability (mean, minimum, maximum, and concentrations), Symmetry (global, hands, and feet), and Density of the 5 most probable configurations. Intra-class probability Class N
Symmetry
Density
Class Ab
Mean Min Max Conc Mean Min Max Conc Global Hands Feet T0 0.59
0.34 0.9
0.15
0.67
0.6
0.94
0.95
0.78 0.62
T1 0.8
0.41 1.0
0.28
0.74
0.42 1.0
0.91 0.15 0.28
0.94
0.87
0.60 0.28
T2 0.73
0.46 1.0
0.17
0.74
0.36 1.0
0.13
0.90
0.80
0.75 0.34
T3 0.6
0.4
0.99 0.23
0.71
0.33 0.93 0.17
0.80
0.50
0.58 0.34
T4 0.7
0.44 0.98 0.17
0.82
0.47 1.0
0.92
0.20
0.68 0.24
0.26
we can observe that: a) no significant differences are detected in the concentrations of infants assigned to each topic. b) Infants with normal motion patterns are more uniformly distributed among the 5 different topics meaning that they present a higher variability in terms of motion patterns. c) Infants with abnormal motion patterns are well represented in Topic 0 and Topic 4 (considering the minimum and the mean probability assignments respectively).
4
Discussion and Conclusions
The class-specific topic proportions analysis associates each infant to a predominant topic. The structural features that we computed (symmetry and density) attempt to reflect some qualitative aspects considered by experts physician during the motor evaluation [21]. Considering the results highlighted in Table 1, we can comment that: (a) the higher variability associated with infants with normal motion patterns is also a qualitative aspect that it is usually considered by expert physicians during their evaluations [21]. (b) Infants with abnormal motion patterns are well represented by Topic 0 since the minimum probability is higher (0.60) with respect to the other cases. Topic 0 is also characterized by dense configurations and with a higher level of symmetry. Also in this case we have a correspondence between our results and the visual evaluation of expert physicians because abnormal movements are characterized by a higher symmetry. (c) Topic 4 is one of the two most frequent topic in infants with abnormal motion patterns and with the highest mean assignment probability. As for Topic 0, we can notice highly symmetric configurations. In this case the symmetry is quite entirely concentrated on feet connections. For future works we plan to include more infants in the study and to refine the network representation detecting more landmark points on infants’ bodies. By increasing configurations size, we expect to gain enough information to consolidate the analysis and investigate possible discriminative properties of the identified topics.
92
D. Garbarino et al.
References 1. Adde, L., Helbostad, J.L., Jensenius, A.R., Taraldsen, G., Grunewaldt, K.H., Støen, R.: Early prediction of cerebral palsy by computer-based video analysis of general movements: a feasibility study. Dev. Med. Child Neurol. 52(8), 773–778 (2010) 2. Ahmedt-Aristizabal, D., Denman, S., Nguyen, K., Sridharan, S., Dionisio, S., Fookes, C.: Understanding patients’ behavior: vision-based analysis of seizure disorders. IEEE J. Biomed. Health Inform. 23(6), 2583–2591 (2019) 3. Alghamdi, R., Alfalqi, K.: A survey of topic modeling in text mining. International Journal of Advanced Computer Science and Applications (IJACSA), vol. 6, no. 1 (2015) 4. Allen, M.C.: Neurodevelopmental outcomes of preterm infants. Current Opinion Neurol. 21(2), 123–128 (2008) 5. Bax, M., et al.: Proposed definition and classification of cerebral palsy, April 2005. Dev. Med. Child Neurol. 47(8), 571–576 (2005) 6. Bayley, N.: Bayley scales of infant and toddler development: administration manual. Harcourt assessment (2006) 7. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003) 8. Carse, B., Meadows, B., Bowers, R., Rowe, P.: Affordable clinical gait analysis: an assessment of the marker tracking accuracy of a new low-cost optical 3d motion analysis system. Physiotherapy 99(4), 347–351 (2013) 9. Chambers, C., et al.: Computer vision to automatically assess infant neuromotor risk. IEEE Trans. Neural Syst. Rehabil. Eng. 28(11), 2431–2442 (2020) 10. Colyer, S.L., Evans, M., Cosker, D.P., Salo, A.I.: A review of the evolution of visionbased motion analysis and the integration of advanced computer vision methods towards developing a markerless system. Sports Med.-open 4(1), 1–15 (2018) 11. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009) 12. Desmarais, Y., Mottet, D., Slangen, P., Montesinos, P.: A review of 3d human pose estimation algorithms for markerless motion capture. arXiv preprint arXiv:2010.06449 (2020) 13. Dimitrova, T., Petrovski, K., Kocarev, L.: Graphlets in multiplex networks. Sci. Rep. 10(1), 1–13 (2020) 14. Garello, L., et al.: A study of at-term and preterm infants’ motion based on markerless video analysis (2021) 15. Goldberg, Y.: Neural network methods for natural language processing. Synthesis Lect. Hum. Lang. Technol. 10(1), 1–309 (2017) 16. Hasan, A., Chung, P.C., Hayes, W.: Graphettes: constant-time determination of graphlet and orbit identity including (possibly disconnected) graphlets up to size 8. PloS One 12(8), e0181570 (2017) 17. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) 18. Hesse, N., Bodensteiner, C., Arens, M., Hofmann, U.G., Weinberger, R., Sebastian Schroeder, A.: Computer vision for medical infant motion analysis: State of the art and rgb-d data set. In: Proceedings of the ECCV (2018)
Attributed Graphettes-Based Preterm Infants Motion Analysis
93
19. Long, Q., Jin, Y., Song, G., Li, Y., Lin, W.: Graph structural-topic neural network. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1065–1073 (2020) 20. Mathis, A., et al.: Deeplabcut: markerless pose estimation of user-defined body parts with deep learning. Nat. Neurosci. 21(9), 1281 (2018) 21. Meinecke, L., Breitbach-Faller, N., Bartz, C., Damen, R., Rau, G., DisselhorstKlug, C.: Movement analysis in the early detection of newborns at risk for developing spasticity due to infantile cerebral palsy. Hum. Mov. Sci. 25(2), 125–144 (2006) 22. Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Chklovskii, D., Alon, U.: Network motifs: simple building blocks of complex networks. Science 298(5594), 824–827 (2002) 23. Mimno, D., Wallach, H., Talley, E., Leenders, M., McCallum, A.: Optimizing semantic coherence in topic models. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 262–272 (2011) 24. Prˇzulj, N., Corneil, D.G., Jurisica, I.: Modeling interactome: scale-free or geometric? Bioinformatics 20(18), 3508–3515 (2004) 25. R¨ oder, M., Both, A., Hinneburg, A.: Exploring the space of topic coherence measures. In: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, pp. 399–408 (2015) 26. Salton, G., Harman, D.: Information retrieval. In: Encyclopedia of Computer Science, pp. 858–863 (2003) 27. Tu, K., Li, J., Towsley, D., Braines, D., Turner, L.D.: gl2vec: Learning feature representation using graphlets for directed networks. In: Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, pp. 216–221 (2019) 28. World-Health-Organization: Preterm birth. https://www.who.int/news-room/ fact-sheets/detail/preterm-birth (2018)
Dynamics of Polarization and Coalition Formation in Signed Political Elite Networks Ardian Maulana1(&), Hokky Situngkir2, and Rendra Suroso1,2 1
Department of Computational Sociology, Bandung Fe Institute, 40151 Bandung, Indonesia [email protected] 2 Department of Cognitive Science, Bandung Fe Institute, 40151 Bandung, Indonesia
Abstract. We study political elite networks within a framework of signed temporal network to investigate the dynamics of coalition formation and polarization during the 2014 Indonesian General Elections. We construct the signed network of inferred relations of agreement or disagreement between Indonesian political ac-tors based on their opinion that is reported by news media during the election. For each temporal network, we detect communities by applying a community detection algorithm for signed networks to identify conflicting groups of political ac-tors, and characterize the identified groups based on party attributes. We visualize the networks and measure political tensions within and between clusters to examine the dynamics of polarization over time. We find that the coalition pattern is absent during the legislative election period, where political actors are more likely to group within their respective party clusters. The intensity of polarization be-tween clusters is relatively lower than the following two periods, with a downward trend of polarization ahead of the legislative election day. The cleavage line between coalition clusters begins to form in the presidential election period and lasts into the post-election period, where the emerged pattern resembles the configuration of party coalitions in the 2014 Indonesian Presidential Election. The process of coalition formation is accompanied by an increase in the intensity of polarization between coalition clusters. Keywords: Signed networks Coalition formation Polarization Community structure Political elite networks Election
1 Introduction Political polarization and coalition formation in elections are relational phenomena rooted in the dynamics of relations between elites or political parties [1, 2]. Polarization is a phenomenon that refers to the existence (or process of formation) of distinguishable groups of political actors or political parties that differ on one or more characteristics [1]. From the lens of network science, such groups can be obtained by partitioning a network into groups in such a way that members of the same group are linked to each other by positive relations (e.g. ideological alignment, collaboration, or cooperation), and members of different groups are not [3–5]. The intensity of collaboration or conflict within and between groups is the basis for measuring the phenomenon of political polarization [6, 7]. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2021, SCI 1072, pp. 94–103, 2022. https://doi.org/10.1007/978-3-030-93409-5_9
Dynamics of Polarization and Coalition Formation
95
Since relations between political actors can, in principle, be coded as positive or negative, it seems natural to use signed network model to represent agreement or disagreement among political actors on contentious political issues. In this study, we infer relations between actors based on their opinions in the media, related to their support or rejection of certain political issues. Through political discourse reported by news media, political actors indirectly interact with each other, both collaboratively and in conflict, for the purpose of influencing public discourse. This paper analyzes the network of Indonesian political elites within the framework of a temporal signed network to investigate the dynamics of coalition formation and polarization during the 2014 Indonesian General Election. Specifically, we build a signed network of inferred relations of agreement or disagreement among Indonesian political actors based on their opinion that is reported by news media during the election. We apply a community detection algorithm to identify internally cohesive opposing groups of political actors, and characterize the identified groups to understand the dynamics of coalition formation during elections. We also measure political tensions within and between groups to investigate the dynamics of political polarization over time. This paper is structured as follows. We describe the data and methodology in Sect. 2, and present the results in Sect. 3. Finally, we present our conclusions in Section 4.
2 Data and Methods The data1 used here comes from the Newsmedia Processing Suite (NPS) database, a technology application developed by BFI Technologies2 to collect and extract basic information contained in news articles. Some of the news article attributes that are relevant for this research are as follows: (i) feed_date: news article publication date, (ii) tag: news topic, (iii) a_name: the person making the statement, (iv) a_description: the organization to which the person belongs, (v) quote: political statement made by the person, concept: a concept related to the issue the person is commenting on, (vi) l_value: the degree to which the person agrees with the concept, l 2 ½0; 1.
Fig. 1. Procedure of signed network construction.
1 2
The dataset is available in limited form at https://github.com/ardianeff/indoelitenet. http://talimerah.com/media-analytics/.
96
A. Maulana et al.
We represent the network of political elites as an undirected signed graph G ¼ ðV; E; rÞ, where each vertex (vi 2 VÞ denotes an individual who is a member of a political party, and each edge ðeij 2 EÞ denotes a relation between two individuals. The set E of edges contains m þ positive edges and m negative edges where r is the sign function that maps edges to f1; þ 1g. Figure 1 shows the procedure for constructing the Indonesian political elite network, as follows: 1. Collecting and filtering news articles. Given a set of N news articles contained in database, we retrieve and filter the data set based on the following criteria: (i) feed_date: March 1, 2014 to October 31, 2014; (ii) tag: 2014 Indonesian Election; (iii) a_description: member of political party. 2. Each actor (v) has an opinion on the concept of ci ðci 2 CÞ, where lij indicates the degree to which the actor agrees with a concept. Considering the changing nature of actors’ opinions, we take the mean of lij over the observed time range, apply a threshold and map to {−1.1}, as follows: lij ¼
1 if lij 2 ½0:8; 1 1 if lij 2 ½0; 0:3
ð1Þ
We deliberately take the values of 0.3 (0.8) as thresholds to ensure differences (similarities) between political actors. 3. For the set of concepts, we calculate the edge connecting two actors as follows: P k lik :ljk ffiffiffiffiffiffiffiffiffiffiffiffi ; 1 eij 1 eij ¼ pffiffiffiffiffiffiffiffiffiffiffiffi ð2Þ P 2 qP 2 k lik k ljk We maps edges to f1; 1g, as follows: eij ¼
1 if eij \0 1 if eij [ 0
ð3Þ
The positive edge means that both actors have the same preference over a number of political concepts, while the negative edge means the opposite. To investigate the dynamic of political networks we disaggregated news data into several time windows. We use rolling window technique, in which the observation period is divided into a number of overlapping intervals of the same size. Each interval overlaps with the next one on a fixed sub-interval, where the difference in their starting dates is a constant step size. We chose a 30-day window and a one-day step to aggregate sufficient data in each window and to preserve daily resolution in the time series. We implement a community detection algorithm proposed by Esmalian and Jalili [8] to identify cohesive groups of political actors with as few internal negative ties as possible. The proposed algorithm use an improved version of Louvain method [9] to optimize Map Equation that was reformulated to account for negative edges in signed graph. Louvian method first assigns a unique label to each node, then expands each
Dynamics of Polarization and Coalition Formation
97
label to those neighbors that maximally improve the objective value, and finally folds each module into a node and repeats the procedure until no further improvement is made. The best community is the one with the lowest Minimum Description Length (MDL), which is the minimum expected code length that is required to address each step of a random-walker.
3 Result Legislative elections were held in Indonesia on 9 April 2014 in which 12 parties contested seats in the People's Representative Council (DPR). The result is very instrumental for the presidential election because the presidential candidate must be supported by a party or coalition of parties that won at least 20% of the seats or 25% of popular vote in the legislative election. The Indonesian presidential election was held on July 9, 2014, with former general Prabowo Subianto (PS-HR), who was supported by the following coalition of parties: Great Indonesia Movement Party (GER), Golkar Party (PG), United Development Party (PPP), Prosperous Justice Party (PKS), National Mandate Party (PAN), Crescent Star Party (PBB), Democratic Party (PD), contesting the election against Joko Widodo (JW-JK), who was supported by the following coalition of parties: People's Conscience Party (HAN), NasDem Party (NAS), National Awakening Party (PKB) and Indonesian Justice and Unity Party (PKPI). On 22 July the General Elections Commission (KPU) announced Joko Widodo's victory. He and his vice president, Jusuf Kalla, were sworn-in on 20 October 2014, for a 5-year term. 3.1
Cluster Analysis
Figure 1 shows the statistics of Indonesia's political elite network in 245 observation windows. In general, these networks are low density sparse networks, where most of the edges are positive. We implemented community detection algorithm to identify cohesive groups of political actors with as few internal negative ties as possible [8]. Table 1 also shows that despite the large number of clusters, the majority of nodes are in the four largest clusters.
Table 1. The statistics of Indonesia's political elite network across the observation windows. Statistics Value Average number of nodes 236.4 Average number of edges 2261.6 Average number of negative edges 408.8 Average clustering coefficient 0.47 Average diameter 6.18 Average degree of node 18.45 Average number of clusters 37.5 Average number of actors in the 4 largest clusters (%) 69.5
98
A. Maulana et al.
Next, we characterize each cluster based on the party attributes of political actors within the cluster. We find that although party members are spread across a number of clusters, the majority of party members tend to group together to form a homogeneous party cluster or together with the majority of other party members form a coalition cluster. Based on the composition of party members in each clusters, we assign the party pi ðpi 2 PÞ to the cluster ci ðci 2 CÞ Where the members of party pi in that cluster are the largest compared to other clusters. In this way we can evaluate the process of coalition formation and political polarization between these coalitions. We conduct the analysis in three time periods, namely: (I) legislative-election period: March–May, 2014; (ii) presidential-election period: June–July, 2014; (iii) postelection period: August–October, 2014. Period I (March–May, 2014). Figure 2 shows the dynamics of party affiliation in the four largest clusters during the legislative-election period. As shown in Fig. 2, the cluster of actors in this period did not show a clear and stable coalition pattern. Political elites tend to group together with their own party members, forming homogeneous party clusters. This is understandable because during the legislative election period, political parties focus on competing for legislative seats rather than building coalitions for the presidential election. However, the tendency of parties to form a coalition began to appear in early May ahead of election day. Figure 2 also shows that most parties are in small clusters (the white color in Fig. 2), while the four largest clusters are dominated by large parties (e.g. PDIP, GER, PG, DEM).
Fig. 2. The dynamics of coalition formation in the legislative-election period. Colors represent the four largest actor groups (red: 1st largest cluster; green: 2nd largest cluster; blue: 3rd largest cluster; cyan: 4th largest cluster; white: other clusters).
The meso-structure of political elite network in Figs. 3a-c (left) shows that actor clusters tend to have homogeneous party attributes, meaning that political actors are more likely to group within their respective party clusters. Meanwhile, the community network in Figs. 3a-c shows that tensions between actor clusters eased as the legislative election day approached. On May 9, the relation between clusters is dominated by negative edges. Different things can be seen in the community network on May 9 (legislative election day) where all relations are positive edges. This can be explained by two things, namely: first, the political elite began to open political communication for coalition purposes in the presidential election; second, the existence of informal agreements among political elites to reduce tension in order to prevent the emergence of political conflicts at the grassroots.
Dynamics of Polarization and Coalition Formation
99
Fig. 3. Visualization of political elite networks (left) and community networks (right) in legislative-election period (March–May, 2014): a. March 3; b. April 9; c. May 9. Colored nodes encode party attributes, colored edges encode the sign of relation. Community networks only present affiliated clusters where node size indicates the number of actors in the cluster. The node number in the community network indicates the community number as shown in the actor network, while the number on the edge indicates the weight of the edge.
Period II (June–July, 2014). In this period, the party attributes in the actor cluster show a pattern that resembles the party coalition in the 2014 Indonesian Presidential Election. As shown in Fig. 4, this grouping pattern is relatively stable, with the exception of a number of smaller parties such as PPP, HAN and PBB. This is because there are internal conflicts in these parties related to support for presidential candidates (Fig. 5).
100
A. Maulana et al.
Fig. 4. The dynamics of coalition formation in the presidential-election period. Colors represent the size of the four largest actor groups (red: 1st largest cluster; green: 2nd largest cluster; blue: 3rd largest cluster; cyan: 4th largest cluster; white: other clusters).
Fig. 5. Visualization of political elite networks (left) and community networks (right) in presidential-election period (June–July, 2014): a. June 25; b. July 9. Colored nodes encode party attributes, colored edges encode the sign of relation. Community networks only present affiliated clusters where node size indicates the number of actors in the cluster. The node number in the community network indicates the community number as shown in the actor network, while the number on the edge indicates the weight of the edge.
Period III (August–October, 2014). As shown in Fig. 6, the formation of two coalition clusters consistently emerged in the post-election period, at least until October 20 when the elected president was inaugurated. The cleavage pattern in the elite network in Figs. 7a-c is relatively no different from the previous period. However, the increasing intensity of negative relations between clusters indicates the strengthening of political polarization in this period. One of the important post-election events that led to increased political tensions was the failure of the election-winning coalition to seize leadership positions in the legislature.
Dynamics of Polarization and Coalition Formation
101
Fig. 6. The dynamics of coalition formation in the post-election period. Colors represent the size of the four largest actor groups (red: 1st largest cluster; green: 2nd largest cluster; blue: 3rd largest cluster; cyan: 4th largest cluster; white: other clusters).
Fig. 7. Visualization of political elite networks (left) and community networks (right) in postelection period (Aug–Oct, 2014): a. Aug. 5; b. Sept. 25; c. Oct. 8. Colored nodes encode party attributes, colored edges encode the sign of relation. Community networks only present affiliated clusters where node size indicates the number of actors in the cluster. The node number in the community network indicates the community number as shown in the actor network, while the number on the edge indicates the weight of the edge.
102
3.2
A. Maulana et al.
Polarization Analysis
We quantify political polarization using a dyad-based tension index [2], which is the proportion of negative relations to the total relations within (between) clusters, as follows: eij CID ¼ P ; CID 2 ½0; 1 eij eij P
eij \0
ð4Þ
As shown in Fig. 8, the proportion of negative relations within clusters is very small, while the tension index between clusters is relatively large and dynamic. In the legislative election period, polarization occurs among party clusters, with the value of polarization decreasing towards election day. Furthermore, the polarization between coalition clusters increased sharply in the presidential-election period. In the postelection period, political polarization remains high and continues to increase ahead of the election for the leadership of the legislative body. After the inauguration of the elected president at the end of October, political polarization began to decline, as indicated by the decreasing tension index and the emergence of a single cluster within the network of Indonesian political elites.
Fig. 8. The dynamics of polarization in the network of Indonesian political elites in the 2014 General Election. Vertical lines mark election day.
4 Conclusions In this study we investigate the dynamics of coalition formation and political polarization in the networks of Indonesian political elites during the 2014 Indonesian General Elections. For that purpose, we build a signed network of infered relations between Indonesian political actors based on their opinions (agree/disagree) on contentious political issues. We apply a community detection algorithm to identify conflicting groups of political actors and characterize the identified clusters based on
Dynamics of Polarization and Coalition Formation
103
the party attributes of political actors. We visualize the networks of political elites and community networks and quantify political tensions within and between clusters to elaborate the dynamics of polarization over time. Analysis of the identified clusters shows that the coalition pattern is absent during the legislative-election period, where political actors are more likely to cluster within their respective party groups. The intensity of polarization between clusters is relatively lower than the following two periods, with a downward trend of polarization ahead of the legislative election day. We find that coalition groups begin to form during the presidential-election period where the emerged pattern resembles the configuration of party coalitions in the 2014 Indonesian presidential election. The process of coalition formation was accompanied by an increase in the intensity of polarization between coalition clusters. This grouping pattern persisted and was stable in the post-election period, where political polarization continued to increase ahead of the election for the leadership of the legislature.
References 1. Neal, Z.: A sign of the times? Weak and strong polarization in the U.S. Congress, 1973–2016. Soc. Networks. 60, 103–112 (2020) 2. Maulana, A., Khanafiah, D.: The Dynamics of Political Parties’ Coalition in Indonesia: The evaluation of political party elites’ opinion. SSRN Electron. J. (2009) 3. Moody, J., Mucha, P.: Portrait of political party polarization. Netw. Sci. 1, 119–121 (2013) 4. Waugh, A., Pei, L., Fowler, J., Mucha, P., Porter, M.: Party polarization in congress: A network science approach. arXiv. (2009) 5. Maulana, A., Situngkir, H.: Media polarization on Twitter during 2019 Indonesian election. In: Benito, R.M., Cherifi, C., Cherifi, H., Moro, E., Rocha, L.M., Sales-Pardo, M. (eds.) Complex Networks & Their Applications IX. pp. 660–670. Springer International Publishing (2021). https://doi.org/10.1007/978-3-030-65347-7_55 6. Traag, V., Bruggeman, J.: Community detection in networks with positive and negative links. Phys. Rev. E. 80 (2009) 7. Maoz, Z.: Network polarization, network interdependence, and international conflict, 1816– 2002. J. Peace Res. 43, 391–411 (2006) 8. Esmailian, P., Jalili, M.: Community Detection in Signed Networks: the Role of Negative ties in Different Scales. Sci. Rep. 5 (2015) 9. Rosvall, M., Bergstrom, C.: Fast stochastic and recursive search algorithm. (2009)
Navigating Multidisciplinary Research Using Field of Study Networks Eoghan Cunningham1,2(B) , Barry Smyth1,2 , and Derek Greene1,2 1
2
School of Computer Science, University College Dublin, Dublin, Ireland [email protected] Insight Centre for Data Analytics, University College Dublin, Dublin, Ireland Abstract. This work proposes Field of Study networks as a novel network representation for use in scientometric analysis. We describe the formation of Field of Study (FoS) networks, which relate research topics according to the authors who publish in them, from corpora of articles where fields of study can be identified. FoS networks are particularly useful for the distant reading of large datasets of research papers, through the lens of exploring multidisciplinary science. To support this, we include case studies which explore multidisciplinary research in corpora of varying size and scope; namely, 891 articles relating to network science research and 166,000 COVID-19 related articles. Keywords: Network analysis
1
· Scientometrics · Multidisciplinarity
Introduction
In line with recognised benefits of multidisciplinary and interdisciplinary collaboration in scientific research [10,15], a trend has established towards greater levels of interdisciplinary research [11]. A common means of understanding these research processes is through the lens of network analysis. For instance, given a collection of research papers and their associated metadata, we can construct a variety of different network representations, including co-authorship networks [6,7] and citation networks [8]. Such representations serve to highlight the collaboration patterns between individuals researchers at a micro level. However, in other cases we might be interested in examining collaboration patterns between researchers coming from different disciplines at the macro level. In particular, we might wish to study how these patterns evolve over time in response to changing research funding landscapes or exogenous events, such as the COVID-19 pandemic. In this work, our aim is to propose a practical “distant reading” approach to help reveal collaborative research patterns in large scientific corpora in order to understand better the nature and implications of these patterns. This concept of distant reading has been considered in other contexts as a means of exploring large volumes of data from a macro level perspective, to identify specific niche areas of interest for closer inspection [13]. In this work, we present a novel graph representation, the Field of Study (FoS) network, which facilitates the investigation of multidisciplinary and interdisciplinary research in corpora of scientific c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2021, SCI 1072, pp. 104–115, 2022. https://doi.org/10.1007/978-3-030-93409-5_10
Navigating Multidisciplinary Research
105
research articles at the macro level. A core contribution of the field of study networks is the use of author-topic relations; a FoS network is populated by fields of study (or research topics), which are related to one another according to the authors who publish in them. In Sect. 3 we describe how these networks can be constructed from the topics/fields of study that have been assigned to research papers. In Sect. 4 we describe two exploratory cases studies, which analyse the FoS networks arising from datasets of differing scope and size. These case studies suggest that FoS networks can provide a useful tool for the distant reading of large corpora of research articles, as well as conducting quantitative analysis to understand the relationship between scientific disciplines.
2
Related Work
Multidisciplinary research is most commonly defined as research which draws on expertise, data or methodology from two or more disciplines. Most formal definitions distinguish interdisciplinary research as an extension of multidisciplinary research, which involves the integration of methodologies from the contributing disciplines [4]. There are numerous analyses which explore multi- or interdisciplinary research, and investigate the relationship between scientific disciplines. Many studies define metrics to quantify research interdisciplinarity at the author or paper level [16,17], often in order to investigate a correlation between interdisciplinarity and research impact [10,15], productivity or visibility [12]. Typically, works which integrate methods and ideas from a diverse set of disciplines are found to have greater research impact and visibility compared to those that do not [12,15]. As such, we can identify several examples of analyses which investigate cross-disciplinary collaboration and map areas of multidisciplinary research, often drawing on methods from network science [6,8,9,18,18,20]. Co-authorship networks can provide an effective means of representing research collaborations. Here researchers are represented by nodes and collaborations are encoded via the edges between them. Thus, research teams are identified as fully-connected components of the graph. In cases where research backgrounds can be identified among the authors in the network, this can be used to quantify the level of multidisciplinary collaborations. These methods have been used to reveal a strong disciplinary homophily between researchers, despite showing those with diverse neighbourhoods tend to have higher research impact [6]. Another common representation used to investigate interdisciplinary research is the citation network, typically constructed at the article or journal level. Analyses of citation networks can highlight influential or “disruptive” articles in interdisciplinary research [20], as well as “boundary” papers which span multiple disciplines [8]. Indeed community finding approaches have been employed to automatically group articles in citation networks into their respective fields of study [18], so that interdisciplinary interactions can then be explored at the macro level. An alternative strategy is to apply text analysis to article abstracts in order to cluster articles together which relate to similar research topics [9,18].
106
E. Cunningham et al.
This is typically based on term co-occurrence patterns, rather than based on article citation patterns. Of course, connections between topics in each of these representations can differ greatly, as fields of study which are distant in their citation patterns may be closely linked semantically. Here we propose an alternative network representation, which relates fields of study according to the authors who typically publish in those fields. This Field of Study network may be used in conjunction with more conventional network representations—in much the same way that semantic networks have been shown to complement citation networks [18]—but in Sect. 4 we show that, on their own, FoS networks can provide an effective means of exploring large collections of research articles, particularly in revealing author multidisciplinarity.
3
Methods
In this section we formalise the definition of a Field of Study (FoS) network and explain how these networks can be generated from existing research resources. In Sects. 3.2 and 3.3 we describe two FoS variations: the static FoS network and the temporal FoS network respectively. 3.1
Field of Study Networks
Formally, a Field of Study (FoS) network is defined as a general graph representation of a collection of research articles (R), written by a set of authors (A), and denoted F = (N, E). The nodes (N ) represent identifiable research topics (i.e. the fields of study) and the edges (E) represent authorship relations between pairs of topics. These relations are aggregated across multiple associated research papers. Below we describe how a FoS network can be constructed from a more conventional authorship graph and we argue that FoS networks are particularly well-suited to analysing the nature of collaboration within the scientific literature, especially as they relate scientific fields of study according to the researchers/authors who publish in them. The formation of a FoS network depends on the availability of fields of study labels for a given set of research papers. These could be derived via manual annotations, the application of automated text mining methods, or some combination of the two. For instance, topic modelling techniques have been shown to be successful in extracting research topics from corpora of research articles and assigning papers to those fields [9]. In fact, many research databases and search engines employ these techniques (or manual classification) to assign research articles or academic journals to fields of study. For example, the Microsoft Academic Graph (MAG)1 maintains a deep hierarchy of Fields of Study which they assign to papers; Web of Science (WOS)2 group journals in 258 Subject Categories; Scopus3 employs experts to 1 2 3
https://www.microsoft.com/en-us/research/project/microsoft-academic-graph/. https://clarivate.com/webofsciencegroup/solutions/web-of-science/. https://www.scopus.com/home.uri.
Navigating Multidisciplinary Research
107
assign All Science Journal Classification (ASJC) codes to all journals covered by their index. For the purpose of the case studies described later in Sect. 4, we use MAG fields of study to categorise research papers and construct FoS networks. The deep MAG field of study hierarchy is desirable as it supports the construction of FoS networks at varying levels of detail, from the broadest research disciplines (level 0) to the specific topics and sub-topics that exist within a particular discipline (levels 4 and 5). It is worth noting that the Microsoft Academic Graph may not always be an appropriate source for field of study data. For instance, the corpus does not provide full coverage of all research disciplines and the massive FoS hierarchy may contain some spurious connections due to its size and semi-automated construction. However, the methods that we propose are not specific to the MAG hierarchy, and are designed to generalise to any scenario where fields of study can be identified at the appropriate level of detail. 3.2
Static FoS Networks
The formation of a static FoS network from a collection of research articles is best described as the two-step process illustrated in Fig. 1. In the first step, an unweighted bipartite graph is generated from identifiable fields of study and their contributing authors; see Fig. 1a. In the second step, this graph is used to generate a projection (the FoS Network) in which a weighted undirected edge exists between two fields if and only if at least one author has published research in both fields; see Eq. 1 for all a ∈ A, where N is the set of fields identifiable in R. The resulting edge weights correspond to the number of such authors who publish in both fields (Eq. 2). (1) E = (ni , nj ) : published(a, ni ) ∧ published(a, nj ) w ni , nj = | a : published(a, ni ) ∧ published(a, nj ) | (2)
(b) (a)
Fig. 1. The formation of a static Field of Study (FoS) network involving two steps: (a) creation of a bipartite network of authors and fields; (b) projection to an undirected network of fields.
108
E. Cunningham et al.
(a)
(b)
Fig. 2. Illustrative example of a temporal Field of Study (FoS) network, involving two steps: (a) creation of a bipartite network of authors and fields; (b) projection to a directed network of fields.
3.3
Temporal FoS Networks
It is further possible to encode temporal information in a FoS Network as directed edges, which allows us to study changes in multidisciplinarity research patterns over time. Temporal FoS networks can be visualised in a time-unfolded representation, where the data is divided into a sequence of two or more discrete time steps, as frequently employed in dynamic network analysis tasks. Nodes are duplicated for each time step so that authors can be connected to any fields in which they publish research during a given time step. As an example, Figs. 2a and 2b illustrate the two stages in the formation of a temporal FoS network, showing an instance of a temporal FoS network with respect to two time-points (tn and tn+1 ) on either side of some event (e); thus tn < te < tn+1 ). The temporal FoS network in Fig. 2b contains a directed edge between two fields (ni , nj ) if an author published in field ni at time tn (before event e) and in field nj at time tn+1 (after event e), as given in Eq. 3. Later, in Sect. 4.3, we present COVID-19-related research in the context of the research backgrounds of the contributing authors with the start of the pandemic serving as the defining event. (3) E = (ni , nj ) : published(a, ni , tn ) ∧ published(a, nj , tn+1 )
4
Case Studies
In what follows we describe two illustrative examples to demonstrate the utility of FoS representations. In the first case study, presented in Sect. 4.1, we consider the use of static FoS networks to explore aspects of multidisciplinary research in the area of network science. The second case study demonstrates the
Navigating Multidisciplinary Research
109
use of both static and temporal FoS networks in the context of a large-scale dataset of research articles relating to the COVID-19 pandemic and is presented in Sects. 4.2 and 4.3. 4.1
Multidisciplinary Research in Network Science
Figure 3 presents two static FoS networks produced using Microsoft Academic Graph metadata for 891 research articles published in the area of network science. To form this dataset, we collect all available papers published in 5 network science journals between the years 2015 and 2019 inclusive: Applied Network Science, Social Network Analysis and Mining, Network Science, Complex Systems, and Computational Social Networks. We use MAG fields of study metadata to categorise these research papers. The MAG uses hierarchical topic modelling to identify and assign research topics to individual papers, each of which represents a specific field of study [19]. To date, this approach has identified a hierarchy of over 700,000 topics within the Microsoft Academic Knowledge corpus, and the average paper published in the set of 891 network science articles is assigned to 9 such topics. To produce a more useful categorisation of articles, we first reduce the number of topics, by replacing each field with its parent, to consider topics at two levels in the FoS hierarchy: 1. The 19 FoS labels at level 0, which we refer to as ‘disciplines’. 2. The 292 FoS labels at level 1, which we refer to as ‘sub-disciplines’ In this way, each article is associated with a set of disciplines (e.g. ‘Medicine’, ‘Physics’, ‘Engineering’) and sub-disciplines (e.g. ‘Virology’, ‘Particle Physics’, ‘Electronic Engineering’), which are identified by traversing the FoS hierarchy from the fields originally assigned to the paper. Note that some MAG subdisciplines belong to more than one discipline. For example, Biochemistry is a child of both Chemistry and Biology. Figure 3a illustrates the resulting FoS network when network science articles are categorised at the discipline level. Each node (or discipline) in this FoS network can then be decomposed into its sub-disciplines as shown in Fig. 3b. From Fig. 3, we can begin to understand the respective roles of the many fields of study represented in network science. Highly central in Fig. 3b are the fields which represent the technical and methodological foundations of network science research. The sub-disciplines of Mathematics and Computer Science such as ‘Algorithm’, ‘Combinatorics’, and ‘Statistics’ have high degree centrality (ranked 3rd, 8th and 9th respectively), because they are identified across the majority of network science research papers. Some fields beyond the disciplines of Computer Science and Mathematics, such as ‘Social Psychology’, ‘Social Science’, and ‘Law’ have high betweenness centrality in the FoS Network (ranked 1st, 4th and 6th, respectively). This is likely because they help to bridge network science methods to their interdisciplinary applications. In particular, in the upper right corner of Fig. 3b we can see a group of fields which reflect the proliferation of recent
110
E. Cunningham et al.
studies of social media networks from the perspective of sociology and political science. Community detection methods can be used to categorise the topics in the FoS network too. Figure 4 shows the network from Fig. 3b, but with the nodes colour-coded to show cluster memberships identified using the Louvain method [2]. This technique identified 6 clusters in the graph, containing as few as 2, and as many as 16 topics. The clusters shown in Fig. 4 differ from the MAG categorisation illustrated in Fig. 3b, because they show how these topics relate in the context of network science specifically, rather than in the MAG hierarchy as a whole. Broadly, the clusters could be categorised as: (i) the core theoretical and methodological topics in the network science (15 topics: statistics, theoretical computer science, combinatorics, etc.), (ii) research relating to computer networks [3] (5 topics: telecommunications, distributed computing, etc.), (iii) social network analysis (16 topics: social science, social psychology, media studies, etc.), (iv) networks in machine learning [1] (6 topics: artificial intelligence, computer vision, natural language processing), (v) applications in biology (3 topics: molecular biology, genetics, biochemistry), (vi) applications in medicine (2 topics: pathology and surgery). 4.2
COVID-19 Research and the Effect on Multidisciplinarity
FoS networks can be used to evaluate the degree of an author’s multidisciplinarity, that is, the extent to which they publish in different disciplines. For example, [5] describes an in-depth analysis of the effect of COVID-19 research on author multidisciplinarity using static FoS networks and for completeness, we summarise the construction and use of FoS networks in this way for this case study. We construct five annual FoS networks from all available research articles by authors who published work related to COVID-19 using the COVID-19 Open Research Dataset (CORD-19)4 . From CORD-19, we identify all authors who published COVID-19 related research in 2020, and collect MAG metadata for any available research articles they published between 2016 and 2020 inclusive. In total, we collect 5,389,445 articles published 2016–2020, including 166,356 articles which relate to COVID-19. Next, using the 292 MAG sub-disciplines, we build a FoS network for each year in the dataset. The nodes in these networks represent MAG sub-disciplines, and they can be divided into 19 overlapping communities based on their assignment to MAG disciplines. This facilitates the characterisation of edges in the FoS network: an edge within a community represents an author publishing in two sub-disciplines within the same parent discipline, while an edge between communities represents an author publishing in two sub-disciplines from different parent disciplines. For example, if an author publishes research in ‘Machine Learning’ and ‘Databases’, then the resulting edge is within the community/discipline of ‘Computer Science’. Conversely, if an author publishes in ‘Machine Learning’ and ‘Radiography’, the resulting edge is between the ‘Medicine’ and ‘Computer 4
https://www.semanticscholar.org/cord19.
Navigating Multidisciplinary Research
111
(a) Disciplines or level 0 fields of study.
(b) Sub-disciplines or level 1 fields of study.
Fig. 3. FoS Networks for research published in 5 network science journals during 2015– 2019. Node size encodes the number of papers attributed to a field of study. In (b) nodes are coloured to represent the parent discipline of the field of study. Edges are coloured to show the parent discipline if the edge is within a discipline/community. Edges between communities are not coloured.
Science’ communities. In this way, an edge between disciplines may represent either a single instance of interdisciplinary research or two separate stances of research, in two different disciplines, by the same author. To explore changes in author multidisciplinarity, we compare the proportion of the total number of edges in the network that are external (i.e. between communities). Figure 5 plots the odds ratio effect sizes when the proportion of external edges in an annual FoS network is compared with that of the previous year. We report these
112
E. Cunningham et al.
Fig. 4. FoS Network for research published in 5 network science journals during 2015– 2019. Nodes are coloured to show clusters identified by Louvain.
scores per community/discipline. We also include a second FoS network for 2020 which excludes any research related to COVID-19, and report an additional odds ratio for the comparison of the 2020-non-COVID network with the 2019 network. Thus, FoS networks have been used to reveal a trend towards greater multidisciplinarity year-on-year. This trend appears to have been accelerated by COVID-19 research, and the increase is shown to be greater in some disciplines. 4.3
Close Reading Case Studies in COVID-19 Research
Figure 5 shows an increase in author multidisciplinarity in many fields of study as a result of COVID-19 research and in this section we illustrate how we can further explore this phenomenon by using Temporal FoS networks to compare the pre-COVID (2016–2019) and COVID (COVID-19 related research in 2020) time periods. As an illustration, Fig. 6 presents COVID-19 related research in the field of Computer Science, with pre-COVID nodes on the left (representing the authors’ research backgrounds) and COVID nodes on the right (representing the FoS characterisation of the COVID related research). To highlight the strongest trends that exist, the FoS network shows only the top-50 edges by weight. We note that authors from diverse research backgrounds contribute articles related primarily to ‘Surgery’, ‘Pathology’, and ‘Machine Learning’. To conduct further close reading, we can narrow the list of articles by considering only those papers that contribute a particular edge to the FoS network. For example, we can search for COVID-related papers which result in the edge
Navigating Multidisciplinary Research
113
Fig. 5. Multidisciplinarity of authors who published COVID-19-related research, by discipline. For each year, we report the odds ratio effect size when the proportion of edges that are between communities is compared with that of the previous year. ‘All disciplines’ reports these scores for the entire network. Also reported are scores for individual communities in the graph, which represent disciplines. Bars are plotted to show a 95% confidence interval.
Fig. 6. Temporal FoS Network presenting COVID-19-related research in Computer Science, produced from 9,004 COVID-related research papers which were attributed to the MAG field ‘Computer Science’.
114
E. Cunningham et al.
between Pathology and Algorithm; these are COVID-related articles containing the topic Algorithm, in which the authors have previously published research in the field of Pathology. To better understand the papers in this subset, we can explore the lower-level MAG topics that are most commonly identified amongst them, or the keywords which occur most frequently in their titles and abstracts. For additional discussion of close reading of this corpus, see [5]. One approach to close reading is to search for articles which cite a large proportion of the papers in a given subset. For instance, in the case of the papers linking ‘Pathology’ to ‘Algorithm’, we find a review paper describing the push for machine learning solutions to COVID-19 detection: “Artificial Intelligence in the Battle Against Coronavirus” [14]. In this way, it is possible to understand in detail, the patterns of multidisciplinarity that were identified at the distant reading level as FoS networks can help to identify novel review papers that bring together ideas from several different fields, papers which may have been hidden in more traditional citation network representations.
5
Conclusions and Future Work
In this work we propose Field of Study (FoS) networks as a novel representation for exploring the relationship between research topics at the macro-level. We describe the formation of two different types of FoS network, and provide case studies which illustrate how these networks can be used in the distant-reading of large corpora of research articles. In the case of network science research, we use FoS networks to explore the roles of different fields of study in multidisciplinary network science, and identify broad topics and applications in network science research. Similarly, in the case of COVID-19 research we investigate the relationship between fields of study within and between scientific disciplines to show an increase in multidisciplinarity in the context of COVID-19 research. Finally, we summarise the use of temporal FoS networks and methods of close-reading conducted on the COVID-19 research dataset in order to understand artefacts of multidisciplinarity identified in FoS networks. There are a number of avenues for potential further research in this area. For example, in a corpus where full paper texts or abstracts are available, it may be informative to explore semantic relationships between the fields of study represented in the network. Similarly, citation information could be used to explore the flow or diffusion of information between communities. A multi-dimensional approach, which combines these methods (similar to that proposed by [18]), may prove a useful tool for scientometric analysis. Moreover, the FoS network we present may be used to explore multidisciplinarity, but not interdisciplinarity (as per the distinction offered in Sect. 2). Extending FoS networks to incorporate citation information may allow for the quantification of interdisciplinarity as many studies have used citation information to assess how articles “integrate” methods from different disciplines [16,17]. Acknowledgments. This research was supported by Science Foundation Ireland (SFI) under Grant Number SFI/12/RC/2289 P2.
Navigating Multidisciplinary Research
115
References 1. Arora, M., Kansal, V.: Character level embedding with deep convolutional neural network for text normalization of unstructured data for Twitter sentiment analysis. Soc. Netw. Anal. Min. 9, 1–14 (2019) 2. Blondel, V.D., Guillaume, J.-L., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities in large networks. J. Stat. Mech: Theory Exp. 2008(10), P10008 (2008) 3. Celik, A., Tetzner, J., Sinha, K., Matta, J.: 5G device-to-device communication security and multipath routing solutions. Appl. Netw. Sci. 4, 11 (2019) 4. Choi, B.C., Pak, A.W.: Multidisciplinarity, interdisciplinarity and transdisciplinarity in health research, services, education and policy: 1. Definitions, objectives, and evidence of effectiveness. Clin. Invest. Med. 29(6), 351–364 (2006) 5. Cunningham, E., Smyth, B., Greene, D.: Collaboration in the time of COVID: a scientometric analysis of multidisciplinary SARS-CoV-2 research. Humanit. Soc. Sci. Commun. 8, 240 (2021). https://doi.org/10.1057/s41599-021-00922-7 6. Feng, S., Kirkley, A.: Mixing patterns in interdisciplinary collaboration networks: assessing interdisciplinarity through multiple lenses. arXiv preprint arXiv:2002.00531 (2020) 7. Gl¨ anzel, W., Schubert, A.: Analysing scientific networks through co-authorship. In: Moed, H.F., Gl¨ anzel, W., Schmoch, U. (eds.) Handbook of Quantitative Science and Technology Research, pp. 257–276. Springer, Dordrecht (2004). https://doi. org/10.1007/1-4020-2755-9 12 8. Karunan, K., Lathabai, H.H., Prabhakaran, T.: Discovering interdisciplinary interactions between two research fields using citation networks. Scientometrics 113(1), 335–367 (2017) 9. Lafia, S., Kuhn, W., Caylor, K., Hemphill, L.: Mapping research topics at multiple levels of detail. Patterns 2(3), 100210 (2021) 10. Larivi`ere, V., Haustein, S., B¨ orner, K.: Long-distance interdisciplinarity leads to higher scientific impact. PLoS ONE 10(3), e0122565–e0122565 (2015) 11. Leahey, E.: From sole investigator to team scientist: trends in the practice and study of research collaboration. Ann. Rev. Sociol. 42(1), 81–100 (2016) 12. Leahey, E., Beckman, C.M., Stanko, T.L.: Prominent but less productive: the impact of interdisciplinarity on scientists’ research. Adm. Sci. Q. 62(1), 105–139 (2017) 13. Moretti, F.: Distant Reading. Verso Books, Brooklyn (2013) 14. Nguyen, T.T., Nguyen, Q.V.H., Nguyen, D.T., Hsu, E.B., Yang, S., Eklund, P.: Artificial Intelligence in the Battle against Coronavirus (COVID-19): A Survey and Future Research Directions. arXiv preprint arXiv:2008.07343 (2021) 15. Okamura, K.: Interdisciplinarity revisited: evidence for research impact and dynamism. Palgrave Commun. 5(1), 141 (2019) 16. Porter, A., Cohen, A., David Roessner, J., Perreault, M.: Measuring researcher interdisciplinarity. Scientometrics 72(1), 117–147 (2007) 17. Rafols, I., Meyer, M.: Diversity and network coherence as indicators of interdisciplinarity: case studies in bionanoscience. Scientometrics 82(2), 263–287 (2010) 18. Raimbault, J.: Exploration of an interdisciplinary scientific landscape. Scientometrics 119(2), 617–641 (2019) 19. Shen, Z., Ma, H., Wang, K.: A web-scale system for scientific knowledge exploration. arXiv preprint arXiv:1805.12216 (2018) 20. Wu, L., Wang, D., Evans, J.A.: Large teams develop and small teams disrupt science and technology. Nature 566(7744), 378–382 (2019)
Public Procurement Fraud Detection: A Review Using Network Analysis Marcos S. Lyra(&), Flávio L. Pinheiro, and Fernando Bacao Information Management School (IMS), Universidade Nova de Lisboa, Campus de Campo Lide, 10170-312 Lisboa, Portugal [email protected]
Abstract. Public procurement fraud is a plague that produces significant economic losses in any state and society, but empirical studies to detect it in this area are still scarce. This article presents a review of the most recent literature on public procurement to identify techniques for fraud detection using Network Science. Applying the PRISMA methodology and using the Scopus and Web of Science repositories, we selected scientific articles and compared their results over a period from 2011 to 2021. Employing a compiled search string, we found cluster analysis and centrality measures as the most adopted techniques. Keywords: Public procurement
Corruption Network analysis PRISMA
1 Introduction Public procurement is a key step in providing public services to the population in the health, education, and infrastructure sectors (World Bank 2017). It is an instrument by which the public administration seeks to obtain the highest volume of goods, services, and construction at the lowest cost (Costa et al. 2020). Its sustainability is one of the elements that lead to the efficiency of the public sector (Silva Filho 2017). This process is considered one of the areas of highest risk of fraud (Whiteman 2019). It can occur at any point in the supply chain, making it difficult to detect and measure, especially after awarding a contract (Whiteman 2019). Scams in this sector are complex and sophisticated; they might occur at any stage of the purchase process (Rustiarini et al. 2019) and can be one of the most problematic crimes to investigate (Mufutau 2016). These stratagems occur widely, from corruption, bribery, embezzlement, collusion, abuse of power, favoritism, misappropriation, and nepotism (Padhi and Mohapatra 2011). However, even a suitably defined public procurement legislation is insufficient to control fraud, and efficient mechanisms for its prevention are inaccurate (Wensink and Vet 2013). That said, it is essential to carry out scientific investigations to provide theoretical and methodological bases to identify the behavior of companies and service providers in line with potential fraudulent actions in the bidding markets. In that sense, this systematic review aims to survey the available literature regarding methodologies used to identify and combat fraud in public procurement through network analysis, being a valuable instrument to detect solutions conceived and unveil possible existing gaps. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2021, SCI 1072, pp. 116–129, 2022. https://doi.org/10.1007/978-3-030-93409-5_11
Public Procurement Fraud Detection
117
The challenge of this article in helping the fight against corruption is to conduct a review of the state-of-the-art to identify the latest scientific contributions based on Network Analysis techniques. According to the overview of the current subject and with the help of some literature on the issue (Shaffril et al. 2020), we came up with a research question: How can network science help in the disclosure of fraud in public procurement? To answer this question, we performed a cross-sectional consultation and a qualitative analysis following the adoption of 18 scientific articles.
2 Method This systematic literature survey intends to capture the contributions of the scientific community to the topic of corruption in public procurement, where we sought to identify the most recurrent authors, their interactions, number of citations, identification of keywords, and their repetitions in the various articles. We used an auditable methodology based on the set of PRISMA items (Preferred Report Items for Systematic Reviews and Meta-Analyzes). This method comprises a checklist with 27 topics distributed in 7 sections, which is used to improve meta-analyses of periodic review reports. Its framework guides the setup of each section based on information from scientific articles: Title, Abstract, Introduction, Methods, Results, Discussion, and other information. The PRISMA 2020 approach was designed to carry out systematic reviews, including synthesis (meta-analyzes or other concepts of statistical synthesis) consisting of five phases (Page et al. 2021): 1. Identification of related articles in the field of interest; 2. Sorting titles and abstracts; 3. Acceptance analysis; 4. Exclusion analysis; 5. Final selection. Afterwards, we started to identify and collect research articles to assist in the assembly of the current study. For this purpose, we used a systematic computerized query in two search repositories, Web of Science (W.O.S) and Scopus, chosen as primary search sources due to their extensive coverage of scientific writings, faster indexing methods, and based on a curated database of published and peer-reviewed content, often not considered in more comprehensive repositories. 2.1
Bibliometric Analysis and Keyword Search Strategy
We carried out an iterative search process with an interval period limitation from 2011 to 2021. The query identified publications covering terms and expressions logically combined by Boolean connectors in their titles, keywords, and abstracts. The resulting search string used in the repositories mentioned above is shown below: Search String: (corruption OR fraud OR collusion OR bribe OR bribery OR rigging OR cartel OR collusive) AND (Procurement OR tender OR bid OR bidding OR auction) AND (network OR SNA OR detection OR detecting OR detect OR prediction OR predict). To perform network analysis on the collected data set and represent the interactions between the actors or nodes of the net, we used the open-source engine VOSviewer (Van and Waltman 2010). It allows us to identify the publications, authors, journals, institutions, keywords, and countries with the most significant impact on the research in
118
M. S. Lyra et al.
repositories of scientific articles. From parameters such as degree centrality (indicates the number of times a paper is mentioned, represented by the size of the nodes) and edge weight (designates the number of times that two articles are cited together in different documents - depicted by the thickness of the link), we used VOSviewer to generate graphs based on the data collected from the repositories and interpreted them applying network analysis. We also analyzed the information collected about the concomitant occurrence of keywords and authors, which establishes that the larger the node size, the more often the keyword appears in publications. On the other hand, the thicker the links between two nodes, the greater the number of times these pairs of keywords occur in different articles. In turn, the same concept was applied to the authors, i.e., the larger the node size, the more published articles these two authors have in the selected database, and the thicker the connection between them, the more often they act in joint publications.
3 Results Data collection took place in August 2021 and was complemented with more articles in September 2021. Subsequently, we proceeded to the selection of relevant publications following the three PRISMA stages. In the first identification phase, we employed the search string for the two electronic repositories. Filtering the years 2011 to 2021, we found 351 articles in Scopus and 311 in the Web of Science. Next, we adopted as an initial basis all writings found in both recipients and identified 279 identical publications. Counting the non-intersected publications, we reached a quantity of 383 different articles and entered the screening phase. At this stage, we made exclusions based on titles, eliminating 301 articles and keeping 82. Afterwards, we read the abstracts of the 82 articles and rejected 64 that were out of scope, remaining with 18. After reading the 18 articles in full, we did not generate exclusions and kept them all. Of these, 15 were published in scientific journals, while four were published in conferences. 3.1
Publications by Author, Number of Citations & Journal/Conference
The number of selected articles related to the words used in the search query totals 18 publications. It starts with one publication in 2011 and grows to four in 2017, peaking in 2020 with six selected articles. It indicates that the issue comes being more researched over time. The articles collected in Scopus and Web of Science repositories, both from journals and conference annals, embrace an extensive variety of scientific research, from Computer Science to Government and Law, passing through Business, Economics, and Criminology. The scrutiny of scientific articles was essential for the identification of works relevant to our research area. We raised articles cited up to 35 times, resulting in the publications shown in Table 1. The articles collected comprise the most up-to-date concepts currently applied to identify fraud in public contracts using network analysis. The most cited publications in Table 1 were: Fazekas and Toth (2016) with 37 citations; Sedita and Apa (2015) with 27 and Lin et al. (2012) with 22 mentions, all related to Scopus. Besides, citations are also analyzed in Sect. 3.4.2, referring individually to the authors and co-authors.
Public Procurement Fraud Detection
119
Table 1. Articles by year of publication Title
Year
Author
Corruption risk in contracting markets: a network science perspective Corruption and complexity: a scientific framework for the analysis of corruption networks Distinguishing Characteristics of Corruption Risks in Iranian Construction Projects: A Weighted Correlation Network Analysis Bidder Network Community Division and Collusion Suspicion Analysis in Chinese Construction Projects Corruption and the network structure of public contracting markets across government change Network Analysis for Fraud Detection in Portuguese Public Procurement Betweenness to assess leaders in criminal networks: New evidence using the dual projection approach Social capital predicts corruption risk in towns A network approach to cartel detection in public auction markets The Analysis of Water Project Bid Rigging Behavior Based on Complex Network Bid-rigging networks and statecorporate crime in the construction industry Detecting the collusive bidding behavior in below average bid auction Adaptation of cluster analysis methods in respect to vector space of social network analysis indicators for revealing suspicious government contracts
2021
Johannes Wachs; Mihály Fazekas; János Kertész Issa Luna-Pla; José R. Nicolás-Carlock
2020
2020
2020
Hosseini, M. Reza; Martek, Igor; Banihashemi, Saeed; et. al Zhu, Jiwei; Wang, Bing; Li, Liang; et al
Citations Scopus W.O.S 3 4
10
8
10
8
2
1
2020
Fazekas, Mihaly; Wachs, Johannes
1
1
2020
Carneiro, D., Veloso, P., Ventura, A., et al
1
N/A
2019
Grassi, R., Calderoni, F., Bianchi, M., & Torriero, A Wachs, Johannes; et al Wachs, Johannes; Kertesz, Janos Cheng, Tiexin; Liu, Ting; Meng, Lingzhi; et al Reeves-Latour, Maxime; Morselli, Carlo Lei, Ming; Yin, Zihan; Li, Shalang; Li, Hao Davydenko, V. I.; Morozov, N. V.; Burmistrov, M. I
16
14
9
5
9
5
N/A
0
28
26
2
0
0
0
2019 2019 2017
2017
2017
2017
(continued)
120
M. S. Lyra et al. Table 1. (continued)
Title
Year
Author
From Corruption to State Capture: A New Analytical Framework with Empirical Applications from Hungary Improving Fraudster Detection in Online Auctions by Using Neighbor-Driven Attributes The impact of inter-organizational relationships on contractors’ success in winning public procurement projects: The case of the construction industry in the Veneto region Combining ranking concept and social network analysis to detect collusive groups in online auctions Corrupt police networks: uncovering hidden relationship patterns, functions and roles
2016
Fazekas, Mihaly; Toth, Istvan Janos
2016
Lin, Jun-Lin; Khomnotai, Laksamee
1
1
2015
Sedita, Silvia Rita; Apa, Roberta
27
25
2012
Lin, Shi-Jen; Jheng, Yi-Ying; Yu, ChengHsien Lauchs, Mark; Keast, Robyn; Yousefpour, et al
22
19
20
18
3.2
2011
Citations Scopus W.O.S 37 29
Major Journals
Concerning journals, a total of 14 articles were collected from 13 different journals and published by ten publishers. 38% of the journals are ranked by Scimago Journal & Country Rank (SJR) in the first quartile Q1, 48% in the second quartile Q2, and 14% in the Q3 quartile. The main fields of investigation are Computer Science, Government & Law and Business & Economics, Table 2.
Public Procurement Fraud Detection
121
Table 2. Major journals Journal Expert Systems With Applications
Qt. 1
Rank Q1
Social Networks
2
Q1
Politics and Governance International Journal of Project Management Science and Engineering Ethics Scientific Reports
1
Q2
1
Q1
1
Q1
1
Q1
Applied Network Science Entropy International Journal of Data Science and Analit Royal Society Open Science Policing & Society
1
Advances in Civil Eng Political Research Quarterly
3.3
Publisher Elsevier Science LTD Elsevier Cogitatio Press Elsevier Science LTD Springer
Country England
Field Computer Science;
Netherlands
Anthropology; Sociology Government & Law Business & Economics
Portugal England
Netherlands England
Q2
Nature Publishing Group Springer
1 1
Q2 Q2
MDPI Springer
Switzerland Switzerland
1
Q2
Royal Soc
England
1
Q2
England
1
Q3
1
Q3
Routledge Journals Hindawi LTD Sage Publications Inc
Switzerland
England United States
Multidisciplinary Sciences Science & Technology Computer Science Multidisciplinary Computer Science Science & Technology Criminology & Penology Construction & Building Government & Law
Conference Proceedings
Regarding conferences, a total of four articles were collected from four different proceedings and published by three publishers belonging to three countries. The main field of investigation is computer science as shown in Table 3.
122
M. S. Lyra et al. Table 3. Main conferences
Conference 13th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD) IEEE 5th International Conference on Future Internet of Things and Cloud (FiCloud) International Conference on Applied Mathematics, Modeling and Simulation (AMMS) International Conference on Intelligent Data Engineering and Automated Learning
Qt. 1
Publisher IEEE
Country USA
Field Computer Science
1
IEEE
USA
Computer Science; Telecommunications
1
Atlantis Press
France
Computer Science; Mathematics
1
Springer International
Switzerland
Computer Science
The four conference proceedings identified in this study were listed in the Computer Science field of investigation. Regarding article publishers, the IEEE published two papers, specifically in Conferences on Fuzzy Systems, and Future Internet of Things and Cloud. Springer International Publishing AG has published one paper in the proceedings on Intelligent Data Engineering and Automated Learning. Atlantis Pres published one paper referring to procedures on Mathematics, Modeling, and Simulation. 3.4
Analysis of Word Incidence in Sections of Varied Articles
To further explore the most sensitive topics, we conducted a co-occurrence analysis. Here we use the concept of keyword occurrence, which is based on the count of words in parts of the articles (Keyword, Title, and Abstract). Besides, the analysis is applied to many articles together, serving researchers for review and statistical exploration. 3.4.1 Analysis of Keyword Occurrences The study of keywords occurrence was developed using the VOSviewer, a software tool for constructing and visualizing bibliometric networks. We performed the search using the complete counting method, comprising 12 keywords in the Scopus and Web of Science repositories, whose minimum occurrence limit was established as two. In Table 4 we present the 10 Keywords with the most occurrence and link strength.
Public Procurement Fraud Detection
123
Table 4. Keywords with the highest occurrences SCOPUS Keyword Public procurement Corruption Crime Social network analysis Social networking Construction industry Risk management Anti-corruption Networks Human
Link strength 13
Occurrence
Web of Science Keyword
7
Corruption
9 12 8
7 4 4
7 6 4
4 3 3
8
3
Government Networks Social network analysis Markets
2
3
7
3
5
2
8
2
4
2
7 5 4
2 2 2
Public procurement Construction industry Management Centrality Collusion
4 3 2
2 2 2
Link strength 12
Occurrence 7
The most identified keywords in the two repositories, Scopus and Web of Science, had the following occurrences and link strengths, respectively: public procurement (Scopus: 7, 13 and W.O.S: 2, 5), government (Scopus: 0, 0 and W.O.S: 4, 7), crime (Scopus: 4, 12 and W.O.S: 0, 0), corruption (Scopus: 7, 9and W.O.S: 7, 12), and social network analysis (Scopus: 4, 8 and W.O.S: 3, 4), networks (Scopus: 2, 5 and W.O.S: 3, 6). The two keyword networks together (Scopus and Web of Science) emphasize the research areas of public procurement and corruption. They also indicate social network analysis as the science for detecting fraud in tenders. The two networks complement each other fully. Despite having some identical keywords, which reinforce their importance, they add, separately, others essential for the research, such as anticorruption, risk management, collusion, and government. 3.4.2 Author Co-Authorship - Documents and Citations Within Multiple Articles The study of authors’ and co-authors’ occurrence was carried out with the VOSviewer tool using a complete counting and limiting the number of author documents to a minimum of one. The search performed included 49 authors/co-authors in the Scopus and 48 in the Web of Science repositories. In Table 5, we present the 10 most-cited authors/co-authors in each repository.
124
M. S. Lyra et al. Table 5. Highest citations from authors and co-authors
SCOPUS Author/ Co- Link Documents Citations author strength fazekas m 4 3 40 tóth i.j
1
1
36
morselli c
1
1
28
reeveslatour m apa r sedita s.r
1
1
28
1 1
1 1
27 27
jheng y.-y
2
1
22
lin s.-j yu c.-h
2 2
1 1
22 22
wachs j
7
4
22
Web of Science Author/Co- Link author strength fazekas, 4 mihaly toth, istvan 1 janos morselli, 1 carlo reeves1 latour, m apa, roberta 1 1 sedita, silvia rita jheng, yi2 ying lin, shi-jen 2 yu, cheng- 2 hsien 7 wachs, johannes
Documents Citations 3
34
1
29
1
26
1
26
1 1
25 25
1
19
1 1
19 19
4
18
Combining the two repositories without intercessions, the authors/co-authors that stand out the most for their Link strength and the number of citations are Fazekas with three documents and Wachs with 3 articles. Looking at the author and co-author network in Fig. 1, filtered by authors with at least four citations, we present for the Web of Science and Scopus repositories the five biggest communities formed, with 21 nodes and 42 links each. Here we observe that the largest and most interconnected nodes belong to authors Mihaly Fazekas and Johannes Wachs. They have two and three articles as authors, respectively, and have written one paper together.
Fig. 1. Author and co-authorship network view. a) Web of Science. b) Scopus.
Public Procurement Fraud Detection
125
3.4.3 Purpose and Methodology All articles used connected networks for their evaluations. Concerning studies using labeled data or doing exploratory research, we have seven articles applying labeled data (Lin et al. 2012; Grassi et al. 2019; Hosseini et al. 2019; Lei et al. 2017; Fazekas and Wachs 2020; Wachs and Kertesz 2019b; Davydenko et al. 2019) The remaining 11 papers used exploratory analysis (Fazekas and Toth, 2016; Sedita and Apa 2015; Reeves-Latour and Morselli 2017; Lauchs et al. 2011; Luna-Pla and Carlock 2020; Wachs et al. 2019a; Wachs and Fazekas 2021; Carneiro et al. 2020; Zhu et al. 2020; Lin and Khomnotai 2016; Cheng et al 2017). Out of the 18 articles, 13 (Cheng et al. 2017; Davydenko et al. 2017; Fazekas and Toth (2016); Fazekas et al. 2020.1; Fazekas et al. 2020.2; Hosseini et al. 2019; Lei et al. 2017, Lin et al. 2016, Lin et al. 2012, Wachs et al. 2019.1; Wachs et al. 2019.2; Wachs et al. 2021; Zhu et al. 2020) used cluster analysis (community detection or clustering) to classify the nodes and make analysis. Eight (Davydenko et al. 2017; Fazekas and Toth 2016; Grassi et al. 2019; Hosseini et al., 2019; Luna-Pla and Carlock 2020; Lauchs et al. 2011; Reeves-Latour and Morselli 2017 and Sedita and Apa 2015) applied centralities and measures tools, which usually reflect the prominence, status, prestige, or visibility of a node and often explain behaviors on the network (Marsden 2005). Two (Grassi et al. 2019 and Wachs et al. 2019.1) used bipartite network projection to create a firm-firm net and capture the hidden relationships between bidders. One (Lin et al. 2016) used the concept of neighbor diversity, which is an effective resource to distinguish fraudsters from regular nodes. The application of Network Analysis techniques varies, not only concerning the tools used but also regarding the objectives. The most cited article in the area (Fazekas and Toth 2016), for example, besides using community detection, centralities and measures, created a fraud risk index comprising intrinsic parameters of tenders and used it to identify a supposed state capture and centralization of the corrupt network after government turnover. On the other hand, Grassi et al. (2019) pointed out the betweenness centrality as the most standard measure to identify criminal agents and used the projection approach to capture strong brokerage positions in a mono-mode network and spot criminal leaders.
4 Discussion The current research seeks to identify the most recent studies and their contributions regarding detecting fraud in public procurement using tools from network science. This section presents a brief analysis comparing the response of the repositories Scopus and Web of Science, analyze how we answer our research question, as follows: 4.1
Scopus and Web of Science Repositories
Concerning the two repositories, we found that they present an intersection of approximately 90% concerning the selected articles, with both adding the same amount of writings. Of the intersecting articles, Scopus has more citations in 75% of cases, Web of Science in 6%, and the same number of references in 19% of writings. Scopus
126
M. S. Lyra et al.
also generates more keywords and related words in titles and abstracts, besides having greater link strength and word occurrence. 4.2
Research Question RQ1
RQ1: How can network science help in the disclosure of fraud in public procurement? The review of the articles allowed the identification of Network Science techniques used to detect fraud in public procurement. We found that cluster analysis (community detection and clustering) and centrality measures are the most used tools in the 18 documents analyzed, with 13 and 9 occurrences, respectively. Community detection is one of the most used tools in studies to identify the formation of clusters, especially cartels. It was used by Wachs et al. (2019a) to investigate a cohesive and exclusive interaction in a large market, in which they found a niche with favorable conditions for the emergence of cartel collusion. However, the approach to detecting cartels was only suggestive and could not conclusively prove fraud without using labeled data. Regarding clustering analysis, Cheng et al. (2017) identified that closely linked nodes with relatively large weights indicate companies in the communities with cooperative behavior and likely bidding fraudulently. Fazekas et al. (2016) detected companies that make extensive contracts among themselves rather than with other organizations outside their communities. Hosseini et al. (2019) analyzed communities in a contract network that refers to cases of violation of procedural principles in their concession, where they identified clusters in which none of the contracts had the capacity to deliver the awarded projects. In turn, Wachs et al. (2021) quantified the extent to which a single bid varies across different communities within a network and interpreted it as significant evidence that corruption is not randomly distributed in the public sector. They concluded that the greater the grouping of single bidders in a market, the more likely it is that investigation of their neighbors would raise cases of fraud. In other research, Wachs and Kertesz (2019b) identified that communities with an excess of bonding social capital are at higher risk of corruption, and communities with more diverse external connectivity are less exposed to fraud. Zhu et al. (2020) divided a network into separate communities classifying overlaps as new communities and identifying collusion cases from the high degree of nodes overlapping. Fazekas (2020) analyzed the persistence of corrupt buyers in a lowcompetitive cluster over the years after government change to determine the corrupt link between public and private entities. The centrality measures were used in the articles to help identify the most active, most influential, most central actors and those suspected of spurious associations. The degree centrality stood out in some articles (Fazekas and Toth 2016; Sedita and Apa, 2015; Lauchs et al. 2011; Luna-Pla and Carlock 2020; Hosseini et al. 2019; Davydenko et al. 2017) exposing the most connected actors in the network. However, while it helps to identify network hubs, criminal network interfaces are not necessarily hub-mediated as it becomes very evident and obvious on the network (Morselli 2008). Therefore, the use of other measures and analyses is necessary to identify suspicious connections. Fazekas (2016), for instance, identified a non-linear and changeable relationship over time between the closeness centrality and a corruption risk index, making the
Public Procurement Fraud Detection
127
closeness a strong indicator of the risk of corruption after a government turnover. Davydenko (2017), on the other hand, despite using closeness centrality, identified that offenders were aware of the main indicators of doubt and were striving to circumvent them by lowering their average values through manipulation of information. In a cluster analysis, he showed that the most significant indicators to reveal suspicious elements were based on the contract price, contract name, closeness values, and average values of eigenvectors, revealing, through these measures, that 13% of government contracts had violations. Testing the efficiency of the betweenness centrality in identifying brokers, Grassi et al. (2019) investigated a set of known leaders in a criminal network, which revealed a core group of about 20 heads with high betweenness values. On the other hand, they also found a small group of outlier leaders whose betweenness was low, which did not indicate them as possible brokers. Luna-Pla et al. (2020), despite using network diameter, average path length, and average degree metrics, notably these remained essentially unaltered, not providing enough information to indicate the risk of corporate fraud, and the clustering coefficient was equal to zero due to the bipartite nature of the network. These network metrics were non-conclusive to establish clear corporate corruption risk indicators, except for the number of multi-edge node pairs, that provided a good measure of the degeneracy of both companies and people. Since Network Science tools are not always able to promote precise results alone, it is often used with other parameters such as single bidder, neighbor diversity, winner diversity, winning rate, or even creating indicators based on tenders’ parameters, as done in some studies (Fazekas and Toth 2016; Fazekas and Wach 2020; Luna-Pla and Carlock 2020).
5 Conclusion and Future Studies This research used the PRISMA methodology, an evidence-based collection of items that intend to help authors report on a broad spectrum of systematic reviews and metaanalyses. The method allowed for an organized presentation of studies and techniques used to detect fraud in public procurement. To select scientific articles, we used two repositories, Scopus and Web of Science. We compared their results separately to identify the contribution and determine the significance of each one in the research. The 18 selected articles were chosen based on the attempt to answer the research question referring to the techniques applied to detect fraud in public procurement using network science. The most addressed technique was cluster analysis, which encompasses clustering and community detection. It was mainly used to explore locations of actors grouped with similar properties, helping to reveal hidden relations among them in the network. Another widely used technique was centrality, mainly degree centrality, closeness centrality, and betweenness centrality, helping to detect fraud. However, some articles reported that the use of these measures was not enough to detect suspicious episodes, but could assist the characterization and analysis of the network. Nevertheless, other
128
M. S. Lyra et al.
articles created fraud indicators based on intrinsic bidding parameters and used cluster analysis to effectively identify fraud. Regarding gaps, the literature review showed that the vast majority of fraud detection analyzes in public procurement occur in bipartite networks. Besides, the studies were not able to reconcile an analysis of projected co-bidding networks with the use of fraud indicators in a cluster investigation using network science. Taking into account the gap presented, we suggest, as a guide for future studies, within the extensive subject of fraud detection in public contracts, investigating a projected co-bidding network using specific fraud risk indicators based on procurement process parameters. Funding. This research was funded by “Fundação para a Ciência e a Tecnologia” (Portugal), grants’ number DSAIPA/DS/0116/2019
References Carneiro, D., Veloso, P., Ventura, A., Palumbo, G., Costa, J.: Network analysis for fraud detection in portuguese public procurement. In: Analide, C., Novais, P., Camacho, D., Yin, H. (eds.) IDEAL 2020. LNCS, vol. 12490, pp. 390–401. Springer, Cham (2020). https://doi.org/ 10.1007/978-3-030-62365-4_37 Cheng, T., Liu, T., Meng, L., et al.: The analysis of water project bid rigging behavior based on complex network. In: International Conference on Applied Mathematics, Modeling and Simulation (AMMS) (2017) Costa, G.A., Machado, D.P., Martins, V.Q.: The efficiency of social control in municipal bidding: a study in social observatories. Sociedade Contabilidade e Gestão 14(4), 112 (2020) Davydenko, V.I., Morozov, N.V., Burmistrov, M.I. Adaptation of cluster analysis methods in respect to vector space of social network analysis indicators for revealing suspicious government contracts. In: IEEE 5th International Conference on Future Internet of Things and Cloud (FiCloud) (2017) Fazekas, M., Tóth, I.J.: From corruption to state capture: a new analytical framework with empirical applications from Hungary. Polit. Res. q. 69(2), 320–334 (2016) Fazekas, M., Wachs, J.: Corruption and the network structure of public contracting markets across government change. Politics and Governance (2020) Grassi, R., Calderoni, F., Bianchi, M., Torriero, A.: Betweenness to assess leaders in criminal networks: new evidence using the dual projection approach. Soc. Networks 56, 23–32 (2019) Hosseini, M.R., Martek, I., Banihashemi, S., et al: Distinguishing Characteristics of Corruption Risks in Iranian Construction Projects: A Weighted Correlation Network Analysis. Science and Engineering Ethics (2019) Lei, M., Yin, Z., Li, S., Li, H.: Detecting the collusive bidding behavior in below average bid auction. In: 13th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD) (2017) Lin, J., Khomnotai, L.: Improving fraudster detection in online auctions by using neighbor-driven attributes. Entropy, Vol. 18, Ed:1, N:e18010011 (2016) Lin, S.J., Jheng, Yi-Y., Yu, C.H.: Combining ranking concept and social network analysis to detect collusive groups in online auctions. Expert Syst. With Applicat (2012) Luna-Pla, I., Carlock. N.J.R.: Corruption and complexity: a scientific framework for the analysis of corruption networks. Appl. Network Sci. (2020)
Public Procurement Fraud Detection
129
Marsden, P.V.: Network Analysis. In: Encyclopedia of Social Measurement (2005) Morselli, C.: Inside Criminal Networks. Springer, Studies of Organized Crime (2008) Mufutau, G.O., Mojisola. O.V.: Detection and prevention of contract and procurement, fraud Catalyst to organization profitability. J. Bus. Manag. (2016) Padhi, S.S., Mohapatra, P.K.J.: Detection of collusion in government procurement auctions. J. Purch. Supply Manag. 17, 207–221 (2011) Page, M.J., McKenzie, J.E., Bossuyt, P.M., et al.: The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. Int. J. Surg. 88 (2021). Number 105906 Reeves-Latour, M., Morselli, C.: Bid-rigging networks and state corporate crime in the construction industry. Soc. Networks 51, 158–170 (2017) Rustiarini, N., Sutrisno, T., Nurkholis, N., Andayani, W.: Why people commit public procurement fraud? the fraud diamond view. J. Pub. Procur. 19(4), 345–362 (2019) Sedita, S.R., Apa, R.: The impact of inter-organizational relationships on contractors’ success in winning public procurement projects: The case of the construction industry in the Veneto region. Int. J. Proj. Manag. (2015) Silva Filho, J.B.: A eficiência do controle social nas licitações e contratos administrativos. Master's thesis - Universidade Nove de Julho, São Paulo (2017) Van Eck N.J., Waltman L.: Software survey: VOSviewer, a computer program for bibliometric mapping. Scientometrics, 84(2), 523–38. Version 1.6.14 (2010) Wachs, J., Fazekas, M. and Kertész, J.: Corruption risk in contracting markets: a network science perspective. Internat. J. Data Sci. Analyt. (2021) Wachs, J., Kertesz, J. (2019b). A network approach to cartel detection in public auction markets. Sci. Rep. Wachs, J., Yasseri, T., Lengyel, B., Kertesz, J. (2019a). Social capital predicts corruption risk in towns. Royal Society Open Science. Wensink, W., Vet, M.J. (2013). Identifying and Reducing Corruption in Public Procurement in the EU. European Commission. Bruxelles. Whiteman, R. (2019). Fraud and corruption tracker. The Chartered Institute of Public Finance and Accountancy – CIPFA. World Bank Group: A fair adjustment: efficiency and equity of public spending in Brazil. Volume 1 - Overview (English). Washington, D.C. (2017) Zhu, J, Wang, B., Li, L., et al.: Bidder network community division and collusion suspicion analysis in Chinese construction projects. Adv. Civil Eng. (2020)
Characterising Different Communities of Twitter Users: Migrants and Natives Jisu Kim1,2(&), Alina Sîrbu3, Giulio Rossetti4, and Fosca Giannotti4 1
2
Scuola Normale Superiore, Pisa, Italy [email protected] Max Planck Institute for Demographic Research, Rostock, Germany 3 University of Pisa, Pisa, Italy [email protected] 4 National Research Council of Italy, Pisa, Italy {giulio.rossetti,fosca.giannotti}@isti.cnr.it
Abstract. Today, many users are actively using Twitter to express their opinions and to share information. Thanks to the availability of the data, researchers have studied behaviours and social networks of these users. International migration studies have also benefited from this social media platform to improve migration statistics. Although diverse types of social networks have been studied so far on Twitter, social networks of migrants and natives have not been studied before. This paper aims to fill this gap by studying characteristics and behaviours of migrants and natives on Twitter. To do so, we perform a general assessment of features including profiles and tweets, and an extensive network analysis on the network. We find that migrants have more followers than friends. They have also tweeted more despite that both of the groups have similar account ages. More interestingly, the assortativity scores showed that users tend to connect based on nationality more than country of residence, and this is more the case for migrants than natives. Furthermore, both natives and migrants tend to connect mostly with natives. Keywords: Twitter Big data analysis Communities
International migration Social network
1 Introduction Twitter is one of the microblogging platforms that attracted many users. Unlike some of the other platforms, Twitter is widely used to communicate in real-time and share news among different users [9]. On Twitter, users follow other accounts that interest them to receive updates on their messages, called “tweets”. Tweets can include photos, GIFS, videos, hashtags and polls. Amongst them, hashtags are widely used to facilitate crossreferencing contents. The tweets can also be retweeted by other users who wish to spread the information among their networks. This involves sometimes adding new
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2021, SCI 1072, pp. 130–141, 2022. https://doi.org/10.1007/978-3-030-93409-5_12
Characterising Different Communities of Twitter Users
131
information or expressing opinion on the information stated. Despite the limit on maximum 280 characters of tweets1, users are able to effectively communicate with others. But above all, Twitter has become a useful resource for research. Twitter data can be accessed freely through an application programming interface (API)2. On top of this, the geo-tagged tweets are widely used to analyse real-world behaviours. One of fields of research that makes use of geo-tagged tweets is migration studies. Typically, migration studies have relied on traditional data such as census, survey and register data. However, provided with alternative data sources to study migration statistics in the recent period, many studies have developed new methodologies to complement traditional data sources (See for instance, [6, 7, 10, 14, 16]). While these studies have successfully shown advantages of alternative data sources, how migrants and natives use social media has not been fully understood. For instance, what do migrants/natives talk about? To whom migrants/natives connect to? Do migrants/natives have many followers or friends? Who are the most central users amongst them? These are the questions that we aim to explore in this work through analysing features and the network of Twitter. In doing so, we expect to discover interests of migrants and natives and evidences for social interaction. Here, we aim to study the characteristics and behaviours of two different communities on Twitter: migrants and natives. We plan to do so through a general assessment of features of individual users from profiles and tweets and an extensive network analysis to understand the structure of the different communities. For this, we identified 4,940 migrant users and 46,948 native users across 174 countries of origin and 186 countries of residence using the methodology developed by [7]. For each user, we have their profile information which includes account age, whether the account is a verified account, number of friends, followers and tweets. We also have information extracted from the public tweets which includes language, location (at country level) and hashtags. With these collected data, we explore how each of the communities utilises Twitter and their interests in both the world- and local-level news using the method developed by [8]. Furthermore, we also explore their social links by studying the properties of the mixed network between migrants and natives. We study centrality and assortativity of the nodes in the network. We discovered that migrants tend to have more followers than friends. They also tweet more and from various locations and languages. The assortativity scores show that users tend to connect based on nationality more than country of residence, and this is true more for migrants than natives. Furthermore, both natives and migrants tend to connect mostly with natives. The rest of the article is organised as follows: we begin with related works, followed by Sect. 3 on data and the identification strategy for labelling migrants and natives on Twitter. Section 4 focuses on statistics on different features of Twitter and Sect. 5 deals with analysis of the different networks. We then conclude the paper in Sect. 6.
1 2
https://developer.twitter.com/en/docs/counting-characters. https://developer.twitter.com/en/docs/twitter-api.
132
J. Kim et al.
2 Related Works Many studies exist that analyse different networks on microblogging platforms. Twitter is one of the platforms that has been studied extensively as it enables us to collect directed graphs unlike Facebook for instance. We can study various types of relationships defined by either a friendship (followers or friends3), conversation threads (tweets and retweets) or semantics (tweets and hashtags). Performing network analysis on these allows us to study properties, structures and dynamics of various types of social relationships. One of the first quantitative studies on topological characteristics of Twitter and its role in information sharing is [9]. From this study onward, many have found distinguished characteristics of Twitter’s social networks. According to the study, Twitter has a “non-power-law follower distribution, a short effective diameter, and low reciprocity”. The study showed that unlike other microblogging platforms that serve as mainly social networking platforms, Twitter acts as a news media platform where users follow others to receive updates on others’ tweets. A further study of the power of Twitter in information sharing and role of influencers is [3]. The authors focused on three different types of influence: indegree, retweets and mentions of tweets. They found that receiving many in- links does not produce enough evidence for influence of a user but the content of tweets created, including the retweets, mentions and topics matter equally. The same authors extended the work to observe information spreaders on Twitter based on certain properties of the users which led to a natural division into three groups: mass media, grassroots (ordinary users) and evangelists (opinion leaders) [2]. Furthermore, by looking at the six major topics in 2009 and how these topics circulated, they found different roles played by each group. For example, mass media and evangelists play a major role in spreading new events despite their small presence. On the other hand, grassroots users act as gossip-like spreaders. The grassroots and evangelists are more involved to form social relationships. Studies that appear in the latter years focused on characteristics on Twitter networks and properties in various scenarios, e.g., political context, social movements, urban mobility and more (See for instance [12, 15]). For instance, [5] studied the network of followers on Twitter in the digital humanities community and showed that linguistic groups are the main drivers to formation of diverse communities. Our work contributes to the same line of these works. But unlike any precedent works, here we explore new types of communities that, to the best of our knowledge, have not yet been explored, i.e., migrants and natives.
3
Followers are users that follow a specific user and friends are users that a specific user follows. https://developer.twitter.com/en/docs/twitter-api/v1/accounts-and-users/follow-search-get-users/over view.
Characterising Different Communities of Twitter Users
133
3 Data and Labelling Strategy 3.1
Data
The dataset used in this work is similar to the one used in [8]. We begin with Twitter data collected by [4], from which we extract all geo-tagged tweets from August 2015 to October 2015 published from Italy, resulting in a total of 34,160 individual users (that we call first layer users). We then searched for their friends, i.e. other accounts that first layer users are following which added 258,455 users to the dataset (called second layer users). We further augmented our data by scraping also the friends of the 258,455 users. The size of the data grew extensively up to about 60 million users. To ensure sufficient number of geo-tagged tweets, all of these users’ 200 most recent tweets were also collected. To synthesise the dataset, we focus on a subset of these users for whom we have their social network, and which have published geo-located tweets. This results in total of 200,354 users from the first and second layers with some overlaps present among the two layers. 3.2
Labelling Migrants and Natives
The strategy for labelling migrants and natives originates from the work of [7]. It involves assigning a country of nationality Cn ðuÞ and a country of residence C r ðuÞ to each user u, for the year 2018. The definition of a migrant is “a person who has the residence different from the nationality,” i.e., C n ðuÞ 6¼ C r ðuÞ. The strategy to assign a user’s residence requires observing the number of days spent in different countries in 2018 through the time stamps of the tweets. In other words, the country of residence is the location where the user remains most of the time in 2018. To assign nationality, we analyse the tweet locations of the user and user’s friends. In this work, we took into account the fact that tweet language was not considered important in defining the nationality as found in the study of [7]. Thus, the language was not considered here as well. By comparing the labels of country of residence and the nationality, we determined whether the user was a migrant or a native in 2018. Some users could not be labelled since the procedure outlined in [7] only assigns labels when enough data is available. As a result, we identified nationalities of 197,464 users and the residence 57,299 users. Amongst them, the total number of users that have both the nationality and residence labels are 51,888. Most importantly, we were able to identify 4,940 migrant users and 46,948 natives from our Twitter dataset. In total, we have identified 163 countries of nationalities for natives. The most present countries are the United States of America, Italy, Great Britain and Spain in terms of nationality. This is due to several factors. First because Twitter’s main users are from the United States. Second, we have large number of Italian nationalities present due to the fact that we initially selected the users whose geo-tags were from Italy. Overall, we have identified 144 countries of nationalities and 169 countries of residences for the migrants. In terms of migration patterns, it is interesting to also remark from our data that the U.S. and U.K have significant number of in and out-going links. In addition, France and Germany have mainly in-coming links.
134
J. Kim et al.
Fig. 1. Distribution of DA & HA for migrants (in blue) and natives (in orange)
Here, we emphasise that through our labelling process we do not intend to reflect a global view of the world’s migration patterns but simply what is demonstrated through our dataset. However as it is also shown in the work of [8], the predicted data correlate fairly with official data when looking at countries separately. For instance, when comparing predicted data with Italian emigration data of AIRE4, we observed a correlation coefficient of 0.831 for European countries and 0.56 for non-European countries. When compared with Eurostat data on European countries, the correlation coefficient was 0.762. This provides us the confidence to employ this dataset to analyse characteristics of different communities through Twitter.
4 Twitter Features In this section we look at the way migrants and natives employ Twitter to connect with friends and produce and consume information. 4.1
Home and Destination Attachment Index
A first analysis concentrates on the types of information that users share, from the point of view of the country where the topics are discussed. In particular, we compute two indices developed by [8]: Home Attachment (HA) and Destination Attachment (DA), which describe how much users concentrate on topics from they nationality and residence country, respectively. We compute the two indices for both migrants and natives; obviously, for natives the residence and nationality are equal and thus the two indices coincide. To compute HA and DA, we first assign nationalities to hashtags by considering the most frequent country of residence of natives using the hashtags. A few hashtags are not labelled, if their distribution across countries is heterogeneous (as measured by the entropy of the distribution). The HA is then computed for each user as the proportion of hashtags specific to the country of nationality. Similarly, the DA is the proportion of hashtags specific to the country of residence. Thus, the HA index measures how much a user is interested in what is happening in his/her country of nationality and the DA index reflects how much a user is interested in what is happening in his/her country of residence. 4
Anagrafe degli italiani residenti all’estero (AIRE) is the Italian register data.
Characterising Different Communities of Twitter Users
135
As shown in the Fig. 1, the indices clearly behave differently for the two groups: migrants and natives. Similar to [8], we observe that migrants have, on average, very low level of DA and HA. When looking at natives, this index distribution is wider and has an average of 0.447 which is surely higher than the average of migrants. Without a doubt, this shows that natives are more attached to topics of their countries, while migrants are generally less involved in discussing the topics, both for the home and destination country. However, we observe that a few migrant users do have large HA and DA showing different cultural integration patterns, as detailed in [8]. At the same time, some natives show low interest in the country’s topics, which could be due to interest in world-level topics rather local-level topics. 4.2
Profile Information
Can we find any distinctive characteristics of migrants and natives from the profiles of users? Here, we look at public information provided by the users themselves on their profiles. We examine the distribution of profile information and perform Kolmogorov– Smirnov (KS) test to compare the distributions for migrants and natives. On the profile, various information are declared by the users themselves such as the joined date, location, bio, birthday and more. We begin by looking at the age of the Twitter accounts from the moment they created their accounts till 2018, as shown in the Fig. 2. We observe that migrants and natives have similar shape of distributions, providing information that there is no earlier or later arrival of one group or another on Twitter. The KS test with high p-value of 0.404 also confirms that the two distributions are indeed very similar. The other criteria we study show some differences. First, we generally observe that natives have slightly more friends than migrants. On average, migrants follow about 1,160 friends and 1,291 friends for the natives. We can also see from the Fig. 2 that the range of this number is much wider for the natives, ranging from 0 to maximum of 436,299 whereas for the migrants, this range ends at 125,315. The KS test yields a p-value of 1.713e−23, confirming that the two distributions are different. Secondly, we observe that the migrants have a larger number of followers. On average, migrants have 10,972 followers versus 7,022 followers for natives (KS pvalue of 0.008). This tells us that there are more users on average that are waiting to get updates on migrant users’ tweets. Interestingly, when it comes to the number of tweets (statuses) that users have ever tweeted since the account was created, the number is about 9% higher for the migrants than the natives: average values of 9,836 for migrants and 9,016 for natives, p-value of 9.777e−06. We also look at the number of accounts that are classified as verified accounts. The verified accounts are usually well-known people such as celebrities, politicians, writers, or directors and so on. Indeed when looking at the proportion of verified accounts, we observe that this proportion is higher among migrants than natives which partly explains also the higher number of followers and tweets for this group. To be more specific, 5% of the users’ accounts are verified accounts among migrants and 3.7% of the accounts are verified accounts among natives.
136
J. Kim et al.
Fig. 2. Left: Distributions of profile features: number of followers, tweets published (statuses), and friends and number of days since the account was created until 2018, respectively. Centre: Distribution of tweet locations and languages. Right: Distribution of tweet locations and languages of friends
4.3
Tweets
Tweets also provide useful information about user behaviour. We are interested in the locations (country level) and languages a user employs on Twitter. Hence, we look at the number of languages and locations that appear in the users’ 200 most recent tweets and computed also the KS statistics to compare the differences between the distributions of migrants and natives. As shown in Fig. 2 on the left, we note that migrants tweet in a wider variety of languages and locations. The two distributions for migrants and natives are different from each other as the KS tests show low p-values; 2.36e−194 for location and 1.412e−38 for language. Since we possess network information, we also studied the tweet language and location information for a user’s friends. In Fig. 2 on the right, the two distributions show smaller differences among natives and migrants, compared to Figure on the left. However, the p-value of the KS test tells us that the distributions are indeed different from one another, where the p-value for location and language distribution for migrants and natives are 3.246e−05 and 0.005 respectively. Although the differences are small, we observe that the friends of migrants tweet in more numerous locations than those of natives, with average of 29.6 for migrants and 27.4 for natives. However, although the two distributions are different from each other from the KS p-value, the actual difference between average values is very small in the case of the number of languages of friends. In fact, the average for migrants is 30.22 and 30.43 for natives. These numbers indicate that the migrants have travelled in more various places and hence, write in diverse languages than the natives. The friends of migrants tend to have travelled more also. However, no large differences were observed for the number of languages that friends can write in for both migrants and natives. Popular Hashtags. What were the most popular hashtags used by natives and migrants in 2018? In Fig. 3 we display the top 10 hashtags used by the two communities, together with the number of tweets using those hashtags, scaled to [0, 1]. We observe that natives and migrants share some common interests but they also have differences. For instance, some of the common hashtags between natives and migrants are #tbt, #love and #art. Other hashtags such as #travel, and #repost are in the top list but the usage of these hashtags is much higher in one of the groups than the other. For
Characterising Different Communities of Twitter Users
137
Fig. 3. Left: Top 14 hashtags used by migrants and natives. Right: Degree distribution of the network.
instance, the hashtag #travel is much more used by migrants than the natives. This is interesting because the number of tweet locations of migrants also reflect their tendency to travel, more than natives. Followed by the hashtag #travel, migrants also used other hashtags such as #sunset, #photography, #summer, and hashtags for countries which show their interests in travelling. On the other hand, natives are more focused on hashtags such as #job, #jobs, and #veteran.
5 Network Analysis In this section, we perform social network analysis on the social graph of our users to examine the relationships between and within the different communities, i.e., migrants, and natives. Initially, our network consisted of 45,348 nodes and 232,000 edges. We however focus on the giant component of the network which consists of 44,582 nodes and 231,372 edges. Each node represents either a migrant or a native and the edges are directed and represent friendship on Twitter (in other words, our source nodes are following the target nodes). Since we have migrants and natives labels, our network allows us to study the relationship between migrants and natives. 5.1
Properties of the Network
In this section, we start by looking at density, reciprocity, and shortest path length for the network, and then study node centrality including degree distribution. The average density score of our network tells us that on average each node is connected to other 5.2 nodes. The reciprocity coefficient is low and indicates that only 23.8% of our nodes are mutually linked. This is normal on Twitter as most of the users follow celebrities but the other way around does not happen in many cases. Within the network, the average shortest path length is 2.42, which means we need on average almost 3 hops to receive information from one node to another. We also compute 7 measures of centrality. The measures include all-, in- and outDegree (Fig. 3) plus Closeness, Betweenness, Pagerank and Eigenvector centrality measures. As shown in Fig. 3, the degree distribution follows a power-law distribution with alpha equal to 2.9. This means that a minority of the nodes are highly connected to
138
J. Kim et al.
the rest of the nodes. As for the rest of the centrality measures, we observe that most of the users have low centrality while a small number of users show higher centrality values. This is true for all measures, however for closeness, the number of users who show higher centrality is larger than for the other measures. This means that many users are wellembedded in the core of the network, and are in a good position to receive information. We also compute the correlation between different centrality measures as shown in Fig. 4. First of all, we observe a positive relationship among all measures, which is expected, as it means that users who are central from one point of view are also central from another. The Betweenness and Eigenvector centrality measures correlate the most (r = 0.55). This tells us that users that serve as a bridge between two parts of graphs are also likely to be the most influential user in the network. On the other hand, Betweenness and Closeness centrality measures have the lowest correlation with r = 0.19. However, the scatterplot shows that those few users who have larger Betweenness also have a large Closeness. The low correlation is determined by the fact that a large majority of users show almost null Betweenness, however Closeness is heterogeneous among this group. A similar observation can be made for the relation between Closeness on one side and Pagerank and Eigenvector centrality on the other: high Pagerank and Eigenvector centralities always correspond to high Closeness, however for users with low Pagerank and Eigenvector centrality the Closeness values vary.
Fig. 4. Correlation between different centrality measures for network. We computed Closeness, Betweenness, Pagerank and Eigenvector measures respectively.
When checking the labels, in terms of migrant or native, of the most central users, we see that in general these are mostly natives. To be more specific, we observe that among the top 8 to 10 users are natives. In other words, most of the nodes have majority of in- and out-going links directed to natives’ accounts. This is somewhat expected since in our network only 10% of users are migrants. However, we note that a migrant user is always in the top 3 in Closeness, Pagerank and Eigenvector centrality measures. This tells us that this migrant user has a crucial influence over the network around itself but also beyond its connections.
Characterising Different Communities of Twitter Users
5.2
139
Assortativity Analysis
We now focus on measuring assortativity of nodes by different attributes of individuals, i.e., migrants or natives, country of residence and country of nationality. Assortativity tells us whether the network connections correlate in any way with the given node attributes [11]. In other words, it tells us whether the nodes in the network tend to connect with other similar nodes. It typically ranges between −1 and 1. A value of 1 means nodes always connect with nodes with the same attributes, i.e. full homophily, while −1 means nodes tend to connect with nodes with different attributes. In our case this analysis allows us to infer whether and in what measure the network topology follows the nationality or residence of the users, or whether the migrant/native status is relevant when building online social links.
Fig. 5. Stacked histogram of conformity measures: From left, we have conformity measure by residence, by nationality and by migrant/native label. Please note that the histograms are stacked, therefore there is no overlap between the plot bars. Blue indicates migrants and orange indicates natives.
We begin with global assortativity measures, which give one assortativity score for the entire network. First, the degree assortativity coefficient of −0.054 shows no particular homophily behaviour from the point of view of the node degree. That means high degree nodes do not link with other high degree nodes. However, when we measure the assortativity by different attributes, we observe different results. When looking at the coefficient by the country of residence, the score of 0.54 shows a very good homophily level. The score improves slightly when we examine the behaviour through the attributes of country of nationality (0.6). These values tell us that nodes tend to follow other nodes that share same country of residence and country of nationality, with a stronger effect for the latter. However, when looking at the coefficient by the migrant/native label, we observe no particular correlation (0.033). The global assortativity scores are susceptible to be influenced by the size of the data and the imbalance in labels, which is our case especially for the migrant/native labels. Therefore we continue to examine the assortativity at local level, allowing us to overcome the possible issues at global level. We thus compute the scores based on an extension of Newman’s assortativity introduced by [13], called conformity. In Fig. 5 we show the distribution of node-level conformity of migrants and natives, for the three attributes (nationality, residence and migrant/native label). We observe different behaviour patterns for migrants and natives. Specifically, we see that migrants tend to display lower homophily compared to natives, when looking at the conformity of nodes
140
J. Kim et al.
by country of residence. This tells us that migrant users tend to consider less the country of residence when following other users. Instead, most natives tent to connect with users residing in the same country. When looking at nationality, this effect is less pronounced. While natives continue to display generally high homophily, with a small proportion of users with low values, migrants show a flatter distribution compared to the nationality. Again, a large part of migrants show low homophily, however a consistent fraction of migrant users show higher nationality homophily, as opposed to what we saw for the residence. This confirms what we observed at global level: there is a stronger tendency to follow nationality labels when creating social links. As for the conformity of nodes by migrant/native labels, we observe that migrants and natives clearly have distinctive behaviours. While natives tend to form connections with other natives, migrants tend to connect with natives as well, resulting in negative conformity values for migrant users. The observed values could also be due to the fact that migrants are only about 10% of our users so naturally many friends will be natives (from either residence, nationality or other country). This result is different from what we observed at global level and confirms that the global conformity score was influenced by the size of the data and the imbalance in labels.
6 Conclusion We studied the characteristics of two different communities; migrants and natives observed on Twitter. Analysing profiles, tweets and network structure of these communities allowed us to discover interesting differences. We observed that migrants have more followers than friends. They also tweet more often and in more various locations and languages. This is also shown through the hashtags where the most popular hashtags used among migrants reflect their interests in travels. Furthermore, we detected that Twitter users tend to be connected to other users that share the same nationality more than the country of residence. This tendency was relatively stronger for migrants than for natives. Furthermore, both natives and migrants tend to connect mostly with natives. As mentioned previously, we do not intend to generalise the findings of this work as only a small sample of individual Twitter data was used. However, we believe that by aggregating the individual level data, we were able to extract information that is worthwhile to be investigated further. To this extent, we simply intend to present what is demonstrated through out dataset. In spite of this drawback, we were able to observe interests, usages of Twitter and social interactions between migrants and natives thanks to the availability of the Twitter data. In the future, it would be interesting to exploit further some of the findings of this work. For instance, we can observe how central users in the network are spreading culture or information throughout the network and how effective are the spreading/communication channels initiated by these central users. Additionally, based on the network composition we have observed, it is possible to investigate strong and weak ties in the network to study network supports for migration settlement [1].
Characterising Different Communities of Twitter Users
141
Acknowledgment. This work was supported by the European Commission through the Horizon2020 European projects “SoBigData++: European Integrated Infrastructure for Social Mining and Big Data Analytics” (grant agreement no 871042) and “HumMingBird - Enhanced migration measures from a multidimensional perspective” (grant agreement no 870661).
References 1. Blumenstock, J.E., Chi, G., Tan, X.: Migration and the value of social networks. CEPR Discussion Paper No. DP13611 (2019) 2. Cha, M., Benevenuto, F., Haddadi, H., Gummadi, K.: The world of connections and information flow in twitter. IEEE Trans. Syst. Man, Cybern. A: Syst. Hum. 42(4), 991–998 (2012) 3. Cha, M., Haddadi, H., Benevenuto, F., Gummadi, P.K., et al.: Measuring user influence in twitter: the million follower fallacy. Icwsm 10(10–17), 30 (2010) 4. Coletto, M., et al.: Perception of social phenomena through the multidimensional analysis of online social networks. Online Soc. Netw. Media 1, 14–32 (2017) 5. Grandjean, M.: A social network analysis of Twitter: mapping the digital humanities community. Cogent Arts Human. 3(1), 1171458 (2016) 6. Hawelka, B., Sitko, I., Beinat, E., Sobolevsky, S., Kazakopoulos, P., Ratti, C.: Geo-located twitter as proxy for global mobility patterns. Cartogr. Geogr. Inf. Sci. 41(3), 260–271 (2014) 7. Kim, J., Sîrbu, A., Giannotti, F., Gabrielli, L.: Digital footprints of international migration on twitter. In: International Symposium on Intelligent Data Analysis. pp. 274–286. Springer (2020). https://doi.org/10.1007/978-3-030-44584-3_22 8. Kim, J., Sîrbu, A., Rossetti, G., Giannotti, F., Rapoport, H.: Home and destination attachment: study of cultural integration on twitter. arXiv preprint arXiv:2102.11398 (2021) 9. Kwak, H., Lee, C., Park, H., Moon, S.: What is twitter, a social network or a news media? In: Proceedings of the 19th international conference on World wide web. pp. 591–600 (2010) 10. Mazzoli, M., et al.: Migrant mobility flows characterized with digital data. PLoS ONE 15(3), e0230264 (2020) 11. Newman, M.E.J.: Assortative mixing in networks. Phys. Rev. Lett. 89(20), 208701 (2002) 12. Radicioni, T., Pavan, E., Squartini, T., Saracco, F.: Analysing twitter semantic networks: the case of 2018 italian elections. arXiv preprint arXiv:2009.02960 (2020) 13. Rossetti, G., Citraro, S., Milli, L.: Conformity: a path-aware homophily measure for nodeattributed networks. IEEE Intell. Syst. 36(1), 25–34 (2021) 14. Sîrbu, A., et al.: Human migration: the big data perspective. Int. J. Data Sci. Anal. 11(4), 341–360 (2020). https://doi.org/10.1007/s41060-020-00213-5 15. Xiong, Y., Cho, M., Boatwright, B.: Hashtag activism and message frames among social movement organizations: semantic network analysis and thematic analysis of Twitter during the # metoo movement. Pub. Relat. Rev. 45(1), 10–23 (2019) 16. Zagheni, E., Garimella, V.R.K., Weber, I., State, B.: Inferring international and internal migration patterns from twitter data. In: Proceedings of the 23rd International Conference on World Wide Web. pp. 439–444 (2014)
Evolution of the World Stage of Global Science from a Scientific City Network Perspective Hanjo D. Boekhout1(B) , Eelke M. Heemskerk2 , and Frank W. Takes1 1
Leiden University, Niels Bohrweg 1, 2333 CA Leiden, The Netherlands [email protected] 2 University of Amsterdam, Nieuwe Achtergracht 166, 1001 NB Amsterdam, The Netherlands
Abstract. This paper investigates the stability and evolution of the world stage of global science at the city level by analyzing changes in coauthorship network centrality rankings over time. Driven by the problem that there exists no consensus in the literature on how the spatial unit “city” should be defined, we first propose a new approach to delineate so-called scientific cities. On a high-quality Web of Science dataset of 21.5 million publications over the period 2008–2020, we study changes in centrality rankings of subsequent 3-year time-slices of scientific city co-authorship networks at various levels of impact. We find that, over the years, the world stage of global science has become more stable. Additionally, by means of a comparison with degree respecting rewired networks we reveal how new co-authorships between authors from previously unconnected cities more often connect ‘close’ cities in the network periphery. Keywords: Scientific co-authorship networks · Scientific cities networks · Temporal networks · Rank correlation
1
· City
Introduction
A prevalent way of studying the global science system, is to produce rankings. This may involve rankings of universities based on, for example, publications, scientific impact, collaboration, open access and gender balance in order to “assess university performance on the global stage” [15,17]. It could involve ranking authors based on, for example: fractionally counted citations [3], the h-index [8], or PageRank in co-citation networks [10]. Or it may involve ranking geographical areas such as countries or cities based on, for example, scientific output [2,6] or domestic vs. international co-authorship [11]. In short, there are many ‘levels’ at which rankings are produced as part of the study of the science system. In this work, we consider rankings at the city level. However, there exists no consensus in the literature on how the spatial unit “city” should be defined [9]. Therefore, c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2021, SCI 1072, pp. 142–154, 2022. https://doi.org/10.1007/978-3-030-93409-5_13
Scientific City Networks
143
we propose a new approach to delineate scientific cities, agglomerations of cities within a small radius based on geo-located addresses on scientific publications. Related work on the science system often considers measures that are directly computable for a given entity, such as their scientific output, scientific impact, etc. [2,3,6,8,11,15]. These directly computable measures usually say little about what position the entity takes within the global science system. Instead, in this work we rank based on the position of the nodes in the co-authorship network underlying the science system, in particular using network centrality measures. We do so with the goal of establishing which cities take a ‘central’ role on the world stage of global science, similar to, e.g., related work by Ding et al. [10]. In this paper, we study changes in centrality rankings in co-authorship networks over time, where our nodes are cities rather than authors and edges denote co-authorship between these cities. By studying the change in centrality rankings we shed light on the stability of the network over time. In particular, we aim to study the stability of the co-authorship networks over time at various levels of ‘prominence’, measured through publication impact in terms of citations received. To this end, we measure the change in network-based rankings that occurs for three co-authorship network variants covering: all publications, the top 10% and the top 1% highly cited publications. Furthermore, we aim to validate the significance of the observed changes in rankings by comparing the changes in these evolving ‘real-world’ networks to changes that occur when the rewiring is performed in a random manner. However, because a sensible rewiring is non trivial, we propose a new suitable approach to generating rewired networks. We extract evolving co-authorship networks of scientific cities from a large and high-quality dataset (21.5 million publications with complete author affiliation linkages and geolocation information for the period 2008–2020). With these networks we show that the world stage of global science has become more stable, and by extension the city co-authorship networks less prone to structural change, over the years. Additionally, we show that city networks follow the expected pattern of more often establishing new co-authorship relations with ‘close’ cities than ‘distant’ cities. Finally, we conclude that, compared to our null model, changes in the network more often occur in the periphery. In short, we do the following: (1) we propose a new approach to delineating scientific cities; (2) we study changes in various network centrality rankings over time at various levels of ‘prominence’ to study the ‘world stage of global science’; and (3) we propose a new rewiring null model and determine the significance of the observed changes in centrality rankings by comparing with this null model. The remainder of this paper is structured as follows. In Sect. 2 we present our basic network notation and define various network centrality measures and rank correlation measures used in our experiments. Then, in Sect. 3 we discuss how the co-authorship networks were extracted and how we generate randomly rewired networks. Next, the experimental setup, results and limitations are discussed in Sect. 4. Finally, in Sect. 5 we summarize and conclude.
144
2
H. D. Boekhout et al.
Definitions, Measures and Background
In this section we provide basic network notation and terminology in Sect. 2.1, define centrality measures for ranking vertices in Sect. 2.2 and specify correlation measures to compare those rankings in Sect. 2.3. 2.1
Network Notation and Definitions
In this paper, we study city co-authorship networks which model scientific cities as nodes and co-authorship on scientific publications by authors from different cities as edges (see Sects. 3.2–3.4 for more details on the specific data used as input). Because the scientific co-authorship relation is an undirected relation, we model networks in this study using an undirected graph G = (V, E, ω), with V the set of vertices or nodes, E the set of edges {u, v} with u, v ∈ V , and weight function ω. We use n = |V | and m = |E|. No self-loops and no parallel edges are assumed. For weighted graphs, edge weights are a function of the connected nodes, denoted ω(u, v), with ω(u, v) > 0 iff {u, v} ∈ E and ω(u, v) = 0 iff {u, v} ∈ / E. For unweighted graphs, ω(u, v) = 1 for all {u, v} ∈ E. We define a θ-minimum-weight graph G≥θ (V, E ) as the unweighted graph induced from a weighted graph G where (u, v) ∈ E iff ω(u, v) ≥ θ. Let u v denote the existence of a path between nodes u, v ∈ V . We call H = (V ⊆ V, {{u, v} : u, v ∈ V ∧ {u, v} ∈ E}) a connected component when for all u, v ∈ V it holds that u v, i.e., all nodes are reachable from every other node. The largest connected component in a graph, in terms of nodes, is referred to as the giant component. The distance between two nodes is denoted as dG (u, v) (with u, v ∈ V ) and indicates the length of a shortest path, i.e., a path where the sum of the weights of the edges in the path u v in graph G is minimal. We define the distance between a node and itself as zero, i.e., dG (u, u) = 0. The number of shortest paths connecting u, v ∈ V is denoted by σuv , with the number of shortest paths including node w ∈ V \ {u, v} denoted by σuv (w). The neighborhood NG (v) of a node v ∈ V is defined as the set of nodes to which v links, i.e., NG (v) = {w ∈ V : (v, w) ∈ E}. The degree of a node equals the size of its neighborhood, i.e., degG (v) = |NG (v)|. Because we want to study the change in rankings over time, we can accomplish this by considering a series of static time-slices, i.e., static networks covering only a few successive years of data. The extraction of these time-slices given our data is discussed in Sect. 3.4. 2.2
Ranking Measures
In this work we determine the rankings of nodes based on various centrality measures. Specifically, we consider degree, eigenvector, closeness and betweenness centrality. Below we define each of these diverse measures and provide the rationale of high (or low) rankings for cities, with respect to the role these nodes play within the structure of the scientific city co-authorship networks.
Scientific City Networks
145
Degree Centrality. Degree centrality assumes that those nodes with connections to more neighbors are more central. In city co-authorship networks, a high rank translates to co-authorships with many different cities. It is defined as. dcG (u) =
deg(u) n−1
(1)
Eigenvector Centrality. Eigenvector centrality is based on the idea that an actor is more central if it is connected to many actors that are central themselves [13]. As such, it considers not only the number of adjacent vertices, but also their value of centrality. It can be computed by iteratively setting the eigenvalue (EV (u)) of all nodes u ∈ V to the average of its neighbors, where the initial values of EV (u) are proportional to the degrees of the nodes, normalizing after each step. (2) ecG (u) = EV (u) In scientific city co-authorship networks, a high rank indicates that said city forms co-authorships with cities that co-author with many other cities. Closeness Centrality. Closeness centrality is a measure of how close a vertex is to all other vertices in the graph. As we will be dealing only with the giant component of undirected networks in our experiments, we can employ the simplest version of this measure, as first introduced by Bavelas [1], defined as follows. n−1 (3) ccG (u) = v∈V dG (u, v) In other words, the closeness centrality of u is the inverse of the average (shortestpath) distance from u to any other vertex in the graph. In our city networks, highly ranked cities are the cities who require the fewest ‘intermediary cities’ for establishing co-authorships with every other city in the network, i.e., the world. Betweenness Centrality. Betweenness centrality is a measure of the ratio of shortest paths a node lies on [4]. In other words, it measures the extent to which shortest paths pass through a specific node. It is defined as follows. bcG (u) =
u∈V \{s,t}
σst (u) σst
(4)
In city co-authorship networks, lying on a shortest path connecting two cities indicates that establishing a co-authorship between those cities may most easily be accomplished through an introduction or collaboration with your city. As such, highly ranked cities may form an important factor in brokering new coauthorships between ‘distant’ cities.
146
2.3
H. D. Boekhout et al.
Rank Comparison Measures
In order to systematically compare two rankings of nodes in evolving networks, a measure is required that can express their (in)equality in one normalized number. Two correlation-based measures suited to this task are the Spearman and Kendall rank correlations. The advantage of applying the Spearman rank correlation is that the exact difference in ranking between all pairs of nodes in both rankings is taken into account [14]. On the contrary, Kendall’s Tau correlation considers only the extent to which the pairs of nodes are identically ordered.
3
Materials and Methods
In this section we first discuss the bibliographic database from which we extract our co-authorship networks in Sect. 3.1. Then, Sect. 3.2 describes our new approach for the delineation of the scientific city agglomerations, i.e., our nodes. Section 3.3 discusses the publication sets and counting methods used to compute the edge weights. Next, we describe how we obtain our final co-authorship network time-slices in Sect. 3.4. Finally, in Sect. 3.5 we explain how we generated the random networks. 3.1
Bibliographic Database
Our analysis is based on Clarivate’s Web of Science database (WoS). Specifically, we use the in-house version of WoS at the Centre for Science and Technology Studies (CWTS) at Leiden University from April 2021. This version of WoS has been enriched with its own: citation matching; assignment of publications to universities and organizations in a consistent and accurate manner [15]; geocoding of the author addresses; and improved author disambiguation [5]. We consider publications published in 2008–2020 categorized as Article, Review, Letter or Proceeding Paper. Publications with missing author-affiliation linkages or missing both geolocation and organization information are excluded. This leaves 21.5 million publications (87.2% of total), covering 196 countries. 3.2
Delineation of Scientific City Agglomerations - Nodes
Csom´os [9] discussed the various challenges of spatial scientometrics focusing on the city level. One of these challenges is that there exists no consensus in the literature on how the spatial unit “city” should be defined and how metropolitan areas should be delineated. Our approach to constructing a set of urban agglomerations (cities) most closely matches that described by Maisonobe et al. [12]. However, whereas their approach agglomerates to metropolitan areas the size of world cities, we instead agglomerate to smaller clusters of research localities, which we call scientific cities. Here, we rely on the observations of Bornmann and de Moya-Aneg´ on [2] and Catini et al. [6] that “institutions are frequently spatially clustered in larger
Scientific City Networks
147
cities” and that “research institutions involved in scientific and technological production” are generally located close to each other and produce well-outlined research clusters within cities. As such, we segregate world cities with multiple research clusters such that, as long as research clusters are spatially further than eight kilometers apart, each research cluster is considered its own scientific city. This approach allows us to create globally comparable geographical entities, similar to [12], whilst still delineating between distinct cities in notoriously difficult regions for agglomeration such as “de Randstad” in the Netherlands (Amsterdam, Leiden, The Hague, Delft etc.). We found this eight kilometer radius to work well throughout most regions of the world. One clear exception, that can be considered a limitation of this approach, is that Chinese cities, such as Beijing, do not appear to allow for segregation into research clusters as the addresses listed on such publications tend to be at the municipality level, which in the case of Beijing covers approximately 16,000 km2 [9]. As such, these cities may come to have an advantage with respect to its scientific output over other world cities for which we are able to delineate multiple scientific cities. To reduce the number cities with very low scientific output in our set of scientific agglomerations, we merged cities with fewer than ten publications into the closest scientific agglomeration with more than a hundred publications (within 30 km). In the end we are left with 16,619 distinct cities, with co-authorships forming 2,084,123 distinct city pairs. These distinct cities and city pairs form respectively the (potential) nodes and edges of our networks. 3.3
Publication Sets and Counting Methods - Edge Weights
Recall that we aim to understand differences between all scientific publishing activity and high impact publishing activity. As such, edges and their weights are determined based on three different publication sets: 1. all, the full publication set; 2. hcp10 , consisting only of the top 10% highly cited publications; and 3. hcp1 , consisting only of the top 1% highly cited publications. The highly cited publication sets are determined by ranking publications within each respective publication year and Web of Science subject category [7]. The publications are ranked by the number of citations they received in the first three years after publication excluding self-citations, where ties are broken by the number of self-citations. By ranking separately for publication years we are able to determine edge weights for each respective year, thus allowing the creation of network time-slices. Additionally, we rank each WoS subject category separately, because different scientific fields have different citation practices and the WoS subject categories can be used as a proxy for scientific fields. If we were to rank irrespective of field, we would erroneously select too many publications from certain fields that receive on average more citations per publication than in other fields, such as Biochemistry & Molecular Biology [16]. Deciding on the right counting method for a given purpose was another challenge highlighted by Csom´ os [9]. For example, when comparing the scientific
148
H. D. Boekhout et al.
output of cities it may be desirable to count a publication that involves multiple cities fractionally towards each city. Traditionally fractional counting assigns equal size parts of the publication to each city. Here, we use the completeness of the author-affiliation linkages in our data to perform fractional counting based on the number of authors linked to each scientific city. Hereby, we aim to assign fractions representing the expected contribution of each city or city-pair. In this study we use city-pair fractional counting for determining the edge weights. Let nai,j indicate the number of authors on a publication i linked to scientific city j and let C be the set of contributing cities. The fraction of pubnai,a ·nai,b lication i assigned to city pair a, b ∈ C is then determined by nai,j ·nai,k . j,k∈C
3.4
Network Formation
We are now ready to extract the various co-authorship network time-slices. Due to variations in the time between conducting research and the publication of that research, there exist minute annual fluctuations in scientific activity. A common approach to account for these fluctuations is to compute a normalized or moving average over a span of three years [11]. Therefore, the 11 time-slices we extracted each covers three years, respectively 2008–2010, 2009–2011, . . ., 2018–2020. For each time-slice and publication set, edge weights are determined using city-pair fractional counting (see Sect. 3.3). Next, we retain only those edges with a summed weight of more than one per million total publications in that time-slice, i.e., we obtain the θ-minimum-weight graphs with θ = #publications 1,000,000 . As such, we exclude edges representing city collaborations that we deem too weak, while accounting for the overall increase in the number of publications per year in general. Finally, we reduce the networks to their giant components. Some basic statistics of the resulting networks are given in Table 1. Table 1. Basic network statistics (see Sect. 3.4 and 4.1 for details) Time-slice θ
all n
m
hcp10 avg deg n
m
hcp1 avg deg n
m
avg deg
2008–2010 3.56 5,870 71,051 24.2
1,719 10,991 12.8
324 749 4.6
2009–2011 3.75 5,920 72,346 24.4
1,740 11,023 12.7
321 748 4.7
2010–2012 3.96 5,927 72,919 24.6
1,767 10,971 12.4
316 743 4.7
2011–2013 4.27 5,820 71,995 24.7
1,727 10,764 12.5
311 735 4.7
2012–2014 4.60 5,822 72,194 24.8
1,728 10,670 12.3
303 697 4.6
2013–2015 4.97 5,845 72,696 24.9
1,741 10,771 12.4
313 726 4.6
2014–2016 5.37 5,811 71,774 24.7
1,678 10,492 12.5
301 704 4.7
2015–2017 5.72 5,708 71,240 25.0
1,707 10,305 12.1
287 694 4.8
2016–2018 6.01 5,709 71,229 25.0
n/a
n/a
n/a
n/a n/a n/a
2017–2019 6.18 5,700 72,049 25.3
n/a
n/a
n/a
n/a n/a n/a
2018–2020 6.32 5,723 73,433 25.6
n/a
n/a
n/a
n/a n/a n/a
Scientific City Networks
3.5
149
Evolving Degree Respecting Rewired Networks
In order to better understand the changes observed in the centrality rankings between subsequent time-slices, we want to compare these changes to those observed if the network rewiring was done randomly. The procedure of generating these networks (see Algorithm 1) involves rewiring a previous time-slice with an equal number of edge removals (lines 2–7) and edge additions (lines 8–21) as performed in the evolution of the real-world network to the subsequent time-slice. During this procedure we aim to retain the degree distribution of the real-world time-slices as close as possible (lines 5 and 13–17). As such, a comparison with this null model highlights where in the city co-authorship networks (core, periphery, etc.) many real-world structural changes occur. We call these networks evolving degree respecting rewired networks (EDRR). Note, that robustness checks for the constants used in Algorithm 1 will be performed in future work. Algorithm 1: Algorithm for generating an EDRR network
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Input: Previous time-slice Gp = (Vp , Ep ) and current time-slice Gc = (Vc , Ec ) Output: EDRR network Gr Gr ← Gp for (u, v) ∈ Ep and (u, v) ∈ / Ec do Eposs ← {}; r ← 0.1 while Eposs = ∅ do Eposs ← {(s, t) : (s, t) ∈ Gr ∧ (1 − r) · degGp (u) ≤ degGr (s) ≤ (1 + r) · degGp (u) ∧ (1 − r) · degGp (v) ≤ degGr (t) ≤ (1 + r) · degGp (v)} r ←r·2 er ← random element from Eposs ; Gr ← Gr \ er n new ← [udegGc (u) for all u ∈ Vc , u ∈ / Vp ] for (u, v) ∈ Ec and (u, v) ∈ / Ep do Eposs ← {}; r ← 0.1 while Eposs = ∅ do if degGp (u) = 0 then Eposs ← {(s, t) : (s, t) ∈ / Gr ∧ s ∈ n new ∧ (1 − r) · degGp (v) ≤ degGr (t) ≤ (1 + r) · degGp (v)} else if degGp (v) = 0 then Eposs ← {(s, t) : (s, t) ∈ / Gr ∧ t ∈ n new ∧ (1 − r) · degGp (u) ≤ degGr (s) ≤ (1 + r) · degGp (u)}
17
else Eposs ← {(s, t) : (s, t) ∈ / Gr ∧ (1 − r) · degGp (u) ≤ degGr (s) ≤ (1 + r) · degGp (u) ∧ (1 − r) · degGp (v) ≤ degGr (t) ≤ (1 + r) · degGp (v)}
18
r ←r·2
19 20
er = (ur , vr ) ← random element from Eposs ; Gr ← Gr ∪ er n new ← n new \ {ur , vr }
150
4
H. D. Boekhout et al.
Results
In this section we discuss our experimental setup, results and limitations. 4.1
Experimental Setup
For each centrality measure (see Sect. 2.2) and impact level, we include in the rankings only those cities that occur in all time-slices. Additionally, for the full publication set (all) we consider for each time-slice only the top 2000 cities. This ensures that cities that do not play a ‘central’ role in the networks, i.e., that are not of direct importance to the world stage of global science, are excluded. As such, it mirrors the natural filtering that occurs for the hcp publication sets. By correlating centrality rankings of the more central cities, observed changes and differences are more relevant for understanding the world stage of global science. Because we use three years of citations after publication for determining the hcp publication sets, the last three time-slices (2016–2018, 2017–2019 and 2018– 2020) are excluded from the from the analysis for hcp10 and hcp1 . For this same reason, statistics on these time-slices are excluded from Table 1. 4.2
Centrality Changes over Time at Various Levels of Impact
Figure 1 shows the correlations between subsequent time-slices for each publication set and for the four centrality measures under consideration. For all four measures we see that the correlations for all are slowly but steadily rising. This tells us that over the years the full world stage of global science has become increasingly more stable, suggesting the city co-authorship network has become less prone to structural change. Most ‘stabilisation’ appears to have occurred between 2009 and 2015 (time-slices 08–10 and 14–16) and is most pronounced for betweenness centrality. Thus, annual changes in the city co-authorship network have had an increasingly diminished effect on shortest paths in the network. A pessimistic interpretation of this observation may be that fewer ‘meaningful’ bridging collaborations appear to be formed between ‘distant’ (clusters of) cities. In Fig. 1 we observe lower correlations for publication sets representing relatively higher impact. A (partial) explanation for this can be found in the nature of the construction of the hcp networks as well as in their respective size and average degree (see Table 1). Because the same θ value is used for each publication set, it is significantly harder for a city co-authorship relation in hcp1 to be considered ‘meaningful’ than in all since there are a hundred times fewer total publications. As a result, it is to be expected that the ‘core’ of the city coauthorship network is more significantly affected on a yearly basis, which in turn affects the rankings. Furthermore, because the average degree for hcp1 networks is quite low, relatively weak co-authorships that connect ‘distant’ (clusters of) cities are more likely to (dis)appear from the hcp networks without alternative short paths connecting them, thereby significantly impacting the rankings for centrality measures such as closeness and betweenness. Indeed, we see especially large fluctuations in the trend of correlations for betweenness centrality for hcp1 .
Scientific City Networks
151
Fig. 1. Rank correlations of subsequent time-slices at varying levels of impact.
4.3
Real-World Rank Correlation vs. Null Model
When comparing with randomly generated networks, sufficient random networks are required to establish meaningful differences between the random and realworld networks. Therefore, we generated 100 EDRR networks, using the process described in Sect. 3.5, for each publication set and time-slice (except the first). For each EDRR network we computed the correlation between the rankings for that EDRR network and the real-world network of the previous time-slice. Figure 2 shows the real-world correlation alongside the mean and the error range defined by the standard deviation (sd-range) of the correlations for each set of EDRR networks. Because we expect confounding effects from our EDRR network generation procedure and it is a local measure, degree centrality is excluded. For eigenvector centrality we observe that the real-world correlations often lie within the sd-range of the EDRR correlations for both the Spearman and Kendall’s Tau correlations. Although Kendall’s Tau correlation for all is almost consistently above random, the difference can hardly be called significant. For closeness centrality we see that all publication sets have Kendall’s Tau correlations that are almost consistently above random, while the Spearman correlations are around, above and below random at times. However, the trends of the Spearman and Kendall’s Tau correlations have similar shapes. This implies that while the real-world networks observe fewer changes in the order of pairs of nodes than the EDRR networks, this difference is negated by the exact difference in the rankings. In other words, the real-world city networks observe many but relatively small changes in rank while the EDRR networks observe more substantial changes in rank, i.e., the EDRR networks more often remove/add edges
152
H. D. Boekhout et al.
connecting otherwise ‘distant’ clusters of cities while the real-world networks remove/add edges between cities that are otherwise already considered ‘close’. In short, the real-world city networks follow the expected pattern of more often establishing new co-authorship relations with ‘close’ cities than ‘distant’. For betweenness centrality we observe a similar trend as for closeness centrality. Especially for hcp10 the difference between the real-world Kendall’s Tau correlation and random is far more significant than it was for closeness. As such, changes in the real-world networks appear to have far less influence on the betweenness centrality than in the EDRR networks. This may imply that more of the annual real-world city network rewiring occurs in the periphery. This inference is further supported by the fact that the differences are smaller for the full publication set for which most periphery nodes are likely already excluded from the analysis. As such, the real-world removal and addition of edges impacts the shortest paths between all pairs of cities less than random.
Fig. 2. Rank correlations of subsequent time-slices comparing the real-world network results against the mean and sd-range of EDRR networks.
4.4
Limitations
An important limitation of this work is that we focus entirely on unweighted networks, generated from weighted networks using thresholds. When studying the co-authorship between cities, the ideal edge weight should represent the ease with which a co-authorship between cities is formed. Although we have a fractional publication output associated with each edge, this weight is likely a poor
Scientific City Networks
153
representation of the ease of co-authorship formation. Since there is no simple numerical computation of ‘the ease of forming a co-authorship’ based on existing scientometric data, we instead use θ-minimum-weight graphs to establish a minimum co-authored scientific output for a relation to be considered ‘meaningful’.
5
Conclusions
In this paper we investigated the stability and evolution of the world stage of global science at the city level by analyzing changes in network centrality rankings over time. First, we proposed an approach for delineating scientific cities and extracted 3-year time-slices of scientific city co-authorship networks from Web of Science at various levels of impact. Comparing correlations between centrality rankings of subsequent time-slices, we determined that the world stage of global science has become more stable over time. We proposed a new rewiring procedure to generate so-called EDRR networks in order to determine significant real-world rank correlations compared to a sensible null model. We found that closeness and betweenness centrality rankings were more stable for the real-world networks, implying that new co-authorships between authors from previously unconnected cities more often connect ‘close’ cities in the network periphery. Having established a systematic method of comparing centrality rankings over time, we want to find more substantive insights for specific cities in future work.
References 1. Bavelas, A.: Communication patterns in task-oriented groups. J. Acoust. Soc. Am. 22(6), 725–730 (1950) 2. Bornmann, L., de Moya-Aneg´ on, F.: Spatial bibliometrics on the city level. J. Inf. Sci. 45(3), 416–425 (2019) 3. Bouyssou, D., Marchant, T.: Ranking authors using fractional counting of citations: an axiomatic approach. J. Informet. 10(1), 183–199 (2016) 4. Brandes, U.: A faster algorithm for betweenness centrality. J. Math. Sociol. 25(2), 163–177 (2001) 5. Caron, E., van Eck, N.J.: Large scale author name disambiguation using rulebased scoring and clustering. In: Proceedings of the 19th International Conference on Science and Technology Indicators, pp. 79–86. CWTS-Leiden University (2014) 6. Catini, R., Karamshuk, D., Penner, O., Riccaboni, M.: Identifying geographic clusters: a network analytic approach. Res. Policy 44(9), 1749–1762 (2015) 7. Web of science categories. https://images.webofknowledge.com/images/help/ WOS/hp subject category terms tasca.html. Accessed 31 Aug 2021 8. Cronin, B., Meho, L.: Using the h-index to rank influential information scientists. J. Am. Soc. Inform. Sci. Technol. 57(9), 1275–1278 (2006) 9. Csom´ os, G.: On the challenges ahead of spatial scientometrics focusing on the city level. Aslib J. Inf. Manag. 72(1), 67–87 (2019) 10. Ding, Y., Yan, E., Frazho, A., Caverlee, J.: Pagerank for ranking authors in cocitation networks. J. Am. Soc. Inform. Sci. Technol. 60(11), 2229–2243 (2009)
154
H. D. Boekhout et al.
11. Maisonobe, M., Eckert, D., Grossetti, M., J´egou, L., Milard, B.: The world network of scientific collaborations between cities: domestic or international dynamics? J. Informet. 10(4), 1025–1036 (2016) 12. Maisonobe, M., J´egou, L., Eckert, D.: Delineating urban agglomerations across the world: a dataset for studying the spatial distribution of academic research at city level. Cybergeo: Eur. J. Geogr. (2018). No. 871 13. Ruhnau, B.: Eigenvector-centrality-a node-centrality? Soc. Netw. 22(4), 357–365 (2000) 14. Takes, F.W., Heemskerk, E.M.: Centrality in the global network of corporate control. Soc. Netw. Anal. Min. 6(1), 1–18 (2016) 15. Waltman, L., et al.: The Leiden ranking 2011/2012. J. Am. Soc. Inform. Sci. Technol. 63(12), 2419–2432 (2012) 16. Waltman, L., van Eck, N.J., van Leeuwen, T.N., Visser, M.S., van Raan, A.F.: Towards a new crown indicator: an empirical analysis. Scientometrics 87(3), 467– 481 (2011) 17. World university ranking. https://www.timeshighereducation.com/worlduniversity-rankings. Accessed 6 Sept 2021
Propagation on Multi-relational Graphs for Node Regression Eda Bayram(B) ´ Ecole Polytechnique F´ed´erale de Lausanne (EPFL), Lausanne, Switzerland [email protected]
Abstract. Recent years have witnessed a rise in real-world data captured with rich structural information that can be conveniently depicted by multi-relational graphs. While inference of continuous node features across a simple graph is rather under-studied by the current relational learning research, we go one step further and focus on node regression problem on multi-relational graphs. We take inspiration from the well-known label propagation algorithm aiming at completing categorical features across a simple graph and propose a novel propagation framework for completing missing continuous features at the nodes of a multi-relational and directed graph. Our multi-relational propagation algorithm is composed of iterative neighborhood aggregations which originate from a relational local generative model. Our findings show the benefit of exploiting the multi-relational structure of the data in several node regression scenarios in different settings. Keywords: Multi-relational data regression
1
· Label propagation · Node
Introduction
Various disciplines are now able to capture different level of interactions between entities of their interest, which promotes multiple types of relationships within data. Examples include social networks [8,17], biological networks [3,9], transportation networks [1,4], etc. Multi-relational graphs are convenient for representing such complex network-structured data. Recent years have witnessed a strong line of relational learning studies focusing on the inference of node-level and graph-level categorical features [6]. Most of these are working on simple graphs and there has been little interest in the regression of continuous node features across the graph. In particular, node regression on multi-relational graphs still remains unexplored. In this study, we present a multi-relational node regression framework. Given multi-relational structure of data and partially observed continuous features belonging to the data entities, we aim at completing missing features. It is possible to encode intrinsic structure of the data by a graph accommodating multiple types of directed edges between graph’s nodes that represent the data c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2021, SCI 1072, pp. 155–167, 2022. https://doi.org/10.1007/978-3-030-93409-5_14
156
E. Bayram
Fig. 1. A fragment of a multi-relational and directed social network
entities. Accordingly, we establish the main research question we address: How can we achieve node-value imputation on a multi-relational and directed graph? For this purpose, we propose an algorithm which propagates observed set of node features towards missing ones across a multi-relational and directed graph. We take inspiration from the well-known label propagation algorithm [20] aiming at completing categorical features across a simple, weighted graph. We see that simple neighborhood aggregations operated on a given relational structure hold the basis for many iterative graph algorithms including the label propagation. Thus, we first break down the propagation framework by the neighborhood aggregations derived through a simple local generative model. Later, we extend this by incorporating a multi-relational neighborhood and suggest a relational local generative model. Then, we build our algorithm, which we call multi-relational propagation (MrP), by iterative neighborhood aggregation steps originating from this new model. We provide the derivation of the parameters of the proposed model, which can be estimated over the observed set of node features. Our method can be considered as a sophisticated version of the standard propagation algorithm by enabling regression of continuous node features over a multi-relational and directed graph. We compare our multi-relational propagation method against the standard propagation in several node regression scenarios. At each, our approach enhances the results considerably by integrating multi-relational structure of data into the regression framework. Comparison to Existing Schemes. The node regression problem has been studied on simple graphs for signal inpainting [7,14] and node representation learning [10,12,13,18]. Many of these approaches implicitly employ a smoothness prior which promotes similar representations at the neighboring nodes of the graph [19]. The smoothness prior exploited in node representation learning studies broadly prescribes minimizing the Euclidean distance between features at the connected nodes. Throughout the paper, we refer to such prior as 2 sense smoothness. Despite its practicality, 2 sense smoothness prior suffers from several major limitations that might mislead regression on a multi-relational and directed graph. First, it treats all neighbors of a node equally while reasoning about the node’s state even though the neighbors connected via different types of relations might play different roles in the inference task. For instance, Fig. 1 illustrates multiple types of relationships that might arise between people. Here, each relation type presumably relies on a different affinity rule or a different level of importance depending on the node regression task. Second, some relation types
Multi-relational Propagation
157
are inherently symmetric, and some others are asymmetric1 . Euclidean distance minimization broadly assumes that values at neighboring nodes are as close as possible, which may not always be the case for asymmetric relationships. We thus depart from the straightforward 2 sense of smoothness and augment the prior with a relational local generative model. Contributions. In this study, (i) we provide a breakdown of propagation algorithm on simple graphs from the Bayesian perspective, (ii) we introduce a relational local generative model, which permits neighborhood aggregation operation on a multi-relational, directed neighborhood, (iii) we propose a novel propagation framework MrP, which properly handles propagating observed continuous node features across a multi-relational directed graph and complete missing ones.
2
Propagation on Simple Graphs
We denote a simple, undirected graph by G(V, E) with set of nodes V and set of edges E. Also, we denote xi ∈ R as the continuous node feature2 held by node-i. Local Generative Model. We recall the smoothness prior prescribing the neighboring node representations to be as close as possible in terms of 2 -norm. Consequently, we write a simple local generative model which relates two neigh2 ). boring nodes as xi = xj + where (i, j) ∈ E and ∼ N (0, σij First-Order Bayesian Estimate of Node’s Value. The local generative model can be used to obtain an approximation of the node’s state in terms of its local neighborhood. This can be achieved by maximizing the expectation of the node’s feature given that of its first-hop neighbors: argmax xi
p(xi |{xj : (i, j) ∈ E}) = argmax xi
p({xj : (i, j) ∈ E}|xi )p(xi ) , p({xj : (i, j) ∈ E})
(1)
where Bayes’ rule applies. Here, we make two assumptions. First, we assume that the prior distribution on the node features, p(xi ) ∀i ∈ V, is uniform. Second, we only consider the partial correlations between the central node—whose state is to be estimated—and its first-hop neighbors while we neglect any partial correlation among the neighbor set—conditionally independence assumption. Accordingly, we reformulate the problem as p(xj |xi ) = argmin − log(p(xj |xi ), (2) argmax xi
1
2
(i,j)∈E
xi
(i,j)∈E
In directed graphs, symmetric relationships emerge from bi-directed edges where the edge direction is valid in both directions such as sibling whereas in asymmetric relationships, the edge direction is valid in only one direction such as parent, child. See the edge directions in Fig. 1. Generalization to vectorial node representations is possible in principle, yet omitted here for the sake of simplicity.
158
E. Bayram
Fig. 2. Overview of the pipeline for the development of a propagation algorithm
and rewrite it as minimization of negative log-likelihood. Next, we plug in the local generative model and obtain the following problem: argmin xi
xj − xi 2 2 . 2 σij
(3)
(i,j)∈E
Neighborhood Aggregation. The first-order Bayesian estimate boils down to minimizing the Euclidean distance between node’s feature to that of its neighbors, i.e., suggesting a least squares problem in (3). Its solution is simply found by setting the gradient of the objective to zero: (i,j)∈E ωij xj , (4) x ˆi = (i,j)∈E ωij 2 . As seen, it is a linear combination of the neighbors’ feawhere ωij = 1/σij tures. The first-order Bayesian estimation in the conditions considered above clarifies the neighborhood aggregation operation accomplished in one iteration of a propagation algorithm [20]. Estimating the node states across the whole graph iteratively, a propagation algorithm expands the scope of the approximation beyond the first-order until a stopping criterion is satisfied. Hence, we summarize the pipeline for developing a propagation algorithm as given in Fig. 2.
3
Multi-relational Model
We now introduce a multi-relational and directed graph as G(V, E, P), where V is the set of nodes, P is the set of relation types, E ⊆ V × P × V is the set of multi-relational edges. The function r(i, j) returns the relation type p ∈ P that is pointed from node j to node i. If such a relation exists between them, yet pointed from the node i to the node j, the function returns the reverse as p−1 . Relational Local Generative Model. It is required to diversify the simple local generative model by the set of relationships existing on a multi-relational graph. To this end, we propose the following local generative model for the node’s state given its multi-relational and directed neighbors: ηp xj + τp + , ∀r(i, j) = p where ∼ N (0, σ 2 ) p σp2 xi = xj τp −1 − + , ∀r(i, j) = p where ∼ N (0, 2 ). ηp ηp ηp
(5)
Multi-relational Propagation
159
This model builds a linear relationship between neighboring nodes by introducing relation-dependent scaling parameter η and a shift parameter τ . The latter case in (5) indicates the generative model yielded by the reverse relation, where the direction of the edge is reversed. Such a linear model conforms both symmetric and asymmetric relationships. This is because it can capture any bias over a certain relation through parameter τ or any change in scale through parameter η. We note that the default set for these parameters are suggested as τ = 0, η = 1, which boils down to the simple local generative model. First-Order Relational Bayesian Estimate. We now estimate the node’s state by its first-hop neighbors connected via multiple types relationships. We repeat the same assumptions as in (1), which casts the problem as maximizing the likelihood of node’s first-hop neighbors. Once the likelihood of relational neighbors is expressed through the model in (5), the estimation can be found by minimizing the negative log-likelihood as in (2). Consequently, we obtain the following objective: argmin xi
p∈P
r(i,j)=p
2 ωp xi − ηp xj − τp + 2
r(i,j)=p−1
xj τ p 2 ωp ηp2 xi − + , 2 ηp ηp
(6)
where we apply a change of parameter ωp = 1/σp2 . Relational Neighborhood Aggregation. For an arbitrary node i ∈ V, we denote the loss to be minimized as Li . Such a loss leads to a least squares ∂Li problem whose solution satisfies (ˆ xi ) = 0. Accordingly, the estimate can be ∂xi found as
x ˆi =
3.1
p∈P
xj τp ωp ηp xj + τp + r(i,j)=p−1 ωp ηp2 − η ηp p . 2 ω + ω η p p p p∈P r(i,j)=p r(i,j)=p−1
r(i,j)=p
(7)
Estimation of Relational Parameters
The parameters of the local generative model associated with relation type p ∈ P are introduced as {τp , ηp ωp }. These parameters can be estimated over the set of node pairs connected to each other by relation p, i.e., (xi , xj )∀i, j ∈ V |r(i, j) = p . For this purpose, we carry out the maximum likelihood estimation over the parameters: argmax τp ,ηp ωp
p
(xi , xj ) ∀i, j ∈ V | r(i, j) = p τp , ηp ωp .
(8)
Then, we conduct an approximation over the node pairs that are connected by a given relation type while neglecting any conditional dependency that might exist among these node pairs. Hence, we can write the likelihood on each node pair in a product as follows: argmax τp ,ηp ωp
r(i,j)=p
p (xi , xj ) τp , ηp ωp .
(9)
160
E. Bayram
We proceed with the minimization of negative log-likelihood to solve (9). The reader might recognize that its solution is equivalent to the parameters of a linear regression model [15]. This is simply because we introduce linear generative models (5) for the relationships existing on the graph. Therefore, the parameters can be found as follows (μ = mean(x) is the mean of node values):
− μ)(xj − μ) τp = mean (xi − ηp xj ) ∀i, j ∈ V | r(i, j) = p , , 2 ωp = 1/mean (xi − ηp xj − τp )2 ∀i, j ∈ V | r(i, j) = p . r(i,j)=p (xj − μ)
r(i,j)=p (xi
ηp =
(10)
Local Generative Model and Local Operation. We summarize the local generative model, the associated loss and the first order estimate in simple and multi-relational cases in Table 1. In the multi-relational case, the neighborhood aggregation is not directly a weighted average of the neighbors but the neighbors are subject to a transformation with respect to the type and the direction of their relation to the central node. The relational transformation is controlled by the parameters η and τ . For this reason, in Table 1 we use the following functions as shortcuts for the transformations applied on the neighbors: f (x) = x in simple case—no actual transformation applied, and fp (x) = ηp x + τp in relational case for type p. In addition, P −1 = {p−1 , ∀p ∈ P} denotes the set relation types where the edge direction is reversed. For the reversed relationships, the set of parameters can be simply set as ηp−1 = 1/ηp , τp−1 = τp /ηp , ωp−1 = ηp2 ωp . Subsequent to the transformations, the estimation is computed by a weighted average that is controlled by the parameter ω. It is worth to notice that ω is set as the inverse of the error variance of the relational local generative model (5). Therefore, the estimate can be interpreted as the outcome of an aggregation with precision that ranks the relational information. Table 1. Local generative model and operation
Simple weighted graph
Local generative model Loss xi = xj + ∀(i, j) ∈ E ∼ N (0, 1/ωij )
Multix i = ηp x j + τ p + relational ∀r(i, j) = p directed graph ∼ N (0, 1/ωp )
3.2
(i,j)∈E
ωij (xi − xj )2
Local operation ωij f (xj ) (i,j)∈E
p∈P∪P −1 r(i,j)=p
(i,j)∈E
ωp (xi − ηp xj − τp )2
ωij
p∈P∪P −1 r(i,j)=p
ωp fp (xj )
p∈P∪P −1 r(i,j)=p
ωp
Multi-relational Propagation Algorithm
In Fig. 2, the propagation algorithm is depicted as an iterative neighborhood aggregation method where each iteration computes the solution of a first-order Bayesian estimation problem. Similarly, we propose a propagation algorithm that relies on the first-order relational Bayesian estimate that is introduced in (6). The algorithm operates iteratively where the relational neighborhood aggregation (7) is accomplished at each node of the graph simultaneously. Thus, we denote a vector x(k) ∈ RN composing the values at iteration-k over the set of nodes for |V| = N . Next, we express the iterations in matrix-vector multiplication format.
Multi-relational Propagation
161
Iterations in Matrix Notation. We denote matrix Ap for encoding the adjacency pattern of relation type p. It is (N × N ) asymmetric matrix storing incoming edges on its rows and outgoing edges on its columns. One can compile aggregations in (7) simultaneously over the entire graph using a matrix notation. Then, the relational local operations at iteration-k can be expressed as (k)
x
=
(k−1) (k−1) ωp ηp Ap x + τp Ap 1 + ωp ηp Ap x − τp Ap 1 p∈P
2
ωp Ap 1 + ωp ηp Ap 1
−1 ,
(11)
p∈P
where 1 is the vector of ones, stands for element-wise multiplication. In addition, the inversion on the latter sum term is applied element-wise. This part, in particular, arranges the denominator in Equation (7) in vector format. Thus, it can be seen as the normalization factor over the neighborhood aggregation. For the purpose of simplification, we re-write (11) as x(k) = (Tx(k−1) + S1) (H1)−1 ,
(12)
by introducing the auxiliary matrices T= ηp ωp (Ap + A τp ωp (Ap − ηp A ωp (Ap + ηp2 A p ), S = p ), H = p ). p∈P
p∈P
p∈P
(13) Algorithm. Given the iterations above, we now formalize the Multi-relational Propagation algorithm (MrP). MrP targets a node-level completion task where the multi-relational graph G is a priori given and the nodes are partially labeled at U ⊆ V. To manage propagation of continuous values at the labeled set of nodes towards the unlabeled, we introduce an indicator vector u ∈ RN , which (0) encodes the labeled nodes. It is initialized as ui = 1, if i ∈ U, else 0. Then, the vector x stores the node values throughout the iterations. It is initialized with the (0) values over U, and zero-padded at the unlabeled nodes, i.e., xi = 0 if i ∈ V \ U. Similar to the label propagation [20], our algorithm fundamentally consists of aggregation and normalization steps. In order to encompass multi-relational transformations during aggregation, we formulate an iteration of MrP by the steps of aggregation, shift and normalization. In addition, similar to the Pagerank algorithm [5], we employ a damping factor ξ ∈ [0, 1] to update a node’s state by combining its value from the previous iteration. We provide a pseudocode for MrP in Algorithm 1. The propagation parameters for each relation type, {τp , ηp , ωp } are estimated over the labeled set of nodes U, as described in Sect. 3.1 and given to the algorithm as input together with the adjacency matrices encoding the multi-relational, directed graph. Steps 1–4 in Algorithm 1 are essentially responsible for the multi-relational neighborhood aggregation. At Step-5, nodes’ states are updated based on the collected information from the neighbors. If valid information collected from neighbors and the node is labeled, then we employ the damping ratio, ξ, to update node’s state. This adjusts amount of trade-off between the neighborhood aggregation and the previous state of the node. We
162
E. Bayram
distinguish whether an arbitrary node is currently labeled or not by the indicator vector, u(k) , which keeps track of propagated nodes throughout the iterations and ensures that the normalization complies with valid collected neighborhood information. Hence, at Step 6, we update it as well. Finally, at Step 7, we clamp labeled set of nodes3 by leaving their values unchanged, simply because they store the governing information for completing the missing features. The algorithm terminates when all the nodes are propagated and the difference between two consecutive iterations is under a certain threshold. Accordingly, the number of iterations is related to the choice of hyperparameter ξ and the stopping criterion. MrP is implemented using PyTorch-scatter package4 , which efficiently computes neighborhood aggregation on a sparse relational structure. Thus, the aggregation steps require 2|E| operations, then, normalization and update steps require |V| operations at each iteration. Therefore, MrP scales linearly with the number of edges in the graph, similar to the standard label propagation algorithm (LP). We finally note that by setting τp = 0, ηp = 1, ωp = 1 ∀p ∈ P, MrP drops down to LP5 as if we operate on a simple graph regardless of the relation types and directions. Algorithm 1: MrP Input: U , {xi |i ∈ U }, {Ap , τp , ηp , ωp }P Output: {xi |i ∈ V \ U } Initialization: u0 , x0 , T, S, H for k = 1, 2, · · · do Step 1. Aggregate: z = Tx(k−1) Step 2. Shift: z = z + Su(k−1) Step 3. Aggregate the normalization factors: r = Hu(k−1) Step 4. Normalize: z = z r† //† is for element-wise pseudo-inverse Step 5. Update values: (k)
xi
x(k−1) , =
if ri = 0 // null info at neighbors i (k−1) zi , if ri > 0, ui =0 // null info at the node (k−1) (k−1) (1 − ξ)xi + ξzi , e.w.(ri > 0, ui = 1)
Step 6. Update propagated nodes: u(k) = u(k−1) , (k)
Step 7. Clamp the known values: xi
(k)
ui
= 1 if ri > 0
= xi , ∀i ∈ U
break if all(u(k) ) & all(x(k) − x(k−1) < ε) (k)
xi = xi
3
4
5
, ∀i ∈ X \ U .
Clamping step also exists in label propagation algorithm [20], which provides reinjection of true labels at each iteration throughout the propagation instead of overwriting the labeled nodes with the aggregated neighborhood information. Source code is available at https://github.com/bayrameda/MrAP. A special case of MrP is studied to propagate heterogeneous node features in [2] for numerical attribute completion in knowledge graphs. The label propagation algorithm [20] was originally designed for completing categorical features across a simple, weighted graph. We render it to propagate continuous features and to be applicable for the node regression by the default parameter set of MrP.
Multi-relational Propagation
4
163
Experiments
We now present a proof of concept of the proposed multi-relational propagation method in several different node regression scenarios in different settings. We first test MrP in estimating weather measurements on a multi-relational and directed graph that connects the weather stations. Then, we evaluate the performance in predicting people’s date of birth, where people are connected to each other on a social network composing different types relationships. In the experiments, the damping factor is set as ξ = 0.5, then the threshold for terminating the iterations is set as 0.001 of the range of given values. As evaluation metrics, we use root mean square error (RMSE), mean absolute percentage error (MAPE) and a normalized RMSE (nRMSE) with respect to the range of groundtruth values. We calculate them over the estimation error on the unlabeled set of nodes. In the experiments, we leave the parameter η in MrP as default by 1 since we do not empirically observe a scale change over the relation types given by the datasets we work on. Then, we estimate the parameter τ and ω for each relation type based on the observed set of node values as described in Sect. 3.1. 4.1
Multi-relational Estimation of Weather Measurements
We test our method on a meteorological dataset provided by MeteoSwiss, which compiles various types of weather measurements on 86 weather stations between years 1981–20106 . In particular, we use yearly averages of weather measurements in our experiments.
Fig. 3. Distribution of change in temperature measurements between weather stations that are related via altitude proximity in ascend and descend direction.
6
https://github.com/bayrameda/MaskLearning/tree/master/MeteoSwiss.
164
E. Bayram
Table 2. Temperature and snowfall prediction performances
Table 3. Precipitation prediction performances
RMSE MAPE nRMSE Temperature LP Snowfall
RMSE MAPE nRMSE
1.120
0.155
0.050
LP-altitude 381.86 0.261
0.174
MrP 1.040
0.147
0.045
LP-gps
374.38 0.242
0.168
194.49 0.405
0.112
MrP
347.98 0.238
0.157
MrP 180.10 0.357
0.105
LP
Construction of Multi-relational Directed Graph. To begin with, we prepare a multi-relational graph representation G(V, E, P) of the weather stations, i.e., |V| = 86, where we relate them based on two types of relationships, i.e., |P| = 2. First, we connect them based on geographical proximity by inserting an edge between a pair of weather stations if the Euclidean distance between their GPS coordinates is below a threshold. The geographical proximity leads to a symmetric relationship where we acquire 372 edges. Second, we relate them based on the altitude proximity in a similar logic yet we anticipate an asymmetric relationship where the direction of an edge indicates an altitude ascend from one station to another. For both relation types, we adjust the threshold for building connections so that there is not any disconnected node. Consequently, we acquire 1144 edges for the altitude relationship. In the experiments, we randomly sample labeled set of nodes, U, from the entire node set, V, with a ratio of 80%. Then, we repeat the experiment in this setting for 50 times in Monte Carlo fashion. The evaluation metrics are then averaged over the series of simulations. Predicting Temperature and Snowfall on Directed Altitude Graph. We first conduct experiments on a simple scenario where we target predicting temperature and snowfall measurements using altitude relations. We compare the proposed method to the standard label propagation algorithm, LP, which overlooks asymmetric relational reasoning. Thus, we aim at evaluating the directional transformation utility of MrP during the neighborhood aggregation, which is mainly gained by the parameter τ and η. We visualize the distribution of measurement changes on the altitude edges in Fig. 3. Here, the parameter τ directly corresponds to the mean measurement difference computed along the directed altitude edges since η = 1. We fit radial basis function (RBF) to the distribution since the residual error in local generative model (5) is assumed to be normal. Then, the parameter ω is simply associated with the inverse of its variance. We see that the temperature differences in the ascend direction, i.e., (xi − xj ) ∀r(i, j) = altitude ascend , has a mean in the negative region. This signifies an expected decrease in temperature values along altitude ascend. As seen in Table 2, even in the case of single relation type—altitude proximity, incorporating the directionality, MrP manages to enhance predictions over the regression realized by the label propagation, LP. Predicting Precipitation on Directed, Multi-relational Graph. We test MrP in another scenario where we integrate both altitude and geographical
Multi-relational Propagation
165
proximity relations to predict precipitation measurements on the weather stations. The prediction performance is compared to the regression by LP, that is accomplished over the altitude relations and GPS relations separately. Since MrP handles both of the relation types and the direction of the edges simultaneously, it achieves a better performance than LP, as seen in Table 3. 4.2
Predicting People’s Date of Birth in a Social Network
We also conduct experiment on a small subset of a relational database called Freebase [16]. We work on a graph G(V, E, P) composing 830 people, i.e., |V| = 830, connected via 8 different types of relationship, i.e., |P| = 8. Here, the task is to predict people’s date of birth while it is only known for a subset of people. A fragment of the multi-relational graph is illustrated in Fig. 1, where two asymmetric relationships exist: influenced by and parent. Table 4 summarizes the statistics for each. For symmetric relationships, the date of birth difference is counted along both directions of an edge, which sets the mean to zero. Also, the distributions over certain relation types are visualized in Fig. 4. In the experiments, we randomly select the set of people whose date of birth is initially known, U, with a ratio of 50% in V. We again report the evaluation metrics that are averaged over a series of experiments repeated for 50 times. We Table 4. Statistics for each relationship, columns respectively: number of edges, mean and variance of ‘date of birth’ difference over the associated relation type.
Table 5. Date of birth prediction performances Relation type LP
0.011
0.115
31.92
0.011
0.113
friendship Relation type
Edges Mean
Variance
RMSE MAPE nRMSE
award nomination 32.43 influenced by
30.29
0.012
0.108
award nomination 454
0
320.23
sibling
32.69
0.012
0.116
friendship
221
0
155.82
parent
33.62
0.013
0.119
influenced by
528
−36.25 1019.77
spouse
31.45
0.011
0.112
31.70
0.011
0.113
sibling
83
0
45.16
dated
parent
98
−32.90
62.90
awards won
33.04
0.012
0.117
union
24.22
0.008
0.086
15.62
0.005
0.055
spouse
262
0
87.60
dated
231
0
90.95
awards won
183
0
257.45
MrP
Fig. 4. Distribution of ‘date of birth’ difference (year) over different relationships. An RBF is fitted to each.
166
E. Bayram
compare performance of MrP to the regression of date of birth values obtained with label propagation LP. We run LP over the edges of each relation type separately and also at the union of those. The results are given in Table 5. Based on the results, we can say that the most successful relation types for predicting the date of birth seems to be influenced by and spouse using LP. Nonetheless, when LP operates on the union of the edges provided by different type of relationships, it performs better than any single type. Moreover, MrP is able to surpass this record by enabling a relational neighborhood aggregation over different types of edges. Once again, we argue that its success is due to the fact that it regards asymmetric relationships, here encountered as influenced by and parent. In addition, it assigns different level of importance to the predictions collected through different type of relationships based on the uncertainty estimated over the observed data.
5
Conclusion
In this study, we proposed MrP, a propagation algorithm working on multirelational and directed graphs for regression of continuous node features and we show its superior performance on multi-relational data compared to standard propagation algorithm. It is possible to generalize the proposed approach for node embedding learning and then for the node classification tasks. The augmentation of the computational graph of the propagation algorithm using multiple types of directed relationships provided by the domain knowledge permits anisotropic operations on graph, which is claimed to be promising for future directions in graph representation learning [11].
References 1. Aleta, A., Moreno, Y.: Multilayer networks in a nutshell. Annu. Rev. Condens. Matter Phys. 10, 45–62 (2019) 2. Bayram, E., Garc´ıa-Dur´ an, A., West, R.: Node attribute completion in knowledge graphs with multi-relational propagation. In: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3590–3594. IEEE (2021) 3. Bentley, B., et al.: The multilayer connectome of caenorhabditis elegans. PLoS Comput. Biol. 12(12), e1005283 (2016) 4. Boccaletti, S., et al.: The structure and dynamics of multilayer networks. Phys. Rep. 544(1), 1–122 (2014) 5. Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst. 30(1–7), 107–117 (1998) 6. Chami, I., Abu-El-Haija, S., Perozzi, B., R´e, C., Murphy, K.: Machine learning on graphs: a model and comprehensive taxonomy. CoRR abs/2005.03675 (2020). https://arxiv.org/abs/2005.03675 7. Chen, S., et al.: Signal inpainting on graphs via total variation minimization. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8267–8271. IEEE (2014)
Multi-relational Propagation
167
8. Cozzo, E., de Arruda, G.F., Rodrigues, F.A., Moreno, Y.: Multilayer networks: metrics and spectral properties. In: Garas, A. (ed.) Interconnected Networks. UCS, pp. 17–35. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-23947-7 2 9. De, M.D.: Multilayer modeling and analysis of human brain networks. GigaScience 6(5), 1–8 (2017) 10. Deng, J., et al.: Edge-aware graph attention network for ratio of edge-user estimation in mobile networks. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 9988–9995. IEEE (2021) 11. Dwivedi, V.P., Joshi, C.K., Laurent, T., Bengio, Y., Bresson, X.: Benchmarking graph neural networks. arXiv preprint arXiv:2003.00982 (2020) 12. Ivanov, S., Prokhorenkova, L.: Boost then convolve: gradient boosting meets graph neural networks. In: International Conference on Learning Representations (2021) 13. Opolka, F.L., Solomon, A., Cangea, C., Veliˇckovi´c, P., Li` o, P., Hjelm, R.D.: Spatiotemporal deep graph infomax. arXiv preprint arXiv:1904.06316 (2019) 14. Perraudin, N., Vandergheynst, P.: Stationary signal processing on graphs. IEEE Trans. Signal Process. 65(13), 3462–3477 (2017) 15. Rencher, A.C., Christensen, W.: Methods of multivariate analysis, chap. 3, pp. 47–90. Wiley (2012) 16. Toutanova, K., Chen, D.: Observed versus latent features for knowledge base and text inference. In: Proceedings of the 3rd Workshop on Continuous Vector Space Models and their Compositionality, pp. 57–66 (2015) 17. Wasserman, S., Faust, K.: Social Network Analysis: Methods and Applications, vol. 8. Cambridge University Press, Cambridge (1994) 18. Wu, Y., Tang, Y., Yang, X., Zhang, W., Zhang, G.: Graph convolutional regression networks for quantitative precipitation estimation. IEEE Geosci. Remote. Sens. Lett. 18, 1124–1128 (2021) 19. Zhou, D., Bousquet, O., Lal, T.N., Weston, J., Sch¨ olkopf, B.: Learning with local and global consistency. In: Advances in Neural Information Processing Systems, pp. 321–328 (2004) 20. Zhu, X., Ghahramani, Z.: Learning from labeled and unlabeled data with label propagation. Technical report, Technical Report CMU-CALD-02-107, Carnegie Mellon University (2002)
Realistic Commodity Flow Networks to Assess Vulnerability of Food Systems Abhijin Adiga1(B) , Nicholas Palmer1 , Sanchit Sinha1,2 , Penina Waghalter3 , Aniruddha Dave1 , Daniel Perez Lazarte1 , Thierry Br´evault4,5 , Andrea Apolloni6,7 , Henning Mortveit1,8 , Young Yun Baek1 , and Madhav Marathe1,2 1
Biocomplexity Institute and Initiative, University of Virginia, Charlottesville, USA [email protected] 2 Computer Science, University of Virginia, Charlottesville, USA 3 Yeshiva University, New York City, USA 4 BIOPASS, CIRAD-IRD-ISRA-UCAD, Dakar, Senegal 5 CIRAD, UPR AIDA, 34398 Montpellier, France 6 Universit´e de Montpellier, CIRAD, Montpellier, France 7 CIRAD, UMR ASTRE, 34398 Montpellier, France 8 Department of Engineering Systems and Environment, University of Virginia, Charlottesville, USA
Abstract. As the complexity of our food systems increases, they also become susceptible to unanticipated natural and human-initiated events. Commodity trade networks are a critical component of our food systems in ensuring food availability. We develop a generic data-driven framework to construct realistic agricultural commodity trade networks. Our work is motivated by the need to study food flows in the context of biological invasions. These networks are derived by fusing gridded, administrativelevel, and survey datasets on production, trade, and consumption. Further, they are periodic temporal networks reflecting seasonal variations in production and trade of the crop. We apply this approach to create networks of tomato flow for two regions – Senegal and Nepal. Using statistical methods and network analysis, we gain insights into spatiotemporal dynamics of production and trade. Our results suggest that agricultural systems are increasingly vulnerable to attacks through trade of commodities due to their vicinity to regions of high demand and seasonal variations in production and flows.
1 1.1
Introduction Background and Motivation
With rapid population growth, shrinking farm acreage and intensive agriculture, society has come to critically depend on long distance flows of agricultural commodities [25]. This phenomenon has led to availability of a variety of commodities round the year. However, it has also made our food systems increasingly susceptible to threats such as invasive species [11], food contamination [14], c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2021, SCI 1072, pp. 168–179, 2022. https://doi.org/10.1007/978-3-030-93409-5_15
Realistic Food Networks
169
extreme weather events [22], and even pandemics like COVID-19 [23]. For example, trade networks act as conduits, enabling the rapid dispersal of pests and pathogens through crops, livestock, packaging, propagating material, etc. In the US alone, the annual economic impact of biological invasions is estimated to be over $120B [20]. Therefore, modeling food systems in all their complexity and understanding their vulnerabilities is critical to ensure food security, biodiversity, health and economic stability. As the complexity of the systems that are part of society and everyday life continues to grow, the need for more models with increasing levels of resolution and fidelity is continuously growing. Network models, be they simple, hierarchical, and/or multi-scale, have become ubiquitous all throughout science and applications in order to keep up with the demands for analysis, discovery, and support for policy formation. As is the case with many built infrastructures, food flows naturally yield to multi-scale network representations [6,15,19,21]. Depending on the problem being studied, nodes of the network represent locations such as operations, markets, cities, counties, states, or countries, connected by transportation infrastructure, and edges representing flow (commodity specific or aggregated). This work focuses on the construction and analysis of realistic representations of production, trade, and consumption of agricultural crops that can be applied to epidemiological processes such as invasive species spread, food poisoning, and biological warfare. Epidemiological models are being increasingly applied in the context of invasive species spread [4,6,15]. Such models can inform policy makers on a variety of aspects such as forecasting, causality, invasion source detection, surveillance, interventions, and economic impact. 1.2
Challenges
Inferring or estimating commodity flow networks is a major challenge as there is hardly any data available on commodity-specific flows. Even if available, the spatial and temporal resolutions of such datasets are not adequate. For example, datasets on inter-country commodity-specific trade are available [8] at yearly resolution. The same holds for production, which is typically available at the state level for a country. Besides, the availability of data differs by country or study region, thus posing a hurdle to generalizing the construction framework. To cope with such challenges, simple models for production and commodity flow have been used. For example, production systems have been mostly weather driven, and host crop production can be modeled using simple regression models with environmental variables as input parameters to sophisticated mechanistic models. But increasingly, farmers are relying on protected cultivation methods, and are thus able to extend the cultivation period to offseasons. Distance between production and consumption area need not be a driver of flows as trade depends on other factors such as crop type, environment, trade & transport infrastructure and pricing. Hence, traditional spatial interaction models might not be able to characterize commodity flows. It is important to incorporate knowledge of growing periods, whole-sale market locations, fine-resolution estimates of crop
170
A. Adiga et al.
production, imports, and exports. Fusion of multi-type datasets, misaligned in space and time and validation are major challenges. 1.3
Contributions
Framework. In this work, we develop a general framework to construct highresolution temporal networks that capture the production, trade, and consumption of agricultural crops. The framework is developed in the context of invasive species spread through multiple pathways, but is generic enough to be applied to other problems such as mentioned above. We use a number of datasets (at gridlevel, administrative unit-level, and qualitative) including crop production, growing seasons, market locations, trade, imports and exports, spatial data capturing human activities, and human population all fused together to obtain multi-scale networks with node and edge attributes. Application. The number of biological invasions at global scale is steadily increasing. The spread of the South American tomato leafminer [2] over the last 15 years is exemplar of such intercontinental biological invasion events. The pest has been responsible for devastating tomato crops globally. To study its spread, we apply our network generation framework to develop temporal attributed networks representing tomato production and trade for two different regions: Senegal (SN), representative of the spread in West Africa and Nepal (NP), representative of the spread in South and Southeast Asia. The pest invaded Senegal and Nepal in the last decade, and the has been preparing for its impending invasion. The dynamics of tomato production widely differ in these three regions due to climate and trade infrastructure making for an interesting comparison. Secondly, the datasets available for each region vary leading to different construction approaches and assumptions. Analysis and Summary of Results. We rigorously analyze the resulting networks using statistical methods and structural analysis. Using decision trees, we identify the factors that drive trade in SN network and NP network. We assess the potential for invasion of high-production areas in a number of ways. We investigate the relationship between production areas and characteristics of nearby localities. Analysis of trade flows in conjunction with production shows that production at source is a primary driver of flow. In many cases, areas with high amounts of imports not only have high population, but also have reasonably high production. Such areas are susceptible to invasions. 1.4
Related Works
Datasets on trade of agricultural commodities are seldom available. The United Nations maintains country-to-country trade data for several commodities: FAOSTAT [8] and ComTrade [5]. Ercsey-Ravasz et al. [6] construct an international food trade network using ComTrade by aggregating product codes corresponding to food and assess the vulnerabilities of the network to attacks. Nath et al. [18] and Suweis et al. [24] use production and trade data from FAO and demographics
Realistic Food Networks
171
information to assess the resilience and the reactivity of the coupled population food system. Multi-pathway models that include long-distance trade have been applied in the invasive species literature [4,15,19,26]. Due to lack of commodity flow data, spatial interaction models such as the gravity model are applied in many cases [4,15]. Nopsa et al. [19] analyze rail networks of grain transport in the context of mycotoxin spread. This work is motivated by the challenges faced in modeling the spread of the South American leafminer [10,15,26]. Here we give some example works from other domains that share features with the models introduced in this paper. Human contact networks for epidemiology have a natural multi-scale organization being composed as a union of networks constructed over multiple types of locations such as workplaces, school, stores, and places of worship. The modeling by for example Mistry et al. [16] and Voigt et al. [27], directly model and calibrate such component networks using data on contact rates and and contact duration form the person-person contact network. Eubank et al. [7] construct the same type of network but use the approach of synthetic populations: by fusing demographic data, highly detailed geo-spatial data on residences, activity locations, and activity sequences derived from surveys, one can also derive a person-person contact graph. In Barrett et al. [1], a coupled, network system involving a contact network, a transportation network, an electrical power network, and a communication network are constructed for analysis of the National Planning Scenario no. 1.
2
Network Construction Framework
Here, we briefly describe a multi-pathway spread model, of which the seasonal commodity flow network is a critical component. Spatial spread processes such as invasive species spread occur at multiple scales through different pathways that can be broadly categorized as natural spread (self-mediated, wind, water, etc.) and human-mediated spread (trade of host crops, transportation, etc.) [4,15]. Our multi-pathway model consists of two component graphs defined at different spatial scales. The first component is the self-mediated dispersal pathway GS (V, E) defined at a grid level. A grid is overlayed on the study region, and let V denote the set of grid cells. Each cell v is associated with an attribute vector a(v, t), where t corresponds to a time stamp. Cell attributes can correspond to administrative levels to which the cell belongs to, quantity of host crop production and consumption, population, climatic variables, number and types of operations, etc. The spread occurs from one grid cell to its adjacent cells. The edge set E is defined accordingly using some distance-based neighborhood criterion such as Moore or Euclidean-distance-based neighborhood. A schematic for the construction of this graph is provided in Fig. 1 (gray blocks). Specific construction details are provided in the description of the SN and NP networks. Let {L1 , L2 , . . .} denote a collection of nL subsets of V that are mutually disjoint. Each Li corresponds to a locality v i . Localities represent areas of high human activity that are relevant to human-mediated spread of the invasive
172
A. Adiga et al.
species. These include production and consumption areas. Each locality typically consists of spatially contiguous nodes representing a city or a district for example. The second component graph corresponds to the inter-locality graph that captures the locality to locality trade of host crops and its influence on spread. This graph is denoted by GLD and is defined on the localities. Let VLD denote the set of all localities. The graph FLD is defined on the locality set L with each edge directed representing link between two localities. In our case, the edge weight is directly proportional to the amount of (estimated) flow of the host crop. A schematic for the construction of this graph is provided in Fig. 1 (blue blocks). Specific construction details are provided in the description of the SN and NP networks. Locality production and population thresholds
Grid cell size
Grid representation of study region Cells with subdivision attributes
Study region subdivisionlevel spatial data
Grid level distribution map of host crops, human population, etc.
Seasonal production, climate zones, countyto-growing-season maps
Spatial aggregation and disaggregation of attributes
Temporal disaggregation
Administration level production volume, GDP, etc. at some administrative level
Weather data such as temperature, and precipitation, commodity prices
Annual/seasonal to month-level
Annual/seasonal trade flow
Yes
Commodity flow spatiotemporal disaggregation Annual/seasonal flows to monthly flows
Locality construction Map cells to localities and aggregate attribute values
?
Trade information present?
Seasonal Commodity flow network
Spatial interaction models
Urban and rural area spatial data No
Monthly flows from monthlevel attributes
Model parameters
Fig. 1. The pipeline for constructing multi-pathway networks. (Color figure online)
2.1
The Senegal Network (SN)
For each cell, the production was assigned as follows. Spatial Production Allocation Model (MAPSPAM) [28] provides estimates of vegetable production at a finer grid resolution. This was mapped to the grid cells in our model. For cell v, let mv denote this value. Even though, this quantity is not representative of tomato production in the cell, it is indicative of how suitable the cell is for tomato production. In particular, cells not suitable for any production can be identified using this quantity. For Senegal, through extensive surveys seasonal production data at province-level and trade at city and market level were collected in 2017 [17]. There are three major seasons: cold dry (November to February), hot dry (March to June), and rainy. First, the production was disaggregated to monthly production by uniformly distributing it to corresponding months. Let total tomato production in each department (administrative unit 2) S be denoted by P (S, t) for month t. For each cell v ∈ S, the estimated tomato production is given by P (v, t) = P (S, t) × mv / v ∈S mv . Here, we distribute the production to cells according to weights obtained from MAPSPAM data. In some states where the production is very low, it is possible
Realistic Food Networks
173
that ∀v ∈ S, mv = 0, but P (S, t) = 0. To avoid such scenarios, initially we add a small positive constant to each mv . For locality construction, we used trade data obtained from survey as the reference. For most cities we found that mapping each city to the second administrative unit (referred to as department) was most suitable. However, there are some cities that are very small compared to the departments they belong to. In such cases, a circular region with the center corresponding to the coordinates of the city was chosen as the locality so that the population covered was comparable to the known population of the city. Like production, commodity flow for each season from locality u to v was divided uniformly over the months corresponding to that season. In Fig. 2, we have visualized the Senegal networks for the months of January and March in the backdrop of a heatmap of cell-level tomato production.
Fig. 2. Seasonal tomato flow networks along with production. The edge weights are proportional to logarithm of the flow volume. In the title, we have the network name followed by month. In each case, to limit the number of edges displayed, we have applied a threshold on the edge weight.
2.2
The Nepal Network (NP)
The production and trade data for this network was obtained from Venkatramanan et al. [26]. Here, grid and node attributes are created the same way as in the SN network. Annual tomato production data is provided at the administrative level 3 (district). Using MAPSPAM and production data, spatial disaggregation of the production was performed in the case of Senegal network. For
174
A. Adiga et al.
temporal disaggregation, we used information about tomato growing seasons. Unlike Senegal, the period of cultivation of vegetables depends on the region. Nepal has a unique geography; in a span of 100 kms from south to north, the elevation increases from near sea level to several thousand meters above sea level (masl). Accordingly, the country is divided into three regions: Terai (67– 300 masl), Mid Hills (700–3000 masl), and High Hills (>3000 masl). Tomato production season widely varies across these regions [26]. We used this information to uniformly disaggregate the annual production for each cell to monthly production. To construct localities, we used major vegetable wholesale markets data. Since markets belonging to the same district are very close to each other, we used the districts as localities. For Nepal, trade data is available for only one market. Therefore, we applied a doubly-constrained gravity model [13,15,26]. The main assumption here is that the trade volume from locality u to v is driven by production at u, consumption at v, and the distance between them. For each locality i, let Oi (t) and Ii (t) denote total outflow and total inflow for month t. The total outflow accounts for amount of production in the locality, imports and exports, i.e., Oi (t) = production(i, t) + import(i, t) − export(i, t). Inflow Ii (t) corresponds to estimated consumption, which is modeled as a function of population and GDP. The flow Fij from locality v i to city v j is given by Fij (t) = ai (t)bj (t)Oiα1 (t)Ij (t)α2 f (dij ), where, dij is the time to travel from v i to v j , and f (·) is the distance deterrence function: d−β ij exp(−dij /κ), where α1 , α2 , β and κ are model parameters. The coefficients ai and bj are computed through an iterative process such that the total outflow and total inflow at each node agree with the input values [13]. For distance, we used the Google API [9] to compute travel time between pairs of localities. In Fig. 2, we have visualized the NP network for the months of January and September.
3
Network Analysis
Dynamics of Production and Trade in the SN Network. The SN network spans 331 cells and comprises of 14 localities. We recall that for SN network, the trade data was available, while for NP, it was estimated using gravity model. Using decision trees [3], we analyzed the relationship between the flow and node attributes (production and population) and edge attribute (distance) between source and target. The results are in Fig. 3. Our analysis indicates that production at source is a primary driver of outflows (plots in the first two rows). There is no evidence of a clear inverse relationship with distance. We notice some very long distance flows from a high production area to the city capital. This could also be attributed to the country’s population distribution, which is concentrated on the west coast. Another important thing to notice is that target locality production and target population are highly correlated (last row in the figure) indicating that much of tomato production occurs close to localities where there is demand.
Realistic Food Networks
175
di st a
nc e
800 700 600 500 400 300 200 100
0
weight
500 1000 1500 2000 2500 3000
from_product>=3809 & dist=363)
3 2
Mean =2480.2 Median=2480.2 Max =2550.25 SD =74.83
500
1000
to_pop
1
0.24 −0.21
0.69
0.27 −0.05
−0.21 0.44 −0.14 0.31 1
0.12
0.12
1
0.8 0.6
0.06
0.12
−0.3
0.89 −0.28
0.2
0.2
−0.2
to_pop
0.07 −0.24
−0.24 0.93 −0.18 0.33
to_production
0.69
0.44
0.27 −0.14 0.89
0.2
1
distance −0.05 0.31 −0.28 0.16 −0.37
0.16 −0.37 1
−0.4
−1
0.6 0.4
to_production distance
0.06
0.93
1
−0.2
0.9
−0.33
0.2
−0.2
1
−0.15
0.3
−0.2
0.12 −0.18 −0.3
0.9
0.33 −0.33
−0.15
1
−0.45
0.3
−0.45
1
−0.4 −0.6 −0.8 −1
2550
2600
0.77
0.3
−0.02
1 weight
1
from_pop −0.08 to_pop
−0.08 0.32 1
0.32 −0.19
−0.19 0.26 −0.11 0.29 1
0.29
0.29
1
0.8 0.6 0.4
0.92 −0.19
0.2
0.29
−0.2
0 from_production
−0.6 −0.8
0.8
2500
weight
0.07
1 weight from_pop
0.4
0 from_production
2450
Weight
distance
−0.06 0.24
2400
to_production
distance
1
2000
weight
to_production
1
from_production
from_pop 0.15
to_pop
weight
distance
to_production
from_production
from_pop
to_pop
weight
1 0.15
1 weight
from_pop −0.06
1500
Weight
distance
0
to_production
600
from_production
500
to_pop
400
from_pop
300
Weight
from_production
200
from_pop
100
to_pop
0
0
0
0
50
2
1
Freq.
Mean =1310.1 Median=1916.8 Max =1916.75 SD =896.11
4
Freq.
150
200
Mean =51.9 Median=25 Max =557 SD =75.02
100
Freq.
250
6
300
4
8
350
Histogram for Weight, N=500 (from_production |s| and if s is a suffix of s . The extension s of s is said to be relevant [10] if DKL (ps ||ps ) >
|s | log2 (1 + c(s ))
(2)
where DKL denotes the Kullback-Leibler divergence. Figure 1b shows the distributions p for the relevant extensions found in a toy example. The threshold used (right side of Eq. 2) makes it harder for longer and sparsely observed extensions
186
C. Coquid´e et al.
to be found relevant. The process used for relevant extensions extraction starts from the first-order sub-sequences. The condition in Eq. 2 is recursively applied to extension of already detected extensions. An upper-bound of DKL is used to stop the recursion. The construction of Von is therefore parameter-free. Network Construction. Von is constructed in a way that a memory-less random walk performed on it is a good approximation of the input sequential data S. This network is noted Von = (V, E, w), where V is the set of nodes. These nodes represent all relevant extensions s and all their prefixes (see Fig. 1c). This ensures that any node v representing a relevant extension of an item σ is reachable during a random walk. We note V(σ) the set of nodes representing item σ ∈ Ω and Nrep (σ) = |V(σ)| the number of such representations. For each pair (s, σ), where s is a relevant extension and σ an item such that p(σ|s) > 0, a directed link s → s∗ σ is added to the network. The node s∗ σ represents the longest suffix of sσ such that s∗ σ ∈ V. The weight of this link is p(σ|s).
4
Application of PageRank to Variable-Order Networks
In this section, we first introduce the PageRank (PR) measure and its direct application on Von. We discuss the effect of the distribution of Nrep on PR probabilities distribution. In order to isolate this effect, we introduced a biased 1st onPR model. A bias-free model called Unbiased Von PR model is then introduced. Standard PageRank Model (1st onPR). The PR measure is an efficient eigenvector centrality measure in the context of directed networks. It was implemented in Google’s search engine by its inventors Brin and Page [3]. PR definition of node’s importance can be interpreted as follows: the more a node is pointed by important nodes, the more it is important. PR is equivalent to the steady state of a random surfer (RS) following a memory-less Markov process. The RS can follow links of the network with probability α or teleport uniformly towards a node of the network with probability 1 − α (it will also teleport from any sink node). These teleportations ensure that RS cannot be stuck in a sub-region of the network and that the steady state probability distribution is unique. The PR probability associated to the node i is denoted P (i). As item i ∈ Ω is represented by a single node i in 1st on, the PR probability associated to item i (Π1 (i)) is equal to P (i). One can sort items by the decreasing order of their PR probabilities. We note K1 their ranks associated to Π1 values. Variable-Order Network PageRank ( VonPR). In the case of Vons, the memoryless Markov process actually simulates the variable-order model as memory is indeed encoded into the nodes. Therefore, [12] suggests that standard PR directly applied to Von will better reflect dependencies between items in the system than Π1 . Since more than one node represent items in Von, [12] defined the PageRank
PageRank Computation for Higher-Order Networks
187
of an item as the probability for the RS to reach at least one of its representations (see Eq. 3). Π(i) = P (v) (3) v∈V(i)
We denote by ΠVon and KVon the PR values and ranking issued from VonPR model. Since we use a random surfer, the more representations item i has, the higher is the probability to teleport to one of them. As Eq. 3 sums over representations of item i, this translates to a bias that is solely due to the teleportation mechanism. We can illustrate this effect with a simple example (see Fig. 2). The value of ΠVon (c) is always greater than or equal to 0.5 in the situation illustrated in Fig. 2b while it is always lower than or equal to 0.5 in Fig. 2a. Equality is achieved when α = 1 (i.e. when there is no teleportation). Although order 2 dependencies exist in 2b, it is hard to justify why item c should be “more central” in this case.
s2
s2 c
s3
s1
s4
s1 c
s3 c
s2 s1
...
c
...
(a) Von without round trips
...
s4 c
s3 s4
c
...
(b) Von with round trips
Fig. 2. Example of Von models of trajectories where all flows go through an item c. In (a), when leaving c, a traveler goes uniformly to any of the satellites si . In (b), a traveler coming from si always goes back to si .
( Nrep )-biased PageRank model (Biased 1st onPR). In order to isolate the bias due to teleportations, we assume the transition probabilities associated to representations of item i are all equal to pi i.e. the representations of i do not encode any different behaviour. This is equivalent to computing PR on 1st on using a preferential teleportation vector vB depending on Nrep as expressed in Eq. 4. vB (j) =
Nrep (j) k∈Ω Nrep (k)
(4)
The item PR values associated to this model and its resulting ranking are denoted Π1B and K1B respectively. In the example above, Π1B computed on Fig. 2a is equal to ΠVon computed on Fig. 2b since the order-2 dependencies do not affect the centrality of c in this example. Unbiased Von PageRank Model. In order to remove the bias discussed above, a modification of the teleportation vector is also used. Although several corrections
188
C. Coquid´e et al.
are possible, the one chosen corresponds to the following random surfing process: teleportation is assumed to be the beginning of a new journey. It is therefore only possible to teleport uniformly to first-order nodes (see Eq. 5). 1/|Ω| if |i| = 1 (5) vU (i) = 0 otherwise It is easy to show that each node is reachable during the Markov process and therefore that the RS steady state is still unique using this teleportation vector. The item PR values associated to this model and its resulting ranking are noted U U and KVon respectively. ΠVon
5
Datasets and Experimental Settings
Datasets. The three datasets used correspond to spatial trajectories. They differ however in terms of length, number of sequences, number of items, etc. For each dataset and each sequence, we removed any repetition of items. The code used and the datasets are available at https://github.com/ccoquide/unbiased-vonpr/. – Maritime: Sequences of ports visited by shipping vessels, from April the 1st to July the 31st 2009. Data are extracted from the Lloyd’s Maritime Intelligence Unit. A variable-order network (Von) analysis of maritime is presented in [12]. – Airports: US flight itineraries of the RITA TransStat 2014 database [2], during the 1st quarter of 2011. Each sequence is related to a passenger, it describes passenger’s trip in terms of airport stops. In [11] and [9], fixedorder network (Fon) representations of the data set are presented. – Taxis: Taxis rides into Porto City from July the 1st of 2013 to June the 30th of 2014. A sequence reports the succession of positions (recorded every 15 s) during a ride. The original data set [1] was part of the ECML/PKDD challenge of 2015. Each GPS location composing the sequences is reported onto the nearest police station as it is suggested in [10]. Sequences and networks statistics are reported in Table 1. We can observe that a large proportion of items have a large number of representations Nrep . The Nrep values are far from being uniformly distributed. Table 1. Datasets and networks information. Dataset
|S|
Maritime 4K
|Ω| |V|
|E|
max(order) Avg. Nrep Q9 (Nrep ) max(Nrep )
909 18K
47K
8
20
50
674
Airports
2751K 446 443K 1292K 6
995
1K
34K
Taxis
1514K 41
99
250
382
4K
15K
14
PageRank Computation for Higher-Order Networks
189
Experimental Settings. For a given dataset, we compute PR values according to the different models with α = 0.85 along with the corresponding rankings (see Sect. 4). Items having the same PR probabilities are ranked using the same order. In addition to the four models described in the previous section, we also report the following measures. – Nrep ranking (Krep ) is the ranking of items by decreasing order of Nrep . We quantify how Nrep -biased are other PR models by comparing them with this benchmark. – Visit rank (KV ) is the ranking based on the probability of each item to occur in the input sequences. The visit rank is used as “ground truth” in [11, Eq. 9] for validation of the author’s selection of fix-order model. However, we argue that this characterization is limited. For example, in the extreme situation where sequences are composed of only two items and can be viewed as a list of directed arcs, KV would correspond to the ranking made from node degrees. More generally, using the item count as a centrality measure assumed an underlying symmetry in the system i.e. every place is as much a destination as it is a departure.
6
Results
We show here that the bias effect is indeed important when looking at Π values or the resulting rankings. Moreover, this is still true when using alternative damping factor values. Evolution of Π values with Nrep . We note η(Nrep ) the probability that a random surfer (RS) visits any item having at least Nrep representations such as η(Nrep ) = Π(j) with Nrep (j) ≥ Nrep (6) j∈Ω
The impact of Nrep on PR probabilities is quantified by the relative PR boost Δη/η = (η − η )/η where η is related to the 1st onPR model. We show the evolution of Δη/η with Nrep in Fig. 3 for each dataset. Both VonPR (ΠVon ) and Biased 1st onPR (Π1B ) models are the ones with the highest relative PR boosts. For example, in case of Maritime dataset (see Fig. 3a), the relative PR boosts, max , equal to 60% and 65% respectively for these models (compared at Nrep = Nrep st to 1 onPR probabilities Π1 ). Moreover, we see that the PR boosts relative U ) are to ΠVon fit well with Π1B ones. The Unbiased VonPR probabilities (ΠVon impacted very differently. For the Airports and Taxis datasets, the distributions shape of boosts is similar to VonPR’s but with lower boost values. Although relative PR boosts are the lowest for our model, the higher-order dependencies it encodes still lead to differences with 1st onPR. However, the fact that PR probabilities are boosted for highly represented items doesn’t necessarily lead to resulting biased PR rankings. Therefore, we investigate the changes in rankings when using the different models.
190
C. Coquid´e et al.
0.6
0.8
0.2
0.6 U
ΠVon
0.2
ΠVon
Δη/η’
Δη/η’
Δη/η’
0.4
0.4
0
ΠB 1
0.2
0
0
-0.2 1
200
400
Nrep
600
-0.2
1
(a) Maritime
10
100
Nrep
1000 10000
1
(b) Airports
100
200
Nrep
300
(c) Taxis
Fig. 3. Relative PageRank boost Δη/η versus Nrep for three PR models, with α = 0.85. For a given number of representations Nrep , η(Nrep ) is the probability to reach any item having at least Nrep representations. η is related to 1st onPR model. Table 2. Spearman (rs ) and Kendall (rτ ) coefficients between PR rankings. K1 rs
rτ
K1B rs
rτ
KVon rs rτ
U KVon rs rτ
a) Maritime K1 0.96 K1B KV 0.94 0.81 0.99
0.85 0.95 0.81 0.95 0.81 0.98 0.89 0.90 0.74 0.92 0.97 0.85 0.89 0.71
b) Airports K1 0.98 0.91 0.96 0.86 0.95 0.83 0.99 0.91 0.91 0.79 K1B KV 0.98 0.91 0.998 0.96 0.99 0.92 0.92 0.80 c) Taxis K1 0.62 K1B KV 0.42 0.30 0.92
0.48 0.44 0.34 0.77 0.61 0.94 0.82 0.88 0.70 0.79 0.98 0.91 0.76 0.58
Rankings Comparison. We quantified similarities between pairs of PR rankings by using both Spearman and Kendall correlation coefficients. Table 2 displays similarities for each dataset. We observe high similarity between VonPR (KVon ) and Biased 1st onPR (K1B ) rankings. On the other hand, correlation coefficients U ) and K1B are lower. We also observe between Unbiased VonPR rankings (KVon overall lower correlation coefficients with Taxis dataset which are probably due to the lower number of items. Note that the visit rank (KV ) is closer to K1B or KVon (for Taxis). If we were to use KV as a PR selection method, we would likely select our biased 1st onPR model which does not include any higher order dependencies. This highlights the fact that KV is not an efficient benchmark in the context of Von.
PageRank Computation for Higher-Order Networks
191
The Nrep -bias also affects the Top 10s ranking which is a popular usage of PR ranking. The Top 10s related to Maritime are displayed in Table 3. Ports with bold name are new entries when compared to the previous ranking. Although 80% of entries are common to all Top 10s, the differences come from reordering. Both KVon and K1B fit almost perfectly with the ten most represented ports U may capture items with bad Krep e.g. the port (Krep ). On the other hand, KVon U = 9). of Surabaya (Krep = 45 and KVon Table 3. Maritime’s Top 10s PageRank. K1B
K1
U KVon
KVon
Rank Port
Krep KV Port
Krep KV Port
Krep KV Port
1
Singapore
2
2
Singapore
2
2
Hong Kong 1
1
Singapore
2
Krep KV
2
Hong Kong
1
1
Hong Kong 1
1
Singapore
2
2
Busan
4
4
3
Rotterdam
5
7
Shanghai
3
3
Shanghai
3
3
Hong Kong
1
1
2
4
Busan
4
4
Busan
4
4
Busan
4
4
Rotterdam
5
7
5
Shanghai
3
3
Rotterdam 5
7
Rotterdam 5
7
Shanghai
3
3
6
Hamburg
8
10
Port Klang 6
6
Port Klang 6
6
Hamburg
8
10
7
Port Klang
6
6
Kaohsiung 7
5
Kaohsiung 7
5
Antwerp
10
12
8
Antwerp
10
12
Hamburg
8
10
Hamburg
8
10
Bremerhaven 12
19
9
Bremerhaven 12
19
Antwerp
10
12
Antwerp
10
12
Surabaya
45
36
10
Kaohsiung
5
Jebel Ali 9
11
Jebel Ali
9
11
Port Klang
6
6
7
Since the number of items composing Taxis dataset (corresponding to subareas of Porto) is small enough, the PR scores of all items are given in Fig. 4. Both ΠVon and Π1B give bad ranks to peripherals. Only 1st onPR and Unbiased VonPR models give importance to peripheral neighborhoods. Finally, central regions have similar rankings whatever the model used.
(a) Π1
(b) Π1B
(c) ΠVon
U (d) ΠVon
Fig. 4. Distribution of PageRanks values for Porto’s neighborhood.
Dependence of the Nrep -bias with the Damping Factor α. Since in the literature an alternative value of the damping factor α could be used (as in [4]), we
192
C. Coquid´e et al.
investigated similarities between rankings regarding its choice. The evolution of Spearman correlation coefficient with respect to changes in α is present in Fig. 5 for α ∈ [0.5, 0.99]. Results related to the Kendall correlation coefficient evolution are not reported since they are similar. For α ≤ 0.85, the observations made earlier are still valid. When teleportations are less frequent, different changes occur. U is to Indeed, for Maritime and Airports, KVon becomes closer to K1 than KVon U B K1 (dashed lines). Overall the pairs KVon -KVon and K1 -K1 get closer as α tends to 1 due to the poor contribution of teleportations. For the taxis, we notice a switch at α ≈ 0.9 for K1B (solid lines). We think this is due to the low amount of items related to Taxis dataset. In order to understand this behaviour, we need to further investigate other similar datasets. 1
0.9
0.8
0.8
0.6
rs(α)
rs(α)
1
0.7
0.4
0.6 0.5 0.5
(a) Maritime
0.2
0.7
α
(b) Airports
0.9
0 0.5
0.7
α
0.9
(c) Taxis
Fig. 5. Evolution of the Spearman correlation coefficient rs (α) with α, for couples of rankings. Solid lines (dashed lines) are related to correlations with Biased 1st onPR (1st onPR). Vertical black dashed line represents α = 0.85.
7
Future Works
This study shows that the application of network measures to the new objects that are Vons are not trivial. We believe the adaptation of other analysis tools is an important challenge for the network science community. We are currently investigating the application of clustering algorithms such as Infomap [8] to Vons. This algorithm indeed uses PR in order to compare clustering qualities. However, [12] also suggests that such algorithm can be directly applied to Vons with no modifications. The PR centrality measure has other applications. A recent method based on the Google matrix (the stochastic matrix which models the random surfer), called reduced Google matrix, has shown its efficiency in inferring hidden links between a set of nodes of interests [6] for example with studying Wikipedia networks [5]. Using user traces on website rather than usual hypertext click statistics, we will also study the generalization of this tool to Vons.
PageRank Computation for Higher-Order Networks
193
References 1. Porto taxis Trajectories Data. https://kaggle.com/crailtap/taxi-trajectory 2. Rita tansstat database. https://www.transtats.bts.gov/ 3. Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst. 30(1), 107–117 (1998). https://www.sciencedirect. com/science/article/pii/S016975529800110X 4. Coquid´e, C., Ermann, L., Lages, J., Shepelyansky, D.L.: Influence of petroleum and gas trade on EU economies from the reduced Google matrix analysis of UN COMTRADE data. Eur. Phys. J. B 92(8), 171 (2019). https://doi.org/10.1140/ epjb/e2019-100132-6 5. Coquid´e, C., Lages, J., Shepelyansky, D.L.: World influence and interactions of universities from Wikipedia networks. Eur. Phys. J. B 92(1), 3 (2019). https:// doi.org/10.1140/epjb/e2018-90532-7 6. Frahm, K.M., Jaffr`es-Runser, K., Shepelyansky, D.L.: Wikipedia mining of hidden links between political leaders. Eur. Phys. J. B 89(12), 269 (2016). https://doi. org/10.1140/epjb/e2016-70526-3 7. Gleich, D.F., Lim, L.H., Yu, Y.: Multilinear PageRank. SIAM J. Matrix Anal. Appl. 36(4), 1507–1541 (2015). https://epubs.siam.org/doi/10.1137/140985160. Society for Industrial and Applied Mathematics 8. Rosvall, M., Axelsson, D., Bergstrom, C.T.: The map equation. Eur. Phys. J. Special Topics 178(1), 13–23 (2009). https://doi.org/10.1140/epjst/e2010-01179-1 9. Rosvall, M., Esquivel, A.V., Lancichinetti, A., West, J.D., Lambiotte, R.: Memory in network flows and its effects on spreading dynamics and community detection. Nat. Commun. 5(1), 4630 (2014). https://www.nature.com/articles/ncomms5630, Number: 1. Nature Publishing Group 10. Saebi, M., Xu, J., Kaplan, L.M., Ribeiro, B., Chawla, N.V.: Efficient modeling of higher-order dependencies in networks: from algorithm to application for anomaly detection. EPJ Data Sci. 9(1), 1–22 (2020). https://epjdatascience.springeropen. com/articles/10.1140/epjds/s13688-020-00233-y, Number: 1. SpringerOpen 11. Scholtes, I.: When is a network a network? Multi-order graphical model selection in pathways and temporal networks. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1037– 1046, KDD 2017. Association for Computing Machinery, New York, NY, USA, August 2017. https://doi.org/10.1145/3097983.3098145 12. Xu, J., Wickramarathne, T.L., Chawla, N.V.: Representing higher-order dependencies in networks. Sci. Adv. 2(5), e1600028 (2016). https://advances.sciencemag. org/content/2/5/e1600028. American Association for the Advancement of Science Section: Research Article 13. Yin, H., Benson, A.R., Leskovec, J., Gleich, D.F.: Local higher-order graph clustering. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 555–564, KDD 2017. Association for Computing Machinery, New York, NY, USA, August 2017. https://doi.org/10. 1145/3097983.3098069 14. Zhang, Z., Xu, W., Zhang, Z., Chen, G.: Opinion dynamics incorporating higherorder interactions. In: 2020 IEEE International Conference on Data Mining (ICDM), pp. 1430–1435, November 2020. https://doi.org/10.1109/ICDM50108. 2020.00189, ISSN: 2374–8486
Fellow Travelers Phenomenon Present in Real-World Networks Abdulhakeem O. Mohammed1(B) , Feodor F. Dragan2 , and Heather M. Guarnera3 1
3
Duhok Polytechnic University, Duhok, Kurdistan Region, Iraq [email protected] 2 Computer Science Department, Kent State University, Kent, USA [email protected] Department of Mathematical and Computational Sciences, The College of Wooster, Wooster, USA [email protected]
Abstract. We investigate a metric parameter “Leanness” of graphs which is a formalization of a well-known Fellow Travelers Property present in some metric spaces. Given a graph G = (V, E), the leanness of G is the smallest λ such that, for every pair of vertices x, y ∈ V , all shortest (x, y)-paths stay within distance λ from each other. We show that this parameter is bounded for many structured graph classes and is small for many real-world networks. We present efficient algorithms to compute or estimate this parameter and evaluate the performance of our algorithms on a number of real-world networks.
1
Introduction
Fellow Travelers Property in metric spaces states that two geodesics γ and γ with the same ends x and y always stay close to each other, i.e., two travellers moving along γ and γ from x to y at the same speed always stay at distance at most λ from each other [21]. In graphs, geodesics are shortest paths and this property is formulated as follows: two travelers moving along shortest paths P (x, y) and P (x, y) from vertex x to vertex y at the same speed always stay at distance at most λ from each other. Given a graph G = (V, E), the leanness λ(G) of G is the smallest λ such that, for every pair of vertices x, y ∈ V , all shortest (x, y)-paths stay within distance λ from each other (we give precise definition of this parameter in Definition 1). In this paper, we investigate this metric parameter on real-world networks and on structured families of graphs. We show that this parameter is bounded for many structured graph classes and is small for many real-world networks. This gives a structural indication that, in these networks with small leanness, shortest paths between the same two points tend to stay close together. Related Work. Understanding the geometric properties of complex networks is a key issue in network analysis and geometric graph theory. In earlier empirical c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2021, SCI 1072, pp. 194–206, 2022. https://doi.org/10.1007/978-3-030-93409-5_17
Fellow Travelers Phenomenon Present in Real-World Networks
195
and theoretical studies researchers have mainly focused on features such as small world phenomenon, power law degree distribution, navigability, and high clustering coefficients. Those nice features were observed in many real-world complex networks and graphs arising in Internet applications, in the biological and social sciences, and in chemistry and physics. Although those features are interesting and important, as noted in [27], the impact of intrinsic geometric and topological features of large-scale data networks on performance, reliability and security is of much greater importance. Recently, there has been a surge of empirical and theoretical work measuring and analyzing geometric characteristics of real-world networks [1,2,23,24,27, 29]. One important such property is negative curvature [27], causing the traffic between the vertices to pass through a relatively small core of the network - as if the shortest paths between them were curved inwards. It has been empirically observed [23,27], then formally proved [10], that such a phenomenon is related to the value of the Gromov hyperbolicity (sometimes called also the global negative curvature) of the graph. A graph G = (V, E) is said to be δ-hyperbolic [21] if for any four vertices u, v, x, y of V , the two larger of the three distance sums d(u, v) + d(x, y), d(u, x) + d(v, y), d(u, y) + d(v, x) differ by at most 2δ. The smallest value δ for which G is δ-hyperbolic is called the hyperbolicity of G and denoted by δ(G). Hyperbolicity measures the local deviation of a metric from a tree metric. It has been shown that a number of data networks, including Internet application networks, web networks, collaboration networks, social networks, and others, have small hyperbolicity. Furthermore, graphs and networks with small hyperbolicities have many algorithmic advantages. They allow efficient approximate solutions for a number of optimization problems (see [8–11,18,25]). For an n-vertex graph G, the definition of hyperbolicity implies a simple brute-force O(n4 ) algorithm for computing δ(G). This running time is too slow for computing the hyperbolicity of large graphs that occur in applications. Relying on matrix multiplication results, one can improve the upper bound on timecomplexity to O(n3.69 ) [19], but it still remains not very practical. The best known practical algorithm still has an O(n4 )-time worst-case bound but uses several clever tricks when compared to the brute-force algorithm [12] (see also [3]). There are also fast heuristics for estimating the hyperbolicity of a graph [12,24], and some efficient constant factor approximation algorithms [6,7,17,19] exist. The leanness λ(G) and the hyperbolicity δ(G) of any graph G are related to each other. For any graph G (in fact, for any geodesic metric space), λ(G) ≤ 2δ(G) holds [21]. Furthermore, in Helly graphs G (discrete analog of hyperconvex metric spaces), λ(G) is almost 2δ(G) (2δ(G) − 1 ≤ λ(G) ≤ 2δ(G)) [15]. For general graphs, however, λ(G) and 2δ(G) could be very far apart. For example, if G is an induced odd cycle of length k = 4+1, then λ(G) = 0 and 2δ(G) = 2−1. In the light of this, two natural questions arise: what other graph classes enjoy the property of the leanness being very close to twice the hyperbolicity; are parameters λ(G) and 2δ(G) close in real-world networks? If so, fast computations or estimations of λ(G) could give very good estimations of δ(G).
196
A. O. Mohammed et al.
Our Contribution. On theoretical side, we demonstrate that the leanness is small or bounded for many structured graph classes. Furthermore, we propose a dynamic programming algorithm that computes in O(n2 m) time and O(n2 ) space the leanness λ(G) of an arbitrary graph G with n vertices and m edges. Computing the exact leanness of a large graph with our theoretical O(n2 m) time dynamic programming algorithm is prohibitively expensive. Therefore, on practical side, we provide an efficient practical algorithm to compute the exact leanness of large real-world networks. It has an O(n4 )-time worst-case bound but uses several tricks to speed-up the computation. It is very fast in practice as our experimental results show. We design also a very simple heuristic to estimate the leanness of a given graph. Lastly, we evaluate the performance of our algorithms on 26 real-world networks. We also compute the exact hyperbolicity of those networks using the algorithm provided by Cohen et al. [12]. Notably, we found that λ = 2δ for all 26 networks investigated in this paper. In the light of this, we claim that our exact practical algorithm can be used to compute (or sharply estimate from below) the hyperbolicity of real-world networks. Furthermore, our heuristic can also be used to estimate the hyperbolicity. Our experimental results show that our exact practical algorithm is faster than the algorithm of Cohen et al. [12] in computing the exact hyperbolicity and that our heuristic used for estimation of the hyperbolicity outperforms two known heuristics from [12,24].
2
Basic Notions and Notations
All graphs appearing here are connected, finite, unweighted, undirected, and contain no self-loops nor multiple edges. The distance dG (u, v) between vertices u and v is the length of a shortest path connecting u and v in G. For any two vertices u, v of G, IG (u, v) = {z ∈ V : dG (u, v) = dG (u, z) + dG (z, v)} is the (metric) interval between u and v, i.e., all vertices that lay on shortest paths between u and v. The set Sk (x, y) = {z ∈ IG (x, y) : dG (x, z) = k} is called a slice of the interval IG (x, y) where 0 ≤ k ≤ d(x, y). The diameter diam(G) of a graph G is the largest distance between a pair of vertices in G, i.e., diam(G) = maxu,v∈V dG (u, v). The diameter diamG (S) of a set S ⊆ V is defined as maxu,v∈S dG (u, v). For a vertex v of G, NG (v) = {u ∈ V : uv ∈ E} is called the open neighborhood of v. We omit subscript G if no confusion arises. Definition 1 (Interval Leanness λ(G)). An interval IG (x, y) is said to be λlean1 if dG (a, b) ≤ λ for all a, b ∈ IG (x, y) with dG (x, a) = dG (x, b). The smallest integer λ for which all intervals of G are λ-lean is called the interval leanness (or simply leanness) of G and denoted by λ(G) (see Fig. 1).
1
This is known (see, e.g., [6]) also under the name λ-thin interval, but to differentiate graph thinness parameter based on the thinness of geodesic triangles from the graph thinness based on thinness of intervals, we use here the word lean.
Fellow Travelers Phenomenon Present in Real-World Networks
197
Fig. 1. A λ-lean interval.
Definition 2 (Hyperbolicity δ(G) (Gromov 4-point condition)). Let δ(u, v, x, y) denote the hyperbolicity of a quadruple u, v, x, y ∈ V defined as half the difference between the biggest two of the three distance sums S1 = dG (x, y) + dG (u, v), S2 = dG (x, v) + dG (u, y), and S3 = dG (x, u) + dG (v, y). The hyperbolicity of a graph G is equal to δ(G) = maxu,v,x,y∈V δ(u, v, x, y).
3
Theoretical Results
Interval Leanness of Structured Classes of Graphs. As for every graph λ(G) ≤ 2δ(G) holds, based on bounds on hyperbolicity known in the literature (see [1,4,6,8,16,20,21,32] and papers cited therein), we may also derive several bounds on leanness λ(G). Clearly, λ(G) = 0 if and only if G is a graph where shortest paths are unique. In particular, λ(G) = 0 when G is an odd induced cycle or when G is a block graph (i.e., every biconnected component is a clique). The leanness is bounded by a small constant on chordal graphs (≤ 1), on AT-free graphs, dually chordal graphs, HHD-free graphs and distance-hereditary graphs (≤ 2 in all), and on many other structured classes of graphs. The leanness of any graph is upper bounded also by twice the slimness and by the thinness, i.e., λ(G) ≤ min{2δ(G), τ (G), 2ς(G)}, where τ (G) and ς(G) are the thinness and the slimness of a graph G, respectively (two other negative curvature parameters of graphs that are within constant factors from the hyperbolicity; see, e.g., [6] for exact definitions). Furthermore, the leanness of a graph G satisfies the inequality λ(G) ≤ min{k(G)/2, 2tl(G), 2Δs (G)}, where k(G), tl(G) and Δs (G) are, respectively, the chordality, the tree-length and the cluster diameter of the layering partition of G with respect to a vertex s (other metric tree-likeness parameters of graphs; see, e.g., [1] for exact definitions). Surprisingly, we have found that equality λ(G) = 2δ(G) holds for all realworld networks that are considered in this paper. We will discuss this in more details in Sect. 4. On Exact Computation of Interval Leanness. It is obvious from the definition that the leanness λ(G) can be computed in O(n4 ) time and O(n2 ) space for a graph G with n vertices. Here, we propose a dynamic programming algorithm that computes in O(n2 m) time and O(n2 ) space the leanness λ(G) of a given graph G with n vertices and m edges. We have Theorem 1. For every graph G with n vertices and m edges, the interval leanness λ(G) of G can be computed in O(n2 m) time and with O(n2 ) space.
198
A. O. Mohammed et al.
We highlight the main ideas of our algorithm next. Let λx (G) = max{d(w, w ) : w, w ∈ I(x, y), d(x, w) = d(x, w ), y ∈ V } be the leanness of G at a fixed vertex x (here, the maximum is taken over all y ∈ V and w, w ∈ I(x, y) with d(x, w) = d(x, w )). Notice that λ(G) = maxx∈V λx (G). Additionally, for every ordered pair x, y and every vertex w ∈ I(x, y), let Mw (x, y) = max{d(w, w ) : w ∈ I(x, y) and d(x, w) = d(x, w )}. Since λ(G) = maxx∈V λx (G), given an algorithm for computing λx (G) in O(T (n, m)) time, we can compute λ(G) in O(nT (n, m)) time, by calling n times this algorithm. Two simple lemmas are crucial to our algorithm. Lemma 1. For any x ∈ V , λx (G) = max{Mw (x, y) : y ∈ V, w ∈ I(x, y)}. The algorithm for computing λx (G) works as follows. First, we compute the distance matrix of G in O(nm) time. Next, we compute Mw (x, y) for all y and all w ∈ I(x, y) in time O(nm). Finally, we compute max{Mw (x, y) : y ∈ V, w ∈ I(x, y)} in O(n2 ) time. By Lemma 1, the obtained value is exactly λx (G). Hence, we are just left with proving that we can compute Mw (x, y) for all y and all w ∈ I(x, y) in time O(nm). To do that, we extend the definition of Mw (x, y) to all vertices w ∈ V . Set Mw (x, y) = max{d(w, w ) : w ∈ I(x, y) and d(x, w) = d(x, w )} if d(x, w) ≤ d(x, y) and set Mw (x, y) = 0 if d(x, w) > d(x, y). Now, to compute Mw (x, y) for any fixed x, w ∈ V , we can use the following recursive formula [6]: ⎧ ⎪ if d(x, y) < d(x, w) ⎨0, Mw (x, y) = d(w, y), if d(x, y) = d(x, w) ⎪ ⎩ max{Mw (x, y ) : y ∈ N (y) ∩ I(x, y)}, otherwise. Since the distance matrix of G is available, using a standard dynamic programming approach, the values Mw (x, y) for all y ∈ V can be computed in O( y deg(y)) = O(m) time. Hence, Lemma 2. For any fixed x, w ∈ V , one can compute the values of Mw (x, y) for all y ∈ V in O(m) time.
4
Interval Leanness in Real-World Networks
In this section, we investigate the leanness in real-world networks. We first provide two algorithms: an efficient practical algorithm to compute the exact leanness of a network and a heuristic to estimate its leanness. We show the performance of our algorithms on a collection of real-world networks and find that the interval leanness tends to be small (≤ 8 in all 26 networks considered). We observe with our experimental results that our heuristic can be used to estimate the hyperbolicity of a real-world network (recall that δ(G) ≥ λ(G) 2 ). We found that our heuristic consistently runs faster than two well-known heuristics presented in [12,24] for estimating the hyperbolicity of a graph.
Fellow Travelers Phenomenon Present in Real-World Networks
199
Practical Exact Algorithm. Computing the exact leanness of a large graph with our theoretical O(n2 m) time dynamic programming algorithm is prohibitively expensive. Therefore, we provide next an efficient practical algorithm to compute the exact leanness of a real-world network. Since it is more likely to find the leanness of a graphG in the interval between pairs of vertices with large distance, we first sort the n2 vertex pairs by distances and process them in non-increasing order. The search is complete once a pair x, y is found such that the current computed value of leanness λ is greater than or equal to d(x, y), and the current computed value is reported as the exact leanness of a given graph. A formal description is given in Algorithm 1. Notice that the leanness of a graph G is bounded by the diameter of G, i.e., λ(G) ≤ diam(G). The following simple lemmas are also used to reduce the running time of our algorithm. Lemma 3. Let x, y be a pair of vertices of a graph G and λ be an integer. If d(x, y) ≤ λ, then the leanness of interval I(x, y) is at most λ. Lemma 4. Let x, y be a pair of vertices of a graph G. Then, for each u, v ∈ V with d(x, y) = d(x, u) + d(u, v) + d(v, y), the leanness of interval I(u, v) is upper bounded by the leanness of interval I(x, y). Lemma 5. Let λ be the current computed value of leanness and let p, q be the next vertex pair to be checked by Algorithm 1 (see line 2). Let t1 = λ/2, t2 = d(p, q) − λ/2 and S[i] := Si (p, q), i := 0, 1, 2, ..., d(p, q), be the slices of the interval I(p, q). Then, we can start the search with slice S[t1 ] and stop when we reach slice S[t2 ] for a pair p, q. Algorithm 1 runs in O(n4 ) time in the worst case. For a fixed vertex pair x, y, the interval I(x, y) can be computed in linear time and line 11 runs in O(n2 ) d(x,y) time, since for any pair x, y, i=0 |S[i]| ≤ n. Although there is no improvement to the worst case time complexity of this algorithm compared to our first (O(n2 m) time) algorithm, it is very fast in practice as our experimental results show. Heuristic. Computing the exact leanness is computationally and memory intensive. The distance matrix alone requires O(n2 ) space, which is prohibitive for large graphs. Therefore, we design a very simple heuristic to estimate the leanness of a given graph. Indeed, we will show later that our heuristic can also be used to estimate the hyperbolicity of a real-world network. We have observed that it is more likely to find the leanness of a graph on the interval of vertex pairs with large distance between them. Hence, we use 2-sweep breadth-first search (BFS) [13] to find a pair of vertices x, y that are far from each other in linear time. Next, we estimate (by sampling) the leanness of interval I(x, y) and report that as an estimation of the leanness of graph G. One iteration of the heuristic is as follows. Choose an arbitrary vertex z. Run BFS at z and let x be a vertex maximizing d(z, x), then run BFS at x and let y be a vertex maximizing d(x, y). Finally, run BFS at y; now all distances
200
A. O. Mohammed et al.
Algorithm 1: Computes the exact leanness of a graph.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Input : A connected graph G = (V, E) with precomputed distance matrix. List Q of all vertex pairs {x, y} of G sorted in non-increasing order with respect to d(x, y). Output: λ, the leanness of G. λ := 0 foreach pair {x, y} ∈ Q in sorted order do if d(x, y) ≤ λ then return λ Create empty sets S[i], i := 0, 1, 2, ..., d(x, y) foreach w in V do if d(x, y) := d(x, w) + d(y, w) then insert w into S[d(x, w)], remove pairs {x, w}, {w, y} from Q end Start := λ/2, End := d(x, y) − λ/2 for i:=Start to End do foreach u, v in S[i] do if λ < d(u, v) then λ := d(u, v) end end end return λ.
from x and y to every vertex in G are known. Consider the interval I(x, y). We compute the slices Si (x, y), i = 0, 1, ..., d(x, y), in total linear time (for each w ∈ V , if d(x, y) = d(x, w) + d(y, w) and d(x, w) = i, then w ∈ Si (x, y)). For each slice Si (x, y), pick k vertices as a sampling and then for each vertex v in the sample, run BFS and compute the distance from v to every vertex in the ˇ be the largest distance between a sample vertex and a vertex same slice. Let λ of the same slice. Starting with a new arbitrary vertex z, we repeat this process ˇ as an estimation for for t iterations, and return the largest computed value λ leanness of G. Let d be the maximum distance over all t of the x, y pairs that are considered by our heuristic. Notice that the maximum number of slices to be checked by our heuristic for any fixed interval is at most d. Thus, the running time of our heuristic is O(tdk(n + m)). We can significantly speed up the heuristic in practice by using Lemma 5 to skip some slices. Algorithm 2 describes our heuristic formally. Experimental Results. We implemented our practical algorithm to compute exact leanness and heuristic to estimate leanness. We evaluate the performance of our algorithms on 26 real-world networks. We also computed the exact hyperbolicity of those networks using the algorithm provided by Cohen et al. [12]; we refer to this algorithm as CCLE. Notably, we found that λ = 2δ for all 26 networks investigated in this paper. In the light of this, our exact algorithm can
Fellow Travelers Phenomenon Present in Real-World Networks
201
Algorithm 2: Heuristic to estimate the leanness of a graph.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Input : A connected graph G = (V, E). Two integer variables t and k ˇ a lower bound on the leanness of G. Output: λ, ˇ := 0 λ for i:=1 to t do Choose an arbitrary vertex z. Let x be the last vertex visited by a BFS starting at z. Let y be the last vertex visited by a BFS starting at x. Run BFS at y to get the distances from y to every other vertex in G. Create empty sets S[i], i := 0, 1, 2, ..., d(x, y) foreach w in V do if d(x, y) := d(x, w) + d(y, w) then insert w into S[d(x, w)] end ˇ ˇ Start := λ/2, End := d(x, y) − λ/2 for i:=Start to End do P := pick from S[i] k random vertices foreach u in P do BFS(u) foreach v in S[i] do ˇ < d(u, v) then if λ ˇ := d(u, v) λ end end end end ˇ return λ.
also be used to compute the hyperbolicity of real-world networks. Additionally, we compare our heuristic to estimate the hyperbolicity of real-world networks to two other heuristics that estimate hyperbolicity, denoted by author initials as CCLH [12] and KSN [24]. The run time of CCLH is O(kt(n + m)), parameterized by integer inputs k and t; it is described briefly as follows. Given a graph G = (V, E), find a pair of vertices x and y with the largest distance between them using 2-sweep BFS. Next, let K be a set of k random vertices from the set Sx ∩ Sy , where Sx (Sy ) is the set of vertices at distance d(x, y)/2 from x (y, respectively). Then, for each vertex u ∈ K, run a BFS from u and for each vertex v visited during that BFS, compute δ(x, y, u, v). Finally, repeat all previous steps t times and return the largest computed value as an estimation of the hyperbolicity of G. On the other hand, the KSN heuristic picks k total vertex quadruples, computes the hyperbolicity of each, then reports the largest computed value as an estimation of the hyperbolicity of graph G. Datasets. We consider 26 real-world networks of a variety of sizes, structural properties, and domains, including seven collaboration networks, seven autonomous systems (AS) networks, four peer-to-peer networks, two biological networks, two web networks, and four social networks. See Table 1. Autonomous
202
A. O. Mohammed et al.
systems networks studied here are collected by the Cooperative Association for the Internet Data Analysis (CAIDA) [5] and Distributed Internet Measurements and Simulations (DIMES) project [28]. Web networks are obtained from [22]. Douban and Reactome networks are collected from the KONECT library [26] and Erdos, Geom, and Yeast are form the Pajek dataset [31]. All other networks are available as part of the Stanford Large Network Dataset Collection [30]. Each graph represents the largest connected component of the original dataset, as some datasets consist of one large connected component and many very small ones. Furthermore, one can decompose the graph into biconnected components and find the leanness of those components since it is clear that the interval leanness is realized on its biconnected component. Since most of those graphs have one large biconnected component and many small ones, we run our algorithms on the largest biconnected component of each of those graphs. Evaluation and Results. Our experimental results are summarized in Table 2. Let us first look at value of hyperbolicity and leanness of networks. Recall that shortest paths in negatively curved graphs stay close to each other, that is, λ ≤ 2δ. Surprisingly, we found that leanness is exactly two times the hyperbolicity for all 26 real-world networks investigated in this paper, that is, λ = 2δ (see columns 2 and 4 in Table 2). Our experiments show that the hyperbolicity in real-world networks is governed by the interval leanness. ˇ We next evaluate the performance of our heuristic to estimate leanness λ. We observed that our estimate is very close to exact leanness λ and has a very small computation time. Of the 26 networks, our estimate matches the exact ˇ = λ), it is one less than exact leanness for 15 networks leanness for 9 networks (λ ˇ (λ = λ − 1), and in only two networks, Reactome and Github, our estimate is ˇ = λ − 2). Notice that the running time T ˇ on two less than exact leanness (λ λ some graphs might be larger than for other graphs with close numbers of nodes and edges since this depends on the distance between a pair x, y returned by our heuristic and the number of vertices in the slices of I(x, y). We now compare the performance of our heuristic to the CCLH and KSN heuristics. We have tested both our heuristic and CCLH heuristic on our dataset of real-world networks. An implementation of CCLH is available from [14]. All reported computations have been performed on a computer equipped with Intel Xeon Gold 6240R CPU 2.40 GHz and 50 GB of RAM. For the KSN heuristic, we report the values available for the 9 overlapping networks from [24]. Before analysing the performance of our heuristic and CCLH, we have to set a number of parameters for both heuristics to get the best bounds and also to make a fair comparison. As previously stated, the performance of both heuristics mainly depends on a number of BFS runs. For CCLH, we set k = 50 and t = 25. In other words, we choose 50 random vertices from the set Sx ∩ Sy for each pair x, y, and repeat that process for 25 iterations. So, the total number of BFS runs would be 1325 (= 25(3 + 50)). For our heuristic, we set k = 15 and t = 15. That is to say we choose 15 random vertices for each slice of I(x, y) for each pair x, y, and repeat that process for 15 iterations. So, the total number of BFS runs for our heuristic is 225 + 45 (= 15(3 + 15)), where is the number of slices of each pair
Fellow Travelers Phenomenon Present in Real-World Networks
203
Table 1. Graph datasets and their parameters: number of vertices n; number of edges m, diameter D. nlbc , mlbc , and Dlbc indicate number of vertices, edges, and diameter of the largest biconnected component. Network name
Type
CA-AstroPh CA-CondMat CA-GrQc CA-HepPh CA-HepTh Erdos GEOM
Collaboration 17903 196972 14 15929 193922 10 21363 91286 15 17234 84595 12 4158 13422 17 2651 10480 11 11204 117619 13 9025 114046 11 8638 24806 18 5898 20983 11 6927 11850 4 2145 7067 4 3621 9461 14 1901 6816 10
n
m
D
nlbc
mlbc
Dlbc
DIMES-AUG052009 AS DIMES-AUG052010 DIMES-AUG052011 DIMES-AUG052012 CAIDA-20130101 CAIDA-20170201 CAIDA-20170301
26655 84900 6 18344 76557 6 26496 93175 8 18840 85488 6 26163 97614 9 18439 89859 7 25357 74999 10 16907 66489 7 43274 140532 11 27454 124672 10 56667 239339 11 36296 218906 10 57015 245287 12 36649 224860 9
Gnutella04 Gnutella09 Gnutella24 Gnutella31
P2P
10876 39994 8104 26008 26498 65359 62561 147878
Yeast Reactome
PPI
2224 6609 11 5973 145778 24
1465 5839 8 5306 144537 13
EPA California
Web
4253 5925
2163 3694
Brightkite Facebook GitHub Douban
Social
56739 22470 37700 154908
10 8379 37497 10 5606 23510 11 15519 54380 11 33812 119127
8897 10 15770 13
7 8 9 9
6776 8 13445 10
212945 18 33187 188577 11 170823 15 19153 167296 12 289003 11 32416 283696 7 327162 9 51634 223878 8
x, y. For instance, for the graph CA-Astroph, our heuristic returned a pair x, y with d(x, y) = 9. So, the number of BFS runs for CA-Astroph would be around 2070. Fortunately, using Lemma 5, we can skip some slices for next pairs. Our heuristic to estimate leanness also provides an estimate for hyperbolicity given ˇ and δ ≥ λ holds for any graph. that λ ≥ λ 2 We observed that for seven graphs, our heuristic indeed returned a better estimation on the hyperbolicity compared to the value computed by CCLH. CCLH computes a better estimation for only two networks (CAIDA-20170301 and Reactome). The values coincide for the remaining 17 networks. Our heuristic
204
A. O. Mohammed et al.
returned a better estimation on the hyperbolicity compared to the value computed by the KSN heuristic for eight of the nine networks that were evaluated with KSN, and the values coincide for the remaining network. See columns 7, 9, and 11 for the time performance of all heuristicss. Above all, our heuristic is faster than CCLH and KSN. Table 2. CCLE computes exact hyperbolicity δCCLE in TδCCLE seconds. Our Algorithm 1 computes exact leanness λ in Tλ seconds. KSN estimates hyperbolicity δˇKSN in TδˇKSN seconds. CCLH estimates hyperbolicity δˇCCLH in TδˇCCLH seconds. Our Algoˇ (and consequently estimates hyperbolicity) in T ˇ seconds. rithm 2 estimates leanness λ λ Network name
Exact δCCLE TδCCLE
λ Tλ
CA-AstroPh CA-CondMat CA-GrQc CA-HepPh CA-HepTh Erdos GEOM
3.0 3.5 3.5 3.0 4.0 2.0 3.0
80.91 589.70 4.81 229.03 1.76 0.04 0.49
6 7 7 6 8 4 6
17.07 59.90 0.85 15.69 0.74 0.01 0.13
DIMES-AUG052009 DIMES-AUG052010 DIMES-AUG052011 DIMES-AUG042012 CAIDA-20130101 CAIDA-20170201 CAIDA-20170301
2.0 2.0 2.0 2.0 2.5 2.5 2.5
19.63 70.87 1358.29 1529.12 1157.94 3567.73 4926.91
4 4 4 4 5 5 5
8.02 16.39 9.11 44.39 84.15 175.13 152.21
Gnutella04 Gnutella09 Gnutella24 Gnutella31
3.0 3.0 3.0 3.5
0.02 0.02 334.37 11.23
6 6 6 7
0.18 0.09 31.47 15.99
Yeast Reactome
2.5 4.0
0.22 0.28
EPA California
3.0 3.0
Brightkite Facebook Github Douban
3.0 4.0 2.5 3.0
Estimate ˇ Tˇ δˇKSN [24] TδˇKSN [24] δˇCCLH TδˇCCLH λ λ 2.5 3.0 3.0 3.0 3.5 1.5 2.5
3.14 1.97 0.21 1.43 0.52 0.14 0.15
5 6 6 6 7 4 5
1.46 0.83 0.06 0.55 0.18 0.02 0.05
1.5 1.5 1.0 1.5 2.5 1.5 2.5
1.75 1.95 1.87 1.51 1.85 2.16 4.02
3 3 3 3 5 5 4
0.49 0.57 0.53 0.47 0.51 0.82 1.09
2.5 2.5 2.5 3.0
0.88 0.54 1.68 3.82
6 6 6 6
0.11 0.15 0.21 0.89
5 0.12 8 0.32
2.5 3.5
0.11 0.71
5 0.03 6 0.37
0.09 5.67
6 0.02 6 1.11
2.5 2.5
0.16 0.27
5 0.06 6 0.08
13618.60 25.93 5197.01 4.51
6 8 5 6
2.5 3.0 2.0 2.5
4.38 1.58 3.05 6.69
5 6 4 5
214.03 13.65 13.31 15.21
2.0 2.5 3.0 2.0 3.0
2.5 2.5 2.5 2.5
15960 9480 8580 46980 14760
10680 8760 5520 12600
1.87 0.82 1.48 2.45
References 1. Abu-Ata, M., Dragan, F.F.: Metric tree-like structures in real-world networks: an empirical study. Networks 67(1), 49–68 (2016) 2. Adcock, A.B., Sullivan, B.D., Mahoney, M.W.: Tree-like structure in large social and information networks. In: 13th ICDM 2013, pp. 1–10. IEEE (2013)
Fellow Travelers Phenomenon Present in Real-World Networks
205
3. Borassi, M., Coudert, D., Crescenzi, P., Marino, A.: On computing the hyperbolicity of real-world graphs. In: Bansal, N., Finocchi, I. (eds.) ESA 2015. LNCS, vol. 9294, pp. 215–226. Springer, Heidelberg (2015). https://doi.org/10.1007/9783-662-48350-3 19 4. Bridson, M., Haefliger, A.: Metric Spaces of Non-Positive Curvature. Grundlehren der mathematischen Wissenschaften, vol. 319. Springer, Heidelberg (1999). https://doi.org/10.1007/978-3-662-12494-9 5. Center for Applied Internet Data Analysis (CAIDA). CAIDA AS Relationships Dataset (2017). http://www.caida.org/data/active/as-relationships/ 6. Chalopin, J., Chepoi, V., Dragan, F.F., Ducoffe, G., Mohammed, A., Vax`es, Y.: Fast approximation and exact computation of negative curvature parameters of graphs. Discret. Comput. Geom. 65(3), 856–892 (2021) 7. Chalopin, J., Chepoi, V., Papasoglu, P., Pecatte, T.: Cop and robber game and hyperbolicity. SIAM J. Discret. Math. 28(4), 1987–2007 (2014) 8. Chepoi, V., Dragan, F., Estellon, B., Habib, M., Vax`es, Y.: Diameters, centers, and approximating trees of δ-hyperbolic geodesic spaces and graphs. In: Proceedings of the 24th Annual Symposium on Computational Geometry, pp. 59–68. ACM (2008) 9. Chepoi, V., Dragan, F.F., Estellon, B., Habib, M., Vax`es, Y., Xiang, Y.: Additive spanners and distance and routing labeling schemes for hyperbolic graphs. Algorithmica 62(3–4), 713–732 (2012) 10. Chepoi, V., Dragan, F.F., Vaxes, Y.: Core congestion is inherent in hyperbolic networks. In: Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 2264–2279. SIAM (2017) 11. Chepoi, V., Estellon, B.: Packing and covering δ-hyperbolic spaces by balls. In: Charikar, M., Jansen, K., Reingold, O., Rolim, J.D.P. (eds.) APPROX/RANDOM -2007. LNCS, vol. 4627, pp. 59–73. Springer, Heidelberg (2007). https://doi.org/ 10.1007/978-3-540-74208-1 5 12. Cohen, N., Coudert, D., Lancin, A.: On computing the Gromov hyperbolicity. J. Exp. Algorithmics (JEA) 20, 1–6 (2015) 13. Corneil, D.G., Dragan, F.F., K¨ ohler, E.: On the power of BFS to determine a graph’s diameter. Networks 42(4), 209–222 (2003) 14. Coudert, D.: Gromov hyperbolicity of graphs: C source code (2014). http://wwwsop.inria.fr/members/David.Coudert/code/hyperbolicity.shtml 15. Dragan, F.F., Guarnera, H.M.: Obstructions to a small hyperbolicity in Helly graphs. Discret. Math. 342(2), 326–338 (2019) 16. Dragan, F.F., Mohammed, A.: Slimness of graphs. Discret. Math. Theor. Comput. Sci. 21(3) (2019) 17. Duan, R.: Approximation algorithms for the Gromov hyperbolicity of discrete metric spaces. In: LATIN, pp. 285–293 (2014) 18. Edwards, K., Kennedy, S., Saniee, I.: Fast approximation algorithms for p-centers in large δ-hyperbolic graphs. In: Bonato, A., Graham, F.C., Pralat, P. (eds.) WAW 2016. LNCS, vol. 10088, pp. 60–73. Springer, Cham (2016). https://doi.org/10. 1007/978-3-319-49787-7 6 19. Fournier, H., Ismail, A., Vigneron, A.: Computing the Gromov hyperbolicity of a discrete metric space. eprint arXiv:1210.3323 (2012) 20. Ghys, E., de la Harpe, P. (eds.): Sur les groupes hyperboliques d’apr`es M. Gromov. Progress in Mathematics, vol. 83 (1990) 21. Gromov, M.: Hyperbolic groups: essays in group theory. MSRI 8, 75–263 (1987) 22. Jon Kleinberg. Jon Kleinberg’s web page. http://www.cs.cornell.edu/courses/ cs685/2002fa/
206
A. O. Mohammed et al.
23. Jonckheere, E.A., Lou, M., Bonahon, F., Baryshnikov, Y.: Euclidean versus hyperbolic congestion in idealized versus experimental networks. Internet Math. 7(1), 1–27 (2011) 24. Kennedy, W.S., Saniee, I., Narayan, O.: On the hyperbolicity of large-scale networks and its estimation. In: 2016 IEEE International Conference on Big Data (Big Data), pp. 3344–3351. IEEE (2016) 25. Krauthgamer, R., Lee, J.R.: Algorithms on negatively curved spaces. In: 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2006), pp. 119–132. IEEE (2006) 26. Kunegis, J.: Konect: the Koblenz network collection. In: WWW 2013 Companion, pp. 1343–1350. Association for Computing Machinery, New York (2013) 27. Narayan, O., Saniee, I.: Large-scale curvature of networks. Phys. Rev. E 84(6), 066108 (2011) 28. Shavitt, Y., Shir, E.: DIMES: let the internet measure itself. ACM SIGCOMM Comput. Commun. Rev. 35(5), 71–74 (2005) 29. Shavitt, Y., Tankel, T.: Hyperbolic embedding of internet graph for distance estimation and overlay construction. IEEE/ACM Trans. Netw. 16(1), 25–36 (2008) 30. Stanford Large Network Dataset Collection (SNAP). Stanford large network dataset. http://snap.stanford.edu/data/index.html 31. Vladimir Batagelj. Pajek datasets. http://vlado.fmf.uni-lj.si/pub/networks/data/ 32. Wu, Y., Zhang, C.: Hyperbolicity and chordality of a graph. Electr. J. Comb. 18(1), Paper #P43 (2011)
The Fr´ echet Mean of Inhomogeneous Random Graphs Fran¸cois G. Meyer(B) Applied Mathematics, University of Colorado Boulder, Boulder, CO 80305, USA [email protected] https://francoismeyer.github.io
Abstract. To characterize the “average” of a set of graphs, one can compute the sample Fr´echet mean. We prove the following result: if we use the Hamming distance to compute distances between graphs, then the Fr´echet mean of an ensemble of inhomogeneous random graphs is obtained by thresholding the expected adjacency matrix: an edge exists between the vertices i and j in the Fr´echet mean graph if and only if the corresponding entry of the expected adjacency matrix is greater than 1/2. We prove that the result also holds for the sample Fr´echet mean when the expected adjacency matrix is replaced with the sample mean adjacency matrix. This novel theoretical result has some significant practical consequences; for instance, the Fr´echet mean of an ensemble of sparse inhomogeneous random graphs is the empty graph.
Keywords: Fr´echet mean
1
· Statistical network analysis
Introduction
The Fr´echet mean graph has become a standard tool for the analysis of graphvalued data (e.g., [5,6,8,10,13,14]). In this work, we derive the expression for the population Fr´echet mean for inhomogeneous Erd˝ os-R´enyi random graphs [2]. We prove that the sample Fr´echet mean is consistent, and could be estimated using a simple thresholding rule. This novel theoretical result implies that the sample Fr´echet mean computed from a training set of graphs, which display specific topological features of interest, will not inherit from the training set the desired topological structure. We consider the set G formed by all undirected unweighted simple labeled graphs with vertex set {1, . . . , n}. We denote by S the set of n × n adjacency matrices of graphs in G, (1) S = A ∈ {0, 1}n×n ; where aij = aji , and ai,i = 0; 1 ≤ i < j ≤ n .
F. G. Meyer—Supported by the National Science Foundation (CCF/CIF1815971). c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2021, SCI 1072, pp. 207–219, 2022. https://doi.org/10.1007/978-3-030-93409-5_18
208
F. G. Meyer
We denote by G n, P the probability space formed by the inhomogeneous Erd˝ os-R´enyi random graphs [2], defined on {1, . . . , n}, where a graph G with adjacency matrix A has probability, a 1−a [pij ] ij [1 − pij ] ij . (2) P (A) = 1≤1 1/2, ∀i, j ∈ [n], µ P ij = (6) 0 otherwise.
2.1
Proof. The proof is given in Sect. 3.2. 2.2
The Sample Fr´ echet Mean Graph of a Graph Sample in G(n, P)
We now turn our attention to the sample Fr´echet mean graph, which has recently been used for the statistical analysis of graph-valued data (e.g., [5,8,14,17]). The computation of the sample Fr´echet mean graph using the Hamming distance is NP-hard [3]. For this reason, several alternatives have been proposed (e.g., [6,8]). Before presenting the second result, we take a short detour through the sample Fr´echet median of the probability measure P, [9,11,16], minimiser of N [P] = argmin m G∈G
N 1 dH (G, G(k) ), N k=1
(7)
210
F. G. Meyer
and which can be computed using the majority rule [1]. N [A] of m N [P] is given by Lemma 1. The adjacency matrix m
N (k) 1 if k=1 aij ≥ N/2, N [A]ij = ∀i, j ∈ [n], m 0 otherwise.
(8)
We now come back to the second main contribution, where we prove that the sample Fr´echet mean graph of N independent random graphs sampled from G n, P is asymptotically equal (for large sample size) to the sample Fr´echet median graph, with high probability. N [A] and µ
N A are given by Theorem 2. ∀δ ∈ (0, 1), ∃Nδ , ∀N ≥ Nδ , m 1 if E [A]ij = pij > 1/2, N [A]ij =
N A ij = m ∀i, j ∈ [n] , µ (9) 0 otherwise,
with 1 − δ over the realizations of the graphs, probability G n, P .
G(1) , . . . , G(N )
in
Proof. The proof is given in Sect. 3.5. The practical impact of Theorem 2 is given by the following corollary, which is an elementary consequence of Theorem 2 and Lemma 1.
N A is given by the majority rule, Corollary 1. ∀δ ∈ (0, 1), ∃Nδ , ∀N ≥ Nδ , µ
N (k) 1 if k=1 aij > N/2,
N A ij = ∀i, j ∈ [n] , µ (10) 0 otherwise, with 1 − δ over the realizations of the graphs, probability G n, P .
3
G(1) , . . . , G(N ) , in
Proofs of the Main Results
We give in the following the proofs of Theorems 1 and 2. In the process, we prove several technical lemmata. 3.1
The Population and sample Fr´ echet Functions
Let A and B be two adjacency matrices in S. We provide below an expression for the Hamming distance squared, d2H (A, B), where the computation is split between the entries of A along the edges of B, E(B), and the entries of A along the “nonedges” of B, E(B). We denote by |E(B)| the number of edges in B.
The Fr´echet Mean of Inhomogeneous Random Graphs
211
Lemma 2. Let A and B two matrices in S. Then, d2H (A, B)
=
2 aij
+ |E(B)| + 2 |E(B)|
1≤i1.0. On the contrary, player i selects one node as a new partner when wi,j ≦ 1.0. At that time wi,j is reset to wi,j 2 U(0.0, 1.0). Strategy update: Each player updates its strategy to the strategy of a neighbor who has the best reward among its neighbors (si ← smaxi). 2.3
Parameter Settings
As shown in Sakiyama 2021, hawks die out when unusual state is often maintained (U is large). This is why we set U = 0.01. Moreover, previous study reveals that population dynamics shows a phase transition around u = 0.70. Therefore, we set u = 0.70 in this paper in order to examine the system property when the system exhibits some critical behaviors. Table 1. Reward (payoff) matrix. The parameter b indicates payoff sensitivity of hawk hawk and is set to 2.5. Dove Hawk Dove 1 0 Hawk 2 1−b
408
T. Sakiyama
Fig. 1. Some examples of network patterns. Both data are obtained from the same trial. A. t = 200. B. t = 500.
Fig. 2. The relationship between the max degree kmax and the average fraction of hawks obtained from 100 trials.
Population Dynamics and Its Instability in a Hawk-Dove Game
409
3 Results Figure 1 demonstrates some examples of network structures obtained from different time steps in one trial (t = 200; Fig. 1A, t = 500; Fig. 1B). As shown in this figure, some players become hub nodes as time goes along. However, the system seems to return to the initial structure, i.e., a random network after a while. Thus, network structure varies, which is dependent on the time. To evaluate the population dynamics and its relation to network structures, we investigate the relationship between the maximum degree of the system kmax and the fraction of hawks. Here, the maximum degree of the system indicates the degree of a hub node. According to Fig. 2, which data is obtained from 100 trials, a phase transition occurs around the maximum degree kmax = 20. The fraction of hawks increases as the kmax increases before the kmax reaches that point. However, the fraction of hawks decreases as the kmax increases after the kmax passes that point. To this end, hawks become a minority when the network has some hub nodes. Interestingly, we found that hawk players were occasionally dominant even if kmax was large. Figure 3 presents the frequency of the fraction of hawk which satisfies kmax > 100. Therefore, the network progresses in a scale-free network. At the same time however, it becomes occasionally unstable while hawk players can be hub nodes. The system gets out of these unstable conditions when hub nodes disappear or when hawk players replace their strategy with the opponent. Thus, a scale-free network is sometimes maintained for a long period and at other times not, which progresses unpredictably as a critical behavior.
Fig. 3. Frequency of the fraction of hawks that satisfies kmax > 100. Plotted data are obtained from 100 trials.
410
T. Sakiyama
4 Conclusion In this paper, we investigate the population dynamics and its relation to a critical behavior using a HD game on the network that we have developed. Hawks become a majority when the network does not have hub nodes while they turn to minority when the network has some hub nodes, i.e., when the network becomes a scale-free network. Furthermore, population of hawks can be occasionally maintained even after the replacement of the network structure under a condition that the system is unstable. This instability of the network may result in complex patterns of the system and allow hub players to stand out and disappear.
References 1. Perc, M.: Uncertainties facilitate aggressive behaviour in a spatial hawk–dove game. Int. J. Bifurcat. Chaos 17, 4223–4227 (2007) 2. Hauert, C., Doebeli, M.: Spatial structure often inhibits the evolution of cooperation in the snowdrift game. Nature 428, 643–646 (2004) 3. Killingback, T., Coebeli, M.: Spatial evolutionary game theory: Hawks and Doves revisited. Proc. R Soc. Lond. B 263, 1135–1144 (1996) 4. Sakiyama, T., Arizono, I.: An adaptive replacement of the rule update triggers the cooperative evolution in the Hawk-Dove game. Chaos Solitons Fractals 121, 59–62 (2019) 5. Doebeli, M., Hauert, C.: Models of cooperation based on the Prisoner’s Dilemma and the Snowdrift game. Ecol. Lett. 8, 748–766 (2005) 6. Dunne, J.A., Williams, R.J., Martinez, N.D.: Food-web structure and network theory: the role of connectance and size. Proc. Natl. Acad. Sci. USA 99, 12917–12922 (2002) 7. Montoya, J.M., Solé, R.V.: Small world patterns in food webs. J. Theor. Biol 214, 405–412 (2002) 8. Rhodes, M., Wardell-Johnson, G.W., Rhodes, M.P., Raymond, B.: Applying network analysis to the conservation of habitat trees in urban environments: a case study from Brisbane Australia. Conserv. Biol. 20, 861–870 (2006) 9. Ohtsuki, H., Hauert, C., Lieberman, E., Nowak, M.A.: A simple rule for the evolution of cooperation on graphs and social networks. Nature 441, 502–505 (2006) 10. Filotas, E., Grant, M., Parrott, L., Rikvold, P.A.: The effect of positive interactions on community structure in a multi-species metacommunity model along an environmental gradient. Ecol. Model. 221(6), 885–894 (2010) 11. Sakiyama, T.: A power-law network in an evolutionary Hawk-Dove game. Chaos Solitons Fractals 146(110932), 1–6 (2021)
Context-Sensitive Mental Model Aggregation in a Second-Order Adaptive Network Model for Organisational Learning Gülay Canbaloğlu1,3(&) and Jan Treur2,3 1
Department of Computer Engineering, Koç University, Istanbul, Turkey [email protected] 2 Social AI Group, Department of Computer Science, Vrije Universiteit, Amsterdam, The Netherlands [email protected] 3 Center for Safety in Healthcare, Delft University of Technology, Delft, The Netherlands
Abstract. Organisational learning processes often exploit developed individual mental models in order to obtain shared mental models for the organisation by some form of unification or aggregation. The focus in this paper is on this aggregation process, which may depend on a number of contextual factors. It is shown how a second-order adaptive network model for organisation learning can be used to model this process of aggregation of individual mental models in a context-dependent manner.
1 Introduction Organisational learning is an important but complex adaptive phenomenon within an organisation. It involves a cyclical interplay of different adaptation processes such as individual learning and development of mental models, formation of shared mental models for teams or for the organisation as a whole, and improving individual mental models or team mental models based on a shared mental model of the organisation; e.g., (Argyris and Schön 1978; Bogenrieder 2002; Crossan et al. 1999; Fischhof and Johnson 1997; Kim 1993; McShane and Glinow 2010; Stelmaszczyk 2016; Wiewiora et al. 2019). For example, Kim (1993), p. 44 puts forward that ‘Organizational learning is dependent on individuals improving their mental models; making those mental models explicit is crucial to developing new shared mental models’. One of the fundamental issues here is how exactly shared mental models are formed based on developed mental models and, in particular, how that depends on a specific context. It the past years, it has been found out how self-modeling networks provide an adequate modeling approach to obtain computational models addressing mental models and how they are used for internal simulation, adapted by learning, revision or forgetting, and the control of all this; e.g., (Treur et al. 2022). In recent research, it has also been shown for a relatively simple scenario how this modeling perspective can be exploited to obtain computational models of organisational learning (Canbaloğlu et al. 2021). However, the important issue of how exactly shared mental models are formed © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2021, SCI 1072, pp. 411–423, 2022. https://doi.org/10.1007/978-3-030-93409-5_35
412
G. Canbaloğlu and J. Treur
in a context-dependent manner based on individual mental models has not been addressed there. The current paper introduces a computational self-modeling network model for organisational learning with a main focus on this context-dependent formation process of shared mental models based on aggregation of individual mental models. In Sect. 2 some background knowledge for this will briefly be discussed. Section 3 briefly describes the modeling approach based on self-modeling networks used. In Sect. 4, the computational self-modeling network model for organisational learning based on context-dependent aggregation will be introduced. This model will be illustrated by an example simulation scenario in Sect. 5. Finally, Sect. 6 is a discussion section.
2 Background Literature In this section, some of the multidisciplinary literature about the concepts and processes that need to be addressed are briefly discussed. This provides a basis for the selfmodeling network model that will be presented in Sect. 4 and for the scientific justification of the model. For the history of the mental model area, often Kenneth Craik is mentioned as a central person. In his book Craik (1943) describes a mental model as a small-scale model that is carried by an organism within its head and based on that the organism ‘is able to try out various alternatives, conclude which is the best of them, react to future situations before they arise, utilize the knowledge of past events in dealing with the present and future, and in every way to react in a much fuller, safer, and more competent manner to the emergencies which face it.’ (Craik 1943, p. 61). Shih and Alessi (1993, p. 157) explain that ‘By a mental model we mean a person's understanding of the environment. It can represent different states of the problem and the causal relationships among states.’ In (Van et al. 2021), an analysis of various types of mental models and the types of mental processes processing are reviewed. Based on this analysis a three-level cognitive architecture has been introduced where: • the base level models internal simulation of a mental model • the middle level models the adaptation of the mental model (formation, learning, revising, and forgetting a mental model, for example) • the upper-level models the (metacognitive) control over these processes By using the notion of self-modeling network (or reified network) from (Treur 2020a; Treur 2020b), recently this cognitive architecture has been formalized computationally and used in computer simulations for various applications of mental models; for an overview of this approach and its applications, see (Treur et al. 2022). Mental models also play an important role when people work together in teams. When every team member has a different individual mental model of the task that is performed, then this will stand in the way of good teamwork. Therefore, ideally these mental models should be aligned to such an extent that it becomes one shared mental model for all team members. Examples of computational models of a shared mental model and how imperfections in it work out can be found in (Van Ments et al. 2021a; Van Ments et al. 2021b).
Context-Sensitive Mental Model Aggregation
413
Organisational learning is an area which has received much attention over time; see, for example, (Argyris and Schön 1978; Bogenrieder 2002; Crossan et al. 1999; Fischhof and Johnson 1997; Kim, 1993; McShane and Glinow 2010; Stelmaszczyk 2016; Wiewiora et al. 2019). However, contributions to computational formalization of organisational learning are very rare. By Kim (1993), mental models are considered a vehicle for both individual learning and organizational learning. By learning and developing individual mental models, a basis for formation of shared mental models for the level of the organization is created, which provides a mechanism for organizational learning. The overall process consists of the following cyclical processes and interactions (see also (Kim, 1993), Fig. 8): (a) Individual level (1) Creating and maintaining individual mental models (2) Choosing for a specific context a suitable individual mental model as focus (3) Applying a chosen individual mental model for internal simulation (4) Improving individual mental models (individual mental model learning) (b) From individual level to organization level (1) Deciding about creation of shared mental models (2) Creating shared mental models based on developed individual mental models (c) Organization level (1) Creating and maintaining shared mental models (2) Associating to a specific context a suitable shared mental model as focus (3) Improving shared mental models (shared mental model refinement or revision) (d) From organization level to individual level (1) Deciding about individuals to adopt shared mental models (2) Individuals adopting shared mental models by learning them (e) From individual level to organization level (1) Deciding about improvement of shared mental models (2) Improving shared mental models based on further developed individual mental models In terms of the three-level cognitive architecture described in (Van Ments and Treur 2021), applying a chosen individual mental model for internal mental simulation relates to the base level, learning, developing, improving, forgetting the individual mental model relates to the middle level, and control of adaptation of a mental model relates to the upper level. Moreover, interactions from individual to organization level and vice versa involve changing (individual or shared) mental models and therefore relate to the middle level, while the deciding actions as a form of control relate to the upper level. This overview will provide useful input to the design of the computational network model for organizational learning and in particular the aggregation in it that will be introduced in Sect. 4.
414
G. Canbaloğlu and J. Treur
3 The Self-modeling Network Modeling Approach Used In this section, the network-oriented modeling approach used is briefly introduced. A temporal-causal network model is characterised by; here X and Y denote nodes of the network, also called states (Treur 2020b): • Connectivity characteristics Connections from a state X to a state Y and their weights xX,Y • Aggregation characteristics For any state Y, some combination function cY(..) defines the aggregation that is applied to the impacts xX,YX(t) on Y from its incoming connections from states X • Timing characteristics Each state Y has a speed factor ηY defining how fast it changes for given causal impact. The following canonical difference (or related differential) equations are used for simulation purposes; they incorporate these network characteristics xX,Y, cY(..), ηY in a standard numerical format: Yðt þ DtÞ ¼ YðtÞ þ gY ½cY ðxX1 ;Y X1 ðtÞ; . . .xXk ;Y Xk ðtÞÞ YðtÞDt
ð1Þ
for any state Y and where X 1 to X k are the states from which Y gets its incoming connections. The available dedicated software environment described in (Treur 2020b, Ch. 9), includes a combination function library with currently around 50 useful basic combination functions. The above concepts enable to design network models and their dynamics in a declarative manner, based on mathematically defined functions and relations. The examples of combination functions that are applied in the model introduced here can be found in Table 1. Combination functions as shown in Table 1 and available in the combination function library are called basic combination functions. For any network model some number m of them can be selected; they are represented in a standard format as bcf1(..), bcf2(..), …, bcfm(..). In principle, they use parameters p1;i;Y ; p2;i;Y such as the k, r, and s in Table 1. Including these parameters, the standard format used for basic combination functions is (with V1, …, Vk the single causal impacts):bcf i ðp1;i;Y ; p2;i;Y ; V 1 ; . . .; V k Þ. For each state Y just one basic combination function can be selected, but also a number of them can be selected, what happens in the current paper; this will be interpreted as a weighted average of them according to the following format: cY ðp1;1;Y ; p2;1;Y ; . . .; p1;m;Y ; p2;m;Y ; . . .; V1 ; . . .Vk Þ c1;Y bcf 1 p1;1;Y ; p2;1;Y ; V1 ; . . .Vk þ . . . þ cm;Y bcf m p1;l;Y ; p2;l;Y ; V1 ; . . .Vk ¼ ð2Þ c1;Y þ . . . þ cm;Y with combination function weights ci,Y. Selecting only one of them for state Y, for example, bcf i ð::Þ; is done by putting weight ci,Y = 1 and the other weights 0. This is a convenient way to indicate combination functions for a specific network model. The function cY(..) can just be indicated by the weight factors ci,Y and the parameters pi,j,Y.
Context-Sensitive Mental Model Aggregation
415
Table 1. The combination functions used in the introduced self-modeling network model Notation
Formula
Parameters
Advanced logistic sum Steponce
alogisticr,s(V1, …,Vk) steponcea,b(..)
−rs 1 1 ½1 þ erðV 1 þ ) ... þ V k sÞ 1 þ ersÞ(1 + e
Hebbian learning
hebbl(V1, V2, V3)
V 1 V 2 ð1 V 3 Þ þ lV 3
Maximum composed with Hebbian learning
max-hebbl(V1, …, Vk)
maxðhebbl ðV 1 ; V 2 ; V 3 Þ; V 4 ; . . .; V k Þ
Scaled maximum Euclidean
smaxk(V1, …, Vk) eucln,k(V1, …, Vk) sgeomeank(V1, …, Vk)
max(V1, …, Vk)/k
Steepness r > 0 Excitability threshold s Start time a End time b V1,V2 activation levels of the connected states; V3 activation level of the self-model state for the connection weight Persistence factor l V1,V2 activation levels of the connected states; V3 activation level of the self-model state for the connection weight Persistence factor l Scaling factor k
Scaled geometric mean
1 if time t is between a and b, else 0
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi n n n
V 1 þ ... þ V k k
qffiffiffiffiffiffiffiffiffiffiffiffiffiffi k
V 1 ...V k k
Order n Scaling factor k Scaling factor k
Realistic network models are usually adaptive: often not only their states but also some of their network characteristics change over time. By using a self-modeling network (also called a reified network), a network-oriented conceptualization can also be applied to adaptive networks to obtain a declarative description using mathematically defined functions and relations for them as well; see (Treur 2020a; Treur 2020b). This works through the addition of new states to the network (called self-model states) which represent (adaptive) network characteristics. In the graphical 3D-format as shown in Sect. 4, such additional states are depicted at a next level (called self-model level or reification level), where the original network is at the base level. As an example, the weight xX,Y of a connection from state X to state Y can be represented (at a next self-model level) by a self-model state named WX,Y. Similarly, all other network characteristics from xX,Y, cY(..), ηY can be made adaptive by including self-model states for them. For example, an adaptive speed factor ηY can be represented by a self-model state named HY, an adaptive combination function weight ci,Y can be represented by a self-model state Ci,Y. As the outcome of such a process of network reification is also a temporal-causal network model itself, as has been shown in (Treur 2020b, Ch 10), this self-modeling network construction can easily be applied iteratively to obtain multiple orders of selfmodels at multiple (first-order, second-order, …) self-model levels. For example, a second-order self-model may include a second-order self-model state HWX,Y representing the speed factor ηWX,Y for the dynamics of first-order self-model state WX,Y
416
G. Canbaloğlu and J. Treur
which in turn represents the adaptation of connection weight xX,Y. Similarly, a persistence factor lWX,Y of such a first-order self-model state WX,Y used for adaptation, e.g., based on Hebbian learning (Hebb 1949) can be represented by a second-order selfmodel state MWX,Y. In particular, for the aggregation process for the formation of a shared mental which is a main focus of the current paper, in Sect. 4 s-order self-model states Ci,WX,Y will be used that represent the ith combination function weight ci,WX,Y of the combination functions selected for a shared mental model connection weight WX,Y (where the latter is a first-order self-model state).
4 The Adaptive Network Model for Organisational Learning The case study addressed to illustrate the introduced model was adopted from the more extensive case study in an intubation process from (Van Ments et al. 2021a; Van Ments et al. 2021b). Here only the part of the mental models is used that addresses four mental states; see Table 2.
Table 2. The mental model used for the simple case study Short notation Explanation States for mental models of persons A, B and organization O a_A a_B a_O Prep_eq_N Preparation of the intubation equipment by the nurse b_A b_B b_O Prep_d_N Nurse prepares drugs for the patient c_A c_B c_O Pre_oy_D Doctor executes pre oxygenation d_A d_B d_O Prep_team_D Doctor prepares the team for intubation
In the case study addressed here, initially the mental models of the nurse (person A) and doctor (person B) are different and based on weak connections; they don’t use a stronger shared mental model as that does not exist yet. The organizational learning addressed to improve the situation covers: 1. Individual learning by A and B of their mental models through internal simulation which results in stronger but still incomplete and different mental models (by Hebbian learning). Person A’s mental model has no connection from c_A to d_A and person B’s mental model has no connection from a_B to b_B. 2. Formation of a shared organization mental model based on the two individual mental models. A process of unification takes place. 3. Learning individual mental models from the shared mental model; e.g., a form of instructional learning.
Context-Sensitive Mental Model Aggregation
417
4. Strengthening these individual mental models by individual learning through internal simulation which results in stronger and now complete mental models (by Hebbian learning). Now person A’s mental model has a connection from c_A to d_A and person B’s mental model has a connection from a_B to b_B. In this case study, person A and person B have knowledge on different tasks, and there is no shared mental model at first. Development of the organizational learning covers: 1. Individual learning processes of A and B for their separate mental models through internal simulation. By Hebbian learning (Hebb 1949), mental models become stronger but they are still incomplete. A has no knowledge for state d_A, and B has no knowledge for state a_B: they do not have connections to these states. 2. Shared mental model formation by aggregation of the different individual mental models. 3. Individuals’ adoption of shared mental model, e.g., a form of instructional learning. 4. Strengthening of individual mental models by individual learning through internal simulation, strengthening knowledge for less known states of persons A and B (by Hebbian Learning). Then, persons have stronger and now (more) complete mental models. 5. Improvements on the shared mental model by aggregation of the effects of the strengthened individual mental individuals. A crucial element for the shared mental model formation is the aggregation process. Not all individual mental models will be considered to have equal value. Person A may be more knowledgeable than person B, for example. And when they are equally knowledgeable, can they be considered independent sources, or have they just learnt it from the same source? In the former case aggregation of their knowledge lead to a stronger outcome than in the latter case. Based on such considerations, a number of context factors have been included that affect the type of aggregation that is applied: they are used to control the process of aggregation leading to a shared mental model in such a way that it becomes context-sensitive. As in the network model, aggregation is specified by combination functions (see Sect. 3) of the first-order self-model states WX,Y for the weights of the connections X ! Y of the shared mental model, this means that these combination functions become adaptive (in a heuristic manner) in relation to the specified context factors. The influences of the context factors on the aggregation as indicated in Table 3 have been used to specify this context-sensitive control for the choice of combination function. For example, if A and B have similar knowledgeability, a form of average is supported (a Euclidean or geometric mean combination function), unless they are independent in which case some form of amplification is supported (a logistic combination function). If they differ in knowledgeability, the maximal knowledge is chosen (a maximum combination function). These are meant as examples of heuristics to illustrate the idea and can easily be replaced by other heuristics.
418
G. Canbaloğlu and J. Treur
Table 3. Examples of heuristics for context-sensitive control of mental model aggregation applied in the example scenario Context: knowledgeable A and B both not knowledgeable
Context: dependency
A and B both knowledgeable
A and B dependent A and B independent
A knowledgeable B not knowledgeable B knowledgeable A not knowledgeable
Context: preference for type of quantity Additive Multiplicative Additive Multiplicative
Combination function type Euclidean Geometric mean Euclidean Geometric mean Logistic Maximum Maximum
The connectivity of the designed network model is depicted in Fig. 1. The base level of this model includes all the individual mental states, shared mental model states, and context states that are used to initiate different phases. Base level states can be considered as the core of the model. The first-order self-model level includes context states that play a role in the aggregation such as context states for knowledgeability level, dependence level and preference for additive or multiplicative aggregation. Derived context states (e.g., representing that none of A and B is knowledgeable) are also placed here to make combinations of context states clearer by specifying in a precise way what it is that affects aggregation. This level lastly includes W-states representing the weights of the base level connections of the mental models to make them adaptive. At this first-order adaptation level there are a number of (intralevel) connections that connect W-states from individual mental models to shared mental models and conversely. The first type of such connections (from left to right) are used for the formation of the shared mental model: they provide the impact from the W-states of the individual mental models on the W-states of the shared mental model. This is input for the aggregation process by which the shared mental model is formed; in (Crossan et al. 1999) this is called feed forward learning. The second type of connections (from right to left) model the influence that a shared mental model has on the individual mental models. This models, for example, instruction of the shared mental model to employees in order to get their individual mental models better; in (Crossan et al. 1999) this is called feedback learning.
Context-Sensitive Mental Model Aggregation
419
Fig. 1. The connectivity of the second-order adaptive network model
The second-order self-model level includes WW-, MW and HW-states to control the adaptations of the network model run at the first-order self-model level. These WWstates (also called higher-order W-states) specifying the weights of the connections between W-states of the organization and individual mental models are placed here to initiate the learning from the shared mental model by the individuals (by making these weights within the first-order self-model level nonzero), once a shared mental model is available. Note that these WW-states are becoming nonzero if (in phase 3) a control decision is made to indeed let individuals learn from the formed shared mental model, but they also have a learning mechanism so that they are maintained after that as well: persons will keep relating (and updating) their individual mental model to the shared mental model. This type of learning for WW-states can be considered a form of higherorder Hebbian learning. The HW-states are used for controlling adaptation speeds of connection weights and MW-states for controlling persistence of adaptation. To control the aggregation for the shared mental model connections there are second-order Ci,W-states in this level. Four different types of Ci,W-states are added to represent four different combination functions (see Table 1): • • • •
C1,W C2,W C3,W C4,W
for for for for
the the the the
logistic sum combination function alogistic scaled maximum combination function smax euclidean combination function eucl scaled geometric mean combination function sgeometric
So, there are four Ci,W-states for each shared mental model connection, which is three in total. Thus, the model has 12 Ci,W-states at the second-order self-model level to model the aggregation process. These second-order self-model states and the functions they represent are used depending on the context (due to the connections from the context states to the Ci,W-states), and the average is taken if more than one i has a nonzero Ci,W for a given W-state.
420
G. Canbaloğlu and J. Treur
More details of the model and a full specification can be found as Linked Data at URL https://www.researchgate.net/publication/354176039.
5 Example Simulation Scenario Recall once more from Sect. 3 that aggregation characteristics within a network model are specified by combination functions. In particular, this applies to the aggregation of individual mental models in order to get shared mental models out of them. In this scenario, different combination functions are used to observe different types of aggregation while an organizational learning progresses by the unification of separate individual mental models. With a multi-phase approach, two individual mental models that are distinct in the beginning create the shared mental model of their organization by time, and there are effects of individuals and the organization on each other in different time intervals. Thus, it is possible to explore how aggregation occurs during an organizational learning progress. To see the flow of these processes clearly, the scenario is structured in phases. In practice and also in the model, these processes also can overlap or take place entirely simultaneously. The five phases were designed as follows: • Phase 1: Individual mental model usage and learning • This relates to (a) in Sect. 2. Two different mental models for person A and B belonging to an organization are constructed and become stronger here in this phase. Hebbian learning takes place to improve their individual mental models by using them for internal simulations. Person A mainly has knowledge on the first part of the job, and person B has knowledge on the last part, thus A is the person who started the job and B is the one who finished it. • Phase 2: Shared mental model formation • This relates to (b) and (c) in Sect. 2. Unification and aggregation of individual mental models occur here. During this formation of shared mental model, different combination functions are used for different cases in terms of knowledgeability, dependence and preference of additivity or multiplicativity. Organizational learning takes place with the determination of the values of the W-states for the organization’s general (non-personal) states for the job a_O to d_O. An incomplete and nonperfect shared mental model is formed and maintained by the organization. • Phase 3: Instructional learning of the shared mental model by the individuals • This relates to (c) and (d) in Sect. 2. Learning from the organization’s shared mental model, which can be considered as learning from each other in an indirect manner, begins in this phase by the activation of the connections from the organization’s general W-states to the individual W-states. Persons receive the knowledge from the shared mental model as a form of instructional learning. There is no need for many mutual one-to-one connections between persons since there is a single shared mental model. • Phase 4: Individual mental model usage and learning • This relates to (d) in Sect. 2. Further improvements on individual mental models of persons are observed by the help of Hebbian learning during usage of the mental
Context-Sensitive Mental Model Aggregation
421
model for internal simulation in this phase. Person A starts to learn about task d (state d_A) by using the knowledge from the shared mental model (obtained from person B) and similarly B learns about task a (state a_B) that they did not know in the beginning. Therefore, these ‘hollow’ states become meaningful for the individuals. The individuals take advantage of the organizational learning. • Phase 5: Strengthening shared mental model with gained knowledge • This relates to (e) in Sect. 2. People of the organization start to affect the shared mental model as they gain improved individual knowledge by time. The activation of organization’s general states causes improvements on shared mental model, and it becomes closer to the perfect complete shared mental model. Figure 2 shows an overview of all states of the simulation. In Fig. 2, individual learning by using mental models for internal simulation (Hebbian learning) takes place in first phase happening between time 10 and 300. Only X4 (d_A) and X5 (a_B) remain at 0 because of the absence of knowledge. These ‘hollow’ states will increase in Phase 4 after learning during Phase 3 from the unified shared mental model developed in Phase 2. The W-states of the individuals representing their knowledge and learning slightly decrease starting from the end of Phase 1 at about 300 since the persistence factors’ self-model M-states do not have the perfect value 1, meaning that persons forget. Since the persistence factor of B is smaller than of A, B’s W-states decrease more in the second phase: it can be deduced that B is a more forgetful person. Context states for different combination functions determine the aggregation pattern of the shared mental model in Phase 2. For 4 different functions, there are 12 Ci,Wstates in total for the organization’s three connections (Wa_O,b_O, Wb_O,c_O, and Wc_O, d_O) with different activation levels. Some of them are even above 1 but this does not cause a problem because the weighted average of them will be taken (according to formula (2) in Sect. 3). The shared mental model is formed in this phase based on the context-sensitive control of the aggregation used. The W-states of the organization’s shared mental model have links back to the Wstates of the individuals’ mental models to make individual learning (by instructional learning) from the shared mental model possible. In Phase 3, all the higher-order selfmodel W-states (X47 to X52, also called WW-states) for these connections from the shared mental model’s to the individuals’ first-order W-states become activated. This models the instructional learning: the persons are informed about the shared mental model. Forgetting also takes place for the connections from the W-states of the organization’s shared mental model to those of the individuals’ mental models. It means that a fast-starting learning process becomes stagnant over time. By observing Phase 4, it can be seen that after time 650, all the W-states of the individuals make an upward jump because of further individual learning. In Phase 5, like in Phase 2, the W-states of the organization’s shared mental model increase (due to the individual mental models that were improved in Phase 4) and get closer to a perfect complete shared mental model. This improved shared mental model in principle also has effect on individual mental models, as also the higher-order Wstates are still activated here.
422
G. Canbaloğlu and J. Treur
1.3 1.2 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 X1 - a_A X5 - a_B X9 - a_O X13 - con_ph1 X17 - conknow,Wa_A,b_A X21 - conknow,Wb_B,c_B X25 - conmultiplicative X29 - connotAnotBa->b X33 - connotAnotBb->c X37 - connotAnotBc->d X41 - Wa_B,b_B X45 - Wb_O,c_O X49 - WX46,X40 X53 - HWa_A,b_A X57 - HWb_B,c_B X61 - HWc_O,d_O X65 - MWa_B,b_B X69 - C2,Wa_O, b_O X73 - C2,Wb_O, c_O X77 - C2,Wc_O, d_O
X2 - b_A X6 - b_B X10 - b_O X14 - con_ph2 X18 - conknow,Wb_A,c_A X22 - conknow,Wc_B,d_B X26 - conAnotBa->b X30 - conAnotBb->c X34 - conAnotBc->d X38 - Wa_A,b_A X42 - Wb_B,c_B X46 - Wc_O,d_O X50 - WX44,X41 X54 - HWb_A,c_A X58 - HWc_B,d_B X62 - MWa_A,b_A X66 - MWb_B,c_B X70 - C3,Wa_O, b_O X74 - C3,Wb_O, c_O X78 - C3,Wc_O, d_O
X3 - c_A X7 - c_B X11 - c_O X15 - con_ph3 X19 - conknow,Wc_A,d_A X23 - condependencyA,B X27 - conBnotAa->b X31 - conBnotAb->c X35 - conBnotAc->d X39 - Wb_A,c_A X43 - Wc_B,d_B X47 - WX44,X38 X51 - WX45,X42 X55 - HWc_A,d_A X59 - Hwa_O,b_O X63 - MW_A,c_A X67 - MWc_B,d_B X71 - C4,Wa_O, b_O X75 - C4,Wb_O, c_O X79 - C4,Wc_O, d_O
X4 - d_A X8 - d_B X12 - d_O X16 - con_ph4 X20 - conknow,Wa_B,b_B X24 - conadditive X28 - conABa->b X32 - conABb->c X36 - conABc->d X40 - Wc_A,d_A X44 - Wa_O,b_O X48 - WX45,X39 X52 - WX46,X43 X56 - HWa_B,b_B X60 - HWb_O,c_O X64 - MWc_A,d_A X68 - C1,Wa_O, b_O X72 - C1,Wb_O, c_O X76 - C1,Wc_O, d_O
Fig. 2. Overview of the simulation scenario
6 Discussion Organisational learning usually exploits developed individual mental models in order to form shared mental models for the organisation; e.g., (Kim 1993; Wiewiora et al. 2019). This happens by some form of aggregation. The current paper focuses on this aggregation process, which often depends on contextual factors. It was shown how a second-order adaptive self-modeling network model for organisation learning based on self-modeling network models described in (Treur 2020b) can model this process of aggregation of individual mental models in a context-dependent manner. Compared to (Canbaloğlu et al. 2021) the type of aggregation used for the process of shared mental model formation was explicitly addressed and made context-sensitive. Different forms of aggregation have been incorporated, for example, Euclidean and geometric mean weighted averages, maximum functions and logistic forms. The choice of aggregation was made adaptive in a context-sensitive manner so that for each context a different form of aggregation can be chosen automatically as part of the overall process.
Context-Sensitive Mental Model Aggregation
423
References Argyris, C., Schön, D.A.: Organizational Learning: A Theory of Action Perspective. AddisonWesley, Reading, MA (1978) Bogenrieder, I.: Social architecture as a prerequisite for organizational learning. Manag. Learn. 33(2), 197–216 (2002) Canbaloğlu, G., Treur, J., Roelofsma, P.H.M.P.: Computational modeling of organisational learning by self-modeling networks. Cogn. Syst. Res. (2021) Craik, K.J.W.: The Nature of Explanation. University Press, Cambridge, MA (1943) Crossan, M.M., Lane, H.W., White, R.E.: An organizational learning framework: from intuition to institution. Acad. Manag. Rev. 24, 522–537 (1999) Fischhof, B., Johnson, S.: Organisational Decision Making. Cambridge University Press, Cambridge (1997) Hebb, D.O.: The Organization of Behavior: A Neuropsychological Theory. John Wiley and Sons, New York (1949) Kim, D.H.: The link between individual and organisational learning. Sloan Manag. Rev. In: Klein, D.A. (ed.), The Strategic Management of Intellectual Capital. Routledge-ButterworthHeinemann, Oxford (1993). Fall 1993, pp. 37–50 McShane, S.L., von Glinow, M.A.: Organizational Behavior. McGraw-Hill, Boston (2010) Shih, Y.F., Alessi, S.M.: Mental models and transfer of learning in computer programming. J. Res. Comput. Educ. 26(2), 154–175 (1993) Stelmaszczyk, M.: Relationship between individual and organizational learning: mediating role of team learning. J. Econ. Manag. 26, 107–127 (2016). https://doi.org/10.22367/jem.2016.26. 06 Treur, J.: Modeling higher-order adaptivity of a network by multilevel network reification. Network Sci. 8, S110–S144 (2020) Treur, J.: Network-Oriented Modeling for Adaptive Networks: Designing Higher-Order Adaptive Biological, Mental and Social Network Models. Springer Nature, Cham (2020b). https://doi. org/10.1007/978-3-030-31445-3 Treur, J., Van Ments, L. (eds.): Mental Models and their Dynamics, Adaptation, and Control: a Self-Modeling Network Modeling Approach. Springer Nature, in press (2022) Van Ments, L., Treur, J.: Reflections on dynamics, adaptation and control: a cognitive architecture for mental models. Cogn. Syst. Res. 70, 1–9 (2021) van Ments, L., Treur, J., Klein, J., Roelofsma, P.: A computational network model for shared mental models in hospital operation rooms. In: Mahmud, M., Kaiser, M.S., Vassanelli, S., Dai, Q., Zhong, N. (eds.) BI 2021. LNCS (LNAI), vol. 12960, pp. 67–78. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86993-9_7 van Ments, L., Treur, J., Klein, J., Roelofsma, P.: A second-order adaptive network model for shared mental models in hospital teamwork. In: Nguyen, N.T., Iliadis, L., Maglogiannis, I., Trawiński, B. (eds.) ICCCI 2021. LNCS (LNAI), vol. 12876, pp. 126–140. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-88081-1_10 Wiewiora, A., Smidt, M., Chang, A.: The ‘how’ of multilevel learning dynamics: a systematic literature review exploring how mechanisms bridge learning between individuals, teams/projects and the organization. Eur. Manag. Rev. 16, 93–115 (2019)
A Leading Author Model for the Popularity Effect on Scientific Collaboration Hohyun Jung1 , Frederick Kin Hing Phoa2(B) , and Mahsa Ashouri2 1
Department of Statistics, Sungshin Women’s University, Seoul 01133, South Korea 2 Institute of Statistical Science, Academia Sinica, Taipei City 11529, Taiwan [email protected]
Abstract. In this paper, we focus on the popularity effect of the scientific collaboration process that popular authors have an advantage in making more publications. Standard network analysis has been used to analyze the scientific collaboration network. However, the standard network has limitations in explaining the scientific output by binary coauthorship relationships since papers have various numbers of authors. We propose a leading author model to understand the popularity effect mechanism while avoiding the use of the standard network structure. The estimation algorithm is presented to analyze the size of the popularity effect. Moreover, we can find influential authors through the estimated genius levels of authors by considering the popularity effect. We apply the proposed model to the real scientific collaboration data, and the results show positive popularity effects in all the collaborative systems. Furthermore, finding influential authors considering the genius level are discussed. Keywords: Scientific collaboration Popularity effect · Dynamic process
1
· Co-authorship network ·
Introduction
Collaboration has emerged as an essential part of the scientific research process. A network presentation has been employed to analyze scientific collaboration data. A scientific collaboration network is a social network in which nodes are researchers and co-authorships are represented as links. To the best of our knowledge, up until now, scientific collaboration networks have been analyzed based on network science from multiple perspectives, such as the collaborative feature by research field or region [1], publication habits [16], gender analysis [7], and the Matthew effect [17]. However, they only consider binary relationships, which does not fully represent the scientific collaboration process. This view emphasizes the importance of a more generalized structure that considers multi-node relationships for a proper explanation of the papers with more than two authors. There are two structures c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2021, SCI 1072, pp. 424–437, 2022. https://doi.org/10.1007/978-3-030-93409-5_36
Leading Author Model
425
for handling the multi-node relationship: hypergraph and simplicial complex. A hypergraph is a generalized network structure in which a hyperedge can join a different number of nodes. On the other hand, a simplicial complex is a mathematical object, defined as a family of non-empty subsets of nodes that are closed under the intersection of non-empty subsets. In their structures, homogeneous interactions are considered among the nodes. Therefore, for expressing the scientific collaboration data, the generalized network structures are more desirable and useful [14]. The main purpose of this research is to analyze the popularity effect in scientific collaboration. By popularity effect, we refer to the phenomenon that popular objects receive more popularity over time. This effect is common in realworld examples and is often referred to as the rich-get-richer phenomenon or the Matthew effect (Merton, 1968). For example, a company that has already occupied the market can attract more customers through promotions based on its reputation. Or, as a form of user behavior, popular YouTube channels tend to attract many subscribers [10]. We can also observe the popularity effect in community question answering websites such as StackExchange [10]. The popularity effect has also been studied in the discipline of network science through a large number of network models [9]. Pioneering work on the popularity effect was carried out by [2] with the preferential attachment mechanism, introducing the BA model. Many of the network scientists have developed the variants of the BA model to explain the popularity effect along with the degree correlation [6], aging of nodes [5], ability of nodes [3], and etc. Although the popularity effect or rich-get-richer is considered important in human-made communities, however, a lack of research regarding the popularity effect of scientific collaboration is noticeable. We expect a positive popularity effect in the scientific collaboration networks, which could be caused by these three reasons: First, in academic conferences, popular speakers tend to receive more attention, and as a result, more opportunities to cooperate with others. Second, popular researchers can coach more students and researchers for scientific publications. Finally, journal reviewers may also favor the manuscripts written by popular researchers. In this paper, we apply these aspects in a scientific collaboration model. Several evolving generalized network models have been proposed to cover the popularity effect of the scientific collaboration network in which all the authors of the paper are considered homogeneous in the hypergraph and simplicial complex models [20]. This means they do not consider the role differences for the authors. For example, authors who lead the paper, work on the specific part, suggest a key idea, and support funding. In this paper, we consider the leading author who conducts the most work of the paper [18]. We assume that the leading author initiates the work and gathers popular researchers who have enough expertise in scientific research. This setting is more realistic since usually most projects are led by one researcher. Also, it is more relevant to explain the popularity effect of the scientific collaboration process.
426
H. Jung et al.
One of the important applications of network science is finding the influential node in the system. There has been a variety of researches concerning the influential author finding [13]. Another contribution of this paper is to identify the authors’ genius level, where a genius level refers to an author’s internal expertise level. Previous studies on scientific collaboration have determined influential or prolific authors by network centrality measures [19]. However, since the popularity effect might be incorporated into the scientific collaboration mechanism, the genius level needs to be measured in consideration of the popularity effect to avoid its overestimation for popular authors. Our main contribution in this research is the analysis of the popularity effect in the scientific collaboration data by the leading author modeling. To the best of our knowledge, our model is the first one investigating the popularity effect in scientific collaboration with a non-homogeneous authors structure. We also investigate the authors’ genius levels regarding the popularity effect. We suggest an algorithm that simultaneously estimates the authors’ genius levels and the papers’ leading authors, as well as the size of popularity effect in the system. The EM algorithm and the Gibbs sampling method are employed for developing the algorithm. We illustrate our approach by investigating real scientific collaboration data collected from the Web of Science.
2
Model
Suppose we have time-series collaboration data at time t = 0, 1, · · · , T . Let Vt be the authors at time t, t = 0, 1, · · · , T . We assume that an author i enters and leaves the system at time t0,i and t1,i , respectively, i.e., i ∈ Vt if and only if t = t0,i , t0,i + 1, · · · , t1,i . Let ui,t be the popularity of author i at time t. In this paper, we employ the average number of published papers per unit time (year) as the popularity measure, expressed by ui,t =
1 (number of papers of author i until time t−1), t = 1, · · · , T, (1) t − 1 + T0
where T0 is the duration of the initial data at t = 0. Furthermore, let gi be the genius level of author i in the system. Here we assume gi follows the standard normal distribution. The probability density function of gi is expressed by 1 2 1 (2) p (gi ) = √ exp − gi , − ∞ < gi < ∞. 2 2π We assume that every paper has a leading author who leads the work and gathers co-authors of the paper. Importantly, we here assume that a popular author has more chances to become a co-author of a paper, chosen by the leading author. We now describe the paper generation process as follows: – For t = 1, 2, · · · , T :
Leading Author Model
427
– For i ∈ Vt : – Draw the number of leading-authored papers of author i at time t: di,t ∼ P oisson (exp (θl + θg gi )) .
(3)
– For each paper led by author i, d = 1, · · · , di,t : – Choose co-authors of paper qi,t,d among j ∈ Vi,t with probability: P (j becomes a coauthor of qi,t,d ) = S(θc + γθg gj + θu uj,t ),
(4)
−x
where S(x) = 1/(1 + e ) is a sigmoid function. Note that the genius level of an author affects the writing of both leadingauthored and co-authored papers, whereas the popularity of an author affects only writing co-authored papers. The popularity coefficient θu represents the magnitude of the popularity effect. A positive θu indicates that the rich-getricher effect is working in the collaborative system, whereas a negative θu shows the opposite direction, namely the rich-get-poorer. The genius coefficient θg determines the size of the effect of authors’ genius levels. The authors with high genius levels will contribute more to the scientific collaborative system. The level of contribution can be different between leadingauthored and co-authored papers, which is controlled by the positive-valued hyperparameter γ. If γ is small, then the contribution is mainly measured by the leading-authored papers. We use γ = 1 throughout the paper. The parameter θl controls the total number of papers per unit time. The parameter θc is related to the number of paper’s authors. We denote θ = (θl , θc , θg , θu ) as the parameter vector of the model.
3 3.1
Algorithm Preliminary
In this section, we provide an algorithm for the estimation of the model parameters, the authors’ genius levels, and the paper’s leading authors. They will be alternately updated by the EM algorithm along with the Gibbs sampling method. The hyperparameter γ is assumed given. We denote by ct = i∈Vt di,t the total number of papers constructed at time t, and mi,t presents the number of co-authored (non-leading) papers of author i at time t. According to the generation process of co-authored papers, there are ct − di,t papers in which an author i can participate with probability S(θc + γθg gi + θu ui,t ). Therefore, we have mi,t ∼ Binomial (ct − di,t , S (θc + γθg gi + θu ui,t )) ,
(5)
where the probability mass function is given by p (mi,t |ct , di,t , gi , ui,t , θ) = ct − di,t (S (θc + γθg gi + θu ui,t ))mi,t (1 − S (θc + γθg gi + θu ui,t ))ct −di,t −mi,t mi,t
(6)
428
H. Jung et al.
for mi,t = 0, 1, · · · , ct − di,t , which can be simplified to p (mi,t |ct , di,t , gi , ui,t , θ) =
ct − di,t mi,t
exp (mi,t (θc + γθg gi + θu ui,t ))
× (1 + exp (mi,t (θc + γθg gi + θu ui,t )))−ct +di,t .
(7)
The probability mass function of di,t can be expressed by p (di,t |gi , θ) =
1 exp (θl + θg gi ) exp (− exp (θl + θg gi )) , di,t = 0, 1, · · · . (8) di,t !
Note that di,t + mi,t is the total number of papers written by author i at time t. For a paper qi,t,d , which is the d-th leading paper of author i at time t, we aim to identify the leading author i among all the observed authors of the paper. We reindex the papers at time t, from qi,t,d , i ∈ Vt , d = 1, · · · , di,t to qc,t , c = 1, · · · , ct . Let bc,t ∈ Vt be the leading author of paper qc,t . For simplicity, we denote the set of variables by omitting the subscripts t, i, c, and so on. For example, g = {gi : i ∈ V }, ui = {ui,t : t = max (t0,i , 1) , · · · , t1,i }, ut = {ui,t : i ∈ Vt }, u = {ui,t : i ∈ V, t = max (t0,i , 1) , · · · , t1,i }, bt = {bc,t : c = 1, · · · , ct }, b = {bc,t : t = 1, · · · , T, c = 1, · · · , ct }, and c = {ct : t = 1, · · · , T }, T where V = t=1 Vt . We use max (t0,i , 1) for t0,i = 0 since the stochastic inference will be made on t = 1, · · · , T . 3.2
Likelihood Function
In this part, we explain distribution functions of the model. We identify the three parts of the generative process: genius levels, leading-authored papers, and coauthored papers. First, the probability distribution function for the authors’ genius levels can be written by p (g) =
p(gi ).
(9)
i∈V
Second, the probability distribution function of the number of leading authored papers is expressed by p (d|g, θ) =
T
p (di,t |gi , θ).
(10)
t=1 i∈Vt
Third, the remaining process is about the selection of co-authors of papers, and the probability distribution function is given by p (m|d, g, u, θ) =
T t=1 i∈Vt
p (mi,t |ct , di,t , gi , ui,t , θ).
(11)
Leading Author Model
429
Combining the three processes, we obtain the total probability distribution function of our generative stochastic model, given by p (g, d, m|u, θ) = p (g) p (d|g, θ) p (m|d, g, u, θ) ,
(12)
using the Bayes Theorem. We then have the complete data likelihood function of the model, given by L (θ|g, d, m, u) = p (g, d, m|u, θ). The complete data log-likelihood function for the EM algorithm can be obtained by l (θ|g, d, m, u) = ln p (g, d, m|u, θ) = ln p (g) + ln p (d|g, θ) + ln p (m|d, g, u, θ). (13) For simplicity, we rewrite the log-likelihood function by l (θ|g, b, q, u) = l (θ|g, d, m, u) since we can determine d and m by the authors list q and the leading author information b of all the papers at t = 1, · · · , T . Now we define the expected value of the complete data log-likelihood function for the model as
(14) Q θ|θ(s) = E l (θ|G, B, q, u) |q, u,θ(s) , where the expectation is taken over the uppercase quantities G and B that correspond to the genius levels g of authors and the choices of the leading author b, respectively. Here, θ(s) is the parameter estimate at the s-th iteration of the EM algorithm. This function is to be maximized in the M-step of the EM algorithm. 3.3
Gibbs Sampling The calculation of Q θ|θ(s) is challenging since in general it cannot be expressed in a closed form. A sampling approach can be a remedy for this problem. Instead of calculating the analytic solution, we will sample G and B from the distribution p g, b q, u, θ(s) to estimate the expected value over G and B. We will alternately sample G and B from the target conditional distributions via the Gibbs sampling technique. The following conditional distributions of G will be the target distributions for the sampling of authors’ genius levels, given by p gi |g\{g i } , b, q, u, θ(s) (15) ∝ p (gi )
t1,i
p di,t |gi , θ(s) p mi,t |ct , di,t , gi , ui,t , θ(s)
t=max(t0,i ,1)
for i ∈ V . Note that gj , j = i are not conditioned eventually since they are independent to the conditional distribution of gi . The derivation of the above relationship is straightforward from the Bayes Theorem.
430
H. Jung et al.
Next, we move on to the sampling of B, which accounts for choosing a leading author among all the paper authors. For each paper qc,t , t = 1, · · · , T , c = 1, · · · , ct , we compute the following by the Bayes Theorem: (s) (s) = p(bc,t = ik |authors of qc,t is {i1 , · · · , iK } , g, u, θ ) p bc,t = ik |g, b\{bc,t }, q, u, θ (s) (s) p {i1 , · · · , iK } \ {ik } are coauthors of qc,t |bc,t = ik , g, u, θ , (16) ∝ p bc,t = ik |g, u, θ
where {i1 , · · · , iK } is the set of authors of paper qc,t . Our goal is to calculate the probability that ik is the leading author. It should be noted that (s) is the probability for ik to become a leading-author of any p bc,t = ik |g, u, θ paper among all authors at time t. Since the number of leading-authored papers for author i is determined by the Poisson distribution with mean parameter exp (θl + θg gi ), we have (s) (s) exp θl + θg gik . p bc,t = ik |g, u, θ(s) = (17) (s) (s) j∈Vt exp θl + θg gj For the second term, we obtain (s) p {i1 , · · · , iK } \ {ik } are coauthors of qc,t |bc,t = ik , g, u, θ
=
(s)
j∈{i1 ,··· ,iK }\{ik }
S(θ c
(s) gj + θ (s) uj,t ) + γθ g u
j∈Vt \{i1 ,··· ,iK }
(18) (s) + γ θ (s) gj + θ (s) uj,t ) . 1 − S(θ c g u
Clearly, p bc,t = j|g, b\{bc,t }, q, u, θ(s) = 0 if j ∈ Vt \ {i1 , · · · , iK } since a leading author is one of the authors of the paper. By discarding terms unrelated to the authors {i1 , · · · , iK }, we get the practical form of the conditional distribution, p bc,t = ik |g, b\{bc,t }, q, u, θ(s) ∝
(s) (s) exp θl + θg gik
, k = 1, · · · , K. (s) (s) (s) S θc + γ θg gik + θu uik ,t
(19)
Then we can choose ik with probability p bc,t
= ik |g, b\{bc,t }, q, u, θ(s) =
(s) exp θl +θg(s) gik
(s) (s) (s) S θc +γ θg gik +θu uik ,t (s) (s) exp θl +θg gi K k k =1 S θ(s) +γ θ(s) g +θ(s) u i i c g u k
k
(20) ,t
for k = 1, · · · , K. We have the following Gibbs sampling algorithm. (s,0)
(s,0)
= 0, i ∈ V. bc,t is assigned randomly from the authors of – Initialize: gi paper qc,t , t = 1, · · · , T, c = 1, · · · , ct . – For h = 1, · · · , H:
Leading Author Model
431
– For i ∈ V : (s,h) from p gi |g\{g i } , b, q, u, θ(s) employing b =b(s,h−1) . – Sample gi – For t = 1, · · · , T : – For c = 1, · · · , ct : (s,h) from p bc,t = ik |g, b\{bc,t }, q, u, θ(s) employing – Sample bc,t (s,h)
(s,h−1)
g =g(s,h) and bc ,t = bc ,t if (c , t ) is visited and bc ,t = bc ,t otherwise.
It provides the Gibbs samples g(s,h) and b(s,h) , h = 1, · · · , H of genius levels and the leading author assignments, corresponding to the current parameter estimate θ(s) . We ignore some number of samples at the beginning since the initial samples can affect the Gibbs samples. In this paper, we discard the first H0 = 10 samples out of H = 60 samples. In addition, the adaptive rejection sampling [8] is used to get samples from the target distribution p gi |g\{g i } , b, q, u, θ(s) of genius levels. 3.4
EM Algorithm
The E-step estimates the function Q θ|θ(s) of θ by θ|θ(s) = Q
1 H − H0
H
l θ g(s,h) , b(s,h) , q, u ,
(21)
h=H0 +1
where g(s,h) , b(s,h) are obtained from the Gibbs sampling algorithm. θ|θ(s) with The M-step estimates θ(s+1) as a maximizer of the function Q respect to θ, written by θ|θ(s) . (22) θ(s+1) = argmaxθ Q We use the constrained optimization by linear approximation (Bos, 2006). Finally, we have the EM algorithm which repeats E- and M-steps until convergence. – Initialize: s = 0. – For s = 1, 2, · · · : – Obtain Gibbs samples g(s,h) and b(s,h) , h = 1, · · · , H by running the Gibbs sampling with θ(s) . algorithm (s) θ|θ . – Calculate Q θ|θ(s) . – Find θ(s+1) = argmaxθ Q – If converged: – θ = θ(s+1) . – Obtain Gibbs samples g(h) and b(h) , h = 1, · · · , H by running the Gibbs sampling algorithm with θ. – Break The algorithm yields the converged parameter estimate θ and the corresponding Gibbs samples g(h) and b(h) , h = 1, · · · , H.
432
H. Jung et al.
3.5
Inference
By Louis’ method (Louis, 1982), the estimated observed information matrix of the estimated model parameter θ can be written by
θ + ∇Q θ| θ θ I θ = −∇2 Q θ| ∇Q θ| −
1 H − H0
(23)
l θ g(h) , b(h) , q, u l θ g(h) , b(h) , q, u ,
H
h=H0 +1
∂ ∂ ∂ ∂ where ∇ = ∂θ , , , is a differential operator of the parameter ∂θ ∂θ ∂θ c g u l vector. Then we can calculate the standard error vector of the parameter estimate by −1 I θ . (24) s.e. θ = xx
x=1,2,3,4
The genius levels of authors can be estimated by the estimated posterior distributions, given by gi =
1 H − H0
H
(h)
gi , i ∈ V.
(25)
h=H0 +1
Similarly by using the estimated posterior distributions of bc,t , we can choose the most probable leading author of a paper qc,t by the most commonly appeared (h)
author among the posterior Gibbs samples bc,t it by
bc,t = argmax i∈Vt
1 H − H0
H
h=H0 +1,··· ,H
. We may express
(h) 1 bc,t = i , t = 1, · · · , T, c = 1, · · · , ct ,
h=H0 +1
(26) where 1(statement) is an indicator function that returns one if the statement is true and zero otherwise.
4 4.1
Real Data Analysis Scientific Collaboration Data
Our scientific collaboration data is from 2007 to 2016 of Web of Science (WoS) that provides extensive citation data for academic papers. We consider two fields: Management and Statistics & Probability. Initial data consists of the first five years of data from 2007 to 2011, and we analyze another five years time-series
Leading Author Model
433
data from 2012 to 2016. Then we have T0 = T = 5, and a unit time (time interval) of 1 year. The numbers of authors are |V | = 104,444 and 96,448 in the fields of Management and Statistics & Probability, respectively. Many of the authors appear only once in the system, i.e., i ∈ Vt for only one time point t and i ∈ / Vt for all t ∈ {1, · · · , T }\{t }. The investigation of their genius levels becomes less meaningful because of small information. Therefore, the genius level is regarded as 0 and is not estimated via the Gibbs sampling algorithm. The numbers of authors of our interest, those who appear at least two time points in the system, are 25,128 and 24,457 for the disciplines of Management and Statistics & Probability, respectively. The total numbers of papers at t = 0, 1, · · · , T are 87,253 (Management) and 86,095 (Statistics & Probability). We see that the conditions such as the number of authors and the number of papers are very similar between the two disciplines. We apply the proposed method to the data and dive into the popularity effect and author’s genius levels. 4.2
Popularity Effect
Estimates of the parameters are given in Table 1. We compare the estimates of θu to check the popularity effect on the system. We have positive popularity coefficients in both fields, indicating that the rich-get-richer effect is actually working. Since Management (θu = 0.2809) has the larger popularity coefficient than Statistics & Probability (θu = 0.1733), we can argue that the rich-get-richer is stronger in the management field. Table 1. Estimates of model parameters with the scientific collaboration Field/Param. θl
s.e.(θl ) θc
s.e.(θc ) θg
s.e.(θg ) θu
s.e.(θu )
−1.0532 0.0048
−9.8729 0.0052
0.2679 0.0044
0.2809 0.0082
Stat. & Prob. −1.0508 0.0054
−9.7635 0.0049
0.3859 0.0059
0.1733 0.0031
Management
The popular authors tend to have more opportunities to co-work, and also they tend to receive a good impression from reviewers. In the field of Management, researchers deal with practical problems that arise in the company and the business environment. Authors may suggest many different solutions to the problems in management, and the author’s subjectivity would be much appreciated. On the other hand, mathematical and statistical analysis are more performed in the Statistics & Probability field, which yields more objectivity. Therefore, the rich-get-richer effect may play a greater role in the Management field, resulting in a larger popularity coefficient. 4.3
Influential Author Finding
In this subsection, we investigate the influential authors in the field of Management. We use the following six influence measures: (i) the estimated genius
434
H. Jung et al.
level gi (GL), (ii) the total number of papers (NP), (iii) the estimated number of leading-authored papers (NLP), (iv) the estimated number of co-authored papers (NCP), (v) the ratio of NLP to NP (RLP), where the authors who have more than or equal to 20 papers are considered (NP ≥ 20), and (vi) the popularity at time T (POP). The number of paper is calculated over t = 1, · · · , T excluding the initial data, and NLP is computed using the estimated leading author assignments bc,t , t = 1, · · · , T, c = 1, · · · , ct . We have the following straightforward relationships: NP = NLP + NCP and RLP = NLP/NP. Table 2 presents the top 10 authors in the Management field for each measure. Table 3 shows various properties of the top 10 authors who have the highest genius levels. In Table 4 we present the Pearson correlation coefficients between the five influential measures. RLP is excluded since it is not involved in all the authors. Not surprisingly, we have similar results for NP and POP (cor(N P, P OP ) = 0.8085) since the popularity is calculated based on the number of papers. Also, results for NCP appear approximately similar to the NP and POP results, which may be caused by the fact that the leading-author is only one author per paper. On the other hand, there is a large difference between GL and POP with a relatively small correlation coefficient (cor(GL, P OP ) = 0.3512). There are no mutual authors in the top 10 authors of both GL and POP. For example, author 24282 ranks in the first place for the three measures NP, NCP, and POP, but it does not appear in the top 10 authors of GL. The genius level, which excludes the effect that popular authors tend to produce more papers based on their popularity, is significantly different from some measures that simply sums up the number of papers. If we judge the expertise of authors only by the number of papers, popular authors could be overrated. The result and interpretation indicate that the genius level considering the popularity effect can Table 2. Influential authors in the discipline of Management. We present the top 10 authors for the six influential measures: GL, NP, NLP, NCP, RLP, and POP. Rank GL
NP
NLP
NCP
RLP
POP
1
13150 24282 36646
24282 67408 24282
2
29174 36646 67408
93716 10011 36646
3
59363 13150 59363
33617 36646 70428
4
26404 93716 13150
29174 59363 70645
5
39160 29174 10011
60198 70645 93716
6
60198 33617 24408
70428 92928 44045
7
43835 60198 39160
80828 39160 15861
8 9 10
2481
2481 70645
80750 70428 48655
15861 13150 33617 84596
9219
7808
4587 43835 84165 103645 85319 11749
Leading Author Model
435
Table 3. Top 10 authors with the highest genius levels in the discipline of Management. Author GL
NP NLP NCP RLP
POP
13150
5.1616 43
21
22
0.4884 3.4444
29174
4.0061 35
1
34
0.0286 3.4444
59363
3.9838 33
25
8
0.7576 3.2222
26404
3.7922 31
3
28
0.0968 3.3333
39160
3.5888 32
16
16
0.5000 3.6667
60198
3.5415 34
1
33
0.0294 4.1111
43835
3.4706 33
13
20
0.3939 4.3333
2481
3.3504 34
7
27
0.2059 4.4444
80750
3.3428 26
4
22
0.1538 2.1111
4587
3.2610 22
2
20
0.0909 2.2222
be more suitable for evaluating the author’s intrinsic ability. The genius level can be valuably applied for the tasks such as choosing journal article reviewers and recommending researchers for collaboration, which require necessary expertise in the research area. Nevertheless, we can see some relations between the genius level and the number of papers (cor(GL, N P ) = 0.4731) in Table 3. The authors with high genius levels tend to produce many papers. Interestingly, the top 10 authors with high genius levels have diverse numbers of leading-authored papers, suggesting a minor connection between the number of leading authored papers and the genius level (cor(GL, N LP ) = 0.2277). This result may suggest that an author with a high genius level may not insist on being a leading author and appreciate participating in the scientific collaboration as a co-author. Table 4. The correlation coefficient matrix for the five influence measures, GL, NP, NLP, NCP, and POP in the discipline of Management. Measure GL GL
5
NP
NLP
NCP
POP
1.0000 0.4731 0.2277 0.4548 0.3512
NP
0.4731 1.0000 0.6033 0.8922 0.8085
NLP
0.2277 0.6033 1.0000 0.1780 0.3820
NCP
0.4548 0.8922 0.1780 1.0000 0.7813
POP
0.3512 0.8085 0.3820 0.7813 1.0000
Concluding Remark
We propose the leading author model that accounts for the popularity effect of the dynamic collaboration system and the genius levels of authors. We avoid
436
H. Jung et al.
using the standard network structure as it does not consider the multi-author relationships. Non-homogeneous authorship among authors of a paper is considered as opposed to the hypergraph and simplicial complex structures. The presented statistical estimation algorithm turns out to be successful in estimating the genius levels of authors and the leading authors of papers as well as the size of the popularity effect. The presented real data analysis highlights the existence of the positive popularity effect in the two disciplines of the scientific collaboration system. The influence levels of the authors are studied in the field of Management, suggesting that the genius level can be useful in finding effective influential authors in the system. It is worth noting that the genius levels are estimated under the reasonable model assumptions. Our proposed model can be widely applicable to other collaborative systems for analyzing the popularity effect and the influence of system members. Finally, a useful future direction for our model could be a hyperparameter sensitivity issue, which is outside the scope of the paper. Also, the popularity measure can be improved by using the affiliation and quality of papers depending on the data availability. Acknowledgments. This work was supported by Academia Sinica (Taiwan) Thematic Project grant number AS-TP-109-M07, the Ministry of Science and Technology (Taiwan) grant numbers 107-2118-M-001-011-MY3 and 109-2321-B-001-013, and the National Research Foundation (Korea) grant number 2021R1G1A1094103.
References 1. Abbasi, A., Chung, K.S.K., Hossain, L.: Egocentric analysis of co-authorship network structure, position and performance. Inf. Process. Manag. 48(4), 671–679 (2012) 2. Barabasi, A.-L., Albert, R.: Emergence of scaling in random networks. Science 286(5439), 509–512 (1999) 3. Bianconi, G., Barabasi, A.-L.: Competition and multiscaling in evolving networks. Europhys. Lett. 54(4), 436 (2001) 4. Bos, J.: Numerical optimization of the thickness distribution of three-dimensional structures with respect to their structural acoustic properties. Struct. Multidiscip. Optim. 32(1), 12–30 (2006). https://doi.org/10.1007/s00158-005-0560-y 5. Dorogovtsev, S.N., Mendes, J.F.F.: Evolution of networks with aging of sites. Phys. Rev. E 62(2), 1842 (2012) 6. Fotouhi, B., Rabbat, M.G.: Degree correlation in scale-free graphs. Eur. Phys. J. B 86(12), 510 (2013) 7. Ghiasi, G., Harsh, M., Schiffauerova, A.: Inequality and collaboration patterns in Canadian nanotechnology: implications for pro-poor and gender-inclusive policy. Scientometrics 115(2), 785–815 (2018). https://doi.org/10.1007/s11192-018-27012 8. Gilks, W.R., Wild, P.: Adaptive rejection sampling for Gibbs sampling. J. Roy. Stat. Soc.: Ser. C (Appl. Stat.) 41(2), 337–348 (1992) 9. Jeong, H., Neda, Z., Barabasi, A.-L.: Measuring preferential attachment in evolving networks. Europhys. Lett. 61(4), 567 (2003)
Leading Author Model
437
10. Jung, H., Lee, J.-G., Kim, S.-H.: On the analysis of fitness change: fitnesspopularity dynamic network model with varying fitness. J. Stat. Mech: Theory Exp. 2020(4), 043407 (2020) 11. Jung, H., Lee, J.-G., Lee, N., Kim, S.-H.: PTEM: a popularity-based topical expertise model for community question answering. Ann. Appl. Stat. 14(3), 1304–1325 (2020) 12. Louis, T.A.: Finding the observed information matrix when using the EM algorithm. J. Roy. Stat. Soc.: Ser. B (Methodol.) 44(2), 226–233 (1982) 13. Lu, H., Feng, Y.: A measure of authors’ centrality in co-authorship networks based on the distribution of collaborative relationships. Scientometrics 81(2), 499–511 (2009) 14. Lung, R.I., Gasko, N., Suciu, M.A.: A hypergraph model for representing scientific output. Scientometrics 117(3), 1361–1379 (2018). https://doi.org/10.1007/s11192018-2908-2 15. Merton, R.K.: The Matthew effect in science: the reward and communication systems of science are considered. Science 159(3810), 56–63 (1968) 16. Metz, T., Jackle, S.: Patterns of publishing in political science journals: an overview of our profession using bibliographic data and a co-authorship network. PS Polit. Sci. Polit. 50(1), 157–165 (2017) 17. Perc, M.: The Matthew effect in empirical data. J. R. Soc. Interface 11(98), 20140178 (2014) 18. Rode, S.M., Pennisi, P.R.C., Beaini, T.L., Curi, J.P., Cardoso, S.V., Paranhos, L.R.: Authorship, plagiarism, and copyright transfer in the scientific universe. Clinics 74, 1312 (2019) 19. Roy, S., Ravindran, B.: Measuring network centrality using hypergraphs. In: Proceedings of the Second ACM IKDD Conference on Data Sciences, pp. 59–68 (2015) 20. Wang, J.-W., Rong, L.-L., Deng, Q.-H., Zhang, J.-Y.: Evolving hypernetwork model. Eur. Phys. J. B 77(4), 493–498 (2010). https://doi.org/10.1140/epjb/ e2010-00297-8
Factoring Small World Networks Jerry Scripps(B) Grand Valley State University, Allendale, USA [email protected]
Abstract. Small World networks, as defined by Watts and Strogatz, have a mixture of regular and random links. Inspired by Granovetter’s definition of a weak tie, a new metric is proposed that can be used to separate a network into regular and random sub-networks. It is shown that within certain constraints, a (modified) small world network can be factored with an accuracy of 100%. The metric is shown to uncover interesting insights and factored networks can be used in downstream applications such as community finding and link prediction.
1
Introduction
In his 1973 paper, Mark Granovetter [9] shows that weak ties can be very helpful in important life changing events, such as looking for a new job or a new romantic relationship. Although he did not provide an objective method to identify weak ties, Granovetter gives us a heuristic: strong ties will have more common neighbors than weak ties. The proposed metric in this paper, vett, identifies strong and weak ties (links) based on this heuristic. The idea of strong and weak links can also be related to the Small World networks of Watts and Strogatz [14]. According to their work, many networks have the two properties of low average path length and high clustering coefficient. They suggested a growth model that starts with a regular network (with a high clustering coefficient) and adds random links (to achieve the low average path length). One can think of this as the union of regular and random networks (having the same nodes but different links). The strong and weak ties of Granovetter correspond to the regular and random links of Watts and Strogatz. Accordingly, one can expect Granovetter’s work to apply to Small World networks. Consider a social network of faculty, where two nodes in a strong link would probably have the same home department but two nodes in a weak link might be in the same university committee. The metric is called vett in recognition of Granovetter’s contribution. Making use of his heuristic and a modification of the Small World model, vett is designed to assign to links a value between 0 and 1 with 0.5 as a threshold between weak and strong links. It will be shown that under some circumstances, vett will identify the strong links with 100% precision. Under certain other circumstances, weak links can also be identified with 100% precision. The contributions of this paper includes: c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2021, SCI 1072, pp. 438–450, 2022. https://doi.org/10.1007/978-3-030-93409-5_37
Factoring Networks
439
– a new metric vett that closely follows the heuristic definition in Granovetter’s paper – guarantees that vett can identify both strong and weak links with 100% – using vett to cover interesting patterns in large networks and enrich the results of other network mining applications After this introduction, there is a literature review. The definition of terms and the growth model follow that. The Sect. 4 describes the metric and algorithm to find and show mathematical support for its effectiveness. Supporting experiments follow and concluding remarks follow that.
2
Related Work
Factoring a small world network into its regular and random subnetworks appears to be a novel idea. Numerous searches have not revealed any previous work. As the factoring is really a matter of assigning a type or tagging links to be strong or weak, the literature review will focus on link tagging. A natural way to tag links is to use known information about the nodes and their neighbors. Examples include gathering information from surveys [6], or transactions [10]. While this data is informative, unless it can be shown to be consistent with the reality of the actual relationships, using the graph data is intrinsically more consistent. Another way to tag links is by using network science techniques like forming communities [5]. However, since this does not rely specifically on a common neighbors approach, it will not be consistent with Granovetter’s definition. Both [11] and [2] rely on weighted graphs to predict tie strength, which is outside of the current work which focuses on unweighted graphs. Strong/weak links were defined in terms of the strong triadic closure rule [13]. This rule states that open triangles of two strong links cannot exist. This approach optimizes a different graph property than the present work; it will be shown in the experiments that the two approaches are not similar.
3
Definitions
A network G = (V, E) is a system of nodes V = {v1 , . . . , vn } which are connected to each other by links E ⊂ V × V . An adjacency matrix A = [aij ]n×n , is used to represent the links, where link aij = 1 for every link eij ∈ E between vi and vj . This paper also uses a weight matrix W = [wij ]n×n that represents the link weights as calculated by vett. In this paper, di represents the degree for vi and cij represents the common neighbors for vi and vj . 3.1
Small Worlds
The Small World model of Watts and Strogatz [14] (which will be referred to as SW1) defines a network as a union of a regular and random network. The
440
J. Scripps
regular links provides tight clustering, as measured by the clustering coefficient and the random links give it short traversal paths, as measured by the average geodesic path between all node pairs. SW1 networks generally reflect Granovetter’s vision of strong and weak links. The SW1 growth model does not enforce a high c (common neighbor) value for strong links though. For example, consider the circular lattice where each node is connected to its 6 closest neighbors. Some of the links will have c = 4, while others will have c = 2. As the number of connections get higher, the variability in the c will also get larger. Instead of having a regular lattice as the starting point for the strong ties, we define SW2 to create a number of communities with equal numbers of nodes and connect the nodes so that each one has a similar degree and each link has an approximate, minimum c. In SW2 like SW1, random links (weak ties) are created between nodes in different communities based on a probability p. The random links are created by applying the probability p to each possible weak links (those between communities). Figure 1 shows the average path length and clustering coefficient plotted for networks of 20 communities of 20 nodes each, with each node having a degree of 15, and various values Fig. 1. Example graph of clustering of p. Notice that the area that defines coefficent (dashed) plotted along with a Small World network (high cluster- average geodesic path length (solid) in ing coefficient, low avg path length) is SW2 networks. similar to the original paper from [14], in the range 0.001 ≤ p ≤ 0.02. 3.2
Thresholding Limitations
It is tempting to think that one could simply choose a threshold cˆ and label any link with cij > cˆ as strong and those with cij ≤ cˆ as weak. In real data sets though this strategy is not optimal. Different parts of the network are more dense than others. The chosen threshold might work for some parts of the network but not others. For a concrete example, consider the football data set [7], where each of the 115 nodes represents a US college football team and the links represent the games. Most of the teams belong to a conference (of about 6 to 13 teams). Teams mostly play other teams in their conference (strong links) but they also play a few teams from other conferences (weak links). Figure 2 contains a cross-tabulation of the links for the football network by common neighbors and link type (within or between conferences). This tabulation appears to be reasonable with many of the (strong) within links having high values of c and low values of c for the (weak) between ones. However, even with this nearly ideal SW2 data set, there is no threshold that would neatly separate
Factoring Networks
441
the strong from the weak links. The best threshold, cˆ = 2, assigns 5 strong links as weak and 4 weak links as strong. 3.3
Problem Definition
cn within between 0 0 83 1 0 46 2 5 20 3 8 3 4 64 1 5 108 0 6 115 0 7 94 0 8 10 0 9 0 0
The problem proposed by this paper is to factor a Small World network G into two its regular (Gs ) and random (Gw ) subnetworks. Specifically to formulate a function f: f (G) → {Gs , Gw } where G = (V, E), Gs = (V, Es ) and Gw = (V, Ew ). Es is the set of links that are identified as strong and Ew are the links identified as weak. The function f that is proposed is a metric vett(G) → W that assigns weights, 0 < wij < 1 to the links such that: – eij ∈ Es if wij > 0.5 – eij ∈ Ew if wij ≤ 0.5
Fig. 2. Common neighbors for strong and weak links in football
The name vett was chosen in recognition of Dr. Granovetter and is designed in the spirit of his definition using common neighbors. The metric vett considers all neighbors, similar to the way Jaccard does. Jaccard is a normalization of c: Jij = N (vi ) ∩ N (vj )/N (vi ) ∪ N (vj ), where N (vi ) is the neighborhood of vi . Note that the numerator is just common neighbors c, while the denominator is c plus the non-common neighbors of vi and vj . Unlike c Jaccard recognizes the differences in density. However, it is not adequate as a metric that identifies strong and weak links, because it does not actually take into consideration the strength of other links.
4
Detection Process
This section introduces a new metric, vett, that is designed to identify weak and strong ties. Following the definition it is shown that the metric is guaranteed within certain bounds. 4.1
Detection Using Undirected Weights
The definition for vett, shown below in Eq. 1, is based on Jaccard except that it uses weights instead of just the counts of common and non-common neighbors. Using weights forced two decisions. First, in Jaccard the common neighbors of vi and vj is just the sum of the number of nodes that are connected to both vi and vj . Using the weights in vett there are two links for each common neighbor. There are many ways to combine the 2 links in a formula such as using the average,
442
J. Scripps
adding the values, or multiplying them. In this paper the choice was made to use multiplication. The other decision was to use the weight of the link itself (wij ) (as opposed to not using it). Both of these decisions were made for two reasons: 1) it allowed for better guarantees of identification and 2) it performed better in experiments of predicting strong and weak links. wij + wik wjk aik (1) wij = wij + (wik wjk aij + wik (1 − ajk ) + wjk (1 − aik )) n The symbol means k=1 . Notice that both the numerator and denominator have the weight of the link itself with the sum of the product of the links to the common neighbors: wij + wik wjk aij (aij forces w = 0 for non-links). The denominator also adds in the link weights to the non-common neighbors. Because there is no way for weights to become negative and that the denominator must always be at least as large as the numerator, ∀i, j : 0 ≤ wij ≤ 1. Links with no common neighbors will approach 0 while those with no links to non-common neighbors will approach 1. Calculating W is iterative, initializing all of the weights wij = aij . The process stops when the weight values have converged. Weights for any node pair that does not have a link are always zero. 4.2
Guarantee of Detecting Strong Links in SW2
Recall that in SW2, nodes and links are placed into communities before the random links are added. To guarantee that vett will identify all strong links, it will be shown that wij > 0.5 when aij is a strong link in an SW2 network under the following conditions: 1) the analysis is prior to adding the random links, 2) all nodes have the same degree, 3) all links have the same common neighbor value (c) and 4) c is sufficiently large. With these assumptions, the weights for links should all be equal. Substituting w for any wij using d and c for the universal degree and common neighbor values simplifies Eq. 1: w=
w + cw2 w + cw2 + 2(d − c − 1)w
Performing some algebra leads to: cw2 + (2d − 3c − 1)w − 1 = 0 which is a quadratic equation. Using the quadratic formula, two possible solutions for w can be found. For values of w > 0.5 (the indication of a strong tie): −(2d − 3c − 1) ± (2d − 3c − 1)2 + 4c > 0.5 (2) 2c The + root of Eq. 2, after some algebra, leads to: c>
2d − 5 3
(3)
Factoring Networks
443
The − (negative) root of Inequality 2, after some algebra becomes a quadratic: −8d2 + 10dc + 4d − 13c2 − 9c − 2 > 0 It can be shown that finding a solution for c using the quadratic formula leads to answers that are imaginary numbers. Therefore, Inequality 3 is the only solution for which c can be calculated given d. Note that this implies that for our metric to effectively identify strong ties, the common neighbors of the initial communities must be greater than 2/3 of the degree, for large values of d. The first assumption above, that there are no random links, will be relaxed in Sect. 4.4. 4.3
Guarantee of Detecting Weak Links in SW2
It will be shown in this section that it is very likely that wij < 0.5 when wij is a random link in an SW2 network for reasonably small values of p. The random links are placed between the communities after the communities are created as described above. The question to be answered here is, will vett be able to identify these weak links? Remembering that common neighbors is the driving force behind the metric, it will be shown that there is a low probability that c for a random link will become large enough for vett to mis-identify the link as a strong tie. There are two ways for a random link eij to have a common neighbor, say node vk . The first is for eik to be a within community (strong) link and for ejk to be another random link. This forms a triangle of a weak, another weak and a strong link (WWS). The other way is for all three links to be weak (WWW). WWS is considered first. Begin with a simple example placing the first random link. Since this is the first one, there are no other links between communities, so c should be zero. With zero common neighbors, the value of vett will app- Fig. 3. Probability of common neighroach zero. To expand this to the gen- bors metric being below SW2 threshold 2d−5 eral case (placing subsequent links), a ( 3 ) for triangles involving a strong way is needed to calculate the proba- link bility of the number of common neighbors for a random link. Considering a random link eij , to have a common neighbor using WWS it would have to have another random link connecting vi to one of vj ’s strong neighbors or a random link connecting vj to one of vi ’s strong neighbors. Using
444
J. Scripps
the binomial distribution to calculate the probability of having exactly k common neighbors follows: ∗ ∗ n p(c = k) = pk (1 − p)n −k k where n∗ = 2(d − 1) and d is the degree of each node before adding the random links. Figure 3 shows the probability of the common neighbors metric being less than the threshold (Formula 3). The chart is based on a networks with n = 1000 nodes, varying the degree from 10 to 50 and using different values for the Small World probability of placing a weak link. Recall that with our definition of Small World networks, the area for the small world effect is between .001 and .01. Changes to n does not affect the shape of the curve. WWW is similar but leads to a different outcome. Given a weak link eij , to calculate the probability for the single common neighbor vk , it is necessary to multiply the probability of links eik and ejk where vi , vj and vk are all in different communities. This also is a binomial distribution with different parameters: ∗ ∗ n p(cn = k) = q k (1 − q)n −k k where n∗ is n − 2d and q = p2 . The plot in Fig. 4 shows that with a low value of p, the probability is very high that the common neighbors will be less than the threshold. As p increases it becomes less likely. It is hardly noticeable with networks n < 10, 000 nodes. However, it becomes increasing more profound as n gets larger. With n > 90, 000 In this analysis, the number of communities is kept constant at 100 and the degree constant at 10. Changing the community number could make a slight change while, increasing the degree would also increase the thresh- Fig. 4. Probability of common neighold c which should improve the likeli- bors metric being below SW2 threshhood as well. The point of this section old for triangles involving 3 weak links is that optimum results are guaranteed for some large networks and most smaller ones. Experimental results in the next sections support this claim but go further to show that good results are possible for many Small World networks other than SW2. 4.4
Effect of Random Links on Strong Links in SW2
In the section above on detecting strong links, the analysis was based on a network of only strong links, before adding in random links. It is natural to wonder if the random links will influence the analysis. The answer is that if
Factoring Networks
445
there are just a few weak links that are low weighted, it should not have a profound effect on the analysis. If these random links are indeed weak – that is, they have few if any common neighbors with the node in question – then they should have a low weight. However, as p increases, then too, will c for the random links increase. One can imagine a network where p is large enough to make strong and weak links indistinguishable from each other because they all have large values of c. Looking at Figs. 3 and 4, at the point where this happens p would become large enough for the network to no longer be considered to be a Small World network. 4.5
Complexity
Assuming that the network is represented using an edge list, vett needs to make I iterations before it converges. In each iteration, the weight for each node is recalculated by tracing the neighbors of each of its neighbors. Using an adjacency list, the complexity is O(I × n × d2 ).
5
Experiments
To test its effectiveness, it will be shown that by using vett (a) patterns emerge naturally exposing the regular and random networks, (b) identification of strong and weak links is effective, and (c) algorithms can make use of the hidden knowledge revealed by weak links. 5.1
Data Sets, Algorithms and Methods
Example data sets based on real netdataset n links clust Len works were used to give a visual football 115 613 0.40 2.49 depiction of separating strong links jazz 198 2 742 0.63 2.22 from weak links. The data files are revere 254 9 706 0.94 1.69 listed in Fig. 5. All are undirectional. facebook 2 556 54 482 0.21 2.59 The first three are intentionally bright 58 228 214 078 0.21 4.92 small so that the reader could see the effects in a two-dimensional plot Fig. 5. List of datasets (in the first experiments). Notice that all appear to be categorized as Small World networks having a high clustering coefficient and a low path length. The last two data sets are used in the later experiments, to do the analysis of neighbor characteristics. A set that was assembled by the authors from Facebook data prior to 2007 with 2556 users with the friendship graph and selected attributes. The bright set is a processed version of BrightKite [3]. It was simplified by associating a single location with each user, selecting the location that was used (checked in) the most.
446
J. Scripps
To show the effectiveness for community finding and for aiding other community finding algorithms, synthetic networks were created. The synthetic networks were generated by the LFR benchmark [1]. In all experiments, for each setting, 10 networks were generated with the results averaged. Networks were generated using the parameters: the number of nodes (N ), the average degree (k), the maximum degree (maxk) and the mix (mu). The mix is the percent of inter-community links. Networks were created by varying the number of nodes N. strong links For the tests using community all links finding algorithms, igraph [4] was used. Of the algorithms available, we chose fast-greedy, label propagation and modularity-multilevel optimization. In the tests comparing vett with the community finding algorithms, ten networks were generated according to the parameters. Fig. 6. Football data set Then, separately communities were found using the algorithms identified above, and links were identified as strong or weak using vett. Accuracy was calculated as the number of correct predictions (strong or weak) divided by the total number of links (Figs. 6 and 7). Fig. 7. Jazz data set 5.2 Exposing the Regular Network This section simply displays the network plot of each of the data sets next to the network plot with the weak links removed. Without the Fig. 8. Revere data set weak links the regular network of strong links will emerge. Care was taken to group the nodes to reveal the groups within. The football set [7] is a good example of an SW2 Small World network. Note that all of the intra-conference games are labeled as weak links and all of the inter-conference games are labeled as weak links (see Fig. 8). In the Jazz set [8] the nodes represent 198 early twentieth century jazz bands. The links indicate that a musician played in both bands. This is a very dense data set so that the communities are not visually clear without removing the weak links. The revere set [12] is a social network of the early members of the US revolutionary war patriots. The 254 nodes represent the patriots that were at the meetings and links are drawn between any two patriots that belonged to the same organization. Because so many patriots belonged to more than one organization,
Factoring Networks
447
the network is very dense. The plot in Fig. 8 on the right, shows the undirected strong links, which is just seven cliques representing the seven organizations. 5.3
Effectively Identifying Strong and Weak Links
Figure 9 compares vett to the greedy algorithm from [13]. That algorithm identifies strong/weak links by minimizing violations of strong triadic closure rule. Experiments were carried out using a network generators that creates cliques and then connects the nodes between the cliques with a probability of p. As p increases, the random links become denser, so greedy labels more as weak, leading to better accuracy. Fig. 9. Comparison of vett with the Within the region of Small World greedy STC metric networks, vett significantly outperforms greedy. The results from the LFR benchmark were similar and so are not included. In the next set of experiments vett is compared to the three community finding algorithms (fast, multi, label) described above. For the purpose of these experiments, identifying strong links and finding communities are compatable goals. Figure 10 shows networks where the parameter N varies from 100 to 2000 nodes (k = 10 and mu = 0.3). Once again, using vett results in very high accuracy. The accuracy starts to drop as N > 1700. This is more a result of increased density rather than sheer number of nodes. 5.4
Evidence for Effectiveness of Vett
Identifying strong and weak ties can reveal patterns to help describe nodes. The bright data is examined to reveal differences in geographic distances of their strong and weak contacts. Analysis of the data shows that (68%) of the nodes are nearer to their close (strong) friends then to their weak friends. This is supported by the social science concept of propinquity, that friendships are more likely between geographically close pairs. Fig. 10. vett compared Another example of different patterns associated to community finding with strong and weak links comes from the face- algorithms. book (FB) set. In this set, there are a few known attributes but only the gender will be used. The data contains only male, female and unspecified genders – using binary gender assignment is a characteristic of the data.
448
J. Scripps
Two users were chosen to contrast two different patterns. The first user, 1337, is male and has 70 friends. Of those, 34 identify as male, 32 identify as female and 4 had no gender. Most of his close (strong link) friends are male while most of his weak link friends are female (or not specified). The ego centric graph seen in Fig. 12 with all of the links on the left side and just the strong links on the right. Node 1337 is in the approximate center. The nodes are have a black center for males and white center for females. Note that on the right, it appears that the strong links include only the 27 nodes of the big group. There should also be strong links between node 1337 and the group of 10 in the lower left. Those links were not identified as strong because for this ego-graph, vett only used the data in the ego graph; using the whole data set identifies the links to those 10 as strong (Fig. 11). The second user, 2143, also Strong links male, has the opposite pattern; only All links that is most (all) of his close friends are female, while his weak link friends are mostly male. The graphs can be seen in Fig. 12. It is tempting to try to explain the reason behind these two patterns. For example, it could be that close friends are other stu- Fig. 11. Facebook ego graphs - node 1337 dents in the same major and that node 1337 is in a male-dominated major while node 2143 is in a female-dominated major. Or it could just be the student’s preferences for friends. With more information we would be able to draw more substantive conclusions. The reader may be inter- Fig. 12. Facebook ego graphs - node 2143 ested to know that the second pattern (more close friends of the opposite gender) appears much more often in the data than the first pattern. There are 2513 nodes that fit one of these two patterns (the rest of nodes did not specify their gender). Of the males, 58% had more strong links with females than with males. Similarly, the females had 65% more male strong link friends than female. While it is not known (from these incomplete data) why there appears to be a preference for more close friends of the opposite gender, the important point is that the patterns are there (Fig. 12).
Factoring Networks
449
Fig. 13. Comparison of algorithms.
6
Conclusion
This paper introduces a new metric, vett, for identifying strong and weak links in networks. It is shown to have very high accuracy for specific networks in the spirit of the Watts and Strogatz small world model. In a modification of that model, the accuracy is at or near 100% when the common neighbors value for links within networks is c = (2d + 5)/3 and 0.001 < p < 0.2. Even for networks that are not within those ranges, experiments have shown that it can be very effective. The experiments show that vett can be used to find communities or as a preprocessing step to finding communities. Other uses seem possible and will be the subject of ongoing research.
References 1. Lancichinetti, A., Fortunato, S., Radicchi, F.: Benchmark graphs for testing community detection algorithms. Phys. Rev. E 78, 046110 (2008) 2. Chen, J., Safro, I.: A measure of the connection strengths between graph vertices with applications. arXiv (2009) 3. Cho, E., Myers, S.A., Leskovec, J.: Friendship and mobility: user movement in location-based social networks. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2011) 4. Csardi, G., Nepusz, T.: The igraph software package for complex network research. InterJ. Complex Syst. 1695, 1–9 (2006) 5. De Meo, P., Ferrara, E., Fiumara, G., Provetti, A.: On Facebook, most ties are weak. Commun. ACM 57(11), 78–84 (2014) 6. Eagle, N., Pentland, A.S., Lazer, D.: Inferring friendship network structure by using mobile phone data. Proc. Natl. Acad. Sci. 106(36), 15274–15278 (2009) 7. Girvan, M., Newman, M.E.J.: American college football. Proc. Natl. Acad. Sci. 99, 7821–7826 (2002) 8. Gleiser, P., Danon, L.: Community structure in jazz. Adv. Complex Syst. 6, 565 (2003). http://deim.urv.cat/∼aarenas/data/welcome.htm 9. Granovetter, M.: Threshold models of collective behavior. Am. J. Sociol. 83, 1420– 1443 (1978) 10. Kahanda, I., Neville, J.: Using transactional information to predict link strength in online social networks. In: ICWSM, pp. 1–10 (2009) 11. Liu, Z., Li, H., Wang, C.: NEW: a generic learning model for tie strength prediction in networks. Neurocomputing 406(2020), 282–292 (2020)
450
J. Scripps
12. Paul Revere’s Ride (1994). http://kieranhealy.org/blog/archives/2013/06/09/ using-metadata-to-find-paul-revere/ 13. Sintos, S., Tsaparas, P.: Using strong triadic closure to characterize ties in social networks. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2014) 14. Watts, D.J., Strogatz, S.H.: Collective dynamics of small-world networks. Nature 393, 440–442 (1998)
Limitations of Chung Lu Random Graph Generation Christopher Brissette(B) and George Slota Rensselaer Polytechnic Institute, Troy, NY 12180, USA {brissc,slotag}@rpi.edu Abstract. Random graph models play a central role in network analysis. The Chung-Lu model, which connects nodes based on their expected degrees, is of particular interest. It is widely used to generate nullgraph models with expected degree sequences. In addition, these attachment probabilities implicitly define network measures such as modularity. Despite its popularity, practical methods for generating instances of Chung-Lu model-based graphs do relatively poor jobs in terms of accurately realizing many degree sequences. We perform a theoretical analysis of the Chung-Lu random graph model in order to understand this discrepancy. We approximate the expected output of a generated Chung-Lu random graph model with a linear system and use properties of this system to predict distribution errors. We provide bounds on the maximum proportion of nodes with a given degree that can be reliably produced by the model for both general and non-increasing distributions. We additionally provide an explicit inverse of our linear system and in cases where the inverse can provide a valid solution, we introduce a simple method for improving the accuracy of Chung-Lu graph generation. Our analysis serves as an analytic tool for determining the accuracy of Chung-Lu random graph generation as well as correcting errors under certain conditions.
Keywords: Graph theory
1
· Random graphs · Graph generation
Introduction
Say we wish to generate a random simple graph G = (V, E) with a degree distribution y = {N1 , N2 , · · · , Nm } where Nk represents the numbers of nodes with degree k. Here, simple means that there are no self-loops or multi-edges. This is a problem that arises in many network science applications, most notably for the generation of null-models used for basic graph analytics [12]. Generating such simple networks exactly using the explicit configuration model, or erased configuration model is computationally expensive for even moderately large networks [7]. This is in part because the explicit configuration model is difficult to parallelize, and partly because correcting self-loops and multi-edges in the erased configuration model can be cumbersome. As such, we rely on probabilistic methods for large-scale graph generation that only match y in expectation. The c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2021, SCI 1072, pp. 451–462, 2022. https://doi.org/10.1007/978-3-030-93409-5_38
452
C. Brissette and G. Slota
Chung-Lu random graph model [4] is one such widely-used probabilistic model. This model pre-assigns to each node vi ∈ V (G) a weight wi corresponding the degree we wish for the node to have. It then connects all nodes vi , vj pairwise w w with the probability pij = i wjk . It is known that the degree of each node in the output graph will match its pre-assigned weight in expectation. There are a number of ways that generating such graphs can be done computationally. Some methods generate loops and multi-graphs, while others generate simple graphs. We focus on what is sometimes called the Bernoulli Method for generating Chung-Lu graphs [16], as it is amenable to the edge-skipping technique [11] that allows linear work complexity and near-constant parallel time for scalable implementations [1,8,14]. In this method, we implicitly consider all possible pairs of edges between unique nodes and generate edge (vi , vj ) with i = j according to the probability pij . This generates a simple-graph with degree sequence y˜ where E[˜ y ] = y. While we focus our analysis on this specific variation, as it is the one most likely to be used in practice, other methods that generate multi-edges and/or loops have many of the same issues that we discuss in this paper. The Chung-Lu model, though m popular and theoretically sound under the tame condition that wi wj < k=1 wk for all vi , vj ∈ V , can produce degree distributions drastically different from the desired expectation in practical settings. This has been widely noted and addressed in the literature [2,3,5,8,9,13,16]. To theoretically motivate why this is the case, consider generating a graph that has many nodes of degree two. While Chung-Lu guarantees that the expected degree of each of these nodes will be two, it makes no other guarantees regarding the probability mass function of these degrees. Indeed in practice, nodes with expected degree two often have degree 0, 1, 3, and beyond. This is particularly challenging when Chung-Lu generation is utilized as a subroutine for more complex graph generation, such as when generating graphs that also match a clustering coefficient distribution (e.g., the BTER model) [10] or a community size distribution for community detection benchmarking [14,15]. In these and other instances, minimizing error in the degree distribution is critical, as this error can propagate through the rest of the generation stages. In Fig. 1 we see an example of the observed error when generating some graphs. As can be seen, the output of Chung-Lu in both cases underestimates the number of degree one nodes, and accrues additional error from other low degree families as well. This suggests that instead of strictly caring about the expected degree of each node in Chung-Lu generation, as is generally done, we should additionally consider deeper statistical properties of the model in application. To better understand the output of Chung-Lu, consider grouping all nodes by expected degree. That is, take degree families dk = {vi ∈ V : wi = k} and consider connections between them. From the point of view of a single node vj ∈ V with expected degree wj the number of connections it has to each degree kw family dk is binomially distributed with mean wji |dk |. Therefore the degree distribution of the nodes in dwj is the sum of m independent binomial random processes where m is the maximum expected degree of the graph. This allows us to go beyond simply guaranteeing the mean of each degree family, and instead
Chung Lu Limitations
453
Fig. 1. Distribution error of Chung-Lu. We consider the degree classes between one and nine for two different power law distributions. On the left is a power-law distribution with exponent β = 1.0 and on the right is a power-law distribution with exponent β = 2.0. In the top two plots, crosses represent the input distribution and x’s represent the average distribution for 20 instances of Chung-Lu graphs given the power law distribution as input. We can see that the Chung-Lu generated graphs drastically under-represent degree one nodes. This is a phenomenon that commonly occurs in application and can greatly affect generation accuracy.
predict the probability mass function for the degrees of each of these families, and by extension predict degree distribution errors. Since the degree distribution of each degree family dk in our graph is binomially distributed, we may apply a further approximation. Because the limiting case of the binomial distribution is the Poisson distribution, we approximate the number of connections between nodes in a given degree family with all other nodes as a sum of Poisson distributions, which is again Poisson. We note that often times a desired degree distribution will be such that certain degree classes will not have the number of nodes required for this approximation to have guaranteed accuracy. In fact, power-law degree distributions will in general have degree families dk where k ≈ m such that |dk | ≈ 1. However, we also note that this additionally means the node-wise error contributed by those families is relatively small, so we are willing to sacrifice some accuracy in lieu of a cleaner description. Say that Xij is the Poisson distribution representing the degree of each node in di if di only connected to nodes in dj . Additionally, take the mean of this Poisson distribution to be γij . Because the means of independent Poisson distributions are additive, we have the following linear system, describing the means of each distribution. ⎤⎡ ⎤ ⎡ ⎤ ⎡ μ1 1 γ11 γ12 · · · γ1m ⎢ γ21 γ22 · · · γ2m ⎥ ⎢1⎥ ⎢ μ2 ⎥ ⎥⎢ ⎥ ⎢ ⎥ ⎢ (1) ⎢ .. .. . . . ⎥ ⎢.⎥ = ⎢ . ⎥ ⎣ . . .. ⎦ ⎣ .. ⎦ ⎣ .. ⎦ . γm1 γm2 · · · γmm
1
μm
454
C. Brissette and G. Slota
This matrix provides additional rationale for our Poisson approximation. Since we assumed the distributions were Poisson we may now add means of Poisson distributions directly as opposed to computing with more complex independent w w binomial distributions. In the case of the Chung-Lu model, each γij = i wjk . This, perhaps as expected, gives the right hand means of μk = k. This means that the degrees within each degree family dk should be approximately Poisson distributed about k. Before moving on we note that a similar analysis can be done for any connection probabilities. While we are focusing on Chung-Lu probabilities, this model also describes the output degree sequence for any set of chosen pij between degree families, albeit with potential changes to the means μi . Given the description offered by Eq. 1 we now have the tools to estimate the output of Chung-Lu through a simple linear system. Consider an input degree distribution y= [N1 , N2 , · · · , Nm ]T as a vector in Rm with the number of nodes m being, N = i=1 Ni . Additionally take poiss(k) to be the probability density function of the Poisson distribution with mean k. We can calculate the expected output y˜ of this as follows. ⎤ ⎡ ⎡ ⎤ N1 | | | ⎢ N2 ⎥ ⎥ ⎢ (2) Qy = ⎣poiss(1) poiss(2) · · · poiss(m)⎦ ⎢ . ⎥ = y˜ .. ⎦ ⎣ | | | Nm This construction works in the following way. Each column of our matrix represents the probability mass function of degrees within each degree family. Taking the inner product of a row r of this matrix with a vector of degree family sizes amounts to adding together the expected number of degree r nodes produced by each degree family under the Chung-Lu algorithm. So, by considering the action of the entire matrix, we are considering the action of Chung-Lu as a whole. Note here we are assuming poiss(k) is the full, discrete version of the Poisson distribution with mean k. This implies that the system in Eq. 2 maps Rm to an infinite dimensional space. This is not computationally useful. We therefore truncate the Poisson matrix Q to be square in Rm×m by removing the first row corresponding to degree zero nodes, as well as everything below the mth row. We will denote this matrix by P. Our justification for this truncation is two-fold. One, we are inputting a degree distribution in Rm , and we mainly only care about error with regards to those output degrees between one and m. Two, making the matrix square allows for us to invert the matrix which will be useful for generating Chung-Lu graphs with more accurate degree sequences. Note that truncating Q to some dimension m amounts to ignoring nodes with degree zero as well as nodes with degree higher than m. If we wish to obtain error information for higher degrees as well we can easily append zeros to the end of our input distribution and consider P ∈ Rn×n where n > m and m is the maximum degree of our desired distribution. Then, for large enough n, our error is only ignoring nodes of degree zero. In a practical setting, these nodes would possibly be thrown out and ignored, anyways. The rest of this paper discusses
Chung Lu Limitations
455
properties of P, the limitations these properties suggest, and how the matrix can be used to improve the accuracy of Chung-Lu outputs in some cases. 1.1
Our Contributions
As noted, while Chung-Lu graph generation is a useful tool for many theoretical purposes and is used widely in fields such as social network analysis, it often does a poor job of approximating distributions at the ends. The specific issue considered in this paper is that Chung-Lu generated networks will often underrepresent low degree nodes. In Fig. 1, we can clearly see that actual Chung-Lu realizations may easily contain less than 60% of the desired number of degree one nodes. This can lead to a great deal of inaccuracy for distributions with particularly large numbers of low degree nodes, such as power-law distributions. In practice, this generally means that generated graphs will have many vertices of degree zero, so one way of resolving this issue is to connect these nodes to the graph in order to inflate the number of degree one nodes. Depending on the degree distribution, this can easily skew other degree classes without careful choice of where these nodes are connected. This may also require considerable computation. For this reason, it is far simpler for applications to throw away degree zero nodes. Other proposed methods might artificially inflate the input distribution in terms of degree one nodes so the output better matches the desired input [10], but this also presents similar challenges. For this reason, we suggest the matrix model referenced in the introduction. The standard input distribution for Chung-Lu is simply the desired output distribution y. We suggest a “shifted” Chung-Lu algorithm where, given a matrix model P for the output of the Chung-Lu algorithm, we take our desired output distribution y and solve for x = P−1 y. Then the input to a Chung-Lu graph generator is x as opposed to the desired output. This is particularly compelling since the matrix P−1 only depends on the maximum degree of our desired output distribution and once computed allows for drastic accuracy improvement at negligible algorithmic cost. While useful in certain special cases, we find that such an algorithm is not possible in general. We prove several the matrix P is invertable and show that many distributions do not have non-negative inverses. We investigate these cases and classify some instances in which an inverse distribution is guaranteed to have negative entries. Most interestingly, we provide tight bounds on the expected maximum number of nodes that may belong in each degree family for both non-increasing as well as general distributions. These bounds suggest that there exists a vast number of graphs that Chung-Lu generation is ill-equipped to generate.
2
Properties of the Matrix Model
From the introduction, we use the assumption that the degree distribution of each family is approximately Poisson distributed to form a matrix that will transform input distributions into approximate output distributions from the ChungLu model. Assume that our input distribution has degrees in Nm = {1, · · · , m}
456
C. Brissette and G. Slota
T and is represented by x = [N1 , N2 , · · · , N m ] where Nk represents the number of m nodes with expected degree k and N = k=1 Nk . We can represent our matrix P in terms of the following factorization. ⎡ ⎤⎡ ⎤ ⎡ −1 ⎤ 1 0 ··· 0 1 1 ··· 1 e 0 ··· 0 ⎢0 1 · · · 0 ⎥ ⎢1 2 · · · m ⎥ ⎢ 0 2e−2 · · · 0 ⎥ ⎢ 2! ⎥⎢ ⎥⎢ ⎥ P = ⎢. . . .. ⎥ ⎢ .. . ⎥ ⎢. . .. . . .. ⎥ ⎣ .. .. . . .. ⎦ ⎣ .. .. . . . (3) . . ⎦⎣ . . . ⎦
0 0 ···
1 2m−1 · · · mm−1
1 m!
0
0
· · · me−m
= AVB When this factorization is multiplied out, we obtain exactly the P matrix discussed in the introduction. Note that realizing a Chung-Lu graph model amounts to computing Px for some pre-defined x. We instead look at the inverse problem of determining x ∈ R+m given P ∈ Rm×m and desired output y ∈ R+m . Here R+m is the element-wise positive region of Rm . This amounts to solving the linear system Px = y. One may be tempted to simply invert this matrix using any number of computational methods, and this is reasonable for small m. However, given the factorization in Eq. 3, we have that P = AVB with V a Vandermonde matrix. Due to the extremely poor conditioning of both A and V, using a computational method for inverting P is not advised. Fortunately A and B are diagonal, meaning they are easy to invert, so finding the inverse of P only requires finding an inverse to V. Again, we do not want to compute this using standard computational methods, since Vandermonde matrices are the textbook examples of nearly uninvertible matrices. Fortunately, our Vandermonde matrix is such that it has a special structure yielding a somewhat simple closed-form inverse given in [6]. It relates each entry in the matrix to associated binomial coefficients and Stirling numbers of the first kind. Explicitly, each entry is expressed as follows. V−1 ij
i+j
= (−1)
n k=max(i,j)
k−1 k 1 (k − 1)! i − 1 j
(4)
For distributions with entry-wise positive inverses we can now compute the input of Chung-Lu that will best approximate the desired output according to P−1 y = x. The actual implementation of this would look like the pseudocode given in Algorithm 1 where · represents element-wise rounding down to the nearest integer. 2.1
Not All Solutions are Positive
We now concern ourselves with cases where Algorithm 1 will fail. These cases will occur exactly when P−1 y has negative entries. To understand why this is the case, consider that P−1 y represents a degree distribution. Negative entries in this vector therefore represent a meaningless value as an input to the ChungLu algorithm. Matrix P has only positive real entries. This implies that for any element-wise positive vector x, Px is also positive. While this implies that any
Chung Lu Limitations
457
Algorithm 1. ShiftedChungLu (y,dmax ) 1: 2: 3: 4: 5: 6: 7:
S ← ComputeStirlingMatrix(dmax ) A−1 ← ComputeDiagInverse(A) B −1 ← ComputeDiagInverse(B) V −1 ← ComputeVandermondeInverse(S,dmax ) x ˜ ← B −1 V −1 A−1 y G ← GenerateChungLu(˜ x) return G
positive input will yield an approximately valid result, it does not exclude the possibility of vectors with negative entries also mapping into the positive region of Rm under the action of P. This means that we may not be able to use the output of P−1 y = x as the input of Px since x has the possibility of containing negative elements. In Fig. 2, we can see what the action of P looks like on a sample of random vectors for P∈R4 . Notice how, as expected, it “squishes” the positive region into a small sliver.
Fig. 2. Action of P on the positive hypercube. Here we can see plots of projections of random vectors under the action of P as a heat map. The sample consists of 100,000 random vectors with random integer entries selected to be within {0, · · · , 100} under the action of P ∈ R4×4 . The output vectors are then projected onto each canonical unit vector ej ∈ R4 and plotted pairwise. These vectors are referred to as Xi in the axis labels. Intuitively this shows all feasible output from a Poisson random graph model with node degrees limited to those in {1, 2, 3, 4}. We can see that all positive vectors remain inside the positive region as expected, and we also see how sharply limiting this is for finding positive solutions of P−1 y for y positive.
Given a number of nodes N we look to bound how many nodes of each degree are feasible. That is, if we have some degree distribution x with L1-norm x1 = N we wish to find lower and upper bounds, li and ui respectively on |(Px)i | such that li ≤ |(Px)i | ≤ ui . We want to do this for every degree family.
458
C. Brissette and G. Slota
Take the projector ρi = eTi ei where ei is the ith canonical unit vector in Rm . Then we know |(Px)i | = ρi Px1 . This directly implies from the structure of P that we have, N min|Pik | ≤ ρi Px1 ≤ N Pii ∀ x∈Rm : x1 = N k
(5)
Under the necessary, but reasonable, assumption that N > m, Eq. 5 gives us a tight upper bound on the number of nodes we can reliably generate of a given degree based on only the number of nodes in our distribution. This bound is realized precisely when all of the nodes in our distribution have input degree Ni . We may be interested in what outputs a more narrow space of input distributions can reliably generate. Consider bounding the number of nodes with given degrees in a special case. Namely we pick degree family sizes such that the following is true. (6) N1 ≥ N2 ≥ · · · ≥ N m That is, the size of the families are non-increasing with respect to input degree. This classifies a wide variety of networks ranging from those with identical family sizes, to power-law distributions. We wish to upper bound the number of nodes we can generate in a given degree family j with a distribution following the Property 6. This problem can be expressed in terms of finding coefficients satisfying Eq. 7. Here we may take coefficients a1 = 1 and then generalize by taking x = N a = [N1 , N2 , · · · , Nm ]T . max a
m k=1
kj
e−k ak j!
(7)
We can see the maximum occurs when aj has a maximal population. This means that, perhaps as expected, the way to achieve the maximum number of nodes with degree j is to maximize the number of input nodes with degree j. Since our function is nonincreasing this means this maximum occurs when a1 = a2 = · · · = aj and aj+1 = aj+2 = · · · = am = 0. This directly implies that we will get the most nodes of degree j when the following is true for a1 = 1. a1 = a2 = · · · = aj =
1 j
(8)
Therefore the maximum number of nodes we should expect in a given degree class can be approximated as follows. j j 1 j −k 1 k e ≈ xj e−x dx j!j j!j 1
(9)
k=1
≈
1 γ(j + 1, j) j!j
(10)
Equation 9 gives us both the exact upper bound and continuous approximation. Equation 10 can be used as a quick approximation of this value in terms of the
Chung Lu Limitations
459
incomplete gamma function from 0 to j. This gives a far tighter bound than is provided by Eq. 5 when we have a non-increasing degree distribution. It should be noted that one may improve upon the accuracy of these bounds for even more restrictive families of distributions by including a lower bound as well as a tighter upper bound on the size of each degree family. We can glean useful information from these bounds. For instance, if one desires an output distribution where more than a tenth of the nodes have degree five, there are no non-increasing inputs for which we should expect that property in output. In terms of the inverse matrix P−1 , inputting such a vector will yield negative family sizes in some indices. This is incredibly limiting since this is independent of node number.
3
Results
We wish to determine how well P models the output of the Chung-Lu algorithm for a given input distribution. In Fig. 3 we compare the na¨ıve output distribution to the outputs of both Chung-Lu generation and our model taking that distribution as input. For this simple example we find that our model predicts the output node degree frequency remarkably well.
Fig. 3. Model distribution versus Chung-Lu outputs. We consider the degree classes between one and nine for two different power law distributions. On the left is a power-law distribution with exponent β = 1.0 and on the right is a power-law distribution with exponent β = 2.0. In the top two plots, black crosses represent the na¨ıve input 1000 × k−β , red circles represent the distribution our model estimates will be the output of Chung-Lu generation, and blue x’s represent the average distribution for 20 instances of Chung-Lu graphs given the black crosses as input. We can see that the Chung-Lu generated graphs match our model output remarkably closely.
Additionally we aim to determine how much proportional L1 accuracy is gained by using the vector x = P−1 y as opposed to y itself as an input to ChungLu. Specifically, we consider generating a set of graphs {yi } using the Chung-Lu algorithm with the na¨ıve inputs {yi }, and the shifted inputs {xi = P−1 yi }. We yi 1 xi 1 i −˜ i −˜ and yy in Fig. 4 where y˜i and plot the proportional L1 errors yy i 1 i 1
460
C. Brissette and G. Slota
x ˜i are the output distributions of Chung-Lu for the na¨ıve input and shifted input respectively. we choose our set {yi } such that these are guaranteed to be invertible distributions in the sense that x∈R+m . For this we use the variable precision toolbox in matlab with the digits of precision set to 100. The results of this can be seen in Fig. 4. We find that our shifted input drastically decreases the proportional L1 error between the output of Chung-Lu and the desired output.
Fig. 4. Error of na¨ıve Chung-Lu input versus shifted Chung-Lu input. We consider 100 input distributions yi such that P−1 yi = xi where the distribution xi is 6i the power-law distribution 1000 × k− 100 with k ranging between 1 and 40. For each of the 100 inputs, 30 graphs were generated and their degree distributions were averaged using the input yi for Chung-Lu. The proportional L1 error between this output and the desired output yi is shown as the solid blue line. Additionally 30 graphs were generated and their degree distributions were averaged using the input xi for ChungLu. The proportional L1 error between this output and the desired output yi is shown as the dashed red line. We can see that the “shifted” input we get using our model drastically reduces error for the sample.
4
Conclusion
We have provided a simple method for estimating the output of Chung-Lu random graph generators with far lower proportional L1 error than that given by the traditional assumption that output distributions will resemble input distributions. Our method utilized a Poisson estimate for the number of nodes of given degrees and we used this to define an invertible matrix P that models the expected output from Chung-Lu generators. This allowed us to “solve the problem in reverse” and take a desired output y and solve for the Chung-Lu input x that will result in y. We called this the shifted Chung-Lu input. We showed P predicts that many degree distributions simply are not feasible for Chung-Lu generators, however we provide conditions for classifying a large portion of these distributions.
Chung Lu Limitations
461
There are several avenues for further research. For instance, this work lends itself to analysis and improvement of graph generation. Methods which use na¨ıve Chung-Lu generation as a subroutine may gain both accuracy and insight into possible distribution errors through the kind of analysis done in this paper. Further work may also be done on how altering connection probabilities between degree classes may be used to fine tune the matrix P in order to produce graphs which are inadvisable for na¨ıve Chung-Lu generation.
References 1. Alam, M., Khan, M., Vullikanti, A., Marathe, M.: An efficient and scalable algorithmic method for generating large-scale random graphs. In: SC 2016: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 372–383. IEEE (2016) 2. Britton, T., Deijfen, M., Martin-L¨ of, A.: Generating simple random graphs with prescribed degree distribution. J. Stat. Phys. 124(6), 1377–1397 (2006). https:// doi.org/10.1007/s10955-006-9168-x 3. Chodrow, P.S.: Moments of uniform random multigraphs with fixed degree sequences. SIAM J. Math. Data Sci. 2(4), 1034–1065 (2020) 4. Chung, F., Lu, L.: The average distances in random graphs with given expected degrees. Proc. Natl. Acad. Sci. 99(25), 15879–15882 (2002) 5. Durak, N., Kolda, T.G., Pinar, A., Seshadhri, C.: A scalable null model for directed graphs matching all degree distributions: in, out, and reciprocal. In: 2013 IEEE 2nd Network Science Workshop (NSW), pp. 23–30. IEEE (2013) 6. Eisinberg, A., Franz´e, G., Pugliese, P.: Vandermonde matrices on integer nodes. Numer. Math. 80, 75–85 (1998). https://doi.org/10.1007/s002110050360 7. Fosdick, B.K., Larremore, D.B., Nishimura, J., Ugander, J.: Configuring random graph models with fixed degree sequences. SIAM Rev. 60(2), 315–355 (2018) 8. Garbus, J., Brissette, C., Slota, G.M.: Parallel generation of simple null graph models. In: The 5th IEEE Workshop on Parallel and Distributed Processing for Computational Social Systems (ParSocial) (2020) 9. van der Hofstad, R.: Critical behavior in inhomogeneous random graphs. Random Struct. Algorithms 42(4), 480–508 (2013) 10. Kolda, T.G., Pinar, A., Plantenga, T., Seshadhri, C.: A scalable generative graph model with community structure. SIAM J. Sci. Comput. 36(5), C424–C452 (2014) 11. Miller, J.C., Hagberg, A.: Efficient generation of networks with given expected degrees. In: Frieze, A., Horn, P., Pralat, P. (eds.) WAW 2011. LNCS, vol. 6732, pp. 115–126. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-212864 10 12. Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Chklovskii, D., Alon, U.: Network motifs: simple building blocks of complex networks. Science 298(5594), 824–827 (2002) 13. Pfeiffer III, J.J., La Fond, T., Moreno, S., Neville, J.: Fast generation of large scale social networks with clustering. arXiv preprint arXiv:1202.4805 (2012) 14. Slota, G.M., Berry, J., Hammond, S.D., Olivier, S., Phillips, C., Rajamanickam, S.: Scalable generation of graphs for benchmarking HPC community-detection algorithms. In: IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC) (2019)
462
C. Brissette and G. Slota
15. Slota, G.M., Garbus, J.: A parallel LFR-like benchmark for evaluating community detection algorithms. In: The 5th IEEE Workshop on Parallel and Distributed Processing for Computational Social Systems (ParSocial) (2020) 16. Winlaw, M., DeSterck, H., Sanders, G.: An in-depth analysis of the Chung-Lu model. Technical report, Lawrence Livermore National Laboratory (LLNL), Livermore, CA, United States (2015)
Surprising Behavior of the Average Degree for a Node’s Neighbors in Growth Networks Sergei Sidorov(B) , Sergei Mironov, and Sergei Tyshkevich Saratov State University, Saratov, Russian Federation [email protected] Abstract. We study the variation of the stochastic process that describes the temporal behavior of the average degree of the neighbors for a fixed node in the Barab´ asi-Albert networks. It was previously known that the expected value of this random quantity grows logarithmically with the number of iterations. In this paper, we use the mean-field approach to derive difference stochastic equations, as well as their corresponding approximate differential equations, in order to find the dynamics of its variation in time. The noteworthy fact proved in this paper is that the variation of this process is bounded by a constant. This behavior is fundamentally different from the dynamics of variation in most known stochastic processes (e.g., the Wiener process), in which its value tends to infinity over time. Keywords: Network analysis · Complex networks · Preferential attachment model · Network evolution · Node degree · Mean-field method · Average nearest neighbor degree
1
Introduction
The amount of nodes in many real complex networks increases over time due to the addition of new ones, and newborn network elements are connected to existing nodes. In the course of their growth, complex networks use various mechanisms for selecting those nodes to which new vertices join. Perhaps, one of the most studied is the preferential attachment mechanism, according to which the probability that a network vertex will be connected to a newborn node is proportional to its degree. The paper examines the dynamics of local characteristics for networks generated using the Barab´ asi-Albert model. The model [2] describes the process of network development based on the growth and the preferential attachment mechanisms. The model description can be found in Sect. 2. It was shown in [2] that in the networks generated using the BA model, the dynamics of the This work was supported by the Ministry of Science and Higher Education of the Russian Federation in the framework of the basic part of the scientific research state task, project FSRR-2020-0006. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2021, SCI 1072, pp. 463–474, 2022. https://doi.org/10.1007/978-3-030-93409-5_39
464
S. Sidorov et al.
expected value of the degree of network node vi at iteration t (denoted by di (t)) follows a power law with exponent 12 : E(di (t)) = m
12 t . i
(1)
Numerous variations and extensions of this model have been proposed in the works [1,3–8,10–15,18]. Let si (t) denote the total degree of all neighbors of the vertex vi at iteration t. In addition to di (t) and si (t), another important local characteristic of node vi is the average degree of its neighbors, defined as follows: αi (t) :=
si (t) . di (t)
This quantity has recently been the object of interest in a number of studies [9,19]. In particular, the cumulative distribution of αi (t) (in the form of the average nearest neighbor degree of nodes with a given degree [19]) has been studied in paper [9] in which it is shown that Barab´ asi-Albert networks are uncorrelated for nodes with high degrees. The value αi (t) of a fixed node vi at each iteration t is a random variable. Since its behavior at the moment t + 1 depends only on values of di (t) and si (t) at the moment t, i.e. on the state of the network at the moment t, we can consider αi (t) as a stochastic non stationary Markov process. Its expected value at iteration t m log t + a, a = a(i, m) = const, E (αi (t)) ∼ 2 is found in papers [9,16]. However, when studying a random variable, it is important to know not only its average value, but also how its values are scattered around its expectation. In other words, it is of interest to study the variation of a random variable. In this paper, we find the variation of αi (t) and show that its value tends to a constant depending on m, the model parameter (the number of edges added per iteration), and on i, the iteration at which the node vi appeared. This means that the variation of the process αi (t) decreases with respect to the expected value, i.e. its coefficient of variation is tending to zero. This behavior is fundamentally different from the dynamics of the vertex degree, which is also a non stationary Markov process, the expected value of which in time is described by a power law with an exponent of 12 (see Eq. (1)), but the variation of the node degree has the same order of magnitude. In particular, it was shown in [17] that the variation of di (t) is equal to Var(di (t)) = t 12 t , the asymmetry coefficient γ1 (di (t)) tends to 2m−1/2 as t → ∞ m i− i as t tends to infinity (see [17]). and the kurtosis gradually decreases to 3(m+2) m The main results of the paper is proven in Theorem 1 (Sect. 3.1) with the use of the mean-field approach. Section 3.2 presents empirical results devoted to checking the theoretical prediction and to approximate concrete values of the claimed bound depending on the parameter m and the node index i.
Surprising Behavior of the Average Degree for a Node’s Neighbors
2 2.1
465
Barab´ asi-Albert model Notations
Let Gt = {Vt , Et } be a network with Vt = {v1 , . . . , vt } as the set of its nodes and Et as the set of its edges. Denote di (t) the degree of node vi of network Gt . Let m ∈ N be a fixed integer. The parameter m is the number of nodes added at each iteration of the algorithm. We start the construction of the BA network with complete graph consisting of m nodes to be ensure that we have enough nodes to create m new edges in any one of the following iterations. Based on the Barab´asi-Albert model, network Gt+1 is iteratively obtained from Gt (at iteration t > m) as follows: – at iteration t = m, Gm = {Vm , Em } is the complete graph with |Vt | = m and |Et | = m(m − 1)/2. – one node vt+1 is added to the network at iteration t + 1, i.e. Vt+1 = Vt ∪{vt+1 }; – m edges that connect node vt+1 with m existing nodes are attached; each of these edges is appeared as the result of the realization of the discrete random i (t) . If variable ξ t+1 that takes the value i with probability P (ξ t+1 = i) = d2mt ξ t+1 = i then edge (vt+1 , vi ) is attached to the network. At each iteration m such independent experiments are carried out. Remark 1. The probability that an edge from new node vt+1 that appears at iteration t + 1 is attached to node vi is P (ξit+1 ) = m
di (t) di (t) = . 2mt 2t
(2)
The aim of this work is to further analyze the behavior of the stochastic processes αi (t) in time. Namely, in this article we focus on estimating its variance. 2.2
The Evolution of the Barab´ asi-Albert networks
It follows from the definition of the Barab´ asi-Albert network that – If ξ t+1 = i, then di (t + 1) = di (t) + 1 and si (t + 1) = si (t) + m, as the result of joining node vi with newborn node vt+1 of degree m. – If ξ t+1 = j and (vj , vi ) ∈ Vt , i.e. new node vt+1 joins a neighbor vj of node vi , then di (t + 1) = di (t) and si (t + 1) = si (t) + 1. Let ξit+1 = 1 if node vt+1 links to node vi at iteration t + 1, and ξit+1 = 0, otherwise, i.e. 1, (vt+1 , vi ) ∈ Vt+1 t+1 ξi = 0, otherwise. Let ηit+1 = 1 if node vt+1 links to one of the neighbors of node vi at iteration t + 1, and ηit+1 = 0, otherwise, i.e. 1, (vt+1 , vj ) ∈ Vt+1 and (vj , vi ) ∈ Vt , t+1 ηi = 0, otherwise.
466
S. Sidorov et al.
Then the conditional expectations of ξit+1 and ηit+1 at moment t + 1 are equal to di (t) si (t) , E(ηit+1 |Gt ) = , (3) E(ξit+1 |Gt ) = 2t 2t The variance is the second central moment of the random variable, i.e. μ2 (ξ) = Var(ξ).
The Dynamics of αi (t)
3 3.1
The Main Result
In this subsection, using the mean field approach we prove that the variation of αi (t) tends to be a constant. First we present two auxiliary lemmas. Lemma 1.
E
si (t) d2i (t)
1 ∼ 2
12 i (log t + b), t
where b is a constant. The proof of Lemma can be found in Appendix A Lemma 2. m2 3 log2 t + am log t + d − m E αi2 (t) ∼ 4 4
12 i log t log2 t + O , 1 t t2
where d and a are constants. The proof of Lemma can be found in Appendix B. Theorem 1. The variation of αi (t) at iteration t is Var(αi (t)) ∼ d − a2 −
m 4
12 log t i log2 t + O . 1 t t2
Proof. It follows from Lemmas 1 and 2 that Var(αi2 (t)) = E(αi2 (t)) − (E(αi (t)))
2
12 i log t m2 3 2 2 log t + am log t + d − m ∼ log t + O 1 4 4 t t2 12 2 1 i 1 i 2 1 m log t + a − log t + b − 2 2 t 2 t 12 log t m i = d − a2 − log2 t + O . (4) 1 4 t t2
Surprising Behavior of the Average Degree for a Node’s Neighbors
3.2
467
Empirical Results
E(αi (t))
25
20 α10 (t) α50 (t) α100 (t)
StdDev(αi (t))
In this subsection we illustrate the behavior of the standard deviation of αi (t) over time by simulating the Barab´ asi-Albert random graphs. As was shown in the previous section, this value will converge to a certain constant, which depends on the number m of nodes added at each iteration of the algorithm, and on the number i of the node. However, the theoretical results obtained in the previous section do not give an accurate estimate of this constant. Therefore, one of the goals of the experiments is to estimate this constant for some specific values of the number m and various vertex numbers i. We conducted several experiments on the construction of BA graphs and measuring their characteristics. During each experiment, a graph containing 160,000 nodes was constructed using the Barab´ asi-Albert algorithm for a fixed value of m. We constructed one thousand graphs for each of the values m = 3, m = 5 and m = 10. The construction of each graph began from the moment t = m with a graph consisting m nodes. For each graph, during its growth, the values of αi (t) were calculated for nodes i = 10, 50, 100 at time points t = 1000, 5000, 10000, 20000, 40000, 80000, 100000, 120000, 140000, 160000. Separately, for each value of m and t, we calculated the average value and the standard deviation of αi (t), i.e. σ(αi (t)) = Var(αi (t)). Figures 1, 2 and 3 shows the results of calculating the average values of αi (t) and values of its standard deviation.
15
10
α10 (t) α50 (t) α100 (t)
15 0
50,000
100,000 150,000
0
50,000
t (a)
100,000 150,000 t
(b)
Fig. 1. Dynamics of empirical values for (a) E(αi (t)) and (b) σ(αi (t)) in networks based on BA model for selected nodes i = 10, 50, 100 as t iterates up to 160, 000. Network is modeled with m = 3
As predicted, the average value of αi (t) increases proportionally to the logarithm of the number of iterations. The results of calculating the values of the standard deviation of αi (t) are shown in part (b) of Figs. 1, 2 and 3.
468
S. Sidorov et al.
40
15
35 10
30 α10 (t) α50 (t) α100 (t)
25
α10 (t) α50 (t) α100 (t)
5
20 0
50,000
100,000 150,000
0
50,000
t
100,000 150,000 t
(a)
(b)
Fig. 2. Dynamics of empirical values for (a) E(αi (t)) and (b) σ(αi (t)) in networks based on BA model for selected nodes i = 10, 50, 100 as t iterates up to 160, 000. Network is modeled with m = 5
E(αi (t))
60
50 α10 (t) α50 (t) α100 (t)
40
StdDev(αi (t))
70 10
5 α10 (t) α50 (t) α100 (t)
0 0
50,000
100,000 150,000
0
50,000
t
t (a)
100,000 150,000
(b)
Fig. 3. Dynamics of empirical values for (a) E(αi (t)) and (b) σ(αi (t)) in networks based on BA model for selected nodes i = 10, 50, 100 as t iterates up to 160, 000. Network is modeled with m = 10
As can be seen from Figures, the value of the standard deviation for each combination of m and i grows rapidly, after which it becomes saturated and significantly slows down. After that, the value of the standard deviation begins to fluctuate relative to some of its limit. Moreover, the experimental results show that the smaller the node index i, the smaller the corresponding variation αi (t). On the contrary, nodes appearing in later iterations have a bigger value of variation αi (t). It can also be noted that with an increase in the BA model parameter m, the corresponding values of the variation αi (t) decrease.
Surprising Behavior of the Average Degree for a Node’s Neighbors
4
469
Conclusion
Studying the behavior of various local characteristics for growing complex networks built on the basis of the Barab´ asi-Albert model is an important task. This is because growth and preferential attachment mechanisms model well the evolution of many real-world networks. In addition, many models use these mechanisms or their modifications in conjunction with other ones to ensure the behavior of the generated networks to be close to the behavior of real ones in one sense or another. This article studies the average degree of the neighbors of a node in the Barab´ asi-Albert network. The value of this local characteristic obviously changes over time due to the fact that new nodes can join this vertex or its neighbors at each iteration. The process is a Markov non-stationary, and one can find its characteristics, including its expected values and variations at each iteration. It was shown earlier [9,16] that the evolution of its expected value is described by the logarithmic function of the number of iterations. In this paper, we use the mean-field approach to derive difference stochastic equations, as well as their corresponding approximate differential equations, in order to find the dynamics of its variation in time. Surprisingly, while the expected value of αi (t) increases to infinity, its variation is limited to a constant. It is quite different from e.g. random walk type processes in which the expectation is limited while the variation tends to infinity. In this regard, it is interesting in future work to check the presence of this phenomenon during the evolution of real networks (in particular, citation and collaboration ones). It would also be important to study how widespread this behavior is in networks generated by models different from the Barab´ asi-Albert model.
A
The Proof of Lemma 1
Proof. Denote βi (t) :=
si (t) . d2i (t)
We have
Δβi (t + 1) := βi (t + 1) − βi (t) si (t) + 1 si (t) + m si (t) si (t) t+1 − + − = ξ ηit+1 . (di (t) + 1)2 d2i (t) i (di (t) + 1)2 d2i (t) (5) Since E(ξit+1 ) =
di (t) si (t) , E(ηit+1 ) = , 2t 2t
470
S. Sidorov et al.
we get the difference equation E (Δβ1 (t + 1)|Gt ) si (t) + 1 si (t) + m si (t) di (t) si (t) si (t) + − − = (di (t) + 1)2 d2i (t) 2t (di (t) + 1)2 d2i (t) 2t si (t) si (t) si (t) mdi (t) − − + = 2t(di (t) + 1)2 t(di (t) + 1)2 2tdi (t)(di (t) + 1)2 2td2i (t) si (t) m 3si (t) m 4si (t) si (t) + + − − + O = − . 2td2i (t) 2tdi (t) 2td3i (t) td2i (t) td4i (t) td5i t) (6) Using E di1(t) ∼ ential equation:
1
i2
1
mt 2
, we get the following approximate first order differ1
f (t) i2 df (t) =− + 3, dt 2t 2t 2 1 the solution of which is f (t) = 12 ti 2 (log t + b), where b is a constant.
B
The Proof of Lemma 2
To prove Lemma 2 we need two auxiliary Lemmas 3 and 4. The next lemma complements the results of papers [9,16]. Lemma 3. E (αi (t)) ∼
1 m log t + a − 2 2
12 12 i i 1 log t + b , t 2 t
where a is a constant. Proof. We have Δαi (t + 1) := αi (t + 1) − αi (t) si (t) + 1 si (t) + m si (t) si (t) − − = ξit+1 + ηit+1 . di (t) + 1 di (t) di (t) + 1 di (t) (7) Since E(ξit+1 ) =
di (t) si (t) , E(ηit+1 ) = , 2t 2t
Surprising Behavior of the Average Degree for a Node’s Neighbors
471
we get the difference equation E (Δβ1 (t + 1)|Gt ) si (t) + 1 si (t) + m si (t) di (t) si (t) si (t) − + − = di (t) + 1 di (t) 2t di (t) + 1 di (t) 2t si (t) m m + − = 2t 2t(di (t) + 1) 2tdi (t)(di (t) + 1) si (t) m si (t) m si (t) m = + − − + + O . 2t 2td2i (t) 2tdi (t) td3i (t) 2td2i (t) td4i (t) (8) Using Lemma 1, we get the following approximate first order differential equation: 1 1 1 m i2 df (t) bi 2 i2 = + 3 log t + 3 − 3 , dt 2t 4t 2 4t 2 2t 2 1 1 1 i 2 the solution of which is f (t) = m log t + 12 b ti 2 + a, where a is a 2 log t − 2 t constant. Lemma 4. E
s2i (t) d3i (t)
∼m
12 1 i b log2 t + log t + c , t 4 2
where c is a constant. Proof. We have s2i (t) s2i (t)(t + 1) − d3i (t)(t + 1) d3i (t) (si (t) + m)2 (si (t) + 1)2 s2i (t) s2i (t) t+1 = − 3 + − 3 ξ ηit+1 . (di (t) + 1)3 di (t) i (di (t) + 1)3 di (t) (9)
Δi (t + 1) :=
Since E(ξit+1 ) =
di (t) si (t) , E(ηit+1 ) = , 2t 2t
472
S. Sidorov et al.
we get the difference equation E (Δi (t + 1)|Gt ) (si (t) + 1)2 (si (t) + m)2 s2i (t) di (t) s2i (t) si (t) + = − − (di (t) + 1)3 d3i (t) 2t (di (t) + 1)3 d3i (t) 2t 2 2 2 msi (t)di (t) m di (t) 3si (t) 3si (t) = + − − t(di (t) + 1)3 2t(di (t) + 1)3 2t(di (t) + 1)3 2tdi (t)(di (t) + 1)3 si (t) s2i (t) s2i (t) + − + 2 2tdi (t)(di (t) + 1)3 td3i (t) 2td3i (t) 2 si (t) msi (t) s2i (t) + 2 +O = − . (10) 2td3i (t) tdi (t) td4i (t) Using Lemma 1, we get the following approximate first order differential equation: 1 1 f (t) mi 2 bmi 2 df (t) =− + log t + 3 3 , dt 2t 2t 2 2t 2 1 the solution of which is f (t) = m ti 2 14 log2 t + 2b log t + c , where c is a constant. Proof. (of Lemma 2) We have Δαi2 (t + 1) := αi2 (t + 1) − αi2 (t) (si (t) + m)2 (si (t) + 1)2 s2i (t) s2i (t) t+1 = − + − ξ ηit+1 . (di (t) + 1)2 d2i (t) i (di (t) + 1)2 d2i (t) (11) Since E(ξit+1 ) =
di (t) si (t) , E(ηit+1 ) = , 2t 2t
we get the difference equation E Δα2i (t + 1)|Gt s2 (t) di (t) s2 (t) si (t) (si (t) + m)2 (si (t) + 1)2 = + − i2 − i2 2 2 (di (t) + 1) di (t) 2t (di (t) + 1) di (t) 2t =
s2i (t) s2i (t) s2 (t) si (t) msi (t)di (t) m2 + + − − + i2 t(di (t) + 1)2 2t(di (t) + 1)2 t(di (t) + 1)2 2tdi (t)(di (t) + 1)2 tdi (t) 2tdi (t) 2 3s2i (t) si (t) si (t) 1 m2 msi (t) . (12) + − 2m − + + O = tdi (t) 2td3i (t) 2 td2i (t) 2tdi (t) td4i (t)
Surprising Behavior of the Average Degree for a Node’s Neighbors
473
Using Lemmas 1, 3 and 4, we get the following approximate first order differential equation: 12
1 i df (t) m m 1 i 2 1 = log t + a − log t + b dt t 2 2 t 2 t 12 1 b 3m i log2 t + log t + c + 2t t 4 2 1 1 1 1 i 2 mi 2 − 2m − (log t + b) + 3 , 2 2 t 2t 2 (13) the solution of which follows 3 m2 log2 t + am log t + d − m f (t) = 4 4 where d is a constant.
12 i log t log2 t + O , 1 t t2
References 1. de Almeida, M., Mendes, G., Madras Viswanathan, G., da Silva, L.R.: Scale-free homophilic network. Eur. Phys. J. B 86(38), 1–6 (2013) 2. Barab´ asi, A., Albert, R., Jeong, H.: Mean-field theory for scale-free random networks. Phys. A 272(1), 173–187 (1999) 3. Bertotti, M., Modanese, G.: The configuration model for Barabasi-Albert networks. Appl. Netw. Sci. 4(1), 1–13 (2019) 4. Bianconi, G., Barab´ asi, A.-L.: Competition and multiscaling in evolving networks. Europhys. Lett. 54(4), 436–442 (2001) 5. Cinardi, N., Rapisarda, A., Tsallis, C.: A generalised model for asymptoticallyscale-free geographical networks. J. Stat. Mech.: Theory Exp. 2020(4), 043404 (2020) 6. Dorogovtsev, S.N., Mendes, J.F.F., Samukhin, A.N.: Structure of growing networks with preferential linking. Phys. Rev. Lett. 85(21), 4633 (2000) 7. Krapivsky, P.L., Redner, S.: Organization of growing random networks. Phys. Rev. E 63(6), 066123 (2001) 8. Krapivsky, P.L., Redner, S., Leyvraz, F.: Connectivity of growing random networks. Phys. Rev. Lett. 85(21), 4629 (2000) 9. Mironov, S., Sidorov, S., Malinskii, I.: Degree-degree correlation in networks with preferential attachment based growth. In: Teixeira, A.S., Pacheco, D., Oliveira, M., Barbosa, H., Gon¸calves, B., Menezes, R. (eds.) CompleNet-Live 2021. SPC, pp. 51–58. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-81854-8 5 10. Pachon, A., Sacerdote, L., Yang, S.: Scale-free behavior of networks with the copresence of preferential and uniform attachment rules. Phys. D 371, 1–12 (2018) 11. Pal, S., Makowski, A.: Asymptotic degree distributions in large (homogeneous) random networks: a little theory and a counterexample. IEEE Trans. Netw. Sci. Eng. 7(3), 1531–1544 (2020)
474
S. Sidorov et al.
12. Piva, G.G., Ribeiro, F.L., Mata, A.S.: Networks with growth and preferential attachment: modelling and applications. J. Complex Netw. 9(1), cnab008 (2021) 13. Rak, R., Rak, E.: The fractional preferential attachment scale-free network model. Entropy 22(5), 509 (2020) 14. Shang, K.K., Yang, B., Moore, J., Ji, Q., Small, M.: Growing networks with communities: a distributive link model. Chaos 30(4), 041101 (2020) 15. Sidorov, S., Mironov, S.: Growth network models with random number of attached links. Phys. A: Stat. Mech. Appl. 576, 126041 (2021) 16. Sidorov, S., Mironov, S., Malinskii, I., Kadomtsev, D.: Local degree asymmetry for preferential attachment model. Stud. Comput. Intell. 944, 450–461 (2021) 17. Sidorov, S., Mironov, S., Agafonova, N., Kadomtsev, D.: Temporal behavior of local characteristics in complex networks with preferential attachment-based growth. Symmetry 13(9), 1567 (2021) 18. Tsiotas, D.: Detecting differences in the topology of scale-free networks grown under time-dynamic topological fitness. Sci. Rep. 10(1), 1–16 (2020) 19. Yao, D., van der Hoorn, P., Litvak, N.: Average nearest neighbor degrees in scalefree networks. Internet Math. 2018, 1–38 (2018)
Towards Building a Digital Twin of Complex System Using Causal Modelling Luka Jakovljevic1,2(B) , Dimitre Kostadinov1 , Armen Aghasaryan1 , and Themis Palpanas2 1
Nokia Bell Labs France, Nozay, France [email protected] 2 University of Paris, Paris, France
Abstract. Complex systems, such as communication networks, generate thousands of new data points about the system state every minute. Even if faults are rare events, they can easily propagate, which makes it challenging to distinguish root causes of errors from effects among the thousands of highly correlated alerts appearing simultaneously in high volumes of data. In this context, the need for automated Root Cause Analysis (RCA) tools emerges, along with the creation of a causal model of the real system, which can be regarded as a digital twin. The advantage of such model is twofold: (i) it assists in reasoning on the system state, given partial system observations; and (ii) it allows generating labelled synthetic data, in order to benchmark causal discovery techniques or create previously unseen faulty scenarios (counterfactual reasoning). The problem addressed in this paper is the creation of a causal model which can mimic the behavior of the real system by encoding the appearance, propagation and persistence of faults through time. The model extends Structural Causal Models (SCMs) with the use of logical noisy-OR gates to incorporate the time dimension and represent propagation behaviors. Finally, the soundness of the approach is experimentally verified by generating synthetic alert logs and discovering both the structure and parameters of the underlying causal model. Keywords: Digital twin modelling
1
· Causality · Alerts · Complex network
Introduction
Complex systems such as modern telecommunication networks, or distributed embedded systems, need to be continuously monitored to allow identification of failure situations. The monitoring systems generate huge volumes of alert logs and notifications, where it is extremely challenging to distinguish real fault symptoms from noisy alerts. Given the large number of highly correlated alerts that appear simultaneously in faulty situations, one still needs to model causal relations between alerts, in order to identify root causes of faults. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2021, SCI 1072, pp. 475–486, 2022. https://doi.org/10.1007/978-3-030-93409-5_40
476
L. Jakovljevic et al.
Experts in charge of monitoring and troubleshooting the system usually have knowledge about the system architecture, the possible types of faults and the connections (dependencies) between components (e.g., topology). What is lacking is the model that can encode the system behavior and assist experts in understanding and reasoning on the system state. For example, it can be used to answer questions such as: how likely (and when) alert B will be observed, given that alert A is currently present?, how two simultaneous alerts B and C, which were never observed together, will propagate?, what is the most likely cause of the observed series of alarms?, etc. The existing tools for modelling variables with causal relations [11,14,21,23] are not suitable for these purposes, since they cannot adequately represent system behavior. On the other hand, the digital twin is emerging as a paradigm which can represent relations between system components and allow simulations or reasoning in unprecedented situations [1,7]. A health monitoring model described in [12] uses Dynamic Bayesian Network (DBN) to model relations between variables. Similarly, system graph can be inferred from correlated patterns in the data [5]. To the best of our knowledge, all these tools and frameworks are only able to describe correlations or causal relations between variables, without the ability to encode probabilities of appearance, propagation and persistence of binary events such as faults, events and notifications. In this paper, we propose a framework for building a digital twin that models faulty system behavior based on causal relations between observable alerts. An alert can be represented as a binary time series, where active state indicates a fault on the subsequent system component. The framework takes as input historical data i.e. past observed alerts, and builds a model of the system behavior in two steps. First, it extracts the structure of the causal relations between alerts (Directed Acyclic Graph - DAG), and then learns the parameters of the dependencies which drive the system behavior (e.g., frequency and duration of alert appearance, time lag needed to observe effect given presence of cause(s), etc.). Second, it builds a causal model of the system by tackling the rarity of fault issues. Multi-causal dependencies follow the noisy-OR logic [18], which enables estimating the effect of multiple causes even if they have never been observed together. The noisy-OR model is commonly used to simplify the expression and computation of the parameters of the dependency between multiple independent causes and one common effect. It has largely been used in network diagnosis and fault localization problems [2,8,13,25]. The resulting digital twin can answer the above mentioned questions and has several possible usages: generation of labelled synthetic data for algorithm optimisation and tuning, reasoning on the system state given partial observations, or simulation of unobserved fault scenarios. The contribution of this paper is twofold. First it describes the framework for building a digital twin from observed historical data; second, it shows how Structural Causal Models (SCMs) [6,19] can be used to encode the system behavior including alert appearance, propagation and persistence. The rest of this paper is organized as follows: Prerequisites regarding noisyOR and SCMs are given in Sect. 2. Section 3 describes our framework for building
Digital Twin of Complex System Using Causal Modelling
477
a causal model of a digital twin which models faulty system behaviour. The experiments that simulate creation of a digital twin, by inferring both causal relations between variables and SCM parameters are discussed in Sect. 4, followed by the conclusion and future work.
2
Prerequisites
The approach presented in this paper is based on Structural Causal Models and uses noisy-OR gate to represent the impact of multiple causes on the same effect. This section briefly introduces the theory behind those two notions. 2.1
Structural Causal Models
A common way to represent causal relationships between variables is to use Structural Causal Models (SCM), referred also as Structural Equation Models (SEM) [3,10,20] in the case of linear relationships between the variables. Graphically, SCMs can be seen as Directed Acyclic Graphs (DAGs) [24] in which sets of Endogenous V and Exogenous U variables are connected by a set of functions F . This set of equations determine the values of the variables in V based on the values of the variables in U . They correspond to causal assumptions and can be seen as assignments, rather than mathematical equations. Intuitively, a DAG represents a flow of information, where the variables U are the inputs of the system, while the variables V are the nodes where that information is processed. Exogenous variables U correspond to unobserved influences in the model which can be treated as noise factors. In the simplest case with two variables, an SCM can be described as shown in Fig. 1a, where variable Y (effect) is a child of a variable X (cause). NX and NY are statistically independent noises i.e. NX ⊥⊥ NY . Variable X depends only on a noise term NX (X := NX ) while variable Y depends on the values of the parent variable X and its own noise NY (Y := f (X, NY )).
(a) Structural Causal Model (SCM) representing X Y.
(b) Collider DAG (also called immorality or v-structure).
Fig. 1. Example of a SCM and a DAG.
2.2
Noisy-OR
The basic idea behind noisy-OR gate is to compactly represent the effect of multiple independent causes which are responsible for producing the same effect.
478
L. Jakovljevic et al.
According to the leaky noisy-OR definition [9], equation allows the transfer of influence from multiple parent Boolean variables x1 , . . . , xn , where the child Boolean variable y can become active with probability: P nor (y = 1 | x1 , . . . , xn ) = 1 − (1 − λ)
n
(1 − νi )xi
(1)
i=1
For each i = 1, ..., n, the number νi (associated with xi ) that takes value between 0 and 1 is called weight. Number λ which also takes value between 0 and 1, is called leak factor. For the leak factor λ = 0, leaky noisy-OR function transforms to the original (standard) noisy-OR. In the following sections we will refer to the leaky noisy-OR as the noisy-OR, since in the majority of causality and Bayesian Network (BN) literature authors do the same. As a comparison, causal probabilistic networks, also known as Bayesian Networks, would require 2n probability parameters to express all states using Conditional Probability Tables (CPTs), where n is the number of parent variables. The main advantage of a noisy-OR gate is that it provides a practical compact representation of CPTs, by describing conditional probabilities using only n + 1 parameters (one parameter per parent variable plus one for the noise).
3
Causal Modelling for a Digital Twin
This section presents our framework for building a digital twin that mimics faulty behavior of a real system. First, we present steps needed to build a digital twin just by observing system alerts. Second, we define variables and rules needed to model fault appearance, propagation and persistence in time. Then, we describe how to build a causal model that encodes relations between system components. Lastly, we define steps needed to infer model structure and parameters from observational data. 3.1
Digital Twin Architecture
Many complex systems are composed of multiple interdependent components and use alerts to communicate on issues/faults in the system, where faults can propagate from one component to another. For example, fault on a base station can be caused by a fault on power equipment connected to that base station (Fig. 2 - System Layers). Similarly, lack of computing resources on a server can cause faulty behavior on a base station. We separate three reasons for an alert occurrence: – alert becomes active independently from other alerts i.e. indicates new fault in the component (case 1); – alert becomes active due to the activity of some other alert i.e. indicates the propagation of a fault from another component (case 2); – alert remains active after its occurrence at the previous time instant, i.e. the fault persists (case 3).
Digital Twin of Complex System Using Causal Modelling
479
Fig. 2. Creation of a Digital twin: modelling behaviour of observable system states, enables mimicking fault appearance, propagation and persistence in time.
In order to model the system behavior and represent the three above mentioned cases, we introduce a framework for building a digital twin (Fig. 2). It uses as input historical data coming from observable system layer to discover the system’s model. This requires structure extraction and parameter learning. Structure extraction consists in learning causal relations between alerts. It can be done using existing causal discovery techniques [15,17,22,23], which can in addition, discover causal dependency time lag (time series DAG). Then parameters can be learnt using techniques that infer conditional probability distribution of noisy-OR gates once the structure is known [16]. Once all parameters are inferred, causal model for a digital twin can be built, as will be described in more detail in the following subsections. The main advantage of having a digital twin for the real system lies in the ability to mimic system behavior. This allows (i) simulating previously unseen faulty scenarios and generating labelled data for algorithm optimization (ii) identifying root causes of faults by explaining observed alert logs using real time observations. Lastly, knowing the dynamics of fault propagation between components enables predictive maintenance tasks. Once an alert appears on a component, knowing when and which component will be affected next, can be used to predict the appearance of alerts. In the sequel, we will consider that active state of the alert corresponds to binary time series having value equal to one. 3.2
Defining Variables and Rules
In order to define the causal model used for modelling system behavior, this section starts by describing the rules used to express causal dependencies between variables, as well as the corresponding parameters. The model uses two types of endogenous SCM variables: gate variables and observed variables, where the latter correspond to system alerts. Given that the framework represents dependencies through time, all states of a given observed variable, which are involved in causal relations, are represented by a separate endogenous variable, indexed by the time instance they represent. For example, there can be two variables Xt and Xt−1 representing the same observed variable X at two different moments, t and t − 1 respectively. Gate variables are used to manage the probabilistic
480
L. Jakovljevic et al.
causal dependencies between observed variables. Each gate variable has indexes for the cause variable and the effect variable that it links. When representing fault propagation, cause and effect variables have different names (and possibly different time indexes if the cause takes time to propagate). On the other hand, when representing fault persistence, the cause and the effect variables correspond to the same time series variable, just in different time states (the cause should have older time index that the effect). Each endogenous variable has an exogenous variable associated with it, whose semantics differs, whether the endogenous variable is a gate or not. For observed (non-gate) variables, the corresponding exogenous variable controls the prior probability of turning the endogenous variable to 1 i.e. probability of a new fault, while for gate variables, it controls the propagation probability from the cause to the effect. Once the graphical model of causal dependencies is defined, it is translated into equations based on the following rules: – Rule 1: for each exogenous variable Ei , create an equation which assigns a value to it, using a random probabilistic generator with a given probability distribution. For example, a discrete random generator disc with parameter λi , denoted disc(1 − λi , λi ) will assign to Ei a value 1 with probability λi and a value of 0 with probability 1 − λi : Ei = disc(1 − λi , λi )
(2)
– Rule 2: for each observed endogenous variable Xti which is not causally dependent from any other variable (i.e. is root variable), create an equation which assigns to it the value of its exogenous variable EXti : Xti := EXti
(3)
– Rule 3: for each endogenous gate variable GateXY , create an equation which assigns a value to it using a logical AND operator over its exogenous variable EGateXY and the value of the input cause variable X: GateXY := AN D[X, EGateXY ]
(4)
– Rule 4: for each non root, non-gate, endogenous variable Yti , create an equation which assigns a value to it using a logical OR operator over the values of all input gate variables {GateP AY i Yti } and its exogenous variable EYti : t
Yti := OR[EYti , {GateP AY i Yti }]
(5)
t
where P A corresponds to all parent nodes i.e. causes for variable Yti . 3.3
Noisy-OR SCM Models
The digital twin takes as input a causal model, which allows modelling the three cases described in Subsect. 3.1 in the following way. Probability of a new fault
Digital Twin of Complex System Using Causal Modelling
481
(case 1) is represented by parameters of exogenous variables associated to nongate endogenous variables, using Rule 1, which translates this probability into faults. These exogenous variables are then assigned to the non-gate endogenous variables using Rules 2 or 4, depending on whether the variable is a root or not, respectively. If variable is not a root node, the model includes propagation of a fault from parent components (case 2), incorporating also Rule 3. The SCM that models these two cases using all four rules is presented on Fig. 3. This model corresponds to the noisy-OR definition (Eq. 1) for collider structure on Fig. 1b, where parents X1 and X2 independently cause child Y .
Fig. 3. Noisy-OR SCM.
Persistence of a fault on the same component (case 3) is realised by introducing (unobserved) endogenous gate variable between two time instances of the same variable (GateXt−1 Xt i.e. GateYt−1 Yt on Fig. 4). Likewise, introducing gate variable between two time instances of different time series (GateXt−1 Yt ) creates a time lag for propagation of fault from another component. The SCM on Fig. 4 combines previously described effects, therefore modelling all three cases from the Subsect. 3.1. Alert Yt can become active due to the activity of alert Xt , regarded as parent in causal graph. At the same time, both alerts have the probability of entering and maintaining their active state, independently from each other. Methodology and all equations can be easily generalized to multiple children and parents (Xti ; i = 1, ..., n) with propagation intervals of arbitrary length t − lag. 3.4
Model Discovery from Observable Data
Building a digital twin requires inferring structure and parameters from historical data, which will serve as input for a causal model (Fig. 2 - Model Discovery). Once the structure (DAG and lag-s) are properly identified using causal discovery techniques, the parameters of the SCM need to be estimated. This consists in learning the parameters of the probability distributions used to assign values to n random variables (alerts - nodes in a DAG) which is similar to learning parameters in Bayesian Networks structure [16]. The order in which
482
L. Jakovljevic et al.
Fig. 4. Example of an SCM for modelling 2 binary time series (Xt causing Yt ) where propagation time lag between variables is 1 (lag = 1).
parameters are learnt is important. First, for each variable Xti (i = 1, ..., n), the new fault appearance probability λXti is computed from all observations in which all parent variables of Xti are inactive, including its own past. Then, the alert i i persistence probability νXt−lag Xti is inferred from samples where all parents of Xt i are inactive, except the samples representing its own past (i.e. Xt−lag ). Finally, j are estimated by considering the cases alert propagation probabilities νX i X t−lag
t
in which only one of the parents for Xtj is active at a time (j = i). In this last case, due to alert persistence, the sample representing the past of Xtj may also be active, but as it’s propagation probability has already been estimated, there is only one unknown parameter to be learnt.
4
Experiments
The goal of this section is to validate the framework’s capability to infer the digital twin of a system, using just the observational data coming from the system alert logs. Synthetic datasets used in the tests are generated using the framework presented in Sect. 3. The experiment is performed in two phases: first, we test the capability of different causal discovery techniques to infer the causal relations between alerts (DAG) from observational data. Second, we test the ability to correctly learn noisy-OR SCM parameters, which corresponds to alert characteristics, once the structure is properly identified. 4.1
Setup
The data generation process of the proposed framework is implemented in Python. The implementation code, dataset samples and the notebooks for running the simulations are publicly available1 . Each test is repeated 10 times, time series length is 1000 points and evaluation results are shown as average and standard deviation across the 10 runs. Tests are executed on a VM with CPU @ 2.2 GHz and 16 GB of RAM. 1
https://github.com/nokia/causal-digital-twin.
Digital Twin of Complex System Using Causal Modelling
483
Datasets. The data generation process consists in 3 steps: 1) Defining the structure of the underlying model, represented as a DAG. Directed Acyclic Graphs with n nodes are randomly generated and the mean edge degree is fixed to d = 3 (taking into account in and out edges in directed graph). Graphs of different size n are generated with n ∈ {5, 10, 15, 20, 50}. DAGs are generated according to ER (Erd˝ os–R´enyi) model using probability d . of p to assign new edges, where p is computed as p = n−1 i 2) Parametrizing the model is done by assigning λXti and νXt−lag Xti probabilj and lag are assigned to graph edges, where ities to graph nodes and νX i t−lag Xt i, j = 1, ..., n. Given that some causal discovery techniques, such as PCMCI, do not apply on instantaneous propagation, lag equal to zero is not used in data generation process. DAG attributes are used to instantiate exogenous and endogenous variables of the SCM, where nodes in a DAG correspond to variables in SCM, while edges correspond to causal relation in SCM. 3) Generating a dataset with n variables, each representing a binary time series. Edges in DAG represent ground truth causal relations between variables. Causal Discovery Techniques. Four state-of-the-art multivariate causal discovery techniques are used for evaluation across the tests, as representatives of different methods for learning causal relations, as listed in Table 1. Table 1. Causal discovery techniques. Algorithm
Short description
Parameters used
PCMCI [23]
Constraint-based
α = 5%, τmax = 3, CMIsymb
PCMCI+ [22]
Constraint-based
α = 5%, τmax = 3, CMIsymb
TCDF [15]
Convolutional neural networks K = 4, L = 0
DYNOTEARS [17] Score-based
τW = 5%, lagmax = 3
Evaluation Measures. The accuracy of discovered causal relations is measured based on the fact that the discovery of a directed edge in a DAG can be treated as a binary classification problem, since all techniques used in experiments output directed edges. In [4], the precision and recall (true positive rate) are defined as TP/(TP + FP) and TP/(TP + FN), respectively, where True Positives (TP) are defined as correctly identified links in the underlying DAG. Similarly for False Positives (FP) and False Negatives (FN). The F 1 score is defined as 2 ∗ P recision ∗ Recall/(P recision + Recall). Tested causal discovery algorithms can additionally output specific time lag along with detected edge. If technique outputs wrong or multiple time lags for the same causal relation, it is not penalized i.e. only incorrectly detected edge (causal relation between variables) is penalized.
484
4.2
L. Jakovljevic et al.
Results
Parameter range in synthetic datasets is chosen to mimic the behavior of alerts in real systems, where faults are rare events. In addition, faults propagate on interconnected components, while alerts turn off once the system recovers from faults. Probability for SCM parameters, used in experiments, are listed in Table 2, where uncertainty of 5% is introduced in order to have more probabilistic scenario. Table 2. Parameter ranges for the tests in Fig. 5. Parameter Value range Description λXti νX i
Xti
νX i
Xt
t−lag t−lag
lag
j
(0–0.05]
Probability of a new fault on Xti
(0–0.05]
Probability that fault persists on Xti
[0.95–1.0]
Probability of a fault propagating from Xti to Xtj
[1, 2, 3]
Causal dependency time lag
First, the ability of causal discovery techniques to detect underlying causal graph from observational datasets is tested. Figure 5 represents the F1, precision and recall scores of different algorithms for various graph sizes. PCMCI+ and PCMCI are able to correctly identify a DAG and maintain high F1 score as graph size grows. TCDF has high precision for small graph structures, although this technique suffers from lower accuracy as the number of variables increases. DYNOTEARS has stable precision across datasets of different sizes, yet suffers from low recall, since it only detects around 20% of edges from the ground truth set. Second, given correctly identified graph structure, our approach using Maximum Likelihood Estimation (MLE) is able to learn the SCM parameters asymptotically as the number of time samples increases (RMSE of ˜1% for variable length of 100k).
Fig. 5. Accuracy and execution time for different graph sizes (d = 3 and SCM parameters are listed in Table 2).
Digital Twin of Complex System Using Causal Modelling
5
485
Conclusions
This work proposes a general-purpose framework for modelling faulty behaviors in complex systems. The model relies on noisy-OR logic components combined into a causal DAG structure by using Structural Causal Models (SCMs). Explicit representation of fault propagation rules characterizes the impact of multiple causes on the same node, while introduction of lagged variables in SCM incorporates the time dimension. Our approach enables modelling the appearance, propagation and persistence of faults, which is suitable for representing multivariate binary time series such as alerts, events, or notifications. Synthetic datasets are generated from this model, showcasing that we are able to recover both causal relations and parameters in the SCM only from observational data, which allows creation of digital twins for the real systems. Our framework can be broadened to allow modelling categorical and continuous time series by using generalizations of the noisy-OR gate. In our future work, we will study extensions to enable changes in causal graph structure and function parameters, which would allow reasoning on interventions, counterfactuals and distribution shifts. In particular, structure changes and distribution shifts can model occurrences of severe fault modes preceded by normal situations with recurrent minor alarms. Last but not the least, we are planning evaluation campaigns of our framework confronted to alarm data collected from deployed communication networks.
References 1. van der Aalst, W.M., Hinz, O., Weinhardt, C.: Resilient digital twins (2021) 2. Agnieszka, O., Marek, J.D., Hanna, W.: Learning Bayesian network parameters from small data sets: application of noisy-or gates. Int. J. Approx. Reason. 27(2), 165–182 (2001) 3. Aldrich, J.: Autonomy. Oxford Econ. Pap. 41(1), 15–34 (1989) 4. Andrews, B., Ramsey, J., Cooper, G.F.: Learning high-dimensional directed acyclic graphs with mixed data-types. PMLR 104, 4–21 (2019) 5. Banerjee, A., Dalal, R., Mittal, S., Joshi, K.P.: Generating digital twin models using knowledge graphs for industrial production lines. UMBC Information Systems Department (2017) 6. Bollen, K.A.: Structural equation models with observed variables. Struct. Eqn. Latent Variables 80–150 (1989) 7. Boschert, S., Heinrich, C., Rosen, R.: Next generation digital twin. In: Proceedings of tmce, vol. 2018, pp. 7–11. Las Palmas de Gran Canaria, Spain (2018) 8. Fallet-Fidry, G., Weber, P., Simon, C., Iung, B., Duval, C.: Evidential networkbased extension of leaky noisy-or structure for supporting risks analyses. IFAC Proc. Vol. 45(20), 672–677 (2012) 9. Fenton, N.E., Noguchi, T., Neil, M.: An extension to the noisy-or function to resolve the “explaining away” deficiency for practical Bayesian network problems’, ieee trans. Knowl. Data Eng. 31, 2441–2445 (2019) 10. Hoover, K.D.: Causality in economics and econometrics. In: The New Palgrave Dictionary of Economics, pp. 1–13. Palgrave Macmillan, UK (2017)
486
L. Jakovljevic et al.
11. Lawrence, A.R., Kaiser, M., Sampaio, R., Sipos, M.: Data generating process to evaluate causal discovery techniques for time series data. In: Causalens NIPS 2020 workshop (2020) 12. Li, C., Mahadevan, S., Ling, Y., Wang, L., Choze, S.: A dynamic Bayesian network approach for digital twin. In: 19th AIAA Non-Deterministic Approaches Conference, p. 1566 (2017) 13. Liang, R., Liu, F., Liu, J.: A belief network reasoning framework for fault localization in communication networks. Sensors 20, 6950 (2020) 14. Mirylenka, K., Cormode, G., Palpanas, T., Srivastava, D.: Conditional heavy hitters: detecting interesting correlations in data streams. VLDB J. 24(3), 395–414 (2015). https://doi.org/10.1007/s00778-015-0382-5 15. Nauta, M., Bucur, D., Seifert, C.: Causal discovery with attention-based convolutional neural networks. Mach. Learn. Knowl. Extract. 1(1), 312–340 (2019). https://www.mdpi.com/2504-4990/1/1/19 16. Oni´sko, A., Druzdzel, M.J., Wasyluk, H.: Learning Bayesian network parameters from small data sets: application of noisy-or gates. Int. J. Approx. Reason. 27(2), 165–182 (2001) 17. Pamfil, R., et al.: DYNOTEARS: structure learning from time-series data. In: International Conference on Artificial Intelligence and Statistics (2020) 18. Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers, Inc., Burlington (1988) 19. Pearl, J.: Causality. Cambridge University Press, Cambridge (2009) 20. Pearl, J., Glymour, M., Jewell, N.P.: Causal Inference in Statistics: A Primer. Wiley, Hoboken (2016) 21. Ramsey, J., Malinsky, D., Bui, K.V.: algcomparison: comparing the performance of graphical structure learning algorithms with TETRAD. arXiv preprint arXiv:1607.08110 (2016) 22. Runge, J.: Discovering contemporaneous and lagged causal relations in autocorrelated nonlinear time series datasets. In: Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI). PMLR (2020) 23. Runge, J., Nowack, P., Kretschmer, M., Flaxman, S., Sejdinovic, D.: PCMCI detecting causal associations in large nonlinear time series datasets. Sci. Adv 5, 11 (2019) 24. Thulasiraman, K., Swamy, M.N.S.: Graphs: Theory and Algorithms. Wiley, Hoboken (2011) 25. Zhou, K., Martin, A., Pan, Q.: The belief noisy-or model applied to network reliability analysis. ArXiv abs/1606.01116 (2016)
Constructing Weighted Networks of Earthquakes with Multiple-parent Nodes Based on Correlation-Metric Yuki Yamagishi1,2(B) , Kazumi Saito2 , Kazuro Hirahara2 , and Naonori Ueda2 1
Faculty of Informatics, Shizuoka Institute of Science and Technology, Fukuroi, Japan [email protected] 2 Center for Advanced Intelligence Project, RIKEN, Kyoto, Japan {yuki.yamagishi.ks,kazumi.saito,kazuro.hirahara,naonori.ueda}@riken.jp
Abstract. In this paper, we address a problem of constructing weighted networks of earthquakes with multiple parent nodes, where the pairs of earthquakes with strong interaction are connected. To this end, by extending a representative conventional method based on the correlationmetric that produces an unweighted network with a single-parent node, we develop a method for constructing a network with multiple-parent nodes and assigning weight to each link by a link-weighting scheme called logarithmic-inverse-distance. In our experimental evaluation, we use an earthquake catalog that covers the whole of Japan, and select 24 major earthquakes which caused significant damage or casualties in Japan. In comparison to four different link-weighting schema, i.e., uniform, magnitude, inverse-distance, and normalized-inverse-distance, we evaluate the effectiveness of the constructed networks by our proposed method, in terms of the ranking accuracy based on the most basic centrality, i.e., weighted degree measure. As a consequence, we show that our proposed method works well, and then discuss the reasons why weighted networks with multiple-parent nodes can improve the ranking accuracy. Keywords: Weighted networks nodes
1
· Correlation-metric · Multiple parent
Introduction
In seismology, there is a pressing need for understanding the relationships of earthquakes in an extensive catalog. Especially, earthquake declustering [15], which is tasked with classifying each earthquake into foreshock, mainshock, or aftershock, plays an essential role in many critical applications such as forecasting future earthquakes, modeling seismic activities, and so forth. Namely, the key task of earthquake declustering can be formalized as a problem of identifying the pairs of strongly interacted earthquakes. In our proposing approach, we attempt to solve this problem by constructing large-scale weighted complex networks with multiple-parents, where each node c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2021, SCI 1072, pp. 487–498, 2022. https://doi.org/10.1007/978-3-030-93409-5_41
488
Y. Yamagishi et al.
(vertex), link (edge), and weight correspond to an earthquake (event), the interaction between earthquakes, and the strength of this interaction, respectively. Here, it seems worth putting some effort into attempting to find empirical regularities and develop explanatory accounts of basic properties in these complex networks. To this end, we can exploit a wider variety of techniques developed in the field of large-scale complex networks, such as centrality analysis, community extraction, and so forth. Such attempts would be valuable for understanding some structures and trends and inspiring us to lead to the discovery of new knowledge and insights underlying these interactions. In this paper, as a representative conventional one, we focus on the correlation-metric method that produces an unweighted single-parent network. Then, based on this conventional one, we develop an extended method for constructing a network with multiple-parent nodes and assigning link weight based on a logarithmic-distance scheme. In our experimental evaluation, we used an earthquake catalog that covers the whole of Japan, and selected 24 major earthquakes which caused significant damage or casualties in Japan. Then, in comparison with different link-weighting schema, we evaluate the effectiveness of the constructed networks by our proposed method, in terms of the ranking accuracy based on the most basic centrality measure. If the proposed centrality measure can ensure that the major earthquakes are ranked higher, then it can contribute to the development of seismology, for example, by estimating the hazard level for each earthquake given by a certain catalog based on the measure. An outline of this paper is given below. Section 2 describes related conventional algorithms for earthquake declustering. Section 3 presents our method for constructing a weighted network with multiple-parents nodes In Sect. 4, we report our experimental results using an existing Japanese earthquake catalog and discuss notable characteristics of our proposed method. Finally, Sect. 5 gives concluding remarks and future problems.
2
Related Work
A series of studies on seismicity declustering by Zaliapin et al. [23–27] revealed the effectiveness of the approach based on the nearest neighbor (single-parent) earthquake selection and the correlation-metric strategy described below. On the other hand, the studies by Yamagishi et al. [20–22] based on the singleparent earthquake selection and the Mean-Shift algorithm have experimentally demonstrated some limitations of this approach. We consider that one direction to overcome these limitations is the enhancement from unweighted single-parent networks to weighted multiple-parent ones. In what follows, we review the related work on seismicity declustering and k-nearest neighbors (kNN) approach which exploits networks with k-parent nodes. 2.1
Link-Based Declustering Approach
Seismicity declustering has been the subject of intensive study over the years [15], where the main idea of this approach is to separate a given catalog into sub-
Weighted Networks of Earthquakes with Multiple-parent Nodes
489
sets based on a specific relationship (such as foreshocks, mainshocks, and aftershocks) between each earthquake. Most of these declustering algorithms are based on a deterministic spatio-temporal window approach [5,8] or on a stochastic model [7,28] which are generally well suited for large earthquakes characterized by evident aftershock series. One of the representative methods, which is the basis of several stochastic declustering, is the Epidemic Type Aftershock Sequence (ETAS) model [10,11] using a likelihood analysis with space, time, and magnitude. As another declustering analysis with links, Baiesi and Paczuski [2] proposed a simple spatio-temporal metric called correlation-metric to correlate earthquakes with each other, and Zaliapin et al. [27] further defined the rescaled distance and time. This alternative approach goes directly to consider a tree of earthquakes in which each earthquake may be a parent (foreshock) of several later earthquakes, but it can be a child (aftershock) of only one earlier earthquake. Technically the parent is found as the nearest neighbor using the proximity function of the spatio-temporal metric based on the Gutenberg-Richter law [6] which is the relationship between the magnitude of aftershocks and the frequency of the aftershocks, and the modified Omori’s law [12,17,18] which is the relationship between the time elapsed after the mainshock and the occurrence rate of aftershocks. Furthermore, correlation-metric is a promising method because its fundamental system of function is similar to the ETAS model. Following this metric, Zaliapin and Ben-Zion [24,25] determined the statistical properties of earthquake clusters characterizing bursts and swarms, finding a relationship between the predominant cluster and the heat flow in seismic regions. 2.2
k-Nearest Neighbors Approach
Yamagishi et al. [20,22] performed an earthquake clustering by link disconnection based on average magnitudes using link-based declustering algorithms. In this clustering experiment, under the condition of selecting a parent node from the earthquakes that occurred before the child node, the single-parent earthquake selection, which is the same concept as the nearest neighbor graph [13], was guaranteed to be the minimum spanning tree (or minimum weight spanning tree) [9,14]. These proximity graphs have well-known extensions, for example, the k-nearest neighbors (kNN) graph [1,3] exists as a natural extension of the nearest neighbor graph which can be considered as a special case of kNN graph for k = 1, and the relative neighborhood graph [16] and the Gabriel graph [4] exist as natural extensions of the minimum spanning tree. However, since the relative neighborhood graph and the Gabriel graph require O(N 3 ) time complexity, the kNN graph is relatively simple in concept and implementation and fast, although it basically requires O(N 2 ) time and space complexities. In this kNN approach, a similarity or distance metric is used to find the k ≥ 1 nearest neighbors (parents) of each child node. Thus the extension to a weighted network [19] using the similarity or distance metric is natural, and we can expect its high performance.
490
3
Y. Yamagishi et al.
Proposed Method
Let D = {(xi , ti , mi ) | 1 ≤ i ≤ N } be a set of observed earthquakes, where xi , ti and mi stand for a location vector, time, and magnitude of the observed earthquake i, respectively. Here, we treat every earthquake (event) as a single point in a spatio-temporal space, as done by representative declustering methods in [15]. Here, we assume that these earthquakes are in order from oldest to most recent, i.e., ti < tj if i < j. Below we describe our proposed algorithm consisting of two steps: network construction and link-weight assignment. Step 1. Construct a network from the observed earthquakes in D by selecting multiple-parent nodes for each earthquake. Step 2. Assign a weight to each link over the constructed network by using a link-weighting scheme. In what follows, we describe the details of these steps. 3.1
Network Construction
Among several seismicity declustering algorithms, we focus on the correlationmetric proposed by Baiesi and Paczuski [2]. Namely, for each pair of earthquakes i and j such that i < j, the spatio-temporal metric n(i, j) is defined as n(i, j) = (tj − ti )xi − xj df 10−b
mi
,
(1)
where df is the fractal dimension set to df = 1.6, b the parameter of the Gutenberg-Richter law [6] set to b = 0.95, and the spatial and temporal metrics are scaled as kilometer and second, respectively. Then, an earthquake j is regarded as the aftershock (child node) of i(j) if the metric n(i, j) is minimized, i.e., i(j) = arg min n(i, j). Thus, based on the correlation-metric, we can 1≤i 1. This argument can naturally explain the reason why the centrality values of the earthquake with id = 18 increase quite rapidly when k = 2. Here, we can apply similar arguments to the other earthquakes with id = 4, 14, 16, 20, and 22, which are referred to as 2004 Kii Peninsula 1, 2011 Tohoku 1, 2011 Iwate offshore, 2011 Sanriku offshore, 2011 Miyagi, and 2016 Kumamoto 1, in this order. We consider that these arguments partly support the usefulness of these networks with multiple-parent nodes. Here, let JkIN C (i) be the set of the incremental child nodes defined by IN C (i) = Jk (i) \ Jk−1 (i) where J0 (i) = ∅. Figure 4 shows the visual evaluJk ation of the incremental child nodes JkIN C (i) with each network G(k) of the above-mentioned earthquake with id = 18. The color of each marker indicates the time distance tj − ti from the earthquake with id = 18, and each marker size is proportional to the magnitude mi of each earthquake. In the case of the single-parent network G(1), the incremental child nodes J1IN C (i) in Fig. 4a are quite concentrated in the location of the earthquake with id = 18. In contrast, in the case of the multiple-parent network G(k) with k > 1, the incremental child nodes JkIN C (i) in Figs. 4b, 4c and 4d spread out around the earthquake with id = 18. Moreover, Fig. 5 shows the comparisons of the incremental child nodes JkIN C (i) with each k in the spatio-temporal distance from the earthquakes with id = 18 and 16, which indicate the greatest increase in the centrality values of the UNI scheme and LID scheme when k = 2. Here, each marker size setting is the same as in Fig. 4, and the horizontal and vertical axes represent the spatial distances xi − xj and the temporal distances tj − ti from the earthquakes with id = 18 and 16, respectively. From Figs. 5a and 5b, the incremental child nodes J1IN C (i) in the case of the single-parent network G(1) are relatively limited in
Weighted Networks of Earthquakes with Multiple-parent Nodes 108
495 108
106
Latitude
40°N
Latitude
40°N
30°N
106
30°N 104
1000 km 500 mi
120°E
130°E
104
1000 km 500 mi
Esri, HERE, Garmin, USGS
140°E
Esri, HERE, Garmin, USGS
120°E
130°E
140°E
Longitude
Longitude
(a) k = 1
(b) k = 2 108
108
10
6
30°N
Latitude
40°N
Latitude
40°N
10
6
10
4
30°N 10
1000 km 500 mi
Esri, HERE, Garmin, USGS
120°E
130°E
140°E
4
1000 km 500 mi
Esri, HERE, Garmin, USGS
120°E
130°E
140°E
Longitude
Longitude
(c) k = 3
(d) k = 4
Fig. 4. Visual evaluations of incremental child nodes JkIN C (i) of id = 18
(a) id = 18
(b) id = 16
Fig. 5. Comparisons of the incremental child nodes JkIN C (i)
the number of the nodes and the spatio-temporal distance area, but it can be seen that the incremental child nodes JkIN C (i) in the case of the multiple-parent network G(k) with k > 1 are numerous and covering a variety of coordinates.
496
Y. Yamagishi et al.
(a) k = 1
(b) k = 3
Fig. 6. Samples of precision recall curves
Figure 6 shows some samples of precision recall curves, where we compare these curves for the LID and UNI schema when k = 1 in Fig. 6a, and those when k = 3 in Fig. 6b, respectively. From these experimental results, in terms of the performance for constructed networks, we can confirm that the LID scheme can work better than the UNI scheme. On the other hand, since the UNI scheme can work better for a small region of rank h, this might indicate that we need to explore a more sophisticated link-weighting schema, However, we believe that our approach to constructing weighted networks with multiple-parent nodes must be promising.
5
Conclusion
In this paper, we addressed a problem of constructing weighted networks of earthquakes with multiple-parent nodes, where the pairs of earthquakes with strong interaction are connected. To this end, we focused on the correlationmetric method that produces an unweighted single-parent network. and then developed an extended method for constructing a network with multiple-parent nodes and assigning link weight based on a logarithmic-distance scheme. In our experimental evaluation, we used an earthquake catalog that covers the whole of Japan, and selected 24 major earthquakes which caused significant damage or casualties in Japan. In comparison to four different weighting schema, i.e., uniform, magnitude, inverse-distance, and normalized-inverse-distance, we showed that our proposed method works well, in terms of the ranking accuracy based on the most basic weighted degree centrality, and then discussed the reasons why weighted networks with multiple-parent nodes can improve the ranking accuracy. As a future task, we plan to evaluate our proposed method by using a variety of datasets and centrality measures. Acknowledgements. This work was supported by JSPS Grant-in-Aid for Scientific Research (C) (No. 18K11441).
Weighted Networks of Earthquakes with Multiple-parent Nodes
497
References 1. Altman, N.S.: An introduction to kernel and nearest-neighbor nonparametric regression. Am. Stat. 46(3), 175–185 (1992) 2. Baiesi, M., Paczuski, M.: Scale-free networks of earthquakes and aftershocks. Phys. Rev. E Stat. Nonlinear Soft Matter Phys. 69(6), 066106(8 pages) (2004) 3. Eppstein, D., Paterson, M.S., Yao, F.F.: On nearest-neighbor graphs. Discret. Comput. Geom. 17, 263–282 (1997) 4. Gabriel, K.R., Sokal, R.R.: A new statistical approach to geographic variation analysis. Syst. Biol. 18(3), 259–278 (1969) 5. Gardner, J.K., Knopoff, L.: Is the sequence of earthquakes in southern California, with aftershocks removed, poissonian? Bull. Seismol. Soc. Am. 64(5), 1363–1367 (1974) 6. Gutenberg, B., Richter, C.: Seismicity of the Earth and Associated Phenomena, p. 310, 2nd edn. Princeton University Press, Princeton, New Jersey (1954) 7. Kagan, Y., Jackson, D.: Long-term earthquake clustering. Geophys. J. Int. 104, 117–133 (1991) 8. Knopoff, L., Gardner, J.K.: Higher seismic activity during local night on the raw worldwide earthquake catalogue. Geophys. J. Int. 28, 311–313 (1972) 9. Kruskal, J.B.: On the shortest spanning subtree of a graph and the traveling salesman problem. Proc. Am. Math. Soc. 7(1), 48–50 (1956) 10. Ogata, Y.: Statistical models for earthquake occurrences and residual analysis for point processes. J. Am. Stat. Assoc. 83(401), 9–27 (1988) 11. Ogata, Y.: Space-time point-process models for earthquake occurrences. Ann. Inst. Stat. Math. 50, 379–402 (1998) 12. Omori, F.: On the aftershocks of earthquakes. J. Coll. Sci. 7, 111–200 (1894) 13. Preparata, F.P., Shamos, M.I.: Computational Geometry - An Introduction. Texts and Monographs in Computer Science, Springer, Heidelberg (1985) 14. Prim, R.C.: Shortest connection networks and some generalizations. Bell Syst. Tech. J. 36(6), 1389–1401 (1957) 15. van Stiphout, T., Zhuang, J., Marsan, D.: Seismicity declustering. Community Online Resource for Statistical Seismicity Analysis (2012) 16. Toussaint, G.T.: The relative neighbourhood graph of a finite planar set. Pattern Recognit. 12(4), 261–268 (1980) 17. Utsu, T.: A statistical study on the occurrence of aftershocks. Geophys. Mag. 30, 521–605 (1961) 18. Utsu, T., Ogata, Y., Matsu’ura, R.: The centenary of the omori formula for a decay law of aftershock activity. J. Phys. Earth 43(1), 1–33 (1995) 19. Wasserman, S., Faust, K.: Social Network Analysis: Methods and Applications. Structural Analysis in the Social Sciences, Cambridge University Press, Cambridge, UK (1994) 20. Yamagishi, Y., Saito, K., Hirahara, K., Ueda, N.: Spatio-temporal clustering of earthquakes based on average magnitudes. In: Benito, R.M., Cherifi, C., Cherifi, H., Moro, E., Rocha, L.M., Sales-Pardo, M. (eds.) Complex Networks & Their Applications IX. COMPLEX NETWORKS 2020 2020. Studies in Computational Intelligence, vol. 943, pp. 627–637. Springer, Cham (2020). https://doi.org/10.1007/9783-030-65347-7 52 21. Yamagishi, Y., Saito, K., Hirahara, K., Ueda, N.: Magnitude-weighted mean-shift clustering with leave-one-out bandwidth estimation. In: Proceedings of the 18th Pacific Rim International Conference on Artificial Intelligence (2021)
498
Y. Yamagishi et al.
22. Yamagishi, Y., Saito, K., Hirahara, K., Ueda, N.: Spatio-temporal clustering of earthquakes based on distribution of magnitudes. Appl. Netw. Sci. 6(1), 71 (2021) 23. Zaliapin, I., Ben-Zion, Y.: A global classification and characterization of earthquake clusters. Geophys. J. Int. 207(1), 608–634 (2016) 24. Zaliapin, I., Ben-Zion, Y.: Earthquake clusters in Southern California i: identification and stability. J. Geophys. Res. Solid Earth 118(6), 2847–2864 (2013) 25. Zaliapin, I., Ben-Zion, Y.: Earthquake clusters in Southern California ii: classification and relation to physical properties of the crust. J. Geophys. Res. Solid Earth 118(6), 2865–2877 (2013) 26. Zaliapin, I., Ben-Zion, Y.: Earthquake declustering using the nearest-neighbor approach in space-time-magnitude domain. J. Geophys. Res. Solid Earth 125(4), e2018JB017120 (2020) 27. Zaliapin, I., Gabrielov, A., Keilis-Borok, V., Wong, H.: Clustering analysis of seismicity and aftershock identification. Phys. Rev. Lett. 101(1), 1–4 (2008) 28. Zhuang, J., Ogata, Y., Vere-Jones, D.: Stochastic declustering of space-time earthquake occurrences. J. Am. Stat. Assoc. 97(458), 369–380 (2002)
Motif Discovery in Complex Networks
Motif-Role Extraction in Uncertain Graph Based on Efficient Ensembles Soshi Naito(B) and Takayasu Fushimi(B) School of Computer Science, Tokyo University of Technology, Hachioji 192-0982, Japan {c011833449,fushimity}@edu.teu.ac.jp
Abstract. In this paper, we formulate a new problem of extracting motif-based node roles from a graph with edge uncertainty, where we first count roles for each node in each sampled graph, second calculate similarities between nodes in terms of role frequency and divide all nodes into clusters according to the similarity of roles. To achieve good accuracy of role extraction for uncertain graphs, a huge amount of graph sampling, role counting, similarity calculation, or clustering is needed. For efficiently extracting node-roles from a large-scale uncertain graph, we propose four ensemble methods and compare the similarity of results and efficiency. From the experiments using four different types of real networks with probabilities assigned to their edges, we confirmed that the graph-ensemble method is much more efficient in terms of execution time compared to other ensemble methods equipped with the state-ofthe-art motif counting algorithm.
1
Introduction
In the field of network science, motif counting is an important task for understanding the features of graphs, and it has been studied extensively over the years [1–3,11,13], starting with the work by Milo et al. [8]. By classifying node positions in directed three-node motifs based on structural equivalence, studies are being actively conducted to extend the concept of motifs that represent the characteristics of graphs, then define and analyze the roles of nodes [7,10]. In Fig. 1, subgraphs and nodes are numbered based on 13 motifs and 30 roles. For example, it can be said that a node that frequently appears as Role13 has a role specialized in transmitting information, and a node with Role24 has a role of transmitting received information to others. In this way, extracting motif-based roles for each node would be applicable to extracting influencers and supermediators, which are important in viral marketing. As another trend in network science in recent years, research on uncertain graphs (a.k.a. probabilistic graphs) is active. Uncertain graph is a graph where the uncertainty of edge existence is expressed as the probability and can reflect real-world phenomena such as blockage in a road network, follow/unfollow action among users in a social network, and so on. To examine the frequency of motifs c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2021, SCI 1072, pp. 501–513, 2022. https://doi.org/10.1007/978-3-030-93409-5_42
502
S. Naito and T. Fushimi
Fig. 1. 30 roles based on directed three-node motif.
or roles in such an uncertain graph, it is needed to compute the expected frequency in all possible graphs. For an uncertain graph with L uncertain edges, the number of possible graphs is 2L , which is difficult to enumerate all the possible graphs even for a small graph, thus approximation by sampling is generally employed. Role counts on uncertain graphs can be used in a wide range of fields, including marketing, city planning, protein analysis, like motif counting is. A motif-counting method for uncertain graphs, which is called LINC, has been proposed by Ma et al. in [6], which is based on efficient sampling by focusing on the structural similarity among sample graphs. LINC is a state-of-the-art method that enables extremely efficient sampling for uncertain graphs with low uncertainty and can obtain the average frequency faster than the naive sampling method by examining only the edges whose state of presence/absence is changing. On the flip side, there is the problem that it is slower than the naive sampling method for uncertain graphs with high uncertainty. In this paper, we formulate the motif based role extraction problem from uncertain graphs and propose four ensemble-based role extraction methods, graph ensemble, role-vector ensemble, similarity ensemble, and cluster ensemble methods. In addition, we conducted comparison experiments among the graph ensemble method, role-vector ensemble method equipped with LINC, and similarity ensemble method equipped with LINC in terms of similarity of results and efficiency, and confirmed that the graph ensemble method is the fastest. This paper is organized as follows. After explaining related work in Sect. 2, we give our problem setting and proposed method in Sect. 3 and Sect. 4. In Sect. 5, we report and discuss experimental results. Finally, in Sect. 6, we summarize this paper and address future work.
Motif-Role Extraction
2
503
Related Work
Research on uncertain graphs has been studied in a wide range of contexts, including reliability, querying, and mining. Extending existing graph analysis methods, including motif counting, to uncertain graphs is an important task. Motif counting techniques have been studied for many years, starting with the pioneering work of Milo et al. [8]. Various algorithms have been developed for different purposes [1–3,11,13]. Wernicke proposed a hash-based algorithm called ESU. It avoided the need to store all subgraphs in a hash table, and improved the efficiency of motif counting by not counting the same subgraph twice [13]. Itzhack et al. proposed an efficient algorithm to traverse a breadthfirst search tree with the target node as the root. It represents the existence of a link in a subgraph as a bit string, and can efficiently identify motif patterns without checking the isomorphism of each subgraph [3]. In this study, we adopt the algorithm of Itzhack et al. for motif counting from sample graphs. Grochow and Kellis proposed a very efficient algorithm for searching for a single motif [2]. This algorithm constructs a partial mapping from a particular graph to a target motif. In addition, this algorithm introduces a method called symmetric-break to avoid multiple counting of motifs, which greatly improves the execution time. Ahmed et al. proposed a parallel algorithm for three and four node motifs that does not enumerate all motif instances but counts some motifs, such as cliques and cycles, and uses the transition relations between motifs to compute all other motifs analytically [1]. Pinar et al. proposed a divide and conquer algorithm that identifies the substructure of each found subgraph and divides them into smaller ones. [11]. Although it is a very efficient method, it cannot be applied to directed networks. McDonnell et al. defined motif-based roles for nodes by focusing on the structural equivalence in directed 3-nodes motifs, and proposed a transformation matrix from motif-frequency vector to role-frequency vector [7]. Node roles targeted in this paper are the same ones defined by McDonnell et al., but we count them based on the abovementioned algorithm proposed by Itzhack et al., unlike McDonnell et al. transformed motif-frequency to role-frequency. Motif counting for uncertain graphs has not yet been thoroughly studied. The following are some of the major ones. Tran et al. proposed a method to compute an unbiased estimator of the number of motifs from noisy and incomplete data, but the method assumes that all edges have uniform joint probabilities and is not applicable to non-uniform probabilities [12]. Ma et al. proposed two sampling based algorithms to obtain basic statistics such as the mean, variance, and probability distribution of motif counts [6]. The first is a simple sampling method, called PGS, which samples a large number of possible graphs from uncertain graphs and counts instances of a single motif from each sample graph. In addition, the paper discusses the sufficient number of samples to accurately estimate the average number of motifs based on Hoeffding’s inequality. The second, more efficient method, called LINC, uses structural similarity between sample graphs to update the frequency of motifs by examining only edge differences between consecutive samples. LINC is an algorithm that outputs the same results as PGS, but runs much faster than PGS when exactly the same samples are used.
504
S. Naito and T. Fushimi
In this work, we consider the LINC algorithm as a state-of-the-art technique and propose a more efficient ensemble algorithm than ensemble ones equipped with role counting routine by LINC.
3 3.1
Problem Framework Motif Based Role Extraction
First of all, we formalize the problem of node role extraction based on motif counting for deterministic graph G = (V, E) where V and E denote the sets of nodes and edges, respectively. In deterministic graph G, the existence probability of all the edges is 1. Following the work [7], we define node roles based on structural equivalence in motifs. As mentioned earlier, the directed 3-node motif has 30 kinds of roles as shown in Fig. 1. In our problem framework, first counting motif-based roles and constructing role vector for each node; second calculating the pair-wise similarity for all the node pairs; and third clustering all the nodes into clusters, each of which consists of nodes with similar role-vectors (see Fig. 2). At the first step, we count R roles for each node u and constructing a R-dimensional vector ru , whose j-th element is the frequency that node u appears as role j. In the case of directed three-node motifs, the number of roles is R = 30 as shown in Fig. 1. The matrix in which the role vectors of all N nodes are arranged is written as R = [r1 , . . . , rN ]T . Here AT denotes the transpose of a vector or matrix A. At the second step, we calculate the cosine rT r similarity c(u, v) = ruurvv for all the node pairs (u, v) ∈ V × V . The matrix in which the cosine similarities of all N × N node pairs are arranged is written as C = [c(u, v)]u∈V,v∈V . At the third step, we divide all the nodes into K clusters based on a greedy algorithm for the k-medoids problem [9]. That is, as a result, we obtain the matrix H = [hu,k ]u∈V,k=1:K , in which the affiliation information of N nodes to the K clusters are arranged, where hu,k = 1 if node u belongs to cluster k, otherwise 0. The result is a node cluster with similar role vectors. This series of processes is defined as role extraction.
Fig. 2. Process of motif-based role extraction.
3.2
Role Extraction from Uncertain Graph
Next, we formalize the role extraction problem for an uncertain graph G = (G, p), where G = (V, E) is the backbone graph and p : E → (0, 1] is the edge existence probability. To exactly solve the role extraction problem for an
Motif-Role Extraction
505
uncertain graph G, it is necessary to extract the roles for all possible graphs G = {Gi = (V, Ei ); Ei ⊆ E} by the above procedure, and ensemble all the results with consideration of their occurrence probabilities. Following related research on uncertain graphs, the occurrence probability of a possible graph Gi is calculated based on independent Bernoulli trials for all the edges: p(e) (1 − p(e)), Pr[Gi ] = e∈Ei
e∈E\Ei
and the following ensembled value calculation is required for role extraction: H = Φ (HG ; Pr[G]). G∈G
Here, Φ stands for an ensemble operator. In this case, clustering results HG of each deterministic graph G is ensembled with consideration of the weights of their occurrence probability Pr[G]. Given the number of edges L = |E|, to obtain the exact ensemble results, it needs the astronomically large number of possible graphs, |G| = 2L , which is difficult to calculate even for small graphs, thus approximation by sampling is generally employed.
4
Methodology
In this paper, we propose four ensemble approaches based on sampling for extracting node roles from a given uncertain graph. Figure 3 depicts our proposed ensemble methods, graph ensemble, role-vector ensemble, similarity ensemble, and cluster ensemble, using S graphs {G1 , . . . , GS } sampled from an uncertain graph G, where each sampled graph Gs = (V, Es ), Es ⊆ E is a deterministic graph.
Fig. 3. Four ensemble approaches.
506
4.1
S. Naito and T. Fushimi
Graph Ensemble
The graph-ensemble method ensembles the sample graphs {G1 , . . . , GS } to find ¯ the ensembled graph G: S
¯ G = Φ (G; Pr[G]) Φ (Gs ; 1/S) = G. G∈G
s=1
¯ = (V, E, ¯ w) is a weighted graph with the edge weight w(e) = S δ G s=1 (e ∈ Es )/S, which represents a probability that e appeared in S samples. Where δ(cond) means a Boolean function which returns 1 if cond is true; otherwise 0. Then, the method examines how many times each node plays each role in the ¯ with consideration of the probability w(e)1 , constructs the ensemble graph G ¯ calculates the cosine-similarities C, ¯ assigns a cluster for each role-vectors R, ¯ This method ensembles S graphs with L edges and counts node, and outputs H. roles of all N nodes from the ensembled graph only once. So, in case of roles based on the k-node motifs, the dominant time complexity is O(SL + N d¯(k−1) )), where d¯ is the average degree. 4.2
Role-Vector Ensemble
The role-vector-ensemble (shortly vector-ensemble) method averages the rolevectors {R1 , . . . , RS } each of which is obtained from a sampled graph Gs , to ¯ find the ensembled vectors R: R = Φ (RG ; Pr[G]) G∈G
S 1 ¯ Rs = R. S s=1
¯ assigns a cluster for each Then, the method calculates the cosine-similarities C, ¯ node, and outputs H. It takes a lot of calculation time to execute the role count naively from S sample graphs. Therefore, taking advantage of the fact that the structures of the sample graphs Gs , Gs , (1 ≤ s = s ≤ S) are similar to each other, i.e., the difference between two edge sets Es and Es , |(Es \ Es ) ∪ (Es \ Es )| is small, we efficiently compute the role frequency by extending the LINC algorithm [6], which counts motifs by focusing only on the difference edges between sample graphs. The LINC algorithm examines the state of L edges, and updates the motif frequency associated with edges whose states are changing, the expected number of such edges is 2L(p−p2 ), where p is the average probability of edge existence. Our vector-ensemble method based on the LINC algorithm counts roles of all the nodes from the backbone graph only once, and computes the averaged role-vectors from S sampled graph. Then, it ensembles S role-vectors ¯ whose size is N ×R. So the dominant time complexity is O(S(N R+L(p−p2 )m)), where m ¯ is the average number of motif instances associated with each edge.
1
More detail algorithm will be presented in the forthcoming papers.
Motif-Role Extraction
4.3
507
Similarity Ensemble
The similarity-ensemble method averages the similarity matrices {C1 , . . . , CS } each of which is calculated from a role-vector Rs , to find the ensembled ¯ matrix C: C = Φ (CG ; Pr[G]) G∈G
S 1 ¯ Cs = C. S s=1
¯ Similar to the Then, the method assigns a cluster for each node, and outputs H. vector-ensemble method, we accelerate the role-count routine according to the LINC algorithm. This method based on the LINC algorithm counts the roles of all the nodes from the backbone graph only once, computes S role-vectors for S sampled graph, and calculates S similarity matrices each of whose size is N × N . Then, it ensembles these similarity matrices. So the dominant time complexity is O(SN 2 ). 4.4
Cluster Ensemble
The cluster-ensemble method ensembles the clustering results each of which is represented as cluster-affiliation matrices {H1 , . . . , HS }: S
¯ H = Φ (HG ; Pr[G]) Φ (Hs ; 1/S) = H. G∈G
s=1
Ensembling of clustering results has not yet been established, so we do not discuss it here.
5 5.1
Experimental Evaluations Dataset and Settings
In out experimental evaluations, role extraction based on the directed 3-node motif is performed on the following four directed graphs observed in the real world, and the effectiveness of the proposed method is confirmed. The graph sizes are shown in Table 1. For these graphs, we set a uniform edge existence probability p(e) = p ∈ [0.5, 0.9]. The number of samples is set to S ∈ {101 , 102 , 103 , 104 }, Table 1. Basic statistics of datasets. Dataset Celegans (neurons and synapses in C. elegans) [5]
#nodes N #edges M avg cos 131
764
0.56
Gnutella (peer-to-peer file sharing) [5]
10,876
39,994
0.61
Blog (trackback among weblogs)
12,047
53,315
0.32
Enron (email among employees) [5]
19,603
210,950
0.76
S. Naito and T. Fushimi 1
1
0.99
0.99
Cosine similarity
Cosine similarity
508
0.98 0.97 0.96 0.95 101
0.98 0.97 0.96
102
103
0.95 101
104
(a) Celegans
104
103
104
(b) Gnutella 0.99
0.95
Cosine similarity
Cosine similarity
103
1
1
0.9
0.85
0.8 101
102
0.98 0.97 0.96
102
103
104
0.95 101
(c) Blog
102
(d) Enron
Fig. 4. Cosine-similarities between role vectors estimated by each ensemble method.
and the number of clusters is set to K = 10. We experimented with varying the number of samples and clusters to see how the similarity of the results and the execution time changes with the number of samples and clusters. For the sake of space, only the values for k = 10 are presented in this paper2 . 5.2
Similarity Evaluation of Estimated Role Vectors
First, we evaluate the similarity between estimated role vectors by the average of cosine similarities between role vectors for each node, i.e., for node v, (gra) (vec) (gra) (vec) (gra) (vec) , rv /rv rv , where rv and rv are role vectors cos(v) = rv of node v estimated by graph-ensemble and vector-ensemble methods, respectively. Figure 4 indicates the average of cosine similarities N1 v∈V cos(v) with respect to the number of samples S using for estimation. From these figures, it can be seen that the cosine similarity is significantly higher than the average value of the cosine similarity between all the node pairs shown in the rightmost column in Table 1, that is, very similar results can be obtained by both the graph-ensemble method and the vector-ensemble method. 2
For a small-size graph, Celegans, we set to K = 5.
Motif-Role Extraction gra vs vec
1
1
0.9
0.95
0.8
0.9
0.9 0.85 0.8 0.75
0.7
0.85
0.6
0.8
0.7 0.5
0.65 0.6
102
104
0.4
0.75
102
104
0.7
(a) Celegans
1
Spearman rank corr. coef.
vec vs sim
gra vs vec
1
0.9
0.9
0.8
0.8
0.7
0.7
gra vs sim
1
102
104
0.6
0.5
0.5
gra vs vec
1
0.95
0.9
0.9
0.8
0.85
0.7
0.8
0.6
0.75
0.5
0.7
0.4
0.65
0.3
0.6
0.2
0.55
102
104
0.1
1
gra vs vec
1
0.95
0.9 0.8
0.6
0.9
0.3
0.3
0.4
0.65
0.4
(c) Blog
102
104
104
gra vs sim
0.4
1
102
104
102
104
vec vs sim
0.95 0.9 0.85 0.8 0.75
0.6 0.5
102
0.5
0.8
0.7
104
0.6
0.7
0.5
102
0.7
0.85
0.75
vec vs sim
0.8
0.8
0.4 104
1 0.9
0.9
0.4 102
gra vs sim
(b) Gnutella
vec vs sim
0.7 0.6
1
Spearman rank corr. coef.
Spearman rank corr. coef.
0.95
gra vs sim
Spearman rank corr. coef.
1
509
0.7 0.65 0.6 102
0.55
104
102
104
(d) Enron
Fig. 5. Rank-correlations between pair-wise similarities estimated by each ensemble method. Left: graph-ensemble vs vector-ensemble; Center: graph-ensemble vs similarity-ensemble; Right: vector-ensemble vs similarity-ensemble;
5.3
Similarity Evaluation of Estimated Similarity Matrices
Next, we evaluate the similarity between estimated similarity matrices by the Spearman’s rank correlation coefficient between the lists which consists of N (N − 1)/2 pair-wise similarities, i.e., for c(gra) , c(vec) , c(sim) estimated by the graph-ensemble, the vector-ensemble, and the similarity-ensemble methods, we calculate Spearman’s correlations. Figure 5 depicts the correlation coefficients with respect to the number of samples S using for estimation. From these figures, it can be seen that except for vector-ensemble vs similarity-ensemble, the correlation coefficient tends to increase as the number of samples increases and the probability of edge existence increases. Especially for p = 0.9 and p = 0.8 in the graph-ensemble vs vector-ensemble, the rank correlation is extremely high, so they could be expected to output similar clustering results.
S. Naito and T. Fushimi 0.95
gra vs vec
1
Normalized mutual information
0.9
1
vec vs sim
1
0.9
0.8
0.8
0.8
0.75
0.7
0.7
0.7 0.6
0.65 0.6
0.6
0.5
0.55 102
104
0.4
0.75
102
104
0.5
102
0.75
0.8
0.6 0.5
0.65
0.55
0.6
0.5
0.55
0.45
0.5
0.4
0.4
0.45
0.35 102
104
0.3
0.75 0.7 0.65
0.6
0.6
0.55
0.7
0.55
0.5
0.5
0.45
0.6
102
104
102
0.35 104
vec vs sim
0.45 102
104
0.4
102
104
(b) Gnutella (K = 10) vec vs sim
0.7
0.6
0.8
0.7 0.65
0.5
104
0.65
0.7
0.3
gra vs sim
0.7
0.8
gra vs sim
0.4
0.4
(c) Blog (K = 10)
0.9
Normalized mutual information
Normalized mutual information
0.9
0.8
0.9
(a) Celegans (K = 5) gra vs vec
gra vs vec
0.75
0.9
0.85
0.5
gra vs sim
Normalized mutual information
510
102
104
gra vs vec
0.9
0.8
gra vs sim
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
vec vs sim
0.7
0.6
0.5
0.4
102
104
0.3
102
104
0.3
102
104
(d) Enron (K = 10)
Fig. 6. Normalized mutual information between k-medoids clusters estimated by each ensemble method. Left: graph-ensemble vs vector-ensemble; Center: graph-ensemble vs similarity-ensemble; Right: vector-ensemble vs similarity-ensemble;
5.4
Similarity Evaluation of Estimated Clusters
Next, we evaluate the similarity between estimated clusters, which is the final result in our role extraction framework. Recall that our framework employs a greedy algorithm for k-medoids clustering, so one input the same similarity matrices, the same results can be obtained regardless of initialization. For evaluation, we calculate the Normalized Mutual Information (NMI), which measures the similarity between clusters [4]. Figure 6 shows the NMI with respect to the number of samples S using for estimation. From these figures, it can be seen that although as for the Celegans, which is a small graph, unstable results are obtained, as for the other graphs, the more the number of samples and the higher the probability, the higher the similarity. 5.5
Efficiency Evaluation
Next, we evaluate the efficiency of each ensemble method in terms of running time. Figure 7 presents the running time from reading an uncertain graph to
Motif-Role Extraction 103 graph-ens
Runing time (sec.)
10
2
10
103 vector-ens
2
10
101
101
100
100
100
10-1
10-1
10-1
101 102 103 104
10-2
101 102 103 104
105
10-2
10
graph-ens
Runing time (sec.)
10
5
104
103
103
103
102
102
102
101 102 103 104
10
10
10
104
103
103
103
102
102
102
101
101 102 103 104
101 102 103 104
101
101 102 103 104
106
graph-ens
vector-ens
106
similarity-ens
5
104
101 102 103 104
10
6
similarity-ens
104
101
101
(b) Gnutella 6
vector-ens
5
similarity-ens
104
101
101 102 103 104
101
Runing time (sec.)
10
6
vector-ens
104
(a) Celegans 6
105
graph-ens
2
101
10-2
105 similarity-ens
Runing time (sec.)
103
511
101 102 103 104
105
105
105
104
104
104
103
103
103
102
101 102 103 104
(c) Blog
102
101 102 103 104
102
101 102 103 104
(d) Enron
Fig. 7. Running time.
outputting the clustering results in each method, where the horizontal axis is the number of samples using for estimation, and the vertical axis is the running time on a logarithmic scale. From these figures, we can confirm that 1) the graph-ensemble method works much faster than the vector-ensemble and the similarity-ensemble methods; 2) the graph-ensemble method tends to be not so affected by the number of samples and the edge existence probability; 3) the vector-ensemble method tends to be somewhat affected by the probability and affected by the number of samples; and 4) the similarity-ensemble method is linearly affected by the number of samples. From the similarity evaluations and the efficiency evaluation, we can conclude that the graph-ensemble method can output very similar results as other ensemble methods equipped with state-of-the-art technique, LINC, in an extraordinarily short time. 5.6
Characteristic Evaluation of Extracted Roles
Finally, we evaluate the clustering results of role vectors. Due to space limitations, we only show the results for Celegans with edge existence probability
512
S. Naito and T. Fushimi
Average frequency in each cluster.
Visualization results.
Fig. 8. Clustering and visualization results (Celegans, K = 5).
p = 0.9 by the graph-ensemble method. Figure 8 shows the role vectors averaged over each cluster and the visualization result, where the color of each line in the line chart and that of each node correspond to the cluster. From Fig. 8, we obtained the following observations. 1) The cluster represented by red contains many Role1, 4, and 16 derived from Motif1, 2, and 4 (See Fig. 1). The frequencies of such motifs are very high in the Celegans graph, that is the red cluster can be interpreted as average cluster. 2) The clusters represented by green and yellow contain many roles with an outgoing edge like Role4, 5, and 13. 3) The clusters represented by blue and magenta contain many roles with an incoming edge like Role1, 10, and 16. From these results, we can see that Celegans intrinsically has three major roles.
6
Conclusion
In this paper, we formalized a new problem, i.e., role extraction from uncertain graphs, which is based on role counting for each node in each sampled graph, and to solve this problem, we proposed four ensemble methods and compared three of them. In our experiments, we evaluated these methods on four real networks of different sizes with the setting of uniform edge-existence probabilities. The results indicated that the proposed graph-ensemble method is much more efficient in terms of execution time compared to other ensemble methods equipped with the state-of-the-art motif counting algorithm. For future work, we plan to obtain the exact expected value of role count and evaluate our ensemble methods in terms of error and we will also use Hoeffding’s inequality to determine the appropriate sample size. In addition, we would like to investigate the experimental results more deeply, including the comparison results where the similarity-ensemble method significantly differs from the others in certain probabilities shown in Fig. 5. Acknowledgments. This material is based upon work supported by JSPS Grantin-Aid for Scientific Research (C) (No. 20K11940) and Early-Career Scientists (No. 19K20417).
Motif-Role Extraction
513
References 1. Ahmed, N.K., Neville, J., Rossi, R.A., Duffield, N.: Efficient graphlet counting for large networks. In: 2015 IEEE International Conference on Data Mining, pp. 1–10 (2015) 2. Grochow, J.A., Kellis, M.: Network motif discovery using subgraph enumeration and symmetry-breaking. In: Speed, T., Huang, H. (eds.) RECOMB 2007. LNCS, vol. 4453, pp. 92–106. Springer, Heidelberg (2007). https://doi.org/10.1007/9783-540-71681-5 7 3. Itzhack, R., Mogilevski, Y., Louzoun, Y.: An optimal algorithm for counting network motifs. Physica A: Stat. Mech. Appl. 381, 482–490 (2007). https://www. sciencedirect.com/science/article/pii/S0378437107002257 4. Kv˚ alseth, T.O.: On normalized mutual information: measure derivations and properties. Entropy 19(11) (2017). https://www.mdpi.com/1099-4300/19/11/631 5. Leskovec, J., Krevl, A.: SNAP Datasets: Stanford large network dataset collection, June 2014. http://snap.stanford.edu/data 6. Ma, C., Cheng, R., Lakshmanan, L.V.S., Grubenmann, T., Fang, Y., Li, X.: LINC: a motif counting algorithm for uncertain graphs. Proc. VLDB Endow. 13(2), 155– 168 (2019). https://doi.org/10.14778/3364324.3364330 7. McDonnell, M.D., Yaveroglu, O.N., Schmerl, B.A., Iannella, N., Ward, L.M.: Motifrole-fingerprints: the building-blocks of motifs, clustering-coefficients and transitivities in directed networks. Plos One 9(12), 1–25 (2014). https://doi.org/10.1371/ journal.pone.0114503 8. Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Chklovskii, D., Alon, U.: Network motifs: simple building blocks of complex networks. Science (New York, N.Y.) 298(5594), 824–827 (2002) 9. Nemhauser, G.L., Wolsey, L.A., Fisher, M.L.: An analysis of approximations for maximizing submodular set functions. Math. Program. 14, 265–294 (1978) 10. Ohnishi, T., Takayasu, H., Takayasu, M.: Network motifs in an inter-firm network. J. Econ. Interact. Coord. 5(2), 171–180 (2010) 11. Pinar, A., Seshadhri, C., Vishal, V.: Escape: efficiently counting all 5-vertex subgraphs. In: Proceedings of the 26th International Conference on World Wide Web, WWW 2017, Republic and Canton of Geneva, CHE, pp. 1431–1440 (2017). https:// doi.org/10.1145/3038912.3052597 12. Tran, N., Choi, K.P., Zhang, L.: Counting motifs in the human interactome. Nature Commun. 4, 2241 (2013) 13. Wernicke, S.: A faster algorithm for detecting network motifs. In: Casadio, R., Myers, G. (eds.) WABI 2005. LNCS, vol. 3692, pp. 165–177. Springer, Heidelberg (2005). https://doi.org/10.1007/11557067 14
Analysing Ego-Networks via Typed-Edge Graphlets: A Case Study of Chronic Pain Patients Mingshan Jia1(B) , Mait´e Van Alboom2 , Liesbet Goubert2 , Piet Bracke2 , Bogdan Gabrys1 , and Katarzyna Musial1 1
University of Technology Sydney, Ultimo, NSW 2007, Australia [email protected] 2 Ghent University, Ghent, Belgium
Abstract. Graphlets, being the fundamental building blocks, are essential for understanding and analysing complex networks. The original notion of graphlets, however, is unable to encode edge attributes in many types of networks, especially in egocentric social networks. In this paper, we introduce a framework to embed edge type information in graphlets and generate a Typed-Edge Graphlets Degree Vector (TyEGDV). Through applying the proposed method to a case study of chronic pain patients, we find that not only a patient’s social network structure could inform his/her perceived pain grade, but also particular types of social relationships, such as friends, colleagues and healthcare workers, are more important in understanding the effect of chronic pain. Further, we demonstrate that including TyE-GDV as additional features leads to significant improvement in a typical machine learning task. Keywords: Edge-labelled graphs · Heterogeneous networks · Attributed graphs · Graphlets · Egocentric networks · Chronic pain study
1
Introduction
Underlying the formation of complex networks, topological structure has always been a primary focus in network science. Among numerous analytical approaches, graphlets [1] have gained considerable ground in a variety of domains. In biology, it is revealed that proteins performing similar biological functions have similar local structures depicted by the graphlet degree vector [2]. In social science, egocentric graphlets are used to represent the patterns of people’s social interactions [3]. More broadly, the notion of graphlets is introduced in computer vision to capture the spatial structure of superpixels [4], or in neuroscience to identify structural and functional abnormalities [5]. However, the original graphlets concept is unable to capture the richer information in networks that contain different types and characteristics of nodes or edges. Specifically, there are situations in which we are more interested in c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2021, SCI 1072, pp. 514–526, 2022. https://doi.org/10.1007/978-3-030-93409-5_43
Edge-Type Embedded Graphlet Degree Vector
515
edge-labelled networks. For example, in a routing network where edges represent communication links, the label of each edge indicates the cost of traffic over that edge and is used to calculate the routing strategy. Or in an egocentric social network, the different types of social relationships between the ego and the alters are essential in analysing ego’s behaviour and characteristics. Some studies have extended graphlets to attributed networks (also called heterogeneous networks). Still, they either only deal with different types of nodes [6] or they categorise each graphlet into a number of “colored-graphlets” according to the exhaustive combinations of different node types and/or edge types [7,8]. In this work, we introduce an approach to embedding edge type information in graphlets, named Typed-Edge Graphlets Degree Vector, or TyE-GDV for short. We employ both the classic graphlets degree vector [2] (GDV) and the proposed TyE-GDV to represent and analyse 303 egocentric social networks of chronic pain patients. The real-life data is collected from three chronic pain leagues in Belgium. Each patient selects up to ten connections and each edge is labelled with one social relationship type. After grouping the patients into four groups according to their self-perceived pain grades, we find that patients with higher grades of pain have more star-like structures (3-star graphlets) in their social networks, while patients in lower pain grades groups form more 3-cliques, tailed-triangles, 4-chordal-cycles and 4-cliques. With the additional edge type information provided by TyE-GDV, we further discover that the outnumbered 3-star graphlet in higher pain grade patients is mainly formed of friends or healthcare workers; and that in 3-cliques and 4-cliques, friends and colleagues appear more frequently among patients with lower pain grades. We further apply TyE-GDV into a node classification task. The dataset contains demographic attributes, detailed information about chronic pain (duration, diagnosis, pain intensity, etc.), and other related data such as the physical functioning score, depression score, social isolation score, etc. We show that the edge-type encoded graphlet features depicted by TyE-GDV are more distinctive than the classic non-typed graphlet features given by GDV in telling apart patients of different pain grades. The remainder of this paper is organised as follows. Preliminary knowledge is provided in Sect. 2. Our proposed approach is introduced in Sect. 3. Experiments, results and analysis are presented in Sect. 4. And finally we conclude in Sect. 5 and discuss future directions.
2
Background and Preliminaries
In this section, we introduce the concepts of graphlets and graphlets in the context of egocentric networks. 2.1
Graphlets
Graphlets are small non-isomorphic induced subgraphs of a network [1]. Nonisomorphic means that two subgraphs need to be structurally different, and induced means that all edges between the nodes of a subgraph must be included.
516
M. Jia et al.
Fig. 1. 9 graphlets and 15 orbits of 2 to 4 nodes.
At the size of 2 to 5 nodes, there are 30 different graphlets in total. And, when the non-symmetry of node position is taken into consideration, there are 73 different local structures, which are also called automorphism orbits [2]. Simply put, orbits are all the unique positions of a subgraph. For any given node, a vector of the frequencies of all 73 orbits is then defined as the Graphlet Degree Vector (GDV). GDV or normalised GDV is often used as node feature to measure the similarities or differences among all nodes. We summarise graphlets together with their orbits of 2 to 4 nodes in Fig. 1. Take G6 for example, the node at orbit-11 touches orbit-0 three times, orbit-2 twice, orbit-3 once and orbit-11 itself once. Thus, its GDV has 3 at the 0th coordinate, 2 at the 2nd coordinate, 1s at the 3rd and 11th coordinates, and 0 at the remaining coordinates. 2.2
Egocentric Graphlets
In social network analysis, egocentric networks are sometimes of particular interest when we care more about the immediate environment around each individual than the entire world [9]. We may want to learn why some people behave the way they do, or why some people develop certain health problems. Since the notion of graphlets is defined at node-level, it is naturally suitable to be applied in egocentric networks, with two modifications. First, some graphlets that do not meet the requirement of being an egocentric network are excluded. For example, in graphlets of size up to 4 nodes (Fig. 1), G3 and G5 are eliminated because any node in them serving as an ego cannot reach all other nodes with 1-hop. Second, there is no need to distinguish different orbits in egocentric graphlets because only one orbit can act as an ego. Therefore, there are in total 7 egocentric graphlets of size 2 to 4 nodes, which are 2-clique, 2-path, 3-clique, 3-star, tailed-triangle, 4-chordal-cycle and 4-clique (Fig. 2).
3
Typed-Edge Graphlet Degree Vector
This section describes the framework for generating edge-type embedded graphlet degree vector.
Edge-Type Embedded Graphlet Degree Vector
517
Fig. 2. 7 egocentric graphlets of 2 to 4 nodes. Ego node is painted in black.
The original concept of graphlets manages to capture rich connectivity patterns in homogeneous networks. However, many real-world networks are more complex by containing different types of nodes and edges, making them heterogeneous networks. Specifically, edge type information is crucial in that it indicates the specific relationship between the nodes. For example, in the dataset of this study, each chronic pain patient describes their egocentric social network, including up to ten actors, and each edge is labelled with 1 of 13 types of social relationships. In order to analyse edge-labelled networks at a finer granularity, we propose to embed edge-type information in graphlets. The original graphlet degree vector counts the occurrences of each type of graphlet, and as a result, a one-dimensional vector is created. Here, we propose to construct a two-dimensional vector by counting each type of edge touched by each type of graphlet. To begin with, we give the formal definition of an edge-labelled network. Definition 1. An edge-labelled network G is a triple V, E, Te , where V = {v1 , v2 , ..., vn } is the set of nodes, E = {eij } ⊂ V × V is the set of edges where eij indicates an edge between nodes vi and vj , and Te is the set of edge types, where τeij denotes the type of edge eij . The first step of the framework is graph preprocessing, in which the set of edge types is mapped to integers ranging from 0 to |Te |. For instance, the 13 types of social relationships in the targeted dataset are denoted from 0 to 12 (τe ∈ [0, 12]). Also, the set of types of graphlets Tg is mapped to integers ranging from 0 to |Tg |. In this work, we consider all possible egocentric graphlets up to 4 nodes (Fig. 2). Therefore the seven types of graphlets are coded from 0 to 6 (τg ∈ [0, 6]). Algorithm 1 shows the approach of generating a two-dimensional vector of size |Tg | × |Te |, i.e., the Typed-Edge Graphlet Degree Vector (TyE-GDV) for any nodes of interest. Specifically, after initialisation, for each node in a given node set V and for each type of the seven egocentric graphlets, the vector is updated through the Update function (Algorithm 2). C(Ni , 2) and C(Ni , 3) denotes all possible 2-combinations and 3-combinations of the set of neighbours of node i. Due to the preprocessing step, τg and τe are conveniently used as indices when updating the vector. For example, if a type ‘2’ graphlet (3-clique) is detected and its three edges are of type ‘0’, ‘1’ and ‘2’, vector elements at coordinates (2, 0), (2, 1) and (2, 2) will increase by 1. In the end, a dictionary of nodes as keys and their corresponding TyE-GDV as values is returned.
518
M. Jia et al.
Algorithm 1: Typed-Edge Graphlet Degree Vector. input : preprocessed graph G = V, E, Te , set of graphlet types Tg , node set V . output: dictionary dic of vectors for all nodes ∈ V . 1 initialise: dic = {}; 2 foreach i ∈ V do 3 initialise a 2d-vector vec of size |Tg | × |Te | with zeros; 4 foreach u ∈ Ni do 5 Update(vec, g0 , eiu ); 2-clique 6 foreach u, v ∈ C(Ni , 2) do 7 if v ∈ / Nu then 8 Update(vec, g1 , [eiu , eiv ]); 2-path 9 else 10 Update(vec, g2 , [eiu , eiv , euv ]); 3-clique 11 foreach u, v, w ∈ C(Ni , 3) do 12 if u ∈ / Nv ∧ u ∈ / Nw ∧ v ∈ / Nw then 13 Update(vec, g3 , [eiu , eiv , eiw ]); 3-star 14 else if v ∈ Nu ∧ w ∈ / Nu ∧ w ∈ / Nv then 15 Update(vec, g4 , [eiu , eiv , eiw , euv ]); 16 else if w ∈ Nu ∧ v ∈ / Nu ∧ v ∈ / Nw then 17 Update(vec, g4 , [eiu , eiv , eiw , euw ]); tailed-tri 18 else if w ∈ Nv ∧ u ∈ / Nv ∧ u ∈ / Nw then 19 Update(vec, g4 , [eiu , eiv , eiw , evw ]); 20 else if u ∈ (Nv ∩ Nw ) ∧ w ∈ / Nv then 21 Update(vec, g5 , [eiu , eiv , eiw , euv , euw ]); 22 else if v ∈ (Nu ∩ Nw ) ∧ w ∈ / Nu then 23 Update(vec, g5 , [eiu , eiv , eiw , euv , evw ]); 4-chord-cyc 24 else if w ∈ (Nu ∩ Nv ) ∧ v ∈ / Nu then 25 Update(vec, g5 , [eiu , eiv , eiw , euw , evw ]); 26 else 27 Update(vec, g6 , [eiu , eiv , eiw , euw , evw , euv ]); 4-clique 28 dic[i] = vec;
4
Experiments and Analysis
In this section, we apply the proposed method to analyse egocentric social networks of chronic pain patients. Our code is available at https://github.com/ MingshanJia/explore-local-structure. 4.1
Dataset
The dataset is collected from chronic pain patients of the Flemish Pain League, the League for Rheumatoid Arthritis and the League for Fibromyalgia [10]. Each patient uses the graphical tool GENSI [11] to generate their egocentric social networks containing up to 10 alters. The types of relationship between the ego and the alters are explicitly given (all 13 types of social relationships are listed in Table 1). Participants were also asked to fill out a sociodemographic/pain
Edge-Type Embedded Graphlet Degree Vector
519
Algorithm 2: Update Vector. 1 2 3 4
Function Update input : 2d-vector vec, type of graphlet τg , edge list Le . foreach e ∈ Le do τe = GetType(e); /* τg and τe are used as indices in vec. vec[τg ][τe ] increase by 1;
*/
questionnaire. After excluding inconsistent and incomplete entries, 303 patients’ egocentric social networks and their sociodemographic/pain characteristics constitute the final dataset. The average age of all patients is 53.5 ± 12 years (248 females and 55 males). Table 1. Edge type and total number of occurrences of each type in all networks.
Relationship
Type Total number of occurs
Partner
T-1
222
Father/Mother
T-2
209
Brother/Sister
T-3
293
Children/Grandchildren T-4
493
Friend
T-5
506
Family-in-law
T-6
207
Other family
T-7
142
Neighbour
T-8
69
Colleague
T-9
57
Healthcare worker
T-10 233
Member of organisations T-11
74
Acquaintance
T-12
15
Other
T-13
17
Figure 3 gives some basic information about these egocentric networks, including the ego nodes’ degree distribution and their edge-type distribution. The edge-type distribution is calculated by summing over all ego nodes on each type of the edges, which is also shown in the third column of Table 1. From the degree distribution (Fig. 3a), we know that most patients (62%) have 10 social contacts in their networks. However, we don’t expect degree being a discriminative feature in the following analysis because 10 alters is the upper limit in the dataset. The edge-type distribution (Fig. 3b) informs us that “friend” and
520
M. Jia et al.
Fig. 3. Degree distribution and edge type distribution of all patients.
“children” are the most frequent types appearing in these networks. In contrast, edges of types “neighbour”, “colleague” and “member of organisations” are underrepresented; “acquaintance” and “other” are almost negligible simply because if somebody is asked to name 10 contacts, they will name strongest contacts and there is no space for “acquaintance” or “other” relationships. Furthermore, pain grades are calculated by means of the Graded Chronic Pain Scale (GCPS), which assesses both pain intensity and pain disability [12]. Patients are then classified into 5 grades based on their average intensity and disability scores: grade-0 no pain; grade-1 low intensity and low disability; grade2 high intensity and low disability; grade-3 moderate disability regardless of pain intensity; and grade-4 high disability regardless of pain intensity. Because all participants are chronic pain patients, their GCPS grades range from grade-1 to grade-4. Specifically, we have 21 patients of grade-1, 33 patients of grade-2, 67 patients of grade-3 and 182 patients of grade-4. In this work, we aim to explore whether the graphlets and typed-edge graphlets are beneficial to recognising GCPS grades of chronic pain patients. 4.2
Analysing Pain Grades via GDV and TyE-GDV
Previous studies have revealed that social interactions play an important role in the perception of pain [13]. For example, a strong association was found between perceived social support and pain inference [14]; and improvements in social isolation lead to significant improvements in patients’ emotional and physical functioning [15]. Usually, the social context of a patient is measured by means of the Patient Reported Outcome Measurement Information System (PROMIS® ) [16] or the Social Support Satisfaction Scale (ESSS) [17]. These measurements, however, are not based on patients’ actual social networks and therefore cannot provide insights on the impact of network structures or specific types of interactions. To cope with this issue, we apply the classic graphlets and the proposed typed-edge graphlets to analyse patients’ social networks. First, we calculate the average Graphlet Degree Vectors of patients from each GCPS grade. A parallel coordinates plot shows the average degrees of all seven
Edge-Type Embedded Graphlet Degree Vector
521
Fig. 4. Parallel coordinates plot of average GDV of different GCPS grades. Each coordinate represents the average number of graphlets belonging to that type.
egocentric graphlets at each grade (Fig. 4). We see that patients of higher-grade pains (grade 3 and grade 4) have more star-like structures (3-star graphlets) in their social networks, and patients of lower pain grades (grade 1 and grade 2) form more 3-cliques, tailed-triangles, 4-chordal-cycles and 4-cliques. A worse connected star-like structure indicates a more isolated social environment, and a better connected structure such as the 3-clique or the 4-clique could be a sign of better social support. These findings are consistent with the previously mentioned studies [13–15] and provide further evidence that a patient’s social network could inform the perceived pain grade. In addition, we find that the number of connections (2-cliques) does not help distinguish pain grades. This may result from the limited number of contacts in the dataset. Still nevertheless, another work also found that the size of a patient’s egocentric social network is not significantly related to changes in pain [18]. This also explains why more complicated network structures should be considered in the analysis of patients’ social networks. Further, in order to investigate the relationship between the types of interactions and the pain grades, we employ the Typed-Edge Graphlet Degree Vector and focus on two particular graphlets, i.e. the poorly connected 3-star graphlet and the well connected 4-clique graphlet. These two graphlets are chosen because they present distinct differences between patients of lower pain grades and patients of higher pain grades. For each of the graphlets, we calculate the average TyE-GDV of patients from every pain grade and generate a parallel coordinates plot (Fig. 5). We find that in the 3-star graphlet (Fig. 5a), higher-grade pain patients have significantly more edges of type ‘5’ (friend) and type ‘10’ (healthcare worker) than lower-grade pain patients. In other words, friends and healthcare workers are not well connected in higher-grade pain patients. It thus provides the potential for interventions that increase the social involvements of a patient’s friends and healthcare workers to improve the management of chronic pain.
522
M. Jia et al.
Fig. 5. Parallel coordinates plot of average TyE-GDV of different GCPS grades for two graphlets. Each coordinate represents the average number of edges belonging to that type.
Then from the average TyE-GDV of the 4-clique graphlet (Fig. 5b), we observe that lower-grade pain patients have more edges of type ‘5’ (friend) than higher-grade pain patients (5.2 compared to 3.2). That is to say, friends appear more often in these tightly connected groups among patients of lower-grade pain. The importance of the friend relationship is revealed in both 3-star and 4-clique graphlets. As pointed out by other studies [19,20], patients with severe chronic pain may be at risk of deterioration in their friendships and are in need of supportive behaviours from friends. Another marked contrast between the highergrade and lower-grade pain patients is in edge type ‘9’ (colleague). Colleagues hardly appear (0.24 on average) in these closely connected structures among the former group, whereas more than one colleague (1.1 on average) emerges among the latter group. It may reflect the adverse effects of severe chronic pain on patients’ professional activities [21]. To give an intuitive understanding of the structural differences, we give two actual examples from the dataset as the social network prototypes of pain grade-1 and pain grade-4, respectively (Fig. 6).
Edge-Type Embedded Graphlet Degree Vector
523
Fig. 6. Prototypes of GCPS grade-1 and GCPS grade-4.
This experiment shows that the extra information brought by TyE-GDV provides us with more insights into the relationship between patients’ social link types and their pain grades. Therefore, it has implications for how therapeutic interventions could be improved by increasing particular types of social connections. 4.3
Predicting Pain Grades via GDV and TyE-GDV
As a new approach capturing both topological structures and edge attributes, TyE-GDV provides additional information for ego node analyses and inferences. We further exhibit its usage in a node classification task where significant improvement is observed on test set performance. Node classification is one of the most popular and widely adopted tasks in network science [22], where each node is assigned a ground truth category. Here, we aim to predict the GCPS grade of chronic pain patients. In order to test the utility of TyE-GDV as extra features, we fit three sets of features into a random forest classifier. The first set, the baseline, includes patients’ demographic attributes, pain-related descriptions and physical/psychological well-being indicators. Since the baseline contains no structural information, we identify it as raw features. The second set includes the raw features plus the classic GDV. The third set includes the raw features plus the proposed TyE-GDV. As the dataset is not large and the distribution of four grades is not balanced (see Sect. 4.1), we adopt a stratified 5-fold cross-validation [23] to evaluate the classification performance with different sets of features. Also, because decision tree-based models are inherently stochastic, we repeat the above step 500 times and report the mean metric score. We report the average macro-F1 scores of three models in Table 2. The macroF1 score is chosen because this is a multi-class classification problem and the distribution of the four classes is unbalanced. A naive classifier (Stratified) is also added in the table, which generates predictions by respecting the class
524
M. Jia et al.
Table 2. Prediction results in average macro-F1 score (± standard deviation), average gain over raw features, and total running time of 500 repetitions.
Macro F1 (Mean ± Std)
Gain over raw feat. (Mean)
Time (Sum)
Stratified
0.248 ± 0.024
—
3
Raw feat
0.578 ± 0.005
—
116
Raw feat. + GDV
0.597 ± 0.008
3.3%
138
Raw feat. + TyE-GDV 0.619 ± 0.004
7.1%
252
distribution in the training set. We observe a significant 7.1% improvement after adding TyE-GDV to the raw features. In comparison, adding GDV leads to an improvement of about 3.3%. As expected, however, the running time of using TyE-GDV also increases with an increased dimension of features (total running time of 500 repetitions is shown in the last column of Table 2). This experiment shows that the structural information captured by GDV and especially the edge attribute information captured by TyE-GDV are useful as additional features to predict a patient’s pain grade.
5
Conclusion
In this paper, we proposed to embed edge type information in graphlets, and we introduced the framework for calculating Typed-Edge Graphlets Degree Vector for ego nodes. After applying GDV and TyE-GDV to the chronic pain patients dataset, we found that 1) a patient’s social network structure could inform their perceived pain grade; and 2) particular types of social relationships, such as friends, colleagues and healthcare workers, could bear more importance in understanding the effect of chronic pain and therefore lead to more effective therapeutic interventions. We also showed that including GDV or TyE-GDV as additional features results in improvement of a typical machine learning task that predicts patients’ pain grades. Future studies will extend TyE-GDV by incorporating all orbits of graphlets and applying them to sociocentric networks or further considering the dynamics of time-varying networks. Acknowledgement. The authors thank Vivek Velivela, Mohamad Barbar and YuXuan Qiu for their helpful comments. This work was supported by the Australian Research Council, Grant No. DP190101087: “Dynamics and Control of Complex Social Networks”. The data collection was supported by two grants from the Fund for Scientific Research-Flanders, Belgium (Grant No. G020118N awarded to Liesbet Goubert and Piet Bracke and grant No. 11K0421N awarded to Mait´e Van Alboom).
Edge-Type Embedded Graphlet Degree Vector
525
References 1. Prˇzulj, N., Corneil, D.G., Jurisica, I.: Modeling interactome: scale-free or geometric? Bioinformatics 20, 3508–3515 (2004) 2. Milenkovi´c, T., Prˇzulj, N.: Uncovering biological network function via graphlet degree signatures. Cancer Inform. (2008) 3. Teso, S., Staiano, J., Lepri, B., Passerini, A., Pianesi, F.: Ego-centric graphlets for personality and affective states recognition. In: SocialCom. IEEE (2013) 4. Zhang, L., Song, M., Liu, Z., Liu, X., Bu, J., Chen, C.: Probabilistic graphlet cut: exploiting spatial structure cue for weakly supervised image segmentation. In: CVPR (2013) 5. Ataei, S., Attar, N., Aliakbary, S., Bakouie, F.: Graph theoretical approach for screening autism on brain complex networks. SN Appl. Sci. (2019) 6. Rossi, R.A., et al.: Heterogeneous graphlets. TKDD 15, 1–43 (2020) 7. Ribeiro, P., Silva, F.: Discovering colored network motifs. In: Contucci, P., Menezes, R., Omicini, A., Poncela-Casasnovas, J. (eds.) Complex Networks V. SCI, vol. 549, pp. 107–118. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-054018 11 8. Gu, S., Johnson, J., Faisal, F.E., Milenkovi´c, T.: From homogeneous to heterogeneous network alignment via colored graphlets. Sci. Rep. 8, 1–16 (2018) 9. Perry, B.L., Pescosolido, B.A., Borgatti, S.P.: Egocentric Network Analysis: Foundations, Methods, and Models. Cambridge University Press, Cambridge (2018) 10. Van Alboom, M., et al.: Well-being and perceived stigma in individuals with rheumatoid arthritis and fibromyalgia: a daily diary study. Clin. J. Pain 37, 349– 358 (2021) 11. Stark, T.H., Krosnick, J.A.: GENSI: a new graphical tool to collect ego-centered network data. Soc. Netw. 48, 36–45 (2017) 12. Von Korff, M., Ormel, J., Keefe, F.J., Dworkin, S.F.: Grading the severity of chronic pain. Pain 50, 133–149 (1992) 13. Karayannis, N.V., Baumann, I., Sturgeon, J.A., Melloh, M., Mackey, S.C.: The impact of social isolation on pain interference: a longitudinal study. Ann. Behav. Med. 53, 65–74 (2019) 14. Ferreira-Valente, M.A., Pais-Ribeiro, J.L., Jensen, M.P.: Associations between psychosocial factors and pain intensity, physical functioning, and psychological functioning in patients with chronic pain: a cross-cultural comparison. Clin. J. Pain 30, 713–723 (2014) 15. Bannon, S., Greenberg, J., Mace, R.A., Locascio, J.J., Vranceanu, A.-M.: The role of social isolation in physical and emotional outcomes among patients with chronic pain. Gen. Hosp. Psychiatry 69, 50–54 (2021) 16. Hahn, E.A., et al.: Measuring social health in the patient-reported outcomes measurement information system (PROMIS): item bank development and testing. Qual. Life Res. 19, 1035–1044 (2010). https://doi.org/10.1007/s11136-010-9654-0 17. Ribeiro, J.L.P.: Escala de satisfa¸ca ˜o com o suporte social (esss) (1999) 18. Evers, A.W., Kraaimaat, F.W., Geenen, R., Jacobs, J.W., Bijlsma, J.W.: Pain coping and social support as predictors of long-term functional disability and pain in early rheumatoid arthritis. Behav. Res. Ther. 41, 1295–1310 (2003) 19. Forgeron, P.A., et al.: Social information processing in adolescents with chronic pain: my friends don’t really understand me. Pain 152, 2773–2780 (2011) 20. Yang, Y., Grol-Prokopczyk, H.: Chronic pain and friendship among middle-aged and older us adults. J. Gerontol. Ser. B 76, 2131–2142 (2020)
526
M. Jia et al.
21. Harris, S., Morley, S., Barton, S.B.: Role loss and emotional adjustment in chronic pain. Pain 105, 363–370 (2003) 22. Bhagat, S., Cormode, G., Muthukrishnan, S.: Node classification in social networks. In: Aggarwal, C. (ed.) Social Network Data Analytics, pp. 115–148. Springer, Boston (2011). https://doi.org/10.1007/978-1-4419-8462-3 5 23. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Analyzing Escalations in Militarized Interstate Disputes Using Motifs in Temporal Networks Hung N. Do
and Kevin S. Xu(&)
Electrical Engineering and Computer Science Department, University of Toledo, Toledo, OH 43606, USA [email protected], [email protected]
Abstract. We present a temporal network analysis of militarized interstate dispute (MID) data from 1992 to 2014. MIDs progress through a series of incidents, each ranging from threats to uses of military force by one state to another. We model these incidents as a temporal conflict network, where nodes denote states and directed edges denote incidents. We analyze temporal motifs or subgraphs in the conflict network to uncover the patterns by which different states engage in and escalate conflicts with each other. We find that different types of temporal motifs appear in the network depending on the time scale being considered (days to months) and the year of the conflict. The most frequent 3-edge temporal motifs at a 1-week time scale correspond to different variants of two states acting against a third state, potentially escalating the conflict. Temporal motifs with reciprocation, where a state acts in response to a previous incident, tend to occur only over longer time scales (e.g. months). We also find that both the network’s degree and temporal motif distributions are extremely heavy tailed, with a small number of states being involved in many conflicts. Keywords: Temporal motifs Dynamic networks Militarized incidents International conflicts Conflict networks Conflict escalation Motif distribution
1 Introduction Militarized interstate disputes (MIDs) are conflicts between (sovereign) states that are not full-scale wars [1]. Each dispute can be broken down into a series of smaller incidents, which provide us with additional information about the progression of the dispute. By analyzing past data on such incidents, we may discover some insights about how they escalate and de-escalate over time. Incidents in MIDs can be modeled as a conflict network [2]. Each node in the network is a state, and each temporal edge is an incident, such as a threat, display, or use of force one state directs toward another. We include the temporal dimension in this network to analyze how disputes and international relations change over time. A variety of analysis methods have been developed for temporal networks [3], including centrality measures [4], temporal community structures [5], and generative models [6, 7]. In this paper, we use temporal motifs [8] to extract information from © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2021, SCI 1072, pp. 527–538, 2022. https://doi.org/10.1007/978-3-030-93409-5_44
528
H. N. Do and K. S. Xu
MIDs. In a temporal or dynamic network, temporal motifs are defined as sequences of edges that complete within a time interval. Figure 1 shows an example of temporal motifs in a conflict network. The main reason to use temporal motifs to analyze incidents in MIDs is to identify patterns of escalations of disputes at different time scales. We present an analysis of MID data from 1992 to 2014 using temporal motifs on conflict networks constructed from incidents. Our main findings are as follows: • A variety of temporal motifs appear, depending on the time scale we consider. We observe primarily non-reciprocal motifs over short times (e.g. 1 week), indicating that target states generally do not quickly escalate a conflict in response to an incident. Reciprocal motifs are more frequent over longer times (e.g. several months). • The number of temporal motifs observed in a year is only moderately correlated with the number of incidents. Since temporal motifs denote rapid escalations of conflict, this indicates that lots of conflicts do not escalate quickly over time. • Both the distribution of the number of incidents and the number of temporal motifs over the states are extremely heavy tailed, although they are better explained by alternative distributions, such as a stretched exponential, rather than a power law.
2 Materials and Methods Data Description. We use the dataset MID 5.01 [1] compiled by the Correlates of War project. We use the incident-level data (MIDIP), which provides the date of each incident in a dispute from 1992 to 2014. Each incident represents a threat, display, or use of force one state directs toward another. We construct a temporal network from the incidents using states as nodes and directed timestamped edges from the state that takes the action (side A) to the state that is the target (side B) using the start date of each incident. The dataset contains 156 states; 4,482 incidents; and 5,136 edges. (The number of edges is higher than the number of incidents because some incidents involve more than 2 states.) The MID data also contain short narrative descriptions of the incidents that are used to code the incident-level data. The time resolution in the dataset is at the level of 1 day. Some of the incidents that happen on the same day may possibly happen one after another—for such actions, the dataset provides the temporal ordering of the incidents, but not the exact time. We assign each incident a time so that all incidents on that day are equally spaced; for example, a day with 1 incident is assigned the time of 12:00 UTC, while a day with 2 incidents is assigned the times of 8:00 and 16:00 UTC. Temporal Motifs. There are many definitions of temporal motifs present in the literature [3]. We conduct our analysis using the Python package DyNetworkX [9] to enumerate temporal motifs according to the definition from [8]. For a subgraph in the network to match a temporal motif, it needs to have the correct ordering of edges, and the time difference between the first and last edges needs to be within the completion time d (e.g. 10 days in Fig. 1). However, there is no need for the edges to be
Analyzing Escalations in Militarized Interstate Disputes
529
Fig. 1. Given a temporal network (top) and a temporal motif of interest (bottom left), we find one such instance of the motif (bottom right). The other crossed out one is not an instance despite matching the correct order of edges because the completion time exceeds the time limit of 10 days from first edge to last edge.
consecutive—there may be another edge that occurs between edges in the temporal motif. Also, since we impute the exact time during the day of an incident, our temporal motif counts are an estimate of the actual counts that would be obtained given the actual incident timestamps. We first calculate all possible 2 or 3-node, 3-edge temporal motifs, which are shown in Fig. 2(a), with maximum completion time d of 7 days. We choose these small motifs primarily for ease of interpretation. By using this short time period, we focus on rapid escalations between countries rather than long-term changes. After that, we calculate these motifs with different completion time intervals of [0, 3], (3, 7], (7, 30], and (30, 120] days to analyze the escalations at different level of intensities. In addition to counts of different motifs, we also analyze the frequency of different temporal motifs by state and by role (red, green, or blue node) in the motif. While many incidents involve a single state on side A and a single state on side B, some incidents have multiple states on a side. For example, a group of 4 allied countries (side A) may decide to take joint action targeting another state (side B). This would be represented by 4 temporal edges (from the 4 different side A states to the side B state) at the exact same time. Having such edges at the same time destroys the notion of order for temporal motifs, which do not account for simultaneous edges. To not lose these simultaneous edges, which will in turn impact our motif counts, we added a small Gaussian noise (mean of 0, standard deviation of 1 s) to each of the timestamps. This creates a unique ordering of the edges, so that they now appear in temporal motifs; however, the ordering is artificial and dependent on the noise values. We average results over 10 different networks generated with different random noise values to mitigate the artificial ordering.
530
H. N. Do and K. S. Xu
Degree and Motif Distributions. The degree distribution is one of the most fundamental properties of a network. The degree distribution for many types of networks are heavy tailed and often claimed to follow a power law; however, recent findings suggest that alternative heavy-tailed distributions may be a better fit [10]. We consider the degree distribution of the temporal network, which is a distribution of the number of incidents that a particular state has been involved in (either on side A or B). This is also the weighted degree distribution of a static network aggregated over time. (The unweighted degree distribution of the aggregated network is less interesting because the maximum degree is limited by the 156 states.) We next consider the distribution of participations in temporal motifs, which we call the motif distribution, in a manner analogous to a degree distribution. The motif distribution is a distribution over the number of times a state is involved in any temporal motif in any of the three roles. We analyze degree and motif distributions using the Python package powerlaw [11], which fits both power law distributions and other heavy-tailed distributions. We use the likelihood ratio test proposed by Clauset et al. [12] for direct comparison of two candidate distributions. We compare the fit of a power law distribution with the exponential, log-normal, stretched exponential (Weibull), and truncated power law alternative distributions for both the degree and motif distributions.
3 Results We first present the observed frequencies for all temporal motifs with a maximum completion time of 7 days over the entire data trace. We then examine several motifs that appear frequently (Sect. 3.1), motif frequencies for different completion times (Sect. 3.2), and distributions of motifs over time and states (Sect. 3.3). Code to reproduce all results in this paper can be found at the following GitHub repository: https://github.com/IdeasLabUT/Temporal_Motifs_MIDs. Figure 2(b) shows the count of all 2 or 3-node and 3-edge temporal network motifs with a maximum completion time of 7 days from the constructed network. The average total motif count is about 33,000 with standard deviation of 18 (due to the Gaussian noise added to the timestamps). The dominant motif counts we found are, in decreasing order: 1. M1,1, M1,6, and M6,6: 2 states initiate 3 incidents in total with the same target state. 2. M6,1: 1 state initiates 3 incidents in a row with the same target state. 3. M4,1, M4,3, and M6,3: 1 state initiates 3 incidents in a row with 2 other target states. The triangle motifs do not frequently appear, likely because it is unusual to have cyclical relationships between countries in the context of international conflicts. Moreover, the prominent motifs we listed out above are not reciprocated. They all show one side initiating incidents to another state without getting any immediate retaliation. Examples of reciprocated motifs are M1,2, M1,5, and M2,6, all of which appear to have very low counts compared to the prominent non-reciprocated ones. We discuss reciprocated motifs further in Sect. 3.2 when we vary the motif completion time.
Analyzing Escalations in Militarized Interstate Disputes
531
Fig. 2. (a) All possible temporal motifs with 2 or 3 nodes and 3 edges (figure credit: [8]). Green and grey shaded boxes denote 2-node and triangle motifs, respectively. We denote the green, red, and blue nodes as roles 1, 2, and 3, respectively. (b) Temporal motif counts with a maximum completion time of 7 days for each of the 2 or 3-node, 3-edge motifs. Counts are averaged over 10 networks with Gaussian noise added to each timestamp for each temporal motif type.
3.1
Motifs of Interest
We looked further into the prominent motifs to determine which states are most involved in each motif and in which roles in the motifs. For all motifs in Fig. 2(a), role 1 means the green node, role 2 means the red node, and role 3 means the blue node. States are denoted by their 3-letter codes from the Correlates of War project [13]. Motifs M1,1, M1,6, and M6,6 (Many to One). We can see from Fig. 3 that Yugoslavia1 is the most frequent participant in these motifs. It has high counts in role 2 of M1,1, M1,6, and M6,6, which shows that it was on side B of many incidents initiated by other states. The most frequent participant in roles 1 and 3 is the USA, which was involved in two main disputes that created these types of motifs: a dispute with Yugoslavia, where Germany, Turkey, the Netherlands, and Greece among others were also on side A with the USA; and a dispute with Iraq, where the United Kingdom and France among others were also on side A with the USA. Motif M6,1 (One to One 3 Times). From Fig. 4 we can see that the main participants in motif M6,1 are Israel dominating role 1, and Lebanon dominating role 2. The edges that Israel directs toward Lebanon account for about 11% of the total incidents from 1992 to 2014. On the other hand, the proportion of edges in the opposite direction is only 0.5%, which implies that Lebanon rarely retaliates against Israel. This can also be
1
Yugoslavia and Serbia may be conflated in this data set. There is no COW country code for Serbia, despite it being mentioned in some of the narratives.
532
H. N. Do and K. S. Xu
Fig. 3. Motifs M1,1, M1,6, and M6,6 (2 states initiate incidents towards 1 target state)
Fig. 4. Role of each country in Motif M6,1 (1 state initiates 3 incidents towards a target state)
illustrated by the low count of M5,1, M5,2, and M6,2, which represent balanced reciprocity between the 2 countries. Most incidents appear to be Israeli attacks on Hezbollah guerillas in southern Lebanon, with the Lebanese not getting involved as frequently. The next most frequent participants in motif M6,1 are the USA in role 1 and Iraq in role 2. Many incidents involved a show of force building up to the USA’s invasion of Iraq. Other states also participated in this dispute, which shows up in other motifs such as the many-to-one motifs discussed previously. Motifs M4,1, M4,3, and M6,3 (One to Many). From Fig. 5, the most frequent participant in these motifs is again Yugoslavia, and this time, in role 1, which shows that it initiated incidents towards many other target states. Unlike the case of Israel and Lebanon for motif M6,1, only about 3% of the total edges in the network have Yugoslavia on either side of the one-to-many and many-to-one motifs. Even though Yugoslavia isn’t involved in as many total incidents as Israel or Lebanon, when it is involved, it tends to escalate fast and bring in many other countries. By digging further into when the one-to-many and many-to-one motifs occur, we find that most of them are only in 1999 and 2000, which is shown in Fig. 6. From the narratives for these actions, we find that they are the result of conflicts between Yugoslavia and Kosovo. These conflicts had interventions from the North Atlantic Treaty Organization (NATO), which increases the motif counts for Yugoslavia substantially. We note that the incidents involving NATO tend to be joint threats from its member countries. As we discussed earlier, these joint threats are coded as having the same timestamps, which the temporal motif counting algorithm ignores. With the
Analyzing Escalations in Militarized Interstate Disputes
533
Fig. 5. Motifs M4,1, M4,3, and M6,3 (1 state initiates incidents towards 2 target states)
noise we added to each time stamp, these incidents now have random orderings, so they are included in our motif counts. Moreover, these motifs all lack reciprocity, which shows that there was not much immediate retaliation from the state on side B. If there were, other motifs such as M1,2 or M4,2 should also occur frequently. The fact that Yugoslavia was the center of these 2 groups of motifs shows that it took a while for Yugoslavia to retaliate back (side A) in 2000 after being on side B in 1999, as we can see in Fig. 6. 3.2
Temporal Motif Distributions at Different Completion Times
The smaller the completion time, the more rapid we find the escalation to be. Figure 7 shows the temporal motif distribution at different completion time intervals. We can notice that the one-to-many motifs M4,1, M4,3, and M6,3 fade away in relative frequency as the completion time gets longer. This shows that this behavior can only occur when a state escalates quickly, and single-handedly engaging in conflicts with many other states over a long time period is not an ideal tactic. However, the opposite is possible, which is also illustrated in Fig. 7, as the many-to-one motifs M1,1, M1,6, and M6,6 remain dominant as the completion time increases. Another interesting observation is that reciprocated motifs, particularly M1,2, M1,5, and M2,6, begin to appear more frequently as we increase the completion time. For completion time between 30 and 120 days, M1,2 and M1,5 appear even more frequently than the one-to-many motifs, which indicates that reciprocation by a target (side B) state to an action from a side A state tends to happen more slowly than continued escalation from the side A state. Finally, we observe that motif M6,1, which denotes one state against another 3 times, increases significantly in frequency for longer completion time. Indeed, it becomes the most frequent motif for completion time between 30 and 120 days. 3.3
Motif Distributions Over Time and States
In the context of MIDs, temporal motifs represent escalations whereas edges represent separate conflict incidents. They represent different things, which can reveal more insights when we analyze them together. An analysis over the raw number of edges (incidents) can help us get the context that those motifs are in.
534
H. N. Do and K. S. Xu
Fig. 6. Temporal motifs in 1999 and 2000, with completion window d = 7 days
Fig. 7. Temporal motifs for different completion time intervals d
Distributions Over Time. Figure 8 shows the distribution of temporal motifs and raw edges over the timeline of the dataset. We can notice that they have some correlation with each other, with correlation coefficient of 0.44. This is somewhat expected
Analyzing Escalations in Militarized Interstate Disputes
535
Fig. 8. (Top) Distribution of all 3-edge temporal motifs (with a completion time of 7 days) vs. raw edges (incidents) over years. (Bottom) Distributions of prominent motifs over years.
because more edges tend to lead to more motifs. However, in many years, that is not the case as one can see in Fig. 8. From 1994 to 1998 and 2011 to 2014, even though there were a lot of incidents, the number of temporal motifs (at a 1-week time scale) remains very low. This implies that those incidents happened over a long period and don’t constitute any type of rapid escalations. Moreover, we can observe that there are spikes in the trend of temporal motifs, particularly the many-to-one motifs in 1999 and the one-to-many motifs in 2000. If we exclude those two years, the many-to-one motifs are still among the most frequent, while the one-to-many motifs are not. Indeed, the most prominent motif type can be different in each year and different from the aggregated results over the entire data trace shown in Fig. 2(b). On the other hand, the trend of raw edges over time is more gradual. This can be explained by how one escalation tends to lead to another in a short time period then fades away altogether. Both trends show that interstate conflicts were less intense from 2002 to 2011 when there was a low number of incidents and almost no escalations. Distributions Over States. We find that both the degree and temporal motif distributions have heavy tails, with a few states dominating the counts, as shown in Fig. 9. The exponential distribution is a particularly bad fit for both the degree and motif distributions due to its lack of heavy tails. However, after fitting different heavy-tailed distributions to the data, we find that neither the degree nor motif distributions are best explained by a true power law distribution. They are better fit by either a lognormal, stretched exponential, or truncated power law distribution as shown by the statistics in Table 1.
536
H. N. Do and K. S. Xu
Fig. 9. Degree and temporal motif distributions (solid lines) in the network on a log-log scale. Both distributions are quite heavy tailed, as indicated by the poor fit of the exponential distribution compared to a power law or stretched exponential (dashed lines). Table 1. Summary of likelihood ratio tests for degree and motif distributions. While both the degree and motif distributions are heavy tailed, they are better explained by alternative heavytailed distributions rather than a power law. Distribution Power law (PL) Exponential
Degree distribution Preferred to p-value Parameter PL? estimate b – – a ¼ 1:44
Motif distribution Preferred to p-value Parameter PL? estimate b – – a ¼ 1:29
Log-normal
Yes
b k ¼ 0:013 wa , where the stability score increases monotonically. High absolute values are reached in the last row, as compared to the corresponding value of aggregation without filtering reported in the first row. Values around 70% for HS and 80–90% for CNS are reached when only the first decile of the most frequent links is kept (the percentage of contacts is much bigger than 10%, around 80% for HS and 70% for CNS, see Sect. 5.1). Several observations can be made: first, at higher scales and especially for the
574
A. Chiappori and R. Cazabet
HS dataset, Stability still remains low in absolute terms, far from the image one could have of a “movie-like”, progressive evolution of the network. Second, a rather heavy filtering is necessary to reach high values: how much the obtained network remains faithful to the original when only 10% (or even less) of the links are kept? In the next section we show what Fidelity score can tell us on this issue. 5.4
Fidelity: FPs vs FNs
The total Distance between the filtered-aggregated network A and the original network O is computed through Eq. 3. Here, we divide the total Distance into false positives (FPs, contacts in A but not in O) and false negatives (FNs, contacts in O but not in A). We display here two selected graphs for Distance, computed for HS dataset (Fig. 3), while interactive 3D plots for both datasets are available online1 .
(a) Fixed threshold
(b) Fixed window size
Fig. 3. FPs, FNs and Distance between original network and filtered time-window aggregated network. Data is for the configuration wf = 1 h, wa < wf , for HS dataset
In this configuration with fixed wf , wa < wf , FNs are constant at fixed threshold—having a same filtering window for all aggregation windows—and FPs increase monotonically with wa (Fig. 3, a). With HS dataset, the first type of error prevails until the threshold goes beyond 90%, then the number of FNs exceeds that of FPs, starting from the smallest sizes. This means that performing a time-window aggregation without filtering, as it is usually done in literature, plenty of FPs are introduced; but removing some of the links it is possible to reduce them to a quarter, at the cost of introducing a relatively small number 1
https://doi.org/10.6084/m9.figshare.16538940.v1.
Evaluation of Snapshot Graphs
575
of FNs. The same results are found with the CNS datasets, but the minimumin Distance occurs at smaller values of the percentage threshold (around 67% with wa on the order of few minutes). We think that this difference comes from the link weights distribution: for the CNS dataset it is broader, with the subset of the most frequent links being relatively less dominant. The most interesting feature of the curves is perhaps that at high threshold the Distance between original and aggregated network presents a minimum, or a plateau (Fig. 3, b). This behavior comes from the opposite monotonicity of FPs and FNs. The behavior of the stochastic filtering baseline is much different in this high filtering limit: there is no saturation—plateau or minimum—but a steep decrease towards zero. There are mainly two possible interpretations: one is that for datasets of social interaction as the ones we worked with, the system behavior can truly be well characterized by only looking at the most frequent links, neglecting most of the others. In this case, temporal networks could be simplified following our rather straightforward manner. The second possibility is to think that the snapshot graphs obtained at high filtering actually describe only that specific subset of more frequent links, but rarer links are also crucial to characterize the network on the whole.
6 6.1
Conclusions Discussion
We found that simple—without filtering—aggregation with non-overlapping windows returns small values of Stability; even smaller if the aggregation window increases. But this is precisely the most common configuration for time-window aggregation that is used in the literature, where the resulting snapshot network is next used for the analysis of temporal network’s evolution. We stress that, proceeding in this way, authors can end up working with time series that are substantially unstable, with most of the links changing from one snapshot to the next one. Based on our observations, we propose a list of recommendations to practitioners and researchers who wish to aggregate a temporal network in snapshot graphs: 1. Try several window lengths; do not choose one based on apriori, as it could lead to particularly unstable networks; 2. Clearly state if the method/analysis presented requires or not adjacent snapshots to be stable. If it is the case, provide the Stability score; 3. Use a threshold to filter out edges appearing a few times only in each snapshot, to limit noise influence; 4. If the chosen aggregating window is only few time the resolution scale, try to use a larger scale for the filtering; 5. If the contribution requires to interpret the network, also provide an analysis in term of Fidelity of the network to the original data.
576
A. Chiappori and R. Cazabet
As far as we know, we are the first to consider and prove the advantage of filtering with a larger window than the aggregation window. 6.2
Alternatives and Future Perspectives
We only presented results with non-overlapping windows, but overlapping ones (sliding windows) can also be considered. They trivially increase stability, but there are at least two drawbacks: the snapshots are not independent anymore and the computational cost increases (especially if distance between two subsequent snapshots is only the resolution scale dt). While we were able to validate an increase in Stability with sliding windows, reaching values only a few percentage points below 100% even without filtering, we were not able to prove any remarkable improvement in Fidelity. We believe that this could be due to the data structure: having a well defined timetable (lectures, breaks) for both datasets, non-overlapping windowing already give a faithful representation of the original graph. As it is done in [7], some of the methods for automatic aggregation mentioned throughout Sect. 2 could be chosen to test the aggregated networks that they produce, on the basis of our two scores. Then, results with our framework can be compared with alternative approaches [7,11]. Our scores or link weights could be modified to include time dependence of the single contacts more explicitly. For instance, the latter could decay in time, as the weights defined in [9]. Acknowledgments. This work is supported by BITUNAM Project ANR-18-CE230004.
References 1. Barabasi, A.-L.: The origin of bursts and heavy tails in human dynamics. Nature 435(7039), 207–211 (2005) 2. Cazabet, R.: Data compression to choose a proper dynamic network representation. In: Benito, R.M., Cherifi, C., Cherifi, H., Moro, E., Rocha, L.M., Sales-Pardo, M. (eds.) COMPLEX NETWORKS 2020. SCI, vol. 943, pp. 522–532. Springer, Heidelberg (2020). https://doi.org/10.1007/978-3-030-65347-7 43 3. Cazabet, R., Boudebza, S., Rossetti, G.: Evaluating community detection algorithms for progressively evolving graphs. J. Complex Netw. 8(6), cnaa027 (2020) 4. Coscia, M., Neffke, F.M.H.: Network backboning with noisy data. In: 2017 IEEE 33rd International Conference on Data Engineering (ICDE), pp. 425–436. IEEE (2017) 5. Darst, R.K., Granell, C., Arenas, A., G´ omez, S., Saram¨ aki, J., Fortunato, S.: Detection of timescales in evolving complex systems. Sci. Rep. 6(1), 1–8 (2016) 6. De Domenico, M., Nicosia, V., Arenas, A., Latora, V.: Structural reducibility of multilayer networks. Nat. Commun. 6(1), 1–9 (2015) 7. Fish, B., Caceres, R.S.: A supervised approach to time scale detection in dynamic networks. arXiv preprint arXiv:1702.07752 (2017)
Evaluation of Snapshot Graphs
577
8. Fournet, J., Barrat, A.: Contact patterns among high school students. PLoS One 9(9), e107878 (2014) 9. Holme, P.: Epidemiologically optimal static networks from temporal network data. PLoS Comput. Biol. 9(7), e1003142 (2013) 10. Krings, G., Karsai, M., Bernhardsson, S., Blondel, V.D., Saram¨ aki, J.: Effects of time window size and placement on the structure of an aggregated communication network. EPJ Data Sci. 1(1), 1–16 (2012) 11. L´eo, Y., Crespelle, C., Fleury, E.: Non-altering time scales for aggregation of dynamic networks into series of graphs. Comput. Netw. 148, 108–119 (2019) 12. Mastrandrea, R., Fournet, J., Barrat, A.: Contact patterns in a high school: a comparison between data collected using wearable sensors, contact diaries and friendship surveys. PLoS One 10(9), e0136497 (2015) 13. Masuda, N., Holme, P.: Detecting sequences of system states in temporal networks. Sci. Rep. 9(1), 1–11 (2019) 14. Petri, G., Expert, P.: Temporal stability of network partitions. Phys. Rev. E 90(2), 022813 (2014) 15. Ribeiro, B., Perra, N., Baronchelli, A.: Quantifying the effect of temporal resolution on time-varying networks. Sci. Rep. 3(1), 1–5 (2013) 16. Sapiezynski, P., Stopczynski, A., Lassen, D.D., Lehmann, S.: Interaction data from the Copenhagen networks study. Sci. Data 6(1), 1–10 (2019) 17. Soundarajan, S., et al.: Generating graph snapshots from streaming edge data. In: Proceedings of the 25th International Conference Companion on World Wide Web, pp. 109–110 (2016) 18. Starnini, M., Lepri, B., Baronchelli, A., Barrat, A., Cattuto, C., Pastor-Satorras, R.: Robust modeling of human contact networks across different scales and proximity-sensing techniques. In: Ciampaglia, G.L., Mashhadi, A., Yasseri, T. (eds.) SocInfo 2017. LNCS, vol. 10539, pp. 536–551. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-67217-5 32 19. Stopczynski, A., Sapiezynski, P., Lehmann, S., et al.: Temporal fidelity in dynamic social networks. Eur. Phys. J. B 88(10), 1–6 (2015) 20. Sulo, R., Berger-Wolf, T., Grossman, R.: Meaningful selection of temporal resolution for dynamic networks. In: Proceedings of the Eighth Workshop on Mining and Learning with Graphs, pp. 127–136 (2010) 21. Sun, J., Faloutsos, C., Papadimitriou, S., Yu, P.S.: Graphscope: parameter-free mining of large time-evolving graphs. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 687–696 (2007) 22. Torres, L., Blevins, A.S., Bassett, D., Eliassi-Rad, T.: The why, how, and when of representations for complex systems. SIAM Rev. 63(3), 435–485 (2021) 23. Uddin, S., Choudhury, N., Farhad, S.M., Towfiqur Rahman, Md.: The optimal window size for analysing longitudinal networks. Sci. Rep. 7(1), 1–15 (2017)
Convergence Properties of Optimal Transport-Based Temporal Networks Diego Baptista(B) and Caterina De Bacco Max Planck Institute for Intelligent Systems, Cyber Valley, Tuebingen 72076, Germany [email protected]
Abstract. We study network properties of networks evolving in time based on optimal transport principles. These evolve from a structure covering uniformly a continuous space towards an optimal design in terms of optimal transport theory. At convergence, the networks should optimize the way resources are transported through it. As the network structure shapes in time towards optimality, its topological properties also change with it. The question is how do these change as we reach optimality. We study the behavior of various network properties on a number of network sequences evolving towards optimal design and find that the transport cost function converges earlier than network properties and that these monotonically decrease. This suggests a mechanism for designing optimal networks by compressing dense structures. We find a similar behavior in networks extracted from real images of the networks designed by the body shape of a slime mold evolving in time. Keywords: Optimal transport theory structure · Network design
1
· Graph theory · Network
Introduction
Optimal Transport (OT) theory studies optimal ways of transporting resources in space [1,2]. The solutions are optimal paths that connect sources to sinks (or origins to destinations) and the amount of flow traveling through them. In a general setting one may start from a continuous space in 2D, arbitrarily set sources and sinks, and then look for such optimal paths without any predefined underlying topology. Empirically, in many settings, these paths resemble network-like structures that embed optimality, in that traffic flowing along them is minimizing a transport cost function. Among the various ways to compute these solutions [3], a promising and computationally efficient recent approach is that of Facca et al. [4–6], which is based on solving a set of equations (the so-called Dynamical Monge-Kantorovich (DMK) equations). This starts from an initial guess of the optimal paths that are then updated in time until reaching a steady state configuration. At each time step, one can automatically extract a principled network from a network-like structure using the algorithm proposed c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2021, SCI 1072, pp. 578–592, 2022. https://doi.org/10.1007/978-3-030-93409-5_48
Convergence Properties of Optimal Transport-Based Temporal Networks
579
in Baptista et al. [7]. This in turn allows observing a sequence of network structures that evolves in time towards optimality, as the dynamical equations are iterated. While we know that the transport cost function is decreasing along this trajectory, we do not know how network properties on these structures evolve. For instance, in terms of the total number of edges or nodes, one may intuitively expect a monotonically decreasing behavior, from a topology covering uniformly the whole space, towards a compressed one only covering a subset of it efficiently. Analyzing the properties of networks that provide optimal transport efficiency is relevant in many contexts and has been explored in several works [8–11]. However, these studies usually consider pre-existing underlying topologies that need to be optimized. Moreover, they focus on network properties at convergence. Here instead we consider the situation where a network can be designed in a continuous 2D space, i.e. with no pre-defined underlying topology, and monitor the whole evolution of network properties, in particular away from convergence. While this question has been explored in certain biological networks [12–15], a systematic investigation of this intuition is still missing. In this work, we address this problem by considering several optimization settings, extracting their optimal networks, and then measuring core network properties on them. We find that network sequences show similar convergence patterns of those exhibited by their continuous counterparts. However, topological features of optimal networks tend to develop slightly slower than total cost function minimization. We also find that, in some cases, this delay in convergence presented by the networks might give better representations than those extracted at other cost-based convergence times. Finally, we analyze real data of the P. polycephalum slime mold evolving its network-like body shape in time as it explores the space foraging. We use networks extracted from images generated in wet-lab experiments [16], and analyze their topological features. Pattern matches can be found between synthetic graphs and this family of real networks. Understanding how network topology evolves towards optimality may shed light on broader questions about optimal network properties and how to obtain them.
2
The Model
The Dynamical-Monge Kantorovich Set of Equations. We now present the main ideas of how to extract sequences of networks that converge towards an optimal configuration, according to optimal transport theory. We start by introducing the dynamical system of equations regulating this, as proposed by Facca et al. [4–6]. We assume that the problem is set on a continuous 2-dimensional space Ω ∈ R2 , i.e. there is no pre-defined underlying network structure. Instead, one can explore the whole space to design an optimal network topology, determined by a set of nodes and edges, and the amount of flow passing through each edge. Sources and sinks of a certain mass (e.g. passengers in a transportation network, water in a water distribution network) are displaced on it. We denote these with a “forcing” function f (x) = f + (x)−f − (x) ∈ R, describing the flow generating sources f + (x)
580
D. Baptista and C. De Bacco
Fig. 1. Temporal networks. On the left, the total length l(G) (i.e. sum of the edge lengths), as a function of time t; the networks inside the insets correspond to different time steps. On the right, optimal transport density μ∗ ; triangles are a [0, 1]2 discretization. In both plots, red and green circles correspond to the support of f + and f − , i.e. sources and sinks, respectively. This sequence is obtained for β = 1.5.
and sinks f − (x) (also known as source and target distributions, respectively). It is assumed that Ω f (x)dx = 0 to ensure mass balance. We suppose that the flow is governed by a transient Fick-Poiseuille type flux q = −μ∇u, where μ, u and q are called conductivity (or transport density), transport potential and flux, respectively. Intuitively, the conductivity can be seen as proportional to the size of the edges where mass can flow, the potential could be seen as pressure on nodes, thus determining the flux passing on it. The set of Dynamical Monge-Kantorovich (DMK) equations is given by: − ∇ · (μ(t, x)∇u(t, x)) = f + (x) − f − (x) , ∂μ(t, x) β = [μ(t, x)∇u(t, x)] − μ(t, x) , ∂t μ(0, x) = μ0 (x) > 0 ,
(1) (2) (3)
where ∇ = ∇x . Equation (1) states the spatial balance of the Fick-Poiseuille flux and is complemented by no-flow Neumann boundary conditions; Eq. (2) enforces the dynamics of this system and Eq. (3) is the initial configuration, this can be thought of as an initial guess of the solution. The parameter β (traffic rate) tunes between various optimization setting: for β < 1 we have congested transportation where traffic is minimized, β > 1 is branched transportation where traffic is encouraged to consolidate along fewer edges, and β = 1 is shortest path-like. In this work we only consider the branched transportation regime 1 < β < 2, as this is the only one where meaningful network structures can be extracted [7]. Solutions (μ∗ , u∗ ) of Eqs. (1)–(3) minimize the transportantio n cost function L(μ, u) [4–6], defined as: L(μ, u) := E(μ, u) + M(μ, u) E(μ, u) :=
(4) (2−β) β
1 1 μ dx . μ|∇u|2 dx, M(μ, u) := Ω 2 2 Ω 2−β
(5)
Convergence Properties of Optimal Transport-Based Temporal Networks
581
L can be thought of as a combination of M, the total energy dissipated during the transport (or network operating cost) and E, the cost to build the network infrastructure (or infrastructural cost). 2.1
Network Sequences
The conductivity μ at convergence regulates where the mass should travel for optimal transportation. This is a function of a 2-dimensional space, it can be turned into a principled network G(μ) (a set of nodes, edges, and weights on them) by using the method proposed by [7], which in turn determines the design of the optimal network. While the authors of that work considered only values at convergence, this method is still valid at any time step, in particular at time steps before convergence. This then leads to a sequence of networks evolving in time as the DMK equations are iterated. Figure 1 shows three networks built using this method at different time steps. The leftmost inset is the densest representation that one can build from the shown discretization of the space (a triangulation), as in the plot on the right side: all the nodes are connected to all of their closest neighbors. This is what happens at initial time steps where the network is built from mass uniformly displaced in space, as per uniform initial condition. On the other hand, the rightmost network is built from a μ at convergence, consolidated on a more branched structure. Formally, let μ(x, t) be a transport density (or conductivity) function of both time and space obtained as a solution of the DMK model. We denote it as the sequence {μt }Tt=0 , for some index T (usually taken to be that of the convergent state). Every μt is the t-th update of our initial guess μ0 , computed by following the rules described in Eqs. (1)–(3). This determines a sequence of networks {G(μt )}Tt=0 extracted from {μt }Tt=0 with [7]. Figure 1 shows three different snapshots of one of the studied sequences. Convergence Criteria. Numerical convergence of the DMK equations (1)–(3) can be arbitrarily defined. Typically, this is done by fixing a threshold τ , and stopping the algorithm when the cost does not change more than that between successive time steps. However, when this threshold is too small (τ = 10−12 in our experiments), the cost or the network structure may consolidate to a constant value way in advance, compared to the algorithmic one. Thus, to meaningfully establish when is network optimality reached, we consider as convergence time the first time step when the transport cost, or a given network property, reaches a value that is smaller or equal to a certain fraction p of the value reached by the same quantity at algorithmic convergence (in the experiments here we use p = 1.05). We refer to tL and tP for the convergence in times in terms cost function or a network property, respectively. Network Properties. We analyze the following main network properties for the different networks in the sequences and for different sequences. Denote with G one of the studied networks belonging to some sequence {G(μt )}Tt=0 . We study
582
D. Baptista and C. De Bacco
the following properties relevant to the design of networks for optimal transport of resources through it. – |N |, total number of nodes; – |E|, total number of edges; – total length l(G) = e l(e), i.e. the sum of the lengths of every edge. Here l(e) is the Euclidean distance between the nodes endpoints of e; – Average degree, the mean number of neighbors per node; – bif (G), the number of bifurcations; a bifurcation is a node with degree greater than 2; – leav(G), the number of leaves; a leave is a node with degree equal to 1.
3
Results on Synthetic Data
To study the behavior of network structures towards optimality, we perform an extensive empirical analysis as follows. We generate synthetic data considering a set of optimal transport problems, determined by the configuration of sources and sinks. In fact, the final solutions strongly depend on how these are displaced in space. We consider here a scenario where we have one source and many sinks, which is a relevant situation in many applications. For instance, in biology, this would be the case for a slime mold placed on a point in space (the source) and looking for multiple sources of food (the sinks). Formally, consider a set of points S = {s0 , s1 , ..., sM } in the space Ω = [0, 1]2 , and 0 < r a positive number. We define the distributions f + and f − as f + (x) ∝ 1R0 (x), f − (x) ∝ 1Ri (x) i>0
where 1Ri (x) := 1, if x ∈ Ri , and 1Ri (x) := 0, otherwise; Ri is the circle of center si and radius r (the value of r is automatically selected by the solver based on the discretization of the space); and the proportionality is such that f + and f − are both probability distributions. The transportation cost is that of Eq. (4). Data Generation. We generate 100 transportation problems by fixing the location of the source s0 = (0, 0) (i.e. the support of f + at (0, 0)), and sampling 15 points s1 , s2 , ..., sM uniformly at random from a regular grid (see supplementary information). By choosing points from vertices of a grid, we ensure that the different sinks are sufficiently far from each other, so they are not redundant. We start from an uniform, and thus non-informative, initial guess for the solution, μ0 (x) = 1, ∀x. We fix the maximum number of iterations to be 300. We say that the sequence {μt }Tt=0 converges to a certain function μ∗ at iteration T if either |μT − μT −1 | < τ, for a tolerance τ ∈ (0, 1], or T = 300. For the experiments reported in this manuscript, the tolerance τ is set to be 10−12 . We consider different values of β ∈ [1.1, 1.9], thus exploring various cost functions within the branched transportation regime. Decreasing β from 2 to 1 results in traffic being
Convergence Properties of Optimal Transport-Based Temporal Networks
583
more penalized, or consolidation of paths into fewer highways less encouraged. In total, we obtain 900 network sequences, each of them containing between 50 and 80 networks.
Fig. 2. Total length and Lyapunov cost. Mean (markers) and standard deviations (shades around the markers) of the total length (top plots) and of the Lyapunov cost, energy dissipation E and structural cost M (bottom plots), as functions of time t. Means and standard deviations are computed on the set described in Sect. 3. From left to right: β = 1.2, 1.5 and 1.8. Red and blue lines denote tP and tL .
Convergence: Transport Cost vs Network Properties. Figure 2 shows a comparison between network properties and the cost function minimized by the dynamics. We observe that tP > tL in all the cases, i.e. convergence in the cost function is reached earlier than convergence of the topological property. Similar behaviors are seen for other values of β ∈ [1.1, 1.9] and for other network properties (see supplementary information). For smaller values of β convergence in transport cost is reached faster, when the individual network properties are still significantly different from their value at convergence, see Fig. 3 for an example in terms of total path length at β = 1.2. In this case, while the cost function does not change much after tL , the network properties do instead. This may be because the solutions for β close to 1 have non-zero μ on many edges but most of them have low values. Indeed, we find that the most important edges, measured by the magnitude of μ on them, are those corresponding to the topology found at a later time, when the network properties also converge, as shown in Fig. 3 (bottom). This indicates that the dynamics first considers many edges, and distributes the fluxes optimally along fewer main roads. At the end, close to convergence, it focuses instead on removing redundant edges, those that have little flux traveling. Finally, we notice how tL is smaller for β close to 1. This reflects the fact that in this case it is easier to find a solution to the optimization problem, as for increasing β the configuration space gets roughed with many local optima [4–6].
584
D. Baptista and C. De Bacco
Fig. 3. Network topologies for different convergence criteria. Mean (markers) and standard deviations (shades around the markers) of total length l(G) as a function of time. The red, black and blue vertical lines (and networks in the insets) correspond to tP , the average of tP and tL , and tL , respectively. Networks without (top row) and with (bottom) edge weights proportional to their μ are plotted at those times three time steps. Hence, the networks on top highlight the topological structure, while those on the bottom the flux passing through edges.
Convergence Behavior of Network Properties. Figure 4 shows how the various network properties change depending on the traffic rate. The plots show their mean values computed across times, for a fixed value of β. Notice that quantities like the total length, the average degree, the number of bifurcations, the number of edges and the number of nodes decrease in time, signaling that sequences reach steady minimum states. These are reached at different times, depending on β, with convergence reached faster for lower β. Moreover, mean values of these properties converge to decreasing values, as β increases. This is explained by a cost function increasingly encouraging consolidations of paths on fewer edges. Finally, the magnitude of the gap between the different mean values of each property for different β depends on the individual property. For instance, the average degree changes more noticeably between two consecutive values of β than the total path length, which shows a big gap between the value at β = 1.1 and all of the subsequent β > 1.1, that have instead similar value of this property. This also shows that certain properties better reveal the distinction between different optimal traffic regimes. The number of leaves behaves more distinctly. In fact, it exhibits two different patterns: either it remains constantly equal
Convergence Properties of Optimal Transport-Based Temporal Networks
585
Fig. 4. Evolution of network properties. Mean (markers) and standard deviations (shades around the markers) of total length l(G) (upper left), average degree (upper center), number of leaves leav(G) (upper right), number of bifurcations bif (G) (lower left), number of edges |E| (lower center) and number of nodes |N | (lower right), computed for different values of β as a function of time. Notice that |E| keeps changing for t > 70 but the scale makes it hard to perceive.
to 0 (β = 1.1) or it increases, and with different rates, as time gets larger (β > 1.1). This number increases with β, as in this regime paths consolidate into fewer edges, thus leaving more opportunities for leaves. To help intuition of the different optimal designs for various β, we plot the extracted networks at convergence in Fig. 5. The positions of source and sinks are the same in all cases. The network obtained for higher β = 1.8 contains fewer edges and nodes than the others cases. On average, these networks have bif (G) ≈ 13 , leav(G) ≈ 7, l(G) ≈ 4 and an average degree ≈ 2. These reveal various topological features on the converged networks of this traffic regime, that make it more distinct than others. For instance, having approximately 7 leaves implies that the dynamics builds networks with as many leaves as approximately half the number of sinks (M = 15) in this transportation problem, while on the other hand, we can see that bif (G) ≈ M , i.e., the number of bifurcations is almost as large as the number of sinks. 3.1
Results on Real Networks of P. polycephalum
In this section, we compare the properties of the sequences {G(μt )}Tt=0 to those extracted from real images of the slime mold P. polycephalum. This organism has been shown to perform an optimization strategy similar to that modeled by the DMK equations of Sect. 2, while foraging for food in a 2D surface [17–19].
586
D. Baptista and C. De Bacco
Fig. 5. Example of optimal networks for various cost functions. Networks extracted from the solutions of the same transportation problem but various β. Green and red and circles denote source and sinks.
Fig. 6. Network evolution of P. polycephalum. On top: P. polycephalum images and networks extracted from them. Bottom left: a zoomed-in part of the graph shown inside the red rectangle on top. Bottom right: total length as a function of time. The red shade highlights a tentative consolidation phase towards optimality.
We extract these networks from images using the method proposed by [20]. This pipeline takes as input a network-like image and uses the color intensities of its different pixels to build a graph, by connecting adjacent meaningful nodes. We choose 4 image sequences from the Slime Mold Graph Repository [16]. Every sequence is obtained by taking pictures of a P. polycephalum placed in a rectangular Petri dish and following its evolution in time. Images are taken every 120 s from a fixed position. We study the evolution of the total length for every sequence. We show in Fig. 6 the total length of the temporal network extracted from one of the mentioned image sequences (namely, image set motion12 ; see supplementary information), together with different network snapshots. As we can see from the lower rightmost plot, the evolution of the total length of the extracted networks resembles that of the synthetic network sequences analyzed above. This suggests that the DMK-generated sequences resemble the behavior of this real system in this time frame. This could mean that the DMK dynamics realistically represents a consolidation phase towards optimality of real slime molds [16]. Similar results are obtained for other sequences (see supplementary information).
Convergence Properties of Optimal Transport-Based Temporal Networks
4
587
Conclusions
We studied the properties of sequences of networks converging to optimal structures. Our results show that network sequences obtained from the solution of diverse transportation problems often minimize network properties at slower rates compared to the transport cost function. This suggests an interesting behavior of the DMK dynamics: first, it focuses on distributing paths into main roads while keeping many network edges. Then, once the main roads are chosen, it removes redundant ones where the traffic would be low. Measuring convergence of network properties would then reveal a more compressed network cleaned from redundant paths. The insights obtained in this work may further improve our understanding of the mechanisms governing the design of optimal transport networks. We studied here a particular set of transportation problems, one source, and multiple sinks. In this case, all the main network properties studied here show similar decaying behavior. However this analysis can be replicated for more complex settings, like multiple sources and multiple sinks [21] or in multilayer networks [22], as in urban transportation networks. Potentially, this may unveil different patterns of the evolution of the topological properties than those studied in this work. Results on real networks suggest that the networks generated by the DMK dynamics (inspired by the P. polycephalum) resemble realistic features. Strongly monotonic phases are not only typical of the mentioned slime molds but also a pattern in the artificially generated data. Alternative realistic behaviors may be seen by considering a modified version of the model described in Eq. (1) by adding non-stationary forcing terms. This may highlight a behavior different than the one observed in a consolidation phase, where a network converges to an optimal design and then does not change further. This is an interesting direction for future work. Acknowledgements. The authors thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting Diego Baptista.
Supplementary Information (SI) 1
Synthetic Data
Details of the Studied Transport Problems. As mentioned in the main manuscript, we consider a set of points S = {s0 , s1 , ..., sM } in the space Ω = [0, 1]2 , and 0 < r a positive number, and we use this to define the distributions f + and f − as f + (x) ∝ 1R0 (x), f − (x) ∝ 1Ri (x) i>0
where Ri is the circle of center si and radius r. The points s1 , ..., sM , the support of the sink, are sampled uniformly at random from a regular grid. The used grid and different realizations of the sampling are shown in Fig. 7.
588
D. Baptista and C. De Bacco
Fig. 7. Support of f − . The nodes of the grid constitutes the set of candidates from which the support of f −
Fig. 8. Total length and Lyapunov cost. Top row: from left to right we see β = 1.1, 1.3 and 1.4. Bottom row: from left to right we see β = 1.6, 1.7 and 1.9. Mean and standard deviation of the total length l(G) as function of time t; Bottom plot: Mean and standard deviation of the Lyapunov cost L, energy dissipation E and structural cost M of transport densities. Red and blue lines denote tP and tL for p = 1.05.
Total Length and Lyapunov Cost. We show in this section a figure like the one presented in the Fig. 2 of the main manuscript, for other values of β. As mentioned in there, the properties show decreasing behaviors for which is always true that tP > tL (see Fig. 8).
Convergence Properties of Optimal Transport-Based Temporal Networks
589
Fig. 9. Other network properties and Lyapunov cost. From left to right: β = 1.2, 1.5 and 1.8. From top to bottom: Mean and standard deviation of the average degree, number of nodes |N |, number of bifurcations bif (G), and the Lyapunov cost L, energy dissipation E and structural cost M. Red and blue lines denote tP and tL for p = 1.05.
Network Properties and Lyapunov Cost. We show in this section a figure like the one presented in the Fig. 2 of the main manuscript, for the other network properties (see Fig. 9).
2
P. polycephalum Networks
Data Information. In this section, we give further details about the used real data. As mentioned in the main manuscript, the images are taken from the Slime Mold Graph Repository [16]. The number of studied sequences {Gi }Ti equals 4. Every sequence’s length T changes depending on the amount of images provided in the repository, since different experiments need more o less shots. An
590
D. Baptista and C. De Bacco
Fig. 10. Network properties for P. polycephalum sequences. From top to bottom: motion12, motion24, motion40 and motion79. Subfigures show the evolution of the properties |E|, average degree and |N | for every sequence as a function of time.
experiment, as explained in the repository’s documentation, consists of placing a slime mold inside a Petri dish with a thin sheet of agar and no sources of food. The idea, as explained by the creators, is to let the slime mold fully explore the Petri dish. Since the slime mold is initially lined up along one of the short side of the dish, the authors stop capturing images once the plasmodium is about to reach the other short side. Network Extraction. The studied network sequences are extracted from the image sets motion12, motion24, motion40 and motion79, which are stored in the repository. Each image set contains a number of images ranging from 60 to 150, thus, obtained sequences exhibit diverse lengths. Every network is extracted using the Img2net algorithm described in [20]. The main parameters of this algorithms are N runs, t2, t3 and new size. N runs controls how many times the algorithm needs to be run; t2 (and t3) are the minimum value (and maximum) a pixel’s grayscale value must be so it is considered as a node; new size is the size to which the input image must be downsampled before extracting the network from it. For all the experiments reported in this manuscript, the previously mentioned parameters are set to be N runs = 1, t2 = 0.25, t3 = 1 and new size = 180. More Network Properties. Other network properties are computed for the real systems referenced in this manuscript. Similar decreasing behaviors, like the one shown for the total length property in the main manuscript, are found for these properties; see Figs. 10 and 11.
Convergence Properties of Optimal Transport-Based Temporal Networks
591
Fig. 11. P. polycephalum total length evolution. From top to bottom: motion24, motion40 and motion79. Plots are separated in couples. For every couple, the plots on top show both P. polycephalum images and networks extracted from them. The network at the lower leftmost plot is a subsection of the graph shown inside the red rectangle on top. The plot at the bottom shows the total length as a function of time. The red shade in this plot highlights a tentative consolidation phase towards optimality.
References 1. Villani, C.: Optimal Transport: Old and New, vol. 338. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-540-71050-9
592
D. Baptista and C. De Bacco
2. Santambrogio, F.: Optimal transport for applied mathematicians. Birk¨ auser NY 55, 58–63 (2015) 3. Peyr´e, G., Cuturi, M., et al.: Computational optimal transport: with applications R Mach. Learn. 11(5–6), 355–607 (2019) to data science. Found. Trends 4. Facca, E., Cardin, F., Putti, M.: Towards a stationary Monge-Kantorovich dynamics: the physarum polycephalum experience. SIAM J. Appl. Math. 78(2), 651–676 (2018) 5. Facca, E., Daneri, S., Cardin, F., Putti, M.: Numerical solution of MongeKantorovich equations via a dynamic formulation. J. Sci. Comput. 82, 1–26 (2020) 6. Facca, E., Cardin, F.: Branching structures emerging from a continuous optimal transport model. J. Comput. Phys. (2020, Submitted) 7. Baptista, D., Leite, D., Facca, E., Putti, M., De Bacco, C.: Network extraction by routing optimization. Sci. Rep. 10(1), 1–13 (2020) 8. Corson, F.: Fluctuations and redundancy in optimal transport networks. Phys. Rev. Lett. 104(4), 048703 (2010) 9. Bohn, S., Magnasco, M.O.: Structure, scaling, and phase transition in the optimal transport network. Phys. Rev. Lett. 98(8), 088702 (2007) 10. Durand, M.: Architecture of optimal transport networks. Phys. Rev. E 73(1), 016116 (2006) 11. Katifori, E., Sz¨ oll˝ osi, G.J., Magnasco, M.O.: Damage and fluctuations induce loops in optimal transport networks. Phys. Rev. Lett. 104(4), 048704 (2010) 12. Baumgarten, W., Hauser, M.J.: Functional organization of the vascular network of physarum polycephalum. Phys. Biol. 10(2), 026003 (2013) 13. Baumgarten, W., Ueda, T., Hauser, M.J.: Plasmodial vein networks of the slime mold physarum polycephalum form regular graphs. Phys. Rev. E 82(4), 046113 (2010) 14. Dirnberger, M., Mehlhorn, K.: Characterizing networks formed by p. polycephalum. J. Phys. D: Appl. Phys. 50(22), 224002 (2017) 15. Westendorf, C., Gruber, C., Grube, M.: Quantitative comparison of plasmodial networks of different slime molds. In: Proceedings of the 9th EAI International Conference on Bio-inspired Information and Communications Technologies (formerly BIONETICS), pp. 611–612 (2016) 16. Dirnberger, M., Mehlhorn, K., Mehlhorn, T.: Introducing the slime mold graph repository. J. Phys. D Appl. Phys. 50(26), 264001 (2017) ´ Maze-solving by an amoeboid organism. 17. Nakagaki, T., Yamada, H., T´ oth, A.: Nature 407(6803), 470 (2000) 18. Tero, A., Kobayashi, R., Nakagaki, T.: A mathematical model for adaptive transport network in path finding by true slime mold. J. Theor. Biol. 244(4), 553–564 (2007) 19. Tero, A., et al.: Rules for biologically inspired adaptive network design. Science 327(5964), 439–442 (2010) 20. Baptista, D., De Bacco, C.: Principled network extraction from images. R. Soc. Open Sci. 8, 210025 (2021) 21. Lonardi, A., Facca, E., Putti, M., De Bacco, C.: Optimal transport for multicommodity routing on networks. arXiv preprint arXiv:2010.14377 (2020) 22. Ibrahim, A.A., Lonardi, A., Bacco, C.D.: Optimal transport in multilayer networks for traffic flow optimization. Algorithms 14(7), 189 (2021)
A Hybrid Adjacency and Time-Based Data Structure for Analysis of Temporal Networks Tanner Hilsabeck, Makan Arastuie, and Kevin S. Xu(B) Electrical Engineering and Computer Science Department, University of Toledo, Toledo, OH 43606, USA {Tanner.Hilsabeck,Makan.Arastuie}@rockets.utoledo.edu, [email protected]
Abstract. Dynamic or temporal networks enable representation of time-varying edges between nodes. Conventional adjacency-based data structures used for storing networks such as adjacency lists were designed without incorporating time. When used to store temporal networks, such structures can be used to quickly retrieve all edges between two sets of nodes, which we call a node-based slice, but cannot quickly retrieve all edges that occur within a given time interval, which we call a time-based slice. We propose a hybrid data structure for storing temporal networks with timestamped edges, including instantaneous edges such as messages on social media and edges with duration such as phone calls. Our hybrid structure stores edges in both an adjacency dictionary, enabling rapid node-based slices, and an interval tree, enabling rapid time-based slices. We evaluate our hybrid data structure on many real temporal network data sets and find that they achieve much faster slice times than adjacency-based baseline structures with only a modest increase in creation time and memory usage. Keywords: Dynamic graph structure · Dynamic network · Interval tree · Adjacency dictionary · Timestamped network · Relational events
1
Introduction
Relational data is often modeled as a network, with nodes representing objects or entities and edges representing relationships between them. Dynamic or temporal networks allow nodes and edges to vary over time as opposed to a static network. Temporal networks have been the focus of many research efforts in recent years [13–15,21]. Many advancements have been made in temporal network analysis, including development of centrality metrics [27], identification of temporal motifs [29], and generative models [1,18]. While research on the analysis of temporal networks has advanced greatly, the data structures have seemingly lagged behind. A common approach to storing temporal networks is to adopt a static network structure, such as an adjacency c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2021, SCI 1072, pp. 593–604, 2022. https://doi.org/10.1007/978-3-030-93409-5_49
594
T. Hilsabeck et al.
(a) Sample Interval Graph Outer Key Inner Key Edge Times A B [2,5), [6,8) B A [2,5), [6,8) C [2,4) C B [2,4) D [5,6) D C [5,6) (b) Adjacency Dictionary
(c) Interval Tree
Fig. 1. Illustration of proposed hybrid data structure on a sample network.
list or dictionary, and save timestamps of edges as an attribute, e.g. using the NetworkX Python package [10,11]. In this paper, we design an efficient data structure for temporal networks to enable rapid slices of three types: – Node-based slices: Given two node sets S and T , return all edges at any time between nodes in S and nodes in T . Node sets may range from a single node to all nodes in the network. Retrieving all temporal edges that contain node u is an example of a node-based slice. – Time-based slices: Given a time interval [t1 , t2 ), return all edges between any two nodes that occur in [t1 , t2 ). Creating an instantaneous snapshot of a network at a specified time t is an example of a time-based slice. – Compound slices: Given two node sets S and T as well as a time interval [t1 , t2 ), return all edges that meet both criteria. This can be done by first conducting a node-based or a time-based slice. Creating a partial snapshot of a network containing only a subset of nodes is an example of a compound slice. While adjacency list or dictionary structures are excellent for node-based slices, they must iterate over all pairs of nodes to perform a time-based slice. We find a conflict between node- and time-based slices for data structures. That is, choosing a data structure that enables rapid node-based slices, such as an adjacency dictionary, results in slow time-based slices, while choosing one that enables rapid time-based slices, such as a binary search tree, results in slow node-based slices. This is a limitation of using a single data structure. Our main contributions in this paper are as follows: – We propose a hybrid data structure that stores temporal networks using both an adjacency dictionary and an interval tree, as shown in Fig. 1.
A Hybrid Data Structure for Analysis of Temporal Networks
595
– We develop a predictive approach for optimizing compound slices by predicting whether first conducting a node- or time-based slice would be faster given some basic network properties. – We demonstrate that our proposed hybrid data structure achieves much faster slice times than existing structures on a variety of temporal network data sets with only a modest increase in creation time and memory usage.
2
Background and Related Work
2.1
Temporal Network Representations
Temporal networks are typically represented in one of 3 ways [3,13]: – Snapshot graph: a sequence of static graphs, in which an edge exists between nodes u and v if there is an edge active during the time interval [t1 , t2 ). – Interval graph 1 : a sequence of tuples (u, v, t1 , t2 ) denoting edges between node u and node v during the time interval [t1 , t2 ). – Impulse graph: a sequence of tuples (u, v, t) denoting edges between node u and node v at the instantaneous time t. This representation is also called a contact sequence [13] or a link stream [3,22]. Snapshot graphs are useful for their ability to quickly restore access to all available static network analysis techniques within each snapshot. Snapshots are usually taken at regular time intervals (e.g. every hour) so that finer-grained temporal information is lost within snapshots. We consider a varying length snapshot representation by creating a snapshot upon each change in the temporal network. This concept can be expanded by distinguishing a node, n, per point in time, ntx , and drawing connections between ntx and ntx+1 [20,36]. Nodes and edges within temporal networks can also possess both presence and latency functions. Presence indicates the active duration of an object, while latency represents the temporal cost of traversals [2]. 2.2
Data Structures for Networks
The two main structures for storing a static graph are the adjacency matrix and the adjacency list. For a network of n nodes, an adjacency matrix requires O(n2 ) space complexity and is thus generally used only for small networks. Adjacency lists are typically used instead in many network analysis libraries such as SNAP [25]. Adjacency lists can be further improved in average time complexity of most operations (at the cost of a constant factor increase in memory) by using hash tables rather than lists. This is sometimes called an adjacency dictionary and is the standard data structure in the popular Python package NetworkX [10,11]. Static graph structures can be used to store temporal networks by saving time information in edge attributes. Such structure prioritizes retrieving edges via node-based slices and require iterating over all pairs of nodes to conduct a time-based slice, which is slow. 1
Not to be confused with the other use of interval graph as a graph constructed from overlapping intervals on R [9].
596
2.3
T. Hilsabeck et al.
Related Work
Hybrid Data Structures. Hybrid data structures, which combine different kinds of data structures into one, have a long history in the data structures literature for tasks including searching and sorting [5,19,28]. Such hybrid structures have also recently been proposed for graph data structures, including the use of separate read- and write-optimized structures [33] and a compile-time optimization framework that benchmarks a variety of data structures on a portion of a data set before choosing one [32]. Temporal Network Data Structures. Most prior work on temporal network data structures has focused on the streaming setting, where the main objective is to design data structures to enable rapid updates to graphs as edges arrive over time in a high-performance computing setting where millions of edges may be changing per second [8]. These types of data structures for massive streaming networks are typically optimized for rapid edge insertions. Their objectives differ significantly to those of “off-line” analysis of dynamic network data that we consider, where a key objective is to rapidly slice the history of the graph, e.g. what edges were present at a specific time. Indeed it has been found that such high-performance streaming graph structures may be even worse than simple baselines such as adjacency dictionaries for common network analysis tasks including community detection [33]. While the focus of this paper is on time-efficient data structures for temporal networks, there has also been prior work on space-efficient structures. A fourthorder tensor model proposed by Wehmuth et al. [36], which can be expressed by an equivalent square matrix with an index for each time event and elements consisting of a traditional adjacency matrix, is capable of storing dynamic graphs with a memory complexity that scales linearly with the number of edges in a network. Cazabet [3] considers encoding temporal networks for data compression using the three temporal network representations discussed in Sect. 2.1. We note that it is possible to use both time- and space-efficient structures as part of a complete workflow by storing data using the more space-efficient format, while loading it into memory to be analyzed using the more time-efficient format.
3
Proposed Hybrid Data Structure
Our proposed data structure to store temporal networks is a hybrid data structure consisting of an adjacency dictionary and interval tree, as shown in Fig. 1. Our main objective in designing the hybrid data structures for temporal networks is to rapidly retrieve edges that meet specified criteria, which we call slicing. In the off-line analysis setting that we target, finding edges is the main operation dominating computation time of network analysis tasks [33], so rapid slices are more important than rapid insertions or deletions. It should be mentioned that, although we utilize two data structures, memory location pointers are used to avoid duplicating the data; therefore, memory usage is only modestly increased to store the data structure itself.
A Hybrid Data Structure for Analysis of Temporal Networks
3.1
597
Interval Tree: Time-Based Slices
The first novel component of our hybrid data structure is an interval tree to store edges using the edge time duration [t1 , t2 ) as the key. For instantaneous edges, we use the trivial interval [t, t]. Interval trees can be implemented as an extension of a variety of popular trees, including red-black trees and AVL trees [4]. For our purpose, we select the AVL tree as the base representation of our interval tree in hopes of maximizing performance during slices due to its more rigid balancing algorithm. The size of the interval tree is equal to the number of unique intervals and impulses in a data set. We use the interval tree structure to perform time-based slices, which retrieve all edges between any two nodes with times [t1 , t2 ) that overlap a given search interval [s1 , s2 ). Once the tree is traversed, each edge time determined to be overlapping with the search interval yields all edges stored within. Given a temporal network with m unique edge times, the interval tree has space complexity O(m) and search time complexity of O(log m + k), where k denotes the number of edges that meet the search criteria [23]. 3.2
Adjacency Dictionary: Node-Based Slices
The second part of our hybrid structure is an adjacency dictionary, an adjacency list-like structure implemented using nested hash tables rather than lists, similar to the NetworkX Python package [10,11]. The outer table stores the keys associated with the an edge’s first node, and the inner table stores keys representing an edge’s second node. The inner table’s values hold a list of all edge times containing the corresponding node pair. For directed networks with edges from u to v, two separate nested hash tables are created: the first with outer keys u and inner keys v, the second with outer keys v and inner keys u. We use the adjacency dictionary to perform node-based slices, which retrieve all edges at any time between two node sets S and T . Either of the sets could range from a single node to the set of all nodes. For example, if S denotes a single node while T denotes the set of all nodes, then the node-based slice is enumerating all edge times with neighboring nodes of S. Since the nested dictionary contains a list of all edge times, the space complexity of this structure is O(m), and the search time complexity is O(k). 3.3
Compound Slices
A compound slice retrieves all edges between two node sets S and T with times [t1 , t2 ) that overlap a given search interval [s1 , s2 ). It combines both the criteria of the time-based and node-based slices. A compound slice can be performed in two ways. The first is to perform a node-based slice using the adjacency dictionary, returning all edges between node sets S, T , and then filter the edges based on the search interval. The second is to perform a time-based slice using the interval tree, returning all edges overlapping the search interval [s1 , s2 ) between any two nodes, and then filter the edges based on the node sets. Depending on
598
T. Hilsabeck et al.
Table 1. Data sets used for evaluation. Edges refer to temporal edges. Edge durations shown are the mean over all pairs of nodes with at least one edge. Data set Enron [30, 31] Bike share [34]
Nodes
Edges
Resolution Directed? Edge duration
184
125,235 1 s
Yes
0 (Impulses)
793
9,882,954 1 min
Yes
21.1 min
6,809
52,050 1 s
Yes
176.1 s
Infectious [16]
10,972
198,198 20 s
No
41.97 s
Wikipedia links [26]
43,509
160,797 1 s
Yes
2.63 years
Reality Mining [6, 7]
Facebook wall [35]
43,953
852,833 1 s
Yes
0 (Impulses)
Ask Ubuntu [24, 29]
159,316
964,437 1 s
Yes
0 (Impulses)
the node sets and search interval, one approach for compound slicing may be faster than the other. Therefore, when tasked with a compound slice, an ideal hybrid structure should attempt to predict the correct sub-structure to use in order to achieve optimal time efficiency. We train a logistic regression model using compound slices with a varying number of nodes and length of interval. From these compound slices, we compute four features. The first two features, percentOfNodes and percentOfInterval, correlate to the number of nodes and length of interval, respectively, specified by the slice. The next feature is sumOfDegrees, representing the number of temporal edges returned by a node-first slice. Lastly, a lifespan is calculated for each node by normalizing the time between a node’s first and last appearance with respect to the network’s trace length.
4 4.1
Experiments Data Sets
We evaluate the proposed models using the real temporal network data sets shown in Table 1. These data sets span a wide range in terms of size, time resolutions, and duration of events, ranging from networks with very few nodes but lots of short temporal edges (London bike share), to networks with lots of nodes and extremely long duration temporal edges (Wikipedia links). 4.2
Comparison Baselines
Four other data structures will serve as our baselines for comparison with our proposed hybrid structure. The first structure is a MultiGraph in NetworkX [10, 11], with intervals stored as edge attributes, representing the de facto standard for network structures in Python. This structure is representative of performance using only an adjacency dictionary. The second structure, SnapshotGraph, is the variable window snapshot technique described in Sect. 2.1. Snapshots are stored
A Hybrid Data Structure for Analysis of Temporal Networks
599
in a SortedDictionary from Python package Sorted Containers [17]. The third structure, AdjTree, is an adjacency dictionary with internal elements consisting of an interval tree for each node-pair. This baseline represents a simplified single structure approach (rather than the hybrid that we propose). The last baseline, TVG, is the fourth-order tensor model by Wehmuth et al. [36] described in Sect. 2.3. In order to assist with slicing, the matrix representation has been adapted into dictionary equivalents. As implemented, the structure consists of a SortedDictionary storing t1 keys, with values pointing to SortedDictionaries containing t2 . The second dictionary points to a standard adjacency dictionary. 4.3
Basic Operations
We are primarily interested in the off-line analysis setting where an entire network is first loaded into memory and then different analysis tasks are performed by slicing the data structure. We are interested in two metrics to evaluate the effectiveness of our proposed hybrid data structure: the times required to compute a time-based slice and a compound slice. Since our adjacency dictionary structure is almost identical to a typical adjacency dictionary, e.g. in NetworkX [10,11], we do not evaluate node-based slices. Of secondary interest are the creation (load) time from a text file and memory usage, both which we expect to be slightly higher than the comparison baselines due to maintaining two data structures. Unless otherwise specified, each structure and data set combination is recorded and averaged 100 times in order to reduce variance in CPU clock rate between measurements2 . Compound Slice Time. The training data for the logistic regression model is obtained by performing 5,000 iterations of randomly selected nodes and interval length, varying independently from (0, 50]% of the network’s nodes and trace length, respectively. Once the features have been calculated, both the adjacency dictionary and interval tree within the hybrid structure will be sliced and times recorded. Iterations are randomly divided according to a 5% train, 95% test split in order to determine the model’s suitability. The extremely low percentage of training samples is selected to mimic a realistic setting with minimal training. An individual model is trained for each data set. 4.4
Case Study
In an ideal world, a data analyst would spend the majority of his or her time analyzing data. However, in reality, an increasing large portion of time is spent creating and slicing the data before analysis can even begin. In this case study, we will evaluate the computation time of a sample data analysis workflow using IntervalGraph and NetworkX on the London Bikeshare data set. To begin the 2
All experiments were run on a workstation with 2 Intel Xeon 2.3 GHz CPUs, totaling 36 cores, and 384 GB of RAM on Python version 3.8.3. Code to reproduce experiments is available at https://github.com/hilsabeckt/hybridtempstruct.
600
T. Hilsabeck et al.
Fig. 2. Time-based slice times for 1% and 5% time slices across all data sets. The proposed hybrid interval tree structure has significantly lower slice times on most data sets.
analysis, a one-time upfront computation cost must be paid in order to create the additional data structures of IntervalGraph and NetworkX. Analysts often wish to determine how network metrics change over time, which requires frequent slicing of the data set. In this example, we wish to calculate the daily betweenness centrality across all nodes, so 365 slices are required. Slice time represents the total time required to retrieve all edges for all slices. Only once the slicing of the data structure occurs can the analysis begin. While the analysis task performed in this case study is betweenness centrality via the NetworkX package, it should be noted that the exact analysis task has no impact on performance as the slicing process returns an identical list of edges.
5 5.1
Results Basic Operations
Time-Based Slices. Figure 2 compares the time to return all edges within 1% and 5% time slices of the network duration. On such small time slices, especially at 1%, our proposed interval tree-based structure, IntervalGraph, is far superior to the other structures on almost all of the data sets. The exception is for the Wikipedia data, which is quite different from the other data sets in that the mean edge duration is about 2.6 years or 27% of the length of the total data trace. This is extremely large compared to the Reality Mining data, which is a more typical data set, where the mean edge duration is about 3 min or 0.002% of the trace length. Compound Slices. Recall that there are two ways to perform a compound slice: a node-based compound slice using the adjacency dictionary and a time-based compound slice using the interval tree. In Fig. 3, we compare the compound slice
A Hybrid Data Structure for Analysis of Temporal Networks
601
Fig. 3. Compound slice times for different approaches on IntervalGraph. Our proposed predictive slicing approach performs better than using only node- or tree-based slices.
times using 3 strategies: always using a node-based slice, always using a timebased slice, and using our prediction of which slice is faster. We compare these to the minimum and maximum times, i.e. always selecting the faster or slower approach respectively, that would be could be achieved (which are not known in practice). Upon analysis, we find that node-only strategy is faster than the tree-only strategy across all tested data sets. However, for some individual slices, timebased slicing is faster, which is why the node-only time is not necessary the minimum time. Our proposed predictive compound slice approach is faster than the other 2 strategies on 3 data sets: Enron, Infectious, and Reality Mining. On the remaining data sets, our predictive approach is only slightly slower than always choosing node-based slices. Accuracy of our predictions varied from 68% to 93%, with the best performance on the Reality Mining and worst on the Facebook wall posts data set. This can be seen qualitatively in Fig. 3 as the difference between minimum and prediction time. Creation Time and Memory Usage. With its lack of sorted edges with respect to time, NetworkX has a creation time of 10–100× faster than the second fastest data structure, SnapshotGraph. SnapshotGraph struggles to efficiently store edges that extend across a large number of snapshots, resulting in a memory usage of over 49 GB on the Wikipedia data set! The remaining three structures (IntervalGraph, AdjTree, and TVG) tend to have creation times between ±10% of each other, depending on the data set. This trend continues when examining memory usage of each structure, where these three structures continue to be within ±10% of each other on most data sets. However, the difference in memory usage between NetworkX and these three structures shrinks to a factor of 2–3×.
602
T. Hilsabeck et al.
Fig. 4. Computation time per stage on the London Bikeshare data set.
5.2
Case Study
Computation times per stage for our proposed hybrid IntervalGraph structure and NetworkX are located in Fig. 4. IntervalGraph’s creation time is much longer than the NetworkX implementation due to its tree sub-structure. However, it is important to remember this cost must only be paid once as the object may be loaded from permanent storage once the temporal edges are sorted. For small analysis tasks requiring little slicing, IntervalGraph’s ability to more efficiently retrieve temporal edges may not outweigh this large upfront cost. With the 365 slices performed in this case study, IntervalGraph is almost 3 times faster than NetworkX after creation and slicing! This speed up translates to a 25% reduction in computation time over the entire workflow, including the analysis time. Depending on the size of the network, we find that IntervalGraph becomes more efficient at completing the overall workflow in anywhere from 5 to 100 slices. We believe this number of slices is low enough to make IntervalGraph more efficient than NetworkX in most use cases, especially during the exploratory analysis stage where a wide variety of snapshot lengths may be sliced.
6
Conclusion
Temporal networks have the unique capability of capturing the spread of information throughout a network with respect to time. Analysis of temporal aspects of a network using a dynamic structure can lead to deeper insights that are lost in translation when these networks are flattened into static graphs. In the interest of increasing our understanding of temporal networks, we propose a hybrid structure that is able to efficiently slice temporal edges using a dimension inaccessible by currently available structures. Due to its hybrid nature, the proposed structure is still able to benefit from algorithms and techniques developed for static graphs. The proposed structure achieves a synergistic relationship between its sub-structures by successfully predicting efficient slicing across multiple dimensions. While these contributions come at the expense of increased memory usage, the increase is not significant enough to limit viability. By proposing this new structure, we hope to spark research interests in techniques associated with temporal networks. We have implemented our proposed hybrid structure in the IntervalGraph class of the DyNetworkX Python package [12] for analyzing dynamic or temporal network data.
A Hybrid Data Structure for Analysis of Temporal Networks
603
Acknowledgement. This material is based upon work supported by the National Science Foundation grants IIS-1755824, DMS-1830412, and IIS-2047955.
References 1. Arastuie, M., Paul, S., Xu, K.S.: CHIP: a Hawkes process model for continuoustime networks with scalable and consistent estimation. In: Advances in Neural Information Processing Systems, vol. 33, pp. 16983–16996 (2020) 2. Casteigts, A., Flocchini, P., Santoro, N., Quattrociocchi, W.: Time-varying graphs and dynamic networks. Int. J. Parallel Emergent Distrib. Syst. 27(5), 387–408 (2012) 3. Cazabet, R.: Data compression to choose a proper dynamic network representation. In: Benito, R.M., Cherifi, C., Cherifi, H., Moro, E., Rocha, L.M., Sales-Pardo, M. (eds.) COMPLEX NETWORKS 2020. SCI, vol. 943, pp. 522–532. Springer, Heidelberg (2020). https://doi.org/10.1007/978-3-030-65347-7 43 4. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms. MIT Press, Cambridge (2009) 5. Dietz, P.F.: Maintaining order in a linked list. In: Proceedings of the Fourteenth Annual ACM Symposium on Theory of Computing, pp. 122–127 (1982) 6. Eagle, N., Pentland, A.S.: Reality mining: sensing complex social systems. Pers. Ubiquit. Comput. 10(4), 255–268 (2006) 7. Eagle, N., Pentland, A.S., Lazer, D.: Inferring friendship network structure by using mobile phone data. Proc. Natl. Acad. Sci. 106(36), 15274–15278 (2009) 8. Ediger, D., McColl, R., Riedy, J., Bader, D.A.: Stinger: high performance data structure for streaming graphs. In: Proceedings of the IEEE Conference on High Performance Extreme Computing, pp. 1–5. IEEE (2012) 9. Fulkerson, D., Gross, O.: Incidence matrices and interval graphs. Pac. J. Math. 15(3), 835–855 (1965) 10. Hagberg, A., et al.: NetworkX (2013). http://networkx.github.io 11. Hagberg, A., Swart, P., Schult, D.: Exploring network structure, dynamics, and function using NetworkX. Technical report. LA-UR-08-5495, Los Alamos National Laboratory (2008) 12. Hilsabeck, T., Arastuie, M., Do, H.N., Sloma, M., Xu, K.S.: IdeasLabUT/dynetworkx: Python package for importing and analyzing discreteand continuous-time dynamic networks (2020). https://github.com/IdeasLabUT/ dynetworkx 13. Holme, P., Saram¨ aki, J.: Temporal networks. Phys. Rep. 519(3), 97–125 (2012) 14. Holme, P., Saram¨ aki, J.: Temporal Networks. Springer, Heidelberg (2013) 15. Holme, P., Saram¨ aki, J.: Temporal Network Theory. Springer, Heidelberg (2019) 16. Isella, L., Stehl´e, J., Barrat, A., Cattuto, C., Pinton, J.F., Van den Broeck, W.: What’s in a crowd? Analysis of face-to-face behavioral networks. J. Theor. Biol. 271(1), 166–180 (2011) 17. Jenks, G.: Python sorted containers. J. Open Source Softw. 4(38), 1330 (2019) 18. Junuthula, R., Haghdan, M., Xu, K.S., Devabhaktuni, V.: The block point process model for continuous-time event-based dynamic networks. In: The World Wide Web Conference, pp. 829–839 (2019) 19. Korda, M., Raman, R.: An experimental evaluation of hybrid data structures for searching. In: Vitter, J.S., Zaroliagis, C.D. (eds.) WAE 1999. LNCS, vol. 1668, pp. 213–227. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-48318-7 18
604
T. Hilsabeck et al.
20. Kostakos, V.: Temporal graphs. Physica A 388(6), 1007–1023 (2009) 21. Lambiotte, R., Masuda, N.: A Guide to Temporal Networks, vol. 4. World Scientific (2016) 22. Latapy, M., Viard, T., Magnien, C.: Stream graphs and link streams for the modeling of interactions over time. Soc. Netw. Anal. Min. 8(1), 1–29 (2018) 23. Lee, D.: Interval, segment, range, and priority search trees. In: Multidimensional and Spatial Structures, p. 1 (2005) 24. Leskovec, J., Krevl, A.: SNAP datasets: stanford large network dataset collection (2014) 25. Leskovec, J., Sosiˇc, R.: SNAP: a general-purpose network analysis and graphmining library. ACM Trans. Intell. Syst. Technol. (TIST) 8(1), 1 (2016) 26. Ligtenberg, W., Pei, Y.: Introduction to a temporal graph benchmark. arXiv preprint arXiv:1703.02852 (2017) 27. Nicosia, V., Tang, J., Mascolo, C., Musolesi, M., Russo, G., Latora, V.: Graph metrics for temporal networks. In: Holme, P., Saram¨ aki, J. (eds.) Temporal Networks, pp. 15–40. Springer, Heidleberg (2013). https://doi.org/10.1007/978-3-642-3646172 28. Overmars, M.H.: The Design of Dynamic Data Structures, vol. 156. Springer, Heidelberg (1987) 29. Paranjape, A., Benson, A.R., Leskovec, J.: Motifs in temporal networks. In: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pp. 601–610 (2017) 30. Priebe, C.E., Conroy, J.M., Marchette, D.J., Park, Y.: Scan statistics on Enron graphs. Comput. Math. Organ. Theory 11, 229–247 (2005) 31. Priebe, C.E., Conroy, J.M., Marchette, D.J., Park, Y.: Scan statistics on Enron graphs (2009). http://cis.jhu.edu/∼parky/Enron/enron.html 32. Schiller, B., Castrillon, J., Strufe, T.: Efficient data structures for dynamic graph analysis. In: Proceedings of the 11th International Conference on Signal-Image Technology & Internet-Based Systems, pp. 497–504. IEEE (2015) 33. Thankachan, R.V., Swenson, B.P., Fairbanks, J.P.: Performance effects of dynamic graph data structures in community detection algorithms. In: Proceedings of the IEEE High Performance extreme Computing Conference, pp. 1–7. IEEE (2018) 34. Transport for London: cycling.data.tfl.gov.uk (2021). https://cycling.data.tfl.gov. uk/ 35. Viswanath, B., Mislove, A., Cha, M., Gummadi, K.P.: On the evolution of user interaction in Facebook. In: Proceedings of the 2nd ACM Workshop on Online Social Networks, pp. 37–42 (2009) 36. Wehmuth, K., Ziviani, A., Fleury, E.: A unifying model for representing timevarying graphs. In: Proceedings of the IEEE International Conference on Data Science and Advanced Analytics, pp. 1–10. IEEE (2015)
Modeling Human Behavior
Markov Modulated Process to Model Human Mobility Brian Chang, Liufei Yang, Mattia Sensi(B) , Massimo A. Achterberg, Fenghua Wang, Marco Rinaldi, and Piet Van Mieghem Delft University of Technology, 2628 CD Delft, The Netherlands [email protected]
Abstract. We introduce a Markov Modulated Process (MMP) to describe human mobility. We represent the mobility process as a timevarying graph, where a link specifies a connection between two nodes (humans) at any discrete time step. Each state of the Markov chain encodes a certain modification to the original graph. We show that our MMP model successfully captures the main features of a random mobility simulator, in which nodes moves in a square region. We apply our MMP model to human mobility, measured in a library. Keywords: Markov modulated process · Human mobility Time-varying networks · Modeling · Markov chains
1
·
Introduction
Over the last century, scientists have modeled the spread of infectious diseases in a population with various approaches and assumptions. Any epidemic outbreak consists of two processes: the viral process describes the transmission of the virus between two hosts and the mobility process specifies when two hosts are in contact for a sufficiently long time at a sufficiently short distance to enable the viral transmission. Since measurements of human mobility were seldom available in earlier times, human mobility has largely been ignored or absorbed into the viral process through time-varying infection and curing probabilities or by imposing restrictions on the contact graph. Simple epidemic models, which do not take mobility into account and consider fixed and static contact networks over time, cannot explain the prolonged duration of an epidemic such as COVID-19 [21]. Recently, various approaches have been proposed to model human mobility [1,5,13,14,16,17,23] and specific aspects such as community and motif formation [15,26] and the movement behavior of each individual [7,18]. Mobile devices are increasingly used to measure movement [8–10,17,27]. The impact of human mobility on the spread of epidemics is studied in [2,11,12,19,24,27]. In this paper, we propose a novel approach to the modeling of Human Mobility Processes (HMP)1 based on a Markov Modulated Process (MMP) [6],[22, Sec. 1
This paper is a bridgement of the master theses of B. Chang [4] and L. Yang [25].
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2021, SCI 1072, pp. 607–618, 2022. https://doi.org/10.1007/978-3-030-93409-5_50
608
B. Chang et al.
5.2]. Each state of the Markov chain in the MMP encodes an action, which creates a contact graph. For example, state i may generate an Erd˝ os-R´enyi graph Gpi (N ) with link density pi , state j may add j links and state k removes a motif (e.g. a triangle) and so on. We evaluate the MMP performance first assuming that nodes move randomly around in a square region and are in contact if their distance is smaller than a certain threshold. We then reconstruct the time-varying contact network using a MMP model. The paper is structured as follows: Sect. 2 describes a random mobility simulator. Section 3 briefly introduces Markov Modulated Processes (MMPs) and Sect. 3.2 compares our MMP model to the mobility simulator. We apply our model to real-world data in Sect. 4. Finally, we conclude in Sect. 5.
2
Random Mobility Simulator
Movements over time are first simulated by a very simple random model, merely to explain and evaluate the idea of a MMP in Sect. 3. Later, in Sect. 4, human mobility is measured in a library. We simulate2 a mobility process in which nodes are constrained to move within a square region of size Z × Z units. We consider N = n2 nodes. Initially, the N nodes are spaced evenly at a distance of Z/(n + 1) in a square lattice with n rows and n columns. The distance between the edges of the lattice and the border of the region is also Z/(n + 1). Figures 1a and 1b show a visualization of the initial placement and a snapshot of a typical realization. The position of each node is represented in Cartesian coordinates (x, y), with the bottom-left corner of the square as origin (0, 0). The positions of the nodes change at discrete time steps. At time k = 0, each node is initialized with a random direction or angle, which is a random integer, uniformly chosen in the range [0, 359]. For all remaining time steps, we fix a value θ ∈ (0, 180) and each node can change its direction by an integer in the range [−θ, θ] with uniform probability. Each node then moves forward in a straight line along its current direction. The move distance is uniformly distributed in the range [0, 2v]. We constrain the movement of the nodes to the square region of size Z × Z. If a node attempts to move out of bounds, its x and y coordinates are both clamped to the range [0, Z], bringing the node back into the region. Then, the direction of the node is reset to point towards the center of the region, i.e. the point (Z/2, Z/2). Finally, a random integer offset in the range [−φ, φ] is applied to the new direction of the node. At each discrete time step k ∈ {0, 1, . . . , T − 1}, every node moves according to the procedure described above. After the nodes have moved, at k + 1, we create a network based on the current positions of the nodes. Two nodes are connected when the distance between them is d units or less; here d = 1.5. The temporal contact network G(N, k) is represented by the time-varying N × N 2
The Python code of our mobility simulator is available on GitHub: https://github. com/twente/mmp-mobility-model.
MMP for Human Mobility
(a) Initial setup of the mobility simulator.
609
(b) Snapshot of one realization.
Fig. 1. Visualization of our random mobility simulator at time k = 0 and at another time k > 0. If the grey circles around two nodes overlap, a link (shown in orange) between the two is created. The grey circles have radius d/2 = 0.75.
adjacency matrix A(k) at discrete time k. Each element aij (k) = 1 if node i and node j are in contact at time k; otherwise, aij (k) = 0. In the mobility process simulations, we choose N = 25 nodes, Z = 10 for the size of the region and, further, θ = 20, v = 1, φ = 30. We perform 1000 realizations of the mobility process and each realization of the simulated mobility process runs for T = 1000 time steps and generates A(1), A(2), . . . , A(1000) adjacency matrices of contact graphs, starting from an empty initial adjacency matrix A(0) = O at time k = 0, representing the evenly-spaced square lattice3 . Figure 2 illustrates the number of links at each time step for one realization of the mobility process. After performing 1000 realizations of 1000 time steps each in the mobility simulations, Fig. 3 plots the distribution of the number of links in the graph at each time step and the number of links added or removed between two time steps.
3
The Markov Modulated Process (MMP)
We consider a temporal graph G(N, k), where the links in the graph change over time, controlled by a discrete-time Markov process. In each state of the Markov chain, an action is executed to modulate the graph. The evolution of the Markov process is controlled by the M × M probability transition matrix P , whose elements pij represent the probability of transitioning from state i to state j. The goal is to use a discrete-time Markov process to modulate a time-varying 3
The spacing in the lattice is 10/6 ≈ 1.667 which exceeds d = 1.5; therefore, there are no links in the initial graph at k = 0.
610
B. Chang et al.
Fig. 2. Number of links at each time step for one realization of the mobility process with N = 25 nodes.
(a) Total number of links. (b) Number of added links. (c) Number of removed Average: 19.430. Average: 10.369. links. Average: 10.350.
Fig. 3. Histograms of link statistics for the simulated mobility process with 25 nodes across 1000 realizations of 1000 time steps.
(a) Total number of links. (b) Number of added links. (c) Number of removed Average: 19.433. Average: 10.369. links. Average: 10.350.
Fig. 4. Histograms of link statistics for the reduced MMP model with 25 nodes across 1000 realizations of 1000 time steps.
graph of 25 nodes, such that the series of MMP generated adjacency matrices has properties which closely resemble those of the series of adjacency matrices produced by the mobility process.
MMP for Human Mobility
3.1
611
A First Approach: the Combinatorial MMP Model
Between two consecutive time steps of the mobility process, links in the contact graph are both added and removed. In order for the average number of links over time in the MMP model to match the mobility process, a feedback mechanism is needed to control the number of links. Without feedback, the MMP model does not track the number of links, which will consequently drift like a random walk over time [25, Sec. 3.5]. Therefore, the states of the MMP model should encode the number of links in the graph at each discrete time step. In the first approach, each state of the Markov process represents the number of links in the graph as well as a modulating action f which will be applied to the temporal graph. The action f = (+a, −b) consists of randomly adding a links and removing b. The set of possible actions is bounded by the number of links which currently exist in the graph, because we cannot remove more links than currently exist nor add more links than can be accommodated by the complete graph. Given that i links currently exist in the graph, the complete action set Ji is the set of all possible actions, given by Ji := {f = (+a, −b)|0 ≤ a ≤ Lmax − i, 0 ≤ b ≤ i}, (1) N where Lmax = 2 = N (N − 1)/2 is the number of links in the complete graph with N nodes. Given i links in the graph, there are thus (i + 1)(Lmax − i + 1) possible actions, each of which will be encoded by a state in the MMP model. As an example, consider N = 3 nodes. Since Lmax = 3, the complete action sets for each possible number of links is: J0 = {(+0, −0), (+1, −0), (+2, −0), (+3, −0)}, J1 = {(+0, −0), (+0, −1), (+1, −0), (+1, −1), (+2, −0), (+2, −1)}, J2 = {(+0, −0), (+0, −1), (+0, −2), (+1, −0), (+1, −1), (+1, −2)}, J3 = {(+0, −0), (+0, −1), (+0, −2), (+0, −3)}. For each of the complete action sets, every action is assigned to a unique state. Hence, the same action can appear multiple times, but for a different number of links. The Markov states, numbered arbitrarily in format Sn : {L, f }, where L is the number of links currently in the graph, for the example of N = 3 are: S1 : {0, (+0, −0)} S2 : {0, (+1, −0)} S3 : {0, (+2, −0)} S4 : {0, (+3, −0)} S5 : {1, (+0, −0)} S6 : {1, (+0, −1)} S7 : {1, (+1, −0)} S8 : {1, (+1, −1)} S9 : {1, (+2, −0)} S10 : {1, (+1, −2)} S11 : {2, (+0, −0)} S12 : {2, (+0, −1)} S13 : {2, (+0, −2)} S14 : {2, (+1, −0)} S15 : {2, (+1, −1)} S16 : {2, (+1, −2)} S17 : {3, (+0, −0)} S18 : {3, (+0, −1)} S19 : {3, (+0, −2)} S20 : {3, (+0, −3)} With the states of the MMP model now defined, the next step is to construct the probability transition matrix P from the mobility process. At each time step k < T of the mobility process, the number of links in the adjacency matrix A(k + 1) is compared with A(k) to determine how many links to add and remove
612
B. Chang et al.
(i.e. the action). The number of links and the action allow us to determine the Markov state at time k. The transition probability, pij =
nij
m∈X nim
,
(2)
equals the number nij of observed transitions from state i to state j divided by the total number of observed transitions from state i to any other state m, where X is the set of all states. Because each state Sn : {L, f } of the combinatorial MMP model encodes both the number of links in the graph as well as an action, the total number M of states is found by summing the number of possible over all possible numbers Lactions max (i + 1)(Lmax − i + 1) which is of links which can exist in the graph, M = i=0 M = O(N 6 ) as shown in [25, Sec. 4.2]. As the size N of the graph increases, the computational complexity of the combinatorial MMP model quickly explodes and becomes infeasible. 3.2
Reduced MMP Model for Human Mobility
The computational intractability has led us to simplify the combinatorial MMP model in Sect. 3.1 to a reduced MMP model, in which each state {0, 1, . . . , Lmax } represents the number of links in the graph and not a specific action. The number of states required is therefore M = Lmax + 1 = O(N 2 ), since one additional state is required for the null graph with 0 links. A Markov state is assigned to each time step k ≤ T of the mobility process and the state corresponds to the number of links in the adjacency matrix A(k). The transition probabilities of the probability transition matrix P are then calculated by (2). The actions of the reduced MMP model are no longer embedded in the states, and instead we define a transition action set Jij for each state transition from state i to j. Each transition action set Jij will be a subset of the corresponding complete action set Ji that was described in (1). The transition action set Jij is defined as the set of possible actions, given that i links currently exist in the graph, that will result in j links after the action is applied, and expressed as Jij := {f = (+a, −b)|j − i = a − b, 0 ≤ a ≤ Lmax − i, 0 ≤ b ≤ i} ={f ∈ Ji |j − i = a − b}.
(3)
Revisiting the example of N = 3 nodes, the transition action sets are given by: J00 J10 J20 J30
: {(+0, −0)} : {(+0, −1)} : {(+0, −2)} : {(+0, −3)}
J01 J11 J21 J31
: {(+1, −0)} : {(+0, −0), (+1, −1)} : {(+0, −1), (+1, −2)} : {(+0, −2)}
J02 J12 J22 J32
: {(+2, −0)} : {(+1, −0), (+2, −1)} : {(+0, −0), (+1, −1)} : {(+0, −1)}
J03 J13 J23 J33
: {(+3, −0)} : {(+2, −0)} : {(+1, −0)} : {(+0, −0)}
When we transition from state i to state j in the reduced MMP model, we select an action f from the corresponding transition action set Jij with probabiln(f ) ity Pr[Jij (f )] = nijij , where n(f )ij is the number of times action f is observed
MMP for Human Mobility
613
during transitions from i links to j links in the mobility process and nij is the total number of observed transitions from i to j links. The probability distribution of the transition action set is independent of the transition probabilities of the Markov process. The probability of an action f ∈ Jij , given that there are currently i links in the graph, is given by pij Pr[Jij (f )], which is the probability of transitioning from i to j links multiplied by the probability of choosing action f from Jij . The probability of each action is only dependent on the current number of links in the graph: by the Markov property, the next state depends only on the current state (i.e. the current number of links) and the probability distribution of each action set is fixed. The reduced MMP model can be directly derived from the combinatorial MMP model as shown in [25, Sec. 4.3]. Both MMP models will produce [25] a modulated graph with the same number of links on average and the long-term rate at which each action is applied is the same for both models. The dependence between consecutive actions is lost in the reduced MMP model: the combinatorial MMP model embeds the actions in the Markov states, and hence the probability of the next action has a dependence on the previous action. In the reduced MMP model, the next action is dependent only on the current number of links. If, in the mobility process being modeled, there is a negligible amount of dependence between consecutive actions, then the combinatorial MMP model is actually trying to model a dependence that does not exist. Therefore, no information is lost in the reduced MMP model, as is the case in our simulated mobility process [25, Sec. 5.5]. The reduced MMP model in Fig. 4 is extremely close to simulated mobility in Fig. 3. In Fig. 5, we plot for the two processes the K-step link retention probability, which is the probability [20, p. 182] that a link still exists at time k+K, given that the link existed at the time k. For the MMP models, the link retention probability decreases exponentially [20, p. 182] in K. Figure 4 indicates that the probability that a link is not removed is approximately 1 − (10.350/19.433) ≈ 0.47. Since the removed links are chosen randomly, we expect an exponential decay of the K-step retention probability. Compared to the mobility process, however, the link retention probability of the MMP models is lower for larger steps. Although Figs. 3 and 4 are almost indistinguishable, the link retention probabilities of the MMP model and the mobility process start to deviate after 3 time steps. This prompts further research of the algorithm which selects the links to be added and removed at each time step.
4
Real-world Application of MMP to HMP in a Library
The mobility data was recorded in the TU Delft library during a single day from 08:00 to 18:00. In the tracking experiment, each of the 37 participants wears a Bluetooth tracking device and there are fixed Bluetooth beacons inside the library. Each Bluetooth tracking device will record contact events with beacons, and each data record includes the time and duration of the contact, the distance from the beacon and information to uniquely identify the device and beacon.
614
B. Chang et al.
Fig. 5. Comparison of the K-step link retention probability between the mobility process and the reduced MMP model. The two curves remain really close up to step 3, whereafter they start to diverge.
4.1
Building the Reduced MMP Model
The tracking devices record the contacts between people every 30 s. The opening time of the library at 08:00 is set at k = 0 and the closing time at 18:00 at time k = 1200. The adjacency matrix A(k) represents the contact graph at discrete time k ∈ [0, 1200]. We plot the number of links at each time step in Fig. 6. We observe that there are almost no contacts, except for the two spikes in the number of contacts, from 11:04 to 12:54 and from 15:10 to 16:20. Since our MMP model is based on the number of links in the graph, we focus on modeling the first spike in contacts from 11:04 to 12:54, corresponding to time step k = 369 to k = 587; this time window is highlighted by a yellow stripe in Fig. 6. We compute the total number of links at each time step (average is 2.84) and the number of links added (average is 0.11) and removed (average is 0.11) between two time steps. We use the adjacency matrices to generate the probability transition matrix and build the action sets. The mobility process runs for a little over 200 time steps, so we choose to simulate the MMP model for 200 time steps; the number of realizations is kept at 1000 as previously. 4.2
Results of the Reduced MMP Model
= 666 and the The maximum number of links in a graph of 37 nodes is 37 2 reduced MMP model needs M = 667 states. However, Fig. 6 shows that a maximum of 9 links was observed. For M = 10 states, there are 100 possible state combinations, but we only observed 27 transitions; so we consider 27 action sets. In each realization, the Markov process is initialized in state 0 with an empty graph. After performing 1000 simulations of 200 time steps, we calculate the total number of links at each time step (average 2.607) and the number of links added (average 0.120) and removed (average 0.106) between two time steps. In the MMP model, the average number of links observed at each time step is about
MMP for Human Mobility
615
Fig. 6. Number of links at each time step for the mobility process in the library (blue circles) and for a realization of our reduced MMP model (orange crosses). The time steps of our chosen window have been re-indexed to start from 0. Inset: entire data from the recorded day, the yellow strip represents the time window which is modeled by the MMP model.
8.3% lower compared to the library mobility process. The average number of links added is about 8.9% higher, and the average number of removed links is about 3.7% lower. The error of the MMP model for the library data is relatively large compared to the random mobility simulator. We attribute this large difference to the fact that the number of links in the graph is small and that the model is run for fewer time steps. One immediate observation is that the real mobility process in Fig. 6 exhibits a pronounced increase in the number of links followed by a decrease, but this trend is not captured by the MMP model. Figure 6 reveals that the modulated graph reaches 9 links at t = 87 and remains there until t = 97. At t = 98, the MMP then transitions to 8 links and remains there for until t = 130. To better understand this behavior, we investigate the probability transition matrix of the MMP model in Fig. 7. Row i of the transition matrix P in Fig. 7 corresponds to state i, which encodes i links in the graph. Hence, the entry (i, j) indicates the probability of having j links in the graph after the next action. The 9th row in Fig. 7 indicates an 83% probability of staying at 9 links and a 17% probability of transitioning to 8 links. This probability is derived by observing the number of links in the mobility process in Fig. 6. From time k = 97 to k = 102, the graph has 9 links. This means that we observe 5 transitions from 9 links to 9 links. At k = 103, the graph transitions to 8 links and for the remainder of the mobility process, we never observe 9 links in the graph again. We observed no transition from 9 links to a number of links different from 8 and 9. Therefore, the probabilities in the probability transition matrix are 1/6 ≈ 0.17 and 5/6 ≈ 0.83, respectively.
616
B. Chang et al.
Fig. 7. Probability transition matrix P of the MMP model based on the real mobility process.
Due to the limited amount of data, the MMP model is sensitive to the idiosyncrasies of this particular dataset and the model overfits the data. Consider how state 9 can be reached: from the probability transition matrix, we observe that the only way to reach state 9 is from state 7 and the corresponding transition probability is 1. The original mobility process, in contrast, is interpreted as follows: the only way for 9 contacts between people occurs if there were 7 contacts between them 30 s ago (one time step). Furthermore, if there are currently 7 contacts between people, there is a 100% probability that after 30 s, there will be 9 contacts. The MMP model only incorporates transitions that have been observed in the mobility process. Unobserved transitions in the real mobility process have zero probability in the MMP model.
5
Conclusions and Outlook
We present the idea of a Markov Modulated Process (MMP) for constructing temporal contact networks for human mobility. The states of our reduced MMP model encode the number of links in the graph. We illustrate the accuracy of the MMP model for both a random mobility simulator and for real-world movements in a library. Surely, our implementation of a MMP can be improved. A first concern is the overfitting of the current MMP model. Only observed transitions in the mobility process may occur in the MMP model. Any other transition is impossible, even though such transition may be possible in reality. The MMP model can be improved by data augmentation of the mobility process, especially regarding rare events [3]. Another challenge is to encode spatial correlations in the MMP model. The number of added and removed links in the graph is based on the MMP model, but which link should be removed is currently random. Thus, spatial correlation between the nodes is ignored in the current implementation and we propose to adjust the reduced MMP model to incorporate the spatial correlations. A third drawback of the reduced MMP model is that we implicitly assume that the number of links in the network is roughly constant. Practical situations
MMP for Human Mobility
617
for modeling population flow, such as airports, schools, libraries and other transfer locations, typically show activity spikes around certain events, such as lunch time. The reduced MMP formulation performs well in situations where activity stays roughly constant, such as long-term interactions of individuals with friends and family, but does not capture activity spikes. We believe that Markov Modulated Processes constitute a powerful framework for modeling human mobility and may ultimately ameliorate the models for investigating the spread of infectious diseases. Acknowledgements. The authors thank Dr. ir. Sascha Hoogendoorn-Lanser for sharing the library data, which has been collected as part of an ongoing project led by her at TU Delft.
References 1. Barbosa, H., et al.: Human mobility: models and applications. Phys. Rep. 734, 1–74 (2018) 2. Barmak, D.H., Dorso, C.O., Otero, M.: Modelling dengue epidemic spreading with human mobility. Phys. A 447, 129–140 (2016) 3. Causer, L., Carmen Ba˜ nuls, M., Garrahan, J.P.: Optimal sampling of dynamical large deviations via matrix product states. PRE 103, 062144 (2021) 4. Chang, B.: Modeling the spread of epidemics, MSc. thesis, Delft, University of Technology (2021). http://resolver.tudelft.nl/uuid:72206da3-4652-4a10-90bd2fdd2a1e98f6 5. Feng, J., Yang, Z., Xu, F., Yu, H., Wang, M., Li, Y.: Learning to simulate human mobility. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data, pp. 3426–3433 (2020) 6. Fischer, W., Meier-Hellstern, K.: The Markov-modulated Poisson process (MMPP) cookbook. Perform. Eval. 18(2), 149–171 (1993). https://doi.org/10. 1103/PhysRevE.103.062144 7. Flores, M.A.R., Papadopoulos, F.: Similarity forces and recurrent components in human face-to-face interaction networks. Phys. Rev. Lett. 121(25), 258301 (2018) 8. Hossmann, T., Spyropoulos, T., Legendre, F.: A complex network analysis of human mobility. In: 2011 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), pp. 876–881. IEEE (2011) 9. Huang, Z., et al.: Modeling real-time human mobility based on mobile phone and transportation data fusion. Transp. Res. Part C: Emerg. Technol. 96, 251–269 (2018) 10. Karamshuk, D., Boldrini, C., Conti, M., Passarella, A.: Human mobility models for opportunistic networks. IEEE Commun. Mag. 49(12), 157–165 (2011) 11. Mari, L., et al.: Modelling cholera epidemics: the role of waterways, human mobility and sanitation. J. R. Soc. Interface 9(67), 376–388 (2012) 12. Meloni, S., Perra, N., Arenas, A., G´ omez, S., Moreno, Y., Vespignani, A.: Modeling human mobility responses to the large-scale spreading of infectious diseases. Sci. Rep. 1(1), 1–7 (2011) 13. Nguyen, A.D., S´enac, P., Ramiro, V., Diaz, M.: STEPS - an approach for human mobility modeling. In: Domingo-Pascual, J., Manzoni, P., Palazzo, S., Pont, A., Scoglio, C. (eds.) NETWORKING 2011. LNCS, vol. 6640, pp. 254–265. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-20757-0 20
618
B. Chang et al.
14. Pappalardo, L., Rinzivillo, S., Simini, F.: Human mobility modelling: exploration and preferential return meet the gravity model. Procedia Comput. Sci. 83, 934–939 (2016) 15. Schneider, C.M., Belik, V., Couronn´e, T., Smoreda, Z., Gonz´ alez, M.C.: Unravelling daily human mobility motifs. J. R. Soc. Interface 10(84), 20130246 (2013) 16. Solmaz, G., Turgut, D.: A survey of human mobility models. IEEE Access 7, 125711–125731 (2019) 17. Song, C., Koren, T., Wang, P., Barab´ asi, A.L.: Modelling the scaling properties of human mobility. Nat. Phys. 6(10), 818–823 (2010) 18. Starnini, M., Baronchelli, A., Pastor-Satorras, R.: Modeling human dynamics of face-to-face interaction networks. Phys. Rev. Lett. 110(16), 168701 (2013) 19. Tizzoni, M., et al.: On the use of human mobility proxies for modeling epidemics. PLoS Comput. Biol. 10(7), e1003716 (2014) 20. Van Mieghem, P.: Performance Analysis of Complex Networks and Systems. Cambridge University Press, Cambridge (2014) 21. Van Mieghem, P., Achterberg, M.A., Liu, Q.: Power-law decay in epidemics is likely due to interactions with the time-variant contact graph. Delft University of Technology, report 20201201 (2020). https://nas.ewi.tudelft.nl/people/Piet/ TUDelftReports.html 22. Van Mieghem, P., Steyaert, B., Petit, G.H.: Performance of cell loss priority management schemes in a single server queue. Int. J. Commun. Syst. 10(4), 161–180 (1997) 23. Wang, J., Kong, X., Xia, F., Sun, L.: Urban human mobility: data-driven modeling and prediction. ACM SIGKDD Explor. Newsl. 21(1), 1–19 (2019) 24. Wesolowski, A., et al.: Quantifying the impact of human mobility on malaria. Science 338(6104), 267–270 (2012) 25. Yang, L.: Developing a Markov-modulated process model for mobility processes, MSc. thesis, Delft, University of Technology (2021). http://resolver.tudelft.nl/ uuid:d37a389d-4145-4ebd-95fc-3dbe351158ad 26. Yang, S., Yang, X., Zhang, C., Spyrou, E.: Using social network theory for modeling human mobility. IEEE Netw. 24(5), 6–13 (2010) 27. Zhou, Y., Xu, R., Hu, D., Yue, Y., Li, Q., Xia, J.: Effects of human mobility restrictions on the spread of COVID-19 in Shenzhen, China: a modelling study using mobile phone data. Lancet Digit. Health 2(8), e417–e424 (2020)
An Adaptive Mental Network Model for Reactions to Social Pain Katarina Miletic1, Oleksandra Mykhailova2, and Jan Treur3(&) 1
Department of Psychology, University of Bologna, Bologna, Italy [email protected] 2 Cognitive Science, University of Warsaw, Warszawa, Poland [email protected] 3 Department of Computer Science, Social AI Group, Vrije Universiteit Amsterdam, Amsterdam, Netherlands [email protected]
Abstract. Reactions to social pain and the behavioral strategies adopted as a consequence of it are a complex and adaptive phenomenon that leads to major consequences to a person’s social functioning and overall well-being. As such behavioral strategies are something that change over time and are difficult to explain by simple linear correlations, a model such as this one is useful to the goal of providing a more nuanced understanding of the phenomena and the way that it changes dynamically over time, potentially leading to richer theoretical predictions and potential new directions for research. Keywords: Social pain
Adaptive mental network
1 Introduction Human beings are inherently social creatures and adapting to life in a social environment is an inherent part of both survival and thriving of an individual. A significant part of human behavior is shaped by the innate need to connect to others, to belong to a group and to be accepted by its members; our brains are hardwired to interpret social exclusion or rejection similarly to experiences of physical pain, reutilizing the same neuronal networks originally developed to help learn to avoid physical pain to shape social behavior in a way that facilitates belonging and social acceptance (Eisenberger 2012). However, large numbers of individuals in modern society suffer from social isolation and difficulties finding significant connections: around a third of people in industrialized countries are suffering from these conditions, and 1 in 12 are experiencing a severe state of loneliness. Because constant activation of the neural pain network leads to many disbalances in the body such as a chronically suppressed immune system (Chester et al. 2012), for such individuals the levels of social pain may be severe enough to lead to not only hurt feelings and higher probabilities of psychological problems (Mushtaq et al. 2014) but also to concrete health risks such as significantly higher risks of heart problems such as coronary disease or a stroke (Valtorta et al. 2016). © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2021, SCI 1072, pp. 619–631, 2022. https://doi.org/10.1007/978-3-030-93409-5_51
620
K. Miletic et al.
While for many their state of chronic lack of social connection is out of their immediate control, many others demonstrate self-defeating or excessively avoidant, anxious, or mistrustful behavior strategies that actively prevent them from forming meaningful social connections even when the external circumstances allow for these connections to be formed (Hulsman et al. 2021). The tendency to adopt such maladaptive strategies is a complex phenomenon that depends on many factors, such as the individual’s personal temperament, their history of socialization experiences and the nature of their current environment. The main goal of this paper is to make an attempt at modeling the dynamics of the development and change of responses to social threat and social pain, with a particular emphasis on how the harshness and unpredictability of the individual’s environment shape their reactions with time. A mental network model with adaptations for social learning processes was designed for this purpose. In Sect. 2 the theoretical background of the model will be provided, while the description of the model can be found in Sect. 3. In Sect. 4 results of the simulation will be provided, which will then be discussed in Sect. 5.
2 Background Literature Approach and avoidance motivation are at the core of human behavior regulation; any behavior of a living, thinking being is fundamentally based on the tendency to approach potentially useful stimuli and to avoid potentially unpleasant ones. Translated to terms relevant to social behavior, at every point in time a person is seeking to obtain desired social end-states such as intimacy and acceptance, while simultaneously avoiding undesirable end states such as rejection and pain. While the balance of these tendencies is affected by the presence of the appropriate stimuli in the environment, it is also influenced by the individual’s innate tendencies, by their history of previous positive and negative experiences, and by their current level of need satisfaction (Gable et al. 2008). Expectations of threat and reward, while both functionally and empirically connected to approach and avoidance expectation, are separate constructs that pertain mostly to the individual’s assessment of their social environment or a concrete stimulus for the potential to either cause social pain or lead to a pleasant social connection. Threat and reward expectations are still influenced by approach and avoidance motivation, as a certain type of motivation will make us more sensitive to stimuli connected to it (e.g. a person highly motivated to avoid social pain will be more sensitive to even weak socially painful cues in order to avoid them effectively), but they are also dependent on the particular the environmental conditions as well as the subject’s perception of said conditions. The capacity to estimate an environmental cue as threatening or potentially rewarding is subject to learning throughout the lifetime. It is easily intuited that human beings aim to develop a behavioral strategy that would lead to the biggest possible net gains in social rewards while avoiding social pain as much as possible. However, not all environments are equally rewarding or threatening and variation in these parameters, especially in early life history, influences threat and reward expectations and approach and avoidance motivation in such a way
An Adaptive Mental Network Model for Reactions to Social Pain
621
so as to form a coherent strategy that benefits the organism the most (Chester et al. 2012). These strategies sometimes persist even when environmental conditions change, leading to dysfunction and negative outcomes for the individual. Examples of such strategies are the previously mentioned higher sensitivity to potentially threatening or rewarding cues based on the prevalence of avoidance or approach motivation (Gable et al. 2008), or the Optimal Calibration Hypothesis (Chester et al. 2012), according to which chronic social pain leads to a downregulation of the social pain network whereas unpredictable and occasional social pain leads to its heightened sensitivity. Some studies have found that social rejection actually leads to higher approach motivation, but only when the potential for social reward is estimated as high by the subject. Therefore, if the subject is rejected but expects a higher chance of obtaining a social reward if they employ prosocial behavior, they will approach others in order to compensate for the previous rejection, but if the expectation of social reward is not certain enough, they will not risk further rejection and will withdraw instead. This has been confirmed empirically by studies in which people rejected during a simulated collaborative game tended to demonstrate larger quantities of prosocial behavior, but only if they were securely attached – an attachment style whose main characteristic is the expectation that others will be consistently available and accepting to the individual. On the other hand, people who anticipate higher levels of rejection will not only develop a tendency for higher avoidance motivation, but they will also downregulate their expectations of potential reward in order to reduce risky approach behavior that may lead to further social pain (MacDonald et al. 2011).
3 Modeling Adaptive Networks Parting from the background introduced in Sect. 2, an adaptive mental model network model was created the modeling approach described in Treur (2016) and Treur (2020). This kind of (temporal-causal) network model uses states indicated by X and Y and is characterized by: (a) Connectivity characteristics • connection weights from a state X to a state Y, denoted by xX,Y (b) Aggregation characteristics • a combination function for each state Y, denoted by cY(..) This describes how to combine the causal impact of other states on state Y (c) Timing characteristics • a speed factor for each state Y, denoted by ηY This determines the time (how fast or slow) the impact changes the state. In Table 1 it is indicated how within a dedicated software environment in a canonical manner based on these characteristics a numerical representation is obtained. The last row of this table shows the difference equation. This is used for simulations.
622
K. Miletic et al.
Table 1. Numerical representation of temporal-causal network; from (Treur 2020), Ch. 2 Concept State values over time t Single causal impact Aggregating multiple impacts
Representation Y(t)
impactX,Y(t) = xX,Y X(t) aggimpactY(t) = cY(impactX1,Y(t),…, impactXk,Y(t)) = cY(xX1,YX1(t), …, xXk, YXk(t)) Timing of the Y(t + Dt) = Y(t) + ηY causal effect [aggimpactY(t) - Y(t)] Dt = Y(t) + ηY [cY(xX1,YX1(t), …, xXk,YXk(t)) - Y(t)] Dt
Explanation At each time point t each state Y in the model has a real number value in [0, 1] At t state X with connection to state Y has an impact on Y, using connection weight xX,Y The aggregated causal impact of multiple states Xi on Y at t, is determined using a combination function cY(V1, …, Vk) and apply it to the k single causal impacts The causal impact on Y is exerted over time gradually, using speed factor ηY; here the Xi are all states from which state Y has incoming connections
The equations from Table 1 are hidden in the dedicated software environment; see (Treur 2020), Ch. 9. Within this software environment, currently around 50 useful basic combination functions are included in a combination functions library; see Table 2 for the ones used in this paper. The selected ones for a model are assigned to states Y by specifying combination function weights by ci,Y and their parameters used by pi,j,Y. Table 2. Combination functions used Notation Advanced alogisticr,s(V1, logistic sum …,Vk)
Formula Parameters −rs 1 1 Steepness r>0 ½1 þ erðV 1 þ (1 + e ) ... þ V k sÞ 1 þ ersÞ Excitability threshold s Stepmod stepmodq,d(V) 0 if t mod q d, else 1 q repetition interval length, d step time Stepmodopp stepmodoppq,d(V) 1 if t mod q d, else 0 q repetition interval length, d step time Hebbian hebbl(V1, V1, W) V1V2 (1 − W) + l W V1,V2 activation learning levels of the connected states; W activation level of the self-model state for the connection weight l persistence factor
Using the approach to network adaptation from (Treur 2020), adaptive network models are obtained based on the notion of self-modeling (or reified) network. Any network characteristic can be made adaptive by adding a (self-model) state to the network that represents the value of this characteristic. This will be applied here to
An Adaptive Mental Network Model for Reactions to Social Pain
623
obtain self-model states WX,Y in the network that represent the value of connection weight xX,Y. For learning within a mental network model, the hebbian learning combination function hebbl (see Table 2) can be (and in the current paper will be) used for such self-model states WX,Y.
4 An Adaptive Network Model for Reactions to Social Pain The current adaptive network model was designed on the basis of the foundations described Sect. 2, relying mostly on the work done by MacDonald et al. (2011), Gable and Berkman (2013), and Chester et al. (2012). A graphical representation of the connectivity of the network model can be found in Fig. 1, while a full list of its states can be found in Table 3.
Fig. 1. Full graphic representation of the connectivity of the adaptive network model. Negative/suppressing connections are represented by dotted lines. The base level is represented in the pink, lower plane; the self-model level is represented in the blue upperplane. Light grey (downward) arrows represent influences of self-model states.
The states X1 and X2 are world states meant to represent a positive and a negative social stimulus. The starting assumption of the model is that the subject is initially incapable of anticipating the outcome of an interaction with each one of the stimuli, and therefore cannot know whether to expect a social threat or a social reward. Within the model, the perception of one of the stimuli leads to either a positive or a negative evaluation of the stimulus (X21 and X20 respectively); because the subject doesn’t at first know which stimulus is the harmful and which is the beneficial one, the model starts off with near-equal connection weights between X3 and X4 and X5 and X6. Second-order states based on Hebbeian learning (X12–X16) render all four of the connection weights susceptible to modification by learning, giving the subject the opportunity to differentiate between a stimulus that is likely to be rejecting and one that is likely to be rewarding with enough exposure to both stimuli.
624
K. Miletic et al.
States X5 and X6 represent social threat and social reward expectations at each given moment and are influenced both by the valuation of a stimulus expected to be threatening or rewarding (X20 and X21) and by one’s motivation orientation. States X7 and X8 (avoidance and approach motivation, respectively) are influenced both by previous experiences of social pain and reward as well as by perceptions of potential threats and rewards present in the environment. The intertwining connections between these states have been based on empirical findings; e.g. avoidance motivation suppressing reward expectations is based on Macdonald et al. (2011). Finally, X9 is a state that represents approach behavior; the model is set up using the scaled minimum function in such a way that the contemporaneous activation of X9 and the appropriate stimulus leads to the experience of either social pain (X10) or social reward (X11). Table 3. Overview of the states in the model X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X20 X21 X12 X13 X14 X15 X16
State thrstim rewstim srsthrstim srsrewstim expthrstim exprewstim AvMot ApMot ApBeh SocPain SocPleCon NegEval PosEval WX3,X5 WX3,X6 WX4,X6 WX4,X5 WX1,X10
X17
WX6,X9
X18
WX10,X8
X19
WX2,X11
Explanation Socially threatening stimulus Socially rewarding stimulus Perception of socially threatening stimulus Perception of socially rewarding stimulus Expectations of social threat Expectations of social reward Avoidance motivation Approach motivation Approach behaviour Social pain Social pleasure/connection Negative evaluation of stimulus Positive evaluation of stimulus Self-model state for the weight of the connection X3 ! X5 Self-model state for the weight of the connection X3 ! X6 Self-model state for the weight of the connection X4 ! X6 Self-model state for the weight of the connection X4 ! X5 Self-model state for the weight of the connection X1 ! X10 Those with stronger social avoidance goals react more strongly to negative social events (Gable et al. 2008). A higher perception of potential threat leads to higher sensitivity to potential rejection Self-model state for the weight of the connection X6 ! X9 Social reward perceptions are more likely to influence social behaviour when approach goals are strong (Macdonald et al. 2011) Self-model state for the weight of the connection X10 ! X8) Perceptions of social reward moderate the connection between social pain and social approach motivation (MacDonald et al. 2011) Self-model state for the weight of the connection X2 ! X11 Heightened approach motivation leads to greater sensitivity to positive social behaviour
An Adaptive Mental Network Model for Reactions to Social Pain
625
In addition to the Hebbian learning self-model states that modify the relationship between X3 and X4 and X20 and X21, other adaptive parameters have been modelled to reflect empirically proven relationships between the variables. For example, the fact that a stronger avoidance motivation leads to a stronger reaction to social pain is modelled by the state X16, representing the connection weight between the socially threatening stimulus X1 and the experience of social pain X10. Individual differences in psychological functioning can be modelled by changing the characteristics and initial values for the network model; for example, the modelled individual could be made more sensitive to social pain or social pleasure by modifying the appropriate characteristics in the model such as excitability thresholds.
5 Simulation Results To verify the behavior of the model, a simulation scenario was run in the dedicated software environment in MATLAB; for the outrcomes, see Figs. 2, 3, 4, 5, 6, 7, and 8. Because the aim is to initially test the behavior using a relatively simple scenario, the two world states X1 and X2 were designed in such a way so as to alternate between each other at more or less regular intervals, using the stepmod function and its opposite. The tables with all the network characteristics used for this simulation can be found in the Appendix available as Linked Data at https://www.researchgate.net/publication/ 354157186. The simulation was allowed to run for 800 time units, with Dt being 0.5. Figure 2 represents the full model outcomes; however, due to its complexity the subsequent figures will only show the relevant parts of the model.
Fig. 2. Displaying the full model for the example simulation
626
K. Miletic et al.
The subject starts out with a behavioral strategy that is based on indiscriminate and enthusiastic approach to all stimuli; after several exposures to painful stimuli their strategy changes to a more cautious one in which only the stimulus that is predicted to be highly rewarding is approached while all other stimuli are avoided.
Fig. 3. Displaying only World states X1 and X2, Approach behavior (state X9) and Social pain (state X10)
Though this strategy is beneficial to the purpose of avoiding social pain, if Hebbeian learning states X12–X15 are observed, it becomes obvious that as far as long-term learning is concerned a strategy of this type is less than optimal because constantly avoiding exposure to all stimuli except ones that are sure to be positive leads to a decline in the capacity to recognize a stimulus as potentially threatening. If approach and avoidance motivation are observed, it is clear that this change in behavioral strategy is due to the change in the attractor state for the avoidance motivation state: at one point avoidance motivation goes from peaking when the subject is exposed to social pain and then going back to its initial value of zero to becoming consistently high, and this leads to a stronger tendency towards avoidance of stimuli. However, due to previous positive experiences with stimulus X2, approach motivation and expectations of reward from stimulus X2 also remain consistently high, and the subject is able to safely approach X2 due to these high expectations for reward.
An Adaptive Mental Network Model for Reactions to Social Pain
627
Fig. 4. Hebbian learning self-model states (X12–X15) of associations between world states X1 and X2 and positive and negative evaluations (states X20 and X21, respectively).
Here we can see that after the shift in behavior strategy, levels of perceived threat remain consistently higher than they were before, being influenced by the high levels of avoidance motivation. In fact, higher levels of perceived threat due to exposure to social pain are what initially triggers the change in behavior strategy: because both higher levels of perceived threat and higher avoidance motivation lead to higher sensitivity to social pain (as operationalized by self-model state X16), the amount of social pain experienced increases with each exposure to the stimulus until the subject is motivated to avoid the stimulus entirely. Several other scenarios have been modelled for exploration, and with slight changes to some of the network characteristics, or a change in the pattern or frequency of the two world states, different behavior patterns could be modelled, such as a variation in which avoidance motivation becomes consistently higher than approach motivation after several exposures to the painful stimulus, leading to a total cessation of approach behavior despite the high expectations for reward of X2.
6 Discussion As discussed in this paper, reactions to social pain and the behavioral strategies adopted as a consequence of it are a complex and adaptive phenomenon that leads to major consequences to a person’s social functioning and overall well-being. As such behavioral strategies are something that change over time and are difficult to explain by simple linear correlations, a model such as this one is useful to the goal of providing a more nuanced understanding of the phenomena and the way that it changes dynamically over time, potentially leading to richer theoretical predictions and potential new directions for research.
628
K. Miletic et al.
Fig. 5. Threat and reward expectations (states X5, X6) and Avoidance and approach motivation (states X7, X8)
Fig. 6. Expectations of threat (state X5), Avoidance motivation (state X7), Social pain (state X10) and Negative evaluation of stimulus (state X20).
An Adaptive Mental Network Model for Reactions to Social Pain
629
Fig. 7. All self-model states of the model
By changing the frequencies of the appearance of the world state stimuli X1 and X2, different types of environments can be modelled, and so can changes in these environments, so the model is also useful for giving insight into the consequences of various kinds of environment, and by changing some of the network characteristics individual differences in personality can be modelled. These two can be combined to give a lot of useful practical implications for the model, e.g. predicting the consequences of exposure to various types of harsh environment, predicting differential sensitivity of certain subsets of the population (those carrying certain characteristics) to certain types of environment, or predicting response-to-treatment and the intensity of treatment necessary to offset the deletrious effects of previous exposure to high levels of social pain (some may benefit from simply being shifted to a less hostile environment, whereas others may require specific and potentially intense treatment).
630
K. Miletic et al.
Fig. 8. Alternative model setting that leads to complete cessation of approach behaviour - the only change was in the connection weights leading from states X5 and X10 to state X7 from 0.4 to 0.5.
The model as it is assumes that all socially threatening stimuli are the same, neglecting to distinguish between ignoring, outright rejection and social aggression, which could be a theoretically salient distinction (MacDonald et al. 2011). It also assumes that the only kinds of behavior available to the subject is to approach or to avoid, ignoring the existence of aggressive (approaching but antisocial) behavior. Therefore, the model in its current state is meant to represent a simplified version of the reaction to socially painful stimuli, but could be expanded in the future to also consider aggression, both as a type of socially painful stimulus and as a type of reaction to it. Even in its current, reductive state, however, the model does a good enough job of showing the processes that shape the reactions to painful and beneficial social stimuli. Furthermore, the adaptive network model as it has been used in the simulations also assumes that social stimuli are either unambiguously positive or unambiguously negative, omitting the existence of ambivalent or simultaneously painful and rewarding stimuli. Should such a stimulus be simulated (for example, by having the values for X1 and X2 overlap at value 1), it is possible that the model would not behave as theoretically expected, seeing as theory predicts several control mechanisms (Evans and Britton 2020) developed to resolve the conflict between approach and avoidance motivation in such cases. Due to lack of time, these control functions were not included in the model, but could present a fruitful direction of expanding the model in order to make it more real-life-like and capable of simulating a larger variety of situations. Finally, the model assumes that the processes represented by it function the same way at all time points; it is, however, empirically known that there are sensitive periods in human development where these processes are more plastic than at other points, and that this plasticity decreases over time (Chester et al. 2012). A second-order adaptation
An Adaptive Mental Network Model for Reactions to Social Pain
631
level could be included to simulate this by varying the learning speed of some of the (first-order) adaptation states over time, to simulate differential susceptibility to change over the lifespan. If a developmental approach were to be taken, however, the fact that infants are incapable of actively avoiding socially threatening stimuli should also be taken into consideration and simulated within the model.
References Cacioppo, J.T., Cacioppo, S.: The growing problem of loneliness. The Lancet 391(10119), 426 (2018). https://doi.org/10.1016/S0140-6736(18)30142-9 Chester, D.S., Pond, R.S., Jr., Richman, S.B., DeWall, C.N.: The optimal calibration hypothesis: How life history modulates the brain’s social pain network. Front. Evolution. Neurosci. 4, 10 (2012). https://doi.org/10.3389/fnevo.2012.00010 Eisenberger, N.I.: The pain of social disconnection: examining the shared neural underpinnings of physical and social pain. Nat. Rev. Neurosci. 13, 421–434 (2012) Evans, T.C., Britton, J.C.: Social avoidance behaviour modulates automatic avoidance actions to social reward-threat conflict. Cogn. Emot. 34(8), 1711–1720 (2020) Gable, S.L., Berkman, E.T.: 12 making connections and avoiding loneliness: approach and avoidance social motives and goals. In: Handbook of Approach and Avoidance Motivation, vol. 203 (2013) Gable, S.L., Strachman, A.: Approaching social rewards and avoiding social punishments: appetitive and aversive social motivation. In: Shah, J.Y., Gardner, W.L. (eds.) Handbook of Motivation Science, pp. 561–575. The Guilford Press, New York (2008) Hulsman, A.M., et al.: Individual differences in costly fearful avoidance and the relation to psychophysiology. Behav. Res. Therapy 137, 103788 (2021) MacDonald, G., Borsook, T.K., Spielmann, S.S.: Defensive avoidance of social pain via perceptions of social threat and reward. In: MacDonald, G., Jensen-Campbell, L.A. (Eds.). Social Pain: Neuropsychological and Health Implications of Loss and Exclusion, pp. 141– 160. American Psychological Association, Washington, D.C. (2011). https://doi.org/10.1037/ 12351-006 Mushtaq, R., Shoib, S., Shah, T., Mushtaq, S.: Relationship between loneliness, psychiatric disorders and physical health? A review on the psychological aspects of loneliness. J. Clin. Diagn. Res. 8(9), WE01 (2014) Treur, J.: Network-Oriented Modeling: Addressing Complexity of Cognitive, Affective and Social Interactions. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-45213-5 Treur, J.: The ins and outs of network-oriented modeling: from biological networks and mental networks to social networks and beyond. In: Nguyen, N.T., Kowalczyk, R., Hernes, M. (eds.) Transactions on Computational Collective Intelligence XXXII. LNCS, vol. 11370, pp. 120– 139. Springer, Heidelberg (2018). https://doi.org/10.1007/978-3-662-58611-2_2 Treur, J.: Network-Oriented Modeling for Adaptive Networks: Designing Higher-Order Adaptive Biological, Mental and Social Network Models. Springer, Cham (2020).https://doi.org/10. 1007/978-3-030-31445-3 Valtorta, N.K., Kanaan, M., Gilbody, S., Ronzi, S., Hanratty, B.: Loneliness and social isolation as risk factors for coronary heart disease and stroke: systematic review and meta-analysis of longitudinal observational studies. Heart 102(13), 1009–1016 (2016)
Impact of Monetary Rewards on Users’ Behavior in Social Media Yutaro Usui1(B) , Fujio Toriumi2 , and Toshiharu Sugawara1 1
Department of Computer Science and Communications Engineering, Waseda University, Shinjuku-ku, Tokyo 169-8050, Japan [email protected],[email protected] 2 Department of Systems Innovation, The University of Tokyo, Tokyo 113-8654, Japan [email protected]
Abstract. This paper investigates the impact of monetary rewards on behavioral strategies and the quality of posts in consumer generated media (CGM). In recent years, some CGM platforms have introduced monetary rewards as an incentive to encourage users to post articles. However, the impact of monetary rewards on users has not been sufficiently clarified. Therefore, to investigate the impact of monetary rewards, we extend the SNS-norms game, which models SNSs based on the evolutionary game theory, by incorporating the model of monetary rewards, the users’ preferences for them, and their efforts for article quality. The results of the experiments on several types of networks indicate that monetary rewards promote posting articles but significantly reduce the article quality. Particularly, when the value of the monetary reward is small, it significantly reduces the utilities of all the users owing to a decrease in quality. We also found that individual user preferences for monetary rewards had a clear difference in their behavior. Keywords: Social media · Consumer generated media reward · Social network service · Public goods game
1
· Monetary
Introduction
Many consumer generated media (CGM), more generally social media, have been developed around the world and have become an influential communication media. They are used for a variety of purposes, including the establishment of online social relationships and communities, and information sharing and exchange within the communities [7]. Generally, CGM is supported by a vast amount of contents/articles provided by users. It is costly for users to post articles, yet the main motivation for users to provide content is the psychological reward, which means to satisfy the desire for self-expression and a sense of belonging to society [9]. Additionally, some CGM have a mechanism that gives users monetary rewards or points that is almost equivalent to monetary rewards for posting articles or comments to promote activity; some users stay active to c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2021, SCI 1072, pp. 632–643, 2022. https://doi.org/10.1007/978-3-030-93409-5_52
Impact of Monetary Rewards on Users’ Behavior
633
obtain one or both the psychological and monetary rewards. In such diverse situations, it is important to elucidate the reasons why users continue to provide content for the growth of CGM and social media, and to clarify the conditions and mechanisms that make such growth possible. Many studies have attempted to understand the reasons and mechanisms by which users contribute content to social networking service (SNS) and social media. Natalie et al. [3] analyzed the users’ motivation for posting using a text mining technique from posting data in an SNS. Zhao et al. [15] conducted a survey by interviewing SNS users to see their purposes of using SNS and the impact on physical face-to-face communication. Some studies have used evolutionary game theoretic approaches to analyze the impact of various mechanisms of SNS on users. Toriumi et al. [13] modeled the activity in a SNS using Axelrod’s public goods games [2] and showed that the existence of meta-comments plays a significant role on the prosperity of SNS. Hirahara et al. [6] proposed the SNS-norms game, which incorporates the characteristics of SNSs that cannot be represented by the public goods game, and showed that low-cost responses such as the “Like” button strongly affect users’ activities. Recently, some CGM/social media have introduced monetary rewards or point awarding for article and comments, temporarily or permanently, to attract users. For example, on the Rakuten recipe (https://global.rakuten.com/corp/) which is an online recipe sharing site operated by the Rakuten Group in Japan, the users can post cooking recipes and browse those that have been posted. When users cook meals using the recipes, they can post reports/reviews on the recipes as comments. When users post recipes or comments, they are rewarded with Rakuten points that can be used in their online markets, which makes it a kind of monetary reward. Although such monetary rewards could be powerful incentives, their actual influence on the users’ activities and their impact on competition with other CGM are not fully known. However, previous studies [6,13] based on the evolutionary game mainly incorporate only psychological rewards in their models, and do not consider the model of monetary rewards to users. Thus, we attempt to analyze the impact of monetary incentives on user behavior based on the evolutionary game. More specifically, we extend the SNSnorms game for the CGM by adding a parameter indicating the article quality, as well as two types of rewards corresponding to psychological and monetary rewards. Simultaneously, we extend the user model (agent) by modeling their preferences for rewards and the average quality of the posted articles. The extended SNS-norms game is then performed between agents based on networks represented by a complete graph and networks based on the connecting nearest neighbor (CNN) model [14]. Subsequently, we investigate the dominant strategies that the agents learn through the interaction and effect of monetary rewards on the agents’ behaviors.
2
Related Work
Many studies have investigated the impact of social media on people [1,5,11,12]. For example, Elison et al. [5] examined the relationship between Facebook usage
634
Y. Usui et al.
and the formation of social capital from a survey of users (undergraduate students) and the regression analysis using these data. Their results suggested that the use of Facebook was related with measures of psychological well-being and users who experienced lower life satisfaction and lower self-esteem may gain more benefits from Facebook. Adelaniaea al. [1] used a predictive model to test whether community feedback, such as replies and comments, would affect users’ posts. The results showed that feedback increased the rate of users’ continuous posting. Shahbaznezhad et al. [12] investigated the impact of content on the users’ engagement on social media by analyzing posts and responses on two social media platforms and found that these impacts are largely dependent on the type of platform and the modality of the contents. Ostic et al. [11] conducted a survey among students to determine the impact of social media use on psychological well-being, with a particular concentration on social capital, social isolation, and smartphone addiction. Their analysis showed that social media use has a positive effect on psychological well-being by fostering social capital, whereas smartphone addiction and social isolation have a significantly negative effect on the psychological well-being. These studies focus on the interaction and psychological aspects of social media through empirical analysis, and do not analyze the dominant behavior based on rationality. They also did not discuss the effect of monetary rewards on the users’ psychological states. Several studies have investigated the implementation of monetary rewards on social media and their effects on the user’s behavior. Chen et al. [4] empirically investigated the impact of financial incentives on the number and quality of content posted on social media in financial markets. They then found that monetary incentives increase the motivation to provide content, but do not improve the quality of the content. L´ opez et al. [8] investigated an electronic word-of-mouth called e-WoM and analyzed the types of incentives for opinion leaders to spread information on w-WoM. Their results reported that opinion leaders responded differently to monetary and non-monetary rewards. However, these studies were limited to empirical surveys of specific services and did not indicate whether their results were applicable to other social media. In contrast, our study aims at understanding the impact of monetary incentives in a more general manner. For this purpose, we extend the abstract model of SNS, SNS-norms game, to adapt to the CGM by incorporating the concept of abstracted monetary incentives and the quality of content. Subsequently, we experimentally show the effect of the reward on the content and the behavioral strategies of CGM users.
3 3.1
Proposed Method SNS-Norms Game with Monetary Reward and Article Quality
The SNS-norms game [6] models three types of user behavior in SNS: article posts, comments on posted articles, and meta-comments (comments on comments). These behaviors come at a cost, but the users can receive psychological rewards from the articles, the comments, and meta-comments. Therefore, a user
Impact of Monetary Rewards on Users’ Behavior
635
Fig. 1. Flow of SNS-norms game with monetary reward and article quality.
gains the utility from the interaction of such behaviors, where utility is the difference between the cost of the user’s actions, such as posts and comments, and the psychological rewards as a result of the behaviors of other users. The SNS-norms game runs on a network of agents represented by the graph G = (A, E), where A = {1, . . . , n} is a set of n agents and E is a set of undirected edges between agents, representing the links (or friend relationships) between agents. We propose the SNS-norms game with monetary reward and article quality, by adding two parameters to the SNS-norms game to represent the concept of the article quality and monetary reward as well as the psychological reward that is already modeled in the SNS-norms game. We often refer to the proposed game simply as the extended SNS-norms game. We assume that whenever agent i ∈ A posts an article, it receives a monetary reward π(>0). Furthermore, parameter Qi (>0) is introduced to represent the quality of an article posted by agent i by assuming that i may obtain a relatively large number of responses to its articles but the chance of article postings will decrease and the cost of the article post will increase if Qi is large. From the correlation of these parameters, we can observe the impact of monetary rewards on the quality of poseted articles. Considering the aforementioned cooking recipe site as a baseline CGM model, we divide the set of agents A to two subsets: the set of the contributor agents Ap that posts articles and the set of the browser agents Anp that does not post, where A = Ap ∪ Anp . This is because in this kind of cooking recipe social media, the users are classified into the group whose members post recipes (these users also cook using other users’ recipes) and the group whose members only cook using the recipes and report (comment) on it. Agent i (∈ Ap ) has parameters with values ranging from 0 to 1: posting rate Bi , comment and meta-comment rates Li , article quality Qi , and monetary preference Mi . Here, we assume that Qi has the lower bound, Qmin > 0, thus, 0 < Qmin ≤ Qi ≤ 1. Agent j (∈ Anp ) has only one parameter for the comment rate Lj . The parameter values of Bi , Li , Qi and Lj dynamically change with learning to gain more utilities, whereas the
636
Y. Usui et al.
monetary preference Mi is randomly determined initially for each agent and does not change in the simulation round. We then define Ap,α = {i ∈ Ap |Mi < 0.5} and Ap,β = {i ∈ Ap |Mi ≥ 0.5}, where Ap,α and Ap,β are the sets of agents preferring psychological reward and agents preferring monetary reward, respectively. Figure 1 shows the flow of one game round of the extended SNS-norms game. In the first stage of a game round, any agent i ∈ Ap has a chance to post an article with probability Pi0 = (Bi /Qi ) × Qmin . This probability means that agents that stick to high quality articles have relatively low posting rates because of the elaboration process. Unlike the SNS-norms game, agent i that posted the article pays the posting cost c0i (>0) and gains the monetary reward π. If i does not post the article, i’s turn in this game round ends. Then, agent j ∈ Ni browses 1 = Qi /sj and obtains a psychological reward the post of i with probability Pj,i 0 ri (>0), where Ni (⊂ A) is the set of agents adjacent to i and sj is the number 1 = 0). of articles posted by Nj in the current game round (if sj = 0, we set Pj,i 1 Thus, probability Pj,i indicates that the article with higher quality is likely to be browsed. The game round then enters the second stage. Agent j that has browsed the article gives i the psychological reward ri1 (>0), i.e., post a comment on the 2 = Lj × Qi and pays the cost c1i (>0). In the article to i with probability Pj,i third stage, i returns a meta-comment to j with probability Pi3 = Li × Qi only when j gives i the comment, which also reflects the article quality. Here, i pays cost c2i (>0) and gives j the psychological reward ri2 (>0). This is where i’s turn in the current game round ends. It should be noted that the first stage of each game round proceeds step by step in a concurrent manner to calculate sj , i.e., after all contributor agents in Ap have posted/have decided to not post, agents 1 . in A select and browse some articles with Pj,i 0 The cost ci of posting by the contributor agent i (∈ Ap ) and the psychological reward ri0 obtained by browsing the article posted by i are assumed to be proportional to the quality Qi of that article (Formula (1)). We set the values for the costs c0i c1i and c2i and the psychological rewards ri0 ri1 and ri2 that occur when posting, browsing, commenting, and meta-commenting by referring to Okada et al. [10]. c0i = cref × Qi
c1i = c0i × δ
c2i = c1i × δ
ri0 = c0i × μ
ri1 = c1i × μ
ri2 = c2i × μ
(1)
Note that parameter δ, which represents the ratio of the cost of each stage, and parameter μ, which represents the ratio of the cost to the reward value, were defined sequentially based on the reference value cref . The utility ui of agent i obtained for a round of the game is calculated by ui = (1 − Mi ) × Ri + Mi × Ki − Ci .
(2)
Impact of Monetary Rewards on Users’ Behavior
637
It should be noted that Ci is the sum of the costs paid by i, Ri is the sum of the psychological rewards of i and Ki the sum of the monetary rewards; therefore, for example, Ci = co + γic × c1i + γimc × c2i , where γic and γimc are the number of comments and meta-comments that i posted during the current round. Note that because the agents in Anp receive no monetary reward, we set Mj = 0 for j ∈ Anp , which is identical to the utility defined in the SNS-norms game. 3.2
Evolutionary Process
Let a generation consist of four game rounds. At the end of each generation, all agents apply the genetic algorithm to learn parameters Bi , Li , and Qi for i ∈ Ap and Li for i ∈ Anp , using Ui , the sum of the utilities in the generation calculated by Eq. (2) as the fitness value. For this purpose, all parameters are encoded as 3-bit numbers that can express integer values from 0 to 7. We then correspond them to fractions 0/7, 1/7, . . . , or 7/7 for Bi , Li and 1/8, 2/8, . . . , or 8/8 for Qi , by setting Qmin as 1/8. Thus, each agent has a 9-bit gene. The process of evolution consists of three phases: parent selection, crossover, and mutation. In the parent selection phase, agent i chooses two agents as parents for the child agent that will be at the same position in the network G in the next generation. The parents are chosen from the same type of agents in Ap,α , Ap,β , or Anp using roulette selection. Therefore, if i ∈ Ap,α , for example, j ∈ Ap,α is chosen as its parent with the probability Πj using roulette selection; Πj =
(Uj − Umin )2 + , 2 k∈Ap,α (Uk − Umin ) +
where Umin = mink∈Ap,α Uk and is a small positive number to prevent division by zero. We set = 0.0001 in our experiments. In the crossover phase, uniform crossover is applied, i.e., the value of one of the parent genes is adopted as the next gene for each bit. Finally, in the mutation phase, each bit of the new gene generated by the crossover is reversed with a small probability of mr (1). The agents with the new genes play the game in the next generation on the same network, and this is repeated until the G generation.
4 4.1
Experiments Experimental Settings
We conducted the experiments to explore the changes in behavioral strategies and utilities of the contributor and browser agents in networks of friendships, as well as the impact of the posters’ concerns for article quality when a monetary reward for article posting is introduced in CGM. The impact on the behavioral
638
Y. Usui et al. Table 1. Network characteristics.
Description and parameter
Complete graph CNN-model network
Number of agents, n = |A| Number of agents preferring psychological reward, |Ap,α | Number of agents preferring monetary reward, |Ap,β | Number of browser agents, |Anp | Transition probability from potential edges to real edges, u Average degree Cluster coefficient
80 20
400 100
20
100
40 −
200 0.9
79 1
20.3 0.376
Table 2. Values of experimental parameters Description
Parameter Value
Generation length Mutation probability Cost ratio between game stages Ratio of cost to reward value Reference value for cost and reward
G mr δ μ cref
1000 0.01 0.5 8.0 1.0
strategy is determined from changes in the average values of the posting rate Bi , comment rate Li , and article quality Qi for all agents. We also investigate the influence of different network structures among agents on the results. Therefore, we conducted experiments assuming interactions on the complete graph (Exp. 1) and the networks generated by the CNN model [14] (Exp. 2). The number of nodes (i.e., agents) in the complete graph was set to n = 80, whereas the number of nodes in the CNN-model network was set to n = 400. Other parameter values and the characteristics related to the generated networks are listed in Table 1. The cardinal numbers of Ap,α , Ap,β , and Anp are also listed in Table 1. The parameter values in our experiments are listed in Table 2. Note that δ and μ were set to 0.5 and 8.0, respectively, in accordance with Okada et al. [10]. The results of this experiment are the averages of 100 experimental trials using different random seeds. In the graphs shown below, the red, green, black, gray, and blue lines represent the averages of all agents A, posting agents Ap , browser agents Anp , contributor agents who prefer psychological rewards Ap,α , and Ap,β , agents who prefer monetary rewards, respectively. 4.2
Experimental Result – Complete Graph
The results of the first experiment (Exp. 1) of the agent’s behavioral strategy in the complete graph are shown in Fig. 2, where Fig. 2a plots the averages of the
Impact of Monetary Rewards on Users’ Behavior
(a) All agents
639
(b) Contributor agents
Fig. 2. Utility and monetary reward in complete graph.
evolved utilities for A, Ap , and Anp , for the monetary reward π, and Fig. 2b plots the averages of the evolved utilities of Ap,α and Ap,β . Remarkably, Fig. 2a reveals that the utility of all types of agents tended to decrease, whereas π increased from 0 to 2.2 After that, when π increased from 2.2 to 10, the utility of the contributor agents Ap begins to trend upward, whereas the graph of the browser agents Anp decreases further. The average for all agents is slightly increasing, but this tendency might depend on the ratio of |Ap | to |Anp |. To determine the cause of the decline in utility in the range of π ≤ 2.2, we plotted the relationship between the monetary rewards and agents’ behavioral parameters in Fig. 3. It should be noted that all agents have the comment rate Li , whereas the posting rate Bi , the article quality Qi and the probability of article post Pi0 are the parameters that only the contributor agents have. We omit the subscripts of these parameters, such as B, L, Q and P 0 , to express their average values. The change in these parameters seems to occur owing to the users’ attitudes toward the quality of the articles they post. Figure 3a shows that the article posting rate B increased albeit only slightly, as the monetary reward increased. When π = 0.0, it shows that the posting rate B was between 0.8 and 0.9, with higher values for agents Ap,α that prefer psychological rewards. However, at approximately π = 1.0, the value of B of the agents in Ap,β increases. Then, in the range of π ≥ 5.0, the value of B of all agents were close to 1.0. It can be inferred that the introduction of monetary rewards leads to the promotion of article posting, but the effect is not large and the types of users that benefit from it are different. In contrast, we can observe from Fig. 3b that the quality of articles decreased significantly as the monetary reward increased. This shows that Ap,β dropped rapidly and remained at approximately 0.14 when π ≥ 2.2. As π increased from 0 to 8.0, the article quality of the agents in Ap,α declined slowly and then maintained the value of approximately 0.14 as in Ap,β when π was even larger than 0.8. Figure 3c shows that there is no significant change in the comment rate L; however, if we consider it closely, we can observe that as the monetary reward increases, the L of the contributor agents increased, whereas that of the
640
Y. Usui et al.
(a) Posting Rate (B).
(b) Quality of Article (Q).
(c) (Meta-)Comment Rate (L).
(d) Probability of Article Post (P 0 ).
Fig. 3. Behavioral parameter values in complete graph.
browser agents decreased slightly. This difference in the trend across the agent types kept the average comment rate L for all agents at approximately 0.45. As a result of changes in these parameters, the probability P 0 of the article posts in the game varied as in Fig. 3d. As the value of the monetary reward increases, the contributor agents that prefer the monetary reward reduce the quality of their articles more rapidly than the increase in the monetary reward and instead adopt the strategy to post articles more frequently. Additionally, we can observe that the agents that prefer psychological rewards try to maintain the quality of the articles and do not increase the number of posts. However, further increases in monetary rewards led to a decline in quality. From these results, we deduced that the decrease in the overall utility by giving monetary rewards π was mainly due to the decrease in the quality of article Q. A significant decrease in the quality Q of the posted article resulted in a significant decrease in the utility gained by the browser agents. Particularly, when 0 π 2.2 (see Fig. 2), that is, when there is a monetary reward but its value is small, it leads to a drop in utility. As the decline in the article quality Q began to subside (π ≥ 2.2), the contributor agents in Ap increased their utilities to the extent that the monetary reward they obtained increased. In contrast, the comment rate L of the browser agent Anp reduced, and the probability of the comments and meta-comments, P 1 , which considers the effect of Q, dropped significantly, and the utility did not turn to increase. This lowered the activity of the browser agents.
Impact of Monetary Rewards on Users’ Behavior
4.3
641
Experimental Results – CNN-Model Network
Figure 4 plots the relationship between the average value of the utility and the monetary reward π. We found that the experimental results on the CNN-model networks were similar to those on the complete graph, i.e., the average utility dropped considerably when the monetary reward was given at a small value but gradually increased as the monetary reward value was set to higher values. According to Fig. 5, which shows the evolved parameter values for the user’s behavior on the CNN-model networks, the behavioral strategies show that the agents posted more articles, but their qualities decreased, similar to those in the complete graph. There are also differences between the two experiments: the average utility of the contributor agents in Ap in the CNN-model network was minimized when the monetary reward was quite small, i.e., π = 1.2, whereas in the complete graph it was minimized when π = 2.2. In the complete graph, the average utility values did not change much in the utility value when π was between 0 and 1.0, indicating that the agents in the CNN-model network were more sensitive to the monetary rewards. Comparing Fig. 2 and Fig. 4, the average utility when π = 0 (when the monetary reward was not implemented) was considerably smaller than that on the complete graph. Therefore, as shown in Fig. 2a, the utility of the contributor agents in particular could not exceed that when π = 0, even when the monetary rewards were increased. However, the monetary rewards increase the utility only for contributor agents in Ap,β who prefer monetary rewards (Fig. 2b). Meanwhile, the results on CNN-model networks (Fig. 4a) show that the utility of the contributor agents tends to be larger when the monetary reward is π ≥ 4.0. Particularly, the utility of agents in Ap,β increases than that when π = 0, even for small π values (Fig. 4b). Suppose that there are two CGM platforms with and without monetary rewards. Then, the platforms are likely to be chosen differently depending on the user preferences in the CNN-model networks, whereas in the complete graph, all users may remain in the media without monetary rewards. However, for browser agents who only browse and comment on articles,
(a) All agents
(b) Contributor agents
Fig. 4. Utility and monetary reward in CNN-model networks.
642
Y. Usui et al.
(a) Posting rate (B)
(b) Quality of Article (Q)
(c) Comment Rate (L)
(d) Probability of Article Post (P 0 )
Fig. 5. Behavioral parameter values in CNN-model networks.
monetary rewards are irrelevant, and they tend to concentrate on CGM without monetary rewards owing to the higher quality of articles.
5
Conclusion
We proposed an extension of the SNS-norms game, a game that models a CGM, by introducing parameters expressing the monetary rewards and article quality. We then analyzed the optimal behavior for the users given the monetary rewards in CGM/social media using evolutionary computation. These experiments suggested that monetary rewards can be an incentive for posting in terms of the number of posts. However, if the design of the monetary rewards is insufficiently considered, the contributor agents will focus on obtaining monetary rewards and neglect the quality of the articles they post, which will have the effect of a reduction in the utility of society as a whole. This suggested that large monetary rewards are necessary to increase the utility of the society, but the quality of the articles remains low. We also conducted our experiments on the CNNmodel networks and the results showed the same trend regardless of the network structures. However, users on the CNN-model network were more sensitive to the effect of monetary rewards. In the future, we plan to investigate the effect of rewards that reflect quality, such as monetary rewards that vary according to the number of browsing. We also plan to model other types of CGM to investigate users’ activities.
Impact of Monetary Rewards on Users’ Behavior
643
References 1. Adelani, D.I., Kobayashi, R., Weber, I., Grabowicz, P.A.: Estimating community feedback effect on topic choice in social media with predictive modeling. EPJ Data Sci. 9(1), 1–23 (2020). https://doi.org/10.1140/epjds/s13688-020-00243-w 2. Axelrod, R.: An evolutionary approach to norms. Am. Polit. Sci. Rev. 80, 1095– 1111 (1986) 3. Berry, N., Lobban, F., Belousov, M., Emsley, R., Nenadic, G., Bucci, S.: #WhyWeTweetMH: understanding why people use twitter to discuss mental health problems. J. Med. Internet Res. 19(4), e107 (2017). https://doi.org/10.2196/jmir.6173 4. Chen, H., Hu, Y.J., Huang, S.: Monetary incentive and stock opinions on social media. J. Manag. Inf. Syst. 36(2), 391–417 (2019). https://doi.org/10.1080/ 07421222.2019.1598686 5. Ellison, N.B., Steinfield, C., Lampe, C.: The benefits of facebook “friends:” social capital and college students’ use of online social network sites. J. Comput.-mediated Commun. 12(4), 1143–1168 (2007) 6. Hirahara, Y., Toriumi, F., Sugawara, T.: Evolution of cooperation in SNS-norms game on complex networks and real social networks. In: Aiello, L.M., McFarland, D. (eds.) SocInfo 2014. LNCS, vol. 8851, pp. 112–120. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-13734-6 8 7. Kapoor, K.K., Tamilmani, K., Rana, N.P., Patil, P., Dwivedi, Y.K., Nerur, S.: Advances in social media research: past, present and future. Inf. Syst. Front. 20(3), 531–558 (2018). https://doi.org/10.1007/s10796-017-9810-y 8. L´ opez, M., Sicilia, M., Verlegh, P.W.: How to motivate opinion leaders to spread e-WoM on social media: monetary vs non-monetary incentives. J. Res. Interact. Mark. (2021). https://doi.org/10.1108/JRIM-03-2020-0059 9. Nadkarni, A., Hofmann, S.G.: Why do people use facebook? Pers. Individ. Differ. 52(3), 243–249 (2012). https://www.sciencedirect.com/science/article/pii/ S0191886911005149 10. Okada, I., Yamamoto, H., Toriumi, F., Sasaki, T.: The effect of incentives and metaincentives on the evolution of cooperation. PLoS Comput. Biol. 11(5), e1004232 (2015) 11. Ostic, D., et al.: Effects of social media use on psychological well-being: a mediated model. Front. Psychol. 12, 2381 (2021) 12. Shahbaznezhad, H., Dolan, R., Rashidirad, M.: The role of social media content format and platform in users’ engagement behavior. J. Interact. Mark. 53, 47–65 (2021) 13. Toriumi, F., Yamamoto, H., Okada, I.: Why do people use social media? Agentbased simulation and population dynamics analysis of the evolution of cooperation in social media. In: 2012 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, vol. 2, pp. 43–50. IEEE (2012) 14. V´ azquez, A.: Growing network with local rules: preferential attachment, clustering hierarchy, and degree correlations. Phys. Rev. E 67, 056104 (2003). https://doi. org/10.1103/PhysRevE.67.056104 15. Zhao, D., Rosson, M.B.: How and why people twitter: the role that micro-blogging plays in informal communication at work. In: Proceedings of the ACM International Conference on Supporting Group Work, GROUP ’09, pp. 243–252. ACM, New York (2009). https://doi.org/10.1145/1531674.1531710
Versatile Uncertainty Quantification of Contrastive Behaviors for Modeling Networked Anagram Games Zhihao Hu1 , Xinwei Deng1 , and Chris J. Kuhlman2(B) 1
2
Virginia Tech, Blacksburg, VA 24061, USA {huzhihao,xdeng}@vt.edu University of Virginia, Charlottesville, VA 22904, USA [email protected]
Abstract. In a networked anagram game, each team member is given a set of letters and members collectively form as many words as possible. They can share letters through a communication network in assisting their neighbors in forming words. There is variability in behaviors of players, e.g., there can be large differences in numbers of letter requests, of replies to letter requests, and of words formed among players. Therefore, it is of great importance to understand uncertainty and variability in player behaviors. In this work, we propose versatile uncertainty quantification (VUQ) of behaviors for modeling the networked anagram game. Specifically, the proposed methods focus on building contrastive models of game player behaviors that quantify player actions in terms of worst, average, and best performance. Moreover, we construct agentbased models and perform agent-based simulations using these VUQ methods to evaluate the model building methodology and understand the impact of uncertainty. We believe that this approach is applicable to other networked games. Keywords: Networked anagram games · Uncertainty quantification Contrastive performance · Model explainability
1 1.1
·
Introduction Background and Motivation
Anagram games is a class of games where players are given a collection of letters and their goal is to identify the single word, or as many words as possible, that can be formed with these letters. Almost always, there is a time limit imposed on the game. Common anagram games include Scrabble and Boggle. In the literature, individual anagram games have been used to determine how players attribute their success or failure. It was found that players who performed well attributed their success to skill and those that performed poorly attributed their failure to bad luck [18]. Clearly, there can be heterogenous and contrastive behaviors among players of an anagram game. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2021, SCI 1072, pp. 644–656, 2022. https://doi.org/10.1007/978-3-030-93409-5_53
Versatile Uncertainty Quantification
645
Our interest is networked group anagram games (NGAGs) [5,15], where players are arranged in a network configuration. They can share letters wherein players request letters from their neighbors and these neighbors decide whether or not to reply with the requested letters. The team’s goal is to form as many words as possible. Figure 1 shows an illustrative game among five players, with initial letter assignments, and a sequence of player actions over a portion of the 5-min game (time is in seconds). In our experiments, each letter has infinite multiplicity: if player vi shares a letter g with neighbor vj , then vi retains a copy of the letter that it shares with vj . Note that the game configuration, and our analyses, account for agents (game players) with different degrees.
Fig. 1. (Left) Networked group anagram game (NGAG) configuration where each player (human subject) has three initial letters (in blue) and two neighbors. This configuration is a circle-5 graph, Circ5 . (Middle) Illustrative actions of players in the game. Players can use a letter any number of times, as evidenced by player 2 forming meet with letters m, e, and t. (Right) Experimental data on number of words formed by players in time; there is variability among player behaviors.
It is seen that the behaviors of players consist of multiple actions, i.e., requesting letters, replying to a request, forming words, or idle (thinking). Player performance is affected by these interactions. For example, the more letters a person has, the more words that she can presumably form. There are various uncertainties in terms of players’ behaviors and the numbers of words formed in the team effort. Moreover, the heterogenous characteristics of players can involve contrastive behaviors, such as some players rarely requesting letters from their neighbors, while others request several. Therefore, it is of great importance to understand the uncertainty of the NGAG, and to have a flexible framework for quantifying the uncertainty of contrastive behaviors of players in the game. In this work, our goals are to: (i) build explainable models of game player behaviors that quantify contrastive behaviors in terms of worst, average, and best performance based on game data; (ii) construct agent-based models (ABMs) and perform agent-based simulations (ABSs) using these models; and (iii) evaluate the model building methodology and understand the impact of uncertainty for these models.
646
Z. Hu et al.
Our exemplar is the NGAG, but the proposed approach can be used in other networked games (e.g., [4,6,11,12]) and with observational data (e.g., [21]) where human behavior data are collected. Such behaviors are notorious for having significant uncertainty across players [8,19,20]. Consequently, uncertainty quantification (UQ) methods are essential in complicated games like ours, where: (i) players have several types of actions that they can take (e.g., form word, request letter), (ii) players can take these actions many times throughout a game, and (iii) we seek to combine behaviors of a collection of game players in order to build one model of behavior (because there is insufficient data to generate a model from one player’s game data). One use of such models is analogous to the earlier work cited above. A human subject could be embedded in a networked anagram game where other players are bots. Skill levels of the bots are controlled (e.g., as all high performers or all worst performers, with the models developed in this study). The goal is to understand how players attribute their success or failure, in a group setting, for different pre-determined bot play performance values and network configurations. These types of questions—in group settings—are in the realm of social psychology [3]. This is analogous to individual anagram games wherein experimenters found that they could control solution times by varying letter order [13]. Novelty of Our Work. The novelty of this work unfolds as follows. First, different from the previous work in [10], our current method is not restricted to clustering of players to differentiate the heterogeneous behaviors of players. The key idea of the proposed method is leveraging the uncertainty of model parameters to quantify the uncertainty of players’ behaviors. Specifically, we propose a novel approach of mapping model parameters to the probabilities of players’ actions, to better represent the uncertainty of behaviors in the game. Second, we propose a versatile uncertainty quantification (VUQ) framework to enable the quantification of contrastive behaviors in terms of worst, average, and best performance to better understand player behaviors based on game data. Different from the previous work in [10], the proposed framework takes advantage of the (1 − α) confidence set of model parameters to enable the quantification of contrastive behaviors with appealing visualization. Third, we integrate the VUQ framework into an ABS framework. 1.2
Our Contributions and Their Implications
Our first contribution is a VUQ approach to UQ. The proposed VUQ method can be used to understand characteristics of a game, including contrastive behaviors, bot effects in the network, and demographic differences of players. We use a multinomial logistic regression model to characterize the probabilities πij of a player taking a particular action aj at time (t + 1) based on a state vector and the player’s action ai at t. Uncertainty is embedded in a parameter matrix B (i) , as described in Equation (1). We employ a sampling technique that uses contours of (1 − α) × 100% confidence regions to construct B (i) in terms of πij . We thus quantify uncertainty with B (i) via πij . Preliminaries (i.e., previous work that
Versatile Uncertainty Quantification
647
serves as the point of departure for our methods here) are provided in Sect. 3 and our methods are presented in Sect. 4. The second contribution is the models that result from the VUQ methods and their use in an agent-based modeling and simulation (ABMS) platform. From the data for a particular collection of players, we determine worst, average, and best player models. These models are integrated into an ABMS platform so that NGAGs can be simulated for any number of players with different levels of performance, any specified communication network, and different numbers of initial letter assignments per player. Our ABMS work and illustrative results are in Sect. 6. We demonstrate, for example, that the number of words formed by players in a game increases by 25% in going from worst to best behavior. This is an example of contrastive behavior: a contrast (difference) in results produced by differences in models. Our third contribution is to illustrate important implications of the preceding two contributions. The proposed VUQ significantly enhances model transparency and model explainablity [14], as well as elaborates the impact of data quality (sufficiency versus scarcity). By mapping model parameter uncertainty to the uncertainty of players’ behaviors in terms of possible game actions, our method provides a useful technique to make UQ more transparent in term of the players’ behaviors. The use of a (1 − α) × 100% confidence set provides a clear and simple tool to visualize the effects of the uncertainty by sampling multiple model parameters from the confidence set. This is similar in spirit to other sampling techniques used to generate graphs [7]. Moreover, the comprehensive quantification of uncertainty of contrastive behaviors surprisingly uncovers the impact of data scarcity and data sufficiency in the modeling and uncertainty quantification of data. We provide concrete examples in Sect. 5.
2
Related Work
Modeling of Group Anagram Games. An ABM was constructed from NGAG experiments in [5,15]. The model computed player behaviors in time. The model also accounted for the number of neighbors that a player (agent) had in the anagram interaction network. In [10], behavior models were made more parsimonious by clustering players based on experimental game data and their degrees k in the game network. Each model was based on the average behavior within a cluster. This current work differs from the above works in that we are analyzing each cluster to produce models for worst, average, and best performance behaviors per cluster. Hence, in this work, an agent’s assigned model is based on its degree in the network, cluster number, and performance specification. Uncertainty Quantification Methods and Analyses. Experimental uncertainty and parameter uncertainty are two common sources of uncertainty. Alam et al. [1] use design of experiments (DoE) to quantify experimental uncertainty and analyze sensitivity. Regression models and Bayesian approaches are commonly used to quantify parameter uncertainty. For example, Arendt et al. [2]
648
Z. Hu et al.
quantify uncertainty using gaussian process and the variance of posterior distribution. Simulation-based modeling and analyses are also used for uncertainty quantification [16,17]. In this work, we aim to quantify uncertainty in terms of both model parameters and players’ behaviors.
3
Previous Models
Our models use a network configuration to capture communication among game players. A player’s neighbors are the players at distance 1 in the NGAG (see Fig. 1). In our previous work [10], a clustering-based UQ method is used for building ABMs of human behavior in the NGAG. The UQ approach in [10] focuses on the following aspects. First, players are partitioned based on their activity in a game by creating two variables as xengagement and xword . Here xengagement is the sum of the number of requests and number of replies of a player, and xword is the number of words a player forms in a game. We conducted hypothesis testing, and the results showed that we can categorize players into two groups: those players with two neighbors (group g = 1) and those players with three or more neighbors (group g = 2). Second, players are further partitioned within each group. We used the k-means clustering method [9] to form four clusters based on the Bayesian information criterion. Specifically, we cluster player behaviors in terms of xengagement and xword , so that players in a single cluster have similar numbers of actions in a NGAG. Third, player behaviors in a game are modeled and four variables are introduced in Equation (1) below: size ZB (t) of the buffer of letter requests that player v has yet to reply to at time t; number ZL (t) of letters that v has available to use (i.e., in hand) at t to form words; number ZW (t) of valid words that v has formed up to t; and number ZC (t) of consecutive time steps that v has taken the same action. Let z = (1, ZB (t), ZL (t), ZW (t), ZC (t))5×1 . Let action a1 represent thinking or idle, a2 represent letter reply, a3 represent requesting a letter, and a4 represent forming a word. The multinomial logistic regression is used to model πij —the probability of a player taking action aj at time t + 1, given that the player took action ai at time t—as exp(zβ j ) πij = 4 , j = 1, 2, 3, 4 (i) l=1 exp(z β l ) (i)
(1)
where β j = (βj1 , . . . , βj,5 ) are the corresponding regression coefficients. For a given action ai at time t, the parameters can be expressed as a matrix (i) B (i) = (βj,h )4×5 for i = 1, . . . , 4. Note that the estimation of B (i) generates a corresponding transition probability matrix which quantifies the activity levels of players in the game. In previous work and this current work, we use all NGAG experimental data for parameter estimation. As a result, one can infer the activity level of a player in a cluster based on its engagement and words [10]. However, there are two limitations of such an approach. First, if we cluster players using a large number of variables, then it (i)
(i)
(i)
Versatile Uncertainty Quantification
649
would be impossible to tell the activity level of a cluster based on a high dimension plot. Thus, the approach requires a more flexible method to infer which model parameters correspond to contrastive behaviors (i.e., worst, average, and best). Second, there are different levels of variability within clusters of players in going from worst to best behavior. The variabilities of some clusters may be small while the variabilities of other clusters may be large. Hence, it is important to quantify the within-cluster uncertainties and integrate them with the ABMs.
4
The Proposed VUQ Method
In a group anagram game, it is important to identify which players are more active and which players are less active, i.e., contrastive behaviors of players. As shown in our previous work [10], players have different behaviors in different clusters. Also, it is essential to quantify uncertainties of players within clusters. To quantify the uncertainty within a cluster, one possible method is to start from the parameter B (i) matrix. Since B (i) is estimated from the multinomial logistic regression, it has an asymptotic normal distribution. Thus, we can draw random samples from the asymptotic normal distribution, and these random samples can represent the variability of that cluster. Moreover, because we are interested in player models with contrastive behaviors, we draw random samples on the contour of (1 − α) × 100% confidence region. However, the sampled B (i) matrices do not have a clear interpretation to quantify the corresponding activity levels (e.g., worst, average, best). While it is difficult to identify the activity level of an agent from the B (i) matrix, it is easy to identify the activity level from the probability vector π = (πi1 , . . . , πi4 ). It is known than an agent is more active if the to-idle probability (πi1 ) is small and less active if the to-idle probability is large. To obtain the probability vector, we need the z vector. Thus, we use NGAG data (i.e., training data) to produce representative z vectors. By using these z vectors, we can compute a set of probability vectors and calculate the mean of to-idle probabilities. Then the mean to-idle probability is used to quantify the activity level of an agent via the B (i) matrix. The following steps summarize the proposed method of uncertainty quantification within a cluster. ˆ (i) matrix to vector βˆ (i), then use the We first transform the estimated B (i) asymptotic normal distribution of parameter estimators βˆ based on the asymptotic property of maximum likelihood estimators. The superscript i denotes the ˆ = βˆ (i) , βˆ = βˆ (i) = (βˆ (i)T , βˆ (i)T , βˆ (i)T )T , then βˆ follows a initial state. Let B 2 3 4 ˆ ). multivariate normal distribution, βˆ ∼ MN(ˆ μ, Σ
β r or B r , where r = 1, . . . , R) on the (1 − 1. Step 1: Draw R random samples (β α) × 100% confidence contour of the estimated βˆ matrix. The (1 − α) × 100% confidence region Sβˆ is defined as β ∈ Sβˆ ) = (1 − α) × 100%, P r(β ˆ β − βˆ )T Σ (β
−1
β − βˆ ) = χ2d (1 − α), (β
(2) (3)
650
Z. Hu et al.
ˆ is the estimated covariance matrix of βˆ , and χ2 (1 − α) is the (1 − α) where Σ d quantile of Chi-squared distribution with d degrees of freedom. 2. Step 2: For each β r drawn in step 1, apply the training data to Eq. 1 to produce n probablity vectors, πˆ r,l , where l = 1, . . . , n and n is the size training data. Then calculate the mean probability, π¯ r = nof the r,l 1 ˆ = (¯ π1r , π ¯2r , π ¯3r , π ¯4r )T . The mean of to-idle probability is denoted l=1 π n n ˆ1r,l . Then we get a set of mean to-idle probabilities, as π ¯1r = n1 l=1 π r π ¯1 , r = 1, . . . , R. 3. Step 3: The mean to-idle probabilities represent the variability within the cluster. The β r vector or B r matrix with low mean to-idle probability is more active, and the β r or B r matrix with high mean to-idle probability is less ¯1r is selected as the worst matrix, active. The B r matrix with the maximum π r ¯1 is selected as the best matrix. and the B r matrix with the minimum π One advantage of this proposed method is that we can quantitatively compare the activity levels of two clusters using the mean to-idle probability. Previously, we relied on xengagement and xword . The second advantage is that this method can be generalized in two aspects. First, currently we quantify the activity level of a cluster. However, we can easily quantify the activity level of a player using the same method. Thus, we can compare activity levels among different players/agents. Second, we use the mean to-idle probability as the criterion for activity level. We can easily use other criterion based on our needs or goals. For example, if we are interested in the activity level of forming words, we can use the mean to-word probability (πi4 ) as the criterion. Note that there can be data scarcity in some clusters, in which case the distribution of B r would have very large variance. For example, if there is only one to-request transition in the game data, then the estimated parameter for to-request in B r can be extremely large. Then the probability of the to-request transition could become close to 1 or 0. If the initial state is request and the probability of to-request is close to 1, then the model will fall into a request-torequest “infinite” loop. This potential issue of data scarcity is avoided as follows. π r (idle) < 0.01), then we choose the First, if the minimum π¯ r (idle) is very small (¯ r B r in which π¯ (idle) is at the 10% percentile, instead of the minimum. Second, the worst and best B matrix matrices are replaced by the mean B matrix if any of these criteria are met: (i) all of the numbers of to-reply, to-request, and to-word transitions are less than 5, or (ii) extremely large values appear in B .
5
Model Evaluation
In this section, the variabilities within clusters will be investigated and presented in terms of mean transition probabilities. For each group, cluster, and initial state, 1000 random B r , r = 1, . . . , 1000 matrices are draw from the 95% confidence contour Sβˆ . Then the mean transition probabilities are calculated using the training data in which initial states are the same as those of the B r matrices. The histogram of mean transition probabilties are presented in Figs. 2 and 3.
Versatile Uncertainty Quantification
651
Fig. 2. Histograms of mean to-word probabilities of random B matrices for the four clusters in group 1. The initial state is idle and the B matrices samples are drawn on the contour of the 95% confidence region. The plots from left to right are for cluster 1, 2, 3, and 4.
Fig. 3. The top four histograms are for group 1, cluster 3 where the initial state is idle and the bottom four histograms are for group 1, cluster 2 where the initial state is request. The B matrices samples are drawn on the contour of the 95% confidence region. The plots in the first column are to-idle, the plots in the second column are to-reply, the plots in the third column are to-request, and the plots in the fourth column are to-word.
Figure 2 reports the histograms of mean to-word probabilities of random B matrices for clusters in group 1. From the histograms in Fig. 2, it is clear that there are within-cluster variabilities in terms of forming words. It further confirms the need of VUQ for quantifying the contrastive behaviors of players. Figure 3 shows the histograms of mean transition probabilities, where the top panel is for the group 1, cluster 3 with initial state being idle and the bottom panel is for group 1, cluster 2 with initial state being request. It is seen that the top four histograms show small variability because sufficient data are available. However, the data can be insufficient in some other cases as reported in the seventh row in Table 1, where there are very few data points for the player actions of reply, request, word. As shown in the bottom four plots of Fig. 3, the variability becomes very large and even ranges from 0 to 1, which is not realistic. Therefore, the mean B matrices are used in this case for both the worst and best behaviors. Such a limitation of the proposed method is due to data scarcity in the numbers of some actions, where the variability of the model parameters becomes unrealistically large. These issues illustrate the importance of model transparency. A summary of actions for some clusters are shown in the Table 1. It is seen that the majority of clusters have sufficient data while some encounter data scarcity in some actions.
652
Z. Hu et al.
Table 1. Summary of actions in selected group, cluster, and initial state triples. The first column is the group ID, with value 1 or 2. The second column is the cluster ID, which ranges from 1 to 4. The third column is the initial state, which ranges from 1 to 4 (idle, reply, request, word). The next four columns are the number of actions by players in games. For example, the number in the idle column is the number of to-idle actions. The last column shows the data sufficiency. group cluster initial state idle 1 1 1 2 2 2 1 2
6 6.1
1 3 4 2 3 4 2 1
1 1 1 1 4 1 3 4
reply request word data sufficiency
8311 28 17399 259 8593 110 17310 282 801 16 1572 33 306 2 199 1
70 366 185 413 0 71 3 0
230 802 993 902 44 344 0 0
sufficient data sufficient data sufficient data sufficient data sufficient data sufficient data data scarcity data scarcity
Simulations and Results Simulation Parameters and Process
We use the models from Sect. 4 to develop ABMs for players in the NGAG. We confine this work to the networked game configuration of Fig. 4, which is a circle graph on six players. All players have behaviors in group 1 because all players have degree k = 2. The symmetry of the setup enables us to assess variability in simulation results. Table 2 contains the parameters in simulations. We limit our simulation conditions owing to space limitations; the simulation system can handle agents of any group and cluster. Also, we use a graph structure that is similar to the network configurations in the NGAG. Table 2. Summary of the parameters and their values used in the simulations.
Fig. 4. Graph of six anagram game players, each with two neighbors (Circ6 ).
Parameter
Description
Network
Circ6 (each of six players has degree 2)
Num. of initial letters n
Four per player
Number of groups
One. Group g = 1 is for agents with degree ≤2
Number of clusters
There are four clusters within group g = 1
Number of different performances
For each cluster, there are three models of game player performance: worst, average, and best. These are the contrastive behaviors
A simulation consists of 50 instances. Each instance is a computation from time t = 0 to t = 300 s, consistent with conditions in experiments. That is, each
Versatile Uncertainty Quantification
653
instance is a simulation of one NGAG. From experimental NGAG data, players do not take successive actions in less than one second. Thus, we set one time increment in a simulation to one second. All players are assigned n = 4 letters and model parameters based on group number (always group 1 in these experiments), cluster number, and performance level. Players request letters from their neighbors, reply to letter requests, form words, and think (idle). A simulation outputs all player actions at all times, similar to the data shown in Fig. 1. Average values and median and error bars in boxplots below are produced from all player data, at each time t ∈ [0, 300] over all 50 instances. Since players are paid in the game in direct proportion to the number of words that they form, increasing numbers of actions (particularly in forming words) means increasing performance. 6.2
Simulation Results
Figure 5 contains data for group 1, cluster 3, and worst performance. The first plot provides average word history curves for each of the six players. The next two plots of the figure show time histories for all actions for players 0 and 1, respectively. These actions are replies received (replRec), replies sent (replSent), requests received (reqRec), requests sent (reqSent), and words formed (words). Requests sent and replies received are the lesser curves because they are both bounded by n = 4 letters for each player. Requests received and replies sent are greater curves because their numbers are bounded above by k · n = 2 · 4 = 8.
(a) word counts all agents
(b) behavior agent 0
(c) behavior agent 1
Fig. 5. Results of anagram simulations with six players forming a Circ6 graph. All players have behaviors assigned based on group 1, cluster 3, and worst performance. (a) word count histories for all six agents, (b) action histories for agent 0, and (c) action histories for agent 1.
Figure 6 contains word count histories for all six players, for cluster 3, and moving left to right, for worst, average, and best performance, respectively. (Figures 5a and 6a are the same plot.) It is clear that the numbers of words formed by players increase in going from worst to best behavior models by 25%. Figure 7 contains boxplots for each of the four clusters for group 1. Each box is for the total number of actions (which is the sum of the total number
654
Z. Hu et al.
(a) worst performance
(b) average behavior
(c) best behavior
Fig. 6. Results of anagram simulations with six players forming a Circ6 graph. All players have behaviors assigned based on group 1, cluster 3. Plots are word count histories for all players in a simulation for: (a) worst, (b) average, and (c) best behavior.
formed words, requested letters, and replies to letter requests), on a per-player basis, across all six players in a simulation. Thus, each box is comprised of 300 data points (=6 players · 50 simulation instances). For each of the first three clusters, the counts of actions increases from worst (W) to average (A) to best (B) performance models. The numbers of actions, for a given performance value, also increases across the first three clusters, consistent with the experimental data. The fourth cluster is interesting and different. The worst, average, and best performance models do not generate monotonic results. The behavior appears to saturate with cluster 4, the cluster that produces the greatest numbers of player actions.
(a) Cluster 1
(b) Cluster-2
(c) Cluster 3
(d) Cluster 4
Fig. 7. Results across all clusters and all performance values, where boxplots are given for performance types worst (W), average (A), and best (B) behavior for each cluster. Boxes are per-player numbers of total actions in a simulated game, and represent the sum of form words, request letters, and reply to letter requests. The clusters are: (a) 1, (b) 2, (c) 3, and (d) 4. Labels on x-axis are “C”, cluster number, and performance type.
7
Conclusion
We provide motivation and novelty of our work, along with contributions, in Sect. 1. A key aspect of the models, that enable explainability of results, is that
Versatile Uncertainty Quantification
655
we map model parameters to player actions in a game. For our game, this is not straight-forward and hence may serve as a template of how this may be done in other game settings. We believe that our approach can be used with other human subject game data, including complicated experiments like ours with different actions types and the ability of players to repeat action types over time. Our current method uses asymptotic distributions to infer the uncertainty of model parameters, which may not be appropriate for some problems. Alternatively, a Bayesian approach for quantifying uncertainty can be useful. Acknowledgment. We thank the anonymous reviewers for their helpful feedback. This work has been partially supported by NSF CRISP 2.0 (CMMI Grant 1916670) and NSF CISE Expeditions (CCF-1918770).
References 1. Alam, M., Deng, X., et al.: Sensitivity analysis of an enteric immunity simulator (ENISI)-based model of immune responses to helicobacter pylori infection. PLoS ONE 10(9), e0136139 (2015) 2. Arendt, P.D., Apley, D.W., Chen, W.: Quantification of model uncertainty: calibration, model discrepancy, and identifiability. J. Mech. Design 134(10), 1–12 (2012) 3. Aronson, E., Aronson, J.: The Social Animal, 12th edn. Worth Publishers, New York (2018) 4. Broere, J., Buskens, V., Stoof, H., et al.: An experimental study of network effects on coordination in asymmetric games. Sci. Rep. 9, 1–9 (2019) 5. Cedeno, V., Hu, Z., et al.: Networked experiments and modeling for producing collective identity in a group of human subjects using an iterative abduction framework. Soc. Netw. Anal. Min. (SNAM) 10, 11 (2020). https://doi.org/10.1007/ s13278-019-0620-8 6. Charness, G., Feri, F., et al.: Experimental games on networks: underpinnings of behavior and equilibrium selection. Econometrica 82(5), 1615–1670 (2014) 7. Duvivier, L., Cazabet, R., Robardet, C.: Edge based stochastic block model statistical inference. In: Benito, R.M., Cherifi, C., Cherifi, H., Moro, E., Rocha, L.M., Sales-Pardo, M. (eds.) COMPLEX NETWORKS 2020 2020. SCI, vol. 944, pp. 462–473. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-65351-4 37 8. Gerstein, D.R., Luce, R.D., et al.: The behavioral and social sciences: achievements and opportunities. Technical report, National Research Council (1988) 9. Hartigan, J.A., Wong, M.A.: Algorithm as 136: a k-means clustering algorithm. J. R. Stat. Soc. Ser. C 28, 100–108 (1979) 10. Hu, Z., Deng, X., Kuhlman, C.J.: An uncertainty quantification approach for agentbased modeling of human behavior in networked anagram games. In: WSC (2021) 11. Judd, S., Kearns, M., Vorobeychik, Y.: Behavioral dynamics and influence in networked coloring and consensus. PNAS 107, 14978–14982 (2010) 12. Kearns, M., Judd, S., Tan, J., Wortman, J.: Behavioral experiments on biased voting in networks. PNAS 106, 1347–1352 (2009) 13. Mayzner, M.S., Tresselt, M.E.: Anagram solution times: a function of letter order and word frequency. J. Exp. Psychol. 56(4), 376 (1958) 14. Pearl, J., Mackenzie, D.: The Book of Why: The New Science of Cause and Effect. Basic Books, New York (2018)
656
Z. Hu et al.
15. Ren, Y., Cedeno-Mieles, V., et al.: Generative modeling of human behavior and social interactions using abductive analysis. In: ASONAM, pp. 413–420 (2018) 16. Riley, M.E.: Evidence-based quantification of uncertainties induced via simulationbased modeling. Reliab. Eng. Syst. Saf. 133, 79–86 (2015) 17. Shields, M.D., Au, S.K., Sudret, B.: Advances in simulation-based uncertainty quantification and reliability analysis. ASCE-ASME J. Risk Uncertainty Eng. Syst. Part A-Civ. Eng. 5(4), 02019003-1–02019003-2 (2019) 18. Stones, C.R.: Self-determination and attribution of responsibility: another look. Psychol. Rep. 53, 391–394 (1983) 19. Usui, T., Macleod, M.R., McCann, S.K., Senior, A.M., Nakagawa, S.: Meta-analysis of variation suggests that embracing variability improves both replicability and generalizability in preclinical research. PLoS Biol. 19, e3001009 (2021) 20. Wuebben, P.L.: Experimental design, measurement, and human subjects: a neglected problem of control. Sociometry 31(1), 89–101 (1968) 21. Yan, Y., Toriumi, F., Sugawara, T.: Influence of retweeting on the behaviors of social networking service users. In: Benito, R.M., Cherifi, C., Cherifi, H., Moro, E., Rocha, L.M., Sales-Pardo, M. (eds.) Complex Networks. SCI, vol. 943, pp. 671–682. Springer, Heidelberg (2020). https://doi.org/10.1007/978-3-030-65347-7 56
Neural-Guided, Bidirectional Program Search for Abstraction and Reasoning Simon Alford1(B) , Anshula Gandhi1 , Akshay Rangamani1 , Andrzej Banburski1 , Tony Wang1 , Sylee Dandekar2 , John Chin1 , Tomaso Poggio1 , and Peter Chin3 1
Massachusetts Institute of Technology, Cambridge, MA 02139, USA 2 Raytheon BBN Technologies, Cambridge, MA 02138, USA 3 Boston University, Boston, MA 02215, USA
Abstract. One of the challenges facing artificial intelligence research today is designing systems capable of utilizing systematic reasoning to generalize to new tasks. The Abstraction and Reasoning Corpus (ARC) measures such a capability through a set of visual reasoning tasks. In this paper we report incremental progress on ARC and lay the foundations for two approaches to abstraction and reasoning not based in bruteforce search. We first apply an existing program synthesis system called DreamCoder to create symbolic abstractions out of tasks solved so far, and show how it enables solving of progressively more challenging ARC tasks. Second, we design a reasoning algorithm motivated by the way humans approach ARC. Our algorithm constructs a search graph and reasons over this graph structure to discover task solutions. More specifically, we extend existing execution-guided program synthesis approaches with deductive reasoning based on function inverse semantics to enable a neural-guided bidirectional search algorithm. We demonstrate the effectiveness of the algorithm on three domains: ARC, 24-Game tasks, and a ‘double-and-add’ arithmetic puzzle.
Keywords: Abstraction networks
1
· Reasoning · Program synthesis · Neural
Introduction
The growth and tremendous success of deep learning has catapulted us past many benchmarks of artificial intelligence. Reaching human and superhuman performance in object recognition, language generation and translation, and complex games such as Go and Starcraft has pushed the boundaries of what humans can do and machines cannot [7,12,13,16,19,22]. To continue to make progress, we must identify and work towards reducing the gaps between human and machine intelligence. The Abstraction and Reasoning Corpus (ARC), introduced by Fran¸cois Chollet in 2019, captures an important aspect of human intelligence that our current systems are unable to do: the ability to systematically and flexibly generalize to c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2021, SCI 1072, pp. 657–668, 2022. https://doi.org/10.1007/978-3-030-93409-5_54
658
S. Alford et al.
new domains [6]. Chollet argues that intelligence must be measured not as skill in a particular task, but as skill-acquisition efficiency. General intelligent systems must also have developer-aware generalization, i.e. be able to solve problems the developer of the system has not encountered before or anticipated. ARC consists of training, evaluation, and private test sets of 400, 400, and 200 tasks. Each task consists of 2–4 training examples and one or more test examples. Each training example is an input/output pair of grids. To solve a task, an agent must determine the relationship between input and output grids in the training examples, and use this to produce the correct output grid for each of the test examples, for which the agent is only given the input grid. Each task is thus a few-shot learning problem, for which the solution is symbolic and rule-based (Fig. 1).
Fig. 1. An example ARC task with three training examples and one test example. The solution might be described as “find the most common object in the input grid”.
The tasks are unique and constructed by hand so as to prevent the reverse engineering of any synthetic generation process. They are designed to depend on a set of human Core Knowledge inbuilt priors such as objectness, simple arithmetic abilities, symmetry, and goal-directedness. The evaluation and private test sets are designed such that a solution tailored to the training set is unlikely to transfer to the evaluation or test sets. Chollet hosted a Kaggle-competition for ARC and the winning solution, a hard-coded brute force approach, achieved only ∼20% performance on the private test set [2]. In this paper we report incremental progress on ARC and lay the foundation for several approaches to abstraction and reasoning not based in brute-force search. We approach ARC as a program synthesis benchmark, solving tasks by writing programs that convert input grids to output grids. In Sect. 2 we outline an approach to abstraction by applying DreamCoder [11]. We show this approach enables learning new concepts that aid in generalization as well as the solving of progressively more challenging tasks. In Sect. 3 we describe a novel program synthesis approach motivated by the way humans approach ARC that captures the reasoning required to search for ARC task solutions. Our algorithm constructs a search graph and reasons over this graph structure to discover task solutions. More specifically, we extend existing execution-guided program synthesis approaches [10,25] with deductive reasoning based on function inverse semantics [18] to enable a neural-guided bidirectional search algorithm.
Neural-Guided, Bidirectional Program Search
659
We evaluate our approach on three domains: ARC tasks, ‘24 Game’ problems, and a simple ‘double-and-add’ challenge. These experiments show the benefits of bidirectional search over baselines and the potential for further progress on ARC. In Sect. 4 we discuss related work, progress on ARC, and future directions.
2
Abstraction Using DreamCoder
We frame the problem as a search problem over the space of programs expressible in some domain specific language (DSL). One way a learning agent can achieve developer aware generalization (in the sense of [6]) is to identify frequently occurring patterns of computation and form abstractions from them. These abstractions enable searching for more complex programs more quickly. In this section we use DreamCoder [11], a recent tool for program synthesis, to form abstractions. We first show how DreamCoder’s compression algorithm enables learning generalizations of concepts seen in training. Second, we run DreamCoder on ARC to show how forming new abstractions enables the agent to solve progressively more challenging tasks. 2.1
Warmup: Forming Abstractions
To show how DreamCoder can form more abstract concepts from existing ones, we supply our agent with six synthetic tasks (meant to be similar to ARC tasks): drawing a line in three different directions, and moving an object in three different directions. See Fig. 2 for a visualization of these tasks. We solve these tasks with four primitives: rotate clockwise and counterclockwise, draw a line down, and move an object down. The programs synthesized are the following: (lambda (lambda (lambda (lambda (lambda (lambda
(rotate_cw (draw_line_down (rotate_ccw $0)))) (rotate_cw (move_down (rotate_ccw $0)))) (rotate_ccw (draw_line_down (rotate_cw $0)))) (rotate_ccw (move_down (rotate_cw $0)))) (rotate_cw (rotate_cw (draw_line_down (rotate_cw (rotate_cw $0)))))) (rotate_cw (rotate_cw (move_down (rotate_cw (rotate_cw $0))))))
// // // // // //
draw move draw move draw move
line left object left line right object right line up object up
After running the compression algorithm, the agent creates the following new abstractions: (lambda (lambda (rotate_cw ($0 (rotate_ccw $1))))) (lambda (lambda (rotate_ccw ($0 (rotate_cw $1))))) (lambda (lambda (rotate_cw (rotate_cw ($0 (rotate_cw (rotate_cw $1)))))))
// apply action left // apply action right // apply action up
Importantly, the abstractions formed are more general than the original primitives given. This can help enable systematic generalization on further tasks. 2.2
Enabling Generalization on ARC Symmetry Tasks
In a second experiment, we demonstrate how compression-based learning enables developer-aware generalization on ARC. We provide DreamCoder with a set of five grid-manipulation operations: flipping vertically with vertical_flip, rotating clockwise with rotate_cw, overlaying two grids with overlay, stacking two
660
S. Alford et al.
(a) An example “draw line left” task
(b) An example “move object left” task
Fig. 2. Sample tasks involving applying an action left.
grids vertically with vertical_stack, and getting the left half of a grid with left_half. We then train our agent on a subset of 36 ARC tasks involving symmetry over five iterations of enumeration and compression. During each iteration, our agent attempts to solve all 36 tasks by enumerating possible programs for each task. It then runs compression to create new abstractions. During the next iteration, the agent repeats its search equipped with the new abstractions. In this experiment, our agent initially solves 16 tasks. After one iteration, it solves 17 in the same amount of time. After another, it solves 19 tasks, and after the final iteration, it solves 22 tasks. Table 1 shows some of the new abstractions learned by DreamCoder’s compression algorithm such as flipping horizontally, and stacking grids horizontally. The program solutions for the final tasks solved, shown in Fig. 3, could not be feasibly discovered without the use of abstractions to reduce the search time. 2.3
Discussion
It is useful to compare the learning done in our approach to that done by neural networks. Neural networks can also learn new concepts from training examples, but their internal representation lacks structure which allows them to apply learned concepts compositionally to other tasks. In contrast, functions learned via compression, represented as programs, can naturally be composed and extended to solve harder tasks, while reusing concepts between tasks. This constitutes a learning paradigm which we view as essential to human-like reasoning. Table 1. Useful actions learned in the process of solving symmetry tasks. Pound signs represent abstractions. Abstractions may rely on others for construction; e.g. to stack grids horizontally, we reflect each input diagonally, stack vertically, and reflect the vertical stack diagonally. Action
Code
Mirror across diagonal
#(lambda (rotate cw (vertical flip $0)))
Rotate 180◦
#(lambda (rotate cw (rotate cw $0)))
Flip horizontally
#(lambda (rotate cw (rotate cw (vflip $0))))
Rotate counterclockwise #(lambda (rotate cw (#(lambda (rotate cw (rotate cw $0))) $0))) Stack grids horizontally #(lambda (lambda (#(lambda (rotate cw (vertical flip $0))) (stack vertically (#(lambda (rotate cw (#(lambda (vertical flip $0)) $0))) $1) (#(lambda (rotate cw (vertical flip $0))) $0)))))
Neural-Guided, Bidirectional Program Search
661
Fig. 3. One of the four-way mirroring tasks and the program discovered that solves it written in terms of the original primitives. The program was discovered only after four iterations of enumeration and compression.
There is a caveat of the approach shown here. Abstraction as shown uses a simple enumerative search. DreamCoder uses a form of neural-guided program synthesis, predicting a distribution over functions to search over, but this guidance is too weak to scale to the complexity of ARC tasks. In the next section, we show the type of reasoning required for ARC and design an approach to exhibit this reasoning.
3
Bidirectional, Neural-Guided Program Search
In Subsect. 3.1 we first motivate and describe our bidirectional, neural-guided search algorithm. Then in Subsect. 3.2 we present experiments and results using this approach. 3.1
Algorithm Description
In this section we describe our reasoning approach for ARC. We first give a motivating example of human reasoning on ARC, explain how to approximate it with execution-guided synthesis, then incorporate inverse semantics to create a bidirectional, neural-guided search algorithm. Motivating Example. Solving ARC tasks fundamentally consists of a search for valid solutions. To make this search tractable, our agent needs the ability to reason towards solutions. ARC tasks feature rich visual queues that guide us towards solutions. Without enabling our agent to take full advantage of these queues, the search over possible programs becomes impossibly large. The process of discovering the solution to an ARC task often consists of several discrete steps of reasoning before discovering the solution. How can we design an approach to search that searches for ARC solutions in the same manner as humans? As a motivating example, let us consider solving task 303 in Fig. 4. The reasoning steps to come to a solution might look something like this:
662
S. Alford et al.
Fig. 4. Task 303.
1. Notice that the output grid consists of copies of the 3×3 input grid, arranged in a certain arrangement among a 9 × 9 grid. 2. New question: Where should we place the input grid copies? 3. Notice that the placements match the arrangement of a different color’s pixels for each grid. For example, in the first example, the diagonal of grids in the output matches the green pixels in the input. 4. New question: What color should we arrange our grid copies along? 5. Solution: The color matched is the most common color in the grid. Notice the way discovering a solution involves combining sequential insights and problem reductions. Systematizing a form of reasoning for ARC that emulates this reasoning will be based on a combination of execution-guided program synthesis and inverse semantics. Extending Execution-Guided Synthesis. Execution-guided program synthesis [5,10] is a form of program synthesis where one executes partial programs to produce intermediate outputs, which are used to guide the construction of the full program. Intermediate evaluations provide the opportunity for step-by-step reasoning: instead of coming to the answer at once, one can construct it piece by piece. Humans could be said to make use of the same thing: for instance, it much easier to write out the result of a multiplication digit by digit, instead of conducting the full calculation in one’s head. The form of execution-guided synthesis we apply to ARC is most similar to the ‘REPL’ approach of [10]. An example applying the technique to ARC is shown in Fig. 5. Existing execution-based synthesis approaches are limited to bottom-up enumeration: the leaves of the program are constructed (and evaluated) first. In contrast, the steps for solving task 303 involve proposing a function that is used to produce the output grid, and deducing the inputs required to correctly produce the output as new intermediate targets before discovering the complete program. This form of deductive reasoning involves evaluating function in reverse. It is best exemplified in the FlashMeta system [18], which leverages the inverse semantics of operators to deduce one or more inputs of a function given the output target and one or more inputs. We incorporate this type of reasoning into an extension of execution-guided program synthesis.
Neural-Guided, Bidirectional Program Search
663
Fig. 5. Solving ARC task 138 from the evaluation set with execution-guided synthesis. Conditioned on the input and output grids, the agent chooses to flip the input horizontally in step one. This action is executed to produce intermediate value i1. Next, the agent chooses to horizontally stack the intermediate value with the input grid, producing another value i2. Last, the agent horizontally stacks this value i2 with itself, correctly producing the output grid for each example and solving the task.
Deductive Reasoning via Inverse Semantics. For our purposes, we can consider two cases. The simplest case is when the function is invertible. In this case, we can evaluate the inverse to produce two new targets for the search, as shown in Fig. 6. In the second case, the function is conditionally invertible: given the output and one or more inputs to a function, one can deduce the remaining inputs needed to produce the output via this function. Many functions are conditionally invertible; perhaps the most familiar family is arithmetic operators: if we know 1 + x = 5, we can deduce that x = 4. An example relevant to ARC is shown in Fig. 6. Using conditional inverses, it is possible to formalize the reasoning described for task 303.
Fig. 6. Left: the function block is directly invertible: given the output, we can deduce the inputs. Right: the function horizontal stack (horizontal stack) is conditionally invertible: given the output and one input, we can deduce the other input.
664
S. Alford et al.
Bidirectional, Neural-Guided Program Search. To extend execution-guided synthesis to a bidirectional algorithm with inverse and conditional inverse functions, we extend the environment of [10], approaching the synthesis task via reinforcement learning. The setup takes place in a Markov Decision Process. The current state is a graph of nodes. Each node represents either an input value, the output value, or an intermediate value resulting from the execution of an operation in the forwards or backwards direction. A node is grounded if there is a valid program to create that node from the operations applied so far. In general, grounded nodes correspond to those from the forwards, i.e. bottom-up program enumeration, direction of search, while ungrounded nodes correspond to those from the backwards direction, i.e. top-down program enumeration. An operation is a function from the grammar along with a designation of being applied in forwards, inverse, or as a conditional inverse (and if as a conditional inverse, conditioned on which input arguments). There are three types of operations: forward operations, inverse operations, and conditional inverse operations. A forward operation applies the function to a set of grounded inputs to produce a new grounded node. An invertible operation takes an ungrounded output and produces a new ungrounded target node such that grounding the target node will cause the output node to be grounded as well. A conditionally invertible operation takes an ungrounded output and one or more grounded input nodes, and produces a new ungrounded target node such that grounding the target node will cause the output node to be grounded as well. All invertible and conditionally invertible operations have a corresponding forward operation. Solving a given task thus consists of an episode in the MDP. Actions in the MDP correspond to a choice of operation and the choice of arguments for that operation. Each action applies a function in either the forward or backward direction. Intuitively, this executes a bidirectional search to try to connect the grounded nodes on one side with the ungrounded output node on the other. We give reward R for solving the task and a penalty of −1 for choosing an action corresponding to an invalid operation. Like [5,10,25], we train with a combination of supervised training on randomly generated programs fine-tuning with reinforcement learning algorithm Reinforce. To generate random bidirectional programs for supervised training, we first create a random program, and construct an execution trace for it by probabilistically converting inverting function applications from the root. Network architecture is held the same from [10], with task-dependent embedding network nodes of the bidirectional graph, a DeepSet network [24] to encode the graph into a single embedding and choose a function to apply, and a pointer network [23] for choosing function arguments. 3.2
Experiments
We evaluate our bidirectional algorithm in three settings: solving ARC symmetry tasks, solving arithmetic puzzles from the ‘24 Game’ family, and solving ‘double-and-add’ puzzles. As a baseline, we compare bidirectional synthesis with
Neural-Guided, Bidirectional Program Search
665
a forward-only baseline which only allows application of operations in the forwards direction like existing approaches. ARC Symmetry Tasks. As a proof of concept, we evaluate the bidirectional algorithm on a set of 18 ARC symmetry tasks—a subset of those used in Sect. 2. We use a DSL of six operations: stacking two grids horizontally or vertically, rotating clockwise or counterclockwise, and flipping a grid horizontally or vertically. The rotation and flip functions are directly invertible, while the stacking operations are conditionally invertible. We use a convolutional neural network to embed grid example sets. We train on a set of randomly generated programs evaluated on random input grids from the ARC training set, and fine-tune with Reinforce before sampling rollouts for thirty minutes on all tasks at once. The agent is able to solve 14 of 18 tasks, including one of the “four-way mirror” tasks. In this experiment, bidirectional performed equally to the forward-only baseline. 24 Game. Next, we compare the performance of bidirectional search with the forward-only baseline by tasking our agent with solving “24 Game” problems. A 24 Game consists of four input numbers, one through nine. To solve the task, one must use each number once in an expression that creates twenty four using +, −, ×, ÷. For example, given 8, 1, 3, and 2, a solution is 24 = (2 − 1) × 3 × 8. To solve these tasks bidirectionally, we can use the conditional inverse of each arithmetic operator in addition to forward arithmetic operations.1 First we conduct supervised pretraining on all depths at once. These programs may create any number as a target, not just 24, with the maximum allowed integer 100, and no negative or nonintegral numbers. We then fine-tune on different depths with Reinforce for 10,000 epochs of batch size 1000. We measure performance by percent of episodes solved in the last 1,000 epochs of training. Results are shown in Table 2. Bidirectional synthesis outperforms the forward-only baseline across all depths. This supports our thesis, but is suspicious: as we should expect to see identical accuracy for depth one tasks, when only a single action is needed. Accuracy remains fairly high as depth increases, because depth does not necessarily imply program length: as many as 40% of depth four tasks remain solvable in fewer than four actions. Double-and-add. Last, we include results on a ‘double-and-add’ task to better show the advantage of bidirectional search. Given a target number, one must reach it starting from the number two by repeatedly adding one or doubling the number. For example, 7 = 1 + 2 ∗ (1 + 2). This task, akin to the method for exponentiation by repeated squaring, is much easier solved in a top-down fashion: the choice of adding one or doubling boils down to whether the target is even or odd. Here we have two forward operations, each of which are directly invertible. On a training set of five thousand numbers sampled between one and five million, and a held out set of five hundred numbers, our bidirectional model 1
We relax the rule that each input is used exactly once.
666
S. Alford et al.
Table 2. Percent of tasks solved for 24 Game, measured by percent of episodes solved in the last 1000 epochs of RL fine-tuning. Forward-only denotes only using forward operations. Bidirectional includes conditional-inverse operations. Average over three runs with stdev shown. Depth
1
2
3
4
Forward-only 87.22 ± 0.64 84.29 ± 1.6 75.88 ± 3.6 67.04 ± 1.0 Bidirectional 95.2 ± 0.66 92.9 ± 2.1 87.7 ± 1.1 85.3 ± 1.9
achieves 100% evaluation accuracy after a single epoch of supervised training. In contrast, the forward-only model fails to solve the held-out tasks, due to the difficulty “seeing” the solution from the source, see Fig. 7.
Fig. 7. Percent of tasks solved for bidirectional and forward-only agents trained on the double-and-add task. The bidirectional agent achieves 100% accuracy after a single epoch of training. After fifty epochs of training, forward-only converges without solving the held-out tasks.
4
Discussion
Related Work. Our work builds off and is inspired by a long line of progress in neural program synthesis [3,4,8,17,21], execution-guided synthesis [5,10,26], and deep reinforcement learning for search [15,19]. Bidirectional, neural-guided program search is made possible primarily due to the inverse semantics of FlashMeta [18]. The concept of bidirectional programming, inverse semantics, and program inversion has been present throughout the history of program synthesis [9,14,20], but the way in which inverse evaluation is used here is most similar to FlashMeta. ARC. To date, there are no prominent learning-based approaches to ARC that have proven more successful than the Kaggle-winning brute-force approach [2]. Other Kaggle approaches include genetic programming and cellular automata, but all essentially rely on brute force search over a DSL of operations combined
Neural-Guided, Bidirectional Program Search
667
with ARC-specific tricks, without any substantial learning [1]. The few-shot nature and large search space for ARC make it a very challenging benchmark, and progress scaling program synthesis algorithms is likely needed to enable further progress. We hope our progress reported here inspires and enables further progress on ARC. Future Work. The next step of our work is to combine the two approaches to create a unified approach. This can be done by using the bidirectional search algorithm to solve tasks, then create new operations out of abstractions base on tasks solved. To fill out the learning approach, we can consider including the ability to synthesize inverse and conditional inverse operations for newly learned abstractions, perhaps as its own synthesis problem. Our approach remains to be scaled up to a full DSL capable of solving ARC. Incorporating more sophisticated inverse semantics and type-directed search are important components of the full bidirectional approach.
References 1. Abstraction and reasoning challenge—kaggle (2020). https://www.kaggle.com/c/ abstraction-and-reasoning-challenge/leaderboard 2. top-quarks/arc-solution (2020). https://github.com/top-quarks/ARC-solution. Accessed 05 Oct 2020 3. Balog, M., Gaunt, A.L., Brockschmidt, M., Nowozin, S., Tarlow, D.: DeepCoder: learning to write programs (2016) 4. Cai, J., Shin, R., Song, D.: Making neural programming architectures generalize via recursion (2017) 5. Chen, X., Liu, C., Song, D.: Execution-guided neural program synthesis. In: International Conference on Learning Representations (2018) 6. Chollet, F.: On the measure of intelligence (2019) 7. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding (2019) 8. Devlin, J., Uesato, J., Bhupatiraju, S., Singh, R., Mohamed, A.R., Kohli, P.: RobustFill: neural program learning under noisy I/O (2017) 9. Dijkstra, E.W.: Program Inversion. In: Dijkstra, E.W. (ed.) Selected Writings on Computing: A personal Perspective. Texts and Monographs in Computer Science, pp. 351–354. Springer, New York (1982). https://doi.org/10.1007/978-14612-5695-3 63 10. Ellis, K., Nye, M., Pu, Y., Sosa, F., Tenenbaum, J., Solar-Lezama, A.: Write, execute, assess: program synthesis with a REPL (2019) 11. Ellis, K., et al..: DreamCoder: growing generalizable, interpretable knowledge with wake-sleep Bayesian program learning (2020) 12. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition (2015) 13. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012) 14. Lubin, J., Collins, N., Omar, C., Chugh, R.: Program sketching with live bidirectional evaluation. In: Proceedings of the ACM on Programming Languages, vol. 4, no. ICFP, pp. 1–29 (2020). https://doi.org/10.1145/3408991
668
S. Alford et al.
15. McAleer, S., Agostinelli, F., Shmakov, A., Baldi, P.: Solving the Rubik’s cube with approximate policy iteration. In: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6–9, 2019. OpenReview.net (2019). https://openreview.net/forum?id=Hyfn2jCcKm 16. Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015) 17. Nye, M., Hewitt, L., Tenenbaum, J., Solar-Lezama, A.: Learning to infer program sketches (2019) 18. Polozov, O., Gulwani, S.: FlashMeta: a framework for inductive program synthesis. In: Aldrich, J., Eugster, P. (eds.) OOPSLA, pp. 107–126. ACM (2015). http://dblp. uni-trier.de/db/conf/oopsla/oopsla2015.html#PolozovG15 19. Silver, D., et al.: Mastering the game of go without human knowledge. Nature 550(7676), 354–359 (2017) 20. Srivastava, S., Gulwani, S., Chaudhuri, S., Foster, J.S.: Path-based inductive synthesis for program inversion. SIGPLAN Not. 46(6), 492–503 (2011). https://doi. org/10.1145/1993316.1993557 21. Valkov, L., Chaudhari, D., Srivastava, A., Sutton, C., Chaudhuri, S.: HOUDINI: lifelong learning as program synthesis (2018) 22. Vinyals, O., et al.: Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 575(7782), 350–354 (2019) 23. Vinyals, O., Fortunato, M., Jaitly, N.: Pointer networks (2017) 24. Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R., Smola, A.: Deep sets (2018) 25. Zhou, C., Li, C.L., Poczos, B.: Unsupervised program synthesis for images using tree-structured LSTM (2020) 26. Zohar, A., Wolf, L.: Automatic program synthesis of long programs with a learned garbage collector (2019)
Success at High Peaks: A Multiscale Approach Combining Individual and Expedition-Wide Factors Sanjukta Krishnagopal(B) Gatsby Computational Neuroscience Unit, University College, London W1T 4JG, UK [email protected] Abstract. This work presents a network-based data-driven study of the combination of factors that contribute to success in mountaineering. It simultaneously examines the effects of individual factors such as age, gender, experience etc., as well as expedition-wide factors such as number of camps, ratio of sherpas to paying climbers etc. Specifically, it combines the two perspectives through a multiscale network model, i.e., a network of network of climbers within each expedition at the finer scale, and an expedition similarity network at the coarser scale. The latter is represented as a multiplex network where layers encode different factors. The analysis reveals that chances of failure to summit due to fatigue, altitude or logistical problems, drastically reduce when climbing with people they have climbed with before, especially for experienced climbers. Additionally, centrality indicates that individual traits of youth and oxygen use while ascending are the strongest drivers of success. Further, the learning of network projections enables computation of correlations between intra-expedition networks and corresponding expedition success rates. Of expedition-wide factors, the expedition size and total time layers are found to be strongly correlated with success rate. Lastly, community detection on the expedition-similarity network reveals distinct communities where a difference in success rates naturally emerges amongst the communities. Keywords: Mountaineering data · Multiscale networks · Multiplex networks · Social network analysis · Group dynamics · Everest expeditions
Introduction Extreme mountaineering is an increasingly popular activity that requires not only physical fitness and skills, but also mental fortitude and psychological control. The Himalayas, one of the most impressive mountain ranges, present several opportunities, including the famous Mount Everest itself, for extreme mountaineering. Extreme or high altitude mountaineering is not what one might consider safe, and personal or expedition-related factors such as effective use of proper equipment, climber experience, mental strength and self-reliance are all measures [28] to increase safety and chances of success. Certain aspects of extreme mountaineering are well-known to be individualistic, especially as one c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2021, SCI 1072, pp. 669–680, 2022. https://doi.org/10.1007/978-3-030-93409-5_55
670
S. Krishnagopal
gets closer to the death zone (8000 m altitude). However, with increasing commercialization of extreme mountaineering, social and psychological factors play a subtle but crucial role in survival. Indeed the mass fatality on Everest in 1996 received tremendous attention social and logistical misgivings of the expeditions [32]. Understanding factors, both individual and expedition-wide, e.g. effective use of proper equipment, climber experience, mental state etc., [28] is crucial to maximizing safety and chances of success. Data driven analysis of successpredicating factors has been accelerated by the availability of the large and detailed Himalayan dataset [26]. Indeed, several works have studied the effects of age and sex [11], experience [9], commercialization [36] etc. on success, and highlight the importance of age as a dominant determining factor. Additionally, [35] shows that women are more risk-averse than men, and sherpas have lower risk at high altitude than paying climbers. Success depends both on both physiological state [10,31], as well as psychological and sociological state [7] that influence the evaluation of risks and hazards. In [4], the psychological motivators behind why people climb is outlined. These motivators differ between paying climbers and sherpas, introducing questions regarding the ethics of hiring sherpas [22]. Group dynamics also play a major role in anxiety and problem solving [33]. Thus, despite opinions that climbing is an individual activity, there is mounting evidence highlighting the importance of social and psychological factors. The psychology is largely driven by relationships between climbers, for instance climbers that frequently climb together may developer better group dynamics, and consequently lower failure. However, there is limited investigation of the effects of climbing with repeat partners. This work studies the likelihood of failing when climbing with repeat partners due to factors such as logistical failings, fatigue, altitude related sickness etc. Network science is becoming increasingly popular when studying data consisting of a multitude of complex interacting factors. Network approaches have been used successful in predictive medicine [13], climate prediction [30], predictions in group sports [19], disease spreading [37] etc. A mountaineering expedition consists of individuals which naturally lend themselves to a network structure. Relationships between individuals and expedition features such as oxygen use, age, sex, experience etc. can be modeled through a bipartite network. This network are often projected into other spaces for further analysis [14,16]. To the best of our knowledge, this work is the first network-based analysis of mountaineering data, incorporating factors at multiple scales. The natural question emerges: which of these features, which can be represented by nodes, are central to maximizing chances of success? An active area of research investigates the importance of nodes [20] through centrality-based measures [25,29] that serve as reliable indicators of ‘important’ factors. While several studies have focused on individual traits that affect success and death, it is natural to expect expedition-wide factors (e.g. ratio of sherpas to paying climbers, number of days to summit, and number of camps, intra-expedition social relationships etc.) to play a role in success. However there
Success at High Peaks
671
is limited work that studies the effect of such expedition-wide factors. This work considers both expedition-wide factors and personal features, and the interaction between the two. Multilayer networks [12] are an ideal tool to model multiple types of interactions, where each layer models relationships between expeditions through a particular factor. Multilayer networks have been used successfully to model neuronal activity [34], in sports [3], in biomedicine [6] etc. Multilayer networks without intra-layer connections between different nodes are also known as multiplex networks [17]. In order to model the different factors that influence expedition similarities, a multiplex network model is used. However the network of feature-relations within an expedition is in itself one of the factors correlated with success, hence one layer of the multilayer network encodes similarities in intra-expedition networks, where connectivity between expeditions is determined by their graph similarity [5]. This lends a multiscale structure to the network model. The term multiscale can be used to refer to different levels of thresholding in the graph across different scales as in [23], or hierarchical networks as in [15,27]. The notion of multiscale used here is derived from the latter, where the nodes of the expedition similarity network are in fact networks themselves. Multiscale networks are natural when modeling relationships on different scales for instance in brain modeling [1], stock market [23], ecology [18] etc. Motivations, ability and psychologies vary amongst individuals, influencing people’s perceptions [24] and strategies that may contribute to success. Hence, there must exist multiple strategies that consist of a different combination dominant factors, and a climber may be interested in the strategy that is best suited to them. Community detection [21] on the expedition similarity networks naturally partitions expeditions into groups that show high within-group similarity, where each group defines a strategy. Community detection is a useful tool in network analysis, and is an active area of research extended to multilayer networks [8], multiscale networks [27] etc. This work identifies three groups, with one in particular correlated with high success rate, providing insight into the combination of factors that allow for safe and successful climbs.
1
Data
The data for this work was obtained from the open access Himalayan Database [26], which is a compilation of records for all expeditions that have climbed in the Nepal Himalayan range. The dataset cover all expeditions from 1905 through 2021, and has records of 468 peaks, over 10,500 expedition records and over 78,400 records of climbers, where each record of any type is associated with an ID. We use the following information from the Expedition records: – – – – –
Peak climbed (height). Days from basecamp to summit. Number of camps above basecamp. Total number of paying members and hired personnel. Result: (1) Success (main peak/foresummit/claimed), (2) No summit.
672
S. Krishnagopal
The success rate of an expedition is calculated as the fraction of members that succeeded. Each expedition comprises of several individual climbers yielding a natural multiscale structure. We use the following data about each climber: – – – –
2
Demographics: Age, Sex, Nationality. Oxygen use: ascending or descending. Previous experience above 8000 m (calculated). Result: 1. Success 2. Altitude related failure: Acute Mountain Sickness (AMS) symptoms, breathing problems, frostbite, snowblindness or coldness. 3. Logistical or Planning failure: Lack of supplies, support or equipment problems, O2 system failure, too late in day or too slow, insufficient time left for expedition. 4. Fatigue related failure: exhaustion, fatigue, weakness or lack of motivation. 5. Accident related failure: death or injury to self or others.
The Effect of Climbing with Repeat Partners
Climbers often tend to climb with friends or regular climbing partners. The security of regular climbing partners may improve confidence and limit failure, but may also lead to a misleading sense of comfort. Here, a comparison of average rates of success and various types of failure when climbing with repeat partners vs new partners is made. Figure 1 shows the fraction of failures when climbing with friends/repeat partners over the climber average. These failures are divided into altitude related, fatigue related, logistical and planning failures and accident/illness. The effect of total experience is normalized for by plotting across the total number of climbs on the x-axis starting at least 15 climbs, hence not considering beginner climbers. As seen in Fig. 1, repeat partners have virtually no effect on the chance of success except for very experienced climbers (36–40 logged climbs) which may be attributed to a increasing climb difficulty or of their partners being less experienced partners, both of which are more likely for very experienced climbers. In contrast, the chance of failure is significantly lower when climbing with repeat partners for every type of failure. In particular, the chance of failure due to fatigue-related issues is the most decreased when climbing with repeat partners, followed by failure due to logistical or planning issues. This may be expected since climbing partners that often climb together typically are better at communication, planing, and knowing each other’s physical limitations. Note that only climbers with over 15 logged climbs are consider, indicating that complete lack of experience is not a cause of failure. Additionally, the most experienced climbers (that have logged 36–40 climbs) have nearly no failure due to fatigue or logistics, as one may expect. Similarly, failure due to altitude-related and cold-related issues also drastically reduces when climbing with repeat partners. Additionally, the cause of failure due to accident shows an increasing trend as a function of increasing experience, which may be attributed to the fact that more experienced climbers tend to tackle more dangerous mountains.
Success at High Peaks
673
Fig. 1. The fraction of several categories (success and various types of failures) averaged over climbers when climbing with a group with at least one repeat partner (someone they have done a logged Himalaya expedition with before) over the individual average. The y-axis denotes the ratio of success and various failures when climbing with repeat partners over their personal average over all climbs (conisdering climbs with repeat+new partners).
3
Intra-expedition Features Determining Success
Here, the focus shifts from studying individual climbers to analyzing a group of climbers within an expedition. In order to do so, only the tallest peak, Mount Everest is considered. Expeditions with less than 12 climbers are excluded, as are expeditions that resulted in death. To generate the intra-expedition network, we start with a bipartite network P between climbers and features, where a climber is connected to the features that they possess. The ‘features’ selected as the nodes are: age, sex, oxygen while ascending, oxygen while descending, sherpa identity and previous experience about 8000 m, making a total of f = 6 features. A climber is connected to sex if they are male, and age is binarized into above and below median age (40). We then generate the intra-expedition, with adjecency matrix A, of size f ×f by projecting the bipartite network into feature space as follows: A = P t P . The edge weight between two nodes (features) is given by the number of people that are connected to both the features. Information about the expedition is then encoded in the structure of this network. To explore such effects, measures such as centrality capture important properties that provide insight into the importance of different features [20]. For instance, if the group were comprised of mostly high-age individuals, the node-centrality of the age node would be relatively high. Here, the eigenvector centrality [29] determines how central each feature is in a given graph. Studying the differences in feature centrality between groups of successful summit vs no-summit provides important insight into features that may be important for summit success.
674
S. Krishnagopal
Fig. 2. Mean eigenvector centrality (a) as a function of expedition features for Everest expeditions greater than 12 members plotted for groups of successful vs unsuccessful climbers ordered by increasing difference between success and no-success centralities. Error bars show standard error on the centrality. (b, c) Aggregation of the feature graph showing relative edge weights in summit success, and no-summit groups respectively.
As seen in Fig. 2 (a), the least central feature in determining success on summit was the use of oxygen while descending, which is expected since descent features have no effect on summit prospects, except for indicating that oxygen was available on descent meaning there wasn’t excessive use during ascent. It is worth noting that most fatalities on Everest happen during the descent. The next features that were slightly more central in successful summits were previous experience about 8000 m (for reference Everest is at 8849 m), followed by use of O2 while ascending. Surprisingly, summit centrality for sex (indicating male) was relatively low compared to no-summit centrality indicating that being male had low importance in the chances of success at summit. Lastly, the largest differences in summit vs no summit were from identity (sherpa were much more likely to succeed), and age ( μl , and 0 otherwise.
(1)
where vil is the value of factor l in expedition i. For the intra-expedition feature layer, the edge weight between expeditions is given by the Graph Edit distance [5] between their intra-expedition graphs, normalized to a max value of 1.
5
Determining Layer Importance Through Correlation with Success
Different factors encoded as layers may have varying importance in determining the success of an expedition. An expedition success rate is the fraction of climbers that succeed at summiting. The importance of a regular (non-multiscale) layer can be inferred from the correlation between the values taken on the nodes (expeditions) in that layer and the corresponding success rates. However, for the intra-expedition layer, this involves computing the correlations between intraexpedition graphs and success rates (scalars).
676
S. Krishnagopal
Fig. 4. Pearson’s correlation coefficient between layer (factor) values and expedition success rate. The exact values across x-axis layers are −0.45, −0.36, −0.12, 0.57, 0.84. The corresponding p-values are 5.5 × 10−10 , 1.15 × 10−6 , 0.1, 5.7 × 10−16 , 8.9 × 10−47 .
Devising measures for comparisons between graph space and scalar space is an important problem in network science. This can be done by dimensionality reduction through projection of graph space to a scalar space. Since each factor in the graph is independent, we perform linear regression on the unique entries of the adjacency matrix of the graph to obtain a linear fit that best map the intra-expedition graphs to the success rates. The corresponding coefficients are denoted by c. Note that linear regression generates a single set of coefficients that are best map from graph space to scalar space, i.e., output a best-fit scalar for each graph. In principle one can use higher order methods, or neural networks to learn this mapping, however, since the features of the graph are expected to be linearly independent, a linear mapping is sufficient in this case. In other words, one can project the graphs as follows: the intra-expedition graph of the j th expedition whose adjacency matrix is Aj is represented by the scalar ηj given by ηj = Aj · c where c are the coefficients obtained through regression over all graphs. One can then identify the importance of the intraexpedition layer through the correlation coefficient (Pearson’s) between η and the success rates. Figure 4 shows the Pearson’s correlation coefficient between the layers and the success rate. A higher correlation implies higher influence of the layer in determining success. Despite sherpas having a high chance of personal success as seen in the intra-expeditional analysis, the ratio of number of paying members to number of hired personnel on the team has a relatively smaller effect on expeditional success compared to the other factors considered in the multilayer approach. Both number of camps above basecamp and days to summit/high point had a negative correlation with success, as one might expect, with the latter having a larger effect. Also surprisingly, the expedition size is found to
Success at High Peaks
677
be relatively important in determining success (with a correlation coefficient of >0.5). Lastly, the most important factor was the intra-expedition feature graph layer which is strongly correlated with success, indicating that non-linear effects and outliers to the regression fit are relatively few. All p-values are extremely low indicating that the correlation is statistically significant except for the number of members to hired personnel.
6
Community Detection to Identify Patterns of Success
Success at high peaks is a combination of several features. Through analysis of the multiscle graph, one can identify communities of expeditions that have similar factors and features. One may wonder if the data would naturally cluster into communities with different success rates, which can be associated to differences in the combination of factors. The layers of the multiscale multiplex graphE are aggregated to generate an expedition similarity graph S given by S = l E l /|l|, where l is the total number of layers (5 in this case). Here, each layer has the same weight, but one may choose to assign weights to them in other ways, for instance weighted by layer importance. Louvain community detection [2] is then applied to S and identifies three communities. Note that the number of communities is not pre-determined but selected by the algorithm maximize modularity [21]. Figure 5 (left) shows the differences in expedition-wide factors in the three communities (ordered by average success rate on the x-axis). As seen from the figure, the three emergent communities naturally bifurcated to reveal different success rates (the first and the second community had similar success rates at 0.28 and 0.32 whereas the third was significantly higher at 0.68). Firstly, the most dominant difference is found to be expedition size which is significantly higher in the third ‘successful’ community (with the highest success rate at 0.68), indicating that larger groups allow a larger fraction of climbers to succeed. One may hypothesize that this is because the experienced climbers do not have to shoulder the responsibility of the less experienced climbers, which may slow them down. Additionally, all communities were largely similar in the number of camps above base camp. However, the ratio of number of members to personnel was the relatively higher in both the community with the highest success rate, as well as the lowest success rate, indicating that it isn’t a determining factor, supported by results from Sect. 4. Lastly, as the success rate of the communities increase, their days to summit decrease, which is also expected. At high altitudes, the body goes into shock from prolonged exposure, so one might expect a faster expedition to face less challenges in this regard, and hence be more successful. Figure 5 (right) plots the centralities of the intra-expedition features for the three communities (ordered by their success rate on the x-axis). The successful community, were the youngest and had the highest centrality for oxygen use both while ascending and while descending. Despite having slightly less average experience than the first two groups, age and oxygen use were the leading indicative features for success, which is in agreement with results from Sect. 3. The main
678
S. Krishnagopal
Fig. 5. (Left) The average values of the various expedition-wide factors represented in layers shown for the three communities. (right) Centralities of the intra-expedition graph features for the three communities. The three communities are represented by their success fraction across the x axis.
differences between the first and second ‘low-success’ communities are that the first community had relatively low experience >8000 m, whereas the second is a relatively older, but more experienced population. Hence, this provides insight into which strategy/choices are conducive to climbing Everest given fixed traits of an individual such as age/experience etc.
7
Discussion and Future Work
This work presents the first network-based analysis of mountaineering, studying the intra-expedition and expedition-wide factors that contribute to success. First, it considers a climber-centric perspective and shows that the chances of summit failure (due to fatigue, logistical failure etc.) drastically reduce when climbing with repeat partners, especially for more experienced climbers. Then, it studies the importance of intra-expedition features by projecting a bipartite climber-feature network to show that the largest different in centralities amongst successful and unsuccessful groups is found in the ‘age’ node, indicating that it’s the strongest driver of success. Further, it introduces a multiscale multiplex network to model similarities between expeditions, where one or more layers may be multiscale whereas others are not. Such a multiscale approach can model a variety of systems, and the tools used here to navigate simultaneous modeling of different types of layers and project networks to a scalar space through regression are applicable in a variety of scenarios. Lastly, community detection on the expedition-similarity reveals three distinct communities where a difference in success rates naturally emerges amongst the communities. The dominant characteristics that support a successful outcome for a large fraction of the expedition are high expedition size, low age, and oxygen use. Future work may include study of additional factors, analysis of death factors, and a multilayer or multiscale approaches to modularity optimization and community detection. Code can be found at https://github.com/chimeraki/Multiscale network mountaineering.
Success at High Peaks
679
References 1. Betzel, R.F., Bassett, D.S.: Multi-scale brain networks. Neuroimage 160, 73–83 (2017) 2. Blondel, V.D., Guillaume, J.-L., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities in large networks. J. Stat. Mech: Theory Exp. 2008(10), P10008 (2008) 3. Buld´ u, J.M., et al.: Using network science to analyse football passing networks: dynamics, space, time, and the multilayer nature of the game. Front. Psychol. 9, 1900 (2018) 4. Ewert, A.: Why people climb: the relationship of participant motives and experience level to mountaineering. J. Leis. Res. 17(3), 241–250 (1985) 5. Gao, X., Xiao, B., Tao, D., Li, X.: A survey of graph edit distance. Pattern Anal. Appl. 13(1), 113–129 (2010). https://doi.org/10.1007/s10044-008-0141-y 6. Hammoud, Z., Kramer, F.: Multilayer networks: aspects, implementations, and application in biomedicine. Big Data Anal. 5(1), 1–18 (2020). https://doi.org/10. 1186/s41044-020-00046-0 7. Helms, M.: Factors affecting evaluations of risks and hazards in mountaineering. J. Exp. Educ. 7(3), 22–24 (1984) 8. Huang, X., Chen, D., Ren, T., Wang, D.: A survey of community detection methods in multilayer networks. Data Min. Knowl. Disc. 35(1), 1–45 (2020). https://doi. org/10.1007/s10618-020-00716-6 9. Huey, R.B., Carroll, C., Salisbury, R., Wang, J.-L.: Mountaineers on Mount Everest: effects of age, sex, experience, and crowding on rates of success and death. PLoS ONE 15(8), e0236919 (2020) 10. Huey, R.B., Eguskitza, X.: Limits to human performance: elevated risks on high mountains. J. Exp. Biol. 204(18), 3115–3119 (2001) 11. Huey, R.B., Salisbury, R., Wang, J.-L., Mao, M.: Effects of age and gender on success and death of mountaineers on Mount Everest. Biol. Lett. 3(5), 498–500 (2007) 12. Kivel¨ a, M., Arenas, A., Barthelemy, M., Gleeson, J.P., Moreno, Y., Porter, M.A.: Multilayer networks. J. Complex Netw. 2(3), 203–271 (2014) 13. Krishnagopal, S.: Multi-layer trajectory clustering: a network algorithm for disease subtyping. Biomed. Phys. Eng. Exp. 6(6), 065003 (2020) 14. Krishnagopal, S., Coelln, R.V., Shulman, L.M., Girvan, M.: Identifying and predicting Parkinson’s disease subtypes through trajectory clustering via bipartite networks. PloS One 15(6), e0233296 (2020) 15. Krishnagopal, S., Lehnert, J., Poel, W., Zakharova, A., Sch¨ oll, E.: Synchronization patterns: from network motifs to hierarchical networks. Philos. Trans. R. Soc. A: Math. Phys. Eng. Sci. 375(2088), 20160216 (2017) 16. Larremore, D.B., Clauset, A., Jacobs, A.Z.: Efficiently inferring community structure in bipartite networks. Phys. Rev. E 90(1), 012805 (2014) 17. Lee, K.-M., Min, B., Goh, K.-I.: Towards real-world complexity: an introduction to multiplex networks. Eur. Phys. J. B 88(2), 1–20 (2015). https://doi.org/10.1140/ epjb/e2015-50742-1 18. Lenormand, M., et al.: Multiscale socio-ecological networks in the age of information. PLoS ONE 13(11), e0206672 (2018) 19. Lusher, D., Robins, G., Kremer, P.: The application of social network analysis to team sports. Meas. Phys. Educ. Exerc. Sci. 14(4), 211–224 (2010)
680
S. Krishnagopal
20. Mo, H., Deng, Y.: Identifying node importance based on evidence theory in complex networks. Physica A 529, 121538 (2019) 21. Newman, M.E.: Modularity and community structure in networks. Proc. Natl. Acad. Sci. 103(23), 8577–8582 (2006) 22. Nyaupane, G., Musa, G., Higham, J., Thompson-Carr, A.: Mountaineering on Mt Everest: evolution, economy, ecology and ethics. In: Mountaineering Tourism, p. 265. Routledge, New York (2015) 23. Pereira, E.J.D.A.L., Ferreira, P.J.S., da Silva, M.F., Miranda, J.G.V., Pereira, H.B.B.: Multiscale network for 20 stock markets using DCCA. Physica A Stat. Mech. Appl. 529, 121542 (2019) 24. Pomfret, G.: Mountaineering adventure tourists: a conceptual framework for research. Tour. Manag. 27(1), 113–123 (2006) 25. Saito, K., Kimura, M., Ohara, K., Motoda, H.: Super mediator-a new centrality measure of node importance for information diffusion over social network. Inf. Sci. 329, 985–1000 (2016) 26. Salisbury, R.: The Himalayan Database: The Expedition Archives of Elizabeth Hawley. Golden: American Alpine Club (2004) 27. Sarkar, S., Henderson, J.A., Robinson, P.A.: Spectral characterization of hierarchical network modularity and limits of modularity detection. PLoS ONE 8(1), e54383 (2013) 28. Schussman, L., Lutz, L., Shaw, R., Bohnn, C.: The epidemiology of mountaineering and rock climbing accidents. J. Wilder. Med. 1(4), 235–248 (1990) 29. Sol´ a, L., Romance, M., Criado, R., Flores, J., Garc´ıa del Amo, A., Boccaletti, S.: Eigenvector centrality of nodes in multiplex networks. Chaos Interdisc. J. Nonlinear Sci. 23(3), 033131 (2013) 30. Steinhaeuser, K., Chawla, N.V., Ganguly, A.R.: Complex networks as a unified framework for descriptive analysis and predictive modeling in climate science. Stat. Anal. Data Min.: ASA Data Sci. J. 4(5), 497–511 (2011) 31. Szymczak, R.K., Marosz, M., Grzywacz, T., Sawicka, M., Naczyk, M.: Death zone weather extremes mountaineers have experienced in successful ascents. Front. Physiol. 12, 998 (2021) 32. Tempest, S., Starkey, K., Ennew, C.: In the death zone: a study of limits in the 1996 Mount Everest disaster. Human Relat. 60(7), 1039–1064 (2007) 33. Tougne, J., Paty, B., Meynard, D., Martin, J.-M., Letellier, T., Rosnet, E.: Group problem solving and anxiety during a simulated mountaineering ascent. Environ. Behav. 40(1), 3–23 (2008) 34. Vaiana, M., Muldoon, S.F.: Multilayer brain networks. J. Nonlinear Sci. 30(5), 2147–2169 (2020). https://doi.org/10.1007/s00332-017-9436-8 35. Weinbruch, S., Nordby, K.-C.: Fatalities in high altitude mountaineering: a review of quantitative risk estimates. High Altitude Med. Biol. 14(4), 346–359 (2013) 36. Westhoff, J.L., Koepsell, T.D., Littell, C.T.: Effects of experience and commercialisation on survival in Himalayan mountaineering: retrospective cohort study. Bmj, 344 (2012) 37. Piontti, A.P., Perra, N., Rossi, L., Samay, N., Vespignani, A.: Charting the Next Pandemic: Modeling Infectious Disease Spreading in the Data Science Age. Springer, Heidelberg (2019). https://doi.org/10.1007/978-3-319-93290-3
Data-Driven Modeling of Evacuation Decision-Making in Extreme Weather Events Matthew Hancock1 , Nafisa Halim2 , Chris J. Kuhlman1(B) , Achla Marathe1 , Pallab Mozumder3 , S. S. Ravi1 , and Anil Vullikanti1 1
University of Virginia, Charlottesville, VA 22904, USA {mgh3x,cjk8gx,achla,ssr6nh,vaskumar}@virginia.edu 2 Boston University, Boston, MA 02118, USA [email protected] 3 Florida International University, Miami, FL 33199, USA [email protected]
Abstract. Data from surveys administered after Hurricane Sandy provide a wealth of information that can be used to develop models of evacuation decision-making. We use a model based on survey data for predicting whether or not a family will evacuate. The model uses 26 features for each household including its neighborhood characteristics. We augment a 1.7 million node household-level synthetic social network of Miami, Florida with public data for the requisite model features so that our population is consistent with the survey-based model. Results show that household features that drive hurricane evacuations dominate the effects of specifying large numbers of families as “early evacuators” in a contagion process, and also dominate effects of peer influence to evacuate. There is a strong network-based evacuation suppression effect from the fear of looting. We also study spatial factors affecting evacuation rates as well as policy interventions to encourage evacuation. Keywords: Hurricane survey data · Survey-based modeling · Evacuation decision-making · Social networks · Agent-based simulation
1 1.1
Introduction Background and Motivation
Many factors affect the decision of whether to evacuate in the face of an oncoming hurricane. These include past evacuation/hurricane experience; risk perceptions (household and human safety, storm threat, concern for looting); storm characteristics such as wind speed, rainfall, and flooding; receiving an evacuation notice; traffic gridlock; presence of children, elderly, and infirm family members; pets; the household’s education level; property protection and insurance; economic factors (household income, availability of resources); work duties; race; and having somewhere to stay [2,4,5,12,13,16]. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2021, SCI 1072, pp. 681–692, 2022. https://doi.org/10.1007/978-3-030-93409-5_56
682
M. Hancock et al.
No modeling work on evacuation decision-making, during hurricanes, takes all of these factors into account. Most papers include only a few or handful of factors, e.g., [15,17]. Some use conventional threshold models [10,17], such as Granovetter’s [3,6]. A few have used synthetic data (i.e., digital twin data [1]) to represent the population over which evacuation decisions are made [17]; most use stylized networks to some extent [8,10,12,17]. Furthermore, most populations are relatively small, with at most on the order of 10,000 families [10,15]; an exception is [17] with 35,064 families. There is very limited data on actual behaviors, and complex factors are at play during disaster events. Therefore, the combination of data from surveys with agent based models provides a systematic approach for understanding evacuation behavior. In this paper, we take the first steps towards this goal by (i) using a statistical evacuation decision model with 26 features, including household and social network features; and (ii) using synthetic population data to augment a 1.7 million family-based representation of a Miami, FL social contact network. However, there are numerous modeling challenges in this process (e.g., the survey data are for the overall event, and not for daily decisions), and a better understanding of the phase space of the associated dynamical system (e.g., sensitivity analysis) can help in improving such models. In another paper in this conference [9], we undertake such a study using a stylized behavioral model, but a realistic contact network. Thus, these two works are complementary. 1.2
Our Contributions
First, we augment a synthetic population and social contact network of Miami, FL, developed in [9], where nodes are families and edges are communications between pairs of families. Specifically, we augment the 1.7 million families with additional properties from the American Community Survey (ACS) such as whether they have flood insurance, internet access, and household members that are elderly or disabled. These 26 features are required because our model of hurricane evacuation—originally presented in [12]—uses these parameter values to compute a family’s daily probability of evacuation. The probability depends on both household characteristics and neighborhood (peer) effects. Second, we perform agent-based simulations of hurricane evacuation decisionmaking for Miami, FL. These simulations include baseline behaviors and effects of model parameters and seeding conditions. To operationalize survey data showing that there are neighborhood effects in the social contact network on a family’s evacuation decision, we introduce two thresholds cu and cd that control the fraction of neighbors evacuating at which peer-influence for evacuating and for looting, respectively, become important. These two phenomena have opposing effects: small cu values enhance evacuation from peer influence and small cd values suppress evacuation due to looting concerns. The probability of evacuation model includes two dominant contributions: (i) those from household characteristics (a term denoted ghh below), such as education level of the head of household, whether elderly people are family members, and whether a family has home insurance, and (ii) those from network neighbor effects (denoted gnet
Evacuation Decision-Making
683
below). The model is detailed in Sect. 2. These factors interact. For example, an interesting result that comes from simulations is that the household term ghh dominates the effect of seeding of randomly selected families as early evacuators and also dominates the effect of cu . This is explained in Sect. 3 below. Geographically, we find that the evacuation rates across Miami are all non-zero, and vary spatially across the city, but that these variations are not extreme. Third, we also conduct simulation-based intervention studies to address people’s concerns over looting. We model police allaying these concerns by visiting residential areas to tell residents that law enforcement will monitor their homes while they are evacuated. We study this effect for different patterns of police visitations to different geographic regions and for different levels of effectiveness of these interactions. We find that geographic visitation patterns can increase evacuation fractions from 0.24 to 0.42 of families, a 75% increase. This is purely a network effect. Changing the effectiveness of visits can increase the fraction of evacuating families from 0.35 to 0.42.
2 2.1
Models and Results Network Model
We perform simulations on a human social contact network of Miami, FL. We build this network using the procedures in [1]. Briefly, a collection of synthetic humans is generated that match distributions of age and gender in Miami, FL. These individuals are grouped into households (a household may contain one person). Households are assigned home locations with (lat, long), i.e., latitude and longitude, coordinates. Each person in each household is assigned a set of activities such as work and school. Each activity has a start/end time and an associated geolocation where it takes place. In this way, people can be co-located (i.e., at the same location with overlapping visit times). Two people (that are nodes in the human contact network) who are co-located have an edge between them in the network. See [1] for further details. Because families choose to evacuate (or not), rather than individuals, we convert the individual-based social contact network into a family-based social network G(V, E), with node set V and edge set E, as follows. Since we are concerned with communication that influences evacuation decisions, we consider only those persons between the ages of 18 and 70, inclusive, as decision-makers or having the ability to influence decision-makers. Nodes vi ∈ V are families. Suppose person hi is a member of family vi and person hj is a member of family vj . If hi and hj are colocated, then there is an edge between the respective families, i.e., eij = {vi , vj } ∈ E of G. The graph is a simple graph, so there is at most one edge between two families. As part of this current work, we augment the network nodes (families in Miami) with attributes from the American Community Survey (ACS) to include the properties required for the evacuation model, as described in Sect. 2.2 and Table 1 below, so that simulations (Sect. 3) can use these properties with the model.
684
M. Hancock et al.
The resulting family-based network has 1,702,038 nodes and 42,789,880 edges. The average degree is 50.3 and the maximum degree is 760. The average clustering coefficient is 0.045 and the graph diameter is nine. 2.2
Family Behavior Model
Each family vi in a network is either in state si = 0, the not-evacuating state, or state si = 1, the evacuating state. Once a family decides to evacuate, they stick with that decision. If a family is in state 0, then a model is needed to quantify under what conditions it transitions to state 1. We quantify this transition of state, 0 → 1, for a particular family vi using a state transition evacuation probability pi,evac , as described next. Our behavioral model of hurricane evacuation was developed from survey data gathered for 1,212 respondents who experienced Hurricane Sandy in 2012 [7]. To build the model, variables that correlate with families’ evacuation decisions were identified using a Binomial Logit model; the resulting variables are provided in Table 1. A logistic regression was performed to construct the probability pi,evac of family vi evacuating, as a function of these variables, given by (1) pi,evac = 1/(1 + [1/ exp(−0.835045 + ghh + gnet )]) with
ni hh nn net ghh = Σi=1 ci ρi and gnet = Σj=1 cj ρj ,
(2)
where ghh represents the household-related (i.e., within-node) term whose variables ρi and coefficients ci are given on the left in Table 1 and gnet represents the network (i.e., peer-effect) term whose variables ρj and coefficients cj are given on the right in Table 1. For example, one summand of ghh is chh i = −0.165 for ρi = khh . Since male is the reference, khh = 0 if the head of household is male and khh = 1 if the head of household is female. To estimate network effects for the term gnet , additional statistical analyses were conducted to infer the parameters given on the right side of Table 1. For all families in Miami, an evacuation vector ηi,evac = (0, ηsi , ηi , ηvi ) and a looting vector i,loot = (0, si , i , vi ) were determined by logistic regression, using a subset of independent variables on the left in Table 1. Details are omitted here for lack of space; see [12] for details. Figure 1 contains representative plots of probability values pi,evac from the survey model of Eq. 1, for different conditions. These data are illustrative, to give a sense of the probability magnitudes and their changes across conditions. For example, in Fig. 1c, if looting is not important, then all of si , i , and vi are zero, but if looting is somewhat important, then si = 1 and the other two variables are zero for gnet in Eq. 2. From the plot, when the fear of looting is somewhat important, pi,evac = 0.0778, a decrease from 0.1379, when looting is not a concern (i.e., bar “all = 0”).
Evacuation Decision-Making
685
Table 1. Logistic regression results: dependent variable pi,evac . The variables ρi and ρj in Eq. 2 are given in the tables on the left and right, respectively. Similarly, coefficients (left) and cnet (right) in Eq. 2. p-values for parameters are given in [12]; are the chh i j variables significant at the 0.05 level are shown in italics. Parameters and coefficients for household terms ghh .
Parameters and coefficients for network terms gnet .
Independent variable
Coeff.
Independent variable
Age (in years), ahoh
−0.00017
Female (Ref: Male), khh
−0.165
Evacuation decision made by neighbors ηi,evac (Ref: not important) Somewhat important, 0.125
Race (Ref: Black) White, irw
−0.301
Hispanic, irh
0.436
Other, iro
−0.423
Mixed, imr
−1.163
Education (Ref: High school or less) Some college, esc
Coeff.
0.353
Bachelor or higher, ealb 0.397 Employment status, hmw
0.073
Household size, ihs
−0.231
No. of HH members who 0.066 are disabled, imd
ηsi Important, ηi
0.523
Very important, ηvi
0.478
Concerns about crime such as looting i,loot (Ref: not important) Somewhat important, si −0.640 Important, i
−1.284
Very important, vi
−1.263
Interaction (neighbor and 0.053 looting), βel
No. of HH members who 0.279 are elderly, ime Household is owned, iio
−0.386
Living in a mobile home, −0.0718 imh HH has access to the inter- −1.446 net, iia HH Income, ihi
0.015
No. of vehicles owned by 0.056 HH, rc Age of house, ahhs
−0.0025
HH has home insurance, 1.853 if i
2.3
Agent-Based Model for Simulation
To produce a temporal agent-based model (ABM) for agent-based simulation (ABS) of evacuation behavior, modifications are required of Eq. 1. First, pi,evac from survey data is a single probability over the entire hurricane event. For ABS, we seek a daily probability to simulate temporal decision making by families in Miami, FL. The daily probability pdaily i,evac uses the geometric mean given by 1/tmax pdaily = 1 − (1 − p ) , where tmax = 10 days because we simulate the i,evac i,evac evacuation behavior ten days before (i.e., leading up to) hurricane arrival, as shown in Sect. 3. Second, the peer (network) effects of evacuating and looting in the right of Table 1 require further model constructs. This is because the vectors ηi,evac and i,loot , which are also derived from the survey data, cannot be operationalized. For example, if for some family, neighbor influence is “important,” then the question naturally arises in how to quantify this effect (i.e., discriminate this effect) if the family has two, or eight, or 12 neighboring families evacuating. To address this ambiguity, we introduce two new parameters cu and cd , which are thresholds, with meanings as follows. If for family vi , the fraction of neigh-
686
M. Hancock et al.
(a) education esc
(b) race
(d) home ownership iio (e) internet access iia
(c) looting l
(f) home insurance if i
Fig. 1. Probabilities of evacuation, over the entire duration of a hurricane event, from the model of Eq. 1 for education (esc ), race, looting (l), house-owned (iio ), internetaccess (iia ), and home-insurance (if i ) in Table 1.
bors evacuated is ≥ cu , then evacuation effects are activated, meaning that the appropriate term from the vector ηi,evac is included in Eq. 2 for gnet ; otherwise the “not important” variable is used. Similarly, if for family vi , the fraction of neighbors evacuated is ≥ cd , then looting effects are activated, meaning that the appropriate term from the vector i,loot is included in Eq. 2 for gnet ; otherwise the “not important” variable is used. Parameters cu and cd are studied in the simulations.
3 3.1
Simulations and Results Simulation Description and Parameters
A simulation instance consists of a set of seed nodes vj that are in state sj (t) = 1 at time t = 0. Time progresses forward in integer time steps (each representing one day), and at each time, each node (family) vi in state si (t) = 0 computes pdaily i,evac per Sect. 2 and performs a Bernoulli trial, in parallel, to determine its next state, i.e., si (t + 1). If si (t) = 1, then si (t + 1) = 1 for all t. Simulations are run in the interval t ∈ [0..9] to produce si (1) through si (10) for all 1 ≤ i ≤ n. A simulation consists of a group of simulation instances or replicates; here, we run 100 replicates, each having a different seed node set but otherwise the replicates are identical. All results are reported based on the mean and standard deviation of the 100 replicates at each t. Simulation parameters are listed in Table 2.
Evacuation Decision-Making
687
Table 2. Summary of the parameters and their values used in the simulations. Parameter
Description
Network
Miami, FL
Number of seed nodes, ns
Values are 0 and 10 to 105 , by powers of 10. Seed nodes are chosen uniformly at random
Family characteristics
Vary by family in family social contact network. See Table 1
Peer effect values cu , cd
Each varies from 0 to 1, in increments of 0.2
Subregions of Miami
Miami is discretized into 24 equi-sized blocks for intervention studies
3.2
Simulation Results
Cumulative Evacuation Time History Results. Figure 2 provides time histories for the fraction of families evacuating (Frac. Evac) as a function of time, for the ten days leading up to hurricane arrival (hurricane impact is on day 10). The results show a nonlinear evacuation fraction in time. Effect of Seeding. Each plot in Fig. 2 has numbers ns of seed nodes ranging from 0 to 105 families. For ns ≤ 104 , the effect of seeding is insignificant. A pronounced effect of ns is only realized when ns = 105 , which is approximately 6% of nodes. This is because our model is not a pure social influence model, akin to those of Granovetter and others [3,6,14] that rely on contagion spreading from seeded nodes. In our model, families can transition to the evacuating state on their own accord, without social influence, owing to family features (see left of Table 1). This is not to say that social influence is not a factor, as we address below. Effect of Peer Influence Thresholds cu and cd . The four plots in Fig. 2 show results for different combinations of (cu , cd ), each taking values of 0 and 1. These values are applied uniformly to all families. Figure 2a is the reference case where cu = cd = 0. These conditions mean that families account for peer effects in both evacuating and in concern for looting, for those families where peer evacuation and peer looting effects are somewhat important, important, or very important in the right of Table 1. That is, influence for each vi to evacuate exists for all fractions η1 of neighbors evacuating that are η1 ≥ cu = 0. Similarly, influence for each vi to remain behind (i.e., not evacuate) exists for all fractions η1 of neighbors evacuating that are η1 ≥ cd = 0. In Fig. 2b, cd is increased to 1.0. This means that looting does not become a concern for each family until all of its neighbors (i.e., a fraction of neighbors equal to cd = 1) are evacuating. Since looting is not a concern, more families evacuate in Fig. 2b than in Fig. 2a. Figure 2c is an initially surprising case. Based on the previous reasoning, one might conclude that fewer families evacuate than in the reference case (Fig. 2a)
688
M. Hancock et al.
(a) cu = 0, cd = 0
(b) cu = 0, cd = 1.0
(c) cu = 1.0, cd = 0
(d) cu = 1.0, cd = 1.0
Fig. 2. Simulation results of the fraction of evacuating families in Miami, FL (Frac. Evac.) as a function of time leading up to the hurricane arrival. We are always modeling the 10 days leading up to the arrival of a hurricane. Day 10 is the arrival of the hurricane. Time zero is the start of the simulation, which is ten days prior to hurricane landfall. In the plots, cu , and cd values are either 0 or 1.0.
because the influence to evacuate is essentially non-existent because cu = 1. However, families generate their own driving force to evacuate through the ghh term in Eq. 2, so reference evacuation rates are maintained. Figure 2d is consistent with the reasoning for the other three cases. The larger cd = 1 means that fear of looting is suppressed, irrespective of what a family’s neighbors choose to do, and hence evacuation rates increase. Spatial Evacuation Rates. Figure 3 shows three heatmaps. In all maps, there are 98 cells in the horizontal direction and 200 cells in the vertical direction, producing 19,600 grid cells over Miami. (Only about 1/3 of these cells contain landmass in Miami, owing to the spatial extent of the city.) Fig. 3a shows population spatial density. Since all families have home geo-locations, each family is mapped to one grid cell. Families are counted in each cell, and the logarithm (base 10) is applied to these counts, to make density variations more distinctive. Figures 3b and 3c show the probabilities of evacuation at the end of days 6 and 10. They are generated as follows. Each simulation is composed of 100 simulation instances. For each family, we determine the fraction of these 100 instances in which it evacuates. The families within each grid cell are collected, and these fractions are averaged to obtain an average evacuation probability for that cell. These averages are plotted. Results indicate that while there is spatial variation in evacuation rates, these variations are not large.
Evacuation Decision-Making
(a) population density
(b) end day 6
689
(c) end day 10
Fig. 3. Heatmaps for Miami, FL. The gradation is 98 × 200 cells in the horizontal and vertical directions, for a total of 19,600 grid cells. (a) Population density per cell (log base 10 scale). (b) Evacuation rates at the end of day 6. (c) Evacuation rates at the end of day 10. For (b) and (c), the simulation inputs are cu = cd = 0.2 and ns = 500 families.
Policy-Based Interventions. A simulation-based intervention is executed as follows. The map of Miami is overlaid with a 6 × 4 grid of equal-sized blocks so that there are 24 grid cells or blocks. The police are sent to each block, in turn, to alleviate citizens’ concerns over looting (e.g., by telling families of regular patrols of their residential areas by police). This is modeled as an increase in cd , i.e., families’ concerns over looting only materialize when a larger fraction of their neighbors evacuate. The police blanket the city in each of four different ways: (i) group 1: start at northwest-most block and traverse west to east across the first of the six rows, then go south to the next row of blocks and travel west to east again, and so on for each row. (ii) group 2: start at southwest-most block and traverse west to east across the first of the six rows, then go north to the next row of blocks and travel west to east again, and so on for each row. (iii) group 3: start at northwest-most block and traverse north to south down the first of the four columns, then go east to the top of the next column of blocks and travel south again, and so on for each column. (iv) group 4: start at northeast-most block and traverse north to south down the first of the four columns, then go west to the top of the next column of blocks and travel south again, and so on for each column. Figure 4 shows the fraction of Miami families visited per block by the police in visiting the total of 24 blocks. Note that Figs. 4a and 4a are essentially mirror images and that Figs. 4c and 4d are essentially mirror images. The order of visitation of high population density regions clearly changes with group number.
690
M. Hancock et al.
(a) Group 1
(b) Group 2
(c) Group 3
(d) Group 4
Fig. 4. Fractions of households in each of 24 equi-sized zones within the bounding box of Miami, FL. The different curves represent different traversals of the blocks by police in assuaging people’s fears of looting. Households that have been reassured by police have cd increased to 0.2 (or 0.4), from the baseline condition of 0; increasing cd dampens a family’s concern over looting. Police traversals: (a) group 1, (b) group 2, (c) group 3, and (d) group 4.
Figure 5 shows the effect of the police allaying people’s concerns over looting. Each plot shows curves for the final fraction of families evacuating (i.e., at day 10), for each of the four traversal groups. The plots from left to right have increasing values of cd , from 0.2 to 0.4. First, evacuation rates increase as cd increases, as expected. Second, there are two large steps for the curves in Fig. 5 for groups 3 and 4, corresponding to the two broader peaks in the family density plots of Figs. 4c and 4d. But the green curves rise faster than the orange curves because the large population blocks are visited earlier in the traversal group 4. Third, by comparison, the traversal groups 1 and 2 are less steep (i.e., are more spread out) because the higher density zones in Figs. 4a and 4b are more spread out. Nonetheless, the stair-stepped nature of the curves is still apparent. Fourth, the curves in Fig. 5 for groups 1 and 2 are closer because the family density plots are more similar. The point of this case study is to demonstrate that we can quantitatively evaluate the effects of different visitation strategies. Since the order of blocks
(a)
(b)
Fig. 5. Final fractions of the Miami, FL population evacuating as a function of the cumulative number of blocks visited by police to reassure families that they will monitor property to dissuade looting. Police visit the blocks in the orders dictated by the groups in Fig. 4. The visits result in families’ cd values increasing from 0 to: (a) cd = 0.2 and (b) cd = 0.4. In both plots, cu = 0 and ns = 100 seeds.
Evacuation Decision-Making
691
visited on the x-axis of these plots is a proxy for time, this case study shows that the group 4 visitation strategy results in more people evacuating sooner. This is one example of how counterfactual analyses may be simulated to assist policy makers in their planning. 3.3
Policy Implications of Results
We examine policy implications from the standpoint of encouraging more evacuations to better safeguard human life. We highlight two issues. First, in Fig. 1, home insurance is an important factor in evacuations, which is also seen with the large positive coefficient at the bottom left in Table 1. This suggests, not surprisingly, that financial issues are important to families. Hence, governments might offer vouchers to offset expenses of evacuating or consider providing incentives to home owners for better insurance coverage. Second, allaying citizens’ fears about looting, for example through greater police patrolling before, during, and after hurricanes, or through crowd-sourced citizen watches, might increase evacuations. Our experiments illustrate issues and parameters that are important and relevant for designing interventions.
4
Conclusions
We motivated our problem in Sect. 1.1, and our contributions are summarized in Sect. 1.2. Selected policy implications are in Sect. 3.3. This study also illustrates how survey data can be used to model scenarios that are beyond the conditions of a particular hurricane. A limitation of our work is that we only address human contact networks, and do not include the effects of social media, or virtual connections. This effect is hard to predict without computations: on one hand, spreading should be faster because there are more types of pathways (face-to-face and virtual), but this model uses relative thresholds so the increased node degrees will inhibit contagion transmission. Also, we do not include stormspecific variables, such as hurricane path, wind speed, storm surge, etc. which may produce spatially heterogeneous evacuation rates. Future work also includes model validation. Based on the parameters and process we study, we believe these results are also applicable to other disaster events such as evacuations caused by wildfires and chemical spills [11]. Acknowledgments. We thank the anonymous reviewers for their helpful feedback. We thank our colleagues at NSSAC and Research Computing at The University of Virginia for providing computational resources and technical support. This work has been partially supported by University of Virginia Strategic Investment Fund award number SIF160, NSF Grant OAC-1916805 (CINES), NSF CRISP 2.0 (CMMI Grant 1916670 and CMMI Grant 1832693), NSF CMMI-1745207, and NSF Award 2122135.
References 1. Barrett, C.L., Beckman, R.J., et al.: Generation and analysis of large synthetic social contact networks. In: Winter Simulation Conference, pp. 1003–1014 (2009)
692
M. Hancock et al.
2. Burnside, R.: Leaving the big easy: an examination of the hurricane evacuation behavior of New Orleans residents before hurricane Katrina. J. Public Manag. Soc. Policy 12, 49–61 (2006) 3. Centola, D., Macy, M.: Complex contagions and the weakness of long ties. Am. J. Sociol. 113(3), 702–734 (2007) 4. Cole, T.W., Fellows, K.L.: Risk communication failure: a case study of New Orleans and Hurricane Katrina. South Commun. J. 73(3), 211–228 (2008) 5. Faucon, C.: The suspension theory: Hurricane Katrina looting, property rights, and personhood. La. Law Rev. 70(4), 1303–1338 (2010) 6. Granovetter, M.: Threshold models of collective behavior. Am. J. Sociol. 83(6), 1420–1443 (1978) 7. Halim, N., Jiang, F., et al.: Household evacuation planning and preparation for future hurricanes: role of utility service disruptions. Trans. Res. Rec. (2021) 8. Halim, N., et al.: Two-mode threshold graph dynamical systems for modeling evacuation decision-making during disaster events. In: Complex Networks (2020) 9. Hancock, M., Halim, N., et al.: Effect of peer influence and looting concerns on evacuation behavior during natural disasters. In: Complex Networks (2021) 10. Hasan, S., Ukkusuri, S.V.: A threshold model of social contagion process for evacuation decision making. Transp. Res. Part B 45, 1590–1605 (2011) 11. Kim, T., Cho, G.H.: Influence of evacuation policy on clearance time under largescale chemical accident: an agent-based modeling. Int. J. Environ. Res. Public Health 17, 1–18 (2020) 12. Kuhlman, C., Marathe, A., Vullikanti, A., Halim, N., Mozumder, P.: Increasing evacuation during disaster events. In: AAMAS, pp. 654–662 (2020) 13. Riad, J.K., Norris, F.H., Ruback, R.B.: Predicting evacuation in two major disasters: risk perception, social influence, and access to resources1. J. Appl. Soc. Psychol. 29(5), 918–934 (1999) 14. Schelling, T.C.: The Strategy of Conflict. Harvard University Press, Cambridge (1960) 15. Widener, M.J., Horner, M.W., et al.: Simulating the effects of social networks on a population’s hurricane evacuation participation. J. Geogr. Syst. 15, 193–209 (2013). https://doi.org/10.1007/s10109-012-0170-3 16. Wong, S., Shaheen, S., Walker, J.: Understanding evacuee behavior: a case study of hurricane Irma. https://escholarship.org/uc/item/9370z127 17. Yang, Y., Mao, L., Metcalf, S.S.: Diffusion of hurricane evacuation behavior through a home-workplace social network: a spatially explicit agent-based simulation model. Comput. Environ. Urban Syst. 74, 13–22 (2019)
Effects of Population Structure on the Evolution of Linguistic Convention Kaloyan Danovski(B) and Markus Brede School of Electronics and Computer Science, University of Southampton, Southampton, UK {kd1u18,mb1a10}@soton.ac.uk
Abstract. We define a model for the evolution of linguistic convention in a population of agents embedded on a network, and consider the effects of topology on the population-level language dynamics. Individuals are subject to evolutionary forces that over time result in the adoption of a shared language throughout the population. The differences in convergence time to a common language and that language’s communicative efficiency under different underlying social structures and population sizes are examined. We find that shorter average path lengths contribute to a faster convergence and that the final payoff of languages is unaffected by the underlying topology. Compared to models for the emergence of linguistic convention based on self-organization, we find similarities in the effects of average path lengths, but differences in the role of degree heterogeneity. Keywords: Complex networks · Evolutionary game theory games · Language evolution · Semiotic dynamics
1
· Language
Introduction
Computational methods have proven vital for enhancing linguists’ ability to formulate and test hypotheses on the evolution of language. They help circumvent the lack of empirical evidence on how systems of communication emerge and change over time [11] and are useful when dealing with the complex and non-linear dynamics of human language [23]. One view, whose roots can be traced back to Lewis’s philosophical studies [15] and Wittgenstein’s language games [27], is that of language as a system of shared conventions that can foster the coordination necessary for effective communication. From this perspective, the focus of attention falls on studying the processes that result in the emergence and stability of such systems without a central authority imposing universal rules that necessarily lead to conventionalization. Furthermore, we are interested in how these processes are able to account for the development of efficient systems of communication – ones that are capable of supporting a large number of unambiguous conventions. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2021, SCI 1072, pp. 693–704, 2022. https://doi.org/10.1007/978-3-030-93409-5_57
694
K. Danovski and M. Brede
In these so-called semiotic dynamics models [20], conventions are represented by associations between objects and signals that are shared by a population of linguistic agents. Individuals are endowed with some capacity to learn or interact, and the temporal change in the quality or outcome of communication on a population-level can be described, either analytically or numerically. While the cognitive and communicative capacity of individual agents is often limited, the resulting population-level dynamics are usually complex and non-linear in nature. In many ways this view of language evolution coincides with the study of the emergence of consensus [6], and hence uses similar methods from evolutionary game theory [24] and statistical physics [7] to analyze the system’s emergent behaviour. One approach is to study language conventionalization through the evolutionary dynamics of a population of linguistic learners. A seminal model here is the evolutionary language game (ELG) model proposed by Nowak et al. [18]. In this model, agents can reproduce, passing on their language and learning strategy, until a common language emerges as an equilibrium state. Good communicators have a selective advantage, which results in more efficient languages. A different approach is the one taken by the Naming Game (NG) model [5], originally proposed by Steels [22], which studies the dynamics resulting from simple communication between agents that lead to the emergence of self-organized conventions. The minimal version of the NG describes a population of individuals attempting to agree on one name (out of many) for an object through local, pairwise interactions. While this language representation can be considered as a simplified version of the one used in the ELG [20], there are qualitative differences between the diffusion processes observed in the two models. Most notably, the dynamics in the NG proceed on much shorter time scales [3], as they describe a process of inter-generational (horizontal) coordination, in contrast to the crossgenerational (oblique) transmission of the ELG. Extensions of the basic ELG model have studied the effects of noise in information transmission [17] and learning biases [21], for example. However, few have examined how the underlying social structure affects the system’s language dynamics. In contrast, studies of the NG often adopt this approach, showing that the topology of the population’s social network can significantly affect the speed of convergence to a common word and the cognitive (memory) demand on agents [4,10]. In particular, the topologies most often studied on the Naming Game are regular lattices and rings, small-world, random and scale-free networks (see previous refs.). One attempt was made by Di Chio and Di Chio [8] to study the effects of spatial structure on the ELG by embedding agents on a 2D lattice, resulting in clustering of isolated language groups, but the effects of topology in general were not explored. The model examined in the present study is based on the ELG, with similar reproduction and learning dynamics, but the effects of different underlying social structures are studied. We therefore aim to explore if and how evolutionary dynamics for linguistic conventionalization are affected by different properties of the network topology, such as degree heterogeneity and small-world effects.
Pop. Structure and Linguistic Convention
695
In Sect. 2, we define the model and outline the network topologies explored. Section 3 presents and discusses the results of our simulations. Finally, Sect. 4 summarizes and concludes our study, presenting avenues for future work.
2 2.1
Model Model Definition
Below, we study a version of the ELG [18] which we modify to allow considerations of population structure on evolutionary outcomes. In more detail, our model considers a population of N agents that exist in a world containing n objects and m signals. Each agent has a language L representing an association between those objects and signals. Formally, L is defined by two matrices. The active matrix P is an n × m matrix whose entries pij represent the probability of producing signal j to refer to object i. Conversely, the passive matrix Q is an m × n matrix whose entries qji represents the probability of inferring object i when perceiving signal j. The rows of both P and Q present discrete distributions, and thus sum up to 1. Consider agents I1 and I2 , with languages L1 and L2 , respectively. We would like to know how well these two agents can communicate with one another (or, equivalently, what the probability of successful communication between them is). In this example, communication occurs when agent I1 produces a signal j in reference to an object i, and agent I2 infers some object ˆi from signal j. The communication is successful when i = ˆi, and its probability can be expressed (1) (2) as pij qji . Since multiple signals could be associated with each object, we can express the probability of successful communication between agents I1 and I2 m (1) (2) (as speaker and listener, respectively) for object i as j pij qji . If we look at this probability over all objects, we can obtain the total probability of successful communication between I1 and I2 , also called the ability of I1 to convey inforn m (1) (2) mation to I2 , expressed as i j pij qji . If we also account for I2 ’s ability to convey information to I1 , we can obtain the symmetric payoff of communication F (L1 , L2 ) between the languages of the two agents: n
F (L1 , L2 ) =
m
1 (1) (2) (2) (1) (pij qji + pij qji ). 2 i j
(1)
Agents are embedded onto a network, such that a node represents a single agent, and a link represents the possibility of communication between two agents. The total payoff of agent I is defined as: M
FI =
I 1 F (LI , LJ ), |MI |
(2)
J
where MI is the set of I’s neighbours. Normalizing by the number of neighbours |MI | results in a payoff that is not biased towards agents with a higher degree
696
K. Danovski and M. Brede
(such as hubs in scale-free networks). This becomes important when you consider that the payoff of agents is directly tied to their evolutionary fitness. The total payoff is nevertheless heavily affected by the most common languages in an agent’s neighborhood. In this way, bias towards popular agents from linguistic factors is eliminated, while evolutionary (frequency-dependent) bias towards popular languages is preserved. The type of bias towards popular agents that we are interested in emerges from the topological properties of the underlying network. For example, more connected agents have a greater influence on both individual- and population-level payoffs. See Ref. [14] for an example of the language dynamics for a model with bias towards popular agents from linguistic factors. Language Learning. Agents obtain a language once, when they are ‘created’, by observing and imitating that of other agents. During this learning step, the agent constructs an n × m association matrix A, whose entries aij represent the number of times the agent has observed signal j being produced to refer to object i. The matrix is populated by sampling k responses for each object from K of its parent’s neighbors, chosen (with replacement) with a probability proportional to their fitness. The agent’s P and Q matrices are then derived from A by normalizing its rows and columns as follows: m n (3) ail , qji = aij / alj . pij = aij / l
l
Through this learning process, an agent constructs their A by sampling other agents’ P , which in turn is used to construct their own P , which is later used by another agent to construct their A, and so on. Since the sampling is probabilistic, with a finite number of samples k, over time this results in agents with binary P matrices. The languages of agents in the first generation are initialized by generating a random, non-binary A, whose entries are uniformly sampled integers in the range [1, 9]. The convergence dynamics resulting form this learning process are illustrated on Fig. 1. Optimal Languages. The maximum payoff between two agents speaking an identical language L is achieved when P is a binary matrix with at least a single 1 in every column (if n ≥ m) or in every row (if n ≤ m) and Q = P T . In other words, when P and Q are as close to one-to-one mappings as possible. Formally, the maximum payoff is expressed as: Fmax = min{n, m}.
(4)
When n = m, a one-to-one mapping is possible, in which case P and Q are permutation matrices whose rows and columns both sum up to 1. A language that can achieve the maximum payoff when communicating with itself will be referred to as an optimal language.
Pop. Structure and Linguistic Convention
697
Fig. 1. Illustration of the convergence dynamics towards a common language. Each node represents a single agent, and is colored (a) based on their payoff, with a lighter color implying higher payoff, and (b) based on their languages, with each color representing a distinct language. In the initial generation, all agents have different, randomly generated languages (b1) that are not well-suited for collective communication (a1). As the simulation progresses, some languages are adopted by multiple agents (b2), and all languages become more alike, yielding higher payoffs (a2). By the end, all agents adopt the same language (b3), and the payoff of communication is the maximum possible given that language (a3). (Colors between (a) and (b) are not related.)
Population Dynamics. The model’s dynamics are studied through an agentbased simulation, in which a single reproduction step takes place at every time step. Reproduction is asexual – a parent is picked randomly from the population with a fitness-proportional probability. A new agent is ‘created’ and samples its language from its parent’s neighbours (as described previously). The new agent replaces another member of the population at a random node in the parent’s neighbourhood (with uniform probability), and the process is repeated. This update strategy is known as birth-death [19]. A number of other strategies are discussed in Ref. [1]. 2.2
Network Topologies
The network topologies we have studied include regular lattices, ring (and smallworld) graphs, random networks, and scale-free networks. Lattices were explored for their regular structure and usefulness for modelling real-world spatial topologies. All lattices explored are square regular grids with periodic boundary conditions (a toroidal topology). Two types of lattices are mentioned, depending on the length of their dimensions: odd-sized and even-sized. For example, an even-sized lattice might have 64 nodes arranged on an 8 × 8 grid, while oddsized lattices might be defined by a 7 × 7 or 9 × 9 grid. This might seem like an unnecessary distinction, but it has some interesting consequences, discussed in the following section. Ring graphs were explored because of their longer average
698
K. Danovski and M. Brede
path length and higher transitivity compared to other networks studied here. Small-world networks are obtained according to the Watts-Strogatz model [26]: starting from a regular ring graph, each node’s edges to its neighbors has a probability p of being rewired to another random node. We examine the effect of different values of p later on. Small-world, random (Erd˝ os-R´enyi model [12]), and scale-free (Barab´asi-Albert model [2]) networks were used to explore the effect of degree heterogeneity and short average path lengths – properties common in real-world social networks [16,25].
3
Results
Below, we carry out Monte Carlo simulations in order to examine the evolutionary dynamics of the model, which is shown on Fig. 2 in which we plot outcomes of multiple simulation runs along with averaged results. It can be seen that starting from random languages the evolutionary dynamics tend to improve language quality until a stationary state is reached. To proceed, we are interested in measuring convergence times and final payoffs for different network structures and population sizes. The average N payoff of the population for a single simulation run is defined as FP = |N1 | I FI , where FI is the total payoff of agent I as defined in Eq. 2. The final payoff Fconv is defined as FP when the population converges on a common language (or at a finite time step limit tmax ). The convergence time tconv is defined as the number of time steps before FP reaches a value within some threshold h of the final payoff (and remains within that threshold until the simulation reaches tmax ). The use of time step limits and small population sizes is a result of long computation times. A simulation with N = 400 and a time step limit tmax =
Fig. 2. Example of evolutionary dynamics for a single realization of the Monte Carlo simulation, averaged over 30 runs. The average payoff FP is shown for both individual runs (blue lines) and simulation average (orange line). This example is for N = 400 on a scale-free network, with Fmax = 5 and tmax = 2 × 106 .
Pop. Structure and Linguistic Convention
699
2×106 takes approx. 16 hours, assuming all simulation runs can be parallelized.1 In general, the runtime of simulations scales linearly with N tmax . For all simulations shown here, we use the following setup. Each agent samples K = 4 agents from his neighbourhood (with replacement) and performs k = 1 observations for each object. Languages are of size n = m = 5 (and consequently Fmax = 5). Our testing suggests that these parameters do not affect the qualitative differences in convergence between network topologies reported below. Unless otherwise stated, all networks examined have a similar average degree of 4. The convergence time threshold is h = 0.05. Results for a single configuration are averaged over 30 simulation runs.
Fig. 3. Differences in mean convergence time to a common language tconv (left) and final payoffs after convergence Fconv (right) on different network topologies. Bars indicate standard error of results. Convergence times are roughly correlated with average shortest path lengths and there are no significant differences in final payoffs (except for even-sized lattices). Results are for N = 500.
3.1
Influence of Population Structure
Figure 3 (left) shows the convergence time obtained for different networks for simulations with a population size N = 500. We observe that spatially embedded topologies (lattice and ring) exhibit slower convergence compared to heterogeneous networks (even-sized lattices are an exception, since they exhibit a different convergence pattern, as discussed later). This difference could be attributed to longer average paths on ring and lattice structures, although the higher clustering on ring graphs could also negatively effect convergence speed. The convergence of small-world networks (generated according to the WattsStrogatz model) exhibits a tconv inversely proportional to the average shortest path length, as shown in Fig. 4. This agrees with findings that the small-world property allows for fast convergence in the NG [5]. 1
Simulations were developed in Python (3.8.2), using the NetworkX (2.5.1) and NumPy (1.21.0) libraries. Runs were parallelized on the IRIDIS 5 compute cluster. Figures were generated using Matplotlib (3.4.2).
700
K. Danovski and M. Brede
Although the results presented seem to suggest that scale-free networks exhibit a faster convergence than random networks, considering standard errors of the simulation results, it is not feasible to draw any definite conclusions on the differences between these topologies. It is worth noting that, contrary to what our results might suggest, studies of the NG have found random networks to support a faster convergence [4]. Additionally, unlike what can be observed in the NG [9], there is no difference in the convergence patterns of low- and high-degree nodes for our model (data not shown).
Fig. 4. Differences in convergence time tconv for small-world networks, generated using the Watts-Strogatz model [26]. The convergence times of ring graphs and random networks are given, showing that small-world graphs approach the behavior of random networks as p increases, as expected. Results are for N = 400.
Fig. 5. Scaling of convergence time tconv with population size N on different networks. Sharper increases in tconv correspond to larger average path lengths, although high clustering could also have an effect on ring graphs. Bars indicate standard errors.
Figure 3 (right) shows final payoffs Fconv for a population of size N = 500. We find that results are very close for different networks, with random networks exhibiting marginally higher payoffs compared to the rest. We also note in Fig. 3 that lattices with even-sized dimensions stand out. The differences in behavior between even-sized and odd-sized lattices is a special property of the dynamics on even-sized lattices to converge to what we call gridlock – an equilibrium state where at least one language emerges in a checkered spatial pattern, as shown on Fig. 6. A lattice in gridlock exhibits constant payoffs and no changes in languages or their spatial distribution. Gridlocks are the only observed case of an equilibrium state that can support multiple different languages. For this reason such patterns are self-defeating: they result in lower payoffs for all members of the population compared to scenarios when a common language has evolved.
Pop. Structure and Linguistic Convention
701
Fig. 6. Demonstration of gridlock pattern on 2D regular lattices. In this case, there are two languages in a checkered pattern on the lattice, but gridlock can also be observed with only one language distributed in a pattern, and multiple different languages inbetween. Adding a single edge between any two nodes disturbs the pattern and leads to a convergence similar to that of odd-sized lattices. A lattice with static boundaries is shown here for visualization purposes – periodic boundaries were used in simulations.
Gridlocks can be observed on even-sized lattices with periodic boundaries and both even- and odd-sized lattices with static boundaries. Importantly, the gridlock pattern disappears when adding only a single edge between any two nodes on the lattice. Because of this instability, it is likely that this property is a result of very specific conditions of the network structure and update rule, and as such won’t be relevant in real-world scenarios. For brevity, any mention of lattices henceforth implies odd-sized lattices. 3.2
Influence of Population Size
Next, we are interested in the scaling of convergence times with population size on different types of network structures. In Fig. 5 we can see that ring graphs scale the worst with an increase in population size, followed by regular lattices. Heterogeneous networks perform better than regular structures in this regard, which could be attributed to an average shortest path length that scales logarithmically with N . In contrast to convergence times, we also note that FP does not exhibit any significant change with the population size N or between different topologies (data not shown). This suggests that population structure does not have a large effect on the development of efficient, unambiguous communication systems. Changes in other factors, such as the linguistic parameters and reproduction rules, might be better at promoting convergence towards an optimal language. It has already been shown that the learning strategy of the agents and the possibility for mistakes in learning [18], as well as a bias towards one-to-one mappings [21], can positively affect final payoffs in the mean-field case (equivalent of fully-connected network).
702
4
K. Danovski and M. Brede
Conclusions
We have defined a model for the evolution of linguistic convention in which a population of language-endowed agents are subjected to evolutionary dynamics that results in the emergence of a shared communication system consisting of objectsignal associations. We have examined how these dynamics change based on the underlying social structure of the population, represented by complex networks. We have found that heterogeneous networks, namely random and scale-free networks, achieve the best results in terms of the convergence time of simulations. Our conjecture that this faster convergence is a result of lower average shortest path lengths is supported by experiments that show a clear dependence of convergence times on the rewiring probability on small-world networks. Further, we have also shown that the convergence time of heterogeneous networks also scales best with an increase in the population size of the network, while regular ring graphs exhibit by far the poorest scaling in this respect. We did not observe any effects of degree heterogeneity, in contrast to the NG where power-law degree distributions result in mildly slower convergence to consensus [4]. Although we find no significant differences in the final payoffs of languages, regardless of network type and population size, it would be premature to discount the role of social structure in promoting the development of efficient communication systems altogether, since it also affects the population dynamics through its interplay with the reproductive mechanics. This paper has only considered a single reproduction update strategy (birth-death) and this choice is not trivial, since the roles that agents take in the reproduction process are non-symmetric, much like the roles in the standard Naming Game [10], and therefore have different effects, especially on heterogeneous networks, where agents have differing influences on the population based on their position on the network. Additionally, since we know that linguistic factors can influence the optimality of common languages in the mean-field case, it is natural to ask whether that holds for populations embedded on complex topologies as well. The results presented here can be used by studies that employ similar models for language evolution or consensus formation to compare the impact of microscopic behaviors on the effective convergence in different networked contexts. On the other hand, comparing these results to those of models that describe the evolution of different linguistic factors and/or across different time scales can bring us closer to a more thorough understanding of the impact that network structures have on the dynamics of social systems. In the field of language evolution, such cross-model comparisons might be the only way to draw reasonable conclusions about the universal effects of social factors [13]. Lastly, we believe that considering the complexity inherent in the model described here, stemming from the number of different linguistic and evolutionary parameters available, its dynamics have been understudied, especially as they relate to different social structures. While we have focused on the effects of degree heterogeneity and path lengths, future work on this model could explore the effect of other topological properties, such as clustering, degree mixing, and community structure.
Pop. Structure and Linguistic Convention
703
Acknowledgments. The authors acknowledge the use of the IRIDIS High Performance Computing Facility, and associated support services at the University of Southampton, in the completion of this work. MB acknowledges support from the Alan Turing Institute (EPSRC grant EP/N510129/1, https://www.turing.ac.uk/) and the Royal Society (grant IES\R2\192206, https://royalsociety.org/).
References 1. Allen, B., Nowak, M.: Games on graphs. EMS Surv. Math. Sci. 1(1), 113–151 (2014). https://doi.org/10.4171/EMSS/3 2. Barab´ asi, A.L., Albert, R.: Emergence of scaling in random networks. Science 286(5439), 509–512 (1999). https://doi.org/10.1126/science.286.5439.509 3. Baronchelli, A., Dall’Asta, L., Barrat, A., Loreto, V.: Strategies for fast convergence in semiotic dynamics. In: Artificial Life X: Proceedings of the Tenth International Conference on the Simulation and Synthesis of Living Systems, pp. 480–485. MIT Press, Cambridge, Mass (2006) 4. Baronchelli, A., Dall’Asta, L., Barrat, A., Loreto, V.: The role of topology on the dynamics of the Naming Game. Eur. Phys. J. Spec. Top. 143(1), 233–235 (2007). https://doi.org/10.1140/epjst/e2007-00092-0 5. Baronchelli, A.: A gentle introduction to the minimal Naming Game. Belg. J. Linguist. 30, 171–192 (2016). https://doi.org/10.1075/bjl.30.08bar 6. Baronchelli, A.: The emergence of consensus: a primer. Roy. Soc. Open Sci. 5(2), 172189 (2018). https://doi.org/10.1098/rsos.172189 7. Castellano, C., Fortunato, S., Loreto, V.: Statistical physics of social dynamics. Rev. Mod. Phys. 81(2), 591–646 (2009). https://doi.org/10.1103/RevModPhys. 81.591 8. Chio, C.D., Chio, P.D.: Evolution of language with spatial topology. Interact. Stud. Soc. Behav. Commun. Biol. Artif. Syst. 10(1), 31–50 (2009). https://doi.org/10. 1075/is.10.1.03dic 9. Dall’Asta, L., Baronchelli, A.: Microscopic activity patterns in the naming game. J. Phys. Math. Gen. 39(48), 14851–14867 (2006). https://doi.org/10.1088/03054470/39/48/002 10. Dall’Asta, L., Baronchelli, A., Barrat, A., Loreto, V.: Nonequilibrium dynamics of language games on complex networks. Phys. Rev. E 74(3), 036105 (2006). https:// doi.org/10.1103/PhysRevE.74.036105 11. Dediu, D., de Boer, B.: Language evolution needs its own journal. J. Lang. Evol. 1(1), 1–6 (2016). https://doi.org/10.1093/jole/lzv001 12. Erd˝ os, P., R´enyi, A.: On random graphs I. Publicationes Math. Devrecen 6, 290– 297 (1959) 13. Gong, T., Shuai, L.: Exploring the effect of power law social popularity on language evolution. Artif. Life 20(3), 385–408 (2014). https://direct.mit.edu/artl/article/ 20/3/385-408/2781 14. Kalampokis, A., Kosmidis, K., Argyrakis, P.: Evolution of vocabulary on scalefree and random networks. Phys. A: Stat. Mech. Appl. 379(2), 665–671 (2007). https://linkinghub.elsevier.com/retrieve/pii/S0378437107000362 15. Lewis, D.K.: Convention: A Philosophical Study. Harvard University Press, Cambridge (1969) 16. Newman, M.E.J.: The structure and function of complex networks. SIAM Rev. 45(2), 167–256 (2003). https://doi.org/10.1137/S003614450342480
704
K. Danovski and M. Brede
17. Nowak, M.A., Krakauer, D.C.: The evolution of language. Proc. Natl. Acad. Sci. 96(14), 8028–8033 (1999) 18. Nowak, M.A., Plotkin, J.B., Krakauer, D.C.: The evolutionary language game. J. Theoret. Biol. 200(2), 147–162 (1999). https://linkinghub.elsevier.com/retrieve/ pii/S0022519399909815 19. Ohtsuki, H., Hauert, C., Lieberman, E., Nowak, M.A.: A simple rule for the evolution of cooperation on graphs and social networks. Nature 441(7092), 502–505 (2006). http://www.nature.com/articles/nature04605 20. Patriarca, M., Heinsalu, E., Leonard, J.L.: Languages in Space and Time: Models and Methods from Complex Systems Theory, 1st edn. Cambridge University Press, Cambridge (2020). https://www.cambridge.org/core/product/identifier/ 9781108671927/type/book 21. Smith, K.: The evolution of vocabulary. J. Theoret. Biol. 228(1), 127–142 (2004). https://linkinghub.elsevier.com/retrieve/pii/S0022519303004636 22. Steels, L.: A self-organizing spatial vocabulary. Artif. Life 2(3), 319–332 (1995). https://direct.mit.edu/artl/article/2/3/319-332/2251 23. Steels, L.: The synthetic modeling of language origins. Evol. Commun. 1(1), 1–34 (1997). https://doi.org/10.1075/eoc.1.1.02ste 24. Szab´ o, G., F´ ath, G.: Evolutionary games on graphs. Phys. Rep. 446(4–6), 97–216 (2007). https://linkinghub.elsevier.com/retrieve/pii/S0370157307001810 25. Vega-Redondo, F.: Complex Social Networks. Cambridge University Press, Cambridge (2007). http://ebooks.cambridge.org/ref/id/CBO9780511804052 26. Watts, D.J., Strogatz, S.H.: Collective dynamics of ‘small-world’ networks. Nature 393(6684), 440–442 (1998). http://www.nature.com/articles/30918 27. Wittgenstein, L.: Philosophical Investigations. Blackwell, Oxford, UK (1958)
Quoting is not Citing: Disentangling Affiliation and Interaction on Twitter Camille Roth(B) , Jonathan St-Onge, and Katrin Herms Computational Social Science Team, Centre Marc Bloch (CNRS/Humboldt Universit¨ at), Friedrichstr. 191, 10117 Berlin, Germany {roth,jonathan.st-onge,katrin.herms}@cmb.hu-berlin.de
Abstract. Interaction networks are generally much less homophilic than affiliation networks, accommodating for many more cross-cutting links. By statistically assigning a political valence to users from their follower ties, and by further contrasting interaction and affiliation on Twitter (quotes and retweets) within specific discursive events, namely quote trees, we describe a variety of cross-cutting patterns which significantly nuance the traditional “echo chamber” narrative. Keywords: Discussion trees
· Cross-cutting interaction · Twitter
The socio-semantic assortativity of online networks is now a classical result: at the macro level, social clusters are often semantically homogeneous, exhibiting for instance similar political leanings [1,16]; at the micro level of users, links form more frequently between semantically similar dyads [7,24]. These observations depend nonetheless heavily on topics [4,10], and on link types: in particular, affiliation links generally configure networks where homophily is much stronger than with interaction links [23]. On Twitter, this dichotomy separates subscriptions (followers) and (dry) retweets, from mentions and replies, whereby the latter are more cross-cutting than the former [8,20]. By focusing on quote cascades on Twitter i.e., rather short-lived discursive events featuring in the same instance both link types (namely, quotes and retweets), we aim to examine the simultaneous manifestations of the affiliation/interaction dichotomy, which is normally studied in a separate or aggregate manner. Tweet cascades, or retweet trees, have long been studied from a diffusion perspective. Such trees are heterogeneous structurally [18] and generatively, for instance alternating broad and deep propagation dynamics [14]; their formation speed and their range depends on content type, such as true vs. false news [26]. Quotes, or tweets with comments, appeared more recently (2015) even though they remind of the original conversational use of retweets [5] before becoming a proper tool on the platform. Research on quotes is still relatively sparse but confirms they are instrumental in (possibly antagonistic) conversation rather than propagation [12,15]. A related strand of research has questioned whether online public spaces stimulate the development of like-minded groups or foster the exposure to diverse c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2021, SCI 1072, pp. 705–717, 2022. https://doi.org/10.1007/978-3-030-93409-5_58
706
C. Roth et al.
content [3,19]. For one, beyond a commonly observed right-/left-wing biclustering, aggregate Twitter networks exhibit a mix of supportive and oppositional relationships [25] and a certain asymmetry whereby mainstream content receives much more attention from so-called “counter-publics” than the other way around [17]—all of which hints at a diversity of attitudes towards crosscutting content and interactions. As we shall see, quote cascades on Twitter gather ephemeral publics that are generally local, in terms of time and of participants. By examining the structure of cross-cutting participation in quote trees, we also aim to contribute to study how local online arenas of a certain political orientation attract participants affiliated with diverse political orientations. In this regard, a series of recent results go against the grain of the traditional “echo chamber” narrative: users appear to engage heavily with content affiliated with an opposite camp, such as commenting on YouTube videos of some opposing channel [27] or posting messages on a Reddit thread of some opposing “subreddit” [21]; more precisely, there exists a continuum of roles where users are diversely embedded in bipartisan networks i.e., are at the interface between users of opposing political affiliations, or not [11]. In a nutshell, we aim to describe the local and largely ephemeral quote tree structure in regard to the political valence both of the original content and of the users who further participate in trees in various ways; the valence is itself computed from a network observed on a much wider temporal and topological scale, thus serving as a basemap. This enables us to distinguish a variety of cross-cutting interaction patterns and roles.
Empirical Data Perimeter and Collection. Over the whole year of 2020, we collected all publications by French-speaking Twitter users belonging to a perimeter based on the 2019 European Parliament elections. We had previously collected all tweets containing at least one hashtag among {#EU2019, #ElectionsEuropeennes2019, #Europeennes2019, #EP2019, #Europ´eennes2019, #electionsue19, #CetteFoisJeVote} between one month before and one month after the vote (April 26-June 28, 2019), focusing on users active in French (i.e., publishing at least 15% of tweets in that language). We further required users to have published at least 5 tweets over this period (minimum activity) and be above the median number of 195 followers (minimum visibility), which reduced the number of users from 39,938 to 15,919, of which 14,102 were still active in January 2020, and 13,074 in December 2020; reflecting a relatively low attrition rate given the initial focus on 2019 elections. Casual manual examination of this perimeter indicates that there are very few bots and that most well-known news sources or political figures have been included, thus suggesting that it represents a meaningful part of the politics-related French online Twitter space. Tree Size and Depth. We then build all non-trivial quote trees stemming from a initial tweet, or root tweet, published in 2020. More precisely, we consider recursive cascades of quotes, restricted by construction to quotes from perimeter users,
Quoting is Not Citing
707
Fig. 1. Left: in the inset, the number of tree roots per user follows a heterogeneous distribution; violin plots show the distribution of tree size (average or maximum) for users having generated a certain number of trees (few, more, many, very many). Right: coverage of the dataset attained by focusing on trees up to a certain size (solid black line) or average depth (dotted gray line), in terms of the proportion of covered nodes (i.e., quotes; represented as dots) or trees (represented as trihedrons).
while excluding quotes where a user quotes themselves, and comprising at least one quote. The dataset features 1.13 m trees generated by 12,462 unique users i.e., about 90 trees per active user, following a usual heterogeneous distribution (inset of Fig. 1-left). Top users are unsurprisingly accounts of media and political figures generating in excess of 10k trees over the whole year i.e., dozens a day. Besides, trees of more prolific users are generally larger on average and among the largest ones (violin plots on Fig. 1). Tree size also follows a heterogeneous law whereby 75% of all trees are of size 2 or 3, 90% of size 6 or less, and only 1% are larger than 30 nodes, as shown on Fig. 1-right. By definition, larger trees gather more quotes and thus represent a larger portion of the dataset in relative terms. To keep the focus on quotes and avoid an over-representation of relatively trivial and very small trees in the subsequent computations, we rather consider the coverage of the dataset in terms of tree nodes. This leads us to define thresholds of small, medium or large trees by considering respectively a coverage of 75% of all nodes (trees containing up to 17 nodes), 90% or less (up to 71 nodes), and the last decile (remaining trees up to a maximum of 1786 nodes). The average depth of trees, denoted as d and computed as the average distance from the root tweet over all nodes, is generally small, with more than 90% of trees with a d of 1 (Fig. 1-right), indicating the absence of secondary quotes, or quotes of quotes. Less than 2% of trees feature a d > 1.5 (majority of secondary quotes) and less than 3% of nodes belong to such trees. On the whole, depth is a relatively rare phenomenon, as shown by the exponentially decreasing number of chains reaching a certain depth over all trees (solid line on Fig. 2). Furthermore, deeper chains correspond to ping-pongs between two individuals (A-B-A-B...) rather than iterative quoting between distinct users (A-B-C-D...): to show this, we plot the number of distinct quoters in a chain, as a function of its maximal depth, focusing on terminal subchains of a given length w. In
708
C. Roth et al.
Fig. 2. Left: number of unique quoters in subchains of size 3, which equals two when there is redundancy (A-B-A) and three otherwise (A-B-C), since we ignore self-quotes (A-A-*). Right: number of unique quoters in subchains of size 5.
other words, we look at the composition of the w last quoters of a chain of some depth. Histograms on Fig. 2 rely on w = 3 and w = 5 (results are similar for other window sizes w) and show a strongly increasing proportion of chains of only two distinct individuals (out of 3 or 5 possibilities) when going deeper in the tree. Such ping-pongs may correspond to a dialogical framing behavior where two users conflate the quote and reply functions. In any case, they represent a tiny portion of the data. Political Valence of Users. We define the likely political position of users by estimating their so-called “Ideal Point” (IP), a technique first introduced to infer a unidimensional political valence of lawmakers from the set of bills they support [22] and more recently applied on Twitter users based on the set of accounts they follow [2]. This method relies on the manual attribution of a fixed valence to a small subset of bootstrap users, or “elites”, from which positions are computed for the whole dataset along affiliation links. We use here the set constructed by [6] comprising 2,013 elites of the French political realm. We then collected the follower set for all users of our dataset (as of January 2021). We were eventually able to compute the IP value of 9,815 users who follow at least 10 elites which provided enough information for the IP estimation. We observe in Fig. 3 that IPs may roughly be broken down into three ranges gathering each a third of the density: markedly negative values where IP < − 13 ; somewhat central values around 0, IP ∈ [− 13 , 13 ]; and markedly positive values, IP > 13 . For users whose political affiliation is explicitly known, who are also well represented in our dataset which further confirms our good coverage of the political space, these three ranges match what is usually considered as left-wing, center and right-wing, respectively. For instance, all users who are explicitly members of PS (Parti Socialiste, left-wing) have an IP below 0 with an average around –1; while all members of LR (Les R´epublicains, right-wing) have an IP above 0, of average around +1. Without entering into a debate concerning the relevance of political labels based on unidimensional values, we deem IPs to be a sufficient proxy to characterize the relative political positions of users generating and participating in quote trees.
Quoting is Not Citing
709
Fig. 3. Distributions of IP values for corpus users and for tree root users, kernel density estimations (left) and cumulative distributions (right).
Tree Structure and Quoting Behavior To describe quote trees in relation to the political valence of their root author, we now exclusively focus on the 699k trees whose root tweet user has a known IP, denoted ρ. This makes about two thirds of all trees. Relative to user IPs, the distribution of ρ over trees favors central and, to a lesser extent, right-tilted values (essentially close to +1). More precisely, half of tree roots stem from the third of users with a central IP, while about 20% stem from users with a markedly negative IP (left), and 30% with a markedly positive IP (right, with the same peak around +1). Upon casual examination the 30 top accounts generating the most trees, which are thus also larger, belong mainly to mainstream media organizations and, to a lesser extent, center-wing political figures. General Features. We first consider the relationship between size, average depth d, and root IP ρ. Results are summarized in Fig. 4. The left panel shows the distribution of ρ for the three tree size ranges. The largest trees are more often generated by central IP users. The right panels show heat maps for each IP range and three interesting areas, in decreasing order of density: (1) both shallow and small- and medium-sized trees, by far the most frequent ones over the whole spectrum, (2) medium- to large-sized trees and moderately deep, which seem to be more often generated by central IP users, (3) small yet deep trees, whose root is more often made of IP-positive users when focusing on the 1% deepest trees (indicative of a narrow reach with strong tendency to long chains). First-Order Layer. Based on the above, we contend that focusing on the two first layers (i.e., primary and secondary quotes) captures most of the content framing behavior. We examine the average IP of the first layer of quotes, denoted as Q, with respect to the root user’s IP value ρ. We also consider R, the average IP of so-called “dry” retweets of the root tweet i.e., without quoting, which we deem a proxy of its political position: retweets indeed correspond to the audience of users who plainly forward with no further framing. On the whole, we observe on Fig. 5 that R generally follows ρ. Average IP values of quoting users, by contrast, tend to diverge from both R and ρ when ρ is not central, all
710
C. Roth et al.
Fig. 4. Left: Root user IP ρ as a function of tree size range. Right panels: heatmap of trees having a certain average depth d and size, for each of the three IP ranges.
the more for large trees as indicated on the three small panels. In other words, tweets at the root of large trees generate quotes, or framing instances, from the whole political spectrum, irrespective of the position of their retweeting audience; while smaller trees exhibit a narrower spectrum of quoting reactions, closer to ρ and thus both the root and R. Note that the standard deviations of R and Q, not shown here, are relatively constant across these spectrums—around 0.65—indicating some amount of variability around each average. Figure 6 characterizes further this divergence between quotes and retweets. We compare Q − R with: the root IP ρ (i.e., the difference between the two curves of the previous figure), the average IP of retweeters R, as well as the offset between retweets and the root IP, R − ρ. This last quantity indicates how far the retweeting population of a root tweet is from the (constant) IP of the root user. We observe first that the divergence Q − R goes, on average again, in a direction opposed to the IP value considered on the x-axis: be it ρ, R or R − ρ. For instance, divergences are increasingly negative for positive root, retweet and offset IP; and the other way around. They also remain under the y = −x curve: the magnitude of this “backlash” is thus smaller than the initial shift from the center (IP = 0). Put simply, if a root tweet is tweeted or retweeted by the left, on average, it is still going to be quoted, on average, by the left, but less so. Second, magnitudes of Q − R are larger when compared with R, and even more with R − ρ, than with ρ: they are stronger when
Fig. 5. Average IP of retweeters R or quoters Q of a root tweet from a user of IP ρ. Right panels: breakdown by tree size.
Quoting is Not Citing
711
Fig. 6. Discrepancy between the average IP value of quotes vs. retweets. Right panels: breakdown by IP category of the root.
a root tweet attracts retweets from non-central users as well as from users off the “baseline” IP of the root. The small middle panel in Fig. 6 is illustrative in this regard: it focuses on trees produced by central users (− 13 ≤ ρ ≤ 13 ) whose discrepancy Q − R is in aggregate close to 0 (as per Fig. 5). Yet, even for these root tweets from central users, Q − R grows as R or R − ρ diverge from 0. To summarize, first-layer quotes diverge more from retweets in larger trees and when root tweet users are non-central, and even more so when average retweeters are non-central or unusually off the root IP. Here again, standard deviations stay around 0.65 for all curves, indicating nonetheless a varied constellation of situations. Assuming that retweets are, on average, rather concentrated around the same IP as the root tweet user, these observations configure an instantaneous, low-level dynamics at the level of individual trees where quotes are all the more off the “baseline” of the retweeting population as this population is off the root user value. This makes it possible to hypothesize that quotes feature local counter-publics of users who come from a distinct set of IP positions to intervene to frame the original content. We thus turn to users. User-Centric Patterns. Several of these tree-centric observations hold from a user-centric perspective. Figure 7 confirms that users retweet roots roughly along their own IP on average, albeit less so for extreme users (which may probably partly explained by artefactual reasons where e.g., users of IP > 2 do not have much content to retweet on their right). Quotes are however more diverse and as a result the divergence Q − R is not flat and higher in absolute values for non-central users. The bottom left heat map illustrates the former point, the bottom central heat map the latter one. Moreover, the second heat map underlines a higher spread of Q − R for non-central users, some of them exhibiting an average divergence close to 0 (quoting on the same material they would retweet), others exhibiting a high average divergence. This hints not only at the existence of various roles, but also at the higher spread of these roles further from the center. Violin plots in Fig. 7-right support this interpretation: the top (respectively bottom) quartile for users with a negative (respectively positive) IP is above (respectively below) 0. Albeit beyond the scope of this paper, it would be most interesting to examine
712
C. Roth et al.
Fig. 7. User-centric discrepancy of quote vs. retweet behavior.
qualitatively who these users are, both from the content they publish and from interviews, to contrast their interest in participating in an online public space. Quotes of Quotes: Toward the Deeper Layer. While primary quotes tend to go against the polarity of the initial root tweet (all the more for non-central roots), in relative terms and all other things being equal, it is unclear whether these dynamics persist deeper in the tree and, for one, whether secondary quotes are made by users whose IP is more aligned with that of primary quoters or not (toward, or away from, the root tweet user). To shed light on this issue, we simply compare the discrepancy between a primary quoter’s and the root’s IPs, D1 − ρ, with the discrepancy between that primary quoter and their secondary quoters, D2 − D1. We observe in Fig. 8 that secondary quotes tend to turn the tide i.e., they stem from users yet again closer to the root, all the more when the primary quoter’s IP diverges from the root. In other words, if a quoter is more to the left than the root, a secondary quoter is going to be more to the right than the quoter, in a sort of back-and-forth movement. The amplitude of this movement is however smaller at the second level - it is as if the shift of secondary quotes was damped: the IP value of the second quoter is, on average, less far from the first quoter, than the first quoter is from the root. Interestingly, the direction of this movement is non-monotonous for the largest trees, where second quoters are further in the same direction as the first quoters for small discrepancies (D1 − ρ), while this trend gets reverted for larger discrepancies (rightmost panel of Fig. 8). Put differently, for root tweets generating the largest numbers of quotes, there appears to be two types of quotes: those originating from quoters close to the root IP and which further attract second quoters roughly of the same polarity, and those originating from quoters that are farther and attracting second quoters in the opposite direction.
Quoting is Not Citing
713
Fig. 8. Comparison of the discrepancy between a primary quoter’s IP D1 and the root IP ρ (x-axis) and the average discrepancy between IPs of secondary quoters and that of their immediate parent D1 (y-axis). Panels: breakdown by tree size ranges.
Qualitative Homogeneity of Some Framing Practices. We finally qualitatively illustrate one of our findings on the behavior of Q − R by studying in more detail a handful of trees related to the above-mentioned example on the small middle panel in Fig. 6 i.e., with a central ρ close to 0. We focus on large trees, to ensure a meaningful qualitative analysis, and on keywords related to the main political measures to curb the Covid-19 crisis in France (admittedly one of the most debated issues in 2020), to ensure comparability among trees. To exemplify each region of this graph, we arbitrarily select three trees whose R is respectively negative, close to zero, and positive. They respectively deal with (1) lockdown lifting (R = −.63, Q = −.48), (2) mask mandates (R = .11, Q = −.22), and (3) vaccination (R = .80, Q = .30); see Fig. 9. Overall, we expectedly observe participation from the whole spectrum. We specifically compare quotes from cross-cutting users with those from noncross-cutting users i.e., users who intervene on roots that are primarily retweeted by users from the opposite vs. the same side. As said before, quotes are framing operations and we naturally use the notion of “frames” to qualitatively detail their nature. A frame is defined as a rhetorical device to recontextualize root tweet issues through the lens of a certain perspective, including normative judgments [9]. We build frame categories using an inductive handcoding approach typical of the “grounded theory” [13], looking for semantic similarities among quotes of a given tree. We then grouped similar claims and detected 7 frame categories for each tree, which are also quite recurrent across trees, plus an eighth category, “other”, which regroups rare and isolated frames. Most frames aim at criticizing officials and their abilities to curb the crisis. Categories differ essentially in the form of that criticism: ranging from concrete expectations (frame A), via expressed mistrust related to incompetent communication (frame B), to allegations of selfishness and malice (frame C), plain protest (frame E) or even insults and mockery (frame F). Table 1 shows the breakdown of frame categories for each of the three trees and each user political valence/color. We found that all colors are generally present in all frames. Some categories are used predominantly by a specific color: for instance, frame B (incompetence)
714
C. Roth et al.
Fig. 9. Structure of 3 illustrative trees. Nodes are colored in blue when users have an IP< − 13 , black for central IP values ∈ [− 13 , 13 ], red for IP> 13 , and gray for unknown IP. Table 1. Number of quotes featuring a given frame category, frame categories for all three trees and counts per tree, broken down by user valence (“ 13 ). Percentages indicate the proportions of quotes of a given color that mention a given frame. Note: quotes using multiple frames appear several times. Frame category
total
Tree 1 ¡ ∼
¿
A Call for responsibility (to [1] ensure a secure school opening, [2] improve crisis management, [3] test the vaccine first)
19 13 2
4 26 10 5 11 17 5 1 11
B Allegation of incompetent political communication
31 19 1 11 31 17 5
(nuances in parenthesis for subtopics specific to tree [1], [2] or [3])
%
%
C Allegation of malice: “officials act against us, the people”
14
D Argumentation (around fear and risk [1 & 3] or solidarity and utility [2])
20
%
E Protest (against sending children to school [1], wearing a mask [2], getting vaccinated [3]) F Insults and mockery against leaders G Emotional exclamations
43
17
8
13
37
%
%
31
53
21
21
7
17
9 17 7 4
6
25
%
%
23
¿
32
29
9
18
13
%
41
6
9
3 3
%
20
42
20
%
2
2 0
0
4
0
%
18
31
Tree 3 ¡ ∼
total
4 33 13 8 12 36 7 4 25
17
%
¿
8 2 9 5
5
0
9
33
33
13
8
1 2
1
3
8
%
32
29
38
3 17 5 2 10
3
14
15
4 1 0
%
3
%
23
5
0
5
7 2
9 13
5 2
6 15 2 0 13
%
16
17
30
%
16
17
6
3 1
2
4
1 0
7
%
%
H Other
30
Tree 2 ¡ ∼
total
2 %
7
8
1 0 2
0
1 14 3
%
3
8
0
3 5 9
21
%
9
0
3 14 2 4 8
%
9
29
20
8 12
6 12 3 2
7
17
11
%
14
14
Quoting is Not Citing
715
is on average more often used by blue (43%) than red (24%) or black (19%) quotes; as is, to a lesser extent, frame A (call for responsibility). By contrast, frame F (insults/mockery) tends to appear more in red (22%) than blue (14%) and black quotes (8%). Interestingly, some frames are balanced, such as frames C (us/them) and D (argumentation). Keeping in mind the small size of this preliminary exploration, and even though there are remarkable variations in the use of some frames by some color, we nonetheless hypothesize that quote frames might obey a vertical dichotomy between “us, the people” and “them, the officials” as much as a moderate horizontal dichotomy between political camps—it is as if cross-cutting interventions fulfill a relatively similar rhetorical goal.
Concluding Remarks We differentiated affiliation and interaction links on Twitter by focusing on a specific object featuring both link types: quote trees. We showed under which conditions these ephemeral discursive events may attract a diverse public eager to frame the initial information contained in the root tweet and coming from a more or less wide spectrum of estimated political valences. In particular, assuming that retweets reflect the “baseline” audience valence of a given root tweet, we observed that the public of quoters diverges all the more from baseline when the root tweet has a non-central valence and attracts a larger audience. Moreover, this backand-forth movement persists in secondary quotes, albeit in an attenuated and non-monotonous manner. At first sight, these phenomena go against the “echo chamber” narrative, at least for larger “chambers” and trees. Coming back to users, we nuanced this finding by exhibiting distinct user attitudes: while some users (especially non-central ones) quote root tweets of a distinct valence as the tweets they normally retweet, some users do not, reminiscing a behavior more akin to echo chambers. A casual yet in-depth qualitative exploration of just three trees further showed that both cross- and noncross-cutting users nevertheless appear to partly rely on a small set of mildly shared frames. Put simply, cross-cutting interventions do not necessarily use cross-cutting frames. While shedding light on the formation and composition of counter-publics in reaction to content published in online social networks, our results hint at further research that would focus on specific regions of the figures presented in this paper, and qualify in more detail the position, behavior and claims of the corresponding users. Acknowledgments. We are grateful to Telmo Menezes and Katharina Tittel for contributing to define the user perimeter and subsequently collect Twitter data. This work was supported by the “Socsemics” Consolidator grant from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (grant agreement No. 772743)
References 1. Adamic, L.A., Glance, N.: The political blogosphere and the 2004 US election: divided they blog. In: Link KDD 2005, pp. 36–43. ACM (2005)
716
C. Roth et al.
2. Barber´ a, P.: Birds of the same feather tweet together: Bayesian ideal point estimation using twitter data. Polit. Anal. 23(1), 76–91 (2015) 3. Barber´ a, P.: Social media, echo chambers, and political polarization. Soc. Media Democracy: State Field, Prospects Reform, 34 (2020) 4. Barber´ a, P., Jost, J.T., Nagler, J., Tucker, J.A., Bonneau, R.: Tweeting from left to right: is online political communication more than an echo chamber? Psychol. Sci. 26(10), 1531–1542 (2015) 5. Boyd, D., Golder, S., Lotan, G.: Tweet, tweet, retweet: conversational aspects of retweeting on twitter. In: 43rd HICCS, pp. 1–10. IEEE (2010) 6. Briatte, F., Gallic, E.: Recovering the French party space from Twitter data. In: Sciences Po Quanti Workshop. Paris, France (May 2015) 7. Cinelli, M., Morales, G.D.F., Galeazzi, A., Quattrociocchi, W., Starnini, M.: The echo chamber effect on social media. PNAS 118(9), e2023301118 (2021) 8. Conover, M., Ratkiewicz, J., Francisco, M., Gon¸calves, B., Flammini, A., Menczer, F.: Political polarization on twitter. In: AAAI 5th ICWSM, pp. 89–96 (2011) 9. Entman, R.M.: Framing: toward clarification of a fractured paradigm. J. Commun. 43(4), 51–58 (1993) 10. Garimella, K., De Francisci Morales, G., Gionis, A., Mathioudakis, M.: Quantifying controversy in social media. In: Proceedings of the 9th WSDM, pp. 33–42. ACM (2016) 11. Garimella, K., Gionis, A., Morales, G.D.F., Mathioudakis, M.: Political discourse on social media: echo chambers, gatekeepers, and the price of bipartisanship. In: Proceedings of the WWW 2018 Intl Conf World Wide Web, pp. 913–922. ACM (2018) 12. Garimella, K., Weber, I., De Choudhury, M.: Quote RTs on twitter: usage of the new feature for political discourse. In: 8th ACM Web Science, pp. 200–204 (2016) 13. Glaser, B.G., Strauss, A.L.: The Discovery of Grounded Theory: Strategies for Qualitative Research. Aldine, Chicago (1967) 14. Goel, S., Anderson, A., Hofman, J., Watts, D.J.: The structural virality of online diffusion. Manage. Sci. 62(1), 180–196 (2016) 15. Guerra, P., Nalon, R., Assun¸cao, R., Meira Jr, W.: Antagonism also flows through retweets: the impact of out-of-context quotes in opinion polarization analysis. In: Proceedings of the AAAI 11th ICWSM (2017) 16. Himelboim, I., McCreery, S., Smith, M.: Birds of a feather tweet together: integrating network and content analyses to examine cross-ideology exposure on twitter. J. Comput.-Mediat. Commun. 18(2), 154–174 (2013) 17. Kaiser, J., Puschmann, C.: Alliance of antagonism: counterpublics and polarization in online climate change communication. Comm. Public 2, 371–387 (2017) 18. Kwak, H., Lee, C., Park, H., Moon, S.: What is twitter, a social network or a news media? In: 19th International Conference World Wide Web WWW 2010, pp. 591–600. ACM (2010) 19. Lev-On, A., Manin, B.: Happy accidents: deliberation and online exposure to opposing views. In: Davies, T., Gangadharan, S.P. (eds.) Online Deliberation: Design, Research, and Practice, Chap. 7, pp. 105–122. University of Chicago Press (2009) 20. Lietz, H., Wagner, C., Bleier, A., Strohmaier, M.: When politicians talk: assessing online conversational practices of political parties on twitter. In: Proceedings of the ICWSM 8th IInternational Conference Weblogs and Social Media, pp. 285– 294. AAAI (2014) 21. Morales, G.D.F., Monti, C., Starnini, M.: No echo in the chambers of political interactions on reddit. Sci. Rep. 11(1), 1–12 (2021)
Quoting is Not Citing
717
22. Poole, K.T.: Spatial Models of Parliamentary Voting. Cambridge (2005) 23. Roth, C., Cointet, J.P.: Social and semantic coevolution in knowledge networks. Soc. Netw. 32(1), 16–29 (2010) 24. Schifanella, R., Barrat, A., Cattuto, C., Markines, B., Menczer, F.: Folks in folksonomies: social link prediction from shared metadata. In: Proceedings of the WSDM Web Search Data Mining, pp. 271–280. ACM (2010) 25. Vaccari, C., Valeriani, A., Barber´ a, P., Bonneau, R., Jost, J., Nagler, J., Tucker, J.: Of echo chambers and contrarian clubs. Soc. Media Soc. 2(3), 1–24 (2016) 26. Vosoughi, S., Roy, D., Aral, S.: The spread of true and false news online. Science 359(6380), 1146–1151 (2018) 27. Wu, S., Resnick, P.: Cross-partisan discussions on YouTube. In: Proceedings of the AAAI ICWSM 15th International Conference Weblogs and Social Media, pp. 808–819 (2021)
Network in Finance and Economics
The COVID-19 Pandemic and Export Disruptions in the United States John Schoeneman(B) and Marten Brienen Oklahoma State University, Stillwater, OK 74078, USA {john.schoeneman,marten.brienen}@okstate.edu
Abstract. We use social network analysis to model the trade networks that connect each of the United States to the rest of the world in an effort to capture trade shocks and supply chain disruptions resulting from the COVID-19 pandemic and, more specifically, to capture how such disruptions propagate through those networks. The results show that disruptions will noticeably move along industry connections, spreading in specific patterns. Our results are also consistent with past work that shows that non-pharmaceutical policy interventions have had limited impact on trade flows.
Keywords: Trade
1
· Supply chains · Social network analysis
Introduction
COVID-19 has caused both significant demand and supply shocks in international trade. The latter have conceivably been caused by policy interventions that required temporarily shutting down or slowing production as well as labor shortages caused by illness, while the former have been attributable to increased demand for some goods and decreased demand for others. Moreover, shifts in consumption patterns, such as where goods are consumed, have resulted in distribution challenges, especially for foodstuffs. In this research project we propose using social network analysis to model the trade networks that connect each of the United States to the rest of the world in an effort to capture trade shocks and supply chain disruption resulting from the COVID-19 pandemic and, more specifically, to capture how such disruptions propagate through those networks. We postulate that the high levels of interconnectedness in global trade make it likely that trade shocks and disruptions of supply chains will propagate primarily along industry-level trade networks. Modeling those networks along with trade shocks and supply chain disruption as we propose here would allow us to show not only the structure of trade networks, but also how disruptions and shocks travel along them. While we have chosen to focus on the United States due to the severity of its COVID-19 outbreak in the time sample and the availability of high-quality data, we do expect that many of the findings will be generalizable for complex economies, due to the homogenization of global trade structures. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2021, SCI 1072, pp. 721–731, 2022. https://doi.org/10.1007/978-3-030-93409-5_59
722
2 2.1
J. Schoeneman and M. Brienen
Theory Industry Linkages and Disruption Propagation
Prior to the COVID-19 pandemic, most supply chain disruption discussion focused on natural disasters, geopolitical events, changes in technology, cyberattacks, and transportation failures as threats to supply chain stability. The literature divides the causes into quadrants based on controllability and whether they are internal or external to the firm experiencing the disruption [1]. Agriculture and foodstuffs hold a prominent place in the literature due to their vulnerability to uncontrollable external disruptions, but the COVID-19 pandemic has shown that most, if not all, industries are at risk of a global disruptive event. This coincides with a general trend among firms to underestimate levels of risk to their supply chain, often leaving them unprepared to respond to disruptions as they occur [25]. Moreover, while the literature indicates that both the costs associated with such disruptions and their frequency have increased globally, the underlying assumption has long remained that they tend to be rooted locally. This has meant that the mitigation strategies developed do not account for global disruptions [21]. An important outlier has been the work of Nassim Taleb, whose assessment of vulnerabilities in global supply chains that hinge on Just-In-Time manufacturing caused him to advocate for fail-safes and backup system [20]. In this paper, we not only look at all industries, but control for interaction of disruptions at the global and local level to fill the aforementioned gap in our understanding of supply chain vulnerabilities. There are a number of extant measures for the robustness of supply chain worldwide. For example, the Euromonitor International publishes a supply chain sensitivity index compiled on the bases of measures of sustainability, supply chain complexity, geographic dependence, and transportation network [10]. Sharma, Srinastava, Jindal, and Gupta’s comprehensive assessment of supply chain sensitivity combining 26 factors, found that having a critical part supplier, location of supplier, length of supply chain lead times, the fixing process owners, and misaligned incentives were the most critical factors in supply chain robustness [17]. While they both identify important aspects of supply chain vulnerability, they fail to fully account for the manner in which supply chain risk compounds as disruption spreads through industry connections. Past literature has provided theoretical grounding for this, depending largely on qualitative case studies to map out how disruption propagate through industry-based supply chain triads of suppliers, manufacturers, and consumers [16]. Zhu et al., mapping industrial linkages using the World Input-Output Database, found that on a global scale, the asymmetrical industrial linkages could see local shocks causing serious disruptions along the supply chain [2]. In this study, we extend the theoretical framework, albeit in a simplified operationalization, using quantitative analysis for a large national market and its global connections. We hypothesize that industry connections will be a significant vehicle for the spread of disruptions between US states.
Pandemic Export Disruptions
2.2
723
Effectiveness of Policy Interventions
Due to the exceptional nature of pandemics on the scale of COVID-19, limited analysis exists of the economic disruption they cause and of the impact of policy measures intended to mitigate against them. The last comparable global pandemic in terms of severity and the number of economies affected was the Spanish Flu of 1918. In the limited literature available to us, policy assessments have found that public health interventions such as economic support and lockdowns did not have adverse economic effects and that these areas recovered more quickly [3]. The emerging literature assessing the efficacy of lock downs for COVID-19 show that stay at home orders did not impact trade, whereas workplace closures did negatively impact trade [6]. This suggests a limited impact for policy measures controlling adverse economic impacts on trade flows. This previous study by Hayakawa and Mukunoki was focused on country-level variation and focused on stay-at-home orders and workplace closures. We build on this work by looking at domestic propagation of disruption, while including potential policy confounders such as economic support and network confounders such as cluster effects.
3
Data and Research Design
The monthly U.S. state-level commodity import and export data used in our analysis were collected by the US Census using the U.S. Customs’ Automated Commercial System [23]. For our analysis, we use data from March 2020 to December 2020. We use this cutoff both due to availability at the time of writing and to avoid having to account for changes in the federal and local responses to the pandemic as a result of the 2020 election. The import and export data are reported in total unadjusted value, in US Dollars. All 50 states and the district of Columbia are included as nodes in the final networks. To construct the dependent edge-level variable used in our models, we first constructed a bipartite graph with states as the first mode and exports at the four-digit level commodity code of the Harmonized System (HS-4) as the second mode. The edges in the bipartite graph are a measure of export disruptions, comparing export value of the current month to a three-month window centered on the same month of the previous year. If the value of the current month was less than 75% of the minimum value in the window for the previous year, it was coded as a one for a disruption. We then collapse the bipartite graph into a monopartite graph of US states and the edges are counts of the number of shared disruptions a state has with other states at the same HS-4 commodity level. We collapse the data primarily for methodological reasons1 , but since our goal is to measure trade disruption spread through industry ties, this step does 1
It is common in network analysis literature to collapse bipartite graphs due to failed convergence in bipartite inferential models and for additional model features not available in bipartite models. Past work has shown that collapsing into a monopartite project still preserves important information about the network [15].
724
J. Schoeneman and M. Brienen
not lose information that we are interested in. For robustness, we repeat this process using a 50% minimum value threshold. 3.1
Covariates
In addition to using US Census data for our export disruption dependent variable, we also use the import data to control for import disruptions of inputs for the export industries. The variable is constructed as a weighted count using the 2014 World Input Output Database (WIOD) [22]. Import disruptions were first constructed in the same manner as export disruptions and then assigned weights for each HS-4 commodity. The weights were assigned using concordance tables to convert HS-4 codes to match International Standard of Industrial Classification (ISIC) codes to then calculate the commodity’s input value as a percentage of the total output value for an industry. Since the weights were percentages based on values in the WIOD and applied to counts of disruption, not trade values, no transformation was necessary to match real USD values. Last, they were collapsed to match the monopartite network. To measure the impact of COVID-19 and COVID-19 related policies, we include hospitalizations per capita, an Economic Support Index, and a Lockdown Index. Hospitalizations per capita were calculated using monthly max hospitalization data from The Covid Tracking Project and then divided by 2019 state population estimates from the US Census Bureau. The Economic Support index and the Lockdown index are taken from the Oxford COVID-19 Government Response Tracker (OxCGRT) [4]. The Economic Support index includes measures that lessen the economic impact of COVID-19, including and weighting state level variation in measures such income support and debt relief. The Lockdown index focuses on measures intended to control people’s behavior, including measures such as mask mandates, school and gym closings, and restrictions gathering size and indoor dining. 3.2
Model and Specification: The Count ERGM
Existing models of network effect in supply chain risk management have relied on complex models based in game theory [24], firm level cluster analysis [5], Bayesian network modeling that defined edges as causes of disruption [13], and myriad others [7]. In our contribution, we are the first to our knowledge to use the count-valued Exponential Random Graph Model (ERGM) [8] to model the spread of export disruptions. This model has two key advantages for the purposes of our study. First, it allows us to model network structure without assuming the independence of observations, as is the case with the majority of generalized linear models (GLM). For example, we include transitivity, also known as the clustering coefficient, to model the linkages between shared disruptions. Moreover, this model allows us to control for deviation from the specified reference distribution, including larger variance and zero inflation. These are both critical,
Pandemic Export Disruptions
725
as we know that economic disruptions in one state will impact economic conditions in other states and that dependent variable distribution rarely follows a specific distribution perfectly. The count ERGM, like all ERGMs, does not model unit level effects as GLMs do, but rather the dependent variable serves to model the entire network using an iterative estimation method (MC-MLE) in which, given starting values for the parameter estimates, a Markov Chain Monte Carlo method is used to sample networks in order to approximate a probability distribution [18]. This iterative process continues until the parameter estimates and probability distribution converge. Because the ERGM family of models allows the research to specify both network effects and covariate effects in the model, both end up being more accurate estimates [12]. While other statistical modeling approaches could be used to account for network dependence while estimating covariate effects (e.g., latent space methods [11], stochastic block modeling with covariates [19], and quadratic assignment procedure [14]), these alternative methods do not permit precise estimation and testing of specific network effects. Given that part of our research objective is to test for transitivity effects, we have adopted an ERGMbased approach, using the implementation made available in the ergm.count [9] package in the R statistical software. Under the count ERGM, the probability of the observed n × n network adjacency matrix y is: Prθ ;h;g (Y = y) =
h(y)exp(θ · g(y)) , κh,g (θ)
(1)
where g(y) is the vector of network statistics used to specify the model, θ is the vector of parameters that describes how those statistic values relate to the probability of observing the network, h(y) is a reference function defined on the support of y and selected to affect the shape of the baseline distribution of dyadic data (e.g., Poisson reference measure), and κh,g (θ) is the normalizing constant. Our main models include a number of base level convergence related parameters, network parameters, and covariate parameters. Base level parameters include the sum of edge values, analogous to the intercept in a GLM model as well as the sum of square root values to control for dispersion in edge values. For network effects we include a transitive weight term. The transitive weight term is specified as: min yi,j , max min(yi,k , yk,j ) , Transitive Weights : g(y) = (i,j)∈Y
k∈N
This term accounts for the degree to which edge (i, j) co-occurs with pairs of large edge values with which edge (i, j) forms a transitive triad with weighted, undirected two-paths going from nodes i to k to j. Note that, because the network is undirected, cyclical and transitive triads are indistinguishable. Exogenous covariates are included by measuring the degree to which large covariate
726
J. Schoeneman and M. Brienen
values co-occur with large edge values. Our only dyadic measure is that of shared, weighted import disruptions and is defined as: yi,j xi,j , Dyadic Covariate : g(y, x) = (i,j)
Lastly, we specify statistics that account for node (i.e., state) level measures of COVID-19 intensity and policy measures. These parameters take the product of the node’s covariate value and a sum of the edge values in which the node is involved, defined as: xi yi,j Node Covariate : g(y, x) = i
j
We estimate a separate model for each month for all industries and then use a time-pooled version when estimating models by industry. This allows us to see changes in general trends across time as well as changes in the impact of variables over time throughout the pandemic, as well as average effects for each industry.
4
Results
There are several important findings from our results for the overall model, shown in Fig. 1. We also remind readers that the disruptions here are not overall disruptions, but shared disruptions across states, which means that interpretations are for spread of disruptions, not overall disruption. However, there is some overlap as an increase in shared disruptions coincides with an increase in the likelihood of overall disruptions. First, we see in Panel a and b that disruptions and the overall dispersion of shared disruptions peaks in April and gradually declines across time. The second major finding is that non-pharmaceutical policy interventions (Panel f and g) had almost no impact on shared export disruptions, regardless of disruption intensity. While there are a number of sound reasons to implement lock-downs and economic support for economies in a pandemic, our results indicate that trade need not be considered as a factor in considering these measures to contain the spread of disease. Third, the transitivity coefficient (Panel c) is positive and significant for all months considered, with a small drop in the early part of the pandemic before rising again. On average, it is roughly double in effect size for more intense disruptions. This result is a strong indicator of spread through industry connections as the edges are defined as shared disruptions in the same commodity, supporting our hypothesis that export disruptions will spread through industry ties across states. Fourth, import input disruptions (Panel d) are also positive and significant, with effects growing in size for more intense disruptions. This also is indicative of the importance of supply chains and the spread of disruptions through global trade networks. Lastly, while hospitalizations (Panel e) are on average positively correlated with shared export disruptions, the relationship is volatile, even being negative at the beginning of
Pandemic Export Disruptions
727
Fig. 1. Coefficient estimates of terms in Poisson ERGMs. Bars span 95% confidence intervals. For some models, the confidence intervals are not visible due to being small and the large range of the coefficient estimates. Circles are for models of disruptions of 75% and triangles are for 50%
728
J. Schoeneman and M. Brienen
Fig. 2. Coefficient estimates of terms in time-pooled Poisson ERGMs. Bars span 95% confidence intervals. For some models, the confidence intervals are not visible due to being small and the large range of the coefficient estimates. Circles are for models of disruptions of 75% and triangles are for 50%
Pandemic Export Disruptions
729
pandemic and in late Fall indicating that immediate pandemic intensity is not the primary driver of economic disruption. Variation of the coefficients across industries also leads to several interesting findings (Fig. 2). In Panel a we see that transitivity is mostly the same across industries. The exception is that industry seven (Raw Hides, Skins, Leather, and Furs) serves as an outlier for transitivity and that the clustering coefficient trends slightly larger for less processed industries (lower numbered). This trend is more pronounced for input import disruptions (Panel b). This finding is interesting as one might expect more highly processed industries to have more inputs and thus be more sensitive to disruptions in imports and across industry changes. Furthermore, while there are some similarities, these findings challenge the rankings of supply chain sensitivity in the Euromonitor’s Global Supply Chain index which ranks the least and most processed as the most sensitive and the moderately processed as the least sensitive [10]. Hospitalizations across industry (Panel c) are just as volatile across industry as it is across time, warranting deeper investigation. Lastly, we confirm that even when broken down by industry, policy variables have little to no impact on the spread of export disruption (Panel d and e).
5
Conclusion
The global pandemic that has gripped the world since early 2020 has exacted an incalculable toll in human lives, while crippling economies for much of that year. Given the impact of the pandemic as well as of policy responses intended to limit the cost in human lives, trade disruptions were to be expected throughout supply chains. Indeed, beyond policy responses, panic-buying and other behavioral oddities caused severe disruptions in very specific supply chains very early on. Given that the previous global pandemic of 1918 took place in an economic environment of much lesser economic complexity, studies examining that event could not accurately predict the manner in which modern economies and industries would be affected. Modern supply chains are, after all, significantly more spread out globally. Indeed, across the globe, new debates have emerged with regard to the perceived need to ‘re-home’ certain key industries as Just-In-Time supply chains dependent on imports from across the globe that have proven to be vulnerable to disruptions in trade over which individual governments have no control. This pandemic, then, has presented us with a rather unique global challenge, as well as a rather unique opportunity to look at the robustness – or lack thereof – of global supply chains in our modern globalized economic environment. More than just a study into the impact of the current global pandemic on global supply chains, our study was intended to close a hole in the extant and emerging literature, which has not used network level analysis of the manner in which trade shock and disruption moves across networks. It was our hypothesis that disruptions will noticeably move along industry connections, spreading in specific patterns, and our model appears to support this hypothesis. We believe
730
J. Schoeneman and M. Brienen
that this is an important finding that has application beyond the context of a global pandemic. Notes and Comments. All data used are from publicly available sources. For replication code, please email the authors.
References 1. Agrawal, N., Pingle, S.: Mitigate supply chain vulnerability to build supply chain resilience using organisational analytical capability: a theoretical framework. Int. J. Logistics Econ. Globalisation 8(3), 272–284 (2020) 2. Cerina, F., Zhu, Z., Chessa, A., Riccaboni, M.: World input-output network. PloS one 10(7), e0134025 (2015) 3. Correia, S., Luck, S., Verner, E.: Pandemics depress the economy, public health interventions do not: evidence from the 1918 flu. SSRN (2020) 4. Hale, T., et al.: A global panel database of pandemic policies (Oxford COVID-19 government response tracker). Nat. Hum. Behav. 5(4), 529–538 (2021) 5. Hallikas, J., Puumalainen, K., Vesterinen, T., Virolainen, V.-M.: Risk-based classification of supplier relationships. J. Purch. Supply Manag. 11(2–3), 72–82 (2005) 6. Hayakawa, K., Mukunoki, H.: Impacts of lockdown policies on international trade. Asian Econ. Pap. 20(2), 123–141 (2021) 7. Hosseini, S., Ivanov, D., Dolgui, A.: Review of quantitative methods for supply chain resilience analysis. Transp. Res. Part E: Logistics Transp. Rev. 125, 285–307 (2019) 8. Krivitsky, P.N.: Exponential-family random graph models for valued networks. Electron. J. Stat. 6, 1100 (2012) 9. Krivitsky, P.N.: ergm.count: fit, simulate and diagnose exponential-family models for networks with count edges. The Statnet Project (2016). http://www.statnet. org. R package version 3.2.2 10. Julian, L.: Supply chain sensitivity index: which manufacturing industries are most vulnerable to disruption? July 2020 11. Matias, C., Robin, S.: Modeling heterogeneity in random graphs through latent space models: a selective review. ESAIM: Proc. Surv. 47, 55–74 (2014) 12. Metz, F., Leifeld, P., Ingold, K.: Interdependent policy instrument preferences: a two-mode network approach. J. Public Policy, 1–28 (2018) 13. Ojha, R., Ghadge, A., Tiwari, M.K., Bititci, U.S.: Bayesian network modelling for supply chain risk propagation. Int. J. Prod. Res. 56(17), 5795–5819 (2018) 14. Robins, G., Lewis, J.M., Wang, P.: Statistical network analysis for analyzing policy networks. Policy Stud. J. 40(3), 375–401 (2012) 15. Saracco, F., Straka, M.J., Di Clemente, R., Gabrielli, A., Caldarelli, G., Squartini, T.: Inferring monopartite projections of bipartite networks: an entropy-based approach. New J. Phys. 19(5), 053022 (2017) 16. Scheibe, K.P., Blackhurst, J.: Supply chain disruption propagation: a systemic risk and normal accident theory perspective. Int. J. Prod. Res. 56(1–2), 43–59 (2018) 17. Sharma, S.K., Srivastava, P.R., Kumar, A., Jindal, A., Gupta, S.: Supply chain vulnerability assessment for manufacturing industry. Ann. Oper. Res. 1–31 (2021). https://doi.org/10.1007/s10479-021-04155-4 18. Snijders, T.A.B.: Markov Chain Monte Carlo estimation of exponential random graph models. J. Soc. Struct. 3(2), 1–40 (2002). Kindly provide the volume number for Ref. [17], if applicable.
Pandemic Export Disruptions
731
19. Sweet, T.M.: Incorporating covariates into stochastic blockmodels. J. Educ. Behav. Stat. 40(6), 635–664 (2015) 20. Taleb, N.N.: Antifragile: Things that Gain from Disorder, vol. 3. Random House Incorporated (2012) 21. Tang, C.S.: Robust strategies for mitigating supply chain disruptions. Int. J. Logistics: Res. Appl. 9(1), 33–45 (2006) 22. Timmer, M.P., Dietzenbacher, E., Los, B., Stehrer, R., De Vries, G.J.: An illustrated user guide to the world input-output database: the case of global automotive production. Rev. Int. Econ. 23(3), 575–605 (2015) 23. US Census Bureau. US import and export merchandise trade statistics. Economic Indicators Division USA Trade Online, March 2021 24. Wu, T., Blackhurst, J., O’grady, P.: Methodology for supply chain disruption analysis. Int. J. Prod. Res. 45(7), 1665–1682 (2007) 25. Zsidisin, G.A., Panelli, A., Upton, R.: Purchasing organization involvement in risk assessments, contingency plans, and risk management: an exploratory study. Supply Chain Manage. Int. J. (2000)
Default Prediction Using Network Based Features Lorena Poenaru-Olaru1(B) , Judith Redi2 , Arthur Hovanesyan3 , and Huijuan Wang4 1
4
Distributed Systems, Delft University of Technology, Delft, The Netherlands [email protected] 2 Data Science, Miro, Amsterdam, The Netherlands 3 Data Science, Exact, Delft, The Netherlands Multimedia Computing, Delft University of Technology, Delft, The Netherlands
Abstract. Small and medium enterprises (SME) are crucial for economy and have a higher exposure rate to default than large corporates. In this work, we address the problem of predicting the default of an SME. Default prediction models typically only consider the previous financial situation of each analysed company. Thus, they do not take into account the interactions between companies, which could be insightful as SMEs live in a supply chain ecosystem in which they constantly do business with each other. Thereby, we present a novel method to improve traditional default prediction models by incorporating information about the insolvency situation of customers and suppliers of a given SME, using a graph-based representation of SME supply chains. We analyze its performance and illustrate how this proposed solution outperforms the traditional default prediction approaches. Keywords: Default prediction · Transactional network features · Network-based models · Network centrality
1
· Network
Introduction
Small and medium enterprises play a key role in economy. In the Dutch economy for example, they not only generate 61.8% of the overall value of the country but also maintain a high employment rate1 (64.2% of the total employment). Furthermore, according to Chong et al. [1], they act as important suppliers for large corporates, ensuring in this way the country’s product exports and, thereby, the economical growth. Despite being considered the backbone of the economy, SMEs suffer from a higher exposure rate to default than large corporates. The primary cause of this fact is their tremendous vulnerability to economic change [2]. Predicting beforehand that an SME will default in the future could be beneficial in preventing this event, as certain measures could be taken earlier. 1
Small Business Act for Europe (SBA) Fact Sheet - Netherlands.
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2021, SCI 1072, pp. 732–743, 2022. https://doi.org/10.1007/978-3-030-93409-5_60
Default Prediction Using Network Based Features
733
A plethora of learning models were proposed in literature when it comes to default prediction and improving their accuracy is a direction that plenty of authors are focusing on. The default prediction problem is often referred to as credit scoring prediction in literature. For instance, Sang and Nam et al. [3] are investigating the effect of parallel random forest on credit scoring prediction models, while Dayu et al. [4] are looking into extreme learning machines classifiers. However, most of these approaches are only relying on the financial situation of each SME and improving the prediction technique instead of looking into other features. In this paper, we are referring to this type of models as the traditional default prediction models, which consider SMEs as isolated entities instead of treating them as part of a supply chain ecosystem. Recently, Misheva et al. [5] constructed a synthetic network based on the similarities between the financial features of SMEs. It has been shown that traditional default prediction models could be improved by the addition of graph features, such as node degree and closeness centrality. Beyond financial features of SMEs, we aim to explore whether the consideration of interconnections between SMEs in the supply chain ecosystem could further improve the default prediction. In this paper, we firstly construct a real-world transactional network composed of around 228.000 Dutch SMEs. It is a temporal (time evolving), undirected and unweighted network, which is measured annually. Two nodes are connected by a link in a year if they have monetary transaction in that year. We analyze the transactional network to identify evidence that the default of an SME is also related to the position of that SME in the transactional network. Furthermore, we propose a novel yearly default prediction model which incorporates both financial and network-based features. The traditional model that only contains financial features is considered as our baseline. In the hybrid model that we proposed, we systematically consider diverse nodal network features including both network centrality features, beyond the node degree and closeness considered in [5] and graph embedding features. Hence, nodal properties derived from the network topology and the embedding space have been taken into account. We further perform an in-depth analysis of existing methods of handling the issue of training on highly imbalanced classes, since the phenomena of high-class imbalance in default prediction may significantly affect the performance of machine learning models. This paper is structured as follows: In Sect. 2 we present the intuition behind our idea, as well as some network analysis to motivate why the network features could be relevant in default prediction. Section 3 introduces our prediction method, while Sect. 4 depicts the performance analysis of the proposed method as well as the interpretation of the obtained results. Section 5 contains our conclusions and proposals for future work.
2
Temporal Transactional Network
The transactional network is an abstraction of SMEs interactions in which nodes represent SMEs and an edge between nodes in a given year indicates that the
734
L. Poenaru-Olaru et al.
SMEs are in a business relationship (customer and supplier). The network is evolving over time in the sense that new SMEs are joining the network and edges may appear or disappear over time, in case business relationships are created or broken, respectively. This type of networks is referred to as temporal networks, and they have been proven to be successful in studying epidemic spreading, for instance [6]. The temporal transaction network can be regarded as 9 static network snapshots, measured yearly from 2011 until 2019. In each year, one node can be in either of the two states, defaulted or non-defaulted. The defaulted state means that the SME has serious financial issues reported in a specific year or it is bankrupt, while the non-defaulted state is assigned to financially healthy SMEs. We will also refer to the transactional network as a business graph. Intuitively, the interactions between SMEs could be relevant for their financial situation. Assume that the SME supplier A has two customers SME B and C in a given period. The default of either B or C may result in their inability to pay their debts to the supplier A and, therefore, in A’s financial stability degradation. However, if SME A would collaborate with far more SMEs than two, the default of one of its counterparts would not be as tremendous as in the first case. Therefore, the interactions between SMEs could be relevant indicators of their financial situation. This could be further supported via the following basic network analysis. Defaulted Sub-network Extraction. This technique was previously used by Yoshiyuki [7] to understand the phenomena of bankruptcy and check whether it could be modelled as an epidemic spreading process. We used the last snapshot of the network in 2019 including both defaulted and non-defaulted nodes, from which we further extracted the sub-network which contains only defaulted SMEs (nodes) and the links between them. We focused solely on the connected components of this sub-network, in which every defaulted node could reach any other defaulted node via a path. The existence of connected components supports the possibility that defaulted nodes could contribute to default of their neighbours. We found 3 such connected components of 7, 8 and 24 nodes, respectively. We further checked the year when the default of each node started. As an example, the component of 7 nodes is shown in Fig. 1. These defaulted nodes are close to each other in the network such that they form a connected component. Moreover, these nodes started to default in a similar time. Distance Between Defaulted Nodes. We further employ a statistical method to evaluate whether the defaulted nodes are close to each other in the network. In this sense, we examine whether the distance between defaulted nodes is smaller than the ones of non-defaulted nodes. We compare the average shortest path between defaulted nodes ED [H] with the one between non-defaulted nodes EN D [H] on the transactional network snapshot 2019. Given the large number n of nodes and the complexity of computing the shortest path between two nodes, we made use of sampling and statistical tests to obtain the approximate EN D [H]. We initially randomly chose 735 non-defaulted nodes (the same as the number defaulted ones) and we calculated the average shortest path between these nodes on the transactional network in 2019. We repeated this procedure 20 times and,
Default Prediction Using Network Based Features
735
Fig. 1. A representation of 7 interconnected defaulted SMEs extracted from out transactional network with respect to the year of their default.
in the end, we took the mean of the average shortest path over the 20 iterations as EN D [H]. Indeed, the average shortest path between defaulted nodes is smaller ED [H] = 3.26 < EN D [H] = 3.49. We further employ a paired difference statistical test, the Wilcoxon signed-rank test, and validate that defaulted nodes are closer to each other in the network. Hence, the position of SME in the network is relevant for its default.
3
Prediction Method
The objective is to predict whether an SME will default or not in a given year t + 1 based on its financial and network characteristics in the previous year t. When designing the model we pair the features calculated at the end of year t with labels of year t+1. In the following two subsections we will motivate our choices in terms of both financial features and network features. 3.1
Financial Features
One important step of model construction was determining appropriate features. In default prediction literature, financial features, which indicate the financial situation of an SME at a particular moment have been widely studied. For the financial baseline model we consider the following financial coefficients as features: Cash, Current Assets, EBIT, EBITDA, Equity Book Value, Interest Expenses, Retained Earnings, Revenue, Short Term Debt, Total Assets, Total Liabilities and Working Capital. Given that we worked with a real-world data-set, we encountered the situation of having missing data in terms of financial coefficients. The missing of financial coefficients can be due to the inactivity of a company. Thus, we considered that financial coefficient for which the data is missing are 0. This explains also why we have not considered ratios of financial coefficients as features, as in other works. [8]
736
3.2
L. Poenaru-Olaru et al.
Network Features
In order to incorporate graph information, we extracted some network features from the nodes that we, thereafter, combined with the financial ones to understand whether they improve the accuracy of traditional models. Through our analysis we observed that network features taken alone are not informative enough to predict default, thus we opted for the combination. We consider the following representative nodal network properties [9], also called centrality metrics: Node Degree, which is the number of links that incident to the node. Clustering Coefficient, which is the probability that the neighbours of a node are connected. It measures the probability that two collaborators of an SME also collaborate. Eigenvector Centrality, which is principal eigenvector component corresponding to the node. The principal eigenvector is the eigenvector corresponding to the largest eigenvalue of the adjacency matrix of the network. A node tends to have a large eigenvector centrality if it is connected to many well connected nodes. Centrality metrics like closeness and betweenness [10] will not be considered because of their high computational complexity, actually associated with the shortest path computation. Li et al. have investigated the correlation between the network centrality metrics via both theoretical analysis and experiments in real-world networks [11]. They found that metrics with a high computational complexity like the betweenness tend to be correlated with metrics with a low computational complexity in diverse types of networks. This supports that information of closeness and betweenness could be captured by the three centrality metrics that we consider and the graph embedding features that we will introduce. Another type of network derived features are the graph embeddings also known as network embeddings. Network embedding aims to represent a network by assigning coordinates to nodes in a low-dimensional vector space[6,12]. The embedding vectors of the nodes will also be considered as network features. We use node2vec, a random walk based network embedding to derive the embedding vectors of the nodes [13]. Specifically, the following configuration was considered: p = 1, q = 1, number of walks = 10, walk length = 80. We experimented with different dimensions of the embedding vectors and chose the best performing model. Thereby, the optimal dimension of graph embedding in our case was 4. We further created multiple hybrid models which contains both financial and network based features. In this sense, we extended the financial coefficients model with the network features in order to understand whether the classification accuracy could be improved by incorporating network information. 3.3
High Class Imbalance Problem
An ubiquitous issue that rises when predicting default is the problem of having the 2 classes, defaulted and non-defaulted, extremely imbalanced. This is the
Default Prediction Using Network Based Features
737
result of having annually a significantly higher number of non-defaulted SMEs than defaulted SMEs. This usually has tremendous effects on the classifier’s performance to distinguish between the two classes. The reason for this is the fact that the classifier does not see enough samples of the minority class to be able to further extrapolate. For instance, if the training set is composed of data from 2011 until 2018, then the percentages between defaulted and non-defaulted samples would be 0.06% and 99.94%, respectively. In order to overcome this problem, we employed 2 data-driven methods, undersampling and oversampling, and one algorithm driven method weighting, presented in Leevy et al.’s survey [14]. We, thereby, undersampled the majority class to lower the number of nondefaulted SMEs and oversampled the minority class to increase the number of defaulted SMEs in the training set. The undersampling was done such that the 2.5% default rate of SMEs in the Netherlands was preserved. Thus, every year we should have around 2.5% defaulted SMEs and around 97.5% non-defaulted SMEs in the training set. We applied stratification to perform undersampling, which ensures the inclusion of SMEs from different categories. Our chosen categories were sector and company size. In terms of oversampling, we used SMOTE [15], which creates synthetic samples of the minority class by interpolating between similar samples. The weighting method assigns a higher weight to the minority class to penalise the model when having the tendency of classifying everything as non-defaulted in order to preserve accuracy. Besides the previously mentioned techniques, we also considered combinations between them, such as undersampling + SMOTE and undersampling + weighting the minority class as they are commonly used in class imbalance literature. 3.4
Classifiers
As for the machine learning algorithms, we consider diverse tree based classification models as they have been proved efficient in prediction problems where the classes are strongly imbalanced, such as anomaly detection, default prediction and fraud detection [16,17]. Specifically, we employed multiple tree based classifiers, ranging from simple classifiers, such as Decision Tree and Random Forest, to boosting algorithms, such as AdaBoost, XGBoost and LightGBM. In Sect. 4, we will firstly evaluate all classifiers and class imbalance methods using the baseline model where only financial features are taken into account. The best combination of the classifier and class imbalance method will be identified and used further to compare our hybrid model that incorporates both network and financial features with the baseline.
4
Performance Analysis
In this section, we will design the experiments to evaluate our methods.
738
L. Poenaru-Olaru et al.
4.1
Experimental Setup
First, we give a comprehensive picture of our experimental setup. Our data-set records the financial and network information of Dutch SMEs from 2011 until 2019. We split into the training set (samples from 2011 to 2018), which was used in order to learn the behaviour of defaulted and non-defaulted SMEs, and testing set (2019), which was employed to evaluate the model’s performance. This choice is motivated by our objective to evaluate the model’s performance close to current time and the fact that most of the defaulted samples are reported in 2019. In Sect. 4.4, different choices of the training and test sets will be considered to explore the robustness of the model against fluctuation of the economy. 4.2
Preliminary Selection of Classifier and Class Imbalance Method
We evaluated the performance of diverse combinations of the aforementioned classifiers and class imbalance methods using the baseline model where only financial features are considered. We measured the performance in terms of Area Under the ROC Curve (ROC AUC) score, where the ROC Curve can be obtained by plotting the True Positive Rate (TPR) against the False Positive Rate (FPR). The metric is suitable for class imbalance prediction problem since it shows how well the classifier is distinguishing between the 2 classes. Its output rages from 0 to 1, where 1 corresponds to perfect prediction. [18].
Fig. 2. The predition quality ROC AUC when using different classifiers and the following 6 class imblance methods: default in which we do not apply any method to handle class imbalance, W in which we assign a higher weight to the minority class, OS in which we oversample the minority class using SMOTE, US in which we undersample the majority class and the 2 combinations US + W and US + OS.
Figure 2 shows the results of all the employed classifiers and class imbalance methods in terms of ROC AUC. We observe that the class imbalance method composed of an undersampling technique combined with an oversampling technique (SMOTE) outperforms the other 4 for each classifier. This could be possibly explained by the fact that some non-defaulted samples could be redundant.
Default Prediction Using Network Based Features
739
Thus, removing the redundancy increases the probability that the classifier sees more relevant samples and its ability to distinguish between the defaulted and non-defaulted increases. Another important observation is that the XGBoost classifier achieves the highest result. Thereby, in our further experiments, we are only considering this particular classifier XGBoost and the undersampling + SMOTE method to overcome the problem of having a high-class imbalance. 4.3
Comparison Between Models
Furthermore, we evaluate whether the hybrid models that include both financial and network features outperform the baseline model that incorporates financial features alone. Besides ROC AUC, we also considered the TPR. Since misclassifying a defaulted SME could result in high financial losses, we need the TPR to understand the percentage of the correctly identified defaulted SMEs. Different combinations of graph features will be considered in the hybrid model. We use the following abbreviations to denote features: – – – – –
Baseline financial features - B; Eigenvector Centrality - EC; Clustering Coefficient - CC; Node Degree - ND Graph Embedding - GE
We began with adding each graph feature to the baseline features and observing whether it improves the ROC AUC or the TPR. Table 1 shows that only the hybrid model that consider the eigenvector centrality outperformed the baseline. Furthermore, we can also observe that by adding the network information eigenvector centrality into the baseline, the model was able to detect with 1% more defaulted SMEs than the initial one. Although this improvement does not seem significant, the model B + EC is more suitable than the baseline, in the sense that correctly classifying as many defaulted SMEs as possible could possibly prevent the loss. Table 1. Predictions quality of the baseline and of the hybrid model that includes baseline financial features and one graph feature. The highest ROC AUC and TPR are highlighted. Model
ROC AUC TPR
B
0.881
B + EC 0.884
0.815 0.827
B + CC 0.872
0.794
B + ND 0.878
0.809
B + GE 0.875
0.803
740
L. Poenaru-Olaru et al.
Furthermore, we evaluate the hybrid model that includes not only EC but also other graph features. The objective is to understand whether considering more network features could further improve the performance. All possible combinations with other network features beyond EC are evaluated in Table 2 . We find that the combination between the network features leads to an even higher improvement in the model’s accuracy. By adding the node degree (ND) or graph embedding (GE) to the B + EC model, the ROC AUC improves further by around 1% and its TPR increases further by around 2%. This improvement is statistically significant according to McNemar’s statistical test. The addition of network features to the traditional financial models could indeed improve the default prediction. Table 2. Prediction quality of the baseline compared with the complex hybrid models that consider the financial feature and the eigenvector centrality and/or other network features. The highest ROC AUC and TPR are highlighted. Model
4.4
ROC AUC TPR
B
0.881
0.815
B + EC
0.884
0.827
B + EC + CC
0.863
0.782
B + EC + ND
0.894
0.848
B + EC + GE
0.892
0.842
B + EC + CC + ND
0.878
0.815
B + EC + CC + GE
0.864
0.782
B + EC + ND + GE
0.874
0.806
B + EC + ND + GE + CC 0.861
0.776
Robustness of the Optimal Model
Within this subsection we explore how robust the best hybrid model (B + EC + ND) is. In the previous experiments, we have trained the model on data from 2011 until 2018 and tested on 2019. A robust model is supposed to be able to perform well when tested on different years, thus robust against fluctuation of the economy. To do so, we include the samples of 2019 into the training set and extract the samples from the other years, one by one, in order to use them as testing sets. We depict our results in Table 3. From Table 3 we can observe the fact that the lowest obtained TPR was 0.8, which means that our model succeeded in correctly determining more than 80% defaulted SMEs each year. Hence, the B + EC + ND model is relatively robust to the changes in the economy, thus reliable when deployed into production. The tests since 2014 are more representative in the sense that more defaults occur since 2014 than before 2014. Thus, there is a higher probability that a model misclassifies a higher number of defaulted samples compared to a lower one.
Default Prediction Using Network Based Features
741
Table 3. Evaluation of the B+EC+ND model when tested in each possibly year. Year ROC AUC TPR
4.5
2011 0.998
1
2012 0.997
1
2013 0.997
1
2014 0.914
0.833
2015 0.898
0.8
2016 0.928
0.862
2017 0.932
0.870
2018 0.906
0.819
2019 0.894
0.848
Interpretation of Results
From our previous experiments, we observed that there were only 3 hybrid models that were able to achieve better performance than the baseline (B), namely B+EC, B+EC+ND and B+EC+GE. In the following part of this subsection, we are focusing on analyzing the meaning of the chosen network features within the business aspects. The B+EC model is composed of financial features and one particular network feature, the eigenvector centrality. In complex networks theory, the eigenvector centrality shows whether one particular node has many highly connected neighbors. In our case, we observed that an SME with a high eigenvector centrality has a lower likelihood of default. In other words, a company that is surrounded by many highly connected neighbours is less likely to default. This explains why the consideration of the eigenvector centrality could improve the prediction. The B+EC+ND model adds two types of network features to the traditional financial features, namely the node degree and the eigenvector centrality. The node degree shows how many connections does a particular node have. Our findings reveal that the further augmentation of the role of degree beyond the eigenvector centrality could improve the performance of the model. The B+EC+GE model includes financial features, eigenvector centrality and the embedding vector of an SME. The embedding vector of a node is supposed to capture the information of a network that is different from but possibly correlated with centrality metrics like degree and eigenvector. Nodes play a similar role, e.g. being the hub of a local community could possibly have a similar embedding although they are not close to each other in the network topology. Hence, graph embeddings could possibly carry valuable information regarding the company status. We observed the hybrid model B+GE that includes the graph embedding features and financial features performs even worse than the baseline. However, the model B+EC+GE that combines the eigenvector centrality, embedding features and financial features performs better than both the baseline and the B+EC model. In summary, adding network features does not
742
L. Poenaru-Olaru et al.
necessary improve the traditional default prediction model. By adding the appropriate network features, e.g. the embedding combined with the eigenvector of a node, the prediction be evidently improved.
5
Conclusions and Future Work
In this paper, we have developed the method to improve financial feature based default prediction models by incorporating network-based features extracted from a real-world transactional network composed of Dutch SMEs. This method entails the construction of the transactional network and the systematic inclusion of diverse network features including centrality metrics in the network topology domain and embedding vectors of nodes in the embedding space, beyond the choice of the classifier and method to overcome the class imbalance problem. We observed and demonstrated that our hybrid model performs better than the baseline financial model, especially in terms of identifying as many defaults as possible when the network features have been appropriately chose. The combination of node degree and the eigenvector centrality enhances the traditional default prediction model the most. Moreover, through our evaluation over years, we demonstrated that the hybrid model, which achieved the highest performance, is robust to economical changes. Additionally, we provided an interpretation of the network features in a business context in order to explain why they improve the baseline. In terms of future work, we believe our hybrid model including its design and performance analysis can be further explored in the following directions. As a start, we have considered the undirected and unweighted transactional network. The volume and the direction (the customer-supplier relationship) of the monetary transactions between SMEs can be relevant for default prediction. Hence, the weighted and directed transactional network can be further investigated. To illustrate our method, we have selected the classifier and class imbalance method that performed the best in the baseline model to further evaluate the hybrid model. Other combinations of the classifier and class imbalance method, further fine-tuned hyperparameters in the network embedding could be used to evaluate the hybrid model. Regarding of the choice of the network features, more combinations of network features could be considered especially those with a low computational complexity. Disclaimer. The information made available by Exact for this research is provided for use of this research only and under strict confidentiality. We would, therefore, like to thank Exact for providing us with resources to pursue this project.
References 1. Chong, S., et al.: The role of small- and medium-sized enterprises in the Dutch economy: an analysis using an extended supply and use table. J. Econ. Struct. 8, 12 (2019)
Default Prediction Using Network Based Features
743
¨ ¨ urek, H.: Small and medium enterprises and global 2. Asgary, A., Ozdemir, A., Ozy¨ risks: evidence from manufacturing SMEs in Turkey. Int. J. Disaster Risk Sci. 11, 59–73 (2020). https://doi.org/10.1007/s13753-020-00247-0 3. Ha, S., Nam, N., Nhan, N.: A novel credit scoring prediction model based on feature selection approach and parallel random forest. Indian J. Sci. Technol. 9, 05 (2016) 4. Xu, D., Xuyao, Z., Hu, J., Chen, J.: A novel ensemble credit scoring model based on extreme learning machine and generalized fuzzy soft sets. Math. Probl. Eng. 2020, 1–12 (2020) 5. Misheva, B.H., Giudici, P., Pediroda, V.: Network-based models to improve credit scoring accuracy. In: 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA), pp. 623–630 (2018) 6. Zhan, X.-X., Li, Z., Masuda, N., Holme, P., Wang, H.: Susceptible-infectedspreading-based network embedding in static and temporal networks. EPJ Data Sci. 9, 30 (2020) 7. Yoshiyuki, A.: Bankruptcy propagation on a customer-supplier network: an empirical analysis in Japan (2018) 8. Altman, E.I., Sabato, G.: Modeling credit risk for SMEs: evidence from the us market (2007) 9. Rodrigues, F.A.: Network centrality: an introduction. arXiv: Physics and Society, pp. 177–196 (2019) 10. Wang, H., Hernandez, J.M., Van Mieghem, P.: Betweenness centrality in a weighted network. Phys. Rev. E 77, 046105 (2008) 11. Li, C., Wang, H., Haan, W., Stam, C., Mieghem, V.: The correlation of metrics in complex networks with applications in functional brain networks. J. Stat. Mech: Theor. Exp. 2011, 11 (2011) 12. Cui, P., Wang, X., Pei, J., Zhu, W.: A survey on network embedding. IEEE Trans. Knowl. Data Eng. 31, 833–852 (2019) 13. Grover, A., Leskovec, J.: node2vec: scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD 2016, pp. 855–864. ACM Press (2016) 14. Leevy, J.L., Khoshgoftaar, T.M., Bauder, R.A., Seliya, N.: A survey on addressing high-class imbalance in big data. J. Big Data 5, 1–30 (2018) 15. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002) 16. Brennan, P.J.: A comprehensive survey of methods for overcoming the class imbalance problem in fraud detection (2012) 17. Maurya, C.K., Toshniwal, D., Venkoparao, G.V.: Online anomaly detection via class-imbalance learning. In: 2015 Eighth International Conference on Contemporary Computing (IC3), pp. 30–35 (2015) 18. Bekkar, M., Djema, H., Alitouche, T.: Evaluation measures for models assessment over imbalanced data sets. J. Inf. Eng. Appl. 3, 27–38 (2013)
Can You Always Reap What You Sow? Network and Functional Data Analysis of Venture Capital Investments in Health-Tech Companies Christian Esposito1,2 , Marco Gortan3 , Lorenzo Testa1,2(B) , Francesca Chiaromonte1,4 , Giorgio Fagiolo1 , Andrea Mina1,5 , and Giulio Rossetti6 1
Institute of Economics and EMbeDS, Sant’Anna School of Advanced Studies, Pisa, Italy [email protected] 2 Department of Computer Science, University of Pisa, Pisa, Italy 3 Bocconi University, Milan, Italy 4 Department of Statistics and Huck Institutes of the Life Sciences, Penn State University, University Park, USA 5 Centre for Business Research, University of Cambridge, Cambridge, UK 6 KDD Lab. ISTI-CNR, Pisa, Italy
Abstract. “Success” of firms in venture capital markets is hard to define, and its determinants are still poorly understood. We build a bipartite network of investors and firms in the healthcare sector, describing its structure and its communities. Then, we characterize “success” by introducing progressively more refined definitions, and we find a positive association between such definitions and the centrality of a company. In particular, we are able to cluster funding trajectories of firms into two groups capturing different “success” regimes and to link the probability of belonging to one or the other to their network features (in particular their centrality and the one of their investors). We further investigate this positive association by introducing scalar as well as functional “success” outcomes, confirming our findings and their robustness. Keywords: Network analysis · Functional data analysis analysis · Venture capital investments
1
· Success
Introduction
Many phenomena may be described through networks, including investment interactions between bidders and firms in venture capital (VC) markets [1] and professional relationships among firms [2]. Risk capital is an essential resource for the formation and growth of entrepreneurial venture and venture capital firms C. Esposito, M. Gortan and L. Testa—These authors contributed equally. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2021, SCI 1072, pp. 744–755, 2022. https://doi.org/10.1007/978-3-030-93409-5_61
Can You Always Reap What You Sow?
745
are often linked together in a network by their joint investments in portfolio companies [3]. Through connections in such a network, they exchange resources and investment opportunities with one another. Many studies show the impact of network dynamics on investments, raising efficiency [4] and providing precious information when there is a great level of information asymmetry [5]. Also, differentiating connection types and avoiding tight cliques appear to help the success of an investor by providing more diverse information and reducing confirmation bias [3]. CB Insights [6] provides records of all transactions in venture capital markets from 1948. Since data until 2000 are partial and discontinuous, we focus on the period 2000–2020, in order to minimize the impact of missing data on our analysis. Additionally, since different sectors may be characterized by different investment dynamics [7], we focus on the healthcare sector, which is of great importance and has shown to be less sensitive to market oscillations [8]. This stability is also shared by returns of life science VC, where investments have a lower failure rate but are at the same time less likely to generate “black-swan” returns [9], offering more consistency but a lower likelihood of achieving billiondollars evaluations. While the number of exits through an IPO or through a trade sale can be seen as a proxy for the success of an investor [10], there are instead different definitions of “success” for startups, but a common factor seems to be the growth rate of the company [11]. Our work aims to understand whether network features may affect “success” of investments in healthcare firms. In order to investigate this, we introduce progressively more nuanced definitions of “success”, and analyze them with increasingly sophisticated statistical tools. The paper is organized as follows. Section 2 introduces and characterizes a network of investors and firms, describing its structure and salient properties, including the communities emerging from its topology. Then, Sect. 3 focuses on the definition and analysis of “successful” firms. We first characterize “success” by looking at the funding trajectories of each firm, clustering these trajectories into two broad groups capturing a high and a low funding regime. The binary cluster membership labels provide a first, rough definition of “success”. We run a logistic regression in order to explain “success” defined in this fashion with statistics computed on the network itself. We then move to more complex characterizations of “success”: the total amount of money raised (a scalar) and the funding trajectory itself (a functional outcome). We run regressions also on these outcomes, to validate and refine our previous results. Finally, we discuss main findings and provide some concluding remarks in Sect. 4.
2
Network Characterization
The 83258 agents in the healthcare sector are divided into two broad categories: 32796 bidders, or investors, and 50462 firms. Companies open investment calls in order to collect funds; investors answer such calls and finance firms. Each deal, i.e. each transaction from an investor to a company, is recorded in the CB Insights’ database. This market dynamics can be described by a bipartite
746
C. Esposito et al.
Table 1. Statistics computed on the projected graphs of investors and firms. Before running regressions in Sect. 3, left-skewed variables are normalized through logtransformation. Variable
Network meaning
Degree centrality
Influence
Betweenness centrality [12]
Role within flow of information
Eigenvector centrality [13]
Influence
VoteRank [14]
Best spreading ability
PageRank [15]
Influence
Closeness centrality [16]
Spreading power (short average distance from all other nodes)
Subgraph centrality [17]
Participation in subgraphs across the network
Average neighbor degree [18]
Affinity between neighbor nodes
Current flow betweenness centrality [19] Role within flow of information
network, which indeed is built on the notion of dichotomous heterogeneity among its nodes. In our case, each node may be a firm or an investor, respectively. An undirected link exists between two nodes of different kind when a bidder has invested into a firm. Of course, given the possibility for an investor to finance the same firm twice, the bipartite network is also a multi-graph. By knowing the date in which investments are made, we can produce yearly snapshots of the bipartite network. A company (investor) is included in a snapshot of a certain year only when it receives (makes) an investment that year. By projecting the bipartite network onto investors and firms, we produce two projected graphs which are used to compute all the node statistics described in Table 1. As the bipartite network is a multi-graph, defining projections on a subset of nodes requires an additional assumption. Specifically, we project the bipartite graph onto firms by linking them in a cumulative fashion: we iteratively add to each yearly projected snapshot a link between two companies in which a bidder has invested during that year. Concerning the projection of the bipartite network onto investors, we link two bidders whenever they invest in the same company in the same financing round. Roughly 75% of the companies in the network projected onto firms are North American and European (around 55% belong to the US market), while the remaining 25% is mostly composed of Asian companies. Around 60% of the companies operate within the sub-sectors of medical devices, medical facilities and biotechnology – the pharmaceutical sub-sector alone accounts for 20% of the network. As of August 2021, roughly 80% of the companies in the network are either active or acquired, with the remaining portion being inactive or having completed an IPO. We witness turnover of the active companies through the years, but this is expected: a company’s status is evaluated as of 2021, and it is more likely to observe a dead company among those that received investments in
Can You Always Reap What You Sow?
747
1999 than in 2018. Indeed, both death and IPO represent the final stage of the evolution of a company, so those that received funding in earlier years are more likely to have already reached their final stage. Finally, we do not observe marked changes in terms of graph sub-sectoral composition: the relative share of each sub-sector is rather stable through the years, with the exception of an increase in the shares of the internet software and mobile software sub-sectors (from 1% in 1999 to 8% in 2019 and from 0% in 1999 to 5% in 2019, respectively). 2.1
Communities
By employing the Louvain method [20], we identify meso-scale structures for each yearly snapshot of the network projected onto firms. For each year, we rank communities by their size, from the largest to singletons. We then compare the largest communities across years, by looking at their relative sub-sectors, status and geographical composition. While the specific nodes in the biggest communities may vary throughout the years, we notice a relative stability in their features. The largest communities (which contain between 13% and 20% of the nodes) reflect the status composition of the general network, downplaying unsuccessful companies and giving higher relative weight to IPO ones, showing just a variation between acquired and active companies across years (i.e. active companies are relatively over-represented in more recent largest communities than in older ones). Considering geographical information, the largest communities comprise mainly US companies, with an under-representation of other continents. This trait is quite consistent through the years, with the exception of two years (2013–2014). With respect to sub-sectors, the largest communities mainly contain medical device and biotechnology companies, and they are quite consistent through the years in terms of sub-sectoral composition. The second largest communities (containing between 10% and 14% of nodes in the network) have a less consistent sub-sectoral composition through the years, although it is worth highlighting that they comprise companies operating within software and technology. Geographically, we are still witnessing communities of mostly US-based companies, although 5 years out of 20 show a remarkable (roughly 80%) presence of European companies. Finally, status composition is balanced between active and acquired until the later years, when active companies predominate within the second largest communities. IPOs are not present, while there are, in a small percentage (between 5% and 20%), dead startups. Finally, the third largest communities (containing between 7% and 12% of the nodes) present a clear change within the period considered: in the first ten years, they mostly comprise failed or acquired European companies within the fields of biotechnology and drug development, while, in the second decade, they comprise active US companies within the fields of medical devices and medical facilities.
3
Success Analysis
Given the bipartite network and its projections, we now turn to the analysis of success and of its main drivers. Because of the elusiveness of the definition
748
C. Esposito et al.
Fig. 1. Money raised cumulatively as a function of time, shown for 319 firms in the pharmaceuticals and drugs sub-sector. Funding trajectories are constructed over a period of 10 years since birth, and aligned using birth years as registration landmarks.
of “success”, we proceed in stages – considering progressively more refined outcomes and comparing our findings. Moreover, since many of the records available in the CB Insights’ data set are incomplete, and our aim is to capture the temporal dynamics leading a firm to succeed, we further restrict attention to those companies for which full information is available on birth year, healthcare market sub-sector and investment history for the first 10 years from founding. Although this filtering may introduce some biases, it still leaves us with a sizeable set of 3663 firms belonging to 22 different sub-sectors. Notably, we restrict our focus also in terms of potential predictors, due to the fact that our collection of network features exhibits strong multicollinearities. By building a feature dendrogram (Pearson correlation distance, complete linkage) and by evaluating the correlation matrix, we reduce the initial set to four representatives. In particular, we select two features related to the investors’ projection (the maximum among the degree centralities of the investors in a company and the maximum among their current flow betweenness centralities, both computed in the company’s birth year) and two features computed on the firms’ projection (a company’s eigenvector and closeness centralities, computed in the year in which the company received its first funding). Each firm has its own funding history: after its birth, it collects funds over the years, building a trajectory of the amount of money it is able to attract. We treat these trajectories as a specific kind of structured data, by exploiting tools from a field of statistics called Functional Data Analysis (FDA) [21], which studies observations that come in the form of functions taking shape over a continuous domain. In particular, we focus on the cumulative function of the money raised
Can You Always Reap What You Sow?
749
Fig. 2. k-means clustering (k = 2) of the funding trajectories of firms belonging to the pharmaceuticals and drugs sub-sector. The green and red dashed lines represent firms in the high (“successful”) and low regimes, respectively. Bold curves represent cluster centroids. To aid their visualization, centroids are shown again in the right panel with individual trajectories in gray.
over time by each company. As an example, Fig. 1 shows 319 such cumulative functions, for the firms belonging to the pharmaceuticals and drugs sub-sector. Trajectories are aligned, so that their domain (“time”) starts at each company’s birth (regardless of the calendar year it corresponds to). By construction, these functions exhibit two characterizing properties: first, they are monotonically nondecreasing; second, they are step functions, with jumps indicating investment events. Our first definition of success is based on separating these trajectories into two regimes characterized by high (successful) vs. low investment patterns: the first runs at high levels, indicating successful patterns, and the second at low levels. Because of heterogeneity among healthcare sub-sectors, we accomplish this by running a functional k-means clustering algorithm [22,23] with k = 2, separately on firms belonging to each sub-sector. As an example, companies belonging to the sub-sector of pharmaceuticals and drugs are clustered in Fig. 2. Throughout all sub-sectors, the algorithm clusters 89 firms in the high-regime group and 3574 in the low-regime one. This binary definition of “success” turns out to be rather conservative; very few firms are labeled as belonging to the high investment regime. Consider the logistic regression p P (yi = 1) βj xij i = 1, . . . n (1) = β0 + log 1 − P (yi = 1) j=1 where n is the number of observations, yi , i = 1, . . . n, are the binary responses indicating membership to the high (yi = 1) or low (yi = 0) regime clusters; β0
750
C. Esposito et al.
Fig. 3. Scatter-plots of logistic regression coefficient estimates (horizontal) and significance (vertical; −log(p-value)). Each point represents one of 1000 fits run on data balanced by subsampling the most abundant class. Orange solid line mark averages across the fits, and orange dashed lines ±1 standard deviations about them. Green solid lines mark 0 on horizontal axes. Blue line mark significance values associated to a p-value of 0.1.
is an intercept and xij , i = 1, . . . n and j = 1, . . . , p (p = 4), are the previously selected scalar covariates. If we fit this regression on our unbalanced data, results are bound to be unsatisfactory and driven by the most abundant class. Running such a fit, one obtains an explained deviance of only 0.10. To mitigate the effects of unbalanced data [24], we randomly subsample the most abundant class (the low-regime firms) as to enforce balance between the two classes, and then run the logistic regression in Eq. 1. We repeat this procedure 1000 times, recording estimated coefficients, associated p-values and explained deviances. The average of the latter across the 1000 replications is substantially higher than on the unbalanced fit, reaching 0.18 (some fits produce deviance explained as high as 0.45). Moreover, we can investigate significance and stability of the coefficient estimates through their distribution across the repetitions. Figure 3 shows scatter-plots of these quantities, suggesting that the two variables related to the firms’ centrality have a modest yet stable, positive impact on the probability of belonging to the highregime cluster. This is not the case for the variables related to the investors’ centrality. This first evidence of a positive relationship between the success of a firm and its centrality, or importance (in a network sense), is promising. However, the binary definition of “success” we employed is very rough – and the unbalance in the data forced us to run the analysis relying on reduced sample sizes (89 + 89 =
Can You Always Reap What You Sow?
751
178 observations in each repeated run). Thus, we next consider a scalar proxy for “success”, which may provide a different and potentially richer perspective. Specifically, we consider the cumulative end point of a firm’s funding trajectory, i.e. the total value of the investment received through its temporal domain. For this scalar response, we run a best subset selection [25] considering all the network features in our initial set – not just the 4 selected to mitigate multicollinearity prior to the logistic regression exercise. Notably, despite the substantial change in the definition of “success”, results are in line with those from the logistic regression. Indeed, the first selected variable, when the predictor subset is forced to contain only one feature, is the eigenvector centrality of firms. When the predictor subset size is allowed to reach 4, the features selected are the closeness and the VoteRank of the firm, and the maximum current flow betweenness centrality among its investors (computed on the firm’s birth year). Thus, the only difference compared to our previous choice is the selection of the firms’ VoteRank centrality instead of the maximum among the investors’ degree centrality. We compare the two alternative selections of four features as predictors of the scalar “success” response fitting two linear models of the form: yi = β0 +
p
βj xij + i
i = 1, . . . n
(2)
j=1
where n is the number of observations, yi , i = 1, . . . n, are the scalar responses (aggregate amount of money raised); β0 is an intercept; xij , i = 1, . . . n and j = 1, . . . , p (p = 4), are the scalar covariates belonging to one or the other subset and i , i = 1, . . . n, are i.i.d. Gaussian model errors. As shown in Table 2, the maximum degree centrality among a firm’s investors is not statistically significant. Surprisingly, the maximum among investors’ current flow betweenness centralities is significantly negative, but its magnitude is close to 0. In contrast, the firms’ closeness and eigenvector centralities are positive, statistically significant and sizeable. This is in line with what we expected, since it is reasonable to think that knowledge may indirectly flow from other startups through common investors, increasing the expected aggregate money raised. Finally, the firms’ VoteRank centrality appears to have a negative, statistically significant impact on the aggregate money raised. This should not be surprising, given that the higher the VoteRank centrality is, the less influential the node will be. The variance explained by the two models is similar and still relatively low (R2 ≈ 0.13), which may be simply due to the fact that network characteristics are only one among the many factors involved in a firm’s success [26]. Nevertheless, the results obtained here through the scalar “success” outcome are consistent with those obtained through the binary one and logistic regression. Our scalar outcome (aggregate money raised) has its own drawbacks. In particular, it implicitly assumes that the right time to evaluate success and investigate its dependence on network features is, cumulatively, at the end of the period considered (10 years). Note that this translates into a 10-year gap between the measurement of network features and financial success.
752
C. Esposito et al.
Table 2. Linear regressions of aggregate money raised on two sets of predictors. All variables are scaled and some are log-transformed (as indicated parenthetically). Dependent variable Aggregate money raised (log) (1) (2) newman max
−0.065∗∗ (0.030)
voterank (log)
−0.140∗∗∗ (0.033)
−0.072∗ (0.041)
degcen max (log)
0.050 (0.040)
closeness
0.126∗∗∗ (0.037)
0.130∗∗∗ (0.030)
eigenvector (log)
0.214∗∗∗ (0.034)
0.255∗∗∗ (0.028)
Constant
0.113∗∗∗ (0.030)
0.062∗∗ (0.025)
Observations
1,118
1,364
R2
0.136
0.127
Adjusted R
2
0.133
0.125
Residual std. error 0.992 (df = 1113) F statistic Note: ∗ p < 0.1;
∗∗
0.923 (df = 1359)
43.951∗∗∗ (df = 4; 1113) 49.458∗∗∗ (df = 4; 1359) p < 0.05; ∗∗∗ p < 0.01
Although this issue could be approached relying on additional economic assumptions, we tackle it refining the target outcome and considering the full funding trajectories – instead of just their end point. This requires the use of a more sophisticated regression framework from FDA; that is, function-on-scalar regression [27]. In particular, we regress the funding trajectories on the same two sets of covariates considered in the scalar case above. The equation used for function-on-scalar regression is: Yi (t) = β0 (t) +
p
βj (t)xij + i (t)
i = 1, . . . n
(3)
j=1
where n is the number of observations; Yi (t), i = 1, . . . n, are the aligned funding trajectories; β0 (t) is a functional intercept; xij , i = 1, . . . n and j = 1, . . . , p (p = 4), are the scalar covariates belonging to the one or the other set, and i (t), i = 1, . . . n, are i.i.d. Gaussian model errors. The regression coefficient of a scalar covariate in this model, βj (t), is itself a curve describing the time-varying relationship between the covariate and the functional response along its domain. Together with the functional coefficients,
Can You Always Reap What You Sow?
753
Fig. 4. Function-on-scalar regression, coefficient curve estimates. (a) intercept function (this can be interpreted as the sheer effect of time on the response); (b) maximum degree centrality among investors (company’s birth year); (c) maximum across investors’ current flow betweenness centrality (company’s birth year); (d) company’s eigenvector centrality; (e) company’s closeness centrality. Dotted lines represent confidence bands. All the covariates are standardized.
we also estimate their standard errors, which we use to build confidence bands around the estimated functional coefficients [28]. Coefficient curve estimates for the covariate set including the maximum investors’ degree centrality are shown in Fig. 4 (results are very similar with the other set of covariates). The impacts of an increase in the maximum among the degree centralities and in the maximum among the current flow betweenness centralities of the investors in a firm are not statistically significant. Conversely, eigenvector and closeness centralities of firms have positive and significant impacts. The impact of the eigenvector centrality seems to be increasing during the first five years, reaching a “plateau” in the second half of the domain. These findings reinforce those obtained with the binary and scalar outcomes previously considered, confirming a role for firms’ centrality in shaping their success.
4
Discussion
This paper exploits techniques from the fields of network and functional data analysis. We build a network of investors and firms in the healthcare sector and characterize its largest communities. Next, we progressively shape the concept of a firm’s “success” using various definitions, and associate it to different network
754
C. Esposito et al.
features. Our findings show a persistent positive relationship between the importance of a firm (measured by its centrality in the network) and various (binary, scalar and functional) definitions of “success”. In particular, we cluster funding trajectories into a high (“successful”) and a low regime, and find significant associations between the cluster memberships and firms’ centrality measures. Then, we switch from this binary outcome to a scalar and then a functional one, which allow us to confirm and enrich the previous findings. Among centralities computed on the two network projections, our results suggest a preeminent role for those computed in the companies’ projection. In particular, both a firm high closeness centrality, indicating a small shortest distances to other firms, and its eigenvector centrality, which may account for a firm’s reputation, seem to be related to the propensity to concentrate capital. Our analysis can be expanded in several ways. First, we limit our study to the healthcare sector, while it may be interesting to investigate other fields, or more healthcare firms based on the availability of more complete records. It would also be interesting to account for external data (e.g. country, sub-sector, etc.) in two ways. One the one hand, these information would be useful as to compute more informative statistics on the network topology. On the other hand, they may be used in our regression, to control for these factors. Moreover, meso-scale communities may be analyzed in terms of their longitudinal evolution, as to characterize “successful” clusters of firms from a topological point of view. Acknowledgments. F.C., C.E., G.F., A.M. and L.T. acknowledge support from the Sant’Anna School of Advanced Studies. F.C. and L.T. acknowledge support from Penn State University. G.R. acknowledges support from the scheme “INFRAIA-01-20182019: Research and Innovation action”, Grant Agreement n. 871042 “SoBigData++: European Integrated Infrastructure for Social Mining and Big Data Analytics”.
References 1. Liang, Y.E., Yuan, S.-T.D.: Predicting investor funding behavior using crunchbase social network features. Internet Research (2016) 2. Bonaventura, M., Ciotti, V., Panzarasa, P., et al.: Predicting success in the worldwide start-up network. Sci. Rep. 10, 345 (2020) 3. Bygrave, W.D.: The structure of the investment networks of venture capital firms. J. Bus. Ventur. 3(2), 137–157 (1988) 4. Wetzel, W.E., Jr.: The informal venture capital market: aspects of scale and market efficiency. J. Bus. Ventur. 2(4), 299–313 (1987) 5. Fiet, J.O.: Reliance upon informants in the venture capital industry. J. Bus. Ventur. 10(3), 195–223 (1995) 6. CB Insights. https://www.cbinsights.com/ 7. Dushnitsky, G., Lenox, M.J.: When does corporate venture capital investment create firm value? J. Bus. Ventur. 21(6), 753–772 (2006) 8. Pisano, G.P.: Science Business: The Promise, the Reality, and the Future of Biotech. Harvard Business Press, Boston (2006) 9. Booth, B.L., Salehizadeh, B.: In defense of life sciences venture investing. Nat. Biotechnol. 29(7), 579–583 (2011)
Can You Always Reap What You Sow?
755
10. Hege, U., Palomino, F., Schwienbacher, A., et al.: Determinants of venture capital performance: Europe and the United States. Working paper, HEC School of Management (2003) 11. Santisteban, J., Mauricio, D.: Systematic literature review of critical success factors of information technology startups. Acad. Entrep. J. 23(2), 1–23 (2017) 12. Hannan, M.T., Freeman, J.: The population ecology of organizations. Am. J. Sociol. 82(5), 929–964 (1977) 13. Bonacich, P.: Power and centrality: a family of measures. Am. J. Sociol. 92(5), 1170–1182 (1987) 14. Zhang, J.-X., et al.: Identifying a set of influential spreaders in complex networks. Sci. Rep. 6, 27823 (2016) 15. Page, L., et al.: The PageRank citation ranking: bringing order to the web. Stanford InfoLab (1999) 16. Freeman, L.C.: Centrality in social networks conceptual clarification. Soc. Netw. 1(3), 215–239 (1978) 17. Estrada, E., Rodriguez-Velazquez, J.A.: Subgraph centrality in complex networks. Phys. Rev. E 71(5), 056103 (2005) 18. Barrat, A., et al.: The architecture of complex weighted networks. Proc. Natl. Acad. Sci. 101(11), 3747–3752 (2004) 19. Newman, M.E.J.: A measure of betweenness centrality based on random walks. Soc. Netw. 27(1), 39–54 (2005) 20. Blondel, V.D., et al.: Fast unfolding of communities in large networks. J. Stat. Mech Theory Exp. 2008(10), P10008 (2008) 21. Ramsey, J.O., Silverman, B.W.: Functional Data Analysis. Springer Series in Statistics, Springer, New York (2005). https://doi.org/10.1007/b98888 22. Jacques, J., Preda, C.: Functional data clustering: a survey. Adv. Data Anal. Classif. 8(3), 231–255 (2013). https://doi.org/10.1007/s11634-013-0158-y 23. Hartigan, J.A., Wong, M.A.: A K-means clustering algorithm. Appl. Stat. 28, 100– 108 (1979) 24. He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009) 25. Friedman, J., Hastie, T., Tibshirani, R.: The Elements of Statistical Learning. Springer Series in Statistics, Springer, New York (2005) 26. Dosi, G., Marengo, L.: Some elements of an evolutionary theory of organizational competences. In: Evolutionary Concepts in Contemporary Economics, pp. 157–178 (1994) 27. Kokoszka, P., Reimherr, M.: Introduction to Functional Data Analysis. Chapman and Hall/CRC (2017). https://doi.org/10.1201/9781315117416 28. Goldsmith, J., et al.: Refund: regression with functional data. R package version 0.1-16 (2016)
Asymmetric Diffusion in a Complex Network: The Presence of Women on Boards Ricardo Gimeno1 , Ruth Mateos de Cabo2(B) , Pilar Grau3 , and Patricia Gabaldon4 1
2 3
Banco de Espa˜ na, 28014 Madrid, Spain [email protected] Universidad CEU San Pablo, Madrid, Spain [email protected] Universidad Rey Juan Carlos, Madrid, Spain [email protected] 4 IE University, Madrid, Spain [email protected]
Abstract. Diffusion processes are well known linear dynamical systems. However, a different kind of dynamical solutions emerge when the speed of diffusion is dependent on the sign of the gradient variable that is diffused through the graph. In this case, we move into a nonlinear dynamical system where solutions would depend on the differential speed of diffusion, the topological structure of the graph and the initial values of the gradient variable. We show an example in a real complex network: we construct a network of US Boards of Directors, where nodes are boards, and links are interlocking directorates. We show that the proportion of women on each board follows a diffusion process, where changes depend of the gradient of this proportion to adjacent boards. Furthermore, we show that the diffusion is asymmetric, with diffusion slower(faster) when the board has a lower(higher) proportion of women than neighbor boards. Keywords: Diffusion process · Boards of Directors · Women on boards
1
Introduction
A complex network can capture the relationships and influences between different agents in a population. The network topology (i.e., the pattern of connections), is critical in the diffusion process of information, agreements, and even contagious illnesses. Studies on network diffusion characterize the network through statistical properties such as degree distribution and degree correlation using mean field analysis, as in [3] or [9]. Others use the adjacency matrix, which is static and finite-sized, treating diffusion as a Markov process in which the diffusion rate is assumed to be dependent on the number of active neighbors (e.g., [7,17]). c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2021, SCI 1072, pp. 756–767, 2022. https://doi.org/10.1007/978-3-030-93409-5_62
Asymmetric Diffusion
757
Our analysis, however, considers a new type of diffusion where the speed of diffusion depends on the level of the gradient variable. We show that this slight change in the dynamics produces a nonlinear dynamical process where the final solution will depend on the differential speed of diffusion, the topology of the graph and the distribution of the initial conditions among nodes. In this way, we introduce the fact that in network diffusion not only is its structure that matters, but also, the way in which agents influence on each other [14]. In order to show that this is not only a theoretical possibility, but a real phenomenon, we show how US companies copy each other in the proportion of women on their Boards of Directors. Companies do not want to be accused of either bigotry and discrimination or of not respecting the market freedom and the meritocratic perspective to hiring directors. We construct the network using companies as nodes and directors working in more than one board as edges. We show that these directors are conducting information, and companies react accommodating to the proportion of women on boards in their neighbors. We also show how the pressure on having too few or too many women on a board (WoB) is not the same, and the consequence is that the speed of diffusion change, producing a pressure to keep the proportion of women on boards at low levels.
2 2.1
Theoretical Framework Dynamical Processes on an Undirected Network
Let’s consider a graph G = (V, E) whose nodes (V ) represent agents in a complex system (e.g., companies) and the edges (E) represent interactions between such agents (e.g., directors holding positions on different boards). In such multi-agent system, “consensus” means an agreement regarding a certain quantity of interest. Let’s consider, then, that there is a level of a variable of interest (ui ) for each node i at a given time t. At every edge (i, j) ∈ E, there will be a gradient produced for the different levels of that variable of interest for each of the nodes i and j [uj (0) − ui (0)]. If there is a search for a consensus in the network, as times evolves, there will be a “movement” from the nodes with a high value of u to the nodes with a low value of u. We can write the evolution of the concentration at a given node i as a heat diffusion process, dui (t) = u˙ i (t) = [ui − uj ] , i = 1, 2, ..., n, dt j∼i where j ∼ i means all nodes j that are neighbors of i in the graph, and where the initial conditions are, ui (0) = u0 , u0 ∈ For instance, for an imaginary network as Fig. 1, we can write such equations for every node:
758
R. Gimeno et al.
Fig. 1. Example of an undirected network. This simple graph with five nodes is going to be used as an example through the paper.
u˙ A (t) = [uB (t) − uA (t)] + [uC (t) − uA (t)] u˙ B (t) = [uA (t) − uB (t)] + [uC (t) − uB (t)] + [uD (t) − uB (t)] + [uE (t) − uB (t)] u˙ C (t) = [uA (t) − uC (t)] + [uB (t) − uC (t)] u˙ D (t) = [uB (t) − uD (t)] u˙ E (t) = [uB (t) − uE (t)] . In a generalization of these expressions, the collective dynamics of the group of agents can be represented by the following equation, ui (t + 1) = ui (t) + ε aij (uj (t) − ui (t)) , (1) j∼i
where ui (t) is the value of a quantitative measure on node i (e.g., the proportion of WoB), ε > 0 is the step-size (i.e., the speed of correction of imbalances), and aij is the element i,j in the adjacency matrix of the graph (i.e., equal to 1 if both nodes are connected, and 0 otherwise). Equation 1 can be arranged as a matrix equation, as in our example of Fig. 1, ⎤⎡ ⎡ ⎤ ⎡ ⎤ uA (t) 2 −1 −1 0 0 u˙ A (t) ⎢u˙ B (t)⎥ ⎢ −1 4 −1 −1 −1 ⎥ ⎢uB (t)⎥ ⎥⎢ ⎢ ⎥ ⎢ ⎥ ⎢u˙ C (t)⎥ = ⎢ −1 −1 2 0 0 ⎥ ⎢uC (t)⎥ , ⎥⎢ ⎢ ⎥ ⎢ ⎥ ⎣u˙ D (t)⎦ ⎣ 0 −1 0 1 0 ⎦ ⎣uD (t)⎦ 0 −1 0 0 1 u˙ E (t) uE (t) which can be written as, u(t) ˙ = −Lu(t), u(0) = −Lu0 ,
(2)
Asymmetric Diffusion
759
where L is the Laplacian matrix, which can be defined as L = K − A; K is the degree matrix which is a diagonal matrix of the degrees of each node in the graph; and A is the adjacency matrix of the graph [5]. ⎡ ⎤ ⎡ ⎤ 01100 20000 ⎢1 0 1 1 1⎥ ⎢0 4 0 0 0⎥ ⎢ ⎥ ⎢ ⎥ ⎥ ⎢ ⎥ K = ⎢0 0 2 0 0⎥ A = ⎢ ⎢1 1 0 0 0⎥ ⎣0 1 0 0 0⎦ ⎣0 0 0 1 0⎦ 01000 00001 The dynamical process defined by Eq. 2, is a linear process, with a known solution: the average value u(0) [4], as shown in Fig. 2. 0.25 A B C D E
u
0.2
0.15
0.1
0.05
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
time
Fig. 2. Solution for the linear diffusion process 2. For example proposed in Fig. 1 and initial values u0 = [.1 .05 .15 .20 .25]
2.2
Asymmetric Agreement
Let’s now suppose that the speed of diffusion is different depending on the sign of the differential. Starting from the example used in Sect. 2.1, let’s suppose that the initial values are, ⎛ ⎞ ⎛ ⎞ 01100 .10 ⎜1 0 1 1 1⎟ ⎜ .05 ⎟ ⎜ ⎟ ⎜ ⎟ ⎟ A = ⎜1 1 0 0 0⎟ .15 (3) u0 = ⎜ ⎜ ⎟ ⎜ ⎟ ⎝0 1 0 0 0⎠ ⎝ .20 ⎠ 01000 .25 Instead of a single adjacency matrix, we have two directional ones, that we represent with the adjacency matrices Ah→l , with ones if the individual in the row has a lower value of the variable of interest than the individual in the column (Fig. 3) and Al→h , with ones if the individual in the row has a higher value of the variable of interest than the individual in the column (Fig. 4).
760
R. Gimeno et al.
Fig. 3. Directed network for the influence of high value nodes on the low value nodes, and its Adjacency matrix
Fig. 4. Directed network for the influence of low value nodes on the high value nodes, and its Adjacency matrix
The addition of Ah→l and Al→h produces the original adjacency matrix: ⎤ ⎤ ⎡ ⎡ ⎤ ⎡ 01000 00100 01100 ⎢1 0 1 1 1⎥ ⎢1 0 1 1 1⎥ ⎢0 0 0 0 0⎥ ⎥ ⎥ ⎢ ⎢ ⎥ ⎢ ⎥ ⎥ ⎢ ⎥ ⎢ A = Ah→l + Al→h ⎢ ⎢1 1 0 0 0⎥ = ⎢0 0 0 0 0⎥ + ⎢1 1 0 0 0⎥ ⎣0 1 0 0 0⎦ ⎣0 0 0 0 0⎦ ⎣0 1 0 0 0⎦ 01000 00000 01000 In the case of ui = uj , we can add the edge to either Ah→l or Al→h without any differences in the dynamics of the process or the final agreement solution. These two directional graphs have nodes with different degrees thus, two directional degree matrices arise (Kh→l and Kl→h ), and the sum of both is equal to the undirected degree matrix: ⎤ ⎤ ⎡ ⎡ ⎤ ⎡ 10000 10000 20000 ⎢0 4 0 0 0⎥ ⎢0 4 0 0 0⎥ ⎢0 0 0 0 0⎥ ⎥ ⎥ ⎢ ⎢ ⎥ ⎢ ⎥ ⎥ ⎢ ⎥ ⎢ K = Kh→l + Kl→h ⎢ ⎢0 0 2 0 0⎥ = ⎢0 0 0 0 0⎥ + ⎢0 0 2 0 0⎥ ⎣0 0 0 1 0⎦ ⎣0 0 0 0 0⎦ ⎣0 0 0 1 0⎦ 00001 00000 00001
Asymmetric Diffusion
761
In the same vein, we have two different Laplacian matrices (Lh→l = Kh→l − Ah→l and Ll→h = Kl→h − Al→h ), where: ⎤ ⎤ ⎡ ⎤ ⎡ 1 −1 0 0 0 1 0 −1 0 0 2 −1 −1 0 0 ⎢ −1 4 −1 −1 −1 ⎥ ⎢ −1 4 −1 −1 −1 ⎥ ⎢ 0 0 0 0 0 ⎥ ⎥ ⎥ ⎢ ⎢ ⎥ ⎢ ⎥ ⎥ ⎢ ⎥ ⎢ L = Lh→l + Ll→h ⎢ ⎢ −1 −1 2 0 0 ⎥ = ⎢ 0 0 0 0 0 ⎥ + ⎢ −1 −1 2 0 0 ⎥ ⎣ 0 −1 0 1 0 ⎦ ⎣ 0 0 0 0 0 ⎦ ⎣ 0 −1 0 1 0 ⎦ 0 −1 0 0 1 00 0 0 0 0 −1 0 0 1 ⎡
The agreement in this framework is driven by, u˙ (t) = − (Ll→h (t) + γLh→l (t)) u(t)
(4)
In this new scenario we have the influence of low nodes on high value nodes (Ll→h (t)) the influence of the high nodes on low value ones (Lh→t (t)) and the asymmetry coefficient (γ). We can consider γ as the symmetry parameter, from the higher level of symmetry (γ = 1), to the lower level of symmetry (γ = 0 or γ = ∞). Furthermore, it needs to be taken into account that a big difference with the linear diffusion process is that now Laplacian matrices for the directional matrices might change in time if the relative values of ui (t) and uj (t) also change. 2.3
Asymptotics
Previous research has shown how the topology of the graph might slow down the diffusion process in a network [10,11], although the solution would be the same one: the average value of the initial conditions (u(0)). But the situation we are showing here is different, because the asymmetry in the speed of diffusion produces a change in the final asymptotic solution. There are three trivial cases in the dynamic represented by Eq. 4: if γ = 1, we are back to the original bidirectional consensus model, where the solution will be the average value of the initial conditions: u(0). By contrast, if γ = 0, then we are in a case where the node with the minimum value will drag all the other nodes in its direction (limt→∞ u(t) = min u(0)). This special case coincides with the zealot problem [13], where a zealot voter is able to drag the whole network to its political views. Similarly, if γ → ∞ the solution will be equal to the maximum value among the initial conditions (limt→∞ u(t) = max u(0)). Out of these cases, for either 0 < γ < 1, or γ > 1 we are in a network where the solution will be either (min u(0) < limt→∞ u(t) < u(0)) or (u(0) < limt→∞ u(t) < max u(0)) respectively (see Fig. 5). Interestingly, now the asymptotic solution will also be different depending on the individual initial values of each node (see Fig. 6). In the linear case (Eq. 2), where the mean was the asymptotic value, it was indifferent if the higher value was in node B or node D, because the only relevant metric was the average u0 . By contrast, in the nonlinear diffusion process (Eq. 4), there will be different outcomes if the node B (the one with the highest degree) has the highest value of u or the lowest. In fact, and perhaps counter-intuitively, the values of the nodes
762
R. Gimeno et al. 0.25
0.2
limt
0.15
0.1
0.05 0
1
2
3
4
5
6
7
8
9
10
Fig. 5. Sensitivity of the solution to γ values. Solutions for the nonlinear diffusion process in Eq. 4 of the graph represented by Fig. 1, with initial values u0 = [.1 .05 .15 .20 .25] , for different values of γ.
with the highest degrees are not the most relevant for the final solution, but the less relevant ones. Since they are exposed to more influences than low degree nodes, they change faster, and any divergence they might have is corrected early in the dynamic, while the more isolated nodes adapting slowly to the consensus value, are able to exert influence later in the dynamic evolution of u. The other factor that influences the solution of the nonlinear process (Eq. 4) are the topological properties of the graph. Changes in the degrees of the nodes, as a consequence of the same mechanism we explained about the sensitivity to the initial conditions, will produce different solutions (see Fig. 7).
3
A Real Network: Boards of Directors and the Proportion of Women on Boards
A real example of the asymmetric diffusion process proposed (Sect. 2.2) is the graph of US listed companies, or more precisely, their Board of Directors. 3.1
Network of Company Boards
The Board of Directors is the main corporate decision-making body, receiving its power from shareholders that count on it to defend their interests. Some members are company executives with inside knowledge of the company, and the rest are independent directors, whose only relationship with the company is the part-time job of attending board meetings, supervising that shareholders’ interests are protected. The part-time commitment of independent directors allows them to serve on multiple boards. Therefore, we can construct a network, where each company is a node, and two nodes are connected by an edge if they share at least one director. The network topology has been explored in the past (e.g., [1,8]), although never in the context of a dynamic diffusion process. Using Boardex Database we have recovered, for 9,855 US listed companies, all their directors since 2005 to 2015. Then, we have built a separate network
Asymmetric Diffusion
763
0.25
0.2
limt
0.15
0.1
0.05
0
1
2
3
4
5
6
7
8
9
10
Fig. 6. Sensitivity of the solution to the initial values. Solutions for the nonlinear diffusion process (Eq. 4) for the graph represented by Fig. 1, different values of γ and all possible combinations of the initial values u0 = [.1 .05 .15 .20 .25] .
Fig. 7. Sensitivity of the solution to nonlinear diffusion process (Eq. 4) for u0 = [.1 .05 .15 .20 .25] ; for different graph of 5 nodes connected in a single
the topology of the graph. Solutions for the the graph represented by Fig. 1, initial values values of γ and all possible combinations of a cluster.
for each year (the Board composition is approved in the Annual Meeting of Shareholders), taking into account the directors that sit on multiple boards (e.g., in 2015, there were 29,060 cases of two pairs of companies sharing at least one director). The consensus variable producing a gradient and potentially be diffused through the graph, is the proportion of women on each board (W ). This is a variable that has reached increasing attention, both in research (e.g., [6,12,15]), and among investors, stakeholders and policy makers [2,16]. Given the attention received, W is a feature that the media, institutional investors, and politicians closely follow, and where legitimacy pressure might produce companies to try to emulate other neighbor boards. From a total of 689,522 director positions (an average of 43 thousand per year), 9.6% are women. 3.2
Empirical Symmetric Model
Once we have Wi for each company, and identified the companies j that share directors with i, we can estimate the discrete version of Eq. 2,
764
R. Gimeno et al.
ΔWi (t + 1) = μ − ε
aij (t)
j=i
1 (Wi (t) − Wj (t)) + νit , ki
(5)
where ΔWi (t + 1) is the change in the proportion of women (W ) in company i between year t and year t + 1. Parameter ε is the speed of diffusion of W in the graph. If ε is equal to zero, then there is no diffusion process. If ε is positive, then we have a diffusion where each company tries to approach the proportion of Wi to those of their neighborhood. Coefficients aij (t) are the elements of the adjacency matrix for the graph in year t. The independent variable (Wi (t) − Wj (t)) is the gradient in W between i and j in the graph. A difference with the theoretical model is that we can not assume that the network is a closed system, since every year new directors enter (i.e., new hiring) and exit (i.e., either for retirement or decease) the networks. Therefore, in line with a heat diffusion we add a constant μ to the model. Additionally, ki is the degree of node i. Finally, νit is a random noise that takes into account all other factors not related to the diffusion process changing the proportion of Wi (t). Since we observe Wi (t), we can recover ε on Eq. 5 for each year are showed in Fig. 8 along their 95% confidence intervals. Parameters are similar for all years, always clearly above zero. This implies that there is a diffusion of W , and that this is quite stable along the studied period. The constant μ is also positive and significant implying an annual increase in the proportion of WoB of between 0.2% and 1%. 3.3
Empirical Asymmetric Model
In the case of an asymmetry in the diffusion (Sect. 2.2), with different speeds of rise and fall of W , the discrete version of Eq. 4 is as follows, ΔWi (t + 1) = μ − εl→h
a(l→h)ij (t)
1 (Wi (t) − Wj (t)) ki
a(h→l)ij (t)
1 (Wi (t) − Wj (t)) + νit , ki
j∼i
− εh→l
j∼i
(6)
where a(l→h)ij is the element i, j of the adjacency matrix Al→h , while a(h→l)ij is the same element for the adjacency matrix Ah→l , and νit is the random noise as the one we used in Eq. 5. Finally, the parameters εl→h and εh→l are the speed of diffusion for the directional graphs represented by Al→h and Ah→l , respectively. In this case, εl→h is the decreasing rhythm of W for the nodes with higher values because of the influence of the neighbors with lower W ; and εh→l the speed of increase of W for the nodes with lower values because of the influence of the neighbors with higher ones. The recovered values of εl→h and εh→l on Eq. 6 for each year are showed in Fig. 9 along their 95% confidence intervals. The values obtained for εl→h are systematically higher (around double values) than for εh→l , implying that the diffusion of W along the US Boards is not symmetric, being faster when
Asymmetric Diffusion 10-3
12
765
0.12
10
0.1
8
0.08
6
0.06
4
0.04
2 0.02 0 2006
2008
2010
2012
2014
2016
0
2006
2008
Year
2010
2012
2014
2016
Year
Fig. 8. Parameter estimates of Eq. 5, for the diffusion of W , on US Boards, with 95% confidence intervals. Left for μ and right for ε estimates. 0.014
0.2
0.012
0.15
0.01
0.1
0.008 0.05 0.006 0
0.004
-0.05
0.002 0
2006
2008
2010
Year
2012
2014
2016
-0.1
2006
2008
2010
2012
l
h
h
l
2014
2016
Year
Fig. 9. Parameter estimates of Eq. 6 of the diffusion of W , on US Boards, with 95% confidence intervals. Left for μ and right for εl→h and εh→l estimates.
the company has a higher value of W than its neighbors and receive pressure to reduce W , and lower when the company has a lower value of W than its neighbors and the pressure is to increase the value of W . As a consequence, although the open nature of the network produce a natural increase in the proportion of W (μ > 0), the asymmetry of the ε parameters (εl→h > εh→l ) produce a pressure in the graph to keep the proportion of women on boards at low levels (something known in the literature as the Old Boys’ Club, [6,12]).
4
Conclusions
In this paper, we have shown how the linear diffusion in graphs/networks can become nonlinear if there is asymmetry in the speed of diffusion. We have also shown how this case can be represented as the aggregation of the diffusion through a double layer of directional graphs. The asymptotic solution of the problem is now more complex with dependence on the relative differences in the speed of diffusion, the initial conditions, and the topology of the graph. Other agreement problems previously studied (e.g., the zealot problem) can be considered particular cases of this more general framework.
766
R. Gimeno et al.
We have also shown a real case of this type of diffusion: the proportion of women directors on boards of US listed firms. We have seen that this asymmetry happens, with a reduced speed of diffusion for the cases where the proportion of women in the board of the company is lower than in its neighbors than in the cases where this proportion is higher. This asymmetry has as a consequence that the asymptotic value is lower than in the case of a symmetric diffusion process, producing a chronic scarcity of women on boards. This research has multiple applications, since the presence of asymmetries in the diffusion of any feature through a network should be explored in many other cases where graph diffusion is used, from epidemiology to twitter discussions, since asymmetries not taken into account can produce unexpected outcomes. Acknowledgments. This research has received financial support by the Spanish Government (Project I+D+i PID2020-114183RB-I00, funded by AEI/FEDER, UE). We are also thankful to Rosa Benito, Javier Borondo, Ernesto Estrada, Juan Carlos Losada and Miguel Rebollo, as well as the participants in the Mediterranean Summer School on Complex Networks and the International Conference on Nonlinear Mathematics and Physics for their comments and suggestions.
References 1. Mateos de Cabo, R., Grau, P., Gimeno, R., Gabaldon, P.: Shades of power: network links with gender quotas and corporate governance codes. Br. J. Manag. (2021, forthcoming) 2. Mateos de Cabo, R., Terjesen, S., Escot, L., Gimeno, R.: Do ‘soft law’ board gender quotas work? Evidence from a natural experiment. Eur. Manag. J. 37(5), 611–624 (2019) 3. Danon, L., et al.: Networks and the epidemiology of infectious disease. Interdisc. Perspect. Infect. Diseases 2011, 1–28 (2011) 4. Estrada, E.: Graph and network theory. In: Digital Encyclopedia of Applied Physics, pp. 1–48 (2003) 5. Estrada, E.: Path Laplacian matrices: introduction and application to the analysis of consensus in networks. Linear Algebra Appl. 436(9), 3373–3391 (2012) 6. Gabaldon, P., De Anca, C., Mateos de Cabo, R., Gimeno, R.: Searching for women on boards: an analysis from the supply and demand perspective. Corp. Gov. Int. Rev. 24(3), 371–385 (2016) 7. Ganesh, A., Massouli´e, L., Towsley, D.: The effect of network topology on the spread of epidemics. In: Proceedings IEEE 24th Annual Joint Conference of the IEEE Computer and Communications Societies, vol. 2, pp. 1455–1466. IEEE (2005) 8. Grau, P., Mateos de Cabo, R., Gimeno, R., Olmedo, E., Gabaldon, P.: Networks of boards of directors: is the ‘golden skirts’ only an illusion? Nonlinear Dyn. Psychol. Life Sci. 24(2), 215–231 (2020) 9. House, T., Keeling, M.J.: Insights from unifying modern approximations to infections on networks. J. R. Soc. Interface 8(54), 67–73 (2011) 10. Iribarren, J.L., Moro, E.: Impact of human activity patterns on the dynamics of information diffusion. Phys. Rev. Lett. 103(3), 038702 (2009) 11. Karsai, M., et al.: Small but slow world: how network topology and burstiness slow down spreading. Phys. Rev. E 83(2), 025102 (2011)
Asymmetric Diffusion
767
12. Mateos De Cabo, R., Gimeno, R., Nieto, M.: Gender diversity on European banks’ boards of directors. J. Bus. Ethics 109(2), 145–162 (2012) 13. Mobilia, M.: Does a single zealot affect an infinite group of voters? Phys. Rev. Lett. 91(2), 028701 (2003) 14. Shalizi, C.R., Thomas, A.C.: Homophily and contagion are generically confounded in observational social network studies. Sociol. Methods Res. 40(2), 211–239 (2011) 15. Terjesen, S., Aguilera, R.V., Lorenz, R.: Legislating a woman’s seat on the board: institutional factors driving gender quotas for boards of directors. J. Bus. Ethics 128(2), 233–251 (2015) 16. Terjesen, S., Sealy, R.: Board gender quotas: exploring ethical tensions from a multi-theoretical perspective. Bus. Ethics Q. 26(1), 23–65 (2016) 17. Zhang, J., Moura, J.M.: Accounting for topology in spreading contagion in noncomplete networks. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2681–2684. IEEE (2012)
Marginalisation and Misperception: Perceiving Gender and Racial Wage Gaps in Ego Networks Daniel M. Mayerhoffer1(B) and Jan Schulz2 1
Institute for Political Science, University of Bamberg, Bamberg, Germany [email protected] 2 Department of Economics, University of Bamberg, Bamberg, Germany https://www.uni-bamberg.de/en/poltheorie/staff/daniel-mayerhoffer/
Abstract. We introduce an agent-based model of localised perceptions of the gender and racial wage gap in Random Geometric Graph type networks that result from economic homophily independent of gender/race. Thereby, agents estimate inequality using a composite signal consisting of local information from their personal neighbourhood and the actual global wage gap. This can replicate the underestimation of the gender or racial wage gap that empirical studies find and the well-documented fact that the underprivileged perceive the wage gap to be higher on average with less bias. Calibration by a recent Israeli sample suggests that women place much more weight on the (correct) global signal than men, in line with the hypothesis that people who are adversely affected by a wage gap listen more carefully to global information about the issue. Hence, (educational) interventions about the global state of gender and racial inequality are promising, especially if they target the privileged. Keywords: Wage gap · Inequality · Perceptions · Networks · Local knowledge · Diversity · Ego networks · Information treatments
1
Introduction
Perceptions matter. There is now widespread agreement that it is perceived inequities and not necessarily actual ones driving redistributive preferences and policy decisions [5,12]. Apart from biased perceptions of empirical wealth and income distributions, the literature increasingly also documents misperceptions of gender and racial inequality [15,17]. These results have important policy implications: If the general public overestimates the achieved progress in both racial and gender equality, the issues might never surface in public debate [4] or individual salary negotiations [22]. Misperceptions about the wage gap might thus Funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation 430621735). c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2021, SCI 1072, pp. 768–779, 2022. https://doi.org/10.1007/978-3-030-93409-5_63
Marginalisation and Misperception
769
contribute to its persistence documented in [1] - and addressing this persistence would require addressing its biased perception. Despite their relevance for both labour market outcomes and public policy, there is, to the best of our knowledge, no formal model of belief formation in this regard yet. We propose a simple network model based on homophilic attachment originally employed for perceptions of overall inequality by [23] to test several candidate explanations. Most notably, we test the hypothesis by [15] that skewed local information sets due to segregated social networks might cause misperceptions. By contrast, misperceptions might also be due to biased information processing of a correct global signal. By calibrating our model with data from a recent sample in [17], we find that both the local and the global channel matters but that their relative importance differs between the privileged and underprivileged.
2
Related Literature
This paper considers perceptions of the wage gap between the privileged and underprivileged as emerging from endogenously evolving reference groups, modelled as ego networks. The literature on the gender wage gap suggests its overall underestimation [17], whereby women have much higher and more accurate perceptions on average than men [17]. Individuals form their perceptions from lived experience in (potentially) gender-diverse situations [2] but are also generally aware of the global wage gap, e.g., through national media [9]. The findings on perceptions of racial wage inequality in the US also suggest its general underestimation with black employees who are adversely affected by the wage gap generally perceiving higher wage gaps and being more accurate in its assessment [14]. [15] find that the richest white people tend to underestimate the racial wage gap most strongly and provide some evidence that this might be related to the diversity in social networks that is particularly low for the richest part of the population. Apart from this local channel, the efficacy of informational treatments reporting aggregates also suggests the relevance of global averages on individual perceptions [11]. To validate our model, we thus consider two stylised facts: (i) The general population appears to underestimate the extent of gender or racial inequality and (ii) this underestimation is mainly driven by the privileged’s bias. As generating mechanisms, both local knowledge in the form of lived experience and global signals, through e.g. the national media, appear to be of importance.
3
Model
This section gives a content-oriented presentation; we provide a technical description following the ODD protocol and the NetLogo implementation of the simulation model at github.com/mayerhoffer/Inequality-Perception. Our model extends [23] who employ the linking procedure to study perceptions of inequality in general. The model consists of three distinct phases that take place once per simulation run and in sequential order: (1) Agent initialisation and group-specific
770
D. M. Mayerhoffer and J. Schulz
income allocation, (2) network generation through agents’ homophilic linkage, and (3) individual wage gap perception and network evaluation. There are 1, 000 agents in the model belonging to either the privileged group (e.g., men) or the underprivileged group (e.g., women). Irrespective of their group, each agent draws their income from an exponential distribution with a mean of λ = 1, but the underprivileged agents’ income is downscaled by the wage gap g ∈ (0, 1) (identical for all agents). Thus, the overall income distribution irrespective of privilege also follows an exponential law. This model input normalises the empirically observed (pre-tax or market) wage distributions in various industrialised countries for the vast majority of individuals for the whole population [26] as well as for gender or racial groups considered separately [24]. Each agent draws five others to form an undirected and unweighted link. A link can represent any social relationship and simply indicates mutual knowledge of each other’s income. Consequently, each agent has at least five link-neighbours (i.e., social contacts) but may have more. This is empirically validated as the closest layer of intense contacts [16]. Moreover, sensitivity analyses indicate that a larger number of links has little impact. When drawing link neighbours, an agent does not care about the potential drawee’s group belonging. However, agents do care about the potential drawee’s income Y , i.e., link-formation is homophilic in income. Namely, agent j’s weight in agent i’s draw is denoted by Ωij and determined as follows: Ωij =
1 exp[ρ |Yj − Yi |]
(1)
ρ = 0 represents a random graph, and for an increasing value of ρ, an agent becomes ever more likely to pick link-neighbours with incomes closer to their own. The link function’s exponential character ensures that those with large income differences become unlikely picks even at low to moderate homophily strengths. Figure 1 illustrates the linkage probabilities implied by the weighted
Fig. 1. Theoretical Probability Density Functions (PDFs) of a node with a given income rank for linkage with another node for the whole support of 1, 000 income ranks. Plots reproduced with permission from [23].
Marginalisation and Misperception
771
draw based on the exponentially distributed income levels. For an extensive analysis of this linkage behaviour, see [23]. The resulting network is a member of the family of Random Geometric Graphs [6], which [25] showed to reproduce core features of many social networks efficiently. Specifically, we combine the notions of homophily [3] with pre-setting node degrees [19,20]. However, concerning our application, we simplify both approaches by pre-determination of only the global minimum degree, like in Preferential-Attachment networks, and consequently defining relative weights rather than absolute probabilities. Agents perceive the wage gap given their individual knowledge consisting of a local component defined by an agent’s position in the network and a global component, which is simply the true wage gap. The composite perception pi of individual i is a linear combination of local perceptions li and the (correct) global wage gap g by pi = (1 − wi ) · li + wi · g.
(2)
The only free parameter is thus wi ∈ R+ 0 , the weight i puts on the global signal. Whenever the global perception pi increases relative to li , the weight on the global signal increases, as then local knowledge is less relevant for perceptions. Our dataset includes g and the mean perceptions for the underprivileged and privileged groups, i.e., p¯U and p¯P . Together with the simulation means for the local signal, ¯lU and ¯lP for both privilege classes, we are thus able to estimate ¯P the implied mean population weights w ¯U and w An agent i’s local perception li of the wage gap bases on their perception set Θi in their neighbourhood. Namely, it is the percentage difference in mean wages of the perceived set of underprivileged U (of size NU,i ) and privileged P (of size NP,i ) link neighbours.1 For either NU,i = 0 or NP,i = 0, we set pi = 0, as in this case, agent i is unable to observe any wage gap. Otherwise, they calculate: li =
Y¯Pi − Y¯Ui 1 ¯i ¯i , 2 (YP + YU )
with Y¯Pi = (1/NP,i ) ·
(3) Yj and Y¯Ui = (1/NU,i ) ·
j∈{Θi ∩P }
Yj ,
j∈{Θi ∩U }
for both NU,i = 0 as well as NP,i = 0.
4
Results
We initialise our model with N = 1, 000 agents each, equally split into a ‘privileged’ and ‘underprivileged’ class with 500 members each. Both are initialised 1
We opt for this particular approximation because it is symmetric [27] and bounded in [−2, 2] and thus more robust against outliers [7]. Our results are not materially sensitive to the specific symmetric approximation of the growth rate and very similar to the more common approximation by log-differences.
772
D. M. Mayerhoffer and J. Schulz
with an identical exponential distribution with rate parameter λ = 1. All incomes in the underprivileged class are then downscaled by (1 − g), in line with the empirical results by [24]. For our calibration exercise in Subsect. 4.3, we use the empirical estimates by [17], again with a total population N = 1, 000. Our results are averaged over 100 Monte Carlo (MC) runs. 4.1
Mean Local Perceptions
We first look at the mean local perceptions, i.e., w = 0, according to privilege and for varying g and ρ in the empirically relevant ranges. Recall that the literature suggests that (i) the wage gap is underestimated regardless of privilege but that (ii) the underprivileged have much more accurate perceptions than the privileged. Figure 2 summarises this first battery of simulation results.
Fig. 2. Violin plots of agents’ perceived local wage gap li relative to the true wage gap g for various g and ρ. Gridline at unity thus suggests correct perceptions.
In line with stylised fact (i), our simulated population overall vastly underestimates the wage gap, regardless of privilege status. We find that all violin plots are significantly below the gridline at unity that would indicate correct perceptions.2 Of course, this is entirely unsurprising. The homophilic graph for2
In an interesting complementary perspective, [13] demonstrate that also the actual (gender) inequities we take as given can emerge purely based on interpersonal comparisons based on similarity, much like our local signal for perceptions.
Marginalisation and Misperception
773
mation lets all agents estimate the wage gap over a set of neighbours with similar income. It is thus improbable to observe incomes that strongly deviate from the agent’s income in question, including incomes from a different privilege class. This unidirectional downwards bias on perceptions affects the two classes differently, though. This follows from the U-shaped selectivity patterns in relation to the whole income distribution we show in Fig. 1, where selectivity is locally maximal at the highest and lowest income ranks. A wage gap g > 0 with ρ > 0 thus exhibits two counteracting effects. Firstly, it moves the incomes of the richest underprivileged agents closer to the global median income, decreasing their selectivity, thus generally increasing the perception of the wage gap. Secondly, it also moves the income of the poorest underprivileged agents away from the incomes of the poorest privileged agents. Thus, the neighbourhoods of the underprivileged poor are less diverse, and their estimates of the wage gap downwards biased. The direction of the total effect depends on both g and ρ, as Fig. 2 shows. In general, increasing ρ increases the relative strength of the first partial effect, as the richest become much more selective here in relative terms. Apart from the case of a very low wage gap g = 0.1, we find that stylised fact (ii) can only be replicated for rather large homophily levels ρ ≥ 2. Only for g = 0.1, stylised fact (ii) can be quantitatively replicated also for low homophily levels. Here, g is sufficiently small not to segregate the top within the global income distribution too much; thus, attenuating the effect of top-income selectivity. While we find that the two stylised facts (i) and (ii) can be qualitatively replicated for high homophily strengths ρ ≥ 2, our quantitative estimation results indicate that the downwards bias is much too strong. In the sample by [17] with a g close to the case of g = 0.3, men underestimate the gender wage gap by about 36% and women only about 22%. This is, of course, far from the at least about 90% downwards bias that is necessary in our model setup for underprivileged agents to be closer to reality than the privileged and thus replicate stylised fact (ii). Our result might indicate that, in line with the composite rule we propose in Eq. (2), perceptions of the gender wage gap are formed by averaging over global and local signals. We calibrate our model with the data in [17] to spell out the implications of the composite specification in greater detail. 4.2
Individual Local Perceptions
The rather clear average relationship between income, diversity and perceived wage gaps masks important heterogeneity on the more granular individual level. Generally, diversity is positively correlated with income for the underprivileged class: The wage gap pushes the poorest underprivileged away from the poorest privileged agents, therefore decreasing diversity; at the same time, it pushes the richest underprivileged close to the global median income, where selectivity is lowest and thus, diversity highest. By contrast, the relationship is generally positive for the privileged class: The richest privileged agents are also the richest overall, with few underprivileged agents in their neighbourhood, and thus exhibit relatively low diversity in their perception sets. The poorest privileged agents are pushed towards the median income with low selectivity and rather high
774
D. M. Mayerhoffer and J. Schulz
diversity levels. These mechanisms become even stronger for a higher wage gap g, which intensifies the disproportion between in-group and global income positions. Furthermore, the homophily ρ also catalyses the sketched mechanisms. Indeed, we find a generally positive relationship between income Y and diversity D for the underprivileged agents, while the relationship is negative for the privileged agents. However, these relationships display notable non-linearities and even non-monotonicities as Figs. 3 and 4 show exemplarily. Moreover, the effects of a rising g for a given ρ and vice versa on diversity are also nonmonotonous.3 Notice also that our model mechanism features the endogenous emergence of gender or racial homophily in many cases (D < 0.5) from income homophily alone that is empirically well-established [18]. Finally, chance notably impacts the diversity of most agents’ neighbourhoods, too. A vertical section through a heatmap represents an individual agent’s (defined by their income) neighbourhood diversities - and associated wage gap perceptions - in the MC runs for a given combination of g and ρ.
Fig. 3. Heatmap of local wage gap perceptions l (as a ratio to the correct wage gap g) for varying privilege, diversity D (rate of links to the other privilege class) and income Y for ρ = 3 and varying actual wage gaps g ∈ {0.1; 0.2; 0.3; 0.4}. Upper panels for the underprivileged agents, lower panels for the privileged ones.
The differences in individual neighbourhood diversity in turn also impact the agent’s locally perceived wage gap l. Hereby, both figures highlight that the impact of Y and D (and hence also g and ρ) on l is far from straightforward. Three aspects are most striking. Firstly, for a given ρ, a higher wage gap does not for all agents mean that they perceive a higher wage gap since the perception 3
We selected empirically probable values of g and ρ for presentation, but the heatmaps for other parameter combinations display similar complexity and non-monotonicity. Indeed, the association between income, diversity and the wage gap is even more discontinuous and non-monotonic, as the rather rugged landscapes for perceptions in all panels suggest. All heatmaps are available upon request.
Marginalisation and Misperception
775
Fig. 4. Heatmap of local wage gap perceptions l (as a ratio to the correct wage gap g) for varying privilege, diversity D (rate of links to the other privilege class) and income Y for g = 0.2 and varying homophily strengths ρ ∈ {1; 2; 3; 4}. Upper panels for the underprivileged agents, lower panels for the privileged ones.
depends on their link-neighbours; in some cases, a higher global wage gap actually causes some local differences between privileged and underprivileged agents to shrink. Secondly, for a given g (and thus also a fixed income and income rank distribution), a rising ρ expectedly lowers local wage gap perceptions overall, but some agents’ l behaves non-monotonically, too. Thirdly, individual perceptions for a given combination of g and ρ vary across MC runs, testifying to the impact of chance. Consequently, other than general inequality perceptions in [23], individual local wage gap estimates cannot be clearly associated with income rank, nor with the individual being underprivileged or privileged. Instead, these individual perceptions seem highly situational, i.e., depending on features of the specific network that the pre-set model parameters do not fully determine. The cause for these observations is that changes in the relative position of an agent (through the wage gap) might move an income into or out of the most likely set to be chosen by another agent. The most probable perception set of an agent i is the one, where they are in the middle position of their perception set, as shown in the model section and proved in [23]. When the wage gap changes the relative ranks of only two agents with differing privilege classes, this might cause large changes in the perceived wage gap between both. The ‘ruggedness’ of the perception landscapes increases in homophily levels, where the most likely set becomes disproportionately more likely compared to all other possible perception sets. Notice also that the variability of estimates varies strongly with income, with indeed the richest and poorest being the least variable (since selectivity is highest there). The sensitivity of these associations to initial conditions is disconcerting for applied empirical work if the empirical target system indeed resembles our model. The complexity of the landscape indicates that population means mask important heterogeneity within this population, and there exists no simple, meaning-
776
D. M. Mayerhoffer and J. Schulz
ful aggregate representation. Moreover, we find that the relationships between income, diversity and perceptions are not only dependent on the actual wage gap, the strength of homophily and the privileged class but also vary along the income distribution with sudden changes in perceptions and variability. Therefore, it appears impossible to fit any monotonic function in reduced-form to these relationships. Per construction of the model, all global behavioural equations are rather straightforward and easily comprehendable, yet empirical work might fail to detect and recover them in the presence of these local non-monotonicities. 4.3
Composite Signal and Empirical Calibration
The simulation results in Subsect. 4.1 indicate that the effect from homophilic segregation on local perceptions is too strong to replicate the empirically observed effect sizes in [17]. Therefore, we allow for a composite signal aggregated from global and local information, respectively, that is, we allow for w > 0. We calibrate our baseline model in eq. (2) with the empirically observed parameters to deduce the weight both the privileged (male) and the underprivileged (female) class puts on the global signal on average. The empirical gender wage gap for the Israeli sample is g = 0.317, with a female labor force participation rate of 0.472, implying the underprivileged population size to be NU = 472 and a privileged population size to be NP = 528. The target mean perception of the underprivileged is p¯U = 0.292 and of the privileged p¯P = 0.203. The estimated ¯P for both the underprivileged and privileged implied mean weights w ¯U and w class and different ρ are shown in Fig. 5. We only show estimates for ρ ≥ 1.5, since for values below, the implied weight w ¯P becomes negative without any clear empirical counterpart or interpretation: For low homophily strengths, both the global and the local signal are above the empirical mean perceptions for men. The estimation results by [23] indeed also suggest that the implied homophily strength for empirical perception networks is ρ ≥ 1.5. Within this empirically relevant range, we find that the underprivileged group places a much higher
Fig. 5. Implied weights for the global signal for both privilege classes and the empirical calibration.
Marginalisation and Misperception
777
weight on the global signal than the privileged. This gap between the underprivileged and privileged classes becomes more narrow with higher homophily when the local perceptions of the privileged decrease but continues to be sizeable for all considered ρ. Notice that this also implies that the composite perceptions of the underprivileged are much less volatile, as they put more weight on the (certain) global signal rather than the (noisy) local one. Our calibration exercise thus suggests two major results: Firstly, local perception formation on skewed information sets is insufficient to generate the observed empirical effect sizes. Only composite signals that combine both a (correct) global signal with the local perception can reconcile the model with the empirical evidence of wage gap perceptions [17]. By contrast, the local signal is sufficient for perceptions of global inequality in general [23]. Remarkable this finding suggests that the concept of a ‘wage gap’ can be much more easily conveyed within educational campaigns than the arguably much more complex and ambiguous concept of ‘inequality’, in line with empirical evidence for significant effects that concept complexity and the associated cognitive costs might exhibit on individual behaviour [21]. Secondly, we find that the underprivileged place much higher weight on the global signal than the privileged with much less noise. This is consistent with the empirical finding that the adversely affected part of the population is more interested in global information about the issue [28].
5
Discussion
The agent-based simulation model presented in this paper suggests that the empirical mild underestimation of gender and racial wage gaps results from individuals processing information obtained by comparing one’s link-neighbour’s incomes as well as a global signal reflecting the actual wage gap. However, underprivileged agents put considerably higher weights on the global signal than privileged ones. Therefore, those adversely affected by a wage gap may perceive it more severely, i.e., closer to its actual severity, simply because they listen more carefully to global information. For policy-makers and the public debate, this finding suggests that underprivileged groups are a reliable source, and their account of the wage gap is not entirely based on subjective experience or even self-interest but on a higher emphasis on what is objectively the case. Furthering such emphasis on the globally true value, e.g. through education, seems to be especially promising in the context of a wage gap that is easy to comprehend and should be targeted primarily at the privileged that put much lower weight on objective evidence. One should not understand our model as identification of actual mechanisms in specific cases. But since they can be true in the real world, our findings present an epistemically possible how-possibly-explanation [10], or candidate for how the stylised empirical facts came about [8]. Still, there may be alternative, more adequate candidate explanations that we are not aware of yet. A particularly promising avenue for extending the model would be to go beyond the
778
D. M. Mayerhoffer and J. Schulz
unidimensional income homophily and to include homophily along other dimensions like race or gender in a multilayer framework that our parsimonious model features only as a desirable ‘byproduct’ of the interplay of income homophily and the wage gap at the moment. Thus, our results call for additional empirical work to test for the existence of our model mechanisms in the real world. Nevertheless, the striking non-monotonicities imply a need for great caution when employing standard statistical measures of association. Namely, standard Pearson correlations are not meaningful as the observed relationships are highly case-dependent and non-linear. Moreover, even common measures of non-linear associations like Kendall’s τ or Spearman’s ρ might not be applicable since all considered local relationships are non-monotonous and dependent on initial conditions such as the precise random draws for the network generation and initialisation of the wage distribution. Finally, individual perceptions depend on the specific network topology. In turn, this topology emerges from the interplay of the actual wage gap, homophily, and chance. Hence, reliable predictions are impossible at an aggregate level and instead require investigation of individuals in their respective ego networks.
References 1. Akchurin, M., Lee, C.-S.: Pathways to empowerment: repertoires of women’s activism and gender earnings equality. Am. Sociol. Rev. 78(4), 679–701 (2013) 2. Auspurg, K., Hinz, T., Sauer, C.: Why should women get less? Evidence on the gender pay gap from multifactorial survey experiments. Am. Sociol. Rev. 82(1), 179–210 (2017) 3. Bogun´ a, M., Pastor-Satorras, R., D´ıaz-Guilera, A., Arenas, A.: Models of social networks based on social distance attachment. Phys. Rev. E 70(5), 056122 (2004) 4. Ceccardi, T.: How do beliefs about the gender wage gap affect the demand for public policy? Differences 2021, 14 (2021) 5. Choi, G.: Revisiting the redistribution hypothesis with perceived inequality and redistributive preferences. Eur. J. Polit. Econ. 58, 220–244 (2019) 6. Dall, J., Christensen, M.: Random geometric graphs. Phys. Rev. E 66(1), 016121 (2002) 7. Decker, R., Haltiwanger, J., Jarmin, R., Miranda, J.: The role of entrepreneurship in US job creation and economic dynamism. J. Econ. Perspect. 28(3), 3–24 (2014) 8. Epstein, J.M.: Agent-based computational models and generative social science. Complexity 4(5), 41–60 (1999) 9. Furnham, A., Wilson, E.: Gender differences in estimated salaries: a UK study. J. Socio-Econ. 40(5), 623–630 (2011) 10. Gr¨ une-Yanoff, T., Verreault-Julien, P.: How-possibly explanations in economics: anything goes? J. Econ. Methodol. 28(1), 114–123 (2021) 11. Haaland, I., Roth, C.: Beliefs about racial discrimination and support for pro-black policies. Rev. Econ. Stat., 1–38 (2021). In print 12. Hauser, O.P., Norton, M.I.: (Mis)perceptions of inequality. Curr. Opin. Psychol. 18, 21–25 (2017) 13. Huet, S., Gargiulo, F., Pratto, F.: Can gender inequality be created without intergroup discrimination? PLoS One 15(8), e0236840 (2020)
Marginalisation and Misperception
779
14. Kraus, M.W., Onyeador, I.N., Daumeyer, N.M., Rucker, J.M., Richeson, J.A.: The misperception of racial economic inequality. Perspect. Psychol. Sci. 14(6), 899–921 (2019) 15. Kraus, M.W., Rucker, J.M., Richeson, J.A.: Americans misperceive racial economic equality. Proc. Natl. Acad. Sci. 114(39), 10324–10331 (2017) 16. Mac Carron, P., Kaski, K., Dunbar, R.: Calling Dunbar’s numbers. Soci. Netw. 47, 151–155 (2016) 17. Malul, M.: (Mis)perceptions about the gender gap in the labor market. Forum Soc. Econ. 1–9 (2021). In print 18. McPherson, M., Smith-Lovin, L., Cook, J.M.: Birds of a feather: homophily in social networks. Ann. Rev. Sociol. 27(1), 415–444 (2001) 19. Newman, M.E.: Random graphs with clustering. Phys. Rev. Lett. 103(5), 058701 (2009) 20. Newman, M.E., Strogatz, S.H., Watts, D.J.: Random graphs with arbitrary degree distributions and their applications. Phys. Rev. E 64(2), 026118 (2001) 21. Oprea, R.: What makes a rule complex? Am. Econ. Rev. 110(12), 3913–3951 (2020) 22. Pfeifer, C., Stephan, G.: Why women do not ask: gender differences in fairness perceptions of own wages and subsequent wage growth. Camb. J. Econ. 43(2), 295–310 (2019) 23. Schulz, J., Mayerhoffer, D., Gebhard, A.: A network-based explanation of perceived inequality. Bamberger Beitr¨ age zur Modernen Politischen Theorie, vol. 2 (2021) 24. Shaikh, A., Papanikolaou, N., Wiener, N.: Race, gender and the econophysics of income distribution in the USA. Phys. A: Stat. Mech. Appl. 415, 54–60 (2014) 25. Talaga, S., Nowak, A.: Homophily as a process generating social networks: insights from social distance attachment model. J. Artif. Soc. Soc. Simul. 23(2), 6 (2020) 26. Tao, Y., et al.: Exponential structure of income inequality: evidence from 67 countries. J. Econ. Interact. Coord. 14(2), 345–376 (2019). https://doi.org/10.1007/ s11403-017-0211-6 27. T¨ ornqvist, L., Vartia, P., Vartia, Y.O.: How should relative changes be measured? Am. Stat. 39(1), 43–46 (1985) 28. Wu, K.: Invisibility of social privilege to those who have it. Acad. Manag. Proc. 2021(1), 10776 (2021)
A Networked Global Economy: The Role of Social Capital in Economic Growth Jaime Oliver Huidobro1,2(B) , Alberto Antonioni1 , Francesca Lipari1 , and Ignacio Tamarit2 1 2
Carlos III University, Madrid, Spain [email protected] Clarity AI Europe S.L., Madrid, Spain
Abstract. Understanding the drivers of economic growth is one of the fundamental questions in Economics. While the role of the factors of production—capital and labor—is well understood, the mechanisms that underpin Total Factor Productivity (TFP) are not fully determined. A number of heterogeneous studies point to the creation and transmission of knowledge, factor supply, and economic integration as key aspects; yet a need for a systematic and unifying framework still exists. Both capital and labor are embedded into a complex network structure through global supply chains and international migration, and it has been shown that the structure of global trade plays an important role in economic growth. Additionally, recent research has established a link between types of social capital and different network centralities. In this paper we explore the role of these measures of social capital as drivers of the TFP. By leveraging the EORA Multi Regional Input Output and the UN International Migration databases we build the complex network representation for capital and labor respectively. We compile a panel data set covering 155 economies and 26 years. Our results indicate that social capital in the factors of production network significantly drives economic output through TFP. Keywords: Economic growth · Total factor productivity · Social capital · Economic complexity · Complex networks · Panel data
1
Introduction
Understanding growth is one of the fundamental questions in Economics. From the seminal work of Solow [1], economic output has been understood as a monotonically increasing function of the factors of production—land, labor and capital—and an additional term called Total Factor Productivity (TFP). This term was introduced to account for additional unknown factors, and was initially connected to technology and human capital [2]. And despite being the key determinant of the long run growth rate (per worker) [3], the drivers of TFP remain unclear. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2021, SCI 1072, pp. 780–791, 2022. https://doi.org/10.1007/978-3-030-93409-5_64
Social Capital in Economic Growth
781
Diverse studies have further investigated the fundamental drivers of TFP. Knowledge creation through innovation plays a key role [4], but also knowledge transfers occurring through Foreign Direct Investments (FDI) [5–9], trade (under the condition of having the necessary human capital to absorb it) [10], and reception of skilled migrants [11]. Skilled emigration also drives TFP through knowledge transfers to the original community [12,13]. Moreover, it has been identified that friendly economic environment and policies lead to economic prosperity for companies, so political and economic freedom have a positive effect on TFP [14]. Finally, the literature also identifies that financial openness leads to TFP growth [15]. Many of these factors rely on the fact that the two main factors of production—labor and capital—flow across the globe through global supply chains and international migration networks respectively. On the trade side, traditional economics reveals that export diversification of products leads to growth [16], and especially for developing countries [17,18]. On the migration side, studies have found that the macroeconomic and fiscal consequences of international migration are positive for OECD countries [19], and the information contained in bilateral migration stocks suggests that migration diversity has a positive impact on real GDP [11]. Also, it has been found that when international asylum seekers become permanent residents, their macroeconomic impacts are positive [20]. Nonetheless, classical methods use local, first-neighbour metrics (usually Herfindahl-Hirschman Index or similar). Therefore they are not able to exploit the information at higher-order neighbours contained in the full network structure. These highly complex datasets contain reverse causality, non-linear and variable interaction effects that call for advanced modelling techniques. The global financial and migratory flows can be interpreted as having a complex network structure where nodes are countries and links are flows of labor and capital, and this requires sophisticated tools to be fully understood [21]. At the macro level, it has been shown that rich countries display more intense trade links and are more clustered [22]. In this trade network, node-statistic distributions and their correlation structure have remained surprisingly stable in the last 20 years [23]. At the micro level, there is evidence that node centrality on the Japanese inter-firm trading network significantly correlates with firm size and growth [24]. Also, the country-level migration stock network has been found to have a small world structure [25,26], and another study found a network homophily effect that could be explained in terms of cultural similarities [27]. On another line of work, recent advances on complex network theory link network centrality measures to social capital types [28]. This concept has mainly been tested on social networks, linking social capital to information diffusion [29], innovation [30] and even personal economic prosperity [31]. Social capital studies point out that individual traders in Africa with more contacts have higher output and growth [32]. The purpose of this work is to unify these different strands of literature under one framework. We proxy different types of social capital with two centrality measures: incoming (out-coming) information capital with hubs (authorities)
782
J. O. Huidobro et al.
score, and favor capital with favor centrality. In order to proxy the relationship of social capital with TFP, we estimate a model based on an augmented CobbDouglass production function [33]. In this way, we give social capital a role in growth theory. To test this model, we build the network representations for the networks of the factors of production; On the one hand, we build two representations of the capital flows network leveraging the EORA World Multi-Regional Input Output database [34], one for capital and another for goods and services. On the other hand we build the labor flow network using the UN’s International Migration Database. This results on a panel data set covering 155 economies from 2000 to 2016, including seven different social capital measures for each country. The rest of this work is structured as follows: in Sect. 2 we describe the proposed framework, the model and the data, in Sect. 3 we lay out the obtained results, and in the last section we summarize our conclusions.
2 2.1
Materials and Methods Network Centralities as Proxy for Types of Social Capital
Recent advances interpret social capital as a topological property of networks that can be proxied with different centrality measures [28]. Our focus is on two different social capital types. The first one is information capital, a proxy for the ability to acquire valuable information and/or to spread it to others. The second one is favor capital, which is defined as having neighbours that are supported by a neighbour in common. Information capital (I) is related to diffusion centrality [35], which converges to eigenvector centrality in infinite iterations [36]. As both of our networks are directed, we leverage the HITS algorithm to proxy inwards (I in ) and outwards (I out ) information capital with the hubs and authorities centralities respectively [37]. On the other hand, the favor capital of node i in an un-weighted network g as been previously proxied with favor centrality as follows [28]: Fi (g) = |j ∈ Ni (g) : [g2 ]ij > 0|.
(1)
Where Ni (g) is the set of i’s neighbours—notice that the term [g2 ]ij > 0 is restricting the set to neighbors of i that are connected to at least another neighbor of i. Thus, we extend this definition to a weighted network in the following way: [g2 ]ij (2) Fi (g) = j
In Fig. 1 we show the proposed social capital measures over a toy model network with link weights equal to one. We observe that node 3 (node 1) has the highest inwards (outwards) information capital, because it is the target (source) of many links. On the other hand, We see that nodes 2 and 3 have the lowest favor capital, since they do not have any neighbours with neighbours in common.
Social Capital in Economic Growth
(a) Inwards information capital
(b) Outwards information capital
783
(c) Favor capital
Fig. 1. Toy model of the three social capital indicators, where nodes with higher values are colored darker. In this example all link weights are equal to one.
2.2
Link Between Social Capital Types and TFP Factors
As we pointed out in the introduction, foreign sources of knowledge and technology are linked to TFP and in turn to growth. Knowledge from abroad may flow through a variety of channels. On the one hand, knowledge on how to efficiently use the factors of production is key for productivity. And in that way, knowledge transfers among countries help develop technology and therefore drive TFP. The first channel for knowledge transfers is Foreign Direct Investments (FDI), helping knowledge spillovers from industrialised to developing countries [5–9]. The second channel is through imports of sophisticated goods and services with high technological content [10]. And the third channel is through international migration [11,13] that works both through the reception of skilled migrants in developed economies, and through migrants’ attachment to their original countries. On the other hand, availability of human capital is key to absorbing knowledge shocks, so access to foreign labor can be key to TFP growth. We propose a direct association between this factors and different types of social capital. We link knowledge transfers due to Foreign Direct Investments (FDI) with information capital on the monetary network, knowledge transfers associated with importing sophisticated goods and services to information capital on the goods and services network, and knowledge transfers associated to migration with information capital on the migration network. On the other hand, another factor playing a relevant role is the availability of a human capital supply. We link this factor to favor capital in the migration network, understanding it as the belonging to country partnerships of free movement of people. The proposed links between the types of social capital and drivers of TFP are summarised in Table 1. 2.3
Social Capital and Economic Growth
Macroeconomic theory generally describes a country’s output through the aggregate production function [1,2] for which one widely used functional form is the Cobb-Douglas function [33]: Q = A · K α · Lβ ,
(3)
784
J. O. Huidobro et al.
Table 1. Relationship between different TFP growth factors and the different types of social capital in the different networks. Contribution to total factor productivity
Social capital Type
Network
Knowledge transfer through FDI
Information capital Financial
Knowledge transfer through trade
Information capital Goods and services
Knowledge transfer through migration
Information capital Migration
Human capital supply
Favor capital
Migration
where Q represents total production, A stands for the Total Factor Productivity, K is capital and L is labor. We propose an augmented Cobb-Douglas production function including the human and financial social capitals: Q = A¯ · K α · Lβ · S(K)κ · S(L)λ ,
(4)
where S(x) stands for the social capital of the factor of production x. Notice that the key difference with respect to Eq. 3 is that we factor out the social capital contributions from the TFP as follows: A = A¯ · S(K)κ · S(L)λ 2.4
(5)
Global Network Data
The two main factors of production—capital and labor—can be interpreted as having a complex network structure. In general, we interpret global transnational interactions (both financial and migratory) as a network (G), with n countries indexed by i ∈ {1, ..., n}. This graph is described by the adjacency matrix g ∈ [0, 1]n×n , where the gij > 0 represents the weight of the interaction between i and j. Since these are directed graphs, g is not symmetrical for any of them. There is a growing body of literature interpreting the global financial flows as a complex network [38,39]. Although interpreted in a different way, the adjacency matrix of the financial network has been thoroughly studied in the field of InputOutput economics [40] under the name of technical coefficient matrix, and thus there are many open data-sources providing this information. In particular, we used EORA’s World Multi-Regional Input Output database [34] to proxy the amount of trade between pairs of countries. On the one hand, we extracted the adjacency matrix of the financial network (GF ), where link weights represent the percentage of country’s economic output (measured in dollars) that is paid to any other country in exchange of goods and services exported. On the other hand, we built the goods and services network (GG ) by weighting the links with the
Social Capital in Economic Growth
785
proportion of the total production of goods and services that a country exports to any other country. We build the migration network’s (GM ) adjacency matrix by leveraging the UN’s International Migration Database [41]. This database contains information for the yearly number of people migrating from one country to another. Using skilled migrants data would be the best approach, however at the time of writing we have no access to such dataset. Thus, we defined the weights of GM as the migrant stock living in a given host country, relative to the working population of the home country.
3 3.1
Results Panel Data Set
We combine the social capital indicators described in Sect. 2.4 with some extra economic information; economic output is modeled with GDP (in current US dollars) provided by the World Bank, capital is modeled as Gross Fixed Capital Formation (in current US dollars) provided by the World Bank and labor as total working population (in millions) provided by the OECD. The result is a panel data set covering 155 economies from 1990 to 2016. In Fig. 2 we show the distributions of the different variables as well as their pairwise Spearman correlations and R2 coefficients of a linear regression model with intercept. 3.2
Social Capital Contribution to Economic Output
We model the relationship of social capital with GDP of country i at time t in a linear fashion by taking logs in Eq. 4: log(GDPit ) = A + α · log(Kit ) + β · log(Lit ) in out μn · log(Iitn ) + νn · log(Iitn ) + ξ · log(FitM ) +
(6)
n∈N
where A is the intercept, Kit is the gross capital formation, Lit is the total working population, N is the set of networks {GF , GG , GM }, Fitn is the favor in out and Iitn are the in and out information capitals respectively. capital, and Iitn We first estimate the model coefficients through Ordinary Least Squares (OLS). To account for unobserved entity and time effects, we leverage a Fixed Effects (FE) estimator including both country and year effects. We performed a Hausman test in order to test consistency of Random Effect estimates—which we rejected with a 1% significance level. Additionally, we use heteroskedasticity and autocorrelation consistent (HAC) errors in our estimation. Results are shown in Table 2. Notice that the adjusted R2 of the models including the social capital indicators raise with respect to the base models, so that the new model is capturing a stronger signal. This is consistent with the clear uni-variate relationships between the social capital indicators and log(GDP ) (Fig. 2). Also, results in
786
J. O. Huidobro et al.
Fig. 2. Pairwise distribution matrix for economic output (log(GDP )), capital (log(K)), labor (log(L)) and the developed social capital indicators: inwards/outwards information capital (I in /I out ), and favour capital F for the financial, goods and services, and migration networks. Each observation corresponds to one country and year. Spearman correlations (ρ) are shown in the lower triangular matrix, while the R2 of a linear regression model with intercept is shown in the upper triangular matrix.
Table 2 indicate that most of the significant effects of the social capital variables are positive. However, we observe that some of the introduced variables have unexpected negative effects. This result could be due to different issues in the model specification; first, network centrality measures generally tend to correlate [42]. This is confirmed by the correlations in Fig. 2, but also by high Variance Inflation Factors1 (the minimum is V IF = 15.8 for inwards information capital in the migratory network, and the maximum is V IF = 2915.7 for the outwards information capital in the financial network). Therefore, we can expect the regression model to suffer from a multicollinearity problem. Second, we don’t capture either 1
The variance inflation factor (VIF) quantifies the severity of multicollinearity in an ordinary least squares regression analysis. To calculate the VIF of every feature, we regress it against all other features and compute V IFi = 1/(1 − Ri2 ).
Social Capital in Economic Growth
787
Table 2. Regression results for the model specification in Eq. 6. The panel data model specification is p-value notation is ∗∗∗ , ∗∗ and ∗ for significance at the 1%, 5% and 10% levels respectively, and standard errors are shown in parenthesis. For each model we show number of observations N, R2 , adjusted R2 and F-statistic. Model
OLS base
OLS extended FE base
FE extended
N
3397
3397
3397
3397
R2
0.839
0.885
0.156
0.390
Adj R
2
0.839
0.884
0.108
0.353
F
8819.1
2884.3
297.35
227.48
A
−4.7402*** −0.2960 (0.1642) (1.2994)
11.388*** 15.724*** (1.2792) (2.3497)
α
0.9264*** (0.0169)
0.8616*** (0.0229)
0.3516*** 0.3028*** (0.0285) (0.0209)
β
0.0535*** (0.0184)
−0.0045 (0.0202)
−0.1470 (0.1005)
−0.3292*** (0.1211)
νF
−1.2268*** (0.1979)
−0.8433*** (0.2491)
μF
1.1802*** (0.1835)
0.6904*** (0.1488)
νG
−0.2107* (0.1185)
0.1111* (0.0626)
μG
0.0474 (0.1206)
−0.0597 (0.0543)
νM
0.0922*** (0.0116)
0.0203 (0.0365)
μM
0.0385*** (0.0058)
0.0019 (0.0096)
ξ
−0.0582*** (0.0219)
0.0482** (0.0240)
non-linear nor interaction terms. These could be of special relevance given the complex nature of the data in hand. And last, our model specification could be prune to suffer from simultaneity bias due to a reverse causality channel; higher social capital enhances productivity, however higher GDP could attract trade and migration and therefore leading to higher social capital.
4
Conclusions
In this work we interpret capital and labour as two factors of production traveling across the globe via the mobility networks of trade and migration. Leveraging
788
J. O. Huidobro et al.
recent advances in the intersection of social capital and network theory, we proxy types of social capital with different network centrality measures. This provides an intuitive way of interpreting the topological importance of each country in the different factors of production networks. We then identify different channels in which social capital may affect Total Factor Productivity—and therefore GDP. On the one hand, information capitals in the financial, goods and services and migration networks are linked respectively to knowledge transfers through FDI, goods and services and migration. On the other hand, favor capital on the migration network is linked to human capital supply. Then, the contributions of the multiple factors are linearly modeled by means of an extended Cobb-Douglass production function (Eq. 4). To test our model, we build two representations of the trade network—one for money and other for goods and services—based on EORA’s World MultiRegional Input Output database, and one representation of the migration network based on the UN’s International Migration Database. We compiled a panel dataset with seven different social capital indicators for 155 countries across 26 years. Overall, we find significant positive relationships between social capital and economic performance. To combine all the effects, we estimate the extended Cobb-Douglass model coefficients using both OLS and a Fixed Effects estimators. In both cases the model fit is enhanced by the inclusion of the indicators. This yields significant coefficients for some of the indicators. We observe positive Spearman correlations of log(GDP ), log(K) and log(L) with the new variables, and most of the regression coefficients are positive and significant. We identify the existence of three possible issues in the estimation; multicollinearity, non-linear and interaction effects, and reverse causality bias. Although out of scope of this work, these issues could be tackled in future research. A common solution to multicollinearity is to apply dimensionality reduction techniques such as Principal Component Analysis (PCA) [43]. Nonlinear and interaction effects could be captured by using more sophisticated machine learning techniques such as gradient boosting trees or neural networks. These could be applied in combination to regularization techniques that would also limit the impact of multicollinearity. And last, a possible way to remove the simultaneity bias and capture a causal effect could be to estimate a gravity model [44] of trade and migration as an instrumental variable approach. This work provides two different types of contribution. First, the presented indicators are very rich signals for policy-making—despite the issues related to estimation. Social capital is a latent variable which is difficult to quantify, yet it contributes to productivity and growth. We provide different indicators for information social capital such as knowledge and migration hubs, which identify knowledge exporters. Moreover, considering social capital in its favor exchange function we quantify the level of integration and openness of countries in the global economic and migratory flows. Moreover, this work contributes to enlarge the discussion in the intersection of complex systems, economic and network
Social Capital in Economic Growth
789
theory, as they are all needed to understand the patterns of mobility and the factors of production.
References 1. Solow, R.M.: Technical change and the aggregate production function. Rev. Econ. Stat. 39(3), 312–320 (1957) 2. Romer, P.M.: Endogenous technological change. J. Polit. Econ. 98(5, Part 2), S71– S102 (1990) 3. Ramsey, F.P.: A mathematical theory of saving. Econ. J. 38(152), 543–559 (1928) 4. Maradana, R.P., Pradhan, R.P., Dash, S., Gaurav, K., Jayakumar, M., Chatterjee, D.: Does innovation promote economic growth? Evidence from European countries. J. Innov. Entrep. 6(1), 1 (2017) 5. Griffith, R., Redding, S., Simpson, H.: Technological catch-up and geographic proximity. J. Region. Sci. 49(4), 689–720 (2009). https://onlinelibrary.wiley.com/doi/ pdf/10.1111/j.1467-9787.2009.00630.x 6. Wooster, R.B., Diebel, D.S.: Productivity spillovers from foreign direct investment in developing countries: a meta-regression analysis. Rev. Dev. Econ. 14(3), 640–655 (2010) 7. Bell, M., Marin, A.: Where do foreign direct investment-related technology spillovers come from in emerging economies? An exploration in argentina in the 1990s. Eur. J. Dev. Res. 16(3), 653–686 (2004) 8. Marin, A., Bell, M.: Technology spillovers from Foreign Direct Investment (FDI): the active role of MNC subsidiaries in Argentina in the 1990s. J. Dev. Stud. 42(4), 678–697 (2006) 9. Goto, A., Odagiri, H.: Building technological capabilities with or without inward direct investment: the case of Japan. Competitiveness, FDI and Technological Activity in East Asia, Cheltenham: Edward Elgar, pp. 83–102 (2003) 10. Mayer, J.: Technology diffusion, human capital and economic growth in devaluing countries. UNCTAD Discussion Paper 154, United Nations Conference on Trade and Development (2001) 11. Bove, V., Elia, L.: Migration, diversity, and economic growth. World Dev. 89, 227–239 (2017) 12. Siar, S.V.: Skilled migration, knowledge transfer and development: the case of the highly skilled Filipino migrants in New Zealand and Australia. J. Curr. Southeast Asian Affairs 30(3), 61–94 (2011) 13. Brunow, S., Nijkamp, P., Poot, J.: The impact of international migration on economic growth in the global economy. In: Handbook of the Economics of International Migration, vol. 1, pp. 1027–1075. Elsevier (2015) 14. Ulubasoglu, M.A., Doucouliagos, C.: Institutions and economic growth: a systems approach. In: Econometric Society 2004, Australasian Meetings Paper No, vol. 63 (2004) 15. Bekaert, G., Harvey, C.R., Lundblad, C.T.: Financial openness and productivity. SSRN Scholarly Paper ID 1358574, Social Science Research Network, Rochester, NY, May 2010 16. Al-Marhubi, F.: Export diversification and growth: an empirical investigation. Appl. Econ. Lett. 7(9), 559–562 (2000). https://doi.org/10.1080/ 13504850050059005
790
J. O. Huidobro et al.
17. Herzer, D., Nowak-Lehnmann, F.: What does export diversification do for growth? An econometric analysis. Appl. Econ. 38(15), 1825–1838 (2006). https://doi.org/ 10.1080/00036840500426983 18. Cadot, O., Carr`ere, C., Strauss-Kahn, V.: Export diversification: what’s behind the hump? Rev. Econ. Stat. 93, 590–605 (2011) 19. d’Albis, H., Boubtane, E., Coulibaly, D.: Immigration and public finances in OECD countries. J. Econ. Dyn. Control 99, 116–151 (2019) 20. d’Albis, H., Boubtane, E., Coulibaly, D.: Macroeconomic evidence suggests that asylum seekers are not a “burden” for Western European countries. Sci. Adv. 4(6), eaaq0883 (2018) 21. Schweitzer, F., Fagiolo, G., Sornette, D., Vega-Redondo, F., Vespignani, A., White, D.R.: Economic networks: the new challenges. Science 325, 422–425 (2009) 22. Fagiolo, G., Reyes, J., Schiavo, S.: The evolution of the world trade web: a weightednetwork analysis. J. Evol. Econ. 20(4), 479–514 (2010) 23. Fagiolo, G., Reyes, J., Schiavo, S.: World-trade web: topological properties, dynamics, and evolution. Phys. Rev. E 79(3), 036115 (2009) 24. Todorova, Z.: Firm returns and network centrality. Risk Gov. Control Financ. Mark. Inst. 9(3), 74–82 (2019) 25. Davis, K.F., D’Odorico, P., Laio, F., Ridolfi, L.: Global spatio-temporal patterns in human migration: a complex network perspective. PLoS ONE 8(1), e53723 (2013) 26. Fagiolo, G., Mastrorillo, M.: International migration network: topology and modeling. Phys. Rev. E 88(1), 012812 (2013) 27. Windzio, M.: The network of global migration 1990–2013: using ERGMs to test theories of migration between countries. Soc. Netw. 53, 20–29 (2018) 28. Jackson, M.O.: A typology of social capital and associated network measures. SSRN Electron. J. (2017) 29. Sandim, H., Azevedo, D., da Silva, A.P.C., Moro, M.: The role of social capital in information diffusion over twitter: a study case over Brazilian posts. In BiDuPosters@VLDB (2018) ˙ ter Weel, B.: Social capital, innovation and growth: evidence 30. Semih Ak¸comak, I., from Europe. Eur. Econ. Rev. 53(5), 544–567 (2009) 31. Norbutas, L., Corten, R.: Network structure and economic prosperity in municipalities: a large-scale test of social capital theory using social media data. Soc. Netw. 52, 120–134 (2018) 32. Fafchamps, M., Minten, B.: Social capital and agricultural trade. Am. J. Agr. Econ. 83(3), 680–685 (2001) 33. Cobb, C.W., Douglas, P.H.: A theory of production. Am. Econ. Rev. 18(1), 139– 165 (1928) 34. Mapping the Structure of the World Economy | Environmental Science & Technology 35. Banerjee, A., Chandrasekhar, A.G., Duflo, E., Jackson, M.O.: The diffusion of microfinance. Science 341(6144), 1–49 (2013) 36. Banerjee, A.V., Chandrasekhar, A.G., Duflo, E., Jackson, M.O.: Using gossips to spread information: theory and evidence from two randomized controlled trials. SSRN Scholarly Paper ID 2425379, Social Science Research Network, Rochester, NY, May 2017 37. Kleinberg, J.M., Newman, M., Barab´ asi, A.-L., Watts, D.J.: Authoritative Sources in a Hyperlinked Environment. Princeton University Press, Princeton (2011) 38. Rungi, A., Fattorini, L., Huremovic, K.: Measuring the input rank in global supply networks. arXiv:2001.08003 [econ, q-fin], January 2020
Social Capital in Economic Growth
791
39. Cerina, F., Zhu, Z., Chessa, A., Riccaboni, M.: World input-output network. PLoS ONE 10(7), e0134025 (2015) 40. Leontief, W.: Input-Output Economics. Oxford University Press, March 1986. Google-Books-ID: HMnQCwAAQBAJ 41. United Nations Population Division, Department of Economic and Social Affairs. UN migration database 42. Valente, T.W., Coronges, K., Lakon, C., Costenbader, E.: How correlated are network centrality measures? Connections (Toronto, Ont.) 28(1), 16–26 (2008) 43. Pearson, K.: LIII. On lines and planes of closest fit to systems of points in space. Philos. Mag. Ser. 6 2(11), 559–572 (1901). https://doi.org/10.1080/ 14786440109462720 44. Isard, W.: Location theory and trade theory: short-run analysis. Q. J. Econ. 68, 305–320 (1954)
The Role of Smart Contracts in the Transaction Networks of Four Key DeFi-Collateral Ethereum-Based Tokens Francesco Maria De Collibus1(B) , Alberto Partida2 , and Matija Piˇskorec1,3 1
Blockchain and Distributed Ledger Technologies Group, University of Zurich, Zurich, Switzerland {francesco.decollibus,matija.piskorec}@business.uzh.ch 2 International Doctoral School, Rey Juan Carlos University, Madrid, Spain [email protected] 3 Rudjer Boskovic Institute, Zagreb, Croatia
Abstract. We analyse the transaction networks of four representative ERC-20 tokens that run on top of the public blockchain Ethereum and can be used as collateral in DeFi: Ampleforth (AMP), Basic Attention Token (BAT), Dai (DAI) and Uniswap (UNI). We use complex network analysis to characterize structural properties of their transaction networks. We compute their preferential attachment and we investigate how critical code-controlled nodes (smart contracts, SC) executed on the blockchain are in comparison to human-owned nodes (externally owned accounts, EOA), which are be controlled by end users with public and private keys or by off-blockchain code. Our findings contribute to characterise these new financial networks. We use three network dismantling strategies on the transaction networks to analyze the criticality of smart contract and known exchanges nodes as opposed to EOA nodes. We conclude that smart contract and known exchanges nodes play a structural role in holding up these networks, theoretically designed to be distributed but in reality tending towards centralisation around hubs. This sheds new light on the structural role that smart contracts and exchanges play in Ethereum and, more specifically, in Decentralized Finance (DeFi) networks and casts a shadow on how much decentralised these networks really are. From the information security viewpoint, our findings highlight the need to protect the availability and integrity of these hubs. Keywords: Blockchain · Ethereum · Cryptocurrencies · Cryptoassets · Decentralised finance · DeFi · Ampleforth · Basic Attention Token · Dai · Uniswap · Preferential attachment · Network dismantling
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2021, SCI 1072, pp. 792–804, 2022. https://doi.org/10.1007/978-3-030-93409-5_65
The Network Role of DeFi Smart Contracts in Ethereum
1 1.1
793
Introduction Blockchain and Ethereum
Blockchain. Public blockchain technology answers the requirement to register a series of events in a open, decentralised, available and immutable platform. Although the word blockchain does not appear in it, the seminal paper about Bitcoin by Satoshi Nakamoto in 2008 [1] discloses the concept behind it. “The longest chain wins” and, consequently, “the largest devoted computing power wins” summarises the functioning of Bitcoin. Ethereum. Five years later, Vitalik Buterin invented and co-founded Ethereum. This public blockchain evolves Nakamoto’s original blockchain concept into a Turing-complete platform able to run decentralised applications by introducing smart contracts, i.e., code that runs on top of the blockchain [2]. The possibility to script any logic in a blockchain gave birth to a multitude of tokens, both fungible and non-fungible (NFTs). While Bitcoin bases its transactions on unspent transaction outputs (UTXO) model with scripts for locking and unlocking the outputs, Ethereum uses an account-based model with balances associated with each address which also allows the implementation of smart contracts [2]. ERC-20 Tokens. Ethereum Request for Comments 20 (ERC-20) is the Ethereum standard for fungible non-native tokens ([3]). Fungible refers to tokens which are identical and interchangeable between the same currency. ERC-20 provides an application programming interface (API) to transact with these tokens. It defines methods such as transfer(), balanceof() and approve(). The four tokens of our study are ERC-20 Ethereum tokens. Network Participants: Humans and Code. In the Ethereum network transactions occur between addresses, each of which has an associated balance. While in Bitcoin transfers can occur from “1 to 1”, “n to 1”, “1 to n” and “n to n” senders and destinations addresses due to its use of the UTXO model, in Ethereum all transactions happen “1 to 1” between a sender address and a destination address due to its use of the account model. Each account has an address derived from a public key and belongs to one of the two types: i) externally owned accounts (EOA) which are controlled by users or, alternatively, by code running outside of blockchain, and smart contracts. Smart contracts expose functions that can be invoked by EOAs or other contracts, with the distinction that smart contracts cannot initiate transactions themselves - only EOA’s can initiate chain of smart contract executions [2,4]. 1.2
Transaction Networks in Blockchain Systems
Nodes and Edges: Addresses and Transactions. Complex network analysis studies the relations between systems composed of a high number of nodes, connected between them via edges [5]. A long list of authors has studied public
794
F. M. De Collibus et al.
blockchains’ networks via network science [6–10]. We use network analysis to understand the structure of the transaction networks of the four mentioned ERC20 tokens. The nodes represent the addresses that intervene in these networks and the edges the value transfers between them. Network Properties. Key network science properties that characterise a complex network are, among others, degree [11], density, and largest connected component. In addition to these, in this paper we analyze two more - preferential attachment and network dismantling. Preferential attachment relates to the way the network grows. Linear preferential attachment leads to a scale-free network that displays a power law behaviour. Network dismantling, the opposite to network percolation, provides insights on how the network endures the elimination of highly connected nodes [12–14]. As Ethereum is a public blockchain, we extract all transaction relevant data related to AMP, BAT, DAI and UNI from a fully synced Ethereum node. We build the four corresponding transaction networks and calculate their key network properties with a special focus on the smart contract addresses. 1.3
Four Tokens Used as DeFi Collateral
DeFi changes the paradigm in finance. It shifts financial activities such as lending and borrowing from a traditionally centralised approach to a blockchain-based distributed approach. The logic required to run these financial processes is implemented in smart contracts running predominantly on the Ethereum platform. We study the transaction networks of four types of Ethereum-based ERC-20 tokens that can be used as collateral in DeFi: A utility token (BAT), an algorithmic stablecoin (AMP), a multi-currency pegged algorithmic stablecoin (DAI) and a governance token (UNI). Ampleforth (AMP): An algorithmic stablecoin pegged to the USD that bases its stability by adapting its supply to price changes without a centralised collateral. The protocol receives exchange-rate information from trusted oracles on USD prices and accordingly changes the number of tokens each user holds [15]. AMP was launched in June 2019 and has a market capitalisation of over USD 2B as of September 2021 which places it in the top 100 cryptocurrencies [16]. Basic Attention Token (BAT): A utility token aiming to improve efficiency in digital advertising via its integration with the Brave browser. Users are awarded BATs for paying attention to ads. BAT allows users to maintain control over quantity and type of the ads they consume while advertisers can achieve better user targeting and reduced fraud rates [17]. BAT had an initial coin offering (ICO) in May 2017 and as of September 2021 it has a market capitalisation of roughly USD 1.1B which places it in the top 100 cryptocurrencies [16].
The Network Role of DeFi Smart Contracts in Ethereum
795
Dai (DAI): A multi-currency pegged algorithmic stablecoin token [18] launched in 2017 which uses, as AMP, smart contracts on Ethereum network to keep its value as close as possible to US$. Users can deposit ETH as a collateral and obtain a loan in DAI, and the stability of DAI is achieved by controlling the type of accepted collateral, the collaterisation ratio and interest rates. In November 2019 DAI transitioned from a single-collateral model (ETH) to a multi-collateral model (ETH, BAT and USDC among other tokens), which we analyze in this paper. As of September 2021 DAI has a market capitalisation of USD 6.5B [16]. Uniswap (UNI): A decentralised finance protocol [19] to exchange ERC-20 tokens on the Ethereum network. Unlike traditional exchanges it does not have a central limit order book but rather a liquidity pool - pairs of tokens provided by users (liquidity providers) which other users can then buy and sell. This UNI governance token was launched on September 2020 [19]. It is currently ranked among the top 11 cryptocurrencies by market capitalisation, which amounts to almost USD 14B as of September 2021 [16].
2
Data Description
Table 1. Summary of the datasets curated for the four tokens used in this study. We extract all transactions (Tx) from the ETH blockchain. The last two columns show the Ethereum blocks containing the transactions used in this study for each token and their time span. Although DAI token launched in 2017 we collect transaction data only since its move towards a multi-collateral model in 2019. Token Tx AMP
755827
Nodes 83050
Edges
Blocks
Time span
201456 7953823–12500000
14/6/2019–25/5/2021
BAT
3046615 1105958 1702429 3788601–12500000
29/5/2017–25/5/2021
DAI
8422158 1042638 2523076 8928674–12500000
13/11/2019–25/5/2021
UNI
2079132
701054 1271933 10861674–12500000 14/9/2020–25/5/2021
We construct an aggregated transaction network GS (t) represented with a single directed graph encompassing the full available history for each of the four Ethereum tokens. The nodes of the network represent addresses participating in transfers. Every edge of the network represent all the transfers that happen between the two involved addresses. We analyse more than 700k transactions (Tx) in AMP, 3M Tx in BAT, 8M Tx in DAI and 2M Tx in UNI, as displayed in Table 1. GS (t) = V S (t), E S (t) for symbol S ∈ {AMP, BAT, DAI, UNI} The set of nodes V S (t) corresponds to the addresses that have been included in at least one transaction of symbol S since time t. The set of edges E S (t) consists of unweighted, directed edges between all pairs of addresses. In edge (transaction j ) (j1 , j2 ), node j1 is the sender and j2 is the recipient (Tables 2 and 3).
796
F. M. De Collibus et al.
Table 2. Spearman correlation ρs between in-degree kin and out-degree kout for each token. The relation is stronger for AMP and DAI than it is for BAT and UNI. Observing a highly irregular pattern for low in-degree nodes, we suspect that the correlation for nodes with higher degree could be stronger. Computing the Spearman correlation for kin > 100 confirms this. Token ρs (kin , kout ) p-value ρs (kin , kout ) where kin > 100 p-value AMP
0.5201
0
0.6772
1.2470 10−7
BAT
0.1523
0
0.4119
2.6450 10−13
DAI
0.4842
0
0.4874
5.112 10−48
UNI
0.2710
0
0.5094
1.3512 10−15
Table 3. Scale-free networks are characterised by a power law degree distribution pk ∼ k−γ . In the definition of Barabasi the exponent should 2 ≤ γ ≤ 3, as in [20, 21] This condition happens for only few cases, for BAT and DAI in kout . According to [22] in most of our cases we are in weak and weakest condition of scale free networks, where we are mostly following a power law distribution. xmin is the minimum x value where the fit starts.
2.1
Token k
xmin γ
AMP
kin
5.0
3.8017 Power law
AMP
kout 3.0
4.0106 Power law
BAT
kin
1.7745 Truncated power law
BAT
kout 5.0
2.9365 Power law
DAI
kin
1.8617 Truncated power law
DAI
kout 7.0
2.7211 Power law
UNI
kin
51.0
1.7872 Truncated power law
UNI
kout 29.0
1.6668 Truncated power law
44.0 57.0
Best fit
Preferential Attachment
Preferential attachment is the network growth mechanism where the probability of forming a new link is proportional to the degree of the target node. In mathematical terms, we describe the probability π of forming a new link to an existing node j with in-degree kin,j or from an existing node j with out-degree kout,j in the following way: in
(kin,j )α π(kin,j ) = , αin j (kin,j )
out
(1)
(kout,j )α π(kout,j ) = (2) out , α j (kout,j )
where αin > 0. If αin = 1 the preferential attachment is linear. If αin < 1 it is sub-linear, and when αin > 1 it is super-linear. When the probability of forming the new link is linear, then preferential attachment leads to a scale-free network. When the attachment is super-linear, very few nodes (hubs) tend to connect to all nodes of the network. These hubs are of crucial importance in the network.
The Network Role of DeFi Smart Contracts in Ethereum
797
We further extend this model to the out-degree kout,j for an existing node j to model the accruing and dynamic process of consolidation of out-degree as well in preferential attachment for directed networks. When a new, directed edge is added to the network, we assume that the source node j is selected with a probability which is a function (solely) of its ∗ ∗ , i.e. π (kout ), as we denote π(kin ) the probability that a new link out-degree kout is created to any node with out-degree k ∗ (or in-degree as in the original model). ∗ Since this probability is a time-dependent, we use the rank function R(α; kin , t), ∗ computed for each link addition to a node with in-degree k at each time t. Specifically: k∗ −1 α k=0 n(k, t) k R(α; k , t) = . α k n(k, t) k ∗
(3)
Thus, the sum in the denominator runs for all nodes whose in-degree is lower ∗ ∗ or, in case of out-degree whose kout is lower than 0. When a new edge is than kin created, if the target or the source node is drawn with a probability for a given αoin or αoout , that we can replace into Eq. 3. To obtain the value of αo , we measure the corresponding K-S (KolmogorovSmirnoff) goodness of fit, i.e., the difference between the empirical distribution function (ECDF) calculated with different exponents α and the theoretical linear CDF distribution. The value αo that minimises the distance to the uniform distribution is the best fit for the exponent, which determine the kind of preferential attachment in our Transaction Network. We sample 10% of all the edges while building the network and calculate K-S error between the empirical distribution and a theoretical one, in this case a pure power law, for a range of α ∈ [0, 2.5] to find an error-minimising α. We observe, consistently in all four tokens, that the minimum value of α is achieved around 1.0 for the out-degree and around 1.1 for the in-degree. A value of α > 1 for the in-degree indicates a super-linear preferential attachment in the network, i.e., small number of nodes attract most of the connections in the network and will eventually form super-hubs. This is another indication of the rising centralisation in the network, caused by the presence of key smart contract and exchange nodes. For all the tokens we have super-linear preferential attachment AMP has αin 1.05 (error 0.143), αout 1.02 (error 0.174), BAT has αin 1.15 (error 0.198), αout 1.1 (error 0.226), DAI has αin 1.1 (error 0.099), αout 1.05 (error 0.126), UNI has αin 1.05 (error 0.227), αout 1.02 (error 0.257). An evolution in time with non cumulative time windows can be seen in Fig. 2.
3
Methods and Implementation
Figure 1 plots network density as a function of network size for all four tokens. Density scales inversely proportional to network size d ∝ N −1 . This shows that the number of edges grows linearly with the size of the network. New nodes add a limited number of edges. Transactions mostly reuse already existing edges: an
798
F. M. De Collibus et al.
indication of preferential attachment. This also indicates an increasing centralisation in the network as smart contract nodes and exchanges act effectively as hubs in the network through which most transactions are executed.
Fig. 1. Evolution of network density as a function of network size. As the network size grows the network does not densify but rather the number of edges scales as d ∝ N −1 , which is an evidence of a preferential attachment process.
Figure 2 shows the evolution of the best fit α for preferential attachment over time in all four tokens for in-degree αin and out-degree αout . We see that αin stays consistently around 1.1 (equivalently, 1.0 for the αout ) for the entire time period of network evolution studied. This confirms a slight super-linear preferential attachment in the network from its start. This comes as no surprise as the tokens are managed by the programmable logic of the smart contract nodes and traded via the exchanges, and these are present in the network from its start. 3.1
Network Dismantling
Network dismantling refers to a general problem of finding the minimal number of nodes whose removal dismantles a network [14] into isolated subcomponents. It belongs to a class of nondeterministic polynomial hard (NP-hard) problems, which essentially implies that there is currently no algorithm that can find the optimal dismantling solution for large-scale networks. However, there are approximate methods which work well enough in practice even for large networks [12,13]. In this paper we are not interested in finding the most efficient dismantling strategy but rather on estimating the influence that the different types of nodes have on dismantling. Our aim is to asses their role in the structural integrity of the network. In our case, we are interested in the difference between nodes corresponding to the addresses of smart contracts and known exchanges, a list of whom was extracted from public sources such as [23] which are controlled by the logic of the code, as opposed to the nodes corresponding to the addresses of the externally owned accounts (EOA), which are controlled by the actual users possessing the corresponding cryptographic keys.
The Network Role of DeFi Smart Contracts in Ethereum
799
Fig. 2. Evolution of best fit α for all four tokens for in-degree αin (top panel) and out-degree αout (bottom panel) for preferential attachment over time, with disjoint and non cumulative time windows.
Our dismantling strategy consists of repeatedly removing nodes of the appropriate type with the highest in-degree kin one-by-one, and then recalculating the in-degrees for all of the nodes before repeating the procedure. As a measure of dismantling we use the ratio of the Largest Strongly Connected Component (LSCC), i.e. the largest maximal set of graph nodes such that for every pair of nodes a and b, there is a directed path from a to b and a directed path from b to a. In our analysis we perform dismantling for up to 200 nodes of each type and for all four tokens separately as shown in Fig. 3. We observe that for all four tokens the removal of nodes corresponding to the addresses of contracts and known exchanges only causes faster dismantling than the removal of nodes corresponding to the addresses of EOA’s only - the LSCC collapses by removing just a handful of nodes. This indicates a large structural centralisation. Nodes corresponding to the addresses of smart contracts and known exchanges effectively act as hubs in the network. Unlike the nodes that correspond to addresses of EOA’s, they have a crucial structural role because they are involved in majority of the transactions. In the information security realm, intentional risk managers should protect these nodes the most [24]. We also performed additional dismantling for up to 10k nodes for each of the tokens but this did not show qualitatively different results, so in Fig. 3 we only show results for up to 200 nodes.
800
F. M. De Collibus et al.
Fig. 3. Dismantling of largest strongly connected component in token with three different strategies, removing first highest in-degree nodes which are smart contracts and known exchanges addresses, EOA address, or a strategy combining two.
3.2
Assortativity
Assortativity coefficient r measures a general tendency of nodes of a certain degree ki to attach to other nodes with similar degree. Its range is −1 < r < 1. A positive value indicates assortative mixing: a high correlation between the degrees of neighboring nodes, forming usually communities. A value close to zero suggests non-assortative mixing: very low degree correlation, typical in coreperiphery structures found in broadcasting. Finally, a negative values reveals disassortative mixing: a negative correlation, found in structures optimised for maximum distributed information transmission. Equation 4 presents the standard definition of assortativity coefficient r [25] where ai = j eij , bj = i eij and eij is a fraction of edges from nodes of degree ki to nodes of degree kj . Due to the high computational demand required to compute the assortativity coefficient in our transactions networks, we instead compute scalar assortativity rs defined as Equation 5 [25], particularly useful when the degree changes over time [26].
The Network Role of DeFi Smart Contracts in Ethereum
801
Fig. 4. Scalar assortativity while dismantling the network.
r=
− i ai bi σa σb
i eii
(4)
rs =
ij
ki kj (eij − ai bj ) σa σb
(5)
In Fig. 4 we show the scalar assortativity of the networks as we remove more and nodes during dismantling, separately for two types of nodes, i.e., nodes corresponding to the addresses of smart contracts and known exchanges (blue line) and nodes corresponding to EOA addresses (orange line). Initial scalar assortativity of networks is slightly negative but close to 0 (from −0.06 for BAT and UNI to −0.20 for AMP), which is not surprising considering the centralisation in the network - most of the small in-degree nodes are connected to the large central hubs, with very little connections between them. Removal of nodes corresponding to EOA addresses during dismantling has no discerning effect on the scalar assortativity, while for contracts and known exchanges the assortativity tends to increase towards zero, making the networks less centralised and almost non-assortative. This is probably because the first nodes to be removed during dismantling are the highly connected hubs - by removing these nodes first the assortativity in the network rises because many connections of the low-tohigh degree nodes, which contribute to the dissasortativity of the network, are removed as well.
802
4
F. M. De Collibus et al.
Discussion
Decentralised finance (DeFi), based on public blockchain technology, holds over USD 80 B assets in September 2021. It aims to disrupt the traditional financial system by providing an alternative way to access financial services. It relies on automation to execute financial transactions on top of a decentralised public blockchain with no central governance. However, decentralisation in the underlying protocol does not necessarily imply decentralisation in the application space on top of it. Smart contracts providing DeFi services act as a central point for the protocol logic. We observe this centralisation in the transaction networks of DeFi-collateral tokens, where nodes corresponding to the addresses of smart contracts and known exchanges (controlled by the logic of code) exhibit different structural roles as opposed to the nodes corresponding to Externally Owned Accounts (EOA) addresses which should be controlled directly by users. The four types of DeFi-collateral Ethereum-based tokens we study span multiple use cases in the DeFi sector: an algorithmic stablecoin (Ampleforth, AMP), a utility token used in digital marketing (Basic Attention Token, BAT), a multicurrency pegged stablecoin (Dai, DAI) and a governance token used in the UNI decentralised exchange (Uniswap, UNI). We analyse the transaction networks of these four tokens up to mid 2021 to evaluate the structural roles in the network of two types of nodes: those representing addresses driven by code and those human-driven. Our analysis shows an increasing centralisation of their transaction network, with nodes corresponding to the addresses of smart contract and known exchanges acting as hubs: we find a decreasing density in the network as new nodes are added, which scales inversely proportional to the number of newly added nodes, as well as a slightly super-linear preferential attachment coefficient (αin > 1.0) which implies that few nodes are gaining most of the connections from the newly incoming nodes, a form of “winner takes all” effect commonly observed in social systems as well [27]. Those nodes should be protected the most from the information security viewpoint in terms of their availability and integrity. Network dismantling confirms the fact that these highly connected nodes indeed correspond to the addresses of smart contracts and known exchanges and not the EOA’s which are controlled by the actual users. Our network dismantling strategy removes one-by-one the two types of nodes with the highest in-degree kin and measures the effect on the Largest Strongly Connected Component (LSCC). Our results conclude that the removal of nodes corresponding to the addresses of smart contracts and known exchanges causes a much faster dismantling than the removal of nodes corresponding to EOA’s. This confirms their structural role in the transaction network as hubs that mediate most of the transactions in the network. Our analysis is restricted to only four representative tokens on Ethereum, the largest public blockchain for smart contracts. These results hint a potentially inconvenient fact for the DeFi sector, claiming to offer decentralisation and inclusiveness in its financial services. Most decentralised applications (dApps) run on smart contracts which effectively centralise application logic. Exchanges which process most of the transactions contribute to centralisation, regardless
The Network Role of DeFi Smart Contracts in Ethereum
803
of whether exchanges are centralised (in that case transactions are off-chain) or decentralised (powered by smart contracts, in that case transactions are onchain). The underlying situation mimics the advent of online social networks in the mid 2000’s. Although they run on nominally decentralised Internet protocol, they effectively centralised information flow within their application ecosystems over time. It seems that DeFi sector is on a similar centralisation trajectory, however, the long term consequences of this are yet unknown. Future work can focus on popular tokens such as the heavily used USDpegged asset-collateralised stablecoins USDT and USDC and tokens that offer cross-chain compatibility or second-layer solutions (like AAVE or Polygon, respectively). Additionally, we suggest to perform a similar analysis in newer smart contract blockchains such as Polkadot, Solana and Tezos. Finally, a novel research path would be to understand how second-layer blockchain solutions, that address scalability challenges in base layer blockchain protocols, and cross-chain compatibility protocols, that share information between different blockchains, influence decentralisation.
References 1. Nakamoto, S.: Bitcoin: a peer-to-peer electronic cash system. Nakamotoinstitute.org, October 2008. https://bitcoin.org/bitcoin.eps. Accessed 09 Sept 2021 2. Buterin, V.: ETH whitepaper. https://ethereum.org/en/whitepaper/. Accessed 09 Sept 2021 3. ERC-20 Specification. https://ethereum.org/en/developers/docs/standards/ tokens/erc-20/. Accessed 10 Sept 2021 4. Antonopoulos, A., Wood, G.: Mastering Ethereum, building smart contracts and dapps, O’Reilly Media, 2019 5. Newman, M.E.J.: The structure and function of complex networks. SIAM Rev. 45, 167–257 (2003). https://doi.org/10.1137/S003614450342480 6. Bovet, A., et al.: The evolving liaisons between the transaction networks of Bitcoin and its price dynamics. https://arxiv.org/eps/1907.03577.eps ´ 7. Kondor, D., POsfai, M., Csabai, I., Vattay, G.: Do the rich get richer? An empirical analysis of the bitcoin transaction network. https://doi.org/10.1371/journal.pone. 0086197 8. Vallarano, N., Tessone, C.J., Squartini, T.: Bitcoin transaction networks: an overview of recent results. http://dx.doi.org/10.3389/fphy.2020.00286 9. Liang, J., Li, L., Zeng, D.: Evolutionary dynamics of cryptocurrency transaction networks: an empirical study. https://doi.org/10.1371/journal.pone.0202202 10. Somin, S., Gordon, G., Pentland, A., Shmueli, E., Altshuler, Y.: ERC20 transactions over ethereum blockchain: network analysis and predictions. https://arxiv. org/abs/2004.08201 11. da Fontoura Costa, L., et al.: Analyzing and modeling real-world phenomena with complex networks: a survey of applications. Adv. Phys. 60(3), 329–412 (2011). https://doi.org/10.1080/00018732.2011.572452 12. Ren, X.-L., Gleinig, N., Helbing, D., Antulov-Fantulin, N.: Generalized network dismantling. Proc. Natl. Acad. Sci. 116(14), 6554–6559 (2019). https://doi.org/ 10.1073/pnas.1806108116
804
F. M. De Collibus et al.
13. Braunstein, A., Dall’Asta, L., Semerjian, G., Zdeborov´ a, L.: Network dismantling. Proc. Natl. Acad. Sci. 113(44), 12368–12373 (2016). https://doi.org/10.1073/pnas. 1605083113 14. Janson, S., Thomason, A.: Dismantling sparse random graphs. Comb. Probab. Comput. 17(02), 259–264 (2008). https://doi.org/10.1017/s0963548307008802 15. Ampleforth whitepaper. https://bit.ly/3lkxsAP. Accessed 11 Sept 2021 16. Coinmarketcap. Cryptocurrencies market capitalisation in real time. https:// coinmarketcap.com/all/views/all/. Accessed 11 Sept 2021 17. BAT whitepaper. https://basicattentiontoken.org/static-assets/documents/ BasicAttentionTokenWhitePaper-4.eps. Accessed 11 Sept 2021 18. DAI whitepaper. https://makerdao.com/en/whitepaper/. Accessed 11 Sept 2021 19. Uniswap whitepaper. https://uniswap.org/whitepaper.eps. Accessed 11 Sept 2021 20. Barab´ asi, A.: Network Science, 05 September 2014. Creative Commons: CC BYNC-SA 2.0. http://barabasi.com/book/network-science. Accessed 29 Dec 2021 21. Alstott, J., Bullmore, E., Plenz, D.: Powerlaw: a python package for analysis of heavy-tailed distributions. PLoS ONE 9(1), e85777 (2014). https://journals.plos. org/plosone/article?id=10.1371/journal.pone.0085777. Accessed 02 Dec 2021 22. Broido, A.D., Clauset, A.: Scale-free networks are rare. Nat. Commun. 10, 1017 (2019). https://doi.org/10.1038/s41467-019-08746-5 23. Etherscan info on exchanges. https://etherscan.io/accounts/label/exchange. Accessed 12 Sept 2021 24. Chapela, V., Criado, R., Moral, M., Romance, R.: Intentional Risk Management through Complex Networks Analysis. Springer, Heidelberg (2015). https://doi.org/ 10.1007/978-3-319-26423-3 25. Newman, M.E.J.: Mixing patterns in networks. https://doi.org/10.1103/ PhysRevE.67.026126 26. Noldus, R., Van Mieghem, P.: Assortativity in complex networks. J. Complex Netw. 3(4), 507–542 (2015). https://doi.org/10.1093/comnet/cnv005 27. Salganik, M.J., Dodds, P.S., Watts, D.J.: Experimental study of inequality and unpredictability in an artificial cultural market. Science 311(5762), 854–856 (2006). https://doi.org/10.1126/science.1121066
Resilience, Synchronization and Control
Synchronization of Complex Networks Subject to Impulses with Average Characteristics Bangxin Jiang and Jianquan Lu(B) Department of Systems Science, School of Mathematics, Southeast University, Nanjing 210096, China [email protected]
Abstract. In this study, synchronization of complex networks subject to impulses with average characteristics is investigated. Specifically, the ideas of average impulse interval and average impulse delay are applied to analyze the effect of delayed impulses on synchronization of complex networks. Further, the new concept of average impulse exponential gain is proposed to globally describe the multiple impulses whose magnitude is allowed to be time-varying. Interestingly, it is shown that the delay in impulses can possess synchronizing impact on the synchronization of complex networks with such multiple delayed impulses. Finally, an numerical example is presented to illustrate the validness of the derived results. Keywords: Synchronization · Complex networks Average impulse exponential gain
1
· Multiple impulses ·
Introduction
As we know, the investigation on complex networks has gained much attention in the natural and social sciences simultaneously [1]. Especially, synchronization problems of complex networks have been widely studied because of its contribution to the image processing, image retrieval and so on [2,3]. Till now, many meaningful studies on synchronization of complex networks have been reported in [4,5]. In addition, the impulse effects, which can lead to abrupt change of system state at some points [6,7], potentially exist in some complex networks, such as biological networks and ecological networks [5,8]. Recently, more and more investigations on synchronization of complex networks subject to delayed impulses have been derived (see [9–11]). Actually, the delay cannot be ineluctable in the sampling and implementing of the impulse signals [12,13]. Although there are some synchronization studies for complex networks subject to delayed impulses, the condition of upper bound for the delays in impulses needs be imposed [9,10]. Additionally, the magnitude of the impulses is often time-varying, however, there is little work on synchronization of complex networks subject to such c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2021, SCI 1072, pp. 807–816, 2022. https://doi.org/10.1007/978-3-030-93409-5_66
808
B. Jiang and J. Lu
impulses, in which the impulsive gain of each impulses is flexible [14]. Although a concept called average impulsive gain was proposed in [14], the concept can be improved further and the delay effects in impulses have not been considered. Based on the afore-mentioned survey and analysis, the concept of average impulse exponential gain (AIEG) is proposed to describe the magnitude of impulses from the global view. By applying the concepts of average impulse interval (AII), average impulse delay (AID) and AIEG, a flexible synchronization criterion is obtained. The key novelties of the current study are listed below. (I) The idea of AIEG is firstly introduced and then three type of average characteristics for impulses are introduced. Subsequently, a flexible synchronization criterion for complex networks subject to impulses with average characteristics is established. (II) It is shown that the obtained results own stronger robustness than previous results, since the current study allows that the impulse intervals, delays in impulses as well as the magnitude of impulses can be flexible simultaneously. (III) Interestingly, it is verified that the delay in impulses may synchronize some original nonsynchronized complex network in the presence of impulses with AIEG. The rest of current study is arranged as follows. Some relevant preliminaries are introduced in Sect. 2. The main results are given in Sect. 3. In Sect. 4, a numerical example is presented. The summary for this study is shown in Sect. 5.
2
Preliminaries
Consider complex networks that consist of N nodes and each one presents a network with n-dimension. Especially, we consider the model of i-th node as follows: (1) x˙ i (t) = Cxi (t) + Υ ψ(xi (t)), in which xi (t) = [xi1 (t), · · · , xin (t)]T stands for the system state of the i-th complex network; C, Υ ∈ Rn×n and ψ(xi (t)) = [ψ1 (xi1 (t)), · · · , ψn (xin (t))]T . Suppose that these N complex networks are coupled with each other and multiple delayed impulses are further taken into account. In this way, we study the complex network with multiple delayed impulses as follows: ⎧ N ⎪ ⎪ ⎪ x ˙ (t) = Cx (t) + Υ ψ(x (t)) + p lij Πxj (t), ⎪ i i i ⎪ ⎪ ⎪ j=1 ⎨ (2) t ≥ t0 ≥ 0, t = tk , k ∈ Z+ , ⎪ ⎪ ⎪ − ⎪ ⎪ xj (tk ) − xi (tk ) = edk [xj (t− ⎪ k − νk ) − xi (tk − νk )], ⎪ ⎩ for (i, j) satisfying lij > 0, in which xi (t) = [xi1 (t), · · · , xin (t)]T stands for the system state of i-th node, Π stands for the inner coupling positive definite matrix; p > 0 stands for the
Synchronization with Average Characteristics
809
coupling strength; lij is denoted by: suppose that there exists a path from node j to another i (i = j), let lij > 0; otherwise, lij = 0. Then, L = (lij )N ×N stands for the Laplacian matrix, which reveals the network’s topology structure [15]. Definition 1 ([14]). The average impulse interval τa of sequence {tk }, k ∈ Z+ is denoted by: t − t0 τa = lim (3) t→+∞ R(t, t0 ) in which R(t, t0 ) stands for the number of impulse for sequence {tk }, k ∈ Z+ in (t0 , t]. Definition 2 ([4]). The average impulse delay ν¯ of sequence {νk }, k ∈ Z+ is denoted by: ν1 + ν2 + · · · + νR(t,t0 ) , ν¯ = lim (4) t→+∞ R(t, t0 ) in which R(t, t0 ) is the same as that in Definition 1. Inspired by above two concepts, we propose the concept of average impulse exponential gain (AIEG) to characterize the magnitude of impulses from the global view. ¯
Definition 3. The average impulse exponential gain ed of sequence {dk }, k ∈ Z+ is denoted by: d1 + d2 + · · · + dR(t,t0 ) d¯ lim e = exp , (5) t→+∞ R(t, t0 ) in which R(t, t0 ) is the same as that in Definition 1. Considering AII, AID and AIEG methods, let L0 , Lν and Ld stand for the set of impulse instant sequences, impulse delay sequences and impulse exponential gain sequences, which satisfy condition (3), condition (4) and condition (5), respectively. Denote by H[{tk }, {νk }, {dk }] the class of H composed of {tk } ∈ L0 , {νk } ∈ Lν and {dk } ∈ Ld . Complex network (2) is said to be globally exponentially synchronized (GES) if there are M > 0, σ > 0 as well as t > 0 satisfying xi (t) − xj (t) ≤ M e−σt , t > t , (6) for arbitrary initial conditions, where i, j = 1, 2, · · · , N . This definition depends on the selection of the impulse instant sequence {tk }, impulse delay sequence {νk } and impulse exponential gain sequences Ld . Then, the investigators are interested to describe GES over sequences {tk }, {νk } and {dk }. Whereupon, complex network (2) is said to be globally uniformly exponentially synchronized (GUES) over the class H[{tk }, {νk }, {dk }] if there are M > 0, σ > 0 and t > 0 satisfy estimation (6) for arbitrary {tk } ∈ L0 , {νk } ∈ Lν and {dk } ∈ Ld . Particularly, one can refer all the notations, assumptions (e.g., about ψ function and Laplacian matrix L) and necessary lemmas in our previous results on synchronization of complex networks subject to delayed impulses whose magnitude is constant (see [4]).
810
3
B. Jiang and J. Lu
Main Results
Denote x(t) = [xT1 (t), · · · , xTN (t)]T and Ψ (x(t)) = [ψ T (x1 (t)), · · · , ψ T (xN (t))]T . Next, we transform the complex networks with impulses into Kronecker-product form as follows: ⎧ x(t) ˙ = (IN ⊗ C)x(t) + (IN ⊗ Υ )Ψ (x(t)) + p(L ⊗ Π)x(t), ⎪ ⎪ ⎪ ⎨ t ≥ t0 ≥ 0, t = tk , (7) − ⎪ xj (tk ) − xi (tk ) = edk [xj (t− ⎪ k − νk ) − xi (tk − νk )], ⎪ ⎩ for (i, j) satisfying lij > 0, where k ∈ Z+ . Referring to [4], φ = (φ1 , · · · , φN )T ∈ RN is designed to be the left eigen N vector to eigenvalue 0 of matrix L such that i=1 φi = 1. Denote Φ = T diag{φ1 , · · · , φN }. Let Δ (δij )N ×N = Φ − φφ , and one may see that Δ is a zero row sum possessing negative off-diagonal elements. Therefore, one can infer that λmax (Δ) > 0. ΔL = (Φ − φφT )L = ΦL − φ(φT L) = ΦL. Next, the synchronization is studied for complex networks subject impulses with average characteristics. Denote c = −λmax (C T + C + Υ Υ T + H T H − pΠ) with ˜ = −λ2 (L)/λ max (Δ) for convenience. Then, c is said to be the rate coefficient of network (7). Theorem 1. Consider complex network (7) with rate coefficient c < 0. Suppose that the class H[{tk }, {νk }, {dk }] fulfills τa < ∞, ν¯ < ∞, νk < tk − tk−1 and d¯ < 0. Then network (7) is GUES over the class H[{tk }, {νk }, {dk }] provided that the following condition is fulfilled c(¯ ν − τa ) + 2d¯ < 0.
(8)
Proof. Design function V (t) = xT (t)(Δ ⊗ In )x(t). Then, along the dynamics of complex network (7), we can obtain corresponding derivative: V˙ (t) = 2xT (t)(Δ ⊗ In )x(t) ˙ = 2xT (t)(Δ ⊗ C)x(t) + 2xT (t)(Δ ⊗ Υ )Ψ (x(t)) + 2px (t)(ΔL ⊗ Π)x(t), T
for all t ∈ [tk−1 , tk ), k ∈ Z+ . Actually, it leads to V (t) =
N N i=1 j=1,j=i
1 − δij (xi − xj )T (xi − xj ). 2
(9)
Synchronization with Average Characteristics
Next, since ΔL = ΦL and (Δ ⊗ In )I(t) = 0, we can infer that
N N 1 V˙ (t) = − δij (xi − xj )T (C − pΠ)(xi − xj ) 2 i=1 j=1,j=i + (xi − xj )T Υ (ψ(xi ) − ψ(xj ))
811
(10)
+ pxT (t) (ΦL + LT Φ) ⊗ Π + Δ ⊗ Π x(t), for arbitrary t ∈ [tk−1 , tk ). Applying matrix decomposition theory [16] and ˆ satisfying Lemma 2 in [4], we can select a unitary matrix U for matrix B T ˆ ˆ ˆ ˆ ˆ = 0 and B = U ΛU , where Λ = diag{λ1 (B), λ2 (B), · · · , λN (B)}, with λ1 (B) 1 1 T U = [u1 , · · · , uN ] with u1 = ( √N , · · · , √N ) . Via Assumption 1 and Lemma 1 in [4], it leads to 2(xi − xj )T Υ (ψ(xi ) − ψ(xj )) ≤ (xi − xj )T Υ Υ T (xi − xj ) + (ψ(xi ) − ψ(xj ))T (ψ(xi ) − ψ(xj ))
(11)
≤ (xi − xj ) (Υ Υ + H H)(xi − xj ). T
T
T
Denote z(t) = (U T ⊗ In )x(t). One can observe that x(t) = (U ⊗ In )z(t), and then derive xT (t)[(ΦL + LT Φ) ⊗ Π]x(t) ˆ ⊗ Π)(U ⊗ In )z(t) = z T (t)(U T ⊗ In )(B =
N
ˆ iT (t)Πzi (t) λi (B)z
i=1
=
N
(12) ˆ T (t)Πzi (t) λi (B)z i
i=2
ˆ ≤ λ2 (B)
N
ziT (t)Πzi (t),
i=2 T = [z1T (t), · · · , zN (t)]T , T (0, 0, · · · , 0) 0N ∈
in which z(t) zi (t) ∈ Rn , i = 1, 2, · · · , N . MoreRN . Therefore, it leads to U T ΔU = over, Δ · u1 = T T ˜ = IN −1 . Since ˜ ˜ ˜ ˜T U [0, 0N ; 0N , U ΔU ], where U = [u2 , u3 , · · · , uN ] such that U λmax (Δ) > 0, it follows that xT (t)[Δ ⊗ Π]x(t) = z T (t)(U T ΔU ⊗ Π)z(t) ˜ T ΔU ˜ ⊗ Π)˜ z (t) = ˜ z T (t)(U T T ˜ ΔU ˜ ⊗ Π)˜ z (t) ≤ λmax (Δ)z (t)(U ≤ λmax (Δ)
N i=2
ziT (t)Πzi (t),
(13)
812
B. Jiang and J. Lu
T in which z˜(t) = [z2T (t), z3T (t), · · · , zN (t)]T . By using (12), (13) and = ˆ −λ2 (B)/λmax (Δ), we have
pxT (t)[(ΦL + LT Φ) ⊗ Π + Δ ⊗ Π]x(t) ˆ + λmax (Δ)] ≤ p[λ2 (B)
N
ziT (t)Πzi (t)
(14)
i=2
= 0. By (10), together with (11) and (14), it leads to V˙ (t) ≤−
N
N
i=1 j=1,j=i
≤ λmax (C
T
1 δij (xi − xj )T (2C + Υ Υ T + H T H − pΠ)(xi − xj ) 2
+ C + ΥΥ
T
1 − δij (xi − xj )T (xi − xj ) + H H − pΠ) × 2 i=1 j=1,j=i T
N
N
(15)
= −cV (t),
for arbitrary t ∈ [tk−1 , tk ), k ∈ Z+ , where c = −λmax (C T + C + Υ Υ T + H T H − pΠ). Thus, for t ∈ [tk−1 , tk ), k ∈ Z+ , one can conclude that V (t) ≤ e−c(t−tk−1 ) V (tk−1 ).
(16)
− At impulse instants, it yields that xj (tk )−xi (tk ) = edk [xj (t− k −νk )−xi (tk −νk )], for every pair of (i, j) such that lij > 0. Because L is irreducible, one can select suffixes s1 , s2 , · · · , sm , satisfying lism > 0, lsm sm−1 > 0, · · · , and ls1 j > 0, for arbitrary pair of suffixes i and j (i = j). Thus, it follows that
xj (tk ) − xi (tk ) = [xj (tk ) − xs1 (tk )] + [xs1 (tk ) − xs2 (tk )] + · · · + [xsm (tk ) − xi (tk )] − − − dk = edk [xj (t− k − νk ) − xs1 (tk − νk )] + e [xs1 (tk − νk ) − xs2 (tk − νk )] + · · · − + edk [xsm (t− k − νk ) − xi (tk − νk )] − = edk [xj (t− k − νk ) − xi (tk − νk )],
(17)
for any suffixes i with j. Thus, if t = tk , k ∈ Z+ , it follows that V (tk ) =
N N 1 −δij [xi (tk ) − xj (tk )]T [xi (tk ) − xj (tk )] 2 i=1 j=1,j=i
2dk
=
e
2 2dk
=e
N N
− T − − −δij [xi (t− k − νk ) − xj (tk − νk )] · [xi (tk − νk ) − xj (tk − νk )]
i=1 j=1,j=i
V (t− k − νk ).
(18)
Synchronization with Average Characteristics
813
Next, the following discussions shall be derived from (16) and (18). For every m ∈ Z+ , and t ∈ [tm−1 , tm ), we infer that ⎞ ⎛ m−1 m−1 V (t) ≤ exp 2 dk exp ⎝c νj ⎠ e−c(t−t0 ) V0 , (19) j=0
j=0
where ν0 0, d0 0 and V0 = V (t0 ). To begin with, by (16), we have V (t) ≤ e−c(t−t0 ) V0 , t ∈ [t0 , t1 ).
(20)
One may observe that (19) is satisfied for m = 1. Then, suppose that (19) is satisfied for m = l, l ≥ 1, that is, ⎞ ⎛ l−1 l−1 dj exp ⎝c νj ⎠ e−c(t−t0 ) V0 , (21) V (t) ≤ exp 2 j=0
j=0
for t ∈ [tl−1 , tl ). Then, the inference of (19) shall be proved for m = l + 1. Because of νk < tk − tk−1 , k ∈ Z+ , which leads to tl − νl ∈ [tl−1 , tl ). By using (18) and (21), one can infer that V (tl ) ≤ e2dl V (tl − νl ) ⎞ ⎛ l−1 l−1 ≤ e2dl · exp 2 dj exp ⎝c νj ⎠ e−c(tl −νl −t0 ) V0 j=0
j=0
⎞
⎛
(22)
l l = exp 2 dj exp ⎝c νj ⎠ e−c(tl −t0 ) V0 . j=0
j=0
Since V (t) ≤ V (tl )e−c(t−tl ) , t ∈ [tl , tl+1 ), for t ∈ [tl , tl+1 ), it yields that ⎞ ⎛ l l V (t) ≤ exp 2 dj exp ⎝c νj ⎠ e−c(tl −t0 ) V0 · e−c(t−tl ) j=0
j=0
⎞ l l = exp 2 dj exp ⎝c νj ⎠ e−c(t−t0 ) V0 , j=0
⎛
(23)
j=0
which implies that (19) is satisfied for m = l + 1. So, via mathematical induction method, we conclude that (19) is fulfilled for all m ∈ Z+ . In addition, from {tk } ∈ L0 , {νk } ∈ Lν and {dk } ∈ Ld we derive that
814
B. Jiang and J. Lu
R(t,t0 )
V (t) ≤ exp 2
⎛
dj exp ⎝c
R(t,t0 )
j=0
⎛ = exp ⎝2
⎞ νj ⎠ e−c(t−t0 ) V0
j=0
R(t,t0 )
R(t,t0 )
dj + c
j=0
⎞
νj ⎠ e−c(t−t0 ) V0
j=0
(24)
⎡ ⎤ R(t,t R(t,t 0 ) 0 ) 2 dj + c νj ⎢ ⎥ t − t0 j=0 j=0 ⎢ ⎥ = exp⎢ (t − t0 )⎥e−c(t−t0 ) V0 R(t, t0 ) R(t, t0 ) ⎣ ⎦ ¯ one can conclude for arbitrary t ≥ 0. Recalling the definition about τa , ν¯ and d, that ⎤ ⎡ R(t,t R(t,t 0 ) 0 ) 2 d + c ν j j ⎢ ν t − t0 ⎥ j=0 j=0 ⎥ 2d¯ + c¯ ⎢ (25) lim ⎢ . ⎥= t→+∞ ⎣ R(t, t0 ) R(t, t0 ) ⎦ τa ¯
ν Therefore, for any 0 < ε < c − 2d+c¯ τa , there is an t > 0 (large enough) satisfying
¯ 2d + c¯ ν V (t) ≤ exp ( + ε − c)(t − t0 ) V0 , τa
(26)
for arbitrary t > t , which implies that 1 φi φj xi (t) − xj (t)2 2 N 1 ≤ φi φj [xi (t) − xj (t)]T [xi (t) − xj (t)] 2 i=1,j=1
(27)
= V (t), where δij = −φi φj for i = j is utilized. Further, it follows from condition (8) ¯ ν that ( 2d+c¯ τa + ε − c) < 0. Hence, the complex network (7) is GUES over the class H[{tk }, {νk }, {dk }]. This concludes the proof.
4
Example
In this section, a numerical example is presented to illustrate the effectiveness of the obtained results. Each node is considered to be a chaotic system as follows [17]: x(t) ˙ = Cx(t) + Υ ψ(x(t)), (28)
Synchronization with Average Characteristics
815
in which x(t) = [x1 (t), x2 (t), x3 (t)]T ∈ R3 stands for the system state, and matrix C = diag{− 65 , − 65 , − 65 }, ⎛ ⎞ 1.16 −1.5 1.5 Υ = ⎝ −1.5 1.16 −2.0 ⎠ , (29) −1.2 2.0 1.16 and function ψ(x(t)) = [tanh(x1 ), tanh(x2 ), tanh(x3 )]. Hence, the Lipschitz constants are h1 = h2 = h3 = 1. System (28) possesses a chaos under initial condition x1 (t0 ) = 0.2, x2 (t0 ) = 0.5 and x3 (t0 ) = 0.6. Consider an small-world network (Newman-Watts type) with multiple delayed impulses [18]. Let N = 200, k = 2. The probability for edge adding is equal to 0.01, which implies that a small-world network is derived. One may calˆ = −0.0129 and = −λ2 (B)/λ ˆ culate that λmax (Δ) = 0.00817, λ2 (B) max (Δ) = 1.58. Consider p = 5 and Π = I3 that leads to c = −3.675. Let the impulse exponential gain sequence dk = −0.1 + 0.2 ∗ (−1)(k+1) , impulse instant sequence 1 tk = 0.3k and impulse delay sequence νk = 0.26 + 10k 2 , k ∈ Z+ . We can compute that c(¯ ν − τa ) + 2d¯ = −0.05 < 0. Actually, one can check that the whole conditions in Theorem 1 are fulfilled. Thus, complex network (7) is GUES over the class H[{tk }, {νk }, {dk }]. Specifically, if there is no delay in impulses, and then the complex network (7) turns to be nonsynchronized. Different from the desynchronizing factor in previous results [19–21], current study reveals the fact that the delay in impulses can bring synchronizing impact to the synchronization of complex networks subject to impulses with average characteristics.
5
Conclusion
In the study, the synchronization problem has been studied for complex networks subject to impulses with average characteristics. To be specific, the method of AIEG has been proposed to characterize the magnitude of impulses globally. Based on characteristics of AII, AID and AIEG with respect to the impulses, a flexible criterion for synchronization has been obtained. An interesting topic is to investigate synchronization of impulsive dynamical networks with delay.
References 1. Boccaletti, S., Latora, V., Moreno, Y., Chavez, M., Hwang, D.-U.: Complex networks: structure and dynamics. Phys. Rep. 424(4–5), 175–308 (2006) 2. Arenas, A., D´ıaz-Guilera, A., Kurths, J., Moreno, Y., Zhou, C.: Synchronization in complex networks. Phys. Rep. 469(3), 93–153 (2008) 3. Lu, W., Chen, T.: New approach to synchronization analysis of linearly coupled ordinary differential systems. Phys. D 213(2), 214–230 (2006) 4. Jiang, B., Lu, J., Lou, J., Qiu, J.: Synchronization in an array of coupled neural networks with delayed impulses: average impulsive delay method. Neural Netw. 121, 452–460 (2020)
816
B. Jiang and J. Lu
5. Wang, Y., Lu, J., Liang, J., Cao, J., Perc, M.: Pinning synchronization of nonlinear coupled Lur’e networks under hybrid impulses. IEEE Trans. Circuits Syst. II Express Briefs 66(3), 432–436 (2019) 6. Lakshmikantham, V., Simeonov, P.S.: Theory of Impulsive Differential Equations, vol. 6. World Scientific, Singapore (1989) 7. Benchohra, M., Henderson, J., Ntouyas, S.: Impulsive Differential Equations and Inclusions, vol. 2. Hindawi Publishing Corporation, New York (2006) 8. Yang, S., Li, C., Huang, T.: Synchronization of coupled memristive chaotic circuits via state-dependent impulsive control. Nonlinear Dyn. 88(1), 115–129 (2016). https://doi.org/10.1007/s11071-016-3233-z 9. Li, X., Song, S.: Stabilization of delay systems: delay-dependent impulsive control. IEEE Trans. Autom. Control 62(1), 406–411 (2016) 10. Chen, W.H., Zheng, W.X.: Input-to-state stability and integral input-to-state stability of nonlinear impulsive systems with delays. Automatica 45(6), 1481–1488 (2009) 11. Jiang, B., Lou, J., Lu, J., Kaibo, S.: Synchronization of chaotic neural networks: average-delay impulsive control. IEEE Trans. Neural Netw. Learn. Syst 12. Li, X., Song, S., Wu, J.: Exponential stability of nonlinear systems with delayed impulses and applications. IEEE Trans. Autom. Control 64(10), 4024–4034 (2019) 13. Jiang, B., Lu, J., Liu, Y.: Exponential stability of delayed systems with averagedelay impulses. SIAM J. Control. Optim. 58(6), 3763–3784 (2020) 14. Wang, N., Li, X., Lu, J., Alsaadi, F.E.: Unified synchronization criteria in an array of coupled neural networks with hybrid impulses. Neural Netw. 101, 25–32 (2018) 15. Chung, F.R., Graham, F.C.: Spectral Graph Theory, no. 92. American Mathematical Society (1997) 16. Horn, R.A., Horn, R.A., Johnson, C.R.: Matrix Analysis. Cambridge University Press, Cambridge (1990) 17. Lu, J., Ho, D.W.C., Cao, J.: A unified synchronization criterion for impulsive dynamical networks. Automatica 46(7), 1215–1221 (2010) 18. Newman, M.E., Watts, D.J.: Scaling and percolation in the small-world network model. Phys. Rev. E 60(6), 7332 (1999) 19. Yang, H., Wang, X., Zhong, S., Shu, L.: Synchronization of nonlinear complex dynamical systems via delayed impulsive distributed control. Appl. Math. Comput. 320, 75–85 (2018) 20. Chen, W.-H., Wei, D., Lu, X.: Global exponential synchronization of nonlinear time-delay Lur’e systems via delayed impulsive control. Commun. Nonlinear Sci. Numer. Simul. 19(9), 3298–3312 (2014) 21. Liu, X., Zhang, K.: Synchronization of linear dynamical networks on time scales: pinning control via delayed impulses. Automatica 72, 147–152 (2016)
Retrieval of Redundant Hyperlinks After Attack Based on Hyperbolic Geometry of Web Complex Networks Mahdi Moshiri and Farshad Safaei(&) Faculty of Computer Science and Engineering, Shahid Beheshti University, Tehran, Iran {m_moshiri,f_safaei}@sbu.ac.ir Abstract. The Internet and the Web can be described as huge networks of connected computers, connected web pages, or connected users. Analyzing link retrieval methods on the Internet and the Web as examples of complex networks is of particular importance. The recovery of complex networks is an important issue that has been extensively used in various fields. Much work has been done to measure and improve the stability of complex networks during attacks. Recently, many studies have focused on the network recovery strategies after the attack. Predicting the appropriate redundant links in a way that the network can be recovered at the lowest cost and fastest time after attacks or interruptions will be critical in a disaster. In addition, real-world networks such as the World Wide Web are no exception, and many attacks are made on hyperlinks between web pages, and the issue of predicting redundant hyperlinks on this World Wide Web is also very important. In this paper, different kinds of attack strategies are provided and some retrieval strategies based on link prediction methods are proposed to recover the hyperlinks after failure or attack. Besides that, a new link prediction method based on the hyperbolic geometry of the complex network is proposed to retrieve redundant hyperlinks and the numerical simulation reveals its superiority that the state-of-the-art algorithms in recovering the attacked hyperlinks especially in the case of attacks based on edge betweenness strategy.
1 Introduction For the past decades, many studies have focused on network recovery in various contexts and many different kinds of recovery strategies were developed considering failure characteristics and network types. Matisziw et al. [1] suggested a multiobjective recovery. Chaoqi et al. [2] proposed a repair model to study the impact of network structure and analyzed the energy of the network during recovery. Hu et al. [3] suggested an optimal recovery strategy for geographical networks after attacks. Yu and Yang [4] selected appropriate components of the failed network to repair it and reach a stable condition with limited recovery resources. Besides that, there are many studies in the area of network resilience.Resilience is the power of the network to return to a stable state as soon as possible [5]. The successful implementation of recovery strategies has a direct impact on the resilience of complex networks. Based on this, many researchers have focused on studying the concept of resilience in a complex © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2021, SCI 1072, pp. 817–830, 2022. https://doi.org/10.1007/978-3-030-93409-5_67
818
M. Moshiri and F. Safaei
networks. Majdandzic et al. [6] developed a model to demonstrate the automatic recovery in a networked system. Also, Afrin et al. [7] have recently given a thorough survey about the attack and recovery strategies and future approaches to this problem. The recovery strategies mentioned above are quite diverse and can be applied to different kinds of networks including web networks under various circumstances and based on this reason, it is necessary to study the retrieval methods and develop a more general retrieval strategy that can confront more types of networks. In recent years, the link prediction problem [8] has achieved great progress in the area of network science. Link prediction methods try to predict missing, spurious, and future links in different kinds of complex networks [9–13]. This problem is generally solved based on unsupervised similarity measures, supervised methods (machine-learning based methods) [12], maximum likelihood methods [14], stochastic block model [15, 16], and probabilistic models [17, 18]. Recently many new applications have been defined for link prediction methods such as community detection, network reconstruction, and recommendation systems [19–21]. Recently robustness of link prediction methods under attack was examined by Wang et al. [22]. Different tools have been applied to enhance the performance of link prediction methods. One of these tools is the hyperbolic geometry of the complex networks. Krioukov et al. [23] and Papadopoulos et al. [24] introduced the networks’ mapping to their underlying hyperbolic space and studied the impact of the two factors of popularity and similarity in networks’ growth and suggested that new connections are made between nodes with popularity and similarity trade-off. After that Papadopoulos et al. [25] proposed the HyperMap method to map a network to its hyperbolic geometry and used the hyperbolic distance to solve the link prediction problem. Different from these works. Recently, Muscoloni et al. [26, 27] introduced a model (N-PSO) to predict the missing links based on the community structure of the networks. Samei and Jalili [28] proposed an enhanced method based on the hyperbolic distance to solve missing and spurious link prediction in the multiplex networks. The objective of this article is to propose a method to recover the attacked or failed hyperlinks in the web network after random or targeted attacks instantly via unsupervised similarity measures used in solving the link prediction problem. The similarity measures are defined based on some existing measures such as Common Neighbors (CN), Resource Allocation (RA), Hyperbolic distance, Cannistraci-Resource-Allocation (CRA), CH2-L2, and a proposed measure based on underlying hyperbolic geometry of the complex networks. This paper is organized as follows. In Sect. 2, the problem is defined clearly as well as the attack and retrieval strategies and the evaluation metrics are presented. In Sect. 3 the datasets are proposed and in Sect. 4 the experimental results and analysis are provided. Finally, Sect. 5 is the conclusion.
2 Methods and Evaluation Metrics 2.1
Problem Definition
Complex web networks are always threatened with malfunctions or disruptions due to technical or intentional failure or attacks, which can eventually lead to total system
Retrieval of Redundant Hyperlinks After Attack Based on Hyperbolic
819
failure. A graph that has the nodes of those web pages and the edges pointed to the link between those pages can be used to study behavior for several reasons. One of these is the disruption or attack on hyperlinks between web pages that will cause web pages to crash. For a given simple web network represented as G ¼ ðV; E Þ; V denotes the set of all nodes (web pages) and E denotes the set of all hyperlinks. After attack or failure, a percentage of hyperlinks E called E A is removed from the web network and the goal is to add a set ER of redundant hyperlinks to the network that restores the web efficiency to its original state before attack. In this section, attack strategies, retrieval strategies, and evaluation metrics are proposed. 2.2
Attack Strategies
Real web networks suffer from various kinds of failures or attacks. Here we consider representative attack strategies [22]. The attack strategies to complex networks are generally proposed based on network characteristics such as degree centrality, betweenness centrality, eigenvector, closeness centrality, entropy, and so on [29–33]. Mozaffari et al. [34] have also proposed similar attack strategies for improving robustness of scale-free networks. Here both random hyperlink attack and some targeted hyperlink attacks strategies are considered, 15% of the hyperlinks are removed based on these attack strategies. The attack strategies used in this article are based on the article by Moshiri et al. [35]. 2.3
Retrieval Strategies
To develop a resilient characteristic for the network against random and targeted attacks, many different strategies have been suggested and applied to complex networks structures [7]. Variable retrieval strategies can be categorized to retrieve based on resilience consideration, failures characteristics, networks properties, and retrieval priorities. Some retrieval strategies can be members of more than one category group. The method proposed in this article can be categorized as a retrieval strategy based on retrieval priorities which is an important factor in instant retrieval of different networks such as web and blog networks. In fact in this category, retrieval strategies follow rules based on priority to repair failed or attacked hyperlinks or elements. The selection rules consider different factors such as betweenness, distance, load, or crucial attacked elements in the network. In link prediction methods, the goal is identifying the hyperlinks that can be replaced with removed hyperlinks (via attack or failure) instantly, so that the network performance with different criteria can be restored to a pre-attack state as soon as possible. 2.4
Redundant Hyperlinks Identification
After a targeted attack or accidental failure, the ordinary way to recover the network is by choosing random links to replace the removed ones. The most effective way to recover damaged links instantly is to check network efficiency at every step of adding redundant links and check the performance of the recovered network. In this paper, the
820
M. Moshiri and F. Safaei
main approach is based on network retrieval and predicting redundant links. This goal is done via proposed methods which are based on link prediction similarity measures to improve network efficiency during the recovery state. Assuming the structure of the network after an intentional attack or random failure, the proposed link prediction-based methods are supposed to predict the appropriate redundant links to be replaced with the original ones in the situation of the real attack. So a list of reserved links is considered that can be added to the network after failure or attack based on the strategies defined above. In another word, first, it is assumed that the network has been under attack based on the attack strategies and then try to find the appropriate links that can increase the network efficiency to the same condition before an attack. The existing link prediction methods and a proposed method based on the local information of removed links are used to choose the appropriate list of reserved links: • Common Neighbors (CN) sCN ij ¼ Ci \ Cj
ð1Þ
where Ci is the set of neighbors of node i and kk indicates the number of nodes in a set. • Resource Allocation (RA) sRA ij ¼
X k2Ci \ Cj
1 kCk k
ð2Þ
• Preferential Attachment (PA) sPA ij ¼ kCi k Cj
ð3Þ
• Cannistraci-Resource-Allocation (CRA) sCRA ¼ ij
X k2Ci \ Cj
kiLCLðk Þk kCk k
ð4Þ
where iLCLðk Þ is the internal local community links defined in [36]. • CH2-L2 sCH2L2 ¼ ij
X k2L2
1 þ dik 1 þ dek
ð5Þ
Where k is the intermediate node on the path of length two (L2 Þ between i and j and dik is the respective internal node degree and dek is the respective external node degree. More information about this measure is provided in [36]. • Hyperbolic distance (HP): This measure uses the hyperbolic distance of links based on the HyperMap method which is based on Maximum Likelihood Estimation and finds the radial and angular coordinates for all nodes and maximizes the likelihood:
Retrieval of Redundant Hyperlinks After Attack Based on Hyperbolic
L¼
Y
a 1aij p xij ij 1 p xij
821
ð6Þ
1 i\j N
where xij is defined as the hyperbolic distance between pair i; j: xij ¼ arccosh coshri coshrj sinhri sinhrj cosDhij
ð7Þ
ri þ rj þ 2ln sin Dhij =2 ri þ rj þ 2lnðDhij =2Þ where Dhij ¼ p jpjhi hj jj
ð8Þ
and p xij is the Fermi-Dirac connection probability: p xij ¼
1 1 þ e ðxij RÞ 1 2T
ð9Þ
where R lnN [37]. The pseudo-code of the proposed method is shown in Fig. 1.
Fig. 1. The pseudo-code of the proposed method
Figure 2 shows an example of the proposed method. When the hyperlink between the red nodes(web pages 3, 6) is damaged, all neighbors(other web pages) of the red nodes are the candidate for making a new hyperlink with one of the red nodes and the one with the highest computed score is the winner.
822
M. Moshiri and F. Safaei
Fig. 2. Example of the proposed method. Nodes 1 to 9 are web pages or blogs that are connected via hyperlinks.
2.5
Evaluation Metrics
To evaluate the proposed methods two metrics are used that are defined as below: • Efficiency: E¼
EF R EF A
ð11Þ
Where E A is the efficiency of the web network before the attack and E R is the efficiency of the web network after retrieval. The efficiency is defined as the efficiency of the web network as below: EF ¼
X 1 1 jN jðjN j 1Þ i;j2N dij
ð12Þ
The greater the F is the better the method performance is. • Retrieval power (RP): RP ¼
jE Recovery j jE Attacked j
ð13Þ
where jEAttacked j is the number of removed hyperlinks after the attack and jERecovery j is the number of redundant hyperlinks that should be added to the web network to retrieve the network to its original efficiency before attack or failure. The smaller the RP is the better the method performance is. As it can be seen in the Result section, Recovery Attacked in some methods jE that shows their good j is much less than E performance. But in some other methods jERecovery j is much more than EAttacked that shows their poor performance. For the cases that RP [ 1 the performance of the method is not acceptable and in this case, this measure is not computed and is shown with “-”.In cases where RP ¼ 0 retrieval is not achieved by the relevant algorithm.
Retrieval of Redundant Hyperlinks After Attack Based on Hyperbolic
823
3 Datasets To evaluate the proposed method and compare them with other existing methods, the information of four real web networks is used. To approximate the underlying hyperbolic of the complex networks via the HyperMap method two parameters c and T are required. Parameter c is the power-law degree distribution exponent which is approximated via the method introduced by Clauset et al. [38], and T is the temperature that is estimated using the Nonuniform Popularity Similarity Optimization N-PSO model [27]. In the following Table 1, we provide an explanation of these web or blog networks.
Table 1. Summary of the datasets Name AIDS blogs (2005)
#Nodes Node Type 146 Blog
#Edges Edge type Ref. 187
moreno_blogs 1224
Blog
19025
web-polblogs 643
Blog
2300
Amazon pages(2012)
Web page
5037
2880
Hyperlink [39]
Description
A network of hyperlinks among blogs related to AIDS, patients, and their support networks, collected by Gopal over a three-day period in August 2005 Hyperlink [40, 41] This directed network contains front-page hyperlinks between blogs in the context of the 2004 US election Hyperlink [42] The graph data sets are donated by a number of different authors and organizations and in many cases have provided the citation information that should be used upon request Hyperlink [43] A small sample of web pages from Amazon.com and its sister companies. The manner in which this network was sampled is unclear, and the direction of the hyperlink has been discarded
824
M. Moshiri and F. Safaei
4 Experimental Results and Analysis The experimental results of the proposed methods on four real web networks are presented in this section. For each network, at first 15% of hyperlinks are removed via different attack strategies explained before, and then the predicted set of redundant hyperlinks are added gradually and the evaluation metrics are used to investigate the performance of the retrieved network. 4.1
Comparison of Retrieval Methods
Figure 3 shows the efficiency of different web networks via the percentage of added hyperlinks to the attacked network in the range of [0, 0.15]. As it is shown, the proposed retrieval methods result in increasing the efficiency of all networks and as it can be distinguished, the efficiency of the proposed retrieval method (HP) performs significantly better in the case of attack based on Edge Betweenness. When the attack is considering the edge efficiency betweenness as the important factor of a hyperlink, removing those hyperlinks results in crucial damages in the network efficiency and it may cause the network to become disconnected. So adding redundant hyperlinks that are locally near to the damaged hyperlink helps the efficiency of the attacked network significantly. So the proposed method could have an impressive impact on restoring the web network structure as soon as possible. Table 2 shows the Retrieval power of different recovery methods while restoring the attacked network. The number in each cell shows the percentages of hyperlinks that should be added to the attacked network to restore the efficiency of the network to the state before the attack. In most cases except the proposed method (HP), adding 15% of redundant hyperlinks to the damaged network can not achieve the original efficiency of the network. But in the case of (HP) adding less or equal percentage of redundant hyperlinks to the damaged network result in the original efficiency of the attacked network. Since in the attacking situation the instant and cheap recovery strategy is of high importance, so the proposed method profits of a very important advantage.
Retrieval of Redundant Hyperlinks After Attack Based on Hyperbolic
825
(a)
(b)
Fig. 3. Efficiency of proposed methods under different kinds of attacks in four real web/blog networks (a) AIDS blogs (2005) (b) moreno_blogs and (c) web-polblogs (d) Amazon_pages 2012
826
M. Moshiri and F. Safaei (c)
(d)
Fig. 3. (continued) Table 2. Retrieval Power of proposed retrieval strategy under different kinds of attack Network
Attack strategy
HP
RA
PA
CN
CRA
CH2_L2
AIDS blogs (2005)
Random failure
1.0
0
–
0
–
–
Edge betweenness
1.0
0
0.3
0
–
–
Preferential attachment
1.0
0
0.26
–
0.81
–
Similarity
0.96
0.26
0.04
0.52
0.56
0.3
Random failure
0.97
0
–
0
–
0
moreno_blogs
web-polblogs
Amazon_pages
Edge betweenness
1.0
–
0.37
0
–
–
Preferential attachment
1.0
0.39
0.1
–
–
0.72
Similarity
1.0
0.23
0.09
0.52
0.75
0.31
Random failure
1.0
–
0.31
0
0
–
Edge betweenness
0.96
0
0
0
0
0
Preferential attachment
1.0
0.26
0.01
0.7
–
0.45
Similarity
1.0
0.07
0.01
0.13
0.13
0.11
Random failure
1.0
0
0
0
0
0
Edge betweenness
1.0
0
0
0
–
0
Preferential attachment
1.0
0
–
0
–
–
Similarity
0.3
0
0
0
–
0
Retrieval of Redundant Hyperlinks After Attack Based on Hyperbolic
4.2
827
Comparison of the Network Structure Under Different Retrieval Strategy
To evaluate the efficiency of the proposed retrieval method for different sizes of attacked hyperlinks, the fraction of damaged hyperlinks is changed in the range of [0.1, 0.5]. Figure 4 shows the network structure of AIDS blogs (2005) for example before and after the EBA attack as well as after blog network retrieval using a different retrieval strategy. As can be seen, the proposed HP retrieval method preserves the network structure before the attack and has a structure closer to the original blog network before the attack. In other methods, node rupture is observed after network retrieval.
Fig. 4. A comparison of the structure of AIDS blogs (2005) (N = 146) when hyperlinks have been removed by EBA attack in different retrieval strategy (Visualization generated using python pyvis lib [44]).
828
M. Moshiri and F. Safaei
5 Conclusion Link prediction methods have been used widely in solving different kinds of problems. In this paper, some existing and proposed methods are suggested to retrieve the web network structure after random or targeted attacks. Firstly, some attack strategies are defined, and based on them some retrieval strategies based on the link prediction similarity measures are proposed. The main proposed method is based on the hyperbolic geometry and local information of the attacked network. The underlying hyperbolic geometry of complex networks uses two important parameters of similarity and popularity of nodes (web pages or blogs) that both play important role in solving the link prediction problem. On the other side using the local information of the damaged hyperlinks helps the method to choose appropriate hyperlinks to be replaced with the removed ones as soon as possible. The results show that the proposed method outperforms the other existing link prediction methods in most cases and can achieve the efficiency of the network before the attack even with adding less number of hyperlinks than the removed ones. Also, the network structure after the attacks in the proposed retrieval method based on hyperbolic geometry is largely preserved and will be closer to the main network before the attacks.
References 1. Matisziw, T.C., Murray, A.T., Grubesic, T.H.: Strategic network restoration. Netw. Spat. Econ. 10(3), 345–361 (2010) 2. Chaoqi, F., et al.: Complex networks under dynamic repair model. Physica A 490, 323–330 (2018) 3. Hu, F., et al.: Recovery of infrastructure networks after localised attacks. Sci. Rep. 6(1), 1– 10 (2016) 4. Yu, H., Yang, C.: Partial network recovery to maximize traffic demand. IEEE Commun. Lett. 15(12), 1388–1390 (2011) 5. Yodo, N., Wang, P.: Engineering resilience quantification and system design implications: a literature survey. J. Mech. Des. 138, 11 (2016) 6. Majdandzic, A., et al.: Spontaneous recovery in dynamical networks. Nat. Phys. 10(1), 34– 38 (2014) 7. Afrin, T., Yodo, N.: A concise survey of advancements in recovery strategies for resilient complex networks. J. Complex Netw. 7(3), 393–420 (2019) 8. Liben-Nowell, D., Kleinberg, J.: The link-prediction problem for social networks. J. Am. Soc. Inform. Sci. Technol. 58(7), 1019–1031 (2007) 9. Clauset, A., Moore, C., Newman, M.E.: Hierarchical structure and the prediction of missing links in networks. Nature 453(7191), 98–101 (2008) 10. Fu, C., et al.: Link weight prediction using supervised learning methods and its application to yelp layered network. IEEE Trans. Knowl. Data Eng. 30(8), 1507–1518 (2018) 11. Lü, L., et al.: Toward link predictability of complex networks. Proc. Natl. Acad. Sci. 112(8), 2325–2330 (2015) 12. Lü, L., Zhou, T.: Link prediction in complex networks: a survey. Physica A 390(6), 1150– 1170 (2011) 13. Samei, Z., Jalili, M.: Discovering spurious links in multiplex networks based on interlayer relevance. J. Complex Netw. 7(5), 641–658 (2019)
Retrieval of Redundant Hyperlinks After Attack Based on Hyperbolic
829
14. Sales-Pardo, M., et al.: Extracting the hierarchical organization of complex systems. Proc. Natl. Acad. Sci. 104(39), 15224–15229 (2007) 15. Airoldi, E.M., et al.: Mixed membership stochastic blockmodels. J. Mach. Learn. Res. 9, 1981–2014 (2008) 16. Holland, P.W., Laskey, K.B., Leinhardt, S.: Stochastic blockmodels: first steps. Soc. Netw. 5 (2), 109–137 (1983) 17. Heckerman, D., Meek, C., Koller, D.: Probabilistic entity-relationship models, PRMs, and plate models. In: Introduction to Statistical Relational Learning, pp. 201–238 (2007) 18. Neville, J.: Statistical models and analysis techniques for learning in relational data (2006) 19. Herrgård, M.J., et al.: A consensus yeast metabolic network reconstruction obtained from a community approach to systems biology. Nat. Biotechnol. 26(10), 1155–1160 (2008) 20. Linden, G., Smith, B., Com, J.Y.A.: Industry report: Amazon.com recommendations: itemto-item collaborative filtering. IEEE Distrib. Syst. Onl. Citeseer (2003) 21. Radicchi, F., et al.: Defining and identifying communities in networks. Proc. Natl. Acad. Sci. 101(9), 2658–2663 (2004) 22. Wang, K., Li, L., Pu, C.: Robustness of link prediction under network attacks (2018). https:// arxiv.org/abs/1811.04528 23. Krioukov, D., et al.: Hyperbolic geometry of complex networks. Phys. Rev. E 82(3), 036106 (2010) 24. Papadopoulos, F., et al.: Popularity versus similarity in growing networks. Nature 489 (7417), 537–540 (2012) 25. Papadopoulos, F., Psomas, C., Krioukov, D.: Network mapping by replaying hyperbolic growth. IEEE/ACM Trans. Netw. 23(1), 198–211 (2014) 26. Alessandro, M., Vittorio, C.C.: Leveraging the nonuniform PSO network model as a benchmark for performance evaluation in community detection and link prediction. New J. Phys. 20(6), 063022 (2018) 27. Muscoloni, A., Cannistraci, C.V.: A nonuniform popularity-similarity optimization (nPSO) model to efficiently generate realistic complex networks with communities. New J. Phys. 20 (5), 052002 (2018) 28. Samei, Z., Jalili, M.: Application of hyperbolic geometry in link prediction of multiplex networks. Sci. Rep. 9(1), 1–11 (2019) 29. Albert, R., Jeong, H., Barabási, A.-L.: Error and attack tolerance of complex networks. Nature 406(6794), 378–382 (2000) 30. Cohen, R., et al.: Breakdown of the internet under intentional attack. Phys. Rev. Lett. 86(16), 3682 (2001) 31. Crucitti, P., et al.: Error and attack tolerance of complex networks. Physica A 340(1–3), 388–394 (2004) 32. Allesina, S., Pascual, M.: Googling food webs: can an eigenvector measure species’ importance for coextinctions? PLoS Comput. Biol. 5(9), e1000494 (2009) 33. Iyer, S., et al.: Attack robustness and centrality of complex networks. PLoS ONE 8(4), e59613 (2013) 34. Mozafari, M., Khansari, M.: Improving the robustness of scale-free networks by maintaining community structure. J. Complex Netw. 7(6), 838–864 (2019) 35. Moshiri, M., Safaei, F., Samei, Z.: A novel recovery strategy based on link prediction and hyperbolic geometry of complex networks. J. Complex Netw. 9(4), cnab007 (2021) 36. Muscoloni, A., Abdelhamid, I., Cannistraci, C.V.: Local-community network automata modelling based on length-three-paths for prediction of complex network structures in protein interactomes, food webs and more. bioRxiv 346916 (2018) 37. Kleineberg, K.-K., et al.: Hidden geometric correlations in real multiplex networks. Nat. Phys. 12(11), 1076–1081 (2016)
830
M. Moshiri and F. Safaei
38. Clauset, A., Shalizi, C.R., Newman, M.E.: Power-law distributions in empirical data. SIAM Rev. 51(4), 661–703 (2009) 39. Gopal, S.: The evolving social geography of blogs. In: Miller, H.J. (ed.) Societies and Cities in the Age of Instant Access, pp. 275–293. Springer, Dordrecht (2007). https://doi.org/10. 1007/1-4020-5427-0_18 40. Kunegis, J.: Konect: the koblenz network collection. In: Proceedings of the 22nd International Conference on World Wide Web (2013) 41. Adamic, L.A., Glance, N.: The political blogosphere and the 2004 US election: divided they blog. In: Proceedings of the 3rd International Workshop on Link Discovery (2005) 42. https://networkrepository.com/web-polblogs.php 43. Šubelj, L., Bajec, M.: Ubiquitousness of link-density and link-pattern communities in realworld networks. Eur. Phys. J. B 85(1), 1–11 (2012) 44. https://pyvis.readthedocs.io/en/latest/
Deep Reinforcement Learning for FlipIt Security Game Laura Greige1(B) and Peter Chin1,2,3 1
2
Boston University, Boston, MA, USA [email protected], [email protected] Center for Brains, Minds and Machines, MIT, Cambridge, MA, USA 3 CMSA, Harvard University, Cambridge, MA, USA
Abstract. Reinforcement learning has shown much success in games such as chess, backgammon and Go [21, 22, 24]. However, in most of these games, agents have full knowledge of the environment at all times. In this paper, we describe a deep learning model in which agents successfully adapt to different classes of opponents and learn the optimal counter-strategy using reinforcement learning in a game under partial observability. We apply our model to FlipIt [25], a two-player security game in which both players, the attacker and the defender, compete for ownership of a shared resource and only receive information on the current state of the game upon making a move. Our model is a deep neural network combined with Q-learning and is trained to maximize the defender’s time of ownership of the resource. Despite the noisy information, our model successfully learns a cost-effective counter-strategy outperforming its opponent’s strategies and shows the advantages of the use of deep reinforcement learning in game theoretic scenarios. We also extend FlipIt to a larger action-spaced game with the introduction of a new lower-cost move and generalize the model to n-player FlipIt. Keywords: FlipIt, Game theory, Cybersecurity games, Deep Q-learning
1
Introduction
Game theory has been commonly used for modeling and solving security problems. When payoff matrices are known by all parties, one can solve the game by calculating the Nash equilibria of the game and by playing one of the corresponding mixed strategies to maximize its gain (or symmetrically, minimize its loss). However, the assumption that the payoff is fully known by all players involved is often too strong to effectively model the type of situations that arise in practice. It is therefore useful to consider the case of incomplete information and apply reinforcement learning methods which are better suited to tackle the problem in these settings. In particular, we examine the two-player game FlipIt [25] where an attacker and a defender compete over a shared resource and where agents deal with incomplete observability. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2021, SCI 1072, pp. 831–843, 2022. https://doi.org/10.1007/978-3-030-93409-5_68
832
L. Greige and P. Chin
The principal motivation for the game FlipIt is the rise of Advanced Persistent Threats (APT) [4,19]. APTs are stealthy and constant computer hacking processes which can compromise a system or a network security and remain undetected for an extended period of time. Such threats include intellectual property theft, host takeover and compromised security keys, caused by network infiltration, typically of large enterprises or governmental networks. For host takeover, the goal of the attacker is to compromise the device, while the goal of the defender is to keep the device clean through software reinstallation or through other defensive precautions. We would like to learn effectively how often should the defender clean the machines and when will the attacker launch its next attack. Hence, the problem can be formulated as finding a cost-effective schedule through reinforcement learning. All these applications can be modeled by the two-player game FlipIt, in which players, attackers and defenders, vie for control of a shared resource. The resource could be a computing device or a password for example, depending on which APT is being modeled. In our work, we train our model to estimate the opponent’s strategy and to learn the best-response to that strategy. Since these estimations highly depend on the information the model gets throughout the game, the challenge comes from the incomplete and imperfect information received on the state of the game. The goal is for the adaptive agents to adjust their strategies based on their observations and good FlipIt strategies will help players implement their optimal cost-effective schedule. In the next sections, we present previous related studies and provide a description of the game framework as well as its variants. We then describe our approach to address the problem of learning in partial observability and our model architecture. We demonstrate successful counter-strategies developed by adaptive agents against basic renewal strategies and compare their performance in the original version of FlipIt to one with a larger action-space after introducing a lower-cost move. In the last sections, we generalize our model to multiplayer FlipIt and discuss the next steps of our project.
2
Related Work
Although game theory models have been greatly applied to solve cybersecurity problems [1,6,9], studies mainly focused on one-shot attacks of known types. FlipIt is the first model that characterizes the persistent and stealthy properties of APTs and was first introduced by van Dijk et al. [25]. In their paper, they analyze multiple instances of the game with non-adaptive strategies and show the dominance of certain distributions against stealthy opponents. They also show that the Greedy strategy is dominant over different distributions, such as periodic and exponential distributions, but is not necessarily optimal. Different variants and extensions of the game have also been analyzed; these include games with additional “insider” players trading information to the attacker for monetary gains [5,7], games with multiple resources [10] and games with different move types [18]. In all these variants, only non-adaptive strategies have been considered and this limits the analysis of the game framework. Laszka et
Deep Reinforcement Learning for FlipIt Security Game
833
al. [11,12] proposed a study of adaptive strategies in FlipIt, but this was done in a variant of the game where the defender’s moves are non-stealthy and noninstantaneous. Oakley et al. [16] were the first to design adaptive strategies with the use of temporal difference reinforcement learning in 2-player FlipIt. Machine Learning (ML) has been commonly used in different cybersecurity problems such as fraud and malware detection [2,13], data-privacy protection [28] and cyber-physical attacks [3]. It has allowed the improvement of attacking strategies that can overcome defensive ones, and vice-versa, it has allowed the development of better and more robust defending strategies in order to prevent or minimize the impact of these attacks. Reinforcement Learning (RL) is a particular branch in ML in which an agent interacts with an environment and learns from its own past experience through exploration and exploitation without any prior or with limited knowledge of the environment. RL and the development of deep learning have lead to the introduction of Deep Q-Networks (DQNs) to solve larger and more complex games. DQNs were firstly introduced by Mnih et al. [14] and have since been commonly used for solving games such as backgammon, the game of Go and Atari [15,20,22]. They combine deep learning and Q-learning [23,27] and are trained to learn the best action to perform in a particular state in terms of producing the maximum future cumulative reward. Hence, with the ability of modeling autonomous agents that are capable of making optimal sequential decisions, DQNs represent the perfect model to use in an adversarial environment such as FlipIt. Our paper extends the research made in stealthy security games with the introduction of adaptive DQN-based strategies allowing agents to learn a cost-effective schedule for defensive precautions in FlipIt and its variants, all in real-time.
3 3.1
Game Environment Framework
FlipIt is an infinitely repeated game where the same one-shot stage game is played repeatedly over a number of discrete time periods. At each period of a game, players decide what action to take depending on their respective strategies. They take control of the resource by moving, or by what is called “flipping”. Flipping is the only move option available and each player can flip at any time throughout the game. We assume that the defender is the rightful owner of the resource and as such, ties are broken by assigning ownership to the defender. Each player pays a certain move cost for each flip and is rewarded for time in possession of the resource. For our purpose we have used the same reward and flip cost for all players (attackers and defenders, adaptive and non-adaptive), but our environment can be easily generalized in order to experiment with different rewards and costs for both players. Moreover, an interesting aspect in FlipIt is that contrary to games like Backgammon and Go, agents do not take turn moving. A move can be made at any time throughout the game and therefore a player’s score highly depends on its opponent’s moves. The final payoff corresponds to the sum of the player’s payoffs from each round. Finally, players have incomplete information
834
L. Greige and P. Chin
about the game as they only find out about its current state once they flip. In particular, adaptive agents only receive feedback from the environment upon flipping, which corresponds to their opponent’s last move (LM). Unless stated otherwise, we assume in the remainder of the paper that the defender is the initial owner of the resource, as it usually is the case with security keys and other devices. The defender is considered to be LM playing a DQNbased strategy against an attacker that follows one of the renewal strategies we describe in the following sections. 3.2
Markov Decision Process
Our environment is defined as a Markov Decision Process (MDP). At each iteration, agents select an action from the set of possible actions A. In FlipIt, the action space is restrained to two actions: to flip and not to flip. As previously mentioned, agents do not always have a correct perception of the current state of the game. In Fig. 1 we describe a case where an LM agent P1 plays with only partial observability against a periodic agent P2 . When P1 flips at iteration 6, the only feedback it receives from the environment concerns its opponent’s flip at iteration 4. Hence, no information is given regarding the opponent’s previous flip at iteration 2 and P1 is subjected to an incorrect assumption on the time it controlled the resource. Suppose P1 claims ownership of the resource at iteration 6. Then, P1 ’s benefit would be equal to the sum of the operational cost of flipping and the reward for being in control of the resource, which is represented by τP1 in the figure below.
Fig. 1. An example of incomplete or imperfect observability in FlipIt by P1
State Space. We consider a discrete state space where each state indicates the current state of the game, i.e. whether the defender is the current owner of the resource and the times elapsed since each agent’s last known moves. The current owner of the resource can be inferred from the current state of the game. Agents only learn the current state of the game once they flip, causing imperfect information in their observations, as previously explained. Action Space. In the original version of FlipIt, the only move option available is to flip. An adaptive agent therefore has two possible actions: to flip or not to flip. In this paper, we extend the game framework to a larger action-spaced game and introduce a new move called check. This move allows an agent to check
Deep Reinforcement Learning for FlipIt Security Game
835
the current state of the game and obtain information regarding its opponents’ last known moves all while paying a lower operational cost than the one of flipping. Just like flipping, an agent can check the state of the game at any time throughout the game and the action spaces are then denoted by Ad = {void, flip, check} for the defender and Aa = {void, flip} for the attacker. State Transitions. Since an agent can move at any time throughout the game, it is possible that both agents involved flip simultaneously and ties are broken by automatically assigning ownership to the defender. At each iteration, the state of the game is updated as such. If the defender flipped, the current owner is assigned to the defender. If the defender did not flip and its opponent flipped, the current owner is assigned to the opponent. If neither agent flips, the current owner is left unchanged. The transition to the next step only depends on the current state and actions taken such that the state transition function T is defined by T : S × Ad × Aa → Δ(S). Reward System. We define the immediate reward at each iteration based on the action taken as well as the owner of the resource at the previous iteration. i) Operational Costs and Payoff. Let by an agent at time step t. We have, ⎧ ⎪ ⎨0 rt = − Cc ⎪ ⎩ τ · r − Cf
rt be the immediate reward received if no play if at = check if at = flip
where r is the payoff given for owning the resource at one time step, Cc is the operational cost of checking and Cf the operational cost of flipping. τ defines the time elapsed between the agent’s last flip move and the time step he last owned the resource previous to its current flip, as described in Fig. 1. ii) Discount Factor. Let γ be the discount factor. The discount factor determines the importance of future rewards, and in our environment, a correct action at some time step t is not necessarily immediately rewarded. In fact, by having a flip cost higher than a flip reward, an agent is penalized for flipping at the correct moment but is rewarded in future time steps. This is why we set our discount factor γ to be as large as possible, giving more importance to future rewards and forcing our agent to aim for long term high rewards instead of short-term ones.
4
Model Architecture
Q-learning is a reinforcement learning algorithm in which an agent or a group of agents try to learn the optimal policy from their past experiences and interactions with an environment. These experiences are a sequence of state-action-rewards. In its simplest form, Q-learning is a table of values for each state (row) and action (column) possible in the environment. Given a current state, the algorithm
836
L. Greige and P. Chin
estimates the value in each table cell, corresponding to how good it is to take this action in this particular state. At each iteration, an estimation is repeatedly made in order to improve the estimations. This process continues until the agent arrives to a terminal state in the environment. This becomes quite inefficient when we have a large number or an unknown number of states in an environment such as FlipIt. Therefore in these situations, larger and more complex implementations of Q-learning have been introduced, in particular, Deep Q-Networks (DQN). Deep Q-Networks were firstly introduced by Mnih et al. (2013) and have since been commonly used for solving games. DQNs are trained to learn the best action to perform in a particular state in terms of producing the maximum future cumulative reward and map state-action pairs to rewards. Our objective is to train our agent such that its policy converges to the theoretical optimal policy that maximizes the future discounted rewards. In other words, given a state s we want to find the optimal policy π ∗ that selects action a such that a = arg maxa [ Qπ∗ (s, a) ] where Qπ∗ (s, a) is the Q-value that corresponds to the overall expected reward, given the state-action pair (s, a). It is defined by, Qπ∗ (s, a) = Eπ
rt + γrt+1 +γ 2 rt+2 + ... + γ T −t rT st = s, at = a
(1)
where T is the length of the game. Q-values are updated for each state and action using the following Bellman equation, Qn (s, a) = Q(s, a) + α
R(s, a) + γ max Q(s , a ) − Q(s, a) a
(2)
where Qn (s, a) and Q(s, a) are the new and current Q-values for the stateaction pair (s, a), R(s, a) is the reward received for taking action a at state s, maxa Q(s , a ) is the maximum expected future reward given new state s and all possible actions from state s , α is the learning rate and γ the discount factor. Our model architecture consists of 3 fully connected layers with rectified linear unit (ReLU) activation function at each layer. It is trained with Q-learning using the PyTorch framework [17] and optimized using the Adam optimizer [8]. We use experience replay [26] memory to store the history of state transitions and rewards (i.e. experiences) and sample mini-batches from the same experience replay to calculate the Q-values and update our model. The state of the game given as input to the neural network corresponds to the agent’s current knowledge on the game, i.e. the time passed since its last move and the time passed since its opponent’s last known move. The output corresponds to the Q-values calculated for each action. The learning rate is set to 0.001 while the discount factor is set to 0.99. We value exploration over exploitation and use an -Greedy algorithm such that at each time step a random action is selected with probability and the action corresponding to the highest Q-value is selected with probability 1 − . is initially set to 0.6 and is gradually reduced at each time step as the agent becomes more confident at estimating Q-values. We choose 0.6
Deep Reinforcement Learning for FlipIt Security Game
837
as it yields the best outcome regardless of the attacker’s strategy. In particular, we find that, despite eventually converging to its maximal benefit, higher exploration values can negatively impact learning while lower exploration values can cause a slower convergence (Fig. 2).
Fig. 2. Learning overtime averaged over 10 FlipIt simulations against 3 renewal strategies, as described in the next section.
5
Experimental Results
In what follows, we assume that opponent move rates are such that the expected interval time between two consecutive flips is larger than the flip move cost. We trained our neural network to learn the best counter-strategy to its opponent’s such that the reward received for being in control of the resource is set to 1 and the cost of flipping is set to 4. The flip cost is purposely set to a higher value than the reward in order to discourage the defender from flipping at each iteration. The following findings apply for any cost value that is greater than the reward. 5.1
Renewal Strategies
A renewal process is a process which selects a renewal time from a probability distribution and repeats at each renewal. For our purpose, a renewal is a flip and our renewal process is 1-dimensional with only a time dimension. There are at least two properties we desire from a good strategy. First, we expect the strategy to have some degree of unpredictability. A predictable strategy will be susceptible to exploitation, in that a malignant or duplicitous opponent can strategically select flips according to the predictable flips of the agent. Second, we expect a good strategy to space its flips efficiently. Intuitively, we can see that near simultaneous flips will waste valuable resources without providing proportionate rewards. In this paper, we examine three basic renewal strategies for the attacker, periodic (Pδ ), periodic with a random phase (Pδ ) and exponential (Eλ ), as they present different degrees of predictability and spacing efficiency.
838
L. Greige and P. Chin
In general, the optimal strategy against any periodic strategy can be found and maximal benefits can be calculated. Since the defender has priority when both players flip simultaneously, the optimal strategy would be to play the same periodic strategy as its opponent’s as it maximizes its time of ownership of the resource and reaches maximal benefit. Considering that the cost of flipping is set to 4 and each game is played over 400 iterations, then the theoretical maximal benefit for an adaptive agent playing against a periodic agent with a period δ = 10 would be equal to 200. We oppose an LM adaptive agent to a periodic agent P10 and show that with each game episode, the defender’s final score does in fact converge to its theoretical maximal benefit in Fig. 3a.
Fig. 3. FlipIt simulations against renewal strategies over different move rates.
In Figs. 3b, 3c and 3d, we plot the average final scores after convergence of the defender reward against periodic, periodic with a random phase and exponential strategies, with regards to the opponent strategy parameter. All scores are averaged over 10 runs. In all 3 cases, the defender converges towards its maximal benefit and drives its opponents to negative ones, penalizing them at each action decision. It outperforms all renewal strategies mentioned, regardless of the strategy parameters, and learns the corresponding optimal counter-strategy even against exponential strategies where the spacing between two consecutive flips is random. A more in-depth look into the strategies developed shows that the adaptive agent playing against Pδ and Pδ learns a strategy where the distribution of wait intervals concentrates on δ whereas the one playing against Eλ learns a strategy with a wider spread, spacing its flips efficiently throughout the game. We find that the defender’s final score decreases as the attacker move rate increases. This can be explained by the fact that a higher strategy parameter suggests flipping more often and causes the defender to also flip more frequently to counter the attacker, thus causing the overall reward to decrease. Moreover, higher move rates cause shorter interval times between two consecutive flips and this increases the risk of flipping at an incorrect iteration which could penalize the defender; as a matter of fact, in the general case, the worst-case scenario, flipping one iteration before each of its opponents flips, is only one shift away from the optimal strategy, flipping at the same time as the opponents. Despite the decreasing final scores, the defender learns to efficiently counter-attack its opponents, thus maximizing its time of ownership of the resource.
Deep Reinforcement Learning for FlipIt Security Game
5.2
839
Larger Action-Spaced FlipIt Extension
We compare the defender’s performance in the original version of FlipIt with one where the adaptive agent’s action space is extended to Ad = {void, flip, check}. We set the operational cost for checking the current state of the game to 1 as a way to compensate for the benefit of owning the resource at time step t and we run the same experiments against basic renewal strategies in 2-player FlipIt.
Fig. 4. Learning overtime in larger action-spaced FlipIt.
In Fig. 4, the top figures represent the DQN’s final episode scores against renewal strategies while the bottom ones represent the final scores after convergence of the defender against renewal strategies depending on their move rates, averaged over 10 runs. Here again, the defender reaches maximal benefit against all renewal strategies. Overall, the adaptive agent’s learning process is slightly slower than in the original version of the game, but eventually converges to its maximal benefit. In the current setup of FlipIt, the addition of check might not seem useful and only causes a slower convergence to the defender’s maximal benefit. However, when resources are limited and players are only allowed to spend a certain amount on moves, check can be key in developing cost-effective strategies, allowing adaptive agents to receive additional feedback from the environment without having to pay an important amount in terms of operational costs. 5.3
Multiplayer FlipIt
Finally, we extend the game to n players, where multiple attackers compete over the shared resource. We assume that one of the attackers is LM (ALM )
840
L. Greige and P. Chin
and attempts to adapt its strategy to its opponents’ whereas the rest of the players adopt one of the renewal strategies discussed in this paper. As the state of the game corresponds to the agent’s knowledge of the game (i.e. opponent’s last known moves), the state size increases as n increases. We begin by testing our model by opposing the adaptive agent to a combination of two opponent agents. Players’ final scores are averaged over 10 runs are plotted in Fig. 5, where darker colors correspond to higher final scores. We show our results with only 3 players for a clearer visualisation of our findings. However, all simulations can be extended to (n > 3)-player FlipIt.
Fig. 5. 3 player FlipIt simulations
Consider the case where an adaptive agent plays against two periodic agents with the same move rate σ. As a reminder, we assume that ALM is the rightful owner of the resource and therefore has priority when assigning a new owner to the resource in the case of simultaneous flips. Therefore, this would be equivalent
Deep Reinforcement Learning for FlipIt Security Game
841
to playing against one periodic agent with move rate σ and we obtain the same results as 2-player FlipIt. Now assume both opponents have different move rates. Hypothetically, this scenario would be equivalent to playing against an agent such that its strategy is a combination of both periodic agent strategies. An in-depth look at the strategy learned by ALM shows that the agent learns both strategy periods and spaces its flips accordingly. When opposed against a periodic agent Pδ and an exponential agent Eλ , ALM develops a strategy such that each two flips are efficiently spaced throughout the game, allowing the adaptive agent to converge towards its maximal benefit. As is the case in 2-player FlipIt, the higher the move rates, the smaller the intervals between two consecutive flips are which drives ALM to flip more frequently and causes lower overall final scores. Nonetheless, ALM yields maximal benefit, regardless of its opponents and opponent move rates. 5.4
Future Work
The ultimate goal in FlipIt is to control the resource for the maximum possible time while also maximizing its score. Without the second constraint, the best strategy would be to flip at each iteration to ensure control of the resource throughout the game. In this paper, we have seen that the defender is able to learn a strategy that maximizes its benefits when opposed to strategies with an expected interval time between two actions to be greater than the flip move cost. However, when opposed to highly active opponents (with an expected interval time between two moves smaller than the flip cost), the defender learns that a no-play strategy is the best strategy as it maximizes its overall score, and the opponent’s excessive flips forces the defender to “drop out” of the game. This is an interesting behavior, one we would like to further exploit and integrate in a game where probabilistic moves or changes of strategy throughout the game would be possible. Moreover, there are many ways to expand FlipIt to model real world situations and we intend on pursuing this project to analyze the use of reinforcement learning in different variants of the game. A few of our interests include games with probabilistic moves and team-based multiplayer games. Adding an upper bound on the budget and limiting the number of flips allowed per player would force players to flip more efficiently throughout the game, where the addition of check could be key to developing new adaptive strategies. Finally, instead of having individual players competing against each other as shown previously, we would like to analyze a team-based variant of FlipIt where players on a same team can either coordinate an attack on the device in question, or cooperate in order to defend the resource and prevent any intrusions.
6
Conclusion
Cyber and real-world security threats often present incomplete or imperfect state information. We believe our framework is well equipped to handle the noisy information and to learn an efficient counter-strategy against different classes of
842
L. Greige and P. Chin
opponents for the game FlipIt, even under partial observability, regardless of the number of opponents. Such strategies can be applied to optimally schedule key changes to network security amongst many other potential applications. Furthermore, we extended the game to larger action-spaced FlipIt, where the additional lower cost action check was introduced, allowing agents to obtain useful feedback regarding the current state of the game and to plan future moves accordingly. Finally, we made our source code publicly available for reproducibility purposes and to encourage researchers to further investigate adaptive strategies in FlipIt and its variants.
References 1. Alpcan, T., Basar, M.: Network Security: A Decision and Game-Theoretic Approach. Cambridge University Press, Cambridge (2010) 2. Buczak, A.L., Guven, E.: A survey of data mining and machine learning methods for cyber security intrusion detection. IEEE Commun. Surv. Tutor. 18(2), 1153– 1176 (2016) 3. Ding, D., Han, Q.-L., Xiang, Y., Ge, X., Zhang, X.-M.: A survey on security control and attack detection for industrial cyber-physical systems. Neurocomputing 275, 1674–1683 (2018) 4. Falliere, N., Murchu, L.O., Chien, E.: W32. Stuxnet dossier. Symantec White Paper (2011) 5. Feng, X., Zheng, Z., Hu, P., Cansever, D., Mohapatra, P.: Stealthy attacks meets insider threats: a three-player game model. In: IEEE Military Communications Conference (MILCOM), pp. 25–30, October 2015 6. Gueye, A., Marbukh, V., Walrand, J.C.: Towards a metric for communication network vulnerability to attacks: a game theoretic approach. In: Krishnamurthy, V., Zhao, Q., Huang, M., Wen, Y. (eds.) GameNets 2012. LNICST, vol. 105, pp. 259–274. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-355820 20 7. Hu, P., Li, H., Fu, H., Cansever, D., Mohapatra, P.: Dynamic defense strategy against advanced persistent threat with insiders. In: IEEE Conference on Computer Communications (INFOCOM), pp. 747–755, April 2015 8. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR abs/1412.6980 (2014) 9. Kunreuther, H., Heal, G.: Interdependent security. J. Risk Uncertain. 26(2), 231– 249 (2003) 10. Laszka, A., Horvath, G., Felegyhazi, M., Butty´ an, L.: FlipThem: modeling targeted attacks with , for multiple resources. In: Poovendran, R., Saad, W. (eds.) GameSec 2014. LNCS, vol. 8840, pp. 175–194. Springer, Cham (2014). https://doi.org/10. 1007/978-3-319-12601-2 10 11. Laszka, A., Johnson, B., Grossklags, J.: Mitigating covert compromises. In: Chen, Y., Immorlica, N. (eds.) WINE 2013. LNCS, vol. 8289, pp. 319–332. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-45046-4 26 12. Laszka, A., Johnson, B., Grossklags, J.: Mitigation of targeted and non-targeted covert attacks as a timing game. In: Das, S.K., Nita-Rotaru, C., Kantarcioglu, M. (eds.) GameSec 2013. LNCS, vol. 8252, pp. 175–191. Springer, Cham (2013). https://doi.org/10.1007/978-3-319-02786-9 11
Deep Reinforcement Learning for FlipIt Security Game
843
13. Milosevic, N., Dehghantanha, A., Choo, K.-K.: Machine learning aided android malware classification. Comput. Electr. Eng. 61, 266–274 (2017) 14. Mnih, V., et al.: Playing atari with deep reinforcement learning. CoRR, abs/1312.5602 (2013) 15. Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015) 16. Oakley, L., Oprea, A.: QFlip: an adaptive reinforcement learning strategy for the FlipIt security game. In: Alpcan, T., Vorobeychik, Y., Baras, J.S., D´ an, G. (eds.) GameSec 2019. LNCS, vol. 11836, pp. 364–384. Springer, Cham (2019). https:// doi.org/10.1007/978-3-030-32430-8 22 17. Paszke, A., et al.: Automatic differentiation in pytorch. In: Neural Information Processing Systems (2017) 18. Pham, V., Cid, C.: Are we compromised? Modelling security assessment games. In: Grossklags, J., Walrand, J. (eds.) GameSec 2012. LNCS, vol. 7638, pp. 234–247. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-34266-0 14 19. Schwartz, N.D., Drew, C.: RSA faces angry users after breach. New York Times, page B1, 8 June 2011 20. Silver, D., et al.: Mastering the game of go with deep neural networks and tree search. Nature 529(7587), 484–489 (2016) 21. Silver, D., et al.: A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science 362(6419), 1140–1144 (2018) 22. Silver, D., et al.: Mastering the game of go without human knowledge. Nature 550(7676), 354–359 (2017) 23. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. The MIT Press, Cambridge (2018) 24. Tesauro, G.: Temporal difference learning and TD-Gammon. Commun. ACM 38(3), 58–68 (1995) 25. van Dijk, M., Juels, A., Oprea, A., Rivest, R.L.: Flipit: the game of “stealthy takeover”. J. Cryptol. 26(4), 655–713 (2013) 26. Wang, Z., et al.: Sample efficient actor-critic with experience replay. CoRR abs/1611.01224 (2016) 27. Watkins, C.J.C.H., Dayan, P.: Q-learning. Mach. Learn. 8(3), 279–292 (1992) 28. Xiao, L., Wan, X., Lu, X., Zhang, Y., Wu, D.: IoT security techniques based on machine learning. ArXiv abs/1801.06275 (2018)
Accelerating Opponent Strategy Inference for Voting Dynamics on Complex Networks Zhongqi Cai(B) , Enrico Gerding, and Markus Brede School of Electronics and Computer Science, University of Southampton, Southampton, UK [email protected]
Abstract. In this paper, we study the problem of opponent strategy inference from observations of information diffusion in voting dynamics on complex networks. We demonstrate that, by deploying resources of an active controller, it is possible to influence the information dynamics in such a way that opponent strategies can be more easily uncovered. To this end, we use the framework of maximum likelihood estimation and the Fisher information to construct confidence intervals for opponent strategy estimates. We then design heuristics for optimally deploying resources with the aim of minimizing the variance of estimates. In the first part of the paper, we focus on inferring an opponent strategy at a single node. Here, we derive optimal resource allocations, finding that, for low controller budget, resources should be focused on the inferred node and, for large budget, on the inferred nodes’ neighbours. In the second part, we extend the setting to inferring opponent strategies over the entire network. We find that opponents are the harder to detect the more heterogeneous networks are, even with optimal targeting. Keywords: Network inference networks · Network control
1
· Voting dynamics · Complex
Introduction
Issues of uncovering structure and reconstructing the dynamical behaviour of complex networked systems from observational data have received increasing attention [5]. Applications of inferring network structure and essential parameters of dynamical processes from observations encoded as time series are many, including using expression data to uncover gene regulatory networks [10], identifying brain functional connectivity networks from neuroimaging data [16], and predicting social connections from information flow [8]. Normally, it is desirable to obtain high-precision estimates within fewer observations, especially when the observation is costly or time-limited [9]. Inspired by this, in this paper, we are interested in the problem of accelerating the convergence of inference of complex networked systems by influencing dynamical processes in strategic ways. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2021, SCI 1072, pp. 844–856, 2022. https://doi.org/10.1007/978-3-030-93409-5_69
Accelerating Opponent Strategy Inference
845
Inferring network structure or parameters of dynamical processes from observations often needs to exploit domain-specific knowledge [5]. Here, we study the problem of accelerating inference in the field of opinion dynamics based on the well-known framework of competitive influence maximization (CIM) [2], where several external controllers compete to maximize their influence by strategically allocating resources to agents in the network. Typically, the CIM is explored under the assumption that the external controllers have no prior knowledge about the opponents’ strategies [13]. However, Romero-Moreno et al. [20] show that knowing the opponent strategy enables better performance in terms of influence maximization by avoiding wasting resources on the agents targeted by the opponents. Moreover, many applications of CIM have time limits [3]. Therefore, speeding up the inference of opponents’ strategies from dynamics can allow the design of more efficient algorithms for CIM. Due to the prominence of the voter model in the field of opinion dynamics and its conceptual simplicity [18], in this paper, we study the problem of opponent strategy inference based on the voting dynamics. Here, agents flip their opinions with probabilities in proportion to the number of neighbours who hold the opposite opinions. Following [3,15,20], we model the influence exertion by external controllers through unidirectional links with agents in the network where the intensity of influence is denoted as the link weight. The most closely related work to our problem of opponent strategy inference is the topic of network structure inference, as the influence from the external controllers can also be modelled as weighted connections that form part of the network. However, a majority of works in this area are based on progressive models (e.g., the independent cascade model [19]), in which once an agent gets activated its opinion remains unchanged, and they are inappropriate for fast-changing opinions which can switch back and forth (e.g. opinions towards political issues). Indeed, there are a few exceptions that explore network structure inference based on non-progressive models [1,7,12,22], e.g., the susceptible-infected-susceptible (SIS) model and the voter model, where agents can repeatedly flip their opinions. Most relevant to our modelling approach, [7,12,22] reconstruct network structure from observations of binary-state dynamics. In more detail, [12,22] infer network topologies by assuming the links between agents are binary variables (i.e., existing a link or not). Therefore, their models fail to consider the effects of interaction intensity between agents which is characterized by the link weights. Additionally, Chen et al. [7] lift the binary restriction of network inference and extend it to a continuous space via developing a data-driven framework to estimate link weights. However, none of these works studies the network inference problem from the perspective of manipulating the opinion diffusion process to accelerate the convergence of estimation, which is essential if one wants to obtain an estimate with an accuracy guarantee within a short and limited observation time. To deal with the challenges of accelerating inference, in this paper, we combine the problem of network control with opponent strategy inference. Here, we explore the scenario of external controllers interacting with the intrinsic dynam-
846
Z. Cai et al.
ics of opinion diffusion to elicit more information for inference based on the voting dynamics. More specifically, we assume one active controller competes against an unknown-strategy opponent, and our focus is on designing optimal resource allocation strategies for the active controller with the aim to speed up the convergence of the estimates of the opponent’s resource allocations. By doing so, we make the following contributions: (i) We are the first to investigate the network information inference from the perspective of accelerating the estimation of the opponent’s strategy by optimally deploying the active controller’s resource allocations. (ii) By modelling the diffusion process as a non-homogeneous Markov chain and solving it via maximum likelihood estimation, we obtain estimators of the opponent’s budget allocation. As the variance of a maximum likelihood estimator of a single unknown parameter is approximately the negative of the reciprocal of the Fisher information, we use it to quantify the accuracy of our estimates. (iii) We propose several heuristics for accelerating the inference of the opponent’s strategy based on minimizing the Fisher information and verify the value of our heuristics by carrying out numerical experiments. Our main findings are: (i) In the scenario of inferring the opponent strategy at a single node, we observe two regimes when varying amounts of resources (also referred to as budget) are available for the active controller. For a small budget, a simple heuristics for minimizing the variance of estimates is to target the inferred node only. However, for a large budget, the best strategy is to equally targeting the neighbours of the inferred node only. (ii) In the scenario of inferring opponent strategies over entire networks, strategic allocation is more important if more budget is available for the active controller. Moreover, the estimates of the opponent’s budget allocations are more accurate for nodes with smaller degrees and for smaller inferred values of opponent influence. (iii) The estimates obtained by the optimal control for less heterogeneous networks are more accurate than for highly heterogeneous networks. The remainder of this paper is organized as follows. In Sect. 2, we present the model for opponent strategy inference. In Sect. 3, we show the heuristics for accelerating the inference and corresponding results. In Sect. 4, we summarize our main findings and present the future directions of our research.
2
Model Description
In line with the majority of works in opinion dynamics [6], we represent social networks as positively weighted and un-directed graphs G(V, E) where agents are represented by vertices vi ∈ V (i = 1, ..., N ) and edge wij ∈ E denotes the influencing strength between agent i and agent j. We further assume each of N agents in the network can be in either state si (t) = 0 or si (t) = 1 (i = 1, ..., N ) at time t. On top of these internal agents (i.e., v1 , ..., vN ), following the framework of [3,20], we consider two external controllers called controller A and controller B who have static opinions sA = 1 and sB = 0. Both controllers build ≥ 0 or bi (t) ≥ 0 to agent i at discrete time unidirectional influencing links ai (t) step t subject to budget constraints N ai (t) ≤ bA or N bi (t) ≤ bB . Here, bA and bB are the amounts of resources accessible to the controller A and B.
Accelerating Opponent Strategy Inference
847
In the following, we consider voting dynamics with parallel updates of the entire population in discrete time. Therefore, the stochastic matrix Pi (t) to describe the state transitions of agent i (i = 1, ..., N ) at time t is given by ⎤ ⎡ bi (t)+ (1−sj (t))wji P r(si (t + 1) = 0 | {sj (t)}j∈N ei(i) ) = N w j +a (t)+b (t) i i =1 i ⎦ Pi (t) = ⎣ (1) ai (t)+ j sj (t)wji P r(si (t + 1) = 1 | {sj (t)}j∈N ei(i) ) = N w +a (t)+b (t) =1
i
i
i
where P r(si (t + 1) = 0 | {sj (t)}j∈N ei(i) ) is the probability of node i moving from state 0 or 1 to state 0 in one time step depending on the states of its neighbours, e.g., {j | wij = 0} (denoted as N ei(i)). Note that, the transition probability is independent of the current state of the updated agent but affected only by its neighbours’ states and external controllers. Below, we assume the states of nodes at different time steps are recorded in a data matrix S = [si (t)]N ×T where T is the length of time series. Each row of S represents the binary state changes of a node, and the update process is modelled by a non-homogeneous Markov chain [4] where the Markov property is retained but the transition probabilities depend on time. By doing so, the opponent-strategy reconstruction problem is transformed into estimating the unknown parameters bi (t) (i = 1, ..., N ) from given non-homogeneous Markov chains of observed state changes of agents. Correspondingly, the logarithm of the likelihood of having observed the time series from 0 to T for node i is ai (t) + j wji sj (t) si (t + 1) log Li (T ) = ai (t) + bi (t) + ki t∈[0,T −1] (2) bi (t) + j wji (1 − sj (t)) + (1 − si (t + 1)) log ( ) ai (t) + bi (t) + ki where ki is the degree of node i. The first term of Eq. (2) is non-zero when node i is in state 1 in the next step. Otherwise, when node i is in state 0 in the next step, as si (t + 1) = 0, the first term will become zero while the second term is non-zero. In this paper, we are interested in the scenario where controller B keeps its budget allocations unchanged from time 0 (i.e., bi (t) = bi (0), i = 1, ..., N ), and infer the values of bi (0) (referred to as bi in the following) from observed data for state transitions encoded in the state matrix S = [si (t)]N ×T . Specifically, by applying maximum likelihood estimation (MLE [22]), and setting the score i (T ) equal to 0, we obtain a point estimate ˆbi for the control allocation function ∂L∂b i of the opponent at node i after T observations. To determine whether the above estimate is consistent with the true value, the Fisher information is commonly-used to construct confidence intervals for maximum likelihood estimators [21]. Following [14], the Fisher information is defined as the expectation of second-order derivative of the likelihood function I(bi , T ) = E[
∂2 Li (T )]. ∂b2i
(3)
848
Z. Cai et al.
Specifically, the second-order derivative of Li (t + 1) deduced from Li (t) (0 ≤ t ≤ T − 1) is ⎧ 2 ∂ Li (t) 2 si (t + 1) = 1 ∂ Li (t + 1) ⎨ ∂b2i + Ψi (t) = (4) 2 2 ∂ L (t) i ⎩ ∂bi + Υi (t) si (t + 1) = 0 ∂b2i in which Υi (t) = (ai (t) + ki + bi )−2 − (ki − j wji sj (t) + bi )−2 , Ψi (t) = (ai (t) + ki + bi )−2 . For large enough sample sizes, the variance of a maximum likelihood estimator of a single unknown parameter is approximately the negative reciprocal of Fisher information [14]. Therefore, we have σ ˆ 2 (bi , T ) = −I(bi , T )
−1
−1
= − [I(bi , T − 1) + Ψi (T − 1) − βi (T − 1)]
(5)
−1 where βi (t) = (ki − j wji sj (t) + bi )(ai (t) + ki + bi ) . Here, instead of only passively observing the dynamics, our focus is on attempting to accelerate the inference through interfering with the system’s dynamics via the controller A. We do this by designing the best strategy for controller A with the aim to speed up the inference of the values of budget allocations by the opponent. More specifically, we consider the above problem from the perspective of how to optimally deploy the budget allocations of controller A for nodes in the network at each update to maximally decrease the variance of the estimator in the following steps (see the recursive expression of the right term of Eq. (5)). The objective functions are defined in detail in Sects. 3.1 and 3.2 regarding different optimizing scenarios.
3
Results
Below, we present results for variance minimization to accelerate the inference of opponent strategies. In Sect. 3.1, we start by exploring the problem at a single node. Our focus here is on exploring optimization and heuristics for maximally speeding up the convergence of the estimate of a single node by optimizing the budget allocations of the A controller to the node and its neighbours. In Sect. 3.2, we extend the above setting from single node inference to inferring opponent strategies over an entire network. 3.1
Inferring the Opponent Strategy at a Single Node
In order to first gain insights into how the budget allocations influence the inference process, we start our analysis with a simple scenario where we only focus on accelerating the convergence of the estimate of a single node. As the transition probability of the inferred node is determined not only by the control gains from controllers but also by the sum of neighbouring states (see Eq. (1)), our heuristics of minimizing the variance of a single estimator is not only based on optimizing the budget allocations of the inferred node but also by optimizing allocations to its neighbours. Note that, even though the node’s state at time
Accelerating Opponent Strategy Inference
849
t + 2 is directly influenced by the budget allocation ai (t + 1) and sum of neighbouring nodes at time t + 1, to influence the neighbouring nodes’ states at time t + 1, we have to optimize the budget allocations for the neighbours at time t. Therefore, we have neighbours of node i
∗ ∗ ∗ 2 {ai (t + 1), aj (t), ..., an (t) } = arg min σ (ˆ bi , t + 2) bi , t + 1) ≡ arg min P r(si (t + 2) = 1) × Ψi (t + 1) + P r(si (t + 2) = 0) × Υi (t + 1)) + I(ˆ
(6) in which we aim to minimize the variance of the estimator ˆbi at step t + 2 by optimizing the transition probability one step ahead. Here we substitute the true value of bi with the estimator ˆbi to calculate σ 2 (ˆbi , t + 2), and Ψi (t + 1) and Υi (t + 1) are defined as Eq. (4). In addition, consistent with Eq. (1), P r(si (t + 2) = 1) = and
ai (t+1)+
j
wji sj (t+1)
is the probability to be in state si (t + 2) = 1
ai (t+1)+ˆ bi +ki ˆ bi + j wji (1−sj (t+1)) P r(si (t + 2) = 0) = ai (t+1)+ˆ bi +ki
is the corresponding probability for
si (t + 2) = 0. However, as j wji sj (t + 1) is influenced by a∗j (t), ..., a∗n (t) and is unknown, we have to enumerate all of the possible combinations of j wji sj (t + 1) in order to calculate Υi (t + 1). In more detail, by law of total probability, we have P r(si (t + 2) = 1) =
P r(si (t + 2) = 1 |
m=0,..,ki
wji sj (t + 1) = m)P r(
j
m=0,..,ki
=
wji sj (t + 1) = m)
j
ai (t + 1) + m P r( wji sj (t + 1) = m) ai (t + 1) + bi + ki j
P r(si (t + 2) = 0) = 1 − P r(si (t + 2) = 1)
(7) where P r(
j
wji sj (t + 1) = m) =
l
ρ=1 j∈cρ
P r(sj (t + 1) = 1)
P r(sj (t + 1) = 0).
(8)
j∈(N ei(i)\cρ )
Here, l is the number of combinations leading to j wij sj (t+1) = m, the entities in C = {c1 , .., cl } (i.e., cρ (1 ≤ ρ ≤ l)) represent all possible combinations of the elements in the neighbourhood of node i (denoted as N ei(i)) taken m at a set of elements in N ei(i) that are not in cρ , and time, N ei(i)\cρ returns the a (t)+ i wij si (t) P r(sj (t + 1) = 1) = j aj (t)+b represents the probability that node j is in j +kj state 1 at time t + 1. Inserting Eq. (7) and Eq. (8) into Eq. (6) yield the full expression. In the following, we explore the optimal strategy of controller A in the context of optimizing budget allocations to one node and its neighbourhood under varying budget constraints with an aim to minimize the variance of the central node. For this purpose we use the interior-point optimization [17] to update the budget allocations obtained from Eq. (6) at each step. Given the total length of time series T , the time complexity of the optimization defined in Eq. (6) is O(n3 T ), where n is the input size, i.e., the number of nodes being optimized.
Z. Cai et al. budget allocation for central node budget allocation for neighbours
10 8 6 4 2 0
1
10
100
Total budgets
(a)
500
Variance for central node
Normalized budget allocation
850
variance by only targeting central node variance by equally targeting neighbours variance calculated by optimization variance by equally targeting
10
0
10 1
10 2
Total budgets
(b) a (k +1)
Fig. 1. Figure (a) shows the dependence of normalized budget allocation a ˜j = j bAi for all nodes after the first 1000 updates calculated by Eq. (6). The black triangles are the budget allocations for each neighbouring node where the differences are characterized by error bars. Figure (b) shows the dependence of variance of MLE of the central node on varying total budgets at update 1000 based on four budget allocation strategies: only targeting the central node (red squares), equally targeting neighbours only (red circles), optimization described in Eq. (6) (black triangles), and equally targeting (blue triangles). The results are based on 20 realizations of random regular networks with 1000 nodes and average degree k = 10. Controller B targets all nodes equally with budget 5, and except for the inferred node and its neighbours, controller A targets all the other nodes with budget 5. Error bars indicate 95% confidence intervals.
To proceed, Fig. 1(a) presents optimal allocations given to the central node and its neighbours on random regular graphs for varying amounts of budgets. Here, we group all of the neighbouring nodes into the same class and plot the average allocation given to each of them. Additionally, we have normalized the a (k +1) budget allocation for all nodes according to a ˜j = j bAi , where bA is the total budget and ki is the degree of the central node. By doing so, we have transformed the absolute values of optimized budget allocations to relative proportions comparable with the equally targeting case (see the red dashes) for varying budget constraints. Specifically, we observe a crossing point which divides a regime of allocating more budget to the central node for a small total budget and a regime in which more budget is allocated to the neighbouring nodes for a large total budget. Inspired by the patterns of budget allocations for extremely small and large budgets in Fig. 1, we further show the dependency of variance of MLE of the central node on different amounts of available budgets based on four different methods in panel (b) of Fig. 1. In more detail, we compare the variance calculated by allocating all of the resources on the central node (the red squares), equally targeting neighbours but leaving the central node unaffected (the red circles), equally targeting all nodes (the blue triangles), and optimizing one node and its neighbourhood (the black triangles). With a careful inspection of Fig. 1(b), we find that, the optimized allocations obtained from Eq. (6) always have the best performance in minimizing the variance regardless of the total budget available.
Accelerating Opponent Strategy Inference
851
However, for small and large budgets, we can find simple heuristics to replace the complicated optimization algorithm without sacrificing performance. In more detail, for a small budget, the variance calculated by only targeting the central node is close to the optimal variance calculated by the optimization algorithm. Therefore, a simple heuristic strategy for controller A when the total budget is small is only targeting the central node. Moreover, for a large budget, the variance calculated by only targeting the neighbouring nodes is close to the optimal variance calculated by the optimization algorithm. In this scenario, a simple strategy is to equally target all of the neighbouring nodes only. 3.2
Inferring Opponent Strategies over Entire Networks
In this section, we further generalize the setting of the previous section of only minimizing the variance of the estimate of a single node to minimizing the sum of the variance of estimators for all nodes. Following Eq. (5), we have N nodes in the network
{
∗ ∗ a1 (t), ..., aN (t)
} = arg min
N i=1
2 σ (ˆ bi , t + 1) = arg min
N
−1 − I(ˆ bi , t) + Ψi (t) − βi (t)
i=1
(9)
−2 where Ψi (t) = (ai (t) + ki + bi ) , βi (t) = (ki − j wji sj (t) + bi )(ai (t) + ki −1 +bi ) , and a1 (t) + · · · + aN (t) ≤ bA , i.e., the sum of budget allocations should satisfy the budget constraint. Here, as we optimize the budget allocations one step ahead to minimize the sum of variance in the next step, the above heuristic is named one-step-ahead optimization. To test the effectiveness of the 2 2 /σequal one-step-ahead optimization, we present the relative sum of variance σopt 2 achievable by the optimization scheme (denoted as σopt ) against the sum of vari2 ance achieved by the equally targeting strategy (σequal ) under varying relative budgets bA /bB in Fig. 2. Specifically, the figure illustrates that if controller A is in a huge budget advantage (i.e., bA bB ), then the one-step-ahead optimization will make a considerable improvement in minimizing the sum of variance of estimators compared with the equally targeting strategy. Otherwise, less than 10% improvement will be gained by the optimization, which means the strategic allocation is more important if more resources are available for the controller. We also see that improvements by optimization accumulate along updates, i.e. the longer the observation time T , the more benefits can be gained by optimization (see Fig. 2). Below, we are also interested to investigate the effects of network topology on a controller’s ability to uncover opponent strategies. More specifically, in Fig. 3 we show results of experiments in which estimated achievable total variances of estimates are compared for networks with regular and heterogeneous degree distributions. To proceed, in Figs. 3(a)–(c), we present the dependence of optimized allocations by the one-step-ahead optimization on the percentages of nodes being targeted when the average budget allocations per node by controller B are 1, 5 and 10. Furthermore, in Figs. 3(d)–(f), we show the corresponding average variance per node which is calculated by adding up all the variance and dividing
852
Z. Cai et al. 1
0.8 0.7
2 opt
/ 2equal
0.9
0.6
T=1000 T=500 T=200
0.5 0.4 10-1
100
101
relative budget (bA/b B)
6 5 1.6 1.4
4 1.2 1
3
0.1
0.2
0.3
2 1
hete net, = 1.6 hete net, = 3 random regular net
0.2
0.4
0.6
0.8
45 hete net, = 1.6 hete net, = 3 random regular net
40 35 30 25 20
average allocation on targeted nodes
7
average allocation on targeted nodes
average allocation on targeted nodes
2 2 Fig. 2. Relative sum of variance σopt /σequal achievable by the one-step-ahead optimization against the equally targeting strategy for varying relative budgets bA /bB . The red, blue and black lines give the relative sum of variance after 1000, 500 and 200 observations. Results are based on 20 realizations of heterogeneous networks with degree exponent λ = 1.6, N = 1000, and average degree k = 6. Controller B targets all nodes with randomly and uniformly distributed budget, and the average budget allocation per node is 10. Error bars indicate 95% confidence intervals.
6 5 4
15 10
0.2
0.3
0.4
5
1
0.2
percentages of nodes being targeted
0.4
0.6
0.8
20
8 7 0.1
0.2
0.08 0.06
0.4
0.6
0.8
1
percentages of nodes being targeted
(c) 1.8 hete net, = 1.6 hete net, = 3 random regular net
1.7
0.6
average variance
0.1
0.3
10 0.2
hete net, = 1.6 hete net, = 3 random regular net
hete net, = 1.6 hete net, = 3 random regular net
average variance
average variance
9
30
(b)
0.12
0.5
0.4
1.6 1.5 1.4 1.3 1.2
0.04 0.02
11
40 10
1
0.7
0.14
hete net, = 1.6 hete net, = 3 random regular net
50
percentages of nodes being targeted
(a)
0.16
60
0.2
0.4
0.6
0.8
percentages of nodes being targeted
(d)
1
0.3
0.2
0.4
0.6
0.8
percentages of nodes being targeted
(e)
1
1.1
0.2
0.4
0.6
0.8
1
percentages of nodes being targeted
(f)
Fig. 3. Figures (a–c) and (d–f) show the dependence of optimized allocation for onestep-ahead optimization and corresponding normalized sum of variance of estimates on percentages of nodes being targeted under budget constraints bB = N (a, d), bB = 5N (b, e), and bB = 10N (c, f). The y axis of figures (d–f) are calculated by the sum of variance of the targeted node divided by the number of nodes targeted. The legend “λ = 1.6(3)” is identical to power law distribution P (k) ∝ k−1.6(−3) . Results are based on 20 realizations of networks with N = 1000, and average degree k = 6. Error bars indicate 95% confidence intervals.
Accelerating Opponent Strategy Inference
853
by the number of node being targeted. Regardless of the budget availability of the opponent, we observe the following three patterns for Figs. 3(a)–(f): (i) With the increase of numbers of nodes being targeted, more resources are allocated per node in the one-step-ahead optimization strategy and the estimates of budget allocations by the opponent are increasingly inaccurate. (ii) There is a crossing point which divides two regime. If only a small number of nodes are targeted, then the targeted nodes in a more heterogeneous networks will be allocated with more budget on average than those in a less heterogeneous network. The opposite holds for a less heterogeneous network when more nodes are targeted. (iii) In the same control setting, it is always more accurate to estimate a less heterogeneous network than a more heterogeneous network. In addition, by comparing Figs. 3(a)–(c), we find that curves for different types of networks are fairly close when only a small portion of nodes are targeted, as well as when the opponent targets nodes with a low budget (i.e., 1 per node on average). However, the differences of the optimized average allocations on the targeted nodes become significant when the opponent targets more nodes with a larger budget. Moreover, we observe that with the increase of the budget availability of the opponent, the differences of variance of estimates between different types of networks become larger. 104
hete, b B/N=10
3
0.8
variance
reg, bB/N=5 reg, bB/N=10
0.7
2
0.6
2
4
6
8
1.5 1 0.5
102 10-1
a = 10 a = 20 (optimally equally targeting) a = 40
2
0.9
hete, b B/N=5
103
2.5
a = 10 a = 20 (optimally equally targeting) a = 40
4
reg, bB/N=1
variance
sum of variance
hete, b B/N=1
1 100
101
budget allocation per node
(a)
102
10
20
30
degree
(b)
40
50
0
0
5
10
15
20
budget allocation by controller B
(c)
Fig. 4. Figure (a) shows the dependence of sum of variance of estimators after 1000 updates on varying budget allocations per node in the equally targeting scenario when the opponent targets nodes with 1, 5 and 10 per node on average (denoted as bB /N ). The circles and squares stand for results for heterogeneous networks and random regular networks with N = 1000, and average degree k = 6. Figures (b) and (c) show the dependence of variance of estimators achieved by equally targeting each node with allocation 10, 20, and 40 after 1000 updates on nodes’ degree and budget allocations by the opponent based on heterogeneous networks. Note that a = 20 is the optimal budget allocation for equally targeting obtained from Figure (a) (the minimum point) for bB /N = 10. In figure (c), we group the value of x axis into bins with width 1 and lower limits are inclusive, e.g., [0, 1). Results are based on 20 realizations and controller B targets all nodes with randomly and uniformly distributed budget. Error bars indicate 95% confidence intervals.
However, due to the O(N 3 T ) time complexity of the one-step-ahead optimization for a network with size N , to make the algorithm scalable for large-size networks, a simplified algorithm is needed. Inspired by the results of Fig. 1(b),
854
Z. Cai et al.
which indicates that allocating too many resources to the inferred node will hinder the inference (see the right upper corner of Fig. 1(b)), and the limited improvements of the one-step-ahead optimization when bA ≤ 5bB , we propose a simple heuristic called optimally-equally-targeting strategy (OETS), in which we find an optimal and equal allocation for all nodes. By doing so, we have reduced the degree of freedom of the one-step-ahead optimization from N to 1 and also the time complexity from O(N 3 T ) to O(T ) without sacrificing much of the performance. Specifically, for the OETS, we have ∗
a = arg min
N i=1
2 σ (ˆ bi , T ) = arg min −
N
T −1
i=1
t=0
−1 (Ψi (t) − βi (t))
∗
(0 ≤ a ≤ bA /N )
(10)
where a∗ is the optimal budget allocation for all nodes to achieve a minimum sum of variance after T observations, bA is the budget constraint for controller
−1 −2 . A, Ψi (t) = (a + ki + bi ) , βi (t) = (ki − j wji sj (t) + bi )(a + ki + bi ) To give some indications of how the optimally equally targeting strategy performs for different budget availability and for different network structures, in Fig. 4(a), we present the dependence of sum of variance of estimators on varying equal budget allocations per node by the active controller when the opponent targets per node with 1, 5 and 10 on average. We note that the dependence is a convex shape with a minimum. Moreover, by comparing the curves for different types of networks in Fig. 4(a), we observe a similar result of Fig. 3, that it is always easier to predict opponents on random regular networks than on heterogeneous networks. We further note that curves for different types of networks are fairly close when the opponent targets nodes with a large budget, which means the types of networks matter less in the OETS if the opponent has large budget availability. As the main difference of networks of different types is the degree distribution, to explore how the degree of nodes play a role in OETS, we present the dependence of variance of estimators on nodes’ degree in Fig. 4(b). Clearly, we observe a positive relationship between the variance and nodes’ degree. This result has further explained the degree-based heuristics for the link weight prediction in [7] about why the solution obtained from a lower-degree node are preferred. Moreover, with a careful inspection of Fig. 4(b), we observe two regimes. For low-degree nodes, a large allocation (e.g., a = 40) will result in a worse performance in predicting the budget allocations. However, for the hub nodes, a larger allocation is preferable in improving the accuracy of the prediction. Furthermore, by comparing the patterns of the dependence for budget allocations a = 10, 20, 40 in Fig. 4(b), we find that the OETS results from a trade-off. On one hand, heterogeneous networks have more low-degree nodes, therefore relatively high a should be avoided. On the other hand, as the hub nodes normally have much higher variance than low-degree nodes, low a is inefficient in minimizing the sum of variance. Another leading factor in the strategy inference is the budget allocation by the opponent. Therefore, in figure (c), we present the dependence of variance on opponent’s budget allocations. Note that, as the budget allocations by the
Accelerating Opponent Strategy Inference
855
opponent are randomly and uniformly distributed, for ease of observation, we group values into bins with width 1, i.e., {[0, 1), [1, 2), · · · } in Fig. 4(c). Similar to Fig. 4(b), with the increase of opponent’s budget allocations, the variance of estimates rise monotonically. However, curves for different budget allocations a are fairly close and a larger a will not result in a lower variance for nodes which are allocated more resources by the opponent.
4
Conclusion
In this paper, we study the problem of reconstructing an opponent’s strategy from binary-state dynamics based on the voter model. Unlike most works in the field of network inference [5], in which the inference is based on a given dataset, our work focuses on optimally interacting with the dynamical processes of the complex networked systems with an aim to accelerate the convergence of estimates. By applying MLE and using Fisher information as a criterion, our analysis makes clear that, via strategically interfering with the network dynamics, the accuracy of inference can be improved. Specifically, we have shown that the optimal strategy for inferring the opponent strategy at a single node depends on the budget availability: Whereas for low budget, all resources should be put on the inferred node only, for large amount of budget, only the neighbouring nodes should be targeted. Moreover, we have found that the knowledge of the nodes’ degrees as well as estimates of the budget allocations by the opponent could be exploited in designing an optimal interference strategy, as there are strong positive correlations between these characteristics and the variance of estimates. This finding is in agreement with the result of [7] that the link prediction for lower-degree nodes is more accurate than for high-degree nodes. Our findings above have been limited to the voter model, but this methodology can also be used to improve the performance of estimates in other binarystate complex networked systems where time series of dynamics are first-order Markov chains, e.g., Ising [11] or SIS models. Moreover, an interesting line of future research is to combine the opponent strategy inference with the influence maximization problem. In more detail, one can explore the influence-maximizing strategy based on the results of opponent strategy inference where uncertainties of estimates are encoded as confidence intervals.
References 1. Barbillon, P., Schwaller, L., Robin, S., Flachs, A., Stone, G.D.: Epidemiologic network inference. Stat. Comput. 30(1), 61–75 (2020) 2. Bharathi, S., Kempe, D., Salek, M.: Competitive influence maximization in social networks. In: Deng, X., Graham, F.C. (eds.) WINE 2007. LNCS, vol. 4858, pp. 306–311. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-771050 31 3. Brede, M., Restocchi, V., Stein, S.: Effects of time horizons on influence maximization in the voter dynamics. J. Complex Netw. 7(3), 445–468 (2019)
856
Z. Cai et al.
4. Br´emaud, P.: Non-homogeneous Markov chains. In: Br´emaud, P. (ed.) Markov Chains. Texts in Applied Mathematics, vol. 31, pp. 399–422. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-45982-6 12 5. Brugere, I., Gallagher, B., Berger-Wolf, T.Y.: Network structure inference, a survey: motivations, methods, and applications. ACM Comput. Surv. (CSUR) 51(2), 1–39 (2018) 6. Castellano, C., Fortunato, S., Loreto, V.: Statistical physics of social dynamics. Rev. Mod. Phys. 81(2), 591 (2009) 7. Chen, Y.Z., Lai, Y.C.: Sparse dynamical Boltzmann machine for reconstructing complex networks with binary dynamics. Phys. Rev. E 97(3), 032317 (2018) 8. Gomez-Rodriguez, M., Leskovec, J., Krause, A.: Inferring networks of diffusion and influence. ACM Trans. Knowl. Discov. Data (TKDD) 5(4), 1–37 (2012) 9. Guo, C., Luk, W.: Accelerating maximum likelihood estimation for Hawkes point processes. In: 2013 23rd International Conference on Field programmable Logic and Applications, pp. 1–6. IEEE (2013) 10. Kaderali, L., Radde, N.: Inferring gene regulatory networks from expression data. In: Kelemen, A., Abraham, A., Chen, Y. (eds.) Computational Intelligence in Bioinformatics. Studies in Computational Intelligence, vol. 94, pp. 33–74. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-76803-6 2 11. Krapivsky, P.L., Redner, S., Ben-Naim, E.: A Kinetic View of Statistical Physics. Cambridge University Press, Cambridge (2010) 12. Li, J., Shen, Z., Wang, W.X., Grebogi, C., Lai, Y.C.: Universal data-based method for reconstructing complex networks with binary-state dynamics. Phys. Rev. E 95(3), 032303 (2017) 13. Li, Y., Fan, J., Wang, Y., Tan, K.L.: Influence maximization on social graphs: a survey. IEEE Trans. Knowl. Data Eng. 30(10), 1852–1872 (2018) 14. Ly, A., Marsman, M., Verhagen, J., Grasman, R.P., Wagenmakers, E.J.: A tutorial on fisher information. J. Math. Psychol. 80, 40–55 (2017) 15. Masuda, N.: Opinion control in complex networks. New J. Phys. 17(3), 1–12 (2015) 16. Papalexakis, E.E., Fyshe, A., Sidiropoulos, N.D., Talukdar, P.P., Mitchell, T.M., Faloutsos, C.: Good-enough brain model: challenges, algorithms and discoveries in multi-subject experiments. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 95–104 (2014) 17. Potra, F.A., Wright, S.J.: Interior-point methods. J. Comput. Appl. Math. 124(1), 281–302 (2000). https://www.sciencedirect.com/science/article/pii/ S0377042700004337. Numerical Analysis 2000. Vol. IV: Optimization and Nonlinear Equations 18. Redner, S.: Reality-inspired voter models: a mini-review. C. R. Phys. 20(4), 275– 292 (2019) 19. Rodriguez, M.G., Sch¨ olkopf, B.: Submodular inference of diffusion networks from multiple trees. arXiv preprint arXiv:1205.1671 (2012) 20. Romero Moreno, G., Chakraborty, S., Brede, M.: Shadowing and shielding: effective heuristics for continuous influence maximisation in the voting dynamics. PLOS One 16(6), 1–21 (2021). https://doi.org/10.1371/journal.pone.0252515 21. Yuan, X., Spall, J.C.: Confidence intervals with expected and observed fisher information in the scalar case. In: 2020 American Control Conference (ACC), pp. 2599– 2604. IEEE (2020) 22. Zhang, H.F., Xu, F., Bao, Z.K., Ma, C.: Reconstructing of networks with binarystate dynamics via generalized statistical inference. IEEE Trans. Circuits Syst. I Regul. Pap. 66(4), 1608–1619 (2018)
Need for a Realistic Measure of Attack Severity in Centrality Based Node Attack Strategies Jisha Mariyam John(B) and Divya Sindhu Lekha Indian Institute of Information Technology Kottayam, Kottayam, Kerala, India {jishamariyam.phd201010,divyaslekha}@iiitkottayam.ac.in
Abstract. Complex networks are robust to random failures; but not always to targeted attacks. The resilience of complex networks towards different node targeted attacks are studied immensely in the literature. Many node attack strategies were also proposed, and their efficiency was compared. However, in each of these proposals, the scientists used different measures of efficiency. So, it doesn’t seem easy to compare them and choose the one most suitable for the system under examination. Here, we review the main results from the literature on centrality based node attack strategies. Our focus is only on the works on undirected and unweighted networks. We want to highlight the necessity of a more realistic measure of attack efficiency. Keywords: Complex networks · Vulnerability · Targeted attacks Robustness · Network functioning · Attack efficiency · Efficiency measures
1
·
Introduction
A complex network consists of nodes and edges (links) connecting them. Each node represents different entities in the real world, and links between nodes represent interactions between these entities. In many real-world instances such as social networks, scientific networks, transportation networks, biological networks etc., we can observe complex networks embodied as interactions between entities. In a social network, nodes represent individuals, and links represent cooperations between them. In a co-authorship network, nodes represent scientists, and the edges represent the number of co-authored papers. In biological networks, nodes represent proteins, genes, species, neurons, and edges represent the synergies between them. These instances indicate that most of the systems in the real world can be adequately expressed using complex networks. Unexpected errors and failures can occur in real-world networks, affecting the overall functioning or reachability of the entire network. Network failures attribute to either random errors or intentional attacks on the structural components of the networks (nodes or links). The ability of a network to withstand c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2021, SCI 1072, pp. 857–866, 2022. https://doi.org/10.1007/978-3-030-93409-5_70
858
J. M. John and D. S. Lekha
or overcome these failures defines its robustness. The study of robustness in complex networks is vital for many fields. In the last decades, studies on network vulnerability have seen fast developments. It has a wide range of applications in different domains. For biological networks, network vulnerability can help study diseases and mutations and how to recover from them. Network robustness can help to evaluate the strength of infrastructure networks such as the internet or power grids. And in the transportation networks, vulnerability assessment is done based on the targeted attack to stations, thereby improving the planning of transportation network. The approach to these problems started with a study on the topological properties of the network. A study on synthetic networks found that random networks have the same behaviour on both random errors and attacks, but scalefree networks exhibit a high level of robustness to random errors and a high level of vulnerability to intentional attacks [2]. More investigations on the resilience of the networks to random failures were done based on percolation theory [5,19]. If the damage on one or a few components of the network is capable of spreading over the entire, it leads to a cascade-based attack [14]. Real-world networks which are heterogeneous are highly liable to such cascading failures [14]. Cascading failures can be a real threat to the security of highly heterogeneous systems such as the internet and power grids. Later, a strategy for defending such attacks in heterogeneous networks was proposed in [15]. Following this, [8] presented a model for cascading failure and showed that an attack to a single node is sufficient to corrupt the entire network. These results attest that scale-free networks are robust to “random errors” but vulnerable to “targeted attacks”. In parallel to these theoretical results, several methods have been proposed in the last decade for identifying the sequence of target nodes that maximizes the damage to network connectivity. Centrality based attacks are strategies in which the target node is decided based on its centrality measure. Scientists have investigated different centrality measures and their importance in network attacks. In the following section, we give a brief idea of some essential strategies. Centrality based attacks are studied in many papers [2,4,11–13,16,17,20]. We attempted a comparison of the results in these papers as they used the variations of different centralities like degree, betweenness, closeness, eigenvectors. But, we could not proceed with the analysis as different papers measured the efficiency of attacks in different ways such as diameter [2], largest connected component [2,4,11,12,16,17,20] and geodesic path length [11,13,17]. So we narrowed down our analysis to three papers, [4,12] and [13]. Based on the analysis, we observed that a uniform and realistic measure is needed for the effectiveness of centrality based attacks.
2
Centrality Based Attack Strategies
In this section, we briefly introduce the centrality based attacks. Nodes are removed based on their centrality measures. There are two general strategies:
Need for a Realistic Measure of Attack Severity
859
Initial attacks: Nodes are ranked based on their centrality and are attacked in the order of their rank. Recalculated attacks: Node centralities are recalculated after each attack. In general, these attacks are more harmful than removals based on the initial network. Severity of recalculated attacks is due to the network topological changes as significant nodes are removed [11]. 2.1
Random (Rand) Attacks
This strategy randomly removes the nodes from the network. Study on the impact of eliminating nodes uniformly at random is firmly related to the classical percolation process [1]. Literature results on the problem of node removal indicated that many real world networks show robustness to random failure. 2.2
Degree Centrality (DC) Attacks
The node’s degree(k) is a simple local network metric and gives the notion of its significance in the network based on its connectivity. In DC strategy, nodes are removed in decreasing order of their connectivity k. In the case of ties (i.e. nodes with the same degree), the node to be deleted is selected randomly [4]. This strategy was introduced in [2] to show the vulnerability of networks to targeted attacks. Also, earlier studies of susceptibility to intentional attack are based on this strategy [6]. Recalculated DC attacks are severe than their initial counterparts [11]. Variations of DC Strategy: Some variants of DC attack strategies are listed below [4]. 1. Second-degree neighbors (Sec): In this strategy, the number of second neighbors of each node is considered for ranking the target nodes. 2. First + Second neighbors (F+S): Nodes are removed based on the sum of first and second neighbors of each node. 3. Combined first and second degree (Comb): Nodes are removed based on the first neighbors of each node. In the case of ties, the node is to be deleted according to their second degree. Consequently, at the beginning of the attack on the network, when at least two nodes have the same degree, removing nodes having the highest second degree causes quick hazards than random removal of those nodes. But, after a certain fraction of removals, this strategy becomes less efficient than the first-degree strategy as the seconddegree (in the case of ties) would remove less significant nodes due to the structural changes in the network [4]. 2.3
Betweenness Centrality (BC)
The betweenness of a node quantifies its significance by counting the number of shortest paths between any pair of nodes passing through the node. While
860
J. M. John and D. S. Lekha
the degree is a measure whose quantity depends only on the local structure, the betweenness measure depends on the global structure of the entire network. One of the observations is that the former concentrates on reducing the number of edges, whereas the latter focuses on destroying as many geodesic paths as possible. Another essential feature is that nodes acting as “bridges” connecting other nodes may have high betweenness even if it is connected to a small number of other nodes. Therefore, a node having high betweenness centrality plays a significant role in controlling the flow of information through the network [12]. Strategy based on recalculated betweenness is more efficient in disrupting the primary communication path in a network [11]. While considering the computational cost of recalculated betweenness, it becomes quadratic in the number of nodes for sparse networks and cubic for dense networks. As a result, this strategy is competent for small networks [20]. Variations of BC Strategy: Variants of BC strategies were discussed in [20], [16]. 1. Approximate betweenness: The computational cost of recalculated betweenness points out the usage of approximate betweenness. This strategy reduces the computational cost by considering only the subset of all possible node pairs in the networks (only logn paths are considered) [9]. Recalculated variants for approximate betweenness can be used to account for the changes in the network when nodes are being removed and it provide a good tradeoff between efficiency and computational cost [20]. 2. Conditional betweenness (Cond. Bet): This strategy accounts for nodes having high betweenness only if they are inside the giant components [16]. Conditional betweenness outperforms recalculated betweenness as it becomes less efficient towards the end of the removal process. 2.4
Combination of Degree and Betweenness Centrality
A strategy based on both degree and betweenness (Initial degree and betweennes, IDB and Recalculated degree and betweenness RDB) was proposed in [17] and found that, both RDB and IDB are more harmful in Watts - Strogatz model due to the unique structure of the network. 2.5
Eigenvector Centrality (EC) Attacks
Eigenvector centrality can be viewed as a refinement of degree centrality and is based on the notion that a node is important if it is connected to other nodes which are themselves important. Eigenvector centrality of a node can take a highest value either by connecting to large number of nodes or by connecting to small number of significant nodes.
Need for a Realistic Measure of Attack Severity
2.6
861
Closeness Centrality (CC) Attacks
This is based on average geodesic path and it gives idea about how close it is to others. This is less effective as shown in [12]. However, [13] showed the relevance of CC attacks when the severity is measured based on average geodesic distance. Profile closeness (PC) [13] is a slight variation of closeness centrality. It helps to identify the secondary targets where primarily important nodes are protected with high-security measures (E.g. Terrorist network).
3
Efficiency Measures
Now, how do we evaluate the impact of these different attack strategies on a network? The efficiency of a network depends on several topological factors like its connectedness. The impact of an attack on a network’s efficiency gives a clear picture of the damages induced. The most commonly used measures in the literature are based on the size of the largest connected component and geodesic path length and are highlighted in this paper. Other efficiency measures include diameter [13], clustering coefficient [7,18] and total connectedness [10]. 3.1
Size of Largest Connected Component (LCC)
The size of the largest connected component (LCC) is defined as the number of nodes in the giant component of the network [3]. For model networks, the efficiency of the attack strategy to degrade LCC depends on the topology of the network that is attacked [4]. Through this measure attack to the network can be mapped to the standard percolation process. 3.2
Average Geodesic Path Length
The geodesic path length describes the interconnectedness of networks, and it is defined by the length of the shortest path between two nodes. That is, it characterizes the ability of the nodes to communicate with each other [2]. The average geodesic path, sometimes termed characteristic path length, is calculated over the number of pairs of vertices [11]. The network will break into disconnected subgraphs with the increase in the number of vertices or edges removed. The average geodesic path becomes infinite for such a disconnected graph, and this leads to the notion of average inverse geodesic path length, which has finite value even for disconnected graph [11]. Unlike the average geodesic path, the larger value of the average inverse geodesic path indicates the better functioning of the networks.
4
A Comparative Study of Centrality Based Attack Strategies
Implementing a recalculated strategy happens to be more time consuming but more efficient. Also, recalculated attacks show only slight variations in the performance of different centrality based strategies [12]. However, attack strategies
862
J. M. John and D. S. Lekha
based on the initial network are helpful in various applications like planning mass vaccination campaigns. While studying vulnerability, it is always desirable to compare the behaviour of synthetic and real-world networks. Theoretic network models primarily studied are Erdos-Renyi model (Random network), Watts - Strogatz (Small world network), Barab´ asi-Albert model (Scale free network). As a result, all the literature surveyed here were found to be focused on both synthetic networks and real-world networks from different domain. Network functioning is measured employing the LCC, geodesic path length, efficiency, diameter etc. to describe the efficiency of attack strategies. We faced practical challenges in comparing the attack efficiencies due to a lack of uniform/global measures. This is mainly because networks from various domains react conflictingly to different measures and attack strategies. Henceforth, it leads to the difficulty to derive conclusions about the efficiency of different attack strategies. Due to the wide distinctness of analysis of different attack strategies found in works of literature, our study mainly focused on three papers [4,12] and [13]. They used severity measures based on LCC [4,12], and average geodesic path length [13] for quantifying the efficiency of attack strategies. Among them [4] analyzed severity of attacks for each fraction of removal. For ease of analysis, we used colourmaps for adequate visualization of comparing different attack strategies. Figure 1 and 3 shows the ranking pattern based on LCC based severity measures for different attack strategies in synthetic and real world networks, respectively. Ranking patterns based on the geodesic path length for synthetic and real world networks are shown in Fig. 2 and 4 respectively. N and M represent the number of nodes and edges, and q indicates the fraction of removal of nodes [We specified the range of q found in the [4], and outside that bound, attack strategies behave differently]. In this paper, we tried to highlight the most commonly used strategies, degree and betweenness. For synthetic networks, strategy degree (DC), which is a purely local centrality measure, is more efficient than other non-local centrality measures under simultaneous targeted attacks. For sequential attacks against synthetic networks, betweenness (BC) is the most effective for degrading the network structure. All these results are true under the measurement of LCC, as shown in Fig. 1. When we consider the geodesic path length as a measure for synthetic network degradation, closeness-based centrality strategies (PC, CC) are relevant under both sequential and simultaneous attacks (Fig. 2). Also, at first glance, it is clear that DC slightly over-performs BC in both simultaneous and sequential strategies (Fig. 2). The results are significantly different for real world networks. Under the LCC based severity measure, removal of nodes according to their betweenness centrality was the most efficient strategy to real world networks, as shown in Fig. 3. This is due to the fact that critical nodes are not strongly linked nodes or highly linked nodes are not in the main core of the network. In the presence of structural properties, such as low degree vertices that act as bridges connecting
Need for a Realistic Measure of Attack Severity
863
Fig. 1. Comparison of different attack strategies on synthetic networks based on LCC
Fig. 2. Comparison of different attack strategies on synthetic networks based on geodesic path length
different components of the network, betweenness centrality will be the most efficient strategy. In the absence of any particular structural properties, the most vulnerable vertices are those with the highest degree [12]. While using geodesic distance-based severity measures, DC has better performance than BC as in the case of synthetic networks (Fig. 4).
864
J. M. John and D. S. Lekha
Fig. 3. Comparison of different attack strategies on real world networks based on LCC
Fig. 4. Comparison of different attack strategies on real world networks based on geodesic path length
However, we can find some exceptions in the dolphin social network, Gnutella P2P network (0.3 < q < 0.4), immunoglobulin interaction network (0.3 < q < 0.5), network science collaboration and US popular airport network from the results which commonly occurs. It points to the possibility of the occurrence of some structural properties in these networks. In other words, the efficiency of the attack strategy varies based on which severity measure we used along with the structural properties of the network. This leads to the significance of the development of realistic standards of robustness for complex networks. It will be appropriate to have an evaluation toolbox containing various efficiency measures instead of a single hybrid measure. Nevertheless, our survey manifested the need for a realistic benchmark for effective comparison of strate-
Need for a Realistic Measure of Attack Severity
865
gies. A basic intuition here is to have a unique metric with relative importance to (1) the attack strategy followed, and (2) the topological properties which are impacted significantly due to the attacks.
5
Conclusions
We did a literature survey for comparing the degree of hazards induced by different centrality-based node attack strategies on various networks. However, we found it challenging to do an extensive comparison study due to the lack of uniform standard metrics for attack efficiency. We found that the efficiency of different strategies varies owing to the following factors. ◦ ◦ ◦ ◦ ◦
mode of target analysis - initial/recalculated type of networks - synthetic/real-world networks network topology different severity measurements - LCC/geodesic path length different network domains.
Therefore, a formal and standard methodology of simulating attacks and comparing attack efficiency deserves further investigation. End Note: Apart from the above-stated features, a desirable metric of attack efficiency also depends on the context in which an attack is performed. For example, consider an obnoxious network like a terrorist network being targeted for destruction by an agency. If the agency’s objective is to cripple the network structure, then a suitable measure of attack efficiency will be LCC . On the other hand, if they aim to induce communication delay in the system and thereby create chaos among the group, then an apt metric will be the average geodesic path length. Therefore, the choice of efficiency measure can depend on the context of the resilience problem under investigation also. Acknowledgement. This work was funded by the IIT Palakkad Technology IHub Foundation Doctoral Fellowship IPTIF/HRD/DF/019. We also acknowledge the three anonymous reviewers for giving us constructive feedback.
References 1. Aharony, A., Stauffer, D.: Introduction to Percolation Theory. Taylor and Francis (1992). https://doi.org/10.1201/9781315274386 2. Albert, R., Jeong, H., Barab´ asi, A.L.: Error and attack tolerance of complex networks. Nature 406, 378–382 (2000). https://doi.org/10.1038/35019019 3. Albert, R., Barab´ asi, A.L.: Statistical mechanics of complex networks. Rev. Mod. Phys. 74, 47–97 (2002). https://doi.org/10.1103/RevModPhys.74.47 4. Bellingeri, M., Cassi, D., Vincenzi, S.: Efficiency of attack strategies on complex model and real-world networks. Phys. A Stat. Mech. Appl. 414 (2014). https:// doi.org/10.1016/j.physa.2014.06.079
866
J. M. John and D. S. Lekha
5. Callaway, D.S., Newman, M.E.J., Strogatz, S.H., Watts, D.J.: Network robustness and fragility: percolation on random graphs. Phys. Rev. Lett. 85, 5468–5471 (2000). https://doi.org/10.1103/PhysRevLett.85.4626 6. Cohen, R., Erez, K., Ben-Avraham, D., Havlin, S.: Breakdown of the internet under intentional attack. Phys. Rev. Lett. 86, 3682–3685 (2001). https://doi.org/ 10.1103/PhysRevLett.86.3682 7. Crucitti, P., Latora, V., Marchiori, M.: Efficiency of scale-free networks: error and attack tolerance. Phys. A 320, 622–642 (2003). https://doi.org/10.1016/S03784371(02)01545-5 8. Crucitti, P., Latora, V., Marchiori, M.: Model for cascading failures in complex networks. Phys. Rev. E Stat. Nonlinear Soft Matter Phys. 69, 045104 (2004). https://doi.org/10.1103/PhysRevE.69.045104 9. Geisberger, R., Sanders, P., Schultes, D.: Better approximation of betweenness centrality. In: Proceedings of the Meeting on Algorithm Engineering and Experiments, pp. 90–100 (2008). http://dl.acm.org/citation.cfmid=2791204.2791213 10. He, S., Li, S., Ma, H.: Effect of edge removal on topological and functional robustness of complex networks. Phys. A 388, 2243–2253 (2009). https://doi.org/10. 1016/j.physa.2009.02.007 11. Holme, P., Kim, B.J., Yoon, C.N., Han, S.K.: Attack vulnerability of complex networks. Phys. Rev. E Stat. Nonlinear Soft Matter Phys. 65(5 Pt 2), 056109 (2002). https://doi.org/10.1103/PhysRevE.65.056109 12. Iyer, S., Killingback, T., Sundaram, B., Wang, Z.: Attack robustness and centrality of complex networks. PLoS ONE 8 (2013). https://doi.org/10.1371/journal.pone. 0059613 13. Lekha, D.S., Balakrishnan, K.: Central attacks in complex networks: a revisit with new fallback strategy. Phys. A Stat. Mech. Appl. 549 (2020). https://doi.org/10. 1016/j.physa.2020.124347 14. Motter, A.E., Lai, Y.C.: Cascade-based attacks on complex networks. Phys. Rev. E. Stat. Nonlinear Soft Matter Phys. 66(6 Pt 2), 065102 (2002). https://doi.org/ 10.1103/PhysRevE.66.065102 15. Motter, A.E.: Cascade control and defense in complex networks. Phys. Rev. Lett. 93, 098701 (2004). https://doi.org/10.1103/PhysRevLett.93.098701 16. Nguyen, Q., Pham, H.D., Cassi, D., Bellingeri, M.: Conditional attack strategy for real-world complex networks. Phys. A Stat. Mech. Appl. 530 (2019). https://doi. org/10.1016/j.physa.2019.121561 17. Nie, T., Guo, Z., Zhao, K., Lu, Z.: New attack strategies for complex networks. Phys. A Stat. Mech. Appl. 424 (2015). https://doi.org/10.1016/j.physa.2015.01. 004 18. Nie, T., Guo, Z., Zhao, K., Lu, Z.: The dynamic correlation between degree and betweenness of complex network under attack. Phys. A 457, 129–137 (2016). https://doi.org/10.1016/j.physa.2016.03.075 19. Reuven, C., Erez, K., Ben-Avraham, D., Havlin, S.: Resilience of the internet to random breakdowns. Phys. Rev. Lett. 85, 4626–4628 (2000). https://doi.org/10. 1103/PhysRevLett.85.4626 20. Wandelt, S., Sun, X., Feng, D., Zanin, M.: A comparative analysis of approaches to network-dismantling. Sci. Rep. 8, 13513 (2018). https://doi.org/10.1038/s41598018-31902-8
Mixed Integer Programming and LP Rounding for Opinion Maximization on Directed Acyclic Graphs Po-An Chen(B) , Ya-Wen Cheng, and Yao-Wei Tseng National Yang Ming Chiao Tung University, 1001 University Road, Hsinchu, Taiwan [email protected]
Abstract. Gionis et al. have already proposed a greedy algorithm and some heuristics for the opinion maximization problem. Unlike their approach, we adopt mathematical programming to solve the opinion maximization problem on specific classes of networks. We find that on directed acyclic graphs, opinion influence between nodes will not cycle, but would spread outwards from influencers. Based on such an insight, we model the problem as a mixed integer programming (MIP) problem and relax the MIP to a linear program (LP). With MIP, we obtain optimal solutions for the opinion maximization problem and derive approximation solutions with LP randomized rounding algorithms. We conduct experiments for one LP randomized rounding algorithm and give an analysis of the approximation ratio for the other LP randomized rounding algorithm. Keywords: Opinion maximization · Directed acyclic graphs integer programming · LP randomized rounding
1
· Mixed
Introduction
DeGroot [4] first proposed a continuous opinion dynamic model for the opinion formation process. The opinion formation problem proposed by Friedkin and Johnsen [5] or the opinion formation game proposed by Bindel et al. [2] is more realistic. They assumed that each person in a social network has two types of opinion: an internal opinion and an expressed opinion. The internal opinions of nodes are constant and not subject to external influences, and on the contrary, their expressed opinion may be influenced by their neighbors with different viewpoints. The update of a node’s expressed opinion can be seen as individual cost minimization affected by the internal opinion and the expressed opinion of each neighbor in the social network. Through multiple runs of the expressed opinion updating, the expressed opinion vector of the entire social network will gradually converge to an equilibrium. The objective is the social cost, defined as the sum of the individual cost of all participants in the social network. Gionis et al. [6] considered a social network, in which each person has a realvalued opinion normalized between zero and one regarding a specific issue. They c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2021, SCI 1072, pp. 867–878, 2022. https://doi.org/10.1007/978-3-030-93409-5_71
868
P.-A. Chen et al.
proposed the opinion maximization that selects no more than k individuals as the key opinion leaders and makes their expressed opinion as 1. The optimal subset of nodes that we select could make the sum of expressed opinions in the social network maximized at equilibrium. They proved the problem is NP-hard. They adopted the opinion formation model mentioned above and computed the equilibrium of expressed opinion by absorbing random walks. They proposed some algorithm and heuristics to approximate the problem. They also proved the submodularity of the objective function so the greedy algorithm can be useful for bounding the approximation ratio. In this paper, we modeled and tackled the opinion maximization problem, which is originally proposed by Gionis et al. [6], by mixed integer programs and linear programs (e.g., see [9]) with randomized rounding specifically for directed acyclic graphs. The value of the opinion for each node is continuous. In this problem, our goal is to maximize the total opinion in the social network by selecting k nodes and fixing each of their opinion to extremely positive opinion. Thus, it is important to select the most effective k individuals. Gionis et al. have already proposed a greedy algorithm and some heuristics for the opinion maximization problem. Unlike their approach, we adopt mathematical programming to solve the opinion maximization problem on specific classes of networks. We observe that on directed acyclic graphs, opinion influence between nodes will not cycle, but would spread outwards from influencers. Based on such insight, we model the problem as a mixed integer programming (MIP) problem and relax the MIP to a linear program (LP). With MIP, we obtain optimal solutions for the opinion maximization problem (although not in polynomial time) and derive approximation solutions with LP randomized rounding algorithms. We conduct experiments for LP randomized rounding Algorithm 1, and bounded the approximation ratio with high probability for LP randomized rounding Algorithm 2. We would like to emphasize that networks in such structure are not hard to see and even prevalent in the real world.1 1.1
Related Work
DeGroot [4] proposed a continuous opinion dynamics model first for the opinion formation process. He described the process of making consensus of a group of individuals in a network, using the average of each node’s neighbors to update individual opinions. With the concept of DeGroot’s work, Bindel et al. [2] extended Friedkin and Johnsen’s [5] work. In their work, the individual cost comes from the difference between internal opinion and external opinion along with the difference between the external opinion and its neighbors’. The cost 1
Also, although our main results are motivated from and established for directed acyclic graphs, what if a given graph is not acyclic but is very close to an acyclic graph? Our conjecture is that the approach can be extended for such a graph since we have not yet observed anything that would immediately prevent us from obtaining an approximation solution, but the feasibility in a probabilistic sense and approximation ratio guarantees may change a lot.
Opinion Maximization on Directed Acyclic Graphs
869
function defined this way no longer induces consensus. Later, Chen et al. [3] studied opinion formation games on the topic of bounds on the price of anarchy for a more general class of directed graphs. In another line of work regarding directed acyclic graphs, group pinning consensus was discussed and established with sufficient conditions, through pinning control protocols under fixed and randomly switching topologies with acyclic partition [10]. Our work is based on the model by Gionis et al. [6]. Using the sum of expressed opinions as the objective, opinion maximization seeks to find a ksubset of nodes to have their expressed opinions fixed to 1 to maximize the objective. Greedy algorithms have been designed to approximate the optimum with the help of the submodularity of such social cost [1,6].
2
Preliminaries
2.1
Opinion Formation Games
The opinion formation problem is proposed by Friedkin and Johnsen [5], and extended by Bindel et al. [2]. We consider a weighted graph G = (V, E), where V for the set of n nodes as individuals in the social network and E for the set of m edges as connection between two people. N (i) represents the social neighborhood of a person i. The edge weight wij ≥ 0 represents the influence spreading from person j to person i. We assume that node i has a persistent internal opinion si , and an expressed opinion zi , which gets updated. Both si and zi are continuous, we model them as values in the interval [0, 1]. The individual cost function is defined as wij (zi − zj )2 , ci (z) = (si − zi )2 + j∈N (i)
where z is the vector of expressed opinion for each node on the network. From this cost model, we can find that to obtain expressed opinion zi from individual i is a process of interaction about their own internal opinion si and expressed opinion of their neighbors zj . Due to social opinion influence, the goal of every node i is to minimize their cost ci (z). As the assumption, internal opinions are fixed, so minimizing the cost means that the expressed opinion for each node zi is the weight average between the internal opinion si and their neighbors j ∈ N (i). That is, the way of expressed opinion of each node updating can be expressed as follows si + j∈N (i) wij zj . zi = 1 + j∈N (i) wij As each node i updates the expressed opinion, the expressed onion vector will converge to an unique2 Nash Equilibrium of the game. 2
The uniqueness follows from the convexity of cost functions.
870
2.2
P.-A. Chen et al.
Opinion Maximization
Referring to the work of Ginois et al. [6], the goal of opinion maximization, which they called a campaign problem, is to make the sum of opinion as large as possible in a network. Given an expressed opinion vector z = (zi )i , and define the total opinion in the entire network g(z) as g(z) =
n
zi .
i=0
The goal of the opinion maximization problem is to maximize g(z). As we mentioned in the introduction, we assume that influence spread counts on selecting a set of nodes T of exactly k nodes. With selecting the set T to maximize g(z), we can make g(z | T ) represent the sum of expressed opinions in a social network. The value of zi (expressed opinion) for all nodes in T are fixed to 1 to obtain the Nash equilibrium vector z. We emphasize that our inputs are the weighted graph G = (V, E), the internal opinions si of all nodes i and k nodes we want to choose to set their external opinions to 1. External opinions zi are not our input, they are obtained through the opinion formation model. Note that Ginois et al. [6] have proved that g(z | T ) is monotone and submodular, so a greedy algorithm has been proposed with an approximation guarantee of (1 − 1/e).
3
Mixed Integer Programming and LP Rounding for Directed Acyclic Graphs
Here, we model the problem of opinion maximization as a mixed integer program (MIP) and two linear programming (LP) randomized rounding algorithms for directed trees (i.e., a special case of directed acyclic graphs) and directed acyclic graphs. MIP is an exact algorithm and LP randomized rounding algorithms are approximation algorithms. For LP randomized rounding Algorithm 2, we give an analysis to bound the approximation ratio. 3.1
Mixed Integer Linear Programs
On directed trees and directed acyclic graphs, the influence between nodes is somehow directional; in other words, the external opinion of each node will not be updated repeatedly and cyclically in the process of opinion formation. This is the key reason that we can model the opinion maximization problem as MIP and LP. For directed trees, there is just 1 out-degree for each node except the root, which means each node except the root just has 1 neighbor. We can generalize the idea to directed acyclic graphs quite easily. The difference is that there are several nodes b ∈ i having no out-degree for directed acyclic graphs.
Opinion Maximization on Directed Acyclic Graphs
871
Directed Trees. Each node vi with its internal opinion si has a unique parent node vj and wij = 1 for all i, j. max zi s.t. i
yi ≤ k,
i
z0 = y0 + s0 − s0 y0 , si yi xi si + zj − − ∀i > 0 zi = y i + 2 2 2 xi ≤ zj ∀i > 0 zi ∈ [0, 1]∀i, yi ∈ {0, 1}∀i, xi ∈ [0, 1]∀i Note that if yi = 1 meaning that node vi is selected, then the expressed opinion z −x of node vi is zi = 1 + j 2 i ≥ 1 where xi is a nonnegative fractional auxiliary variable for setting zi ; since zi ∈ [0, 1], zi = 1. If yi = 0meaning that node vi s +z −x is not selected, zi = i 2j i ; since the objective is i zi , xi = 0 helps at si +zj maximum and thus zi = 2 . Directed Acyclic Graphs. Each node vi with its internal opinion si has multiple parent nodes vj1 , ..., vj|N (i)| and wij = 1 for all i, j. Nodes v0 , ..., vb−1 are the nodes with no outgoing edges. max zi s.t. i
yi ≤ k,
i
zi = yi + si − si yi ∀i ∈ [b], si + j∈N (i) zj si yi j∈N (i) xi,j − − ∀i > b zi = yi + |N (i)| + 1 |N (i)| + 1 |N (i)| + 1 xi,j ≤ zj ∀i, j zi ∈ [0, 1]∀i, yi ∈ {0, 1}∀i, xi,j ∈ [0, 1]∀i, j Note that if yi = 1 meaning that node vi is selected, then the expressed opinion j∈N (i) zj −xi,j of node vi is zi = 1 + ≥ 1 where xi is a nonnegative fractional |N (i)|+1 auxiliary variable for setting zi ; since zi ∈ [0, 1], zi = 1. If yi = 0 meaning si + j∈N (i) zj −xi,j that node vi is not selected, zi = ; since the objective is i zi , |N (i)|+1 xi,j = 0 for all j ∈ N (i) helps at maximum and thus zi =
si + j∈N (i) zj . |N (i)|+1
872
3.2
P.-A. Chen et al.
LP Randomized Rounding Algorithms
Randomized rounding is a method that relaxes an integer program (IP) and converts the fractional solution obtained from the relaxation of IP to an approximation solution. We gave mixed integer linear program (MIP) for problem of opinion maximization in the previous section. In this section, we inherit the mixed integer program that we gave. Let yi be fractional ∈ [0, 1] for each node i to relax the mixed integer program to a linear program. With the variables and the solution that we obtain from linear programming, we use randomized rounding to convert some values of variables to integers in {0, 1} and obtain an approximation solution. We will give two LP randomized rounding algorithms in the following two subsections. Each node vi with its internal opinion si has a unique parent node vj and wij = 1 for all i, j. The LP relaxation of the MIP for a directed tree is as follows. max zi s.t. i
yi ≤ k,
i
z0 = y0 + s0 − s0 y0 , si yi xi si + zj zi = y i + − − ∀i > 0 2 2 2 xi ≤ zj ∀i > 0 zi ∈ [0, 1]∀i, yi ∈ [0, 1]∀i, xi ∈ [0, 1]∀i Each node vi with its internal opinion si has multiple parent nodes vj1 , ..., vj|N (i)| and wij = 1 for all i, j. Nodes v0 , ..., vb−1 are the nodes with no outgoing edges. The LP relaxation of the MIP for a directed acyclic graph is as follows. zi s.t. max i
yi ≤ k,
i
zi = yi + si − si yi ∀i ∈ [b], si + j∈N (i) zj si yi j∈N (i) xi,j − − ∀i > b zi = yi + |N (i)| + 1 |N (i)| + 1 |N (i)| + 1 xi,j ≤ zj ∀i, j zi ∈ [0, 1]∀i, yi ∈ [0, 1]∀i, xi,j ∈ [0, 1]∀i, j
Opinion Maximization on Directed Acyclic Graphs
873
Algorithm 1. LP Randomized Rounding Algorithm 1 1: Solve the LP above to get the optimal fractional solution {¯ xi }i , {¯ yi }i , {¯ zi }i and the optimal objective value OP Tf . 2: Initially set the selected set S := φ and set V := V . 3: while Repeat the following process k times do 4: Select a node vi to set zi = yi = 1 with distribution { y¯iy¯i }i of nodes in V . i 5: Add the selected node to S. 6: Remove the selected node from V . 7: end while 8: Efficiently compute Nash equilibria according to S ([6]).
LP Randomized Rounding Algorithm 1. The concept of LP randomized rounding Algorithm 1 is selecting the node that we want to add in set S with distribution { y¯iy¯i }i and setting their expressed opinion zi = 1. We select one i node in each round until k nodes are selected. Regarding the performance of LP randomized rounding Algorithm 1, we will show how close it is to the optimal solution through experiments in the next section.
Algorithm 2. LP Randomized Rounding Algorithm 2 (for Directed Trees) 1: Solve the LP above to get the optimal fractional solution {¯ xi }i , {¯ yi }i , {¯ zi }i and the optimal objective value OP Tf . 2: while Repeat the following process c log n rounds do 3: for vi ∈ V do s +¯ z 4: Round zi to 1 with probability of y¯i and set it to i 2 j with probability of 1 − y¯i . 5: Round yi to 1 if zi = 1 and to 0 otherwise. 6: end for 7: end while 8: for zi ∈ V do 9: Set zi to the value at the same arbitrary round. 10: end for
LP Randomized Rounding Algorithm 2. For the analysis, we upper bound the probability of violating the constraints so we can ensure that LP randomized rounding Algorithm 2 has high probability of finding a feasible solution with a good approximation ratio. Note that although we present the algorithm and analysis for directed trees here, the algorithm and analysis can be extended to work for directed acyclic graphs. Theorem 1. Algorithm 2 outputs a feasible solution and has an objective value at least (1 − δ)(OP T − ) with high probability for some constant 0 < δ ≤ 1 and a constant ≥ 0 defined by the optimal fractional solution.
874
P.-A. Chen et al.
Proof. In one of the c · log n rounds by Markov’s inequality yi ≥ k + 1] Pr[
(1)
i
E[ i yi ] y¯i i E[yi ] = = i ≤ k+1 k+1 k+1 1 ≤ . 1 + 1/k For the c · log n times of the process, the probability that round is at most for some proper constant c > 0 (
i
yi ≥ k + 1 every
1 1 )c·log n ≤ . 1 + 1/k cn
(2)
The expected objective is
E[zi ] =
i
y¯i · 1 + (1 − y¯i )
i
≥
y¯i +
si y¯i z¯j si + z¯j − − 2 2 2
y¯i +
¯i si y¯i x ¯i z¯j − x si + z¯j − − − 2 2 2 2
i
=
i
=
si + z¯j 2
z¯i − = OP Tf −
i
≥ OP T − ≥ k − for =
i
z¯j −¯ xi 2 .
By the Chernoff bound, for some constant 0 < δ ≤ 1
E[ i zi ]δ 2 2 Pr[ zi ≤ (1 − δ)E[ zi ]] ≤ e− ≤
i
i
1 . e(k−)δ2 /2
(3)
Coming all these bad events such that i yi ≥ k + 1 every time or i zi ≤ (1 − δ)E[ i zi ], we have that the objective value by the algorithm’s feasible 1 1 solution is at least (1−δ)(OP T −) with probability of at least 1− e(k−)δ 2 /2 − c n . Remark 1. The complexity of the two algorithms is polynomial in time.
4
Numerical Results for LP Randomized Rounding 1
In this section, we show the result of our experiments. Since we have proved in the previous section that LP randomized rounding Algorithm 2 can obtain a feasible approximation solution with high probability, we only conduct experiments on LP randomized rounding Algorithm 1 in the LP randomized rounding part. As we mentioned earlier, our goal is to maximize the overall express opinion g(z|T )
Opinion Maximization on Directed Acyclic Graphs
875
in the network. In order to actually analyze the algorithms, we implemented the algorithms with python, SNAP API [8] for network structure and Gurobi API [7] for mixed integer programming and linear programming implementation. We find that there are some mechanisms in Gurobi that will speed up the solving process of the optimizer, but we want to use a pure branch-and-bound strategy to solve the problem in order to compare it with other algorithms more fairly. Gurobi optimizer will “pre-solve” the problem first when solving the problem, and the pre-solve mechanism will remove some constraints and variable bounds. Heuristic algorithms provided by Gurobi will affect the results, too. Moreover, the Gurobi MIP solver runs in parallel with multi-thread, and cutting planes strategies will impact the result of MIP solver. Consequently, we modify some parameters and make sure that we can solve the problem with 1 thread and with no pre-solve, and rule out influence of heuristics and cutting planes strategies. We generate directed full trees with different layers and different in-degrees for all parents in the tree, and also generate directed acyclic graphs in different scales randomly, 4 graphs in total. The small tree is the tree with 3 layers and 3 in-degree for all parents (13 nodes and 12 edges). The large tree is the tree with 5 layers and 5 in-degree for all parents (781 nodes and 780 edges). The small directed acyclic graph is the graph with 15 nodes and 19 edges. The large directed acyclic graph is the graph with 400 nodes and 1875 edges. We randomly generate an internal opinion si for all nodes i for these graphs and let every weight of edges wij = 1. We will experiment with the approximation ratio and computation time of algorithms above, and compare their efficiency and effectiveness. We present the sum of expressed opinion g(z|T ) and computation time for each algorithm on the tables in the following, and we present each ratio of the objective value to the optimal solution. 4.1
Directed Trees
We adopt the directed tree randomly generated (3 layers with 3 in-degree for all parents), and the same internal opinion we generated randomly. We set k = 5 and set the external opinion of the nodes we choose zi = 1. The following is a table of experimental results (Table 1). Table 1. LP randomized rounding comparing for small tree as k = 5 Algorithm
g(z|T )
Brute force
11.894395930949575 0.5631651 s
Greedy algorithm
11.894395930949575 0.0258441 s
Computation time Ratio
LP randomized rounding 10.935825325887178 0.0747382 s
1.00 1.00 0.9194
The LP randomized rounding is an approximation algorithm, and we achieve approximation ratio over 90% to the optimal solution. Although the greedy
876
P.-A. Chen et al.
algorithm is also an approximation algorithm, it happens that we obtain the optimal solution with it on this network structure. We then use a larger scale directed tree (5 layers with 5 in-degree for all parents) randomly generated. We set k = 50 and set the external opinion of the nodes we choose zi = 1. The following is a table of experimental results (Table 2). Table 2. LP randomized rounding comparing for large tree as k = 50 Algorithm
g(z|T )
Computation time Ratio
Greedy algorithm
558.2954466648072 642.985675 s
0.9989
Mixed integer program
558.8729695901077 0.2067385 s
1.00
LP randomized rounding 520.6121731840166 0.0993077 s
0.9315
This is a larger directed tree network. The LP randomized rounding is an approximation algorithm, and we achieve an approximation ratio over 90% to the optimal solution (that we obtain from mixed integer program). Although the greedy algorithm is also an approximation one and gives an approximation ratio close to the optimal solution, but through linear program randomized rounding, we can obtain a good approximation solution as well much faster. 4.2
Directed Acyclic Graphs
We adopt the directed acyclic graph randomly generated (15 nodes and 19 edges) and the same internal opinion we generated randomly. We set k = 5 and set the external opinion of the nodes we choose zi = 1. The following is a table of experimental results (Table 3). Table 3. LP randomized rounding comparing for small DAG as k = 5 Algorithm
g(z|T )
Brute force
13.20415451606796 2.3857962 s
Greedy algorithm
13.20415451606796 0.0522161 s
Computation Ratio
LP randomized rounding 12.60421544883294 0.0653146 s
1.00 1.00 0.9546
LP randomized rounding algorithm achieves over 95% approximation ratio to optimal solution we obtain from brute force. Although greedy algorithm is approximation one, too, we happen to obtain the exact solution with it on this network structure. We then adopt a larger scale directed acyclic graph (400 nodes and 1875 edges) randomly generated. We set k = 50 and set the external opinion of the nodes we choose zi = 1. The following is a table of experimental results (Table 4).
Opinion Maximization on Directed Acyclic Graphs
877
Table 4. LP randomized rounding comparing for large DAG as k = 50 Algorithm
g(z|T )
Computation time Ratio
Greedy algorithm
365.3755578392006
329.2380125 s
0.9991
Mixed integer program
365.7031671582477
142.1758167 s
1.00
LP randomized rounding 302.59276338381625 0.3941836 s
0.8274
This is a larger directed acyclic graph, LP randomized rounding algorithm achieves over 80% approximation ratio to optimal solution we obtain from mixed integer program. Although the greedy algorithm approximates the objective function well, through linear program randomized rounding, we can obtain an approximation solution much faster.
5
Conclusions and Future Work
We propose the mixed integer programming (MIP) and two different linear programming randomized rounding (LP randomized rounding) forms to solve the problem of opinion maximization proposed by Gionis et al. [6] for directed trees and directed acyclic graphs. With mixed integer programming, we make sure that we can obtain the optimal solution in a relatively short time. Especially for directed trees, the compute time is very short. With two LP randomized rounding algorithms, we obtain the approximate solution. The experiment shows the result of the first LP randomized rounding algorithm has quite a high ratio compared with optimal solution. We also proved that we can obtain an approximate solution with high probability with the second LP randomized rounding algorithm. About the future work, first of all, due to the limitation of computer performance, we did not experiment with large scale networks. We can try to experiment with larger scale graphs in the future. In terms of graph selection, experiments can also be conducted on real directed acyclic graph networks instead of randomly generated graphs. Put together to test applicability of our approach to much larger and realistic graphs or networks, we expect to use large empirical directed acyclic graph data sets to conduct computational experiments. Moreover, the problem of opinion maximization may be modeled and tackled by alternative mixed integer program formulations or even other optimization methods, like delayed constraint generation algorithms which add cuts for every iteration in two-stage (stochastic) programming. Wu [11] has used delayed constraint generation algorithms on the influence maximization problem and obtained optimal solutions. Besides using different kinds of methods applied on this problem, it is worth exploring the applicability of our approach on some more general directed graphs to obtain optimal or approximation solutions.
878
P.-A. Chen et al.
References 1. Ahmadinejad, A., Mahini, H.: How effectively can we form opinions? In: Proceedings of International World Wide Web Conference (2014) 2. Bindel, D., Kleinberg, J., Oren, S.: How bad is forming your own opinion? Games Econ. Behav. 92, 248–265 (2015) 3. Chen, P.-A., Chen, Y.-L., Chi-Jen, L.: Bounds on the price of anarchy for a more general class of directed graphs in opinion formation games. Oper. Res. Lett. 44(6), 808–811 (2016) 4. DeGroot, M.H.: Reaching a consensus. J. Am. Stat. Assoc. 69(345), 118–121 (1974) 5. Friedkin, N.E., Johnsen, E.C.: Social influence and opinions. J. Math. Sociol. 15(3– 4), 193–206 (1990) 6. Gionis, A., Terzi, E., Tsaparas, P.: Opinion maximization in social networks. In: Proceedings of the 2013 SIAM International Conference on Data Mining, pp. 387– 395. SIAM (2013) 7. LLC Gurobi Optimization. Gurobi optimizer reference manual (2021) 8. Leskovec, J., Sosiˇc, R.: SNAP: a general-purpose network analysis and graphmining library. ACM Trans. Intell. Syst. Technol. (TIST) 8(1), 1 (2016) 9. Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Kluwer Academic Publishers, Dordrecht (2004) 10. Shang, Y.: Group pinning consensus under fixed and randomly switching topologies with acyclic partition. Netw. Heterog. Media 9(3), 553–573 (2014) 11. Wu, H.-H., K¨ uc¸u ¨kyavuz, S.: A two-stage stochastic programming approach for influence maximization in social networks. Comput. Optim. Appl. 69(3), 563–595 (2017). https://doi.org/10.1007/s10589-017-9958-x
Correction to: Complex Networks & Their Applications X Rosa Maria Benito, Chantal Cherifi, Hocine Cherifi, Esteban Moro, Luis M. Rocha, and Marta Sales-Pardo
Correction to: R. M. Benito et al. (eds.): Complex Networks & Their Applications X, SCI 1072, https://doi.org/10.1007/978-3-030-93409-5
In the original version of the book, the following belated correction has been incorporated: The volume number has been changed from 1015 to 1072 in the Frontmatter, Backmatter and Chapter opening pages. The book and the chapter have been updated with the change.
The updated original version of the book can be found at https://doi.org/10.1007/978-3-030-93409-5 © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2021, SCI 1072, p. C1, 2022. https://doi.org/10.1007/978-3-030-93409-5_72
Author Index
A Abraham, Linda, 50 Achterberg, Massimo A., 607 Adiga, Abhijin, 168 Aghasaryan, Armen, 475 Alford, Simon, 657 Antonioni, Alberto, 780 Antonov, Andrey, 376 Apolloni, Andrea, 168 Arastuie, Makan, 593 Archambault, Daniel, 297 Ashouri, Mahsa, 424 B Bacao, Fernando, 116 Bachar, Liav, 247 Baek, Young Yun, 168 Banburski, Andrzej, 657 Baptista, Diego, 578 Barla, Annalisa, 82 Bar-Noy, Amotz, 3 Bayram, Eda, 155 Boekhout, Hanjo D., 142 Boldi, Paolo, 234 Bonato, Anthony, 50 Bracke, Piet, 514 Brede, Markus, 693, 844 Brévault, Thierry, 168 Brienen, Marten, 721 Brissette, Christopher, 451 C Cai, Zhongqi, 844 Canbalo˘glu, Gülay, 411
Carchiolo, Vincenza, 321 Casadio, Maura, 82 Cazabet, Rémy, 566 Chang, Brian, 607 Chebotarev, Pavel, 328 Chen, Po-An, 867 Cheng, Ya-Wen, 867 Cherifi, Hocine, 342 Chiappori, Alessandro, 566 Chiaromonte, Francesca, 744 Chin, John, 657 Chin, Peter, 657, 831 Chunaev, Petr, 376 Clark, Ruaridh A., 27 Coffman, Ian, 73 Coquidé, Célestin, 183 Courtain, Sylvain, 220 Cunningham, Eoghan, 104 D Dandekar, Sylee, 657 Danovski, Kaloyan, 693 Dave, Aniruddha, 168 De Bacco, Caterina, 578 de Cabo, Ruth Mateos, 756 De Collibus, Francesco Maria, 792 de Freitas Pereira, Everson José, 16 Deng, Xinwei, 644 Do, Hung N., 527 Dondi, Riccardo, 553 Dragan, Feodor F., 194
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2021, SCI 1072, pp. 879–881, 2022. https://doi.org/10.1007/978-3-030-93409-5
880 E Elyashar, Aviad, 247 Esposito, Christian, 744 Evmenova, Elizaveta, 260, 376 F Fagiolo, Giorgio, 744 Furia, Flavio, 234 Fushimi, Takayasu, 501 G Gabaldon, Patricia, 756 Gabrys, Bogdan, 514 Gancio, Juan, 309 Gandhi, Anshula, 657 Gao, Yuan, 62 Garbarino, Davide, 82 Gerding, Enrico, 844 Giannotti, Fosca, 130 Gimeno, Ricardo, 756 Gortan, Marco, 744 Goubert, Liesbet, 514 Grassia, Marco, 321, 369 Grau, Pilar, 756 Greene, Derek, 104, 297 Greige, Laura, 831 Gromov, Dmitry, 260 Guarnera, Heather M., 194 H Halim, Nafisa, 681 Hancock, Matthew, 681 Heemskerk, Eelke M., 142 Herms, Katrin, 705 Hilsabeck, Tanner, 593 Hirahara, Kazuro, 487 Hosseinzadeh, Mohammad Mehdi, 553 Hovanesyan, Arthur, 732 Howald, Blake, 73 Hozhabrierdi, Pegah, 354 Hu, Zhihao, 644 Huidobro, Jaime Oliver, 780 I Ivashkin, Vladimir, 328 J Jaber, Ali, 342 Jakovljevic, Luka, 475 Jia, Mingshan, 514 Jiang, Bangxin, 807 John, Jisha Mariyam, 857 Jung, Hohyun, 424
Author Index K Kim, Jisu, 130 Kostadinov, Dimitre, 475 Krishnagopal, Sanjukta, 669 Kuhlman, Chris J., 644, 681 Kumar, Purushottam, 388 Kundu, Suman, 271 L Lazarte, Daniel Perez, 168 Lekha, Divya Sindhu, 857 Li, Gen, 16, 39 Lipari, Francesca, 780 Liu, Weiguang, 16, 39 Lu, Jianquan, 807 Lyra, Marcos S., 116 M Macdonald, Malcolm, 27 Malgeri, Michele, 321 Mangioni, Giuseppe, 321, 369 Marathe, Achla, 681 Marathe, Madhav, 168 Martin, Dustin, 73 Maulana, Ardian, 94 Mayerhoffer, Daniel M., 768 McGrath, Ciara N., 27 Meyer, François G., 207 Mieghem, Piet Van, 607 Miletic, Katarina, 619 Mina, Andrea, 744 Mirkin, Boris, 285 Mironov, Sergei, 463 Mohammed, Abdulhakeem O., 194 Moretti, Paolo, 82 Moro, Matteo, 82 Mortveit, Henning, 168 Moshiri, Mahdi, 817 Mozumder, Pallab, 681 Musial, Katarzyna, 514 Mykhailova, Oleksandra, 619 N Naito, Soshi, 501 Nazareth, Alexander, 50 Novick, Yitzchak, 3 O Odone, Francesca, 82 P Palmer, Nicholas, 168 Palpanas, Themis, 475 Parjanya, Rohith, 271 Partida, Alberto, 792
Author Index Phoa, Frederick Kin Hing, 424 Pinheiro, Flávio L., 116 Piškorec, Matija, 792 Poenaru-Olaru, Lorena, 732 Poggio, Tomaso, 657 Puzis, Rami, 247 Q Queiros, Julie, 183 Queyroi, François, 183 R Rajeh, Stephany, 342 Rangamani, Akshay, 657 Ravi, S. S., 681 Redi, Judith, 732 Rinaldi, Marco, 607 Rossetti, Giulio, 130, 744 Roth, Camille, 705 Rubido, Nicolás, 309 S Sadler, Sophie, 297 Saerens, Marco, 220 Safaei, Farshad, 817 Saito, Kazumi, 487 Sakiyama, Tomoko, 405 Santini, Guillaume, 539 Schoeneman, John, 721 Schulz, Jan, 768 Scripps, Jerry, 438 Sensi, Mattia, 607 Shalileh, Soroosh, 285 Sharma, Dolly, 388 Sidorov, Sergei, 463 Sinha, Sanchit, 168 Sîrbu, Alina, 130 Situngkir, Hokky, 94 Slota, George, 451 Smyth, Barry, 104 Soldano, Henry, 539 Soundarajan, Sucheta, 354 Stavinova, Elizaveta, 376 St-Onge, Jonathan, 705
881 Sugawara, Toshiharu, 632 Suroso, Rendra, 94 T Tacchino, Chiara, 82 Takes, Frank W., 142 Tamarit, Ignacio, 780 Testa, Lorenzo, 744 Toriumi, Fujio, 632 Treur, Jan, 411, 619 Tseng, Yao-Wei, 867 Tyshkevich, Sergei, 463 U Ueda, Naonori, 487 Usui, Yutaro, 632 V Van Alboom, Maité, 514 Vigna, Sebastiano, 234 Vullikanti, Anil, 681 W Waghalter, Penina, 168 Wang, Fenghua, 607 Wang, Huijuan, 732 Wang, Tony, 657 X Xu, Kevin S., 527, 593 Y Yamagishi, Yuki, 487 Yan, Jianglong, 16, 39 Yang, Liufei, 607 Yassin, Ali, 342 Z Zevio, Stella, 539 Zhao, Liang, 16, 39 Zheng, Qiusheng, 16, 39 Zhu, Yu-tao, 16, 39 Zhu, Zhen, 62