144 101 73MB
English Pages 743 [729] Year 2021
Studies in Computational Intelligence 944
Rosa M. Benito · Chantal Cherifi · Hocine Cherifi · Esteban Moro · Luis Mateus Rocha · Marta Sales-Pardo Editors
Complex Networks & Their Applications IX Volume 2, Proceedings of the Ninth International Conference on Complex Networks and Their Applications COMPLEX NETWORKS 2020
Studies in Computational Intelligence Volume 944
Series Editor Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland
The series “Studies in Computational Intelligence” (SCI) publishes new developments and advances in the various areas of computational intelligence—quickly and with a high quality. The intent is to cover the theory, applications, and design methods of computational intelligence, as embedded in the fields of engineering, computer science, physics and life sciences, as well as the methodologies behind them. The series contains monographs, lecture notes and edited volumes in computational intelligence spanning the areas of neural networks, connectionist systems, genetic algorithms, evolutionary computation, artificial intelligence, cellular automata, self-organizing systems, soft computing, fuzzy systems, and hybrid intelligent systems. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution, which enable both wide and rapid dissemination of research output. Indexed by SCOPUS, DBLP, WTI Frankfurt eG, zbMATH, SCImago. All books published in the series are submitted for consideration in Web of Science.
More information about this series at http://www.springer.com/series/7092
Rosa M. Benito Chantal Cherifi Hocine Cherifi Esteban Moro Luis Mateus Rocha Marta Sales-Pardo •
•
•
•
•
Editors
Complex Networks & Their Applications IX Volume 2, Proceedings of the Ninth International Conference on Complex Networks and Their Applications COMPLEX NETWORKS 2020
123
Editors Rosa M. Benito Grupo de Sistemas Complejos Universidad Politécnica de Madrid Madrid, Madrid, Spain Hocine Cherifi LIB, UFR Sciences et Techniques Université de Bourgogne Dijon, France Luis Mateus Rocha Center for Social and Biomedical Complexity, Luddy School of Informatics, Computing, and Engineering Indiana University Bloomington, IN, USA
Chantal Cherifi IUT Lumière University of Lyon Bron Cedex, France Esteban Moro Grupo Interdisciplinar de Sistemas Complejos, Departamento de Matematicas Universidad Carlos III de Madrid Leganés, Madrid, Spain Marta Sales-Pardo Department of Chemical Engineering Universitat Rovira i Virgili Tarragona, Tarragona, Spain
ISSN 1860-949X ISSN 1860-9503 (electronic) Studies in Computational Intelligence ISBN 978-3-030-65350-7 ISBN 978-3-030-65351-4 (eBook) https://doi.org/10.1007/978-3-030-65351-4 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
This 2020 edition of the International Conference on Complex Networks and their Applications is the ninth of a series that began in 2011. Over the years, this adventure has made the conference one of the major international events in network science. Network science scientific community of various fields such as finance and economy, medicine and neuroscience, biology and earth sciences, sociology and politics, computer science and physics. The variety of scientific topics ranges from network theory, network models, network geometry, community structure, network analysis and measure, link analysis and ranking, resilience and control, machine learning and networks, dynamics on/of networks, diffusion and epidemics, visualization. It is also worth mentioning some recent applications with high added value for current trend social concerns such as social and urban networks, human behavior, urban systems-mobility, or quantifying success. The conference brings together researchers that study the world through the lens of networks. Catalyzing the efforts of this scientific community, it drives network science to generate cross-fertilization between fundamental issues and innovative applications, review the current state of the field and promote future research directions. Every year, researchers from all over the world gather in our host venue. This year’s edition was initially to be hosted in Spain by Universidad Politécnica de Madrid. Unfortunately, the COVID-19 global health crisis forced us to organize the conference as a fully online event. Nevertheless, this edition attracted numerous authors with 323 submissions from 51 countries. The papers selected for the volumes of proceedings clearly reflect the multiple aspects of complex networks issues as well as the high quality of the contributions. All the submissions were peer reviewed by three independent reviewers from our strong international program committee. This ensured high-quality contributions as well as compliance to conference topics. After the review process, 112 papers were selected to be included in the proceedings.
v
vi
Preface
Undoubtedly, the success of this edition relied on the authors who have produced high-quality papers, as well as the impressive list of keynote speakers who delivered fascinating plenary lectures: – Leman Akoglu (Carnegie Mellon University, USA): “Graph-Based Anomaly Detection: Problems, Algorithms and Applications” – Stefano Boccaletti (Florence University, Italy): “Synchronization in Complex Networks, Hypergraphs and Simplicial Complexes” – Fosca Giannotti (KDD Lab, Pisa, Italy): “Explainable Machine Learning for Trustworthy AI” – János Kertész (Central European University, Hungary): “Possibilities and Limitations of using mobile phone data in exploring human behavior” – Vito Latora (Queen Mary, University of London, UK): “Simplicial model of social contagion” – Alex “Sandy” Pentland (MIT Media Lab, USA): “Human and Optimal Networked Decision Making in Long-Tailed and Non-stationary Environments” – Nataša Pržulj (Barcelona Supercomputing Center, Spain): “Untangling biological complexity: From omics network data to new biomedical knowledge and Data-Integrated Medicine” The topics addressed in the keynote talks allowed a broad coverage of the issues encountered in complex networks and their applications to complex systems. For the traditional tutorial sessions prior to the conference, our two invited speakers delivered insightful talks. David Garcia (Complexity Science Hub Vienna, Austria) gave a lecture entitled “Analyzing complex social phenomena through social media data,” and Mikko Kivela (Aalto University, Finland) delivered a talk on “Multilayer Networks.” Each edition of the conference represents a challenge that cannot be successfully achieved without the deep involvement of many people, institutions and sponsors. First of all, we sincerely gratify our advisory board members, Jon Crowcroft (University of Cambridge), Raissa D’Souza (University of California, Davis, USA), Eugene Stanley (Boston University, USA) and Ben Y. Zhao (University of Chicago, USA), for inspiring the essence of the conference. We record our thanks to our fellow members of the organizing committee. José Fernando Mendes (University of Aveiro, Portugal), Jesús Gomez Gardeñes (University of Zaragoza, Spain) and Huijuan Wang (TU Delft, Netherlands) chaired the lightning sessions. Manuel Marques Pita (Universidade Lusófona, Portugal), José Javier Ramasco (IFISC, Spain) and Taha Yasseri (University of Oxford, UK) managed the poster sessions. Luca Maria Aiello (Nokia-Bell Labs, UK) and Leto Peel (Université Catholique de Louvain, Belgium) were our tutorial chairs. Finally, Sabrina Gaito (University of Milan, Italy) and Javier Galeano (Universidad Politécnica de Madrid, Spain) were our satellite chairs.
Preface
vii
We extend our thanks to Benjamin Renoust (Osaka University, Japan), Michael Schaub (MIT, USA), Andreia Sofia Teixeira (Indiana University, USA), Xiangjie Kong (Dalian University of Technology, China), the publicity chairs for advertising the conference in America, Asia and Europa, hence encouraging the participation. We would like also to acknowledge Regino Criado (Universidad Rey Juan Carlos, Spain) as well as Roberto Interdonato (CIRAD - UMR TETIS, Montpellier, France) our sponsor chairs. Our deep thanks go to Matteo Zignani (University of Milan, Italy), publication chair, for the tremendous work he has done at managing the submission system and the proceedings publication process. Thanks to Stephany Rajeh (University of Burgundy, France), Web chair, in maintaining the website. We would also like to record our appreciation for the work of the local committee chair, Juan Carlos Losada (Universidad Politécnica de Madrid, Spain), and all the local committee members, David Camacho (UPM, Spain), Fabio Revuelta (UPM, Spain), Juan Manuel Pastor (UPM, Spain), Francisco Prieto (UPM, Spain), Leticia Perez Sienes (UPM, Spain), Jacobo Aguirre (CSIC, Spain), Julia Martinez-Atienza (UPM, Spain), for their work in managing online sessions. They greatly participated to the success of this edition. We are also indebted to our partners, Alessandro Fellegara and Alessandro Egro from Tribe Communication, for their passion and patience in designing the visual identity of the conference. We would like to express our gratitude to our partner journals involved in the sponsoring of keynote talks: applied network science, EPJ data science, social network analysis and mining, and entropy. Generally, we are thankful to all those who have helped us contributing to the success of this meeting. Sincere thanks to the contributors, the success of the technical program would not be possible without their creativity. Finally, we would like to express our most sincere thanks to the program committee members for their huge efforts in producing high-quality reviews in a very limited time. These volumes make the most advanced contribution of the international community to the research issues surrounding the fascinating world of complex networks. Their breath, quality and novelty signals how profound is the role played by complex networks in our understanding of our world. We hope that you will enjoy reading the papers as much as we enjoyed organizing the conference and putting this collection of papers together. Rosa M. Benito Hocine Cherifi Chantal Cherifi Esteban Moro Luis Mateus Rocha Marta Sales- Pardo
Organization
Organization and Committees General Chairs Rosa M. Benito Hocine Cherifi Esteban Moro
Universidad Politécnica de Madrid, Spain University of Burgundy, France Universidad Carlos III, Spain
Advisory Board Jon Crowcroft Raissa D’Souza Eugene Stanley Ben Y. Zhao
University of Cambridge, UK Univ. of California, Davis, USA Boston University, USA University of Chicago, USA
Program Chairs Chantal Cherifi Luis M. Rocha Marta Sales-Pardo
University of Lyon, France Indiana University, USA Universitat Rovira i Virgili, Spain
Satellite Chairs Sabrina Gaito Javier Galeano
University of Milan, Italy Universidad Politécnica de Madrid, Spain
Lightning Chairs José Fernando Mendes Jesús Gomez Gardeñes Huijuan Wang
University of Aveiro, Portugal University of Zaragoza, Spain TU Delft, The Netherlands
ix
x
Organization
Poster Chairs Manuel Marques-Pita José Javier Ramasco Taha Yasseri
University Lusófona, Portugal IFISC, Spain University of Oxford, UK
Publicity Chairs Benjamin Renoust Andreia Sofia Teixeira Michael Schaub Xiangjie Kong
Osaka University, Japan University of Lisbon, Portugal MIT, USA Dalian University of Technology, China
Tutorial Chairs Luca Maria Aiello Leto Peel
Nokia-Bell Labs, UK UC Louvain, Belgium
Sponsor Chairs Roberto Interdonato Regino Criado
CIRAD - UMR TETIS, France Universidad Rey Juan Carlos, Spain
Local Committee Chair Juan Carlos Losada
Universidad Politécnica de Madrid, Spain
Local Committee Jacobo Aguirre David Camacho Julia Martinez-Atienza Juan Manuel Pastor Leticia Perez Sienes Francisco Prieto Fabio Revuelta
CSIC, Spain UPM, Spain UPM, Spain UPM, Spain UPM, Spain UPM, Spain UPM, Spain
Publication Chair Matteo Zignani
University of Milan, Italy
Organization
xi
Web Chair Stephany Rajeh
University of Burgundy, France
Program Committee Jacobo Aguirre Amreen Ahmad Masaki Aida Luca Maria Aiello Marco Aiello Esra Akbas Mehmet Aktas Tatsuya Akutsu Reka Albert Aleksandra Aloric Claudio Altafini Benjamin Althouse Lucila G. Alvarez-Zuzek Luiz G. A. Alves Enrico Amico Hamed Amini Chuankai An Marco Tulio Angulo Demetris Antoniades Alberto Antonioni Nino Antulov-Fantulin Nuno Araujo Elsa Arcaute Laura Arditti Samin Aref Panos Argyrakis Malbor Asllani Tomaso Aste Martin Atzmueller Konstantin Avrachenkov Jean-Francois Baffier Giacomo Baggio Rodolfo Baggio Franco Bagnoli Annalisa Barla
Centro Nacional de Biotecnología, Spain Jamia Millia Islamia, India Tokyo Metropolitan University, Japan Nokia-Bell Labs, UK University of Stuttgart, Germany Oklahoma State University, USA University of Central Oklahoma, USA Kyoto University, Japan The Pennsylvania State University, USA Institute of Physics Belgrade, Serbia Linköping University, Sweden New Mexico State University, USA IFIMAR-UNMdP, Argentina Northwestern University, USA Swiss Federal Institute of Technology in Lausanne, Switzerland Georgia State University, USA Dartmouth College, USA National Autonomous University of Mexico (UNAM), Mexico RISE - Research Center, Cyprus Carlos III University of Madrid, Spain ETH Zurich, Switzerland Universidade de Lisboa, Portugal University College London, UK Polytechnic of Turin, Italy Max Planck Institute for Demographic Research, Germany Aristotle University of Thessaloniki, Greece University of Limerick, Ireland University College London, UK Tilburg University, The Netherlands Inria, France National Institute of Informatics, Japan University of Padova, Italy Bocconi University, Italy University of Florence, Italy Università di Genova, Italy
xii
Paolo Barucca Anastasia Baryshnikova Nikita Basov Gareth Baxter Marya Bazzi Mariano Beguerisse Diaz Andras A. Benczur Rosa M. Benito Luis Bettencourt Ginestra Bianconi Ofer Biham Livio Bioglio Hanjo Boekhout Johan Bollen Christian Bongiorno Anton Borg Stefan Bornholdt Federico Botta Alexandre Bovet Dan Braha Ulrik Brandes Markus Brede Marco Bressan Piotr Bródka Javier M. Buldu Raffaella Burioni Fabio Caccioli Rajmonda Caceres Carmela Calabrese Paolo Campana M. Abdullah Canbaz Carlo Vittorio Cannistraci Vincenza Carchiolo Giona Casiraghi Douglas Castilho Costanza Catalano Remy Cazabet David Chavalarias Kwang-Cheng Chen Po-An Chen Xihui Chen Xueqi Cheng Chantal Cherifi
Organization
University College London, UK Calico Life Sciences, USA St. Petersburg State University, Russia University of Aveiro, Portugal University of Oxford, UK University of Oxford, UK Hungarian Academy of Sciences, Hungary Universidad Politécnica de Madrid, Spain University of Chicago, USA Queen Mary University of London, UK The Hebrew University of Jerusalem, Israel University of Turin, Italy Leiden University, The Netherlands Indiana University Bloomington, USA Università degli Studi di Palermo, Italy Blekinge Institute of Technology, Sweden Universität Bremen, Germany The University of Warwick, UK Université Catholique de Louvain-la-Neuve, Belgium NECSI, USA ETH Zürich, Switzerland University of Southampton, UK Sapienza University of Rome, Italy Wroclaw University of Science and Technology, Poland Universidad Rey Juan Carlos, Spain Università di Parma, Italy University College London, UK Massachusetts Institute of Technology, USA University of Naples Federic, Italy University of Cambridge, UK Indiana University Kokomo, USA TU Dresden, Germany Universita di Catania, Italy ETH Zurich, Switzerland Federal University of Minas Gerais, Brazil Gran Sasso Science Institute, Belgium Lyon University, France CNRS, CAMS/ISC-PIF, France University of South Florida, USA National Chiao Tung University, Taiwan University of Luxembourg, Luxembourg Institute of Computing Technology, China Lyon 2 University, France
Organization
Hocine Cherifi Peter Chin Matteo Chinazzi Matteo Cinelli Richard Clegg Reuven Cohen Alessio Conte Marco Coraggio Michele Coscia Clementine Cottineau Regino Criado Mihai Cucuringu Marcelo Cunha Giulio Valentino Dalla Riva Kareem Darwish Bhaskar Dasgupta Joern Davidsen Toby Davies Pasquale De Meo Fabrizio De Vico Fallani Charo I. del Genio Pietro Delellis Jean-Charles Delvenne Yong Deng Bruce Desmarais Patrick Desrosiers Riccardo Di Clemente Matías Di Muro Jana Diesner Shichang Ding Linda Douw Johan Dubbeldam Jordi Duch Kathrin Eismann Mohammed El Hassouni Andrew Elliott Michael T. M. Emmerich Frank Emmert-Streib Gunes Ercal Alexandre Evsukoff Mauro Faccin Sofia Fernandes
xiii
University of Burgundy, France Boston University, USA Northeastern University, USA University of Rome “Tor Vergata,” Italy Queen Mary University of London, UK Bar-Ilan University, Israel University of Pisa, Italy University of Naples Federico II, Italy IT University of Copenhagen, Denmark CNRS, Centre Maurice Halbwachs, France Universidad Rey Juan Carlos, Spain University of Oxford and The Alan Turing Institute, USA IFBA, Brazil University of Canterbury, New Zealand Qatar Computing Research Institute, Qatar University of Illinois, Chicago, USA University of Calgary, Canada University College London, UK Vrije Universiteit Amsterdam, Italy Inria - ICM, France Coventry University, UK University of Naples Federico II, Italy University of Louvain, Belgium Xi’an Jiaotong University, China The Pennsylvania State University, USA Université Laval, Canada University of Exeter, UK Universidad Nacional de Mar del Plata-CONICET, Argentina University of Illinois at Urbana-Champaign, USA University of Goettingen, Germany Amsterdam UMC, The Netherlands University of Technology, The Netherlands Universitat Rovira i Virgili, Spain University of Bamberg, Germany Mohammed V University in Rabat, Morocco University of Oxford, UK Leiden University, The Netherlands Tampere University of Technology, Finland SIUE, USA COPPE/UFRJ, Brazil Université Catholique de Louvain, Belgium Laboratory of Artificial Intelligence and Decision Support, Portugal
xiv
Guilherme Ferraz de Arruda Daniel Figueiredo Jorge Finke Marco Fiore Alessandro Flammini Manuel Foerster Barbara Franci Diego Função Angelo Furno Sabrina Gaito Lazaros Gallos José Manuel Galán Joao Gama Yerali Gandica Jianxi Gao David Garcia Federica Garin Michael Gastner Alexander Gates Vincent Gauthier Raji Ghawi Tommaso Gili Silvia Giordano Rosalba Giugno David Gleich Antonia Godoy Kwang-Il Goh Jaime Gomez Jesus Gomez-Gardenes Antonio Gonzalez Bruno Gonçalves Joana Gonçalves-Sá Przemyslaw Grabowicz Carlos Gracia-Lázaro Justin Gross Jelena Grujic Jean-Loup Guillaume Mehmet Gunes Sergio Gámez Meesoon Ha Jürgen Hackl Edwin Hancock
Organization
ISI Foundation, Italy COPPE/UFRJ, Brazil Pontificia Universidad Javeriana, Colombia IMDEA Networks Institute, Spain Indiana University Bloomington, USA Bielefeld University, Germany Delft University of Technology, The Netherlands University of São Paulo, Brazil Univ. Lyon, University Gustave Eiffel, France University of Milan, Italy Rutgers University, USA Universidad de Burgos, Spain University of Porto, Portugal Université Catholique de Louvain, Belgium Rensselaer Polytechnic Institute, USA Medical University of Vienna and Complexity Science Hub, Austria Inria, France Yale-NUS College, Singapore Northeastern University, USA Telecom SudParis/Institut Mines Telecom, France Technical University of Munich, Germany IMT School for Advanced Studies, Italy SUPSi, Switzerland University of Verona, Italy Purdue University, USA Rovira i Virgily University, Spain Korea University, South Korea Universidad Politécnica de Madrid, Spain Universidad de Zaragoza, Spain Universidad Autónoma de Madrid, Spain New York University, USA Nova School of Business and Economics, Portugal University of Massachusetts, Amherst, USA BIFI, Spain UMass Amherst, USA Vrije Universiteit Brussel, Belgium L3i - Université de la Rochelle, France Stevens Institute of Technology, USA Universitat Rovira i Virgili, Spain Chosun University, South Korea University of Liverpool, Switzerland University of York, UK
Organization
Chris Hankin Jin-Kao Hao Heather Harrington Yukio Hayashi Mark Heimann Torsten Heinrich Denis Helic Chittaranjan Hens Laura Hernandez Samuel Heroy Takayuki Hiraoka Philipp Hoevel Petter Holme Seok-Hee Hong Ulrich Hoppe Yanqing Hu Flavio Iannelli Yuichi Ikeda Roberto Interdonato Giulia Iori Antonio Iovanella Gerardo Iñiguez Sarika Jalan Mahdi Jalili Jaroslaw Jankowski Marco Alberto Javarone Hawoong Jeong Tao Jia Chunheng Jiang Ming Jiang Di Jin Di Jin Ivan Jokić Bertrand Jouve Jason Jung Marko Jusup Arkadiusz Jędrzejewski Byungnam Kahng Rushed Kanawati Rowland Kao Márton Karsai
xv
Imperial College London, UK University of Angers, France, France University of Oxford, UK Japan Advanced Institute of Science and Technology, Japan University of Michigan, USA University of Oxford, Germany Graz University of Technology, Austria Indian Institute of Chemical Biology, India Université de Cergy-Pontoise, France University of Oxford, UK Aalto University, Finland University College Cor, Ireland Tokyo Institute of Technology, Japan University of Sydney, Australia University Duisburg-Essen, Germany Sun Yat-sen University, China Humboldt University, Germany Kyoto University, Japan CIRAD - UMR TETIS, France City, University of London, UK University of Rome Tor Vergata, Italy Central European University, Hungary IIT Indore, India RMIT University, Australia West Pomeranian University of Technology, Poland Coventry University, UK Korea Advanced Institute of Science and Technology, South Korea Southwest University, China Rensselaer Polytechnic Institute, USA University of Illinois at Urbana-Champaign, USA Tianjin University, China University of Michigan, USA Delft University of Technology, The Netherlands CNRS, France Chung-Ang University, South Korea Tokyo Institute of Technology, Japan Wrocław University of Science and Technology, Poland Seoul National University, South Korea Université Paris 13, France University of Edinburgh, UK ENS de Lyon, France
xvi
Eytan Katzav Mehmet Kaya Domokos Kelen Dror Kenett Yoed Kenett Janos Kertesz Mohammad Khansari Hamamache Kheddouci Hyoungshick Kim Jinseok Kim Maksim Kitsak Mikko Kivela Konstantin Klemm Peter Klimek Dániel Kondor Xiangjie Kong Ismo Koponen Onerva Korhonen Jan Kralj Reimer Kuehn Prosenjit Kundu Ryszard Kutner Haewoon Kwak Richard La Hemank Lamba Renaud Lambiotte Aniello Lampo Christine Largeron Jennifer Larson Anna T. Lawniczak Eric Leclercq Deok-Sun Lee Sune Lehmann Balazs Lengyel Juergen Lerner Fabrizio Lillo Ji Liu Yang-Yu Liu Giacomo Livan Lorenzo Livi Alessandro Longheu Laura Lotero Meilian Lu
Organization
The Hebrew University of Jerusalem, Israel Firat University, Turkey Institute for Computer Science and Control, Hungary Johns Hopkins University, USA University of Pennsylvania, USA Central European University, Hungary University of Tehran, Iran Universit Claude Bernard, France Sungkyunkwan University, South Korea University of Michigan, USA Northeastern University, USA Aalto University, Finland IFISC (CSIC-UIB), Spain Medical University of Vienna, Austria SMART, Singapore Dalian University of Technology, China University of Helsinki, Finland Université de Lille, Finland Jozef Stefan Institute, Slovenia King’s College London, UK National Institute of Technology Durgapur, India University of Warsaw, Poland Qatar Computing Research Institute, Qatar University of Maryland, USA Carnegie Mellon University, USA University of Oxford, UK UOC, Spain Université de Lyon, France New York University, USA University of Guelph, Ontario, Canada University of Burgundy, France Inha University, South Korea Technical University of Denmark, Denmark Hungarian Academy of Sciences, Hungary University of Konstanz, Germany University of Bologna, Italy Stony Brook University, USA Harvard University, USA University College London, UK University of Manitoba, Canada University of Catania, Italy Universidad Pontificia Bolivariana, Colombia Beijing University of Posts and Telecom, China
Organization
John C. S. Lui Leonardo Maccari Matteo Magnani Cécile Mailler Nishant Malik Fragkiskos Malliaros Noel Malod-Dognin Giuseppe Mangioni Ed. Manley Rosario Nunzio Mantegna Madhav Marathe Manuel Sebastian Mariani Radek Marik Andrea Marino Antonio Marques Manuel Marques-Pita Christoph Martin Cristina Masoller Emanuele Massaro Rossana Mastrandrea John Matta Arya McCarthy Fintan Mcgee Matúš Medo Jörg Menche Jose Fernando Mendes Ronaldo Menezes Humphrey Mensah Anke Meyer-Baese Radosław Michalski Tijana Milenkovic Letizia Milli Andreea Minca Shubhanshu Mishra Bivas Mitra Marija Mitrovic Andrzej Mizera Osnat Mokryn
xvii
The Chinese University of Hong Kong, Hong Kong University of Venice, Italy Uppsala University, Sweden UVSQ, France Rochester Institute of Technology, USA University of Paris-Saclay, France University College London, UK University of Catania, Italy University of Leeds, UK Palermo University, Italy University of Virginia, USA University of Zurich, Switzerland Czech Technical University, Czechia University of Florence, Italy Universidad Rey Juan Carlos, Spain Universidade Lusofona, Portugal Leuphana University of Lüneburg, Germany Universitat Politècnica de Catalunya, Spain Ecole Polytechnique Fédérale de Lausanne, Switzerland IMT Institute of Advanced Studies, Italy SIUE, USA Johns Hopkins University, USA Gabriel Lippmann Public Research Centre, Ireland University of Electronic Science and Technology of China, China CeMM of the Austrian Academy of Sciences, Austria University of Aveiro, Portugal University of Exeter, UK Syracuse University, USA FSU, USA Wrocław University of Science and Technology, Poland University of Notre Dame, USA University of Pisa, Italy Cornell University, USA University of Illinois at Urbana-Champaign, USA Indian Institute of Technology Kharagpur, India Institute of physics Belgrade, Serbia University of Luxembourg, Luxembourg University of Haifa, Israel
xviii
Roland Molontay Raul Mondragon Misael Mongiovì Andres Moreira Paolo Moretti Esteban Moro Greg Morrison Sotiris Moschoyiannis Elisha Moses Igor Mozetič Animesh Mukherjee Masayuki Murata Tsuyoshi Murata Alessandro Muscoloni Matthieu Nadini Zachary Neal Muaz Niazi Rolf Niedermeier Peter Niemeyer Jordi Nin Rogier Noldus El Faouzi Nour-Eddin Neave O’Clery Masaki Ogura Marcos Oliveira Andrea Omicini Luis Ospina-Forero Gergely Palla Pietro Panzarasa Fragkiskos Papadopoulos Symeon Papadopoulos Michela Papandrea Francesca Parise Han Woo Park Juyong Park Fabio Pasqualetti Leto Peel Tiago Peixoto Matjaz Perc Hernane Pereira Lilia Perfeito
Organization
Budapest University of Technology and Economics, Hungary Queen Mary University of London, UK Consiglio Nazionale delle Ricerche, Italy Universidad Tecnica Federico Santa Maria, Chile Friedrich-Alexander-University Erlangen-Nunberg, Germany Universidad Carlos III de Madrid, Spain University of Houston, USA University of Surrey, UK The Weizmann Institute of Science, Israel Jozef Stefan Institute, Slovenia Indian Institute of Technology, India Osaka University, Japan Tokyo Institute of Technology, Japan TU Dresden, Germany New York University, Italy Michigan State University, USA COMSATS Institute of IT, Pakistan TU Berlin, Germany Leuphana Universität Lüneburg, Germany Universitat Ramon Llull, Spain Ericsson, The Netherlands IFSTTAR, France University College London, UK Nara Institute of Science and Technology, Japan Leibniz Institute for the Social Sciences, USA Università degli Studi di Bologna, Italy University of Manchester, UK Statistical and Biological Physics Research Group of HAS, Hungary Queen Mary University of London, UK Cyprus University of Technology, Cyprus Information Technologies Institute, Greece SUPSI, Switzerland MIT, USA YeungNam University, South Korea KAIST, South Korea UC Riverside, USA Universite Catholique de Louvain, Belgium Central European University and ISI Foundation, Germany University of Maribor, Slovenia UEFS and SENAI CIMATEC, Brazil Nova SBE, Portugal
Organization
Chiara Perillo Giovanni Petri Jürgen Pfeffer Carlo Piccardi Flavio Pinheiro Clara Pizzuti Chiara Poletto Maurizio Porfiri Pawel Pralat Victor Preciado Natasa Przulj Oriol Pujol Rami Puzis Christian Quadri Marco Quaggiotto Filippo Radicchi Tomasz Raducha Jose J. Ramasco Felix Reed-Tsochas Gesine Reinert Benjamin Renoust Daniel Rhoads Pedro Ribeiro Massimo Riccaboni Laura Ricci Alessandro Rizzo Celine Robardet Luis E. C. Rocha Luis M. Rocha Francisco Rodrigues Fernando Rosas Giulio Rossetti Camille Roth Celine Rozenblat Giancarlo Ruffo Meead Saberi Ali Safari Marta Sales-Pardo Arnaud Sallaberry Iraj Saniee Francisco C. Santos Jari Saraméki
xix
University of Zurich, Switzerland ISI Foundation, Italy Technical University of Munich, Germany Politecnico di Milano, Italy Universidade NOVA de Lisboa, USA CNR-ICAR, Italy Sorbonne University, France New York University Tandon School of Engineering, USA Ryerson University, Canada University of Pennsylvania, USA University College London, Spain University of Barcelona, Spain Ben Gurion University of the Negev, Israel University of Milan, Italy ISI Foundation, Italy Northwestern University, USA Faculty of Physics University of Warsaw, Poland IFISC (CSIC-UIB), Spain University of Oxford, UK University of Oxford, UK Osaka University, Japan Universitat Oberta de Catalunya, Spain University of Porto, Portugal IMT Institute for Advanced Studies, Italy University of Pisa, Italy Politecnico di Torino, Italy INSA Lyon, France Ghent University, Belgium Indiana University Bloomington, USA University of São Paulo, Brazil Imperial College London, UK KDD Lab ISTI-CNR, Italy CNRS, Germany University of Lausanne Institut de Géographie, Switzerland Università di Torino, Italy UNSW, Australia Friedrich-Alexamder Universitüt Erlangen-Närnberg, Germany Universitat Rovira i Virgili, Spain Université Paul Valéry Montpellier 3, France Bell Labs, Alcatel-Lucent, USA Universidade de Lisboa, Portugal Aalto University, Finland
xx
Koya Sato Hiroki Sayama Antonio Scala Michael Schaub Maximilian Schich Frank Schweitzer Santiago Segarra Irene Sendiña-Nadal M. Ángeles Serrano Saray Shai Aneesh Sharma Rajesh Sharma Julian Sienkiewicz Anurag Singh Lisa Singh Rishabh Singhal Sudeshna Sinha Per Sebastian Skardal Oskar Skibski Michael Small Keith Smith Igor Smolyarenko Zbigniew Smoreda Tom Snijders Annalisa Socievole Igor M. Sokolov Albert Sole Sucheta Soundarajan Jaya Sreevalsan-Nair Massimo Stella Arkadiusz Stopczynski Blair D. Sullivan Xiaoqian Sun Xiaoqian Sun PååSundsøy Samir Suweis Boleslaw Szymanski Bosiljka Tadic Andrea Tagarelli Kazuhiro Takemoto Frank Takes Fabien Tarissan Dane Taylor
Organization
University of Tsukuba, Japan Binghamton University, USA Italian National Research Council, Italy University of Oxford, UK The University of Texas at Dallas, USA ETH Zurich, Switzerland Rice University, USA Rey Juan Carlos University, Spain Universitat de Barcelona, Spain Wesleyan University, USA Google, USA University of Tartu, Estonia Warsaw University of Technology, Poland NIT Delhi, India Georgetown University, USA Dayalbagh Educational Institute, India Indian Institute of Science Education and Research, India Trinity College, USA University of Warsaw, Poland The University of Western Australia, Australia The University of Edinburgh, UK Brunel University, UK Orange Labs, France University of Groningen, The Netherlands National Research Council of Italy, Italy Humboldt-University of Berlin, Germany Universitat Rovira i Virgili, Spain Syracuse University, USA IIIT Bangalore, India Institute for Complex Systems Simulation, UK Technical University of Denmark, Denmark University of Utah, USA Beihang University, China Chinese Academy of Sciences, China NBIM, Norway University of Padua, Italy Rensselaer Polytechnic Institute, USA Jozef Stefan Institute, Slovenia DIMES, University of Calabria, Italy Kyushu Institute of Technology, Japan Leiden University and University of Amsterdam, The Netherlands ENS Paris-Saclay (ISP), France University at Buffalo, SUNY, USA
Organization
Claudio Juan Tessone François Théberge Olivier Togni Ljiljana Trajkovic Jan Treur Milena Tsvetkova Liubov Tupikina Janos Török Stephen Uzzo Lucas D. Valdez Pim van der Hoorn Piet Van Mieghem Michalis Vazirgiannis Balazs Vedres Wouter Vermeer Christian Lyngby Vestergaard Anil Kumar Vullikanti Johannes Wachs Huijuan Wang Lei Wang Ingmar Weber Guanghui Wen Gordon Wilfong Mateusz Wilinski Richard Wilson Dirk Witthaut Bin Wu Jinshan Wu Feng Xia Haoxiang Xia Xiaoke Xu Gitanjali Yadav Gang Yan Xiaoran Yan Taha Yasseri Ying Ye Qingpeng Zhang Zi-Ke Zhang Junfei Zhao Matteo Zignani
xxi
Universität Zürich, Switzerland Tutte Institute for Mathematics and Computing, Canada Burgundy University, France Simon Fraser University, Canada Vrije Universiteit Amsterdam, The Netherlands London School of Economics and Political Science, UK Ecole Polytechnique, France Budapest University of Technology and Economics, Hungary New York Hall of Science, USA FAMAF-UNC, Argentina Eindhoven University of Technology, The Netherlands Delft University of Technology, The Netherlands AUEB, Greece University of Oxford, UK Northwestern University, USA CNRS and Institut Pasteur, France University of Virginia, USA Central European University, Hungary Delft University of Technology, The Netherlands Beihang University, China Qatar Computing Research Institute, Qatar Southeast University, China Bell Labs, USA Scuola Normale Superiore di Pisa, Italy University of York, UK Forschungszentrum Jülich, Germany Beijing University of Posts and Telecommunications, China Beijing Normail University, China Federation University Australia, Australia Dalian University of Technology, China Dalian Minzu University, China University of Cambridge, UK Tongji University, China Indiana University Bloomington, USA University of Oxford, UK Nanjing University, China City University of Hong Kong, USA Hangzhou Normal University, China Columbia University, USA University of Milan, Italy
xxii
Eugenio Zimeo Lorenzo Zino Antonio Zippo Fabiana Zollo Arkaitz Zubiaga Claudia Zucca
Organization
University of Sannio, Italy University of Groningen, The Netherlands Consiglio Nazionale delle Ricerche, Italy Ca’ Foscari University of Venice, Italy Queen Mary University of London, UK University of Glasgow, UK
Contents
Machine Learning and Networks Structural Node Embedding in Signed Social Networks: Finding Online Misbehavior at Multiple Scales . . . . . . . . . . . . . . . . . . . Mark Heimann, Goran Murić, and Emilio Ferrara On the Impact of Communities on Semi-supervised Classification Using Graph Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hussain Hussain, Tomislav Duricic, Elisabeth Lex, Roman Kern, and Denis Helic Detecting Geographical Competitive Structure for POI Visit Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Teru Fujii, Masahito Kumano, João Gama, and Masahiro Kimura Consensus Embeddings for Networks with Multiple Versions . . . . . . . . Mengzhen Li and Mehmet Koyutürk
3
15
27 39
Graph Convolutional Network with Time-Based Mini-Batch for Information Diffusion Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . Hajime Miyazawa and Tsuyoshi Murata
53
A Sentiment Enhanced Deep Collaborative Filtering Recommender System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ahlem Drif, Sami Guembour, and Hocine Cherifi
66
Experimental Evaluation of Train and Test Split Strategies in Link Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gerrit Jan de Bruin, Cor J. Veenman, H. Jaap van den Herik, and Frank W. Takes Enriching Graph Representations of Text: Application to Medical Text Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alexios Mandalios, Alexandros Chortaras, Giorgos Stamou, and Michalis Vazirgiannis
79
92
xxiii
xxiv
Contents
SaraBotTagger - A Light Tool to Identify Bots in Twitter . . . . . . . . . . . 104 Carlos Magno Geraldo Barbosa, Lucas Gabriel da Silva Félix, Antônio Pedro Santos Alves, Carolina Ribeiro Xavier, and Vinícius da Fonseca Vieira Graph Auto-Encoders for Learning Edge Representations . . . . . . . . . . 117 Virgile Rennard, Giannis Nikolentzos, and Michalis Vazirgiannis Incorporating Domain Knowledge into Health Recommender Systems Using Hyperbolic Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . 130 Joel Peito and Qiwei Han Image Classification Using Graph-Based Representations and Graph Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 Giannis Nikolentzos, Michalis Thomas, Adín Ramírez Rivera, and Michalis Vazirgiannis Graph-Based Topic Extraction from Vector Embeddings of Text Documents: Application to a Corpus of News Articles . . . . . . . . . . . . . . 154 M. Tarik Altuncu, Sophia N. Yaliraki, and Mauricio Barahona Learning Parameters for Balanced Index Influence Maximization . . . . . 167 Manqing Ma, Gyorgy Korniss, and Boleslaw K. Szymanski Mobility Networks Mobility Networks for Predicting Gentrification . . . . . . . . . . . . . . . . . . 181 Oliver Gardiner and Xiaowen Dong Connecting the Dots: Integrating Point Location Data into Spatial Network Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 Shuto Araki and Aaron Bramson Topological Analysis of Synthetic Models for Air Transportation Multilayer Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 Marzena Fügenschuh, Ralucca Gera, and Andrea Tagarelli Quick Sub-optimal Augmentation of Large Scale Multi-modal Transport Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 Elise Henry, Mathieu Petit, Angelo Furno, and Nour-Eddin El Faouzi Order Estimation of Markov-Chain Processes in Complex Mobility Network Embedded in Vehicle Traces . . . . . . . . . . . . . . . . . . . . . . . . . . 231 Keigo Yamamoto, Shigeyuki Miyagi, and Osamu Sakai Modeling Human Behavior A Second-Order Adaptive Network Model for Learner-Controlled Mental Model Learning Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 Rajesh Bhalwankar and Jan Treur
Contents
xxv
Self-modeling Networks Using Adaptive Internal Mental Models for Cognitive Analysis and Support Processes . . . . . . . . . . . . . . . . . . . . 260 Jan Treur Evolution of Spatial Political Community Structures in Sweden 1985–2018 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275 Jérôme Michaud, Ilkka H. Mäkinen, Emil Frisk, and Attila Szilva Graph Comparison and Artificial Models for Simulating Real Criminal Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286 Lucia Cavallaro, Annamaria Ficara, Francesco Curreri, Giacomo Fiumara, Pasquale De Meo, Ovidiu Bagdasar, and Antonio Liotta Extending DeGroot Opinion Formation for Signed Graphs and Minimizing Polarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298 Inzamam Rahaman and Patrick Hosein Market Designs and Social Interactions. How Trust and Reputation Influence Market Outcome? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310 Sylvain Mignot and Annick Vignes Assessing How Team Task Influences Team Assembly Through Network Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322 Emily Kaven, Ilana Kaven, Diego Gómez-Zará, Leslie DeChurch, and Noshir Contractor Resident’s Alzheimer Disease and Social Networks Within a Nursing Home . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 Mehrdad Agha Mohammad Ali Kermani, Samane Abbasi Sani, and Hanie Zand Forming Diverse Teams Based on Members’ Social Networks: A Genetic Algorithm Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346 Archan Das, Diego Gómez-Zará, and Noshir Contractor Biological Networks Deep Reinforcement Learning for Control of Probabilistic Boolean Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361 Georgios Papagiannis and Sotiris Moschoyiannis A Methodology for Evaluating the Extensibility of Boolean Networks’ Structure and Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372 Rémi Segretain, Sergiu Ivanov, Laurent Trilling, and Nicolas Glade NETME: On-the-Fly Knowledge Network Construction from Biomedical Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386 Alessandro Muscolino, Antonio Di Maria, Salvatore Alaimo, Stefano Borzì, Paolo Ferragina, Alfredo Ferro, and Alfredo Pulvirenti
xxvi
Contents
Statistics of Growing Chemical Network Originating from One Molecule Species and Activated by Low-Temperature Plasma . . . . . . . . 398 Yasutaka Mizui, Shigeyuki Miyagi, and Osamu Sakai Joint Modeling of Histone Modifications in 3D Genome Shape Through Hi-C Interaction Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410 Emre Sefer Network Models Fast Multipole Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425 Steve Huntsman A Random Growth Model with Any Real or Theoretical Degree Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437 Frédéric Giroire, Stéphane Pérennes, and Thibaud Trolliet Local Degree Asymmetry for Preferential Attachment Model . . . . . . . . 450 Sergei Sidorov, Sergei Mironov, Igor Malinskii, and Dmitry Kadomtsev Edge Based Stochastic Block Model Statistical Inference . . . . . . . . . . . . 462 Louis Duvivier, Rémy Cazabet, and Céline Robardet Modeling and Evaluating Hierarchical Network: An Application to Personnel Flow Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474 Jueyi Liu, Yuze Sui, Ling Zhu, and Xueguang Zhou GrowHON: A Scalable Algorithm for Growing Higher-order Networks of Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485 Steven J. Krieg, Peter M. Kogge, and Nitesh V. Chawla Analysis of a Finite Mixture of Truncated Zeta Distributions for Degree Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497 Hohyun Jung and Frederick Kin Hing Phoa De-evolution of Preferential Attachment Trees . . . . . . . . . . . . . . . . . . . . 508 Chen Avin and Yuri Lotker An Algorithmic Information Distortion in Multidimensional Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 520 Felipe S. Abrahão, Klaus Wehmuth, Hector Zenil, and Artur Ziviani Hot-Get-Richer Network Growth Model . . . . . . . . . . . . . . . . . . . . . . . . 532 Faisal Nsour and Hiroki Sayama Networks in Finance and Economics Measuring the Nestedness of Global Production System Based on Bipartite Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547 Jun Guan, Jiaqi Ren, and Lizhi Xing
Contents
xxvii
Extracting the Backbone of Global Value Chain from HighDimensional Inter-Country Input-Output Network . . . . . . . . . . . . . . . . 559 Lizhi Xing and Yu Han Analysis of Tainted Transactions in the Bitcoin Blockchain Transaction Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 571 María Óskarsdóttir, Jacky Mallett, Arnþór Logi Arnarson, and Alexander Snær Stefánsson Structural Network Measures Top-k Connected Overlapping Densest Subgraphs in Dual Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585 Riccardo Dondi, Pietro Hiram Guzzi, and Mohammad Mehdi Hosseinzadeh Interest Clustering Coefficient: A New Metric for Directed Networks Like Twitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597 Thibaud Trolliet, Nathann Cohen, Frédéric Giroire, Luc Hogie, and Stéphane Pérennes Applying Fairness Constraints on Graph Node Ranks Under Personalization Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 610 Emmanouil Krasanakis, Symeon Papadopoulos, and Ioannis Kompatsiaris Which Group Do You Belong To? Sentiment-Based PageRank to Measure Formal and Informal Influence of Nodes in Networks . . . . . 623 Lan Jiang, Ly Dinh, Rezvaneh Rezapour, and Jana Diesner Temporal Networks Path Homology and Temporal Networks . . . . . . . . . . . . . . . . . . . . . . . . 639 Samir Chowdhury, Steve Huntsman, and Matvey Yutin Computing Temporal Twins in Time Logarithmic in History Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 651 Binh-Minh Bui-Xuan, Hugo Hourcade, and Cédric Miachon A Dynamic Algorithm for Linear Algebraically Computing Nonbacktracking Walk Centrality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664 Eisha Nathan TemporalRI: A Subgraph Isomorphism Algorithm for Temporal Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675 Giorgio Locicero, Giovanni Micale, Alfredo Pulvirenti, and Alfredo Ferro StreamFaSE: An Online Algorithm for Subgraph Counting in Dynamic Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 688 Henrique Branquinho, Luciano Grácio, and Pedro Ribeiro
xxviii
Contents
Temporal Bibliometry Networks of SARS, MERS and COVID19 Reveal Dynamics of the Pandemic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 700 Ramya Gupta, Abhishek Prasad, Suresh Babu, and Gitanjali Yadav Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 713
Machine Learning and Networks
Structural Node Embedding in Signed Social Networks: Finding Online Misbehavior at Multiple Scales Mark Heimann1(B) , Goran Muri´c2 , and Emilio Ferrara2 1
Lawrence Livermore National Laboratory, Livermore, CA, USA [email protected] 2 Information Sciences Institute, Marina Del Rey, CA, USA {gmuric,ferrarae}@isi.edu
Abstract. Entities in networks may interact positively as well as negatively with each other, which may be modeled by a signed network containing both positive and negative edges between nodes. Understanding how entities behave and not just with whom they interact positively or negatively leads us to the new problem of structural role mining in signed networks. We solve this problem by developing structural node embedding methods that build on sociological theory and technical advances developed specifically for signed networks. With our methods, we can not only perform node-level role analysis, but also solve another new problem of characterizing entire signed networks to make network-level predictions. We motivate our work with an application to social media analysis, where we show that our methods are more insightful and effective at detecting userlevel and session-level malicious online behavior from the network structure than previous approaches based on feature engineering.
1 Introduction Networks are a natural model for many forms of data, in which entities exhibit complex patterns of interaction. In many applications, entities may form negative as well as positive interactions with each other. For example, users on a social network may form friendships and engage in prosocial behavior with each other, but they may also form animositities and engage in antisocial behavior, such as trolling [17] or cyberaggression [10]. Signed networks, in which edges between node may be positive or negative, can naturally model these interactions of varying polarity. Existing work in signed network analysis often tries to characterize with whom each node interacts. For example, the common task of edge sign prediction [1] is to determine whether two nodes would have a positive or negative interaction; node embedding objectives for signed networks [2, 22] encourage each node to have similar latent feature representations to other nodes with whom it interacts positively, and dissimilar representations to those with whom it interacts negatively. Instead, we focus on the orthogonal and new problem of characterizing how a node forms positive or negative relationships, namely its structural role in the signed network. Existing methods for role analysis in networks [19] are designed for unsigned networks, so we introduce Work done while author was an intern at the Information Sciences Institute. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 944, pp. 3–14, 2021. https://doi.org/10.1007/978-3-030-65351-4_1
4
M. Heimann et al.
structural node embedding methods that build on sociological theories and technical advances designed specifically for signed networks. Not only can we perform node-level structural role analysis, but we can also characterize network-level behavior. For signed networks, this is another new problem, as methods for graph comparison and classification focus on unsigned networks. To solve it, we contribute baseline statistical signatures, graph kernels, and graph features derived from our signed structural node embeddings. The latter describe a signed network’s distribution of structural roles, a very natural and powerful way to characterize a network. We motivate our methodology with an application to social media analysis. Several types of social media users are revealed by their patterns of negative behavior on platforms designed to encourage positive interactions. For instance, trolls on social media may try to stir up controversy or causing annoyance to others [17], while cyberbullies leave hurtful or aggressive comments for other social media users with the intent to shame, mock, and/or intimidate [11]. Further understanding the nature of antisocial online behavior can lead to more effective preventative measures and improve the social media experience. Our node-level and graph-level methods respectively allow us to characterize social behavior at the level of individual users and larger media sessions. While recent work derived several insights about users’ social roles from the network structure alone using hand-engineered network statistics [15], we show that our embedding-based methods can improve on network feature engineering. Our contributions are thus as follows: 1. New Node-level Problem and Methods: we propose structural node embedding methods for the new problem of structural role analysis in signed networks. We propose a simple scheme to leveraging existing (unsigned) methods directly, and also new embedding methods that can learn from multi-sign higher-order interactions. 2. New Graph-level Problem and Methods: we propose techniques for the new problem of signed network classification: baselines using hand-engineered features and more expressive methods that leverage our signed structural node embeddings. 3. Controlled Synthetic Test Cases for Signed Networks: We study the behavior of our techniques in synthetic networks, where we can tailor the signed structural roles of nodes. Our methods are more discriminative than the simpler feature engineering approaches that have recently been used to characterize similar social roles. 4. Applications to Social Media Analysis: We use social media datasets with explicit user-specified edge signs and implicit signs that must be inferred, which can serve as benchmarks for our new problems. In node-level and graph-level analysis of social roles, our methods yield quantitative improvements over feature engineering. For reproducibility, our code is available at https://github.com/mark heimann/ Signed-Network-Roles.
2 Related Work The majority of works in graph mining focus on unsigned graphs. We first review relevant literature from the unsigned graphs, noting methods designed for signed networks usually outperform unsigned methods na¨ıvely applied to signed networks [2]. With this in mind, we also review relevant literature from signed networks.
Structural Node Embedding in Signed Social Networks
5
2.1 Unsigned Network Methods We first review node embedding methods for unsigned networks, which enables nodelevel graph mining; we then discuss graph-level analysis. Node Embedding. Node embedding learns features for nodes in a network via an objective that encourages similar nodes to have similar features. Often, similarity is defined based on node proximity. Unsupervised methods may model higher order proximities using random walks, matrix factorization, or deep neural networks; we refer the reader to a comprehensive survey on (proximity-preserving) node embedding [6] for more details. In a semi-supervised setting, graph neural networks [16] have grown in popularity. A complementary line of work embeds nodes based on their structural roles: nodes will be embedded close to other nodes with similar local structure, regardless of proximity. Such structural embedding methods are surveyed and contrasted conceptually [20] and empirically [14] to proximity-preserving embedding methods. Graph Classification. While the above methods are often used for node-level analysis, many applications also require network-level predictions. Graph classification seeks to predict the class to which an entire network belongs using supervised machine learning. Three major families of techniques include kernels operating on graph similarity functions [21], unsupervised graph feature learning [8], and end-to-end feature learning with deep neural networks on graphs [5]. Given node embeddings that are comparable across networks–in particular, embeddings capturing structural roles–the gap between node-level and graph-level features can be bridged by modeling the distribution of node embeddings in a sparse graph feature vector [8]. 2.2 Signed Network Methods Many of the works involved in analyzing signed social networks have their underpinnings sociological theory. One of the most influential theories on signed network mining is balance theory [7], which specifies some general rules for “balanced” and “unbalanced” configurations of edge signs. It predicts that balanced configurations are more stable and thus more likely to occur than unbalanced configurations, and generalizes intuition often captured in expressions such as “my enemy’s enemy is my friend”. One of the principal tasks in signed network mining is edge sign prediction. This may be done by directly trying to minimize imbalance in the network, construction of hand-engineered signed features for supervised prediction, or matrix factorization [1]. Recent works have extended network embedding to signed networks based on shallow architectures and random walks [23] or deep architectures [22], applying them to nodelevel and edge-level tasks within a single network (to the best of our knowledge, multinetwork tasks on signed social networks are largely unexplored). Existing approaches do not model structural roles, but rather preserve signed proximity in the network, trying to learn similar embeddings for positively connected nodes and different embeddings for negatively connected nodes.
3 Preliminaries Let G = (V, E) be a directed graph with vertex set V (|V | = n) and edge set E ⊆ V × V . In this work, we consider signed networks: a sign function ϕ : E → {1, −1}
6
M. Heimann et al.
dictates the sign of each edge as being positive or negative. G has single-sign subgraphs G+ induced by the positive edges of G and G− induced by the negative edges. G has an adjacency matrix A, whose nonzero entries are ±1 and whose i-th row is Ai . The goal of node embedding is to learn low-dimensional features for each node that capture higher-order network structure. Nodes are embedded into d-dimensional space, where d is a small constant, such that their geometry in vector space preserves some sort of node similarity in the original network (in our case, structural role similarity). Y ∈ Rn×d denotes a graph’s matrix of node embeddings.
4 Node-Level Techniques In this section, we extend structural node embedding [4, 9] to signed networks. We first discuss how existing unsigned network embedding methods can be applied directly on signed networks (§4.1) and the strengths and weaknesses of doing so. To overcome limitations of this approach, we adapt two unsigned structural embedding methods to the signed network domain (§4.2 & 4.3) using principled techniques for modeling signed network structure more expressively. 4.1
sec-Embedding: Concatenation of Single-Sign Embeddings
To apply unsigned node embedding methods directly to signed networks, we can use them to model separate structural roles of roles based only on positive or negative edges. Formally, given an embedding method, we apply it to G+ to produce an embedding Y+ , as well as to G− to produce an embedding Y− . The final embedding is Y = Y+ ||Y− . We call this heuristic single-signed embedding concatenation, which we denote by appending the prefix sec- before an embedding name (e.g. sec-xNetMF). The sec–technique is simple, allowing unsigned structural embedding methods to be applied to signed networks without methodological modification; moreover, it may offer benefits of interpretability and generalizability by disentangling the effect of positive and negative edges on the structural roles. However, it ignores the complex structure of higher-order mixed-sign interactions present in signed networks, which have been characterized by rich sociological theory [7] that has informed signed network mining [1]. Discovery of some meaningful structural role similarities and differences, as we show concretely in §6.1, requires methodological innovation to capture structural roles based on higher-order mixed-sign structure. 4.2
sNCE: Embedding Nodes Based on Signed Neighborhood Connectivities
The first structural node embedding method we adapt to signed networks is xNetMF [9], which embeds unsigned, undirected graphs from an implicit decomposition of a pairwise structural node similarity matrix. Nodes’ structural similarity depends on the distribution of structural features (degree) in their k-hop neighborhoods; EMBER [13] generalizes this to handle edge directions by modeling separate neighborhoods based on incoming and outgoing edges. However, gracefully modeling edge signs poses two research challenges: Cx 1) For each node, how do we model higher-order neighbors’
Structural Node Embedding in Signed Social Networks
7
signed relationships to that node? (We need to model positive and negative neighbors separately, and discern whether indirect neighbors are indirectly positive or negative.) Cx 2) How do we model nodes’ signed connectivity in a neighborhood? (Degree alone, the usual connectivity measure [9, 18] does not incorporate edge signs.) We call our proposed signed structural embedding method sNCE (signed Neighborhood Connectivity Embedding) as we apply the fundamental idea of xNetMF– embedding nodes based on structural similarity, derived from the connectivity statistics in their local neighborhoods–while respecting best practices on signed networks. Defining Neighborhoods. To address Cx 1, we turn to sociological theories of balance [7]: we partition neighborhoods into balanced and unbalanced neighborhoods [2]. Formally, for a node u, let Nuk→ be the k-hop out-neighborhood of u: the nodes that can be reached from u in a directed path of length exactly k. For immediate (onehop) neighborhoods, balanced and unbalanced neighborhoods depend on the edge sign between the node and the neighbor: Bu1→ = {v ∈ Nuk→ : G(u, v) > 0} and Uu1→ = {v ∈ Nuk→ : G(u, v) < 0}. For k > 1, we recursively define balanced higher-order neighborhoods [2]. The balanced k-hop neighborhood of u consists of all positive neighbors of balanced (k − 1)-hop neighbors of u (“friends of friends”), as well as negative neighbors of unbalanced (k−1)-hop neighbors of u (“enemies of enemies”): (k−1)→ (k−1)→ : G(v , v) > 0} ∪ {v : v ∈ Uv : G(v , v) < 0}. Buk→ = {v : v ∈ Bv The unbalanced k-hop neighborhood of u consists of all negative neighbors of balanced (k−1)-hop neighbors of u (“enemies of friends”), as well as positive neighbors of unbal(k−1)→ : anced (k − 1)-hop neighbors of u (“friends of enemies”): Uuk→ = {v : v ∈ Uv (k−1)→ : G(v , v) < 0}. Balanced and unbalanced inG(v , v) > 0} ∪ {v : v ∈ Bv neighborhoods are defined analogously using Nuk← , the k-hop in-neighborhood of u or the nodes that can reach u via a directed path of length k. Characterizing Neighborhoods. After splitting the k-hop neighborhoods of each node into balanced and unbalanced in- and out-neighborhoods following [2], we characterize the original node’s structural role while respecting Cx 2 by examining several signed structural connectivity measures in these neighborhoods. Let F be the set of connectivity measures; |F| = 4 as it consists of positive and negative node in- and out-degree. We consider each connectivity measure’s distribution in the set N of neighborhoods of u; |N | = 4 as it consists of balanced and unbalanced in- and out-neighborhoods. Then, for f ∈ F, the distribution of f in node u’s (balanced/unbalanced, in/out) k-hop neighborhood Nuk can be represented in a logarithmically binned [9] histogram hf (Nuk ). We K combine hop distances, discounting further ones [9, 13]: hf (Nu ) = k=1 δk hf (Nuk ), using maximum distance K = 2 [9, 13] and discount factor δ = 0.9 by default. Nodes’ structural similarity can be computed by comparing these histograms: ||hf (Nu ) − hf (Nv )||) (1) sim(u, v) = exp(− f ∈F N ∈N
Embedding. To learn structural node embeddings, we want nodes to have similar features if they are structurally similar according to Eq. 1. Following the scalable implicit matrix factorization approach of [9], we derive embeddings from a low-rank decomposition of a pairwise node structural similarity matrix. To compute d-dimensional embeddings, we select d landmark nodes uniformly at random [9] and compute the
8
M. Heimann et al.
n × d similarity matrix C of all nodes to these landmarks using Eq. 1. We then form the d × d submatrix W of landmark-to-landmark similarities. With the pseudoinverse 1 of the W and its SVD W† = UΣV, we form the embeddings: Y = CUΣ 2 . 4.3
sRDE: Embedding Nodes Based on Distributions of Signed Relevance Scores
GraphWave [4] computes a matrix representing pairwise node relevance scores; each nodes’ structural embedding models the distribution of its relevance to other nodes. While for unsigned networks, the relevance scores can be derived from heat diffusion, it is not clear how this diffusion process would respect edge signs (we found that using it led to poor performance). Our research challenges include Cg 1: how can we compute appropriate signed relevance scores, and Cg 2: how do we appropriately model the score distributions? We call our proposed signed structural embedding method sRDE (signed Relevance Distribution Embedding) as we apply the fundamental idea of GraphWave–embedding distributions of relevance scores–while respecting best practices on signed networks. Computing Node Relevance. To address Cg 1, we compute node relevance using signed random walk with restart (RWR) [3], which has a closed form matrix expression: (2) R = (1 − c)(I − cS)−1 where c ∈ [0, 1] is a scalar; D is the signed degree matrix, a diagonal matrix where −1 Dii = A is the signed random walk transition matrix. (In the j |Aij |; S = D future, iterative methods may be used to scale the computation of R.) Embedding. To address Cg 2, for a node u, we form the embedding Yu by computing histogram over its relevance scores to all other nodes, given by the u-th row of the signed RWR matrix Ru . Using d evenly spaced bins, we represent each node as a d-dimensional vector. We find this to be a simpler and empirically more effective alternative to sampling from the empirical characteristic function computed from Ru , as proposed to learn structural embeddings from the (unsigned) heat kernel matrix [4].
5 Graph-Level Techniques The techniques in §4 produce a single feature vector for each node, which may be used for node-level analysis. However, we may also want to analyze entire networks. Existing methods for graph comparison focus on unsigned networks, so we extend statistical and kernel-based methods to signed networks (§5.1). These approaches use handengineered features which may be less expressive than node embeddings. Thus, we leverage a recent unsupervised graph feature learning technique [8] to directly turn our node features from §4 into more expressive graph features for entire (signed) networks. 5.1
Signed Network Statistical Signatures and Kernels
As baselines, we propose methods for comparing graphs based on hand-engineered signed network statistics, using two different graph comparison methods: graph statistical signatures and graph kernels.
Structural Node Embedding in Signed Social Networks
9
Statistical Signatures. We can construct feature vectors based on hand-engineered statistics in the graph. One simple approach is Signed Maximum Degree (SMD): we form a four-dimensional feature vector consisting of the maximum positive and negative in- and out-degrees of any nodes in the graph. Such a feature vector captures some simple structural properties, but of course discards information about the degrees of most of the nodes. Thus, we consider the Signed Degree Distribution (SDD): we form and concatenate histograms for the distribution of positive and negative in- and out-degrees in the networks. Each histogram has one bucket for each possible degree statistic, up to the maximum value of that statistic for any node in the entire dataset. Graph kernel. We can compute a kernel on graphs based on the distribution of motifs in the graph [21]. Each graph G has a feature vector φ(G) counting the number of unique graphlets, or graph structures of consisting of k nodes, appear in the graph. (Usually k is a small number, with 3 being a popular choice in the literature). The graphlets kernel between two graphs is then given as the inner product of their feature vectors k(G1 , G2 ) = φ(G1 ), φ(G2 ). Our signed graphlet kernel (SGK) counts the number of configurations of unique 3-node graphlets counting edge signs. That is φ(G)i = # of times the i-th signed graphlet appears in G. Using 1 to denote a positive edge, −1 to denote a negative edge, and 0 to denote no edge, we consider all unique combinations of 0 s and ±1. φ(G) is thus a vector with ten elements corresponding to counts of each of these graphlets, which we normalize to sum to one. 5.2 Signed RGM: Distributions of Signed Structural Node Embeddings Given any set of node embeddings for a graph, the RGM feature map [8] represents a graph as a histogram of the distribution of its node embeddings in vector space. When these embeddings reflect structural roles, RGM models the distribution of structural roles in a network. With it, we turn any of our node embeddings from § 4 into graph features with a clear interpretation, which may be used for graph-level learning. We follow all steps of the RGM procedure [8]: we normalize the embeddings and bin them using a partition of [0, 1]d given by a d-dimensional grid with cell widths μ and offsets δ sampled independently along each dimension: μ ∼ Gamma(2, 1/γ), and δ ∼ unif(0, μ). The c-th entry of the histogram counts the number of node embeddings that fall into the c-th cell of the grid: these histograms form a sparse feature map for the graph. The parameter γ controls the resolution of the histograms, similar to a RBF or Laplacian kernel. To capture multiresolution structure, we concatenate histograms chosen by γ ∈ [1, 2, 4, 8], weighted by γ1 to place greater emphasis on matches found in tighter histograms; we also use two iterations of label expansion, starting with uniform node labels, to topologically group nodes prior to binning.
6 Experiments We first use controlled synthetic experiments to illustrate theoretical expressivity of various embedding methods, before using our node- and graph-level techniques to conduct real-world social media analysis.
10
M. Heimann et al.
(a) Graph G1 with role differences
(b) Graph G2 with role similarities
(c) G1 : sNCE finds (d) G1 : sRDE finds (e) G1 : Degree fea- (f) G1 : sec-xNetMF differences differences tures do not find dif- does not find differferences ences
(g) G2 : sNCE finds (h) G2 : sRDE finds (i) G2 : Degree feadifferences but not differences and simi- tures do not find simisimilarities larities larities or some differences
(j) G2 : sec-xNetMF does not find similarities or some differences
Fig. 1. Synthetic graphs: Signed structural node embeddings can distinguish structurally different nodes (in G1 ) and recognize structurally similar nodes (in G2 ) using higher-order connections or information from multi-sign paths. Simpler approaches based on degree features or combining single-sign embeddings cannot do this.
6.1
Role Discovery in Synthetic Networks
To understand what our embeddings can learn in a controlled context, we generate signed networks with planted structural roles, a form of analysis often used for unsigned structural embeddings [14]. Our graphs contain disconnected components, but structural roles do not rely on node proximity and disconnected nodes can be compared [9]. We learn 4-dimensional node embeddings due to the small size of the graphs, and visualize the nodes’ embedding similarity in two dimensions using PCA. For the graph drawn in Fig. 1a, the red and yellow nodes have similar but not identical structural roles, when higher order connections are considered (in fact, the yellow node has no 2-hop neighbors, but the red node does). Degree statistics cannot capture
Structural Node Embedding in Signed Social Networks
11
higher order information, so these nodes are given identical structural roles (Fig. 1e) and overlap in the plot; hence, the red node is not visible. Network embedding can model higher order information, but embedding positive and negative components separately loses information from mixed-sign connections. Concatenating unsigned node embeddings (§ 4.1) cannot distinguish between these nodes either (Fig. 1f), since the only higher-order neighborhoods have mixed signs. However, signed structural embedding methods can give these roles different embeddings. For the graph shown in Fig. 1b, we highlight three nodes with ‘warm’ colors red, orange, and yellow, as they have similar structural roles analogous to some patterns of online (mis)behavior. A user behaving like the red or orange nodes, sending negative edges to nodes without additional negative edges, might be actively antagonizing ordinary users, while a user behaving like the yellow node might be goading an antagonizer on (sending positive edges to nodes that send negative edges): both propagate largely negative influence throughout the network [15]. Signed structural embeddings such as sNCE and sRDE capture this, embedding the two nodes similarly in the vector space. (Indeed, the entire goal of sRDE embeddings is to characterize the signed propagation patterns from each node). However, without higher-order, multi-sign connections, we cannot distinguish the behavior of goading on a bully (like the yellow node) from supporting a normal user (like the node marked in light blue). Thus, the yellow node is invisible, as it overlaps with the light blue node in Figs. 1i and j which plot the features learned by concatenating single-signed embeddings or using hand-engineered statistics. sNCE (Fig. 1g) can model these differences, but does not recognize the similarity of the yellow, red, and orange nodes. On the other hand, sRDE (Fig. 1h) successfully clusters these together. 6.2 Finding Misbehavior in Social Media One reason the problems of signed structural node embedding and network classification may be new is because of a lack of benchmark datasets. These formulations are a natural fit for social media analysis, which we perform for our experimental evaluation. We hope our work will also inspire further methodological development as well as introduction of new benchmarks for these problems. Note: a complete solution for identifying online misbehavior would likely use information beyond the network structure itself, such as text or media content [11]. Our primary goal here is to learn from the network structure alone, which has been shown to inform our understanding of social roles [15]. We verify that we capture richer signed network role information than existing graph-based methods. Social Media Data. We consider two social media datasets: Slashdot Zoo [17] and Cyberbullying [10, 11] on Instagram. We represent each as a network where nodes are users and edges represent pairwise interactions between users. Both datasets contain a subset of users who engage in some sort of online “misbehavior”: trolls in Slashdot Zoo, and cyberbullies in Instagram. Intuitively, such socially deviant behavior should manifest itself in a distinctive structural role that we would like to capture in topological feature representations for each user.
12
M. Heimann et al.
In Slashdot Zoo, the edge sign function ϕ is given explicitly by the users themselves, who denote other users as “friends” or “foes” (modeled by positive and negative outgoing edges, respectively). In Cyberbullying, ϕ must be inferred implicitly. The network is defined by users commenting on each other’s media sessions. Comments are assumed to be directed at the user who posted the picture or video to initiate the session, unless the commenter “mentions” other users using an @ symbol before a username (if so, we form a directed edge from mentioner to mentionee). Thus, ϕ represents benign or hostile comment intent. Recent preliminary analysis of this dataset [15] found that a strong predictor of a comment’s cyberaggression was its score from the VADER model [12] for sentiment analysis in social media, which ranges from −1 (most negative) to 1 (most positive). To avoid misclassification of mildly negative but not truly aggressive comments, we assign an edge sign of 1 for a VADER score above −0.5 and −1 otherwise; we verify this guideline’s effectiveness by manual inspection of several comments. User-Level (Node-Level) Analysis. For the Slashdot dataset, 96 users are marked as trolls by the ground-truth Slashdot account “No More Trolls”. We randomly select an equal number of non-trolls and distinguish the two with logistic regression and 10-fold cross validation, trained on various node features: – Degree Features. We concatenate the positive and negative in- and out-degrees of each node to form a four-dimensional feature vector. This is a form of hand engineering using a fundamental structural feature [14] while modeling edge signs. – SGCN. We use the Signed Graph Convolutional Network [2], which performs feature propagation to learn node representations while taking into account balance theory. Such an approach learns community-based node features [20], namely embedding positively-oriented nodes closer than negatively-oriented nodes. This serves as a contrast to our role-based embedding methods. – Single-sign variations of xNetMF [9]: xNetMF+ , ignores negative edges and only embeds G+ , while xNetMF− ignores positive edges and only embeds G− . We also use signed structural node embed- Table 1. Classifying troll users in dings via sNCE (§ 4.3), and sec-xNetMF (§ 4.1): Slashdot Zoo. Structural embeddings concatenating xNetMF+ and xNetMF− features. using both positive and negative (sRDE’s memory requirements are excessive edges–in this case disentangling each on this larger graph, a limitation shared by sign type’s effect on the structural role–leads to greatest accuracy. its unsigned counterpart GraphWave [13]). All embeddings use the standard dimension d = Method Accuracy 128 [13]. Degree 0.57 From the results in Table 1, we see that handSGCN 0.46 engineered features (Degree) and features that try 0.59 xNetMF+ to preserve node proximity (SGCN) are the least 0.61 xNetMF− accurate for the task, which motivates our use of sec-xNetMF 0.64 structural node embeddings to characterize troll sNCE 0.51 behavior. Using negative edges alone to determine structural roles leads to slightly better results than using positive edges alone–this makes sense for the task of identifying a negative behavior–but we see that using both positive and negative edges in signed structural embeddings gives the best performance. However, it seems most useful to model the structural roles of users separately in a positive
Structural Node Embedding in Signed Social Networks
13
and a negative context, as is evidenced by the worse performance of sNCE and the superiority of sec-xNetMF. Our synthetic experiments (§ 6.1) show that sNCE can detect subtle role differences that sec-xNetMF cannot; however, the differing results on this real dataset may reveal the double-edged nature of this expressivity (e.g. overfitting). Still, we next show that the signed structural embeddings’ roles effectively characterize the network itself. Session-Level (Graph-Level) Analysis. For graph classification, we evaluate the performance of an SVM with 5-fold cross-validation to predict graph’s labels using the following kernels or features: – Hand-engineered statistics: We consider the statistical signatures SMD and SDD, along with the kernel SGK discussed in § 5.1. – Methods using node embeddings: we use RGM as discussed in § 5.2 with 16dimensional signed node embeddings to capture the distribution of structural roles in the network, using sNCE and sRDE respectively. For this task, we extract the signed Table 2. Classifying the cyberaggression levels who-comments-on-whom networks of occurring in Instagram media sessions. Methods 200 Instagram media sessions (§ 3). Each based on signed structural node embedding outsession has one of six ground-truth labels perform baselines based on feature engineering. corresponding to the level of cyberbulMethod Accuracy lying it contains [11], which we predict SMD 0.24 from the network structure. In Table 2, SDD 0.24 we see that the most powerful predictors SGK 0.23 are RGM using our structural embedRGM-sNCE 0.32 dings. Intuitively, this suggests that the RGM-sRDE 0.33 distribution of structural roles as captured by embeddings most informatively characterize the network, more so than statistical signatures or graph kernels designed from hand-engineered features.
7 Conclusion We have taken a new approach to signed social network mining using node embedding, characterizing nodes based on the structural roles that they play in the signed network. Our methods enable node-level and graph-level analysis that allow us to gain more insights into the social roles of social media users than was previously possible. As the problems we formulated are new, few benchmark datasets or baseline methods exist and we hope that our work will attract more interest to these important problems. Future work may incorporate metadata beyond the network topological structure alone. Acknowledgments. The authors are grateful to the Defense Advanced Research Projects Agency (DARPA), contract W911NF-17-C-0094, for their support.
14
M. Heimann et al.
References 1. Chiang, K.-Y., Hsieh, C.-J., Natarajan, N., Dhillon, I.S., Tewari, A.: Prediction and clustering in signed networks: a local to global perspective. JMLR 15(1), 1177–1213 (2014) 2. Derr, T., Ma, Y., Tang, J.: Signed graph convolutional networks. In: ICDM (2018) 3. Derr, T., Wang, C., Wang, S., Tang, J.: Signed node relevance measurements. arXiv preprint: arXiv:1710.07236 (2017) 4. Donnat, C., Zitnik, M., Hallac, D., Leskovec, J.: Learning structural node embedding’s via diffusion wavelets. In: KDD (2018) 5. Errica, F., Podda, M., Bacciu, D., Micheli, A.: A fair comparison of graph neural networks for graph classification. In: ICLR (2020) 6. Goyal, P., Ferrara, E.: Graph embedding techniques, applications, and performance: a survey. Knowl. Based Syst. 151, 78–94 (2018) 7. Heider, F.: The Psychology of Interpersonal Relations. Psychology Press, Hove (2013) 8. Heimann, M., Safavi, T., Koutra, D.: Distribution of node embedding’s as multi-resolution features for graphs. In: ICDM (2019) 9. Heimann, M., Shen, H., Safavi, T., Danai, K.: Representation learning-based graph alignment. In: CIKM, Regal (2018) 10. Hosseinmardi, H., Mattson, S.A., Rafiq, R.I., Han, R., Qin, L., Mishra, S.: Analyzing labeled cyberbullying incidents on the instagram social network. In: SocInfo (2015) 11. Hosseinmardi, H., Rafiq, R.I., Han, R., Qin, L., Mishra, S.: Prediction of cyberbullying incidents in a media-based social network. In: ASONAM (2016) 12. Hutto, C.J., Gilbert, E.: Vader: A parsimonious rule-based model for sentiment analysis of social media text. In: ICWSM (2014) 13. Jin, D., Heimann, M., Safavi, T., Wang, M., Lee, W., Snider, L., Danai, K.: Inferring professional roles in email networks. In: KDD, Smart roles (2019) 14. Jin, J., Heimann, M., Jin, D., Koutra, D.: Understanding and evaluating structural node embedding’s. In: KDD MLG Workshop (2020) 15. Kao, H.-T., Yan, S., Huang, D., Bartley, Homa, N., Mardi, H., Ferrara, E.: Understanding cyberbullying on Instagram and Ask. fm via social role detection. In: WWW Companion (2019) 16. Kipf , T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: ICLR (2017) 17. Kunegis, J., Lommatzsch, A., Bauckhage, C.: The slashdot zoo: mining a social network with negative edges. In: WWW (2009) 18. Ribeiro, L.F.R., Saverese, P.H.P., Figueiredo, D.R.: Struc2vec: learning node representations from structural identity. In: KDD (2017) 19. Rossi, R.A., Ahmed, N.K.: Role discovery in networks. TKDE 27(4), 1112–1131 (2014) 20. Rossi, R.A., Jin, D., Kim, S., Ahmed, N.K., Koutra, D., Lee, J.B.: On proximity and structural role-based embedding’s in networks: misconceptions, methods, and applications. TKDD 14, 1–13 (2020) 21. Shervashidze, N., Vishwanathan, S.V.N., Petri, T., Mehlhorn, K., Borgwardt, K.: Efficient graphlet kernels for large graph comparison. In: AISTATS (2009) 22. Wang, S., Tang, J., Aggarwal, C., Chang, Y., Liu, H.: Signed network embedding in social media. In: SDM (2017) 23. Yuan, S., Wu, X., Xiang, Y.: SNE: signed network embedding. In: PAKDD (2017)
On the Impact of Communities on Semi-supervised Classification Using Graph Neural Networks Hussain Hussain1(B) , Tomislav Duricic1,2 , Elisabeth Lex1,2 , Roman Kern1,2 , and Denis Helic2 1 Know-Center GmbH, Graz, Austria {hussain,tduricic,elisabeth.lex,rkern}@tugraz.com 2 Graz University of Technology, Graz, Austria [email protected]
Abstract. Graph Neural Networks (GNNs) are effective in many applications. Still, there is a limited understanding of the effect of common graph structures on the learning process of GNNs. In this work, we systematically study the impact of community structure on the performance of GNNs in semi-supervised node classification on graphs. Following an ablation study on six datasets, we measure the performance of GNNs on the original graphs, and the change in performance in the presence and the absence of community structure. Our results suggest that communities typically have a major impact on the learning process and classification performance. For example, in cases where the majority of nodes from one community share a single classification label, breaking up community structure results in a significant performance drop. On the other hand, for cases where labels show low correlation with communities, we find that the graph structure is rather irrelevant to the learning process, and a feature-only baseline becomes hard to beat. With our work, we provide deeper insights in the abilities and limitations of GNNs, including a set of general guidelines for model selection based on the graph structure. Keywords: Graph neural networks Semi-supervised learning
1
· Community structure ·
Introduction
Many real-world systems are modeled as complex networks, which are defined as graphs with complex structural features that cannot be observed in random graphs [12]. The existence of such features governs different processes and interactions between nodes in the graph. In particular, community structures are often found in empirical real-world complex networks. These structures have a major impact on information propagation across graphs [1] as they provide barriers for propagation [9]. The process of information propagation forms the basis c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 944, pp. 15–26, 2021. https://doi.org/10.1007/978-3-030-65351-4_2
16
H. Hussain et al.
for many studied applications on graphs. Among these applications, graph-based semi-supervised learning [13,27] is widely studied and particularly of interest. Graph-based semi-supervised learning aims to exploit graph structure in order to learn with very few labeled examples [15]. In recent years, state-of-the-art methods for this task have predominantly been graph neural networks (GNNs) [22]. Problem. Despite the outstanding results of GNNs on this and similar tasks, there is still a limited understanding of their abilities and constraints, which hinders further progress in the field [16,26]. To the best of our knowledge, there is a lack of research on how common graph structures, such as communities, impact the learning process of GNNs. The Present Work. We are particularly interested in the influence of community structure on the performance of GNNs in semi-supervised node classification. From a practical perspective, we set out to provide a set of guidelines on the applicability of GNNs based on the relationships between communities and target labels. To that end, we design an evaluation strategy based on an ablation study on multiple graph datasets with varying characteristics in order to study the behaviour of GNNs. Using this evaluation strategy we compare the performance of GNNs on six public graph datasets to a simple feature-based baseline (i.e., logistic regression) which ignores the graph structure. To gain a deeper understanding on the role of communities, we compare the change in performance of GNNs after (i) eliminating community structures while keeping the degree distribution, (ii) keeping the community structure while using a binomial degree distribution, and (iii) eliminating both. Finally, we link the evaluation results on the ablation models to the relationship between communities and labels in each dataset. To achieve this, we compute the uncertainty coefficient [19] of labels with respect to communities, before and after applying community perturbations. Findings. Our results show that GNNs are able to successfully exploit the graph structure and outperform the feature-based baseline only when communities correlate with the node labels, which is known in the literature as the cluster assumption [4]. If this assumption fails, GNNs propagate noisy features across the graph and are unable to outperform the feature-based baseline. Contributions. With our work, we highlight the limitations that community structures can impose on GNNs. We additionally show that the proposed uncertainty coefficient measure helps predicting the applicability of GNNs on semisupervised learning. We argue that this measure can set a guideline to decide whether to use GNNs on a certain semi-supervised learning task, given the relationship between communities and labels.
2
Background
Graph Neural Networks. Let G = (V, E) be a graph with a set of nodes V and a set of edges E. Let each node u ∈ V have a feature vector xu ∈ IRd , where d is the feature vector dimension; and a label yu ∈ L, where L is the set of labels.
Communities on Semi-supervised Classification With GNNs
17
GNNs are multi-layer machine learning models, which operate on graphs. They follow a message passing and aggregation scheme where nodes aggregate the messages that are received from their neighbors and update their representation on this basis. In a GNN with K hidden layers (with the input layer denoted (k) as layer 0), each node u has a vector representation hu at a certain layer k ≤ K of dimension dk . The transformation from a layer k to the next layer k + 1 is performed by updating the representation of each node u as follows: (k) a(k) u := AGGREGATEv∈N (u) (hv ), (k) h(k+1) := COMBINE(h(k) u u , au ),
(1)
where N (u) is the set of neighbors of u. The AGGREGATE function takes an unordered set of vectors of dimension dk as an input, and returns a single vector of the same dimension dk , e.g., element-wise mean or max. The COMBINE function combines the representation of u in layer k with the aggregated representation of its neighbors, e.g., a concatenation followed by ReLU of a linear transformation COMBINE(h, a) = ReLU(W.[h, a]). We set the representation of (0) u in the input layer to the input features: hu := xu . In classification problems, the dimension of the last layer dK equals the number of labels in the graph |L|. Semi-supervised Learning on Graphs. Semi-supervised learning aims to exploit unlabeled data in order to generate predictions given few labeled data. In graphs, this means exploiting the unlabeled nodes as well as the network structure to improve the predictions. Many semi-supervised classification methods on graphs assume that connected nodes are more likely to share their label [15], which is usually referred to as the cluster assumption [4]. Based on this assumption, approaches to solve this task usually aim to propagate node information along the edges. Earlier related approaches [21,27] focused on propagating label information from labeled nodes to their neighbors. In many applications, however, graph nodes can also be associated with feature vectors, which can be utilized by GNNs. GNNs achieved a significant improvement over the state of the art since they can effectively harness the unlabeled data, i.e., graph structure and node features. Cluster Assumption. GNNs operate by propagating node feature vectors along the edges, hence exploiting both the graph structure and feature vectors. The GNN update rule in Eq. 1 can be seen as a form of (Laplacian) feature smoothing [15] as it combines the feature vector of a node with the feature vectors of its neighbors. Feature smoothing results in neighboring nodes having similar vector representations. Therefore, with the cluster assumption in mind, feature smoothing potentially causes nodes with similar labels to also obtain similar vector representations. However, when the cluster assumption does not hold, i.e., connected nodes are less likely to share their label, the propagation in Eq. 1 can cause nodes with different labels to have similar vector representations. It is widely accepted that classifiers achieve better accuracy when similar vector representations tend to have similar labels.
18
H. Hussain et al.
Communities and the Cluster Assumption. Communities are densely connected subgraphs, and they are common in empirical graphs including social, citation or web graphs. The existence of communities directly affects information propagation in graphs [5]. As communities are densely connected, the feature smoothing performed by the update rule in Eq. 1 tends to make nodes within the same community have similar vector representations. This dense connectivity also causes the cluster assumption to generalize to the community level, which means that nodes within the same community tend to share the same label. As a result, when the cluster assumption holds, GNNs cause nodes with the same label to have similar vector representations simplifying the classification task on the resulting vector representations. Li et al. [15] hypothesize that this alignment of communities and node labels may be the main reason why GNNs achieve state-of-the-art performance on the classification task. In this paper we aim to experimentally test this hypothesis on six datasets from different domains. In the other case, which is typically ignored in literature, the cluster assumption does not hold, and a community could possibly have a variety of labels. The feature propagation between nodes of the same community would therefore result in feature smoothing for nodes with different labels. This eventually makes the classification task harder since representation similarity does not imply label similarity in this case. In summary, in this paper we set out to quantify the label-community correlation and how this correlation is related to the performance of GNNs on semi-supervised classification task on graphs.
3
Methods and Experiments
We start by quantifying how much information node’s community reveals about its label. For a labeled graph, let L be a random variable taking values in the set of labels L, i.e., L(u) is the label of node u ∈ V . Assuming the graph is partitioned into a set of disjoint communities C, we define another random variable C taking values in C, i.e., C(u) is the community of node u ∈ V . To measure how much the (fraction of) uncertainty about L is reduced knowing C, we use the uncertainty coefficient [19] of L given C. This coefficient can be written as U (L|C) = I(L;C) H(L) ∈ [0, 1], where H(L) is the entropy of L, and I(L; C) is the mutual information between L and C. When the uncertainty coefficient equals 1, all nodes within each community share the same label, and thus knowing the node’s community means also that we know the node’s label. On the other hand, when the uncertainty coefficient is 0, the label distribution is identical in all communities, so knowing the community of a node does not contribute to knowing its label. In general, the higher the eliminated uncertainty about the labels when knowing communities is (i.e., the closer U (L|C) is to 1), the more likely it is that the cluster assumption holds, and thus GNNs can exploit the graph structure, and vice versa.
Communities on Semi-supervised Classification With GNNs
3.1
19
Ablation Study
After establishing the intuitions behind the role of communities, we aim to show their impact experimentally. To achieve this, we evaluate five popular state-of-the-art GNN models on six empirical datasets. Subsequently, we reevaluate these GNN models on the same datasets after applying ablation to their structures. In particular, we evaluate the accuracy of GNNs on the original datasets and compare this performance to the ones on the following ablation models: • SBM networks. Here we aim to rebuild the graph while preserving the communities. Thus, we firstly perform community detection with widely used Louvain method [2], which maximizes the modularity score. Secondly, we build a stochastic block matrix encoding the original density of edges within and between the detected communities. Finally, we use this matrix to construct a graph with the stochastic block model (SBM) [10]. This graph preserves a node’s community, features and label but results in a binomial degree distribution [11]. • CM networks. In this ablation model, we apply graph rewiring using the so called configuration model (CM) [18]. With this rewiring, each node keeps its degree, but its neighbors can become any of the nodes in the graph. This effectively destroys the community structure, while keeping the node’s degree, features and label. • Random networks. We use Erd˝ os-R´enyi graphs [7] to eliminate both communities and degree distribution. The resulting graph has no community structure and features a binomial degree distribution. This ablation model can only spread noisy feature information across the graph. Last but not least, for each of the original datasets and each of the ablation models we compute the community-label correlation by means of the uncertainty coefficient. To that end, we use the joint distribution of labels L and communities C (extracted by the Louvain method) to compute this coefficient for each graph. In order to highlight the correlation between this coefficient and the applicability of GNNs, we show the computed coefficients on the given datasets along with their performance. 3.2
Experiments
In our experiments1 , we study five GNN architectures which are widely used for semi-supervised classification on graphs: (a) Graph Convolutional Networks (GCN) [13], (b) Graph Sample and Aggregate (SAGE) [8], (c) Graph Attention Networks (GAT) [24], (d) Simple Graph Convolutions (SGC) [25], and (e) Approximate Personalized Propagation of Neural Predictions (APPNP) [14] . We additionally compare these approaches to a simple feature-only baseline, 1
The implementation and technical details can be found on https://github.com/ sqrhussain/structure-in-gnn.
20
H. Hussain et al.
Table 1. Dataset statistics after preprocessing (similar to Shchur et al. [22]). The label rate represents the fraction of nodes in the training set. The edge density is the number of existing undirected edges divided by the maximum possible number of undirected edges (ignoring self-loops). For Twitter dataset, we apply cleaning to the feature vectors of nodes same as in [23], i.e., removing graph-based and some textual features. Dataset
Labels Features Nodes Edges Edge density Label rate
CORA-ML [21] 7
1,433
2,485
5,209
0.0017
0.0563
CiteSeer [21]
6
3,703
2,110
3,705
0.0017
0.0569
PubMed [17]
3
500
19,717 44,335 0.0002
0.0030
CORA-Full [3]
67
8,710
18,703 64,259 0.0004
0.0716
Twitter [20]
2
215
2,134
7,040
0.0031
0.0187
WebKB [6]
5
1,703
859
1,516
0.0041
0.0582
i.e., logistic regression model, which ignores the graph structure and only uses the node features. The comparison to this baseline can indicate whether a GNN model is actually useful for the task on the respective datasets. Datasets. To provide a better understanding of the roles of the studied structures, we aim for a diverse selection of datasets with respect to (a) domain, e.g., citation, social and web graphs, (b) structure, e.g., directed acyclic vs. cyclic, (c) and correlations between communities and labels, i.e., whether nodes of the same community tend to share the same label. Having this in mind, we use the datasets summarized in Table 1. The label of a node in a citation graph (i.e., CORA-ML, CiteSeer, PubMed and CORA-Full ) represents the topic of the paper. Citations are expected to be denser among papers with similar topics than they would be between papers of different topics. For example, a publication about natural language processing would be more likely to cite other natural language processing papers than human-computer interaction papers. Therefore, one could intuitively expect that papers within the same graph community tend to share the same label. Twitter graph consists of users where the edges represent retweets, and node labels indicate whether a user is hateful or not. Therefore, one could not easily assume the presence or the absence of a correlation between communities and labels (hateful or normal). In other words, we do not know whether hateful users typically form communities as this highly depends on the discussion topics. For WebKB dataset, nodes are web pages, edges are links, and labels indicate the type of the web page, i.e., course, faculty, project, staff or student. In this case one cannot intuitively assume that nodes within a graph community are expected to share a label. For example, a web page of a staff member could be more likely to link to projects on which this staff member is working than to other staff members’ web pages. Based on these intuitions, we consider that these graphs are sufficiently diverse concerning community impact on label prediction.
Communities on Semi-supervised Classification With GNNs
21
Evaluation Setup. While the original graphs are directed, we treat them as undirected by ignoring the edge direction (which is in the line with previous research [8,13,22]). All of these graphs are preprocessed in a similar manner as Shchur et al. [22], i.e., removing nodes with rare labels and selecting the largest weakly connected component except in WebKB where we take four connected components representing 4 universities. Following the train/validation/test split strategy as in [22], each random split consists of 50 labeled examples per class (20 for training and 30 for validation) and the rest are considered test examples. This applies to all of our datasets except WebKB where we use 10 training examples and 15 validation examples per class due to the fewer number of nodes. To evaluate the GNN models on the original graphs, we follow the evaluation setup as conducted in [22] by having 100 random splits and 20 random model initializations for each split. The same process is carried out to evaluate the feature-only baseline (logistic regression) model. The evaluation is slightly different for the ablation studies (SBM, CM and random graphs) since they include an additional level of randomization. For a given dataset, we generate SBM, CM and random graphs with 10 different random seeds, which results in 10 SBM graphs, 10 CM graphs and 10 random graphs. The evaluation on each of these generated graphs is carried out through 50 different random splits and 10 different random initializations. As a result, the reported accuracy for a GNN architecture on the original graph is presented for 2, 000 trainings of the GNN. Meanwhile, the reported accuracy of a GNN architecture on one of the ablation models (SBM, CM or random graphs) is presented for training the GNN 5, 000 times, i.e., 10 randomly generated graphs × 50 random splits × 10 random initializations. Uncertainty Coefficient Calculation. To calculate U (L|C) for a set of nodes, the labels of these nodes must be available. Therefore, for each dataset, we compute U (L|C) using the labeled nodes from the training and validation sets. 3.3
Results
We summarize the evaluation results of the GNN models with respect to accuracy for the original graph and its corresponding ablation models in Fig. 1. Could GNNs Outperform the Simple Baseline? Comparing the performance of GNNs on the original graphs with the feature-only baseline, all GNN models clearly outperform the baseline on the citation datasets. For the Twitter dataset however, GAT could not outperform the baseline, while GCN and GraphSAGE outperformed it only by a small margin. This is more prominent for WebKB where none of the GNNs outperforms the baseline on the original graph. This suggests that for the two latter datasets, the graph structure is either irrelevant to the learning process or is even hindering it. To test the statistical significance for each approach on each dataset against the corresponding baseline, we compute the non-parametric Mann-Whitney U test for unpaired data with the significance level α = 0.01 (Bonferroni corrected). The significance test shows that the performance of GNNs on the original graphs is significantly different than the performance of the feature-only baseline, i.e., GNNs significantly
22
H. Hussain et al.
Fig. 1. The accuracy of GNNs on the original graph and the ablation models of each dataset. The red dashed line represents the median accuracy of the feature-only baseline. The performance on the original graphs is generally higher than that of the baseline except for WebKB. On citation graphs, the baseline is clearly outperformed on the original graphs, and eliminating communities (CM graphs) results in a much higher accuracy drop than eliminating the degree sequence (SBM graphs). This is not the case for the other two datasets where the baseline is not always outperformed on the original graphs, and the drop in performance on SBM networks is substantial. This highlights the positive impact of communities on citation datasets and its negative impact on WebKB. The uncertainty coefficient is the highest for the three datasets in the top row and the lowest for WebKB providing a potential explanation for the low GNN performance on this dataset.
outperform the baseline on all datasets except on WebKB where the baseline significantly outperforms the GNNs. Mapping our results back to the cluster assumption and the uncertainty coefficient, we notice that when U (L|C) is high, i.e., for CORA-ML (.691), CiteSeer (.647) and PubMed (.673), GNNs are able to consistently outperform the baseline. Meanwhile, when U (L|C) is low, i.e., for WebKB (.320), GNNs are not useful for the task. Ablation Results. On citation datasets, the accuracy drop on the SBM graphs is smaller than for the other two ablation models (CM and random graphs). As SBM graphs preserve the community structure, this observation shows a noticeable impact of communities on node classification in citation datasets. However, this behavior is not always demonstrated on the Twitter dataset. On contrary, on the Twitter dataset, we observe an overlap in the performances on the original graph and the ablation models. This performance overlap is even more prominent on WebKB. The SBM graphs generally gain the lowest accuracy on the WebKB dataset, showing a negative effect of preserving communities in this dataset. These observations can again indicate that communities are boosting the prediction for the citation datasets (where U (L|C) is high) while hindering it for the WebKB dataset (where U (L|C) is low).
Communities on Semi-supervised Classification With GNNs
3.4
23
Discussion
In our experiments, we observe that the uncertainty coefficient is high for the citation datasets, relatively low for Twitter, and much lower for WebKB. This correlates with the classification performance on citation datasets that show (a) a consistent performance on the original graphs and (b) a better accuracy on SBM graphs comparing to the other ablations. On contrary, in the Twitter dataset, where GNNs only outperform the baseline by a small margin, the behavior is reflected in the coefficient value that is smaller than in three out of four citation datasets. For WebKB, the coefficient is particularly low following the observations that GNNs are unable to beat the simple baseline on the original network. Our observation suggest that the uncertainty coefficient can indicate whether GNNs are applicable depending on the relationship between the communities and labels. To shed more light on the correlation between the uncertainty coefficient and the classification performance of GNNs, we now study the change of both GNN accuracy and the measured uncertainty coefficient on the given datasets after applying additional community perturbations. Particularly, we start with the SBM networks for each dataset and we perform the following perturbations. We randomly select a fraction of nodes and assign them to different communities by simply swapping the nodes position in the network. Then we gradually increase the fraction of the selected nodes to obtain a spectrum of the uncertainty coefficient. Finally, we compute U (L|C) and the accuracy of the GCN model on each of the obtained graphs and show the correlation between the two measures. We choose the SBM networks for this experiment to guarantee that the node swapping only changes nodes’ communities and not their importance. We expect that these perturbations increasingly reduce U (L|C) when the cluster assumption holds. We show that U (L|C) decreases and then converges for the first 5 datasets datasets (cf. Fig. 2 [top]), supporting that the community perturbations decrease the correlation between communities and labels. In these cases, the GNN accuracy has a positive correlation with U (L|C) (cf. Fig. 2 [bottom]). However, when community perturbations do not reduce the correlation between communities and labels, U (L|C) is already at a convergence level. That is the case where GNNs were not able to outperform the feature-based baseline, i.e., WebKB. By reading the bottom row of Fig. 2, we see that when U (L|C) is below 0.3, the accuracy becomes unacceptably low. On the contrary, if U (L|C) is above 0.7, the accuracy is high enough for the respective dataset. Guideline for Application of GNNs. To verify whether a GNN model is applicable on a certain real-world dataset, we suggest the following two-step guideline based on the previous observations. • The first step is to perform community detection on the dataset, and inspect the uncertainty coefficient for it. If the coefficient is particularly low, e.g., below 0.3, there is a high confidence that GNNs will not work. If the coefficient is particularly high, e.g., above 0.7, it is likely that GNNs can exploit the
24
H. Hussain et al.
Fig. 2. The figure in the top row shows the calculated U (L|C) for each dataset. The swap fraction represents the fraction of nodes which changed their community. We see a decline in the coefficient with increasing swapping fraction, and then a convergence. This line is already converging in WebKB as the measure is already low in the original network. The figure in the bottom shows the test accuracy with changing U (L|C). We see a positive correlation between the uncertainty coefficient and the accuracy, which is weak for Twitter, and non-existent for WebKB supporting our observations from above.
network structure and be helpful for the prediction task. Otherwise, if the value is inconclusive, e.g., around 0.5, we advise to perform the second step. • The second step involves gradual community perturbations and inspection of the respective uncertainty coefficient. If the value of the coefficient decreases with more perturbations, this supports that the cluster assumption holds in the original graph, and GNNs are applicable. Otherwise, if it does not decrease, the cluster assumption most likely does not hold in the first place, and feature-based methods are more advisable than GNNs.
4
Conclusion
In this work, we analyzed the impact of community structures in graphs on the performance of GNNs in semi-supervised node classification, with the goal of uncovering limitations imposed by such structures. By conducting ablation studies on the given graphs, we showed that GNNs outperform a given baseline on this task in case the cluster assumption holds. Otherwise, GNNs cannot effectively exploit the graph structure for label prediction. Additionally, we show an analysis on the relation between labels and graph communities. With our analysis, we suggest that when community information does not contribute to label prediction, it is not advisable to use such GNNs. In particular, we show that the uncertainty coefficient of node labels knowing their communities can indicate whether the cluster assumption holds. We further formalize a guideline to select where to apply GNNs based on community-label correlation. Our work serves as a contributing factor to intrinsic validation of the applicability of GNN models. Future work can also investigate the effect of other graph structural properties such as edge directionality, degree distribution and graph assortativity on the GNN performance.
Communities on Semi-supervised Classification With GNNs
25
References 1. Barab´ asi, A.L.: Network science. Philos. Trans. Royal Soc. A: Math. Phys. Eng. Sci. 371(1987), 20120375 (2013) 2. Blondel, V.D., Guillaume, J.L., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities in large networks. J. Stat. Mech. Theory Exp. 2008(10), P10008 (2008) 3. Bojchevski, A., G¨ unnemann, S.: Deep Gaussian embedding of graphs: unsupervised inductive learning via ranking. In: International Conference on Learning Representations, pp. 1–13 (2018) 4. Chapelle, O., Scholkopf, B., Zien, A.: Semi-supervised learning. IEEE Trans. Neural Netw. 20(3), 542–542 (2009). (chapelle, o. et al., eds.; 2006)[book reviews] 5. Cherifi, H., Palla, G., Szymanski, B.K., Lu, X.: On community structure in complex networks: challenges and opportunities. Appl. Netw. Sci. 4(1), 1–35 (2019) 6. Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K., Slattery, S.: Learning to extract symbolic knowledge from the World Wide Web. In: Proceedings of the National Conference on Artificial Intelligence, pp. 509–516 (1998) 7. Erd˝ os, P., R´enyi, A.: On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci 5(1), 17–60 (1960) 8. Hamilton, W., Ying, Z., Leskovec, J.: Inductive representation learning on large graphs. In: Advances in Neural Information Processing Systems, pp. 1024–1034 (2017) 9. Hasani-Mavriqi, I., Kowald, D., Helic, D., Lex, E.: Consensus dynamics in online collaboration systems. Comput. Soc. Netw. 5(1), 2 (2018) 10. Holland, P.W., Laskey, K.B., Leinhardt, S.: Stochastic blockmodels: first steps. Soc. Netw. 5(2), 109–137 (1983) 11. Karrer, B., Newman, M.E.: Stochastic blockmodels and community structure in networks. Phys. Rev. E 83(1), 016107 (2011) 12. Kim, J., Wilhelm, T.: What is a complex graph? Phys. A: Stat. Mech. Appl. 387(11), 2637–2652 (2008) 13. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: International Conference on Learning Representations (ICLR) (2017) 14. Klicpera, J., Bojchevski, A., G¨ unnemann, S.: Predict then propagate: Graph neural networks meet personalized PageRank. In: 7th International Conference on Learning Representations, ICLR 2019 (2019) 15. Li, Q., Han, Z., Wu, X.M.: Deeper insights into graph convolutional networks for semi-supervised learning. In: 32nd AAAI Conference on Artificial Intelligence, AAAI 2018 (2018) 16. Loukas, A.: What graph neural networks cannot learn: depth vs width. In: International Conference on Learning Representations (2020). https://openreview.net/ forum?id=B1l2bp4YwS 17. Namata, G., London, B., Getoor, L., Huang, B., EDU, U.: Query-driven active surveying for collective classification. In: 10th International Workshop on Mining and Learning with Graphs, Vol. 8 (2012) 18. Newman, M.E.: The structure and function of complex networks. SIAM Rev. 45(2), 167–256 (2003) 19. Press, W.H., Teukolsky, S.A., Flannery, B.P., Vetterling, W.T.: Numerical Recipes in FORTRAN 77. FORTRAN numerical recipes: the art of scientific computing, vol. 1. Cambridge University Press, Cambridge (1992)
26
H. Hussain et al.
20. Ribeiro, M.H., Calais, P.H., Santos, Y.A., Almeida, V.A., Meira, Jr., W.: “like sheep among wolves”: Characterizing hateful users on twitter (2017). arXiv preprint: arXiv:1801.00317 21. Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B., Eliassi-Rad, T.: Collective classification in network data. AI Mag. 29(3), 93–93 (2008) 22. Shchur, O., Mumme, M., Bojchevski, A., G¨ unnemann, S.: Pitfalls of graph neural network evaluation. In: Relational Representation Learning Workshop, NeurIPS 2018 (2018) 23. Tiao, L., Elinas, P., Nguyen, H., Bonilla, E.V.: Variational Spectral Graph Convolutional Networks. In: Graph Representation Learning Workshop, NeurIPS 2019 (2019) 24. Veliˇckovi´c, P., Cucurull, G., Casanova, A., Romero, A., Li` o, P., Bengio, Y.: Graph Attention Networks. International Conference on Learning Representations (2018). https://openreview.net/forum?id=rJXMpikCZ. (Accepted as poster) 25. Wu, F., Zhang, T., Souza Jr., A.H.., Fifty, C., Yu, T., Weinberger, K.Q.: Simplifying graph convolutional networks (2019). arXiv preprint: arXiv:1902.07153 26. Xu, K., Jegelka, S., Hu, W., Leskovec, J.: How powerful are graph neural networks? In: 7th International Conference on Learning Representations, ICLR 2019 (2019) 27. Zhu, X., Ghahramani, Z., Lafferty, J.D.: Semi-supervised learning using Gaussian fields and harmonic functions. In: Proceedings of the 20th International conference on Machine learning (ICML2003), pp. 912–919 (2003)
Detecting Geographical Competitive Structure for POI Visit Dynamics Teru Fujii1 , Masahito Kumano1 , Jo˜ ao Gama2 , and Masahiro Kimura1(B) 1
Faculty of Advanced Science and Technology, Ryukoku University, Otsu, Japan [email protected] 2 LIAAD, INESC TEC, University of Porto, Porto, Portugal
Abstract. We provide a framework for analyzing geographical influence networks that have impacts on visit event sequences for a set of point-ofinterests (POIs) in a city. Since mutually-exciting Hawkes processes can naturally model temporal event data and capture interactions between those events, previous work presented a probabilistic model based on Hawkes processes, called CHP model, for finding cooperative structure among online items from their share event sequences. In this paper, based on Hawkes processes, we propose a novel probabilistic model, called RH model, for detecting geographical competitive structure in the set of POIs, and present a method of inferring it from the POI visit event history. We mathematically derive an analytical approximation formula for predicting the popularity of each of the POIs for the RH model, and also extend the CHP model so as to extract geographical cooperative structure. Using synthetic data, we first confirm the effectiveness of the inference method and the validity of the approximation formula. Using real data of Location-Based Social Networks (LBSNs), we demonstrate the significance of the RH model in terms of predicting the future events, and uncover the latent geographical influence networks from the perspective of geographical competitive and cooperative structures.
Keywords: Latent influence network network analysis
1
· Point process model · Social
Introduction
The rise of Location-Based Social Networks (LBSNs) and the progress in sensor technology are increasing the availability of a large amount of spatio-temporal event data, and attention has been drawn to the analysis and mining of such data [17]. Recently, a set of point-of-interests (POIs) in a city has become available from LBSNs and social media. Through check-in and sensor data, it is also becoming possible to know when people’s visit events for those POIs actually occurred, i.e., to obtain the history data of visit events for those POIs. It is desired to find the underlying geographical influence structures of such spatiotemporal data, which are of fundamental importance for tourism marketing, c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 944, pp. 27–38, 2021. https://doi.org/10.1007/978-3-030-65351-4_3
28
T. Fujii et al.
urban planning, etc. In this paper, we aim to model the occurrence process of visit events for all the POIs involved, and detect latent geographical influence networks having impacts on the POI visit dynamics. We consider the properties of POI visit dynamics in a city according to the similarity to the share event dynamics of online items in cyberspace [12,19]. First, we can simply assume that visit events for a POI are not only caused by its own attractiveness but also influenced by the occurrences of previous visit events, i.e., POI visit events have a self-exciting nature. We can also consider that visit events for a POI are influenced by the previous visit events for other relevant POIs, i.e., POI visit events exhibit a mutually-exciting nature. Moreover, it is natural to assume that the influence from a previous visit event temporally decays. These properties can be naturally captured through Hawkes processes [8]. In our previous work [12], we presented a probabilistic model, called cooperative Hawkes process (CHP) model, for finding cooperative structure of a set of online items from their share event sequences by properly combining a Dirichlet process [13] (i.e., a Chinese restaurant process (CRP)) with Hawkes processes. Here, the cooperative structure is defined as partitioning the set of online items into cooperative groups, where every item in a cooperative group sends an influence of the same strength to any item under the same temporal decay rate. In this paper, we refer the CHP model to as sender homogeneous model with CRP prior (SH-CRP model) to emphasize its basic scheme. Note that although the SH-CRP model is regarded as a special case of multivariate Hawkes process with complete connection (MHP), it is not practical to accurately estimate the parameters of MHP from a limited amount of observed data for a large number of items. In fact, using both synthetic and real social media data, we showed in the previous work [12] that the SH-CRP model significantly outperforms the MHP in predicting the future events from a relatively small number of observed events. This implies the importance of the cooperative structure. Thus, we try to extend the SH-CRP model for an analysis of influence structures among POIs. In this paper, as another interpretable influence structure in a set of POIs, we newly introduce geographical competitive structure, where each group constituting the structure is formed by geographically nearby POIs, and every POI in such a group receives an influence of the same strength from any POI under the same temporal decay rate. It is considered that POIs belonging to the same geographical competitive group compete with each other for gaining more popularity through their own attractiveness degrees. For detecting the geographical competitive structure from visit event sequences for the set of POIs, we propose receiver homogeneous (RH) model by incorporating a distance dependent Chinese restaurant process (ddCRP) [1] into Hawkes processes in a novel way. For the RH model, we present its inference method from the POI visit event history, and mathematically derive an analytical approximation formula for the popularity prediction. Moreover, by appropriately extending the SH-CRP model to the one with ddCRP prior, we construct sender homogeneous (SH) model for detecting geographical cooperative structure, where POIs belonging to each cooperative group are geographically clustered. Using synthetic data, we first confirm
Detecting Geographical Competitive Structure for POI Visit Dynamics
29
the effectiveness of the inference method and the validity of the approximation formula. Using real data of LBSNs, we demonstrate the importance of the RH model in terms of predicting the future events, and uncover the latent geographical influence networks from the perspective of geographical competitive and cooperative structures.
2
Related Work
Gomez-Rodriguez et al. [7] and Daneshmand et al. [3] investigated the problem of inferring the underlying network of social influence from observed information diffusion sequences. In order to model information cascades in a social network, Hawkes processes are frequently used [5]. By exploiting Hawkes processes, several researchers extracted an implicit weighted network among a set of entities from event sequences associated with them to provide deep insights into the underlying structures of the dynamical data [11,17,20]. Also, by combining Hawkes processes with an infinite relational model or a stochastic block model, Blundell et al. [2] and Junuthula et al. [9] presented methods of discovering latent communities and their influence structure from continuous-time event-based dynamic network data (i.e., temporal interaction data). Unlike these investigations, we focus on geographical competitive and cooperative structures among a given set of POIs, and aim to analyze them by introducing receiver and sender homogeneous point process models based on Hawkes processes and presenting their inference method. On the other hand, Lin et al. [10] provided a univariate interaction point process based on univariate Hawkes process to obtain the clustering and branching structure of observed spatio-temporal events, where ddCRP is assumed as a prior for branching structure. Although their model looks similar to our SH model, these two models are basically different. In fact, unlike their model, the SH model is based on multivariate Hawkes process (MHP) and generates a temporal event sequence for each node of a given set of nodes. Moreover, to the best of our knowledge, this paper is the first attempt on introducing the concept of receive homogeneous (RH) model for detecting competitive structure. Several researchers investigated a problem of predicting which POI a user will visit in the next discrete-time point based on the history of users’ POI check-ins for LBSNs (see e.g., [6,18]). Also, by suitably taking into account geographical information of POIs, recent studies have improved the performance of predicting a set of POIs a user will visit in the near future [15]. For privacy protection and security reasons, we believe that it should be in general difficult to keep track of who visited which POI. Unlike those studies, we focus on modeling the occurrence process of visit events for all the involved POIs in a continuous-time axis to discover the latent geographical influence networks among them without knowing who visited which POI.
3
Proposed Model
In order to detect geographical competitive and cooperative structures for a set U of POIs in a city during a time period T = [0, T ), we propose probabilistic
30
T. Fujii et al.
models for the occurrence process of visit events for U, where T (> 0) is not so large (e.g., several days). 3.1
Preliminaries
Based on Hawkes processes, we consider modeling the temporal sequence of visit events for an arbitrary u ∈ U as a temporal point process Nu (t) with conditional intensity function λu (t), where Nu (t) expresses the number of visit events for u during time period [0, t). Let Hu (t) = {(u, tn ) | n = 1, . . . , Nu (t)} denote the history of visit events for uup to but not including time t. We N put N (t) = u (t) and H(t) = u∈U u∈U Hu (t). Note that λu (t) dt is the conditional expectation for the number of visit events for u within a small time window [t, t + dt) given H(t), i.e., λu (t) dt = dNu (t) | H(t) (see [5]), and the probability density (i.e., likelihood) of the observed data H(T ) is given by T λu (t) dt λun (tn ). (1) p (H(T )) = exp − 0
u∈U
(un ,tn ) ∈ H(T )
We briefly recall the definition of sender homogeneous model with CRP prior (SH-CRP model) (i.e., CHP model [12]). For any v ∈ U, let z(v) denote the assignment of v to a cooperative group, where z(v) is generated from a CRP. The conditional intensity function (λu (t))u∈U of SH-CRP model is defined by λu (t) = μu + Wu,z(v) gz(v) (t − tn ) v∈U
(v,tn ) ∈ Hv (t)
where μu (> 0) and Wu,k (> 0) represent the attractiveness degree of u and the strength of influence from kth cooperative group to u, respectively. Here, gk (t) is a kernel function with decay rate γk (> 0), i.e., gk (t) = γk e−γk t , (t ≥ 0) and t gk (t) = 0, (t < 0). We set Gk (t) = 0 gk (τ ) dτ . 3.2
Receiver and Sender Homogeneous Models
We propose receiver homogeneous (RH) model for detecting geographical competitive groups {Rk } in U based on H(T ). For any u ∈ U, let z(u) denote the assignment of u to a geographic competitive group, i.e., u ∈ Rz(u) . The conditional intensity function (λu (t))u∈U of RH model is defined by λu (t) = μu + Wz(u),v gz(u) (t − tn ), (2) v∈U
(v,tn ) ∈ Hv (t)
where μu (> 0) and Wk,v (> 0) represent the attractiveness degree of u and the strength of influence from v to kth geographical competitive group, respectively. In order to cluster geographical nearby POIs and estimate the number K of geographical competitive groups from the observed data, we consider incorporating distance dependent Chinese restaurant process (ddCRP) [1], where z(u) is generated from a ddCRP.
Detecting Geographical Competitive Structure for POI Visit Dynamics
31
In order to present the generative process of RH model, we specify the prior distributions of the model parameters. First, a ddCRP draws partition Z = (z(u))u∈U in the following way. For each u ∈ U, we introduce a node assignment cu (∈ U) to induce a graph among U. The ddCRP independently draws node assignments C = (cu )u∈U from distribution,
h(d(cu , u)) if cu = u (3) p(cu ) ∝ α if cu = u for any u ∈ U, where d(u, v) stands for the spatial distance between u and v for ∀u, v ∈ U, h(s) (≥ 0) is a decay function, and α is a given positive constant. For simplicity, we adopt a window decay h(s) = I(0 ≤ s < ρ0 ) in the experiments, where ρ0 is a given positive constant. Here, I(q) is the indicator function such that I(q) = 1 if q is true, otherwise I(q) = 0. Let Z(C) denote the partition of U which is derived by the connected components of the graph among U induced from C (see [1]). Then, Z(C) is the partition Z drawn by the ddCRP. Next, we specify the prior distributions of the model parameters other than Z. We set μ = (μu ), W = (Wk,u ) and γ = (γk ). The parameters μ, W and γ are independently drawn from Gamma distributions with hyperparameters β μ = (β0μ , β1μ ), β w = (β0w , β1w ) and β γ = (β0γ , β1γ ), respectively; i.e., μu ∼ Gamma(β μ ), Wk,u ∼ Gamma(β w ) and γk ∼ Gamma(β γ ) for k = 1, . . . , K and u ∈ U. By extending the SH-CRP model to the one with ddCRP prior in the same way as the RH model, we also introduce sender homogeneous (SH) model for detecting geographical cooperative groups {Sk } in U based on H(T ).
4
Inference Method
For the observed data H(T ) = {(un , tn ) | n = 1, . . . , N (T )}, we present a parameter learning method of the RH and SH models, and provide a framework for prediction and influence network analysis in terms of geographical competitive and cooperative structures. 4.1
Parameter Learning
In our previous work [12], we developed a learning method for SH-CRP model. By newly taking into account the ddCRP prior, we derive a learning method for the RH and SH models as an extension of [12]. Here, for the RH model, we sketch the method of estimating the parameters Z, γ, μ and W from H(T ), since the case of SH model is similar. First, from the additivity for independent Poisson processes, we introduce latent variables Y = (yn )n=1,...,N (T ) such that the nth event (un , tn ) is triggered by the yn th event (uyn , tyn ), where yn = 0, 1, . . . , n − 1, and yn = 0 means that the nth event is caused by the attractiveness μun of un (see [11,12]). Then, the intensity λun (tn , yn ) of un at time tn from the yn th event is given by λun (tn , yn ) = μun for yn = 0 and λun (tn , yn ) = Wz(un ),uyn gz(un ) (t − tyn ) for
32
T. Fujii et al.
n−1 0 < yn < n. Note that λun (tn ) = yn =0 λun (tn , yn ). Thus, given Z, γ, μ and W , the likelihood of H(T ) and Y is provided by p (H(T ), Y | Z, γ, μ, W ) = joint
N (T ) T exp − 0 n=1 λun (tn , yn ) (see Eq. (1)). From Eq. (2), this u ∈ U λu (t) dt can be analytically marginalized over μ and W . Thus, we can obtain p (H(T ), Y | Z, γ, β , β ) = Q0 (β ) μ
w
μ
K
Qk (Z, γ, β w ),
(4)
k=1
where Q0 (β μ ) =
βμ B +β μ Γ (Bu + β0μ )(β1μ ) 0 (T + β1μ ) u 0 Γ (β0μ ) , u∈U
w
Qk (Z, γ, β ) = Dk
βw B +β w Γ (Bk,u + β0w )(β1w ) 0 (Dk,u + β1w ) k,u 0 Γ (β0w ) . u∈U
N (T ) Here, Γ (s) is the gamma function, and Bu = n=1 I(yn = 0) I(un = u), Dk =
N (T ) N (T ) I(yn >0) I(z(un )=k) , Bk,u = n=1 I(yn > 0) I(z(un ) = k) n=1 gk (tn − tyn ) N (T ) I(uyn = u), Dk,u = v∈U n=1 Gz(v) (T − tn ) I(k = z(v)) I(u = un ). By exploiting Eq. (4), we derive a learning method for the RH model. Specifically, by iterating the following four steps, we obtain the estimates of Z, γ, μ and W : 1) ddCRP Gibbs sampling for Z; 2) Gibbs sampling for Y ; 3) MetropolisHastings sampling for γ; 4) updating of μ and W by the expected values of their posteriors, and updating of β μ , β w and β γ by the maximum likelihood method. In step 1), we get a sample for Z from the following ddCRP Gibbs sampler: −u μ w | C −u , H(T ), Y, γ, β μ , β w ) ∝ p(cnew ∪cnew p(cnew u u ) p(H(T ), Y | Z(C u ), γ, β , β ), −u −u new for each u ∈ U, (see Eqs. (3) and (4)), where C and C ∪cu mean removing to C −u , node assignment cu from C and adding a new node assignment cnew u respectively. Note that the Gibbs sampler is efficiently calculated, since we can −u ∪ cnew show that it is proportional to p(cnew u ) unless Z(C u ) joins groups k −u −u ∪ and in Z(C ), otherwise it is proportional to {h(d(cnew u , u)) Qk (Z(C new w −u w −u w cu ), γ, β )} / {Qk (Z(C ), γ, β ) Q (Z(C ), γ, β )}. Also, steps 2) - 4) are derived in the same way as the case of SH-CRP model (CHP model) in [12]. 4.2
Prediction and Influence Network Analysis
Based on the learned RH and SH models, we consider predicting the POI visit events for U during a future time period [T, T ). Basically, we utilize the simulations with Ogata’s thinning algorithm [14]. As for the RH model, we consider mathematically deriving an analytical approximation formula for the conditional expectation of the number of visit events, N u (T ; T ) = Nu (T ) − Nu (T ) | H(T ) for each u ∈ U in the following way. For any t > T , let λu (t) = λu (t) | H(T ) denote the conditional expectation of λu (t) given H(T ). From Eq. 2), we have t λu (t) = μu + v ∈ U Wz(u),v 0 gz(u) (t − τ ) dNv (τ ) (see [4]). Thus, by noting that λv (τ ) dτ = dNv (τ ) | H(τ ) for τ ∈ [0, t), we can obtain λu (t) =
Detecting Geographical Competitive Structure for POI Visit Dynamics
33
t λu (T ) + v ∈ U Wz(u),v T gz(u) (t − τ ) λv (τ ) dτ . By differentiating this integral equation with respect to t, we have t d λu (t) = γz(u) Wz(u),v λv (t) − gz(u) (t − τ ) λv (τ ) dτ dt T v∈U
= γz(u) Wz(u),v λv (t) − λu (t) + γz(u) λu (T ). (5) v∈U
By solving the linear differential equation (5) with constant coefficients under initial condition λu (T ) = λu (T ), we can explicitly write λu (t), (t > T ) in terms of matrix exponential1 . Hence, based on mean field approximation N u (T ; T ) ≈ T λu (t)dt, we can obtain an analytical approximation formula for N u (T ; T ). T Here, we omit the details due to space limitation. Based on the learned RH model, we detect the latent influence network {AR k, } = W /|R among geographical competitive groups {Rk }, where AR k,v | k, v∈R represents the influence degree from R to Rk . Also, using the learned SH model, we detect the latent influence network {ASk, } among geographical cooperative groups {Sk }, where ASk, = u∈Sk Wu, /|Sk | represents the influence degree from S to Sk . Moreover, to reveal the geographical competitive and cooperative structures, we explore geographical locations of {Rk } and {Sk }, and investigate the influence decay rates of them through the learned parameters {γk }.
5
Experiments
For real LBSN data, we empirically evaluate the proposed RH model in terms of prediction performance, and analyze the latent geographical influence networks from the perspective of geographical competitive and cooperative structures by applying the RH and SH models. 5.1
Experiments with Synthetic Data
Using synthetic data, we first confirmed the effectiveness of the learning method for the RH and SH models, and the validity of the analytical approximation formula for the RH model. Throughout all experiments for synthetic and real data, we adopted a univariate Hawkes process (UHP) model ignoring influence relations among POIs as a baseline, since it can capture the most basic property of visit event occurrence for each POI, and the previous work [12] showed that the MHP model is not practical for a limited amount of observed data. Here, the UHP model has three parameters, μu , Wu,u and γu for every POI u ∈ U. We note that the MHP model was also examined as a reference in our experiments. For comparison, 1
Note that a similar formula can also be obtained for λu (t) of the SH model when gk (t) does not depend on k.
34
T. Fujii et al.
we further considered the previous SH-CRP model [12] and the RH model with CRP prior, which is referred to as RH-CRP model. Note that the RH-CRP and SH-CRP models detect the non-geographical competitive and cooperative structures, respectively. We set the hyperparameters as follows: β μ = β w = (0.1, 1), β γ = (10, 100), α = 1 and ρ0 is one-tenth of the maximum distance between POIs. We note that their small variations had little impact on the results. We measured the prediction accuracy by the prediction log-likelihood (PL) (see Eq. 1)). We implemented 100 iterations with 20 burn-in iterations for all the models. We assessed the developed learning method in terms of prediction performance, where the ground truth data (i.e., the training and test data) were generated by a model in question. For several synthetic datasets, we confirmed that the model in question always outperformed the other models. Those results support the effectiveness of the learning method we presented. We also verified the analytical approximation formula for the RH model in the following way. For simplicity, we focus on the case of T = 0. For ∀u ∈ u (t) denote the estimate of expectation N u (t; 0) by the U and ∀t > 0, let N analytical approximation formula. In principle, N u (t; 0) should be empirically estimated through m simulations of the RH model with specified parameters for a sufficiently large m. Let Nu (t)m denote an empirical estimate of N u (t; 0) by u (t)|/N u (t), m simulations. We examined the average error Em (t) of |Nu (t)m −N (∀u ∈ U) for several synthetic datasets, and confirmed that Em (t) was decreasing as m was increased. Figure 1 shows one of the experimental results, where Em (t) is plotted as a function of t ∈ [0, 20] for m = 102 , 103 , 104 . Here, 100 POIs were partitioned into ten competitive groups of equal size, (i.e., |U| = 100, K = 10 and |Rk | = 10), and μu , Wk,v and γk were randomly selected from (0, 0.01), (0, 0.1) and (0, 0.1), respectively. These results support the validity of the analytical approximation formula we presented. 5.2
Comparison of Prediction Performance for Real Data
Next, for real-world data, we exploited the dataset of “FourSquare - NYC and Tokyo Check-ins”2 , where it consists of check-ins in New York city (NYC) and Tokyo (TKY) collected for about ten months [16]. The total numbers of POIs in NYC and TKY are 38,333 and 61,568, respectively. The total numbers of check-ins in NYC and TKY are 227,428 and 573,703, respectively. We simply regarded a check-in event as a visit event for a POI. By taking into account the difference of NYC and TKY in data size, we constructed our datasets in the following way. For NYC and TKY, we selected the POIs having more than five check-in events during periods of seven and five days, respectively. We focus on four datasets NYC-5 (May 1–7), NYC-6 (June 1–7), TKY-5 (May 1–5) and TKY-6 (June 1–5). Here, the numbers of POIs for NYC-5, NYC-6, TKY-5 and TKY-6 were 311, 259, 381 and 440, respectively. Also, the total numbers of check-in events for NYC-5, NYC-6, TKY-5 and TKY-6 were 2,300, 1,785, 4,924 and 5,604, respectively. 2
https://www.kaggle.com/chetanism/foursquare-nyc-and-tokyo-checkin-dataset.
Detecting Geographical Competitive Structure for POI Visit Dynamics
0.10
m = 100 m = 1, 000 m = 10, 000
Average error Em(t)
0.08
RH
RH
RH-CRP
RH-CRP
SH
SH
SH-CRP
SH-CRP
MHP
MHP
UHP −800
35
UHP −600
−400
−200
0
−400
Prediction log-likelihood (PL)
−300
−200
−100
0
Prediction log-likelihood (PL)
(a) NYC-5
(b) NYC-6
0.06
0.04
0.02
RH
RH
RH-CRP
RH-CRP
SH
SH
SH-CRP
SH-CRP
MHP
MHP
UHP −600
0.00
UHP −500
−400
−300
−200
−100
Prediction log-likelihood (PL) 0
10
Time t
20
Fig. 1. A validation result for the analytical approximation formula of the RH model.
(c) TKY-5
0
−2500
−2000
−1500
−1000
−500
0
Prediction log-likelihood (PL)
(d) TKY-6
Fig. 2. Comparison results of prediction performance in terms of PL metric.
We evaluated the RH model in terms of prediction performance. For each of the four datasets, the six models including the MHP were trained, and the next day of the dataset was utilized as the test period [T, T ). Then, the numbers of events in the test period for NYC-5, NYC-6, TKY-5 and TKY-6 were 284, 284, 368 and 1,240, respectively. Figure 2 shows the average performance for PL metric on five trials. We first note that the MHP performed much worse than the other five models because of a relatively small number of observed events. For the RH and SH models and their variants (the RH-CRP and SH-CRP models), we first observe that at least one of the four models always performs better than the baseline UHP. This result indicates that there exists a mutually-exciting relation among POIs in the occurrence process of POI visit events for these cities. For NYC-5, NYC-6 and TKY-6, the RH model performs the best. These results imply that the geographical competitive structure had more significance than the non-geographical competitive structure and the cooperative structures for the POI visit dynamics in these datasets. Namely, they demonstrate the importance of geographical competitive structure. As for the popularity forecasts by the analytical approximation formula for the RH model, the average absolute error was about 20%. On the other hand, for TKY-5, the SH-CRP model performs the best and the RH-CRP model follows. Note that this period corresponds to Japanese vacation week in May. This result suggests that for Tokyo in this period, the geographical influence structures had little impact on the POI visit dynamics, and the non-geographical cooperative structure played an important role. Like these, by examining which model outperforms other models in terms of
36
T. Fujii et al.
prediction performance, we can find an interesting property of POI visit dynamics for a target city in a specified period from the viewpoints of “geographical and non-geographical” and “competitive and cooperative”.
Fig. 3. Geographical locations of the major {Rk } and {Sk } for NYC-5.
5.3
Fig. 4. Analysis results for NYC-5.
Analysis of Geographical Influence Networks for Real Data
For the real-world data, we analyze the latent geographical influence networks S {AR k, } and {Ak, } detected by the RH and SH models from the perspective of geographical competitive and cooperative structures. Due to space limitation, we only describe the analysis results for NYC-5. Figures 3a and b visualize the major geographical competitive and cooperative groups detected for NYC-5, {Rk | k = 1, . . . , 5} and SH groups {Sk | k = 1, . . . , 5}, respectively. Here, the groups including more than ten POIs are selected, and the locations of the corresponding POIs for each group are shown by an individual marker. We observe that groups Rk and Sk were quite similar for every k, and they especially coincided for k = 3, 4, 5. R1 and S1 are Manhattan’s neighborhoods. However, an area from Flushing Meadows Corona Park’s neighborhood to eastern Brooklyn’s neighborhood is included in R1 , while it is not included in S1 but included in S2 . R2 is Pelham Bay Park’s neighborhood. Also, R3 = S3 is eastern Queens’s neighborhood including JFK International Airport, R4 = S4 is Paramus’ neighborhood, and R5 = S5 is Newark’s neighborhood. These results demonstrate that the proposed models can detect geographical structures of influence.
Detecting Geographical Competitive Structure for POI Visit Dynamics
37
For those five {Rk }, Figs. 4a and b show the geographical influence network {AR k, } and the half-life log 2/γk for the temporal influence decay of Rk , respectively. Also, for those five {S }, Figs. 4c and d show the geographical influence network {ASk, } and the half-life log 2/γ for the temporal influence decay of S , respectively. From the perspective of influence receiving, we first see that R1 receives relatively strong influence from R4 , R3 and oneself. We also observe that R1 and R4 are influenced for a longer duration. On the other hand, R2 and R5 are influenced for a shorter duration. Here, R2 and R5 receive strong influence from R3 and R2 , respectively. From the viewpoint of influence sending, we first observe that S1 has an almost equal influence on every S . We also see that the influence of S1 and S4 is more rapidly time-decayed. S5 has a strong influence on S1 and S4 , and its influence is more slowly time-decayed compared to other S . These results imply that the RH and SH models can present an interesting analysis of the dynamics for POI visit event occurrences in terms of geographical competitive and cooperative structures.
6
Conclusion
We addressed the problem of finding latent geographical influence networks that have impacts on the occurrence process of visit events for a set of POIs in a target city, and have proposed a novel probabilistic model based on Hawkes processes, called the RH model, for detecting the geographical competitive structure among the POIs. For the RH model, we presented a method of inferring it from the observed data, and mathematically derived an approximation formula for the popularity prediction. Moreover, by appropriately extending the previous CHP model to the one with ddCRP prior, we have constructed the SH model for detecting geographical cooperative structure among the POIs. Using synthetic data, we confirmed the effectiveness of the inference method, and the validity of the approximation formula. Using New York city and Tokyo data from FourSquare, we demonstrated that there exist mutually-exciting relations among POIs, and the cases in which the geographical competitive structure has more impact on the POI visit dynamics than the non-geographical competitive structure and the cooperative structures, showing the significance of the RH model. Moreover, by applying the RH and SH models, we uncovered and analyzed the latent geographical influence networks for the cities from the perspective of geographical competitive and cooperative structures. Acknowledgments. This work was supported in part by JSPS KAKENHI Grant Number JP17K00433 and Research Support Program of Ryukoku University.
References 1. Blei, D., Frazier, P.: Distance dependent Chinese restaurant processes. J. Mach. Learn. Res. 12, 2461–2488 (2011)
38
T. Fujii et al.
2. Blundell, C., Heller, K., Beck, J.: Modelling reciprocating relationships with Hawkes processes. In: Proceedings of NIPS 2012, pp. 2600–2608 (2012) 3. Daneshmand, H., Gomez-Rodriguez, M., Song, L., Sch¨ olkopf, B.: Estimating diffusion network structures: recovery conditions, sample complexity & softthresholding algorithm. In: Proceedings of ICML 2014, pp. 793–801 (2014) 4. Farajtabar, M., Du, N., Gomez-Rodriguez, M., Valera, I., Zha, H., Song, L.: Shaping social activity by incentivizing users. In: Proceedings of NIPS 2014, pp. 2474– 2482 (2014) 5. Farajtabar, M., Wang, Y., Gomez-Rodriguez, M., Li, S., Zha, H., Song, L.: Coevolve: a joint point process model for information diffusion and network evolution. J. Mach. Learn. Res. 18(41), 1–49 (2017) 6. Feng, S., Li, X., Zeng, Y., Cong, G., Chee, Y., Yuan, Q.: Personalized ranking metric embedding for next new poi recommendation. In: Proceedings of IJCAI 2015, pp. 2069–2075 (2015) 7. Gomez-Rodriguez, M., Leskovec, J., Krause, A.: Inferring networks of diffusion and influence. In: Proceedings of KDD 2010, pp. 1019–1028 (2010) 8. Hawkes, A.: Spectra of some self-exciting and mutually exiting point process. Biometrika 58(1), 83–90 (1971) 9. Junuthula, R., Haghdan, M., Xu, K., Devabhaktuni, V.: Block point process model for continuous-time event-based dynamic networks. In: Proceedings of WWW 2019, pp. 829–839 (2019) 10. Lin, P., Zhang, B., Guo, T., Wang, Y., Chen, F.: Interaction point processes via infinite branching model. In: Proceedings of AAAI 2016. pp. 1853–1859 (2016) 11. Linderman, S., Adams, R.: Discovering latent network structure in point process data. In: Proceedings of ICML 2014, pp. 1413–1421 (2014) 12. Matsutani, K., Kumano, M., Kimura, M., Saito, K., Ohara, K., Motoda, H.: Discovering cooperative structure among online items for attention dynamics. In: Proceedings of ICDMW 2017, pp. 1033–1041 (2017) 13. Neal, R.: Markov chain sampling methods for Dirichlet process mixture models. J. Comput. Graph. Stat. 9(2), 249–265 (2000) 14. Ogata, Y.: On Lewis’ simulation method for point processes. IEEE Trans. Inform. Theory 27(1), 23–31 (1981) 15. Wang, H., Shen, H., Ouyang, W., Cheng, X.: Exploiting poi-specific geographical influence for point-of-interest recommendation. In: Proceedings of IJCAI 2018, pp. 3877–3883 (2018) 16. Yang, D., Zhang, D., Zheng, V., Yu, Z.: Modeling user activity preference by leveraging user spatial temporal characteristics in LBSNs. IEEE Trans. Syst. Man Cybern. Syst. 45(1), 129–142 (2015) 17. Yuan, B., Li, H., Bertozzi, A., Brantingham, P., Porter, M.: Multivariate spatiotemporal Hawkes processes and network reconstruction. SIAM J. Math. Data Sci. 1(2), 356–382 (2019) 18. Zhang, J., Chow, C.: Spatiotemporal sequential influence modeling for location recommendations: a gravity-based approach. ACM Trans. Intell. Syst. Technol. 7(1), 11:1–11:25 (2015) 19. Zhao, Q., Erdogdu, M., He, H., Rajaraman, A., Leskovec, J.: Seismic: a self-exciting point process model for predicting tweet popularity. In: Proceedings of KDD 2015, pp. 1513–1522 (2015) 20. Zhou, K., Zha, H., Song, L.: Learning social infectivity in sparse low-rank networks using multi-dimensional Hawkes processes. In: Proceedings of AISTATS 2013, pp. 641–649 (2013)
Consensus Embeddings for Networks with Multiple Versions Mengzhen Li(B) and Mehmet Koyut¨ urk Department of Computer and Data Sciences, Case Western Reserve University, Cleveland, OH, USA [email protected]
Abstract. Machine learning applications on large-scale networkstructured data commonly encode network information in the form of node embeddings. Network embedding algorithms map the nodes into a low-dimensional space such that the nodes that are “similar” with respect to network topology are also close to each other in the embedding space. Many real-world networks that are used in machine learning have multiple versions that come from different sources, are stored in different databases, or belong to different parties. Due to efficiency or privacy concerns, it may be desirable to compute consensus embeddings for the integrated network directly from the node embeddings of individual versions, without explicitly constructing the integrated network. Here, we systematically assess the potential of consensus embeddings in the context of processing link prediction queries on user-chosen combinations of different versions of a network. For the computation of consensus embeddings, we use linear (singular value decomposition) and non-linear (variational auto-encoder) dimensionality reduction methods. Our results on a large selection of protein-protein interaction (PPI) networks (eight versions with 255 potential combinations) show that consensus embeddings enable real-time processing of link prediction queries on user-defined combinations of networks, without requiring explicit construction of the integrated network. We observe that linear dimensionality reduction delivers better accuracy and higher efficiency than nonlinear dimensionality reduction. We also observe that the performance of consensus embeddings is amplified with increasing number of networks in the database, demonstrating the scalability of consensus embeddings to growing numbers of network versions.
Keywords: Node embedding prediction
1
· Dimensionality reduction · Link
Introduction
Large-scale information networks are becoming ubiquitous. Mining knowledge from these information networks has become very popular in a broad range c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 944, pp. 39–52, 2021. https://doi.org/10.1007/978-3-030-65351-4_4
40
M. Li and M. Koyut¨ urk
of applications. Learning a representation of networks is useful for many network analysis applications [16], including social network analysis [10,24] and bioinformatics [2,21]. With the increase in the quantity and variety of network datasets, effective integration of different data sources is becoming a popular and challenging task for researchers [19]. Many networks have multiple versions, as different data providers may gather their data from different sources, some of the data may not be shared due to privacy concerns [11], or the data that is available may evolve over time. An important task in analyzing integrated networks is the computation of node embeddings, i.e. learning low-dimensional representation of integrated networks [18,24]. Node embeddings aim to map each node in the network to a low dimensional vector representation to extract features that represent the topological characteristics of the network. Many techniques are developed for this purpose [1,8,15,22], and these techniques are shown to be effective in addressing problems such as link prediction [3,23], node classification [4], and clustering [17]. Different versions of a network have the same set of nodes and different sets of edges. These different sets of edges may represent identical semantics but different sources (e.g., protein-protein interaction (PPI) networks obtained from different databases) or different semantics (e.g., physical PPIs vs. genetic interactions). In many settings, it may not be possible or desirable to superpose multiple versions of a network. For example, in integrated querying of networks from multiple databases, computation of embeddings for all possible combinations may not be feasible or efficient [6]. Cho et al. [5] develop a method that uses random walk diffusion states of individual networks to compute the node embedding of the integrated network. In their method, node embeddings are learned from multiple n × n matrices of probabilities. However, the learning has to be performed at query time. From the perspective of efficiency and real-time query processing, computation of node embeddings at query time is not desirable. A potentially more efficient approach is to use the node embeddings of different versions to compute the embeddings of the integrated network (as opposed to computing embeddings using the integrated network). Motivated by this observation, we introduce the notion of “consensus embeddings” as node embeddings for the integrated network that are computed from the embeddings of separate versions. To compute consensus embeddings, we use linear (singular value decomposition) and non-linear (variational autoencoder) dimensionality reduction. Using multiple versions of protein interaction networks, we systematically assess the accuracy and efficiency of concensus embeddings in the context of combinatorial link prediction queries. Our results show that use of consensus embeddings in processing link prediction queries significantly improves computational efficiency, without significantly compromising the accuracy of link prediction.
Consensus Embeddings
2 2.1
41
Methods Node Embedding
Node embedding aims to learn a low-dimensional representation of nodes in networks [16]. Given a graph G = (V, E), a node embedding is a function f : V − → Rd that maps each node v ∈ V to a vector in Rd where d |V |. A node embedding method computes a vector for each node in the network such that the proximity in the embedding space reflects the proximity/similarity in the network. In the last few years, many methods [1,8,15,22] have been developed to compute node embeddings in a given network. Methods usually differ in terms of how they formulate the similarity between nodes (or the objective function that specifies the correspondence between the embedding and network topology). Node embedding methods can also be roughly divided into community-based approaches and role-based approaches [16]. Community based approaches aim to preserve the similarity of the nodes in terms of the communities they induce in the network. In contrast, role based approaches aim to capture the topological roles of the nodes and map nodes with similar topological roles close to each other in the embedding space. As representatives of these different approaches, we here consider node2vec [8] as a community-based approach and role2vec [1] as a role-based approach.
Fig. 1. The proposed framework for the computation of consensus embeddings using dimensionality reduction methods. The graphs labeled Version i represent multiple versions of a network with a fixed node set and different (possibly overlapping) edge sets. The objective is to compute node embeddings for the network obtained by superposing these versions. At the absence of the integrated network, we compute the consensus embedding by computing separate embeddings for each version and then using dimensionality reduction to compute a common reduced-dimensional space for these embedding spaces. Finally, we use the resulting consensus embeddings to perform downstream machine learning tasks on the integrated network.
42
2.2
M. Li and M. Koyut¨ urk
Consensus Embeddings
In this section, we formalize the problem of integrating multiple networks and computing node embeddings. Let G1 = (V, E1 ), G2 = (V, E2 ), ..., Gk = (V, Ek ) be k versions of a network with the same set of nodes and different sets of edges, i.e., all the versions have the same set of n nodes. We consider the integration of these k networks through superposition of their edges, i.e., we define the k integrated graph as G = (V, E), where E = i=1 Ei ). Assume that d-dimensional node embeddings Xi for Gi are given for 1 ≤ i ≤ k, where Xi are n × d matrices and Xi (j) is a d-dimensional vector representing the embedding of node vj ∈ V according to graph Gi . Our objective is to use Xi to compute d-dimensional node embeddings Xc for G, without using any other information on each of the Gi s or G. We call Xc a consensus embedding. This framework is illustrated in Fig. 1, in which node embedding can be any method for the computation of node embeddings (we here use node2vec and role2vec), and dimensionality reduction can be any dimensionality reduction method (we here use SVD or variational autoencoder).
SVD(Singular Value Decomposition) Autoencoder
Fig. 2. Illustration of dimensionality reduction methods used to compute consensus embeddings.
2.3
Computing Consensus Embeddings
The input to the computation of consensus embeddings is k n × d matrices X1 , X2 , ..., Xk . To integrate these embeddings, we first create an n × kd matrix X by concatenating these k matrices. We then use dimensionality reduction on this matrix to compute an n × d matrix Xc , which represents the consensus embedding for G.
Consensus Embeddings
43
Singular Value Decomposition (SVD): Singular Value Decomposition (SVD) is a matrix decomposition method for reducing a matrix to its constituent parts. The singular value decomposition of an m × p matrix M , whose rank is r, is a factorization of the form U SV T , where U is an m × r unitary matrix, S is an r × r diagonal matrix, and V is an p × r unitary matrix. S is a diagonal matrix and the diagonal values of S are called the singular values of M . Letting M = X in this formulation, we obtain n × r dimensional matrix U , r × r dimensional matrix S, and nd × r dimensional matrix V , where r denotes the rank of X and X = U SV T . Our objective is to compute a d-dimensional matrix Xc such that Xc XcT approximates XX T well. If we set our objective as one of choosing n × kd dimensional matrix Y with rank d to minimize the Frobenius or 2-norm of the difference ||X −Y ||, then the optimal solution is given by the truncation of SVD to the largest d singular values (and corresponding singular vectors) of X. Namely, let U , S , and V denote the n × d, d × d, and kd × d matrices obtained by choosing the first d columns (also rows for S) of respectively U , S, and V . Then the matrix Y = U S V T provides the best rankd approximation to X. Consequently, V provides an optimal mapping of the kd dimensions in X to d-dimensional space. Based on this observation, SVD-based dimensionality reduction sets Xc(SV D) = XV T ,
(1)
i.e., it maps the kd-dimensional concatenated embedding of each node of the graph into the d-dimensional space defined by the SVD of X. Figure 2(a) shows the dimensions of matrices when computing the consensus embeddings via SVD. Variational Autoencoder: An autoencoder is an unsupervised learning algorithm that applies backpropagation to obtain a lower-dimensional representation of data, setting the target values to be equal to the inputs. The use of a convolutional autoencoder for dimensionality reduction in the context of computing consensus embeddings is shown in Fig. 2(b). As seen in the figure, the autoencoder is a neural network with kd inputs, each representing a column of the matrix X (i.e., a dimension in one of the k embeddings spaces). The layer(s) on the left (encoder) map these kd inputs to d latent features shown in the middle, which are subsequently transformed into the kd output by the layer(s) on the right (decoder). While training the network, each row of the matrix X (i.e., the embedding of each node) is used as an input and the respective output. The neural network is trained using this loss function: L(X, Y ) = X − Y 2F ,
(2)
where Y denotes the n × kd matrix whose rows represent the outputs of the network corresponding to the inputs that represent the rows of X. Thus the idea behind the variational autoenconder is to learn an encoding of the kd input dimensions into the d latent features (shown in the middle) such that the kd inputs can be reconstructed by the decoder with minimum loss. Observe that
44
M. Li and M. Koyut¨ urk
this loss function is identical to that of SVD; however, the use of neural networks provides the ability to perform non-linear dimensionality reduction. Once the neural network is trained, we perform dimensionality reduction by retaining the d-dimensional output of the encoder that corresponds to each of the n training instances (rows of the matrix X or nodes in V ). These n d-dimensional (V AE) , i.e, consensus embeddings of the nodes in vectors comprise the matrix XC V computed by variational autoencoder. In our implementation, we use a convolutional autoencoder [14]. Same as a standard autoencoder, a convolutional autoencoder also aims to output the same vectors as the input. The convolutional autoencoder contains convolutional layers in the encoder part of the autoencoder. In every convolutional layer, there is a filter that slides around the input matrix to compute the next layer. Convolutional autoencoder also have pooling layers after each convolutional layer. In the decoder part, there are deconvolutional layers and unpooling layers that recovers the input matrix. 2.4
Link Prediction
Link prediction is an important task in network analysis [13]. Given a network G = (V, E), link prediction aims to predict the potential edges that are likely to appear in the network based on the topological relationships between pairs of nodes. Link prediction can be supervised [7] or unsupervised [9]. For supervised link prediction, the known links serve as positive samples and disconnected pairs of nodes serve as negative samples. The embedding vectors of nodes are treated as feature vectors and used to train the classifiers [23]. For unsupervised link prediction, the distances between pairs of vectors can be used to predict the proximity between nodes in the network and thus predicts the potential edges by ranking the distances [3]. In our experiments, we use BioNEV [23] to test the performance of the link prediction accuracy of the consensus embeddings. It is a supervised method that aims to systematically evaluate embeddings. It outputs the AUC scores of the link predictions using the embeddings. 2.5
Processing Combinatorial Link Prediction Queries for Versioned Networks
Consider the following scenario: A graph database houses k versions of a network (as formulated at the beginning of this section). These k versions may either come from different resources (e.g., different protein-protein interaction databases) or represent semantically different types of edges between a common set of nodes (e.g., genetic interactions vs. physical interactions vs. functional association among human proteins). In this setting, a “combinatorial” link prediction query can be formulated as follows: The user chooses (i) a node q ∈ V , and (ii) a subset S ⊆ {G1 , G2 , ..., Gk } of networks. The query seeks to identify the nodes that are most likely to be associated with the query node q based on the topology of the integrated network G(S) = (V, E (S) ), where E (S) = i∈S Ei .
Consensus Embeddings
45
Such a flexible query framework is highly useful in the context of many applications, since the relevance and reliability of different network versions can be variable, and different users may have different needs and preferences. The above framework defines a “combinatorial” query in the sense that a user can select any combination of networks to integrate. This poses a significant computational challenge as the number of possible combinations of networks is exponential in the number of networks in the database, i.e., the user can choose from 2k − 1 possible combinations of networks. As we discuss in Sect. 2.4, there are many different ways of processing the link prediction queries. Among existing approaches, embedding-based link prediction techniques method demonstrated success in the context of many applications [12,23]. Furthermore, embedding based link-prediction can facilitate the development of effective solutions to the combinatorial challenge associated with combinatorial link prediction queries, because link prediction algorithms using node embeddings do not need to access to the network topology while performing link prediction. By computing and storing node embeddings in advance, it is possible to efficiently process link prediction queries while giving the user the flexibility to choose the combination of networks to integrate. Two possible approaches to addressing the combinatorial challenge represent two ends of the pre-processing/storage vs. query runtime trade-off: – Exhaustive Pre-Computation: Compute the embeddings for each possible combination, store those embeddings. When the user selects a combination, use the embeddings for that combination directly. This approach minimizes query processing time while maximizing storage and pre-processing cost. – Network Integration at Query Time: Store the individual network versions in the database without computing any embeddings before query. When the user selects a combination, construct the integrated network, compute the embeddings, and then use the embeddings to process the query. This approach avoids computing and storing an exponential number of embeddings, but performs all computations during query processing. Consensus embeddings provide an alternate solution that can render storage feasible while enabling real-time query processing for very large networks and large number of versions: – Consensus Embedding at Query Time: Compute and store the embeddings for each network separately. When the user selects a combination, compute a consensus embedding for that combination and use it to process the query. One important consideration in the application of this idea is the “inexact” nature of consensus embeddings, i.e., consensus embeddings may not adequately capture the information represented by the embeddings computed on the integrated network. In the following section, we perform computational experiments to characterize the inexact nature of consensus embeddings on the accuracy of
46
M. Li and M. Koyut¨ urk
Table 1. The description and size of the human protein-protein interaction (PPI) networks used in our experiments. G3
G4
G5
G6
G7
G8
Interaction Affinity Affinity type: capture-MS captureRNA
Affinity capturewestern
Negative genetic
Positive genetic
Synthetic growth defect
Synthetic lethality
Two-hybrid
# Edges:
6132
65369
13018
9295
6842
4202
Version:
G1
13472
G2
3160
link prediction. We also investigate the earnings provided by consensus embeddings in terms of the required computational resources in processing combinatorial link prediction queries.
3
Results and Discussion
In this section, we present comprehensive experimental results on versioned networks in link prediction and discuss the implications of the results. 3.1
Datasets
In our computational experiments, we use protein-protein interaction (PPI) networks obtained from BioGRID [20]. PPI networks contain physical interactions and functional associations between pairs of proteins. The dataset we use contains multiple PPI networks separated based on experimental systems. Each network (version) contains a unique type of PPI (genetic or physical). The types of the interactions represented by each network version are shown in Table 1. In order to obtain multiple networks with the same set of nodes, we remove the nodes (proteins) that do not exist in all versions. After preprocessing, all versions have 1025 nodes and different numbers of edges ranging from 3160 to 65369. The type of PPI and the number of edges for each network are shown in Table 1. 3.2
Accuracy of Link Prediction
We compare the link prediction performance of the node embeddings computed on integrated networks and consensus embeddings computed based on the embeddings of individual networks. We consider two embedding algorithms, Node2vec [8] and Role2vec [1], and two methods for computing consensus embeddings, SVD and variational autoencoder. To assess link prediction performance, we use BioNEV [23], a Python package that is developed to assess the performance of various tasks that utilize network embeddings. Given a network and its node embeddings, BioNEV generates random training and testing sets to evaluate the link prediction performance of the embedding. BioNEV uses the known interactions as positive samples and randomly selects the negative samples. Both samples are split into a training set (80%) and a testing set (20%).
Consensus Embeddings
47
For each node pair, BioNEV concatenates the embeddings of two nodes as the edge feature and then build a binary classifier. Using BioNEV, we obtain the area under ROC curve (AUC scores) for the link prediction performance of node embeddings generated using different methods. Figure 3 shows the performance of consensus embeddings in link prediction compared with the performance of the integrated networks’ embeddings. In each figure, the AUC of link prediction is shown as a function of the number of network versions. When there is a single version, the consensus embedding is identical to the embedding of the individual network. We observe that, on average, the accuracy of link prediction goes down with increasing number of versions that are integrated. However, the performance difference between embedding of integrated network and consensus embeddings becomes smaller with increasing number of versions. This observation suggests that the utility of concensus embeddings can be more pronounced for network databases with larger number of versions. We also observe that there is considerable variance of accuracy across different combinations with the same number of versions, indicating that some combinations of PPI types are more informative in predicting new PPIs as compared to other combinations. As seen in Fig. 3, accuracy of link prediction is improved with increasing number of dimensions in node embeddings. Importantly, with growing number of dimensions, the link prediction performance of consensus embeddings converge to that of the embeddings computed on the integrated network. Across the board, consensus embeddings computed using linear dimensionality reduction (SVD) deliver more accurate link prediction as compared to those computed using variational autoencoder. Since the edge set of the integrated network is a union of the edges of individual networks, the adjacency matrix of the integrated matrix can be approximated with a linear combination of the adjacency matrices of the individual networks. This might be the reason why linear dimensionality reduction performs better than neural networks. Finally, we observe that Node2vec delivers consistently more accurate link prediction as compared to Role2vec. The performance of consensus embeddings on larger numbers of networks is also better with Node2vec as compared to Role2vec. This is not surprising as Node2vec is based on communities in the network whereas Role2vec is based on roles. Integration of versions can change the topological features (e.g. degrees or paths) of the individual versions, and thus can have a stronger effect on the “roles” of nodes as compared to communities. Therefore, the embeddings computed via Role2vec are less robust to slight variations in network topology. 3.3
Computational Resource Requirements
In this section, we investigate whether consensus embeddings improve the efficiency of processing link prediction queries. For this purpose, we first compare query processing time for consensus embeddings computed using different methods (SVD and autoencoder) against embeddings computed at query time after integrating the combination of networks selected by the user. We use a high
48
M. Li and M. Koyut¨ urk
number of dimensions = 16
number of dimensions = 64
Fig. 3. Accuracy of consensus embeddings in link prediction. Results are shown for embedding methods Node2vec (left panels) and Role2vec (right panels). For each point k on the x axis, each point in the plot shows the area under ROC curve (AUC) of link prediction for a specific combination of k network versions. The lines show the average AUC across all combinations as a function of the network versions that are integrated. The blue, yellow, and red points/lines respectively show the accuracy provided by the embeddings computed directly on the integrated network, consensus embeddings computed using variational autoencoder, and consensus embeddings computed using SVD.
performance computing environment with a 2.2 GHz processor and 4 GB memory. The results of this analysis are shown in Fig. 4. As seen in the figure, processing queries using consensus embeddings drastically improves the efficiency of query processing. For both node2vec or role2vec, “Consensus Embedding at Query Time” using SVD enables processing of combinatorial link prediction queries in real time across the board, while integration of networks at query time requires orders of magnitude more time to process these queries. In most cases, “Consensus Embedding at Query Time” convolutional autoencoder is also faster than “Network Integration at Query Time”, but its performance degrades with increasing number of networks that are being integrated. The runtime of computing an embedding increases as networks become denser, especially for node2vec. As seen in Fig. 4, the blue dots are separated into two groups for node2vec. This is because G4 is extremely dense (see Table 1), making the integrated networks that contain G4 also dense. Therefore, combinations that contain G4 have a significantly higher query runtime as compared
Consensus Embeddings
49
to those that do not contain G4 . Computation of consensus embeddings using SVD is more robust to this effect.
number of dimensions = 16
number of dimensions = 64
Fig. 4. Runtime of combinatorial link prediction queries using network embeddings. The blue dots (for each combination)/curves (average of all combinations with the respective number of versions) show the query time corresponding to the “Network Integration at Query Time” approach described in Sect. 2.5, while the red and yellow dots/curves show the query time corresponding to the “Consensus Embedding at Query Time”. Results are shown for two node embedding algorithms, Node2vec (left panel) and Role2vec (right panel), and two methods for computing consensus embeddings, SVD (yellow) and autoencoder (red). Each row shows a different number of dimensions for node embeddings.
Next, we investigate the trade-off between the earnings in query runtime and pro-processing time/storage requirements. As discussed in Sect. 2.5, we consider three options for the processing of combinatorial link prediction queries. While the trade-off between storage/pre-processing vs. query runtime requirements for each of these approaches is intuitive, we also assess the performance of each approach in the context of this trade-off. The results of this analysis are shown on Table 2. As seen on Table 2, “Exhaustive Pre-Computation” makes query processing time extremely efficient since node embeddings are readily available during query processing with this approach. However, pre-processing time and storage requirements grow exponentially with the number of versions in the database,
50
M. Li and M. Koyut¨ urk
making this approach infeasible for practical applications. “Network Integration at Query Time” has zero pre-processing time, storage goes up linearly with the number of versions, but the query processing time is very slow because it needs to compute embedding at query time. “Consensus Embedding at Query Time” effectively balances this trade-off by as it requires pre-processing time and storage that grows linearly with the number of versions, but has always fast query processing time. In our experiments, there are 8 networks. However, in many practical settings, the number of networks can be large enough to render storage of all combinations infeasible. Table 2. Assessment of the trade-off between pre-processing time, storage requirements, and query runtime for combinatorial link prediction on versioned networks. Results are shown for a database of 8 network versions and 64-dimensional embeddings. SVD is used to compute consensus embeddings. Storage
Preprocess time Query runtime
Exhaustive precomputation
180 MB
32676.624 s
Network integration at query time
1.02 MB 0 s
Node2vec
Consensus embedding at query time 10.0 MB 490.354 s
0s 128.144 ± 60.627 s 1.840 ± 0.745 s
Role2vec Exhaustive precomputation
322 MB
Network integration at query time
1.02 MB 0 s
38278.802 s
Consensus embedding at query time 5.53 MB 1036.038 s
4
0s 150.113 ± 17.222 s 1.616 ± 0.822 s
Conclusion
In this work, we consider the problem of computing node embeddings for integrated networks derived from the multiple network versions. We define consensus embeddings as the node embeddings of the integrated network computing using the embeddings of individual versions. We test the performance of link prediction of the consensus embeddings and found that accuracy of consensus embeddings is close to the accuracy of embeddings computed directly from the integrated network. Our runtime analyses show that consensus embeddings are much more efficient than computing embeddings from the integrated network of multiple versions.
References 1. Ahmed, N., Rossi, R.A., Lee, J.B., Kong, X., Willke, T.L., Zhou, R., Eldardiry, H.: Learning role-based graph embeddings. arXiv abs/1802.02896 (2018) 2. Ata, S.K., Fang, Y., Wu, M., Li, X.-L., Xiao, X.: Disease gene classification with metagraph representations. Methods 131, 83–92 (2017)
Consensus Embeddings
51
3. Bojchevski, A., G¨ unnemann, S.: Deep Gaussian embedding of graphs: unsupervised inductive learning via ranking. arXiv preprint arXiv:1707.03815 (2017) 4. Cavallari, S., Zheng, V.W., Cai, H., Chang, K.C.-C., Cambria, E.: Learning community embedding with community detection and node embedding on graphs. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM 2017, pp. 377–386. Association for Computing Machinery, New York (2017) 5. Cho, H., Berger, B., Peng, J.: Compact integration of multi-network topology for functional analysis of genes. Cell Syst. 3(6), 540–548 (2016) 6. Cowman, T., Co¸skun, M., Grama, A., Koyut¨ urk, M.: Integrated querying and version control of context-specific biological networks. Database (2020). https:// doi.org/10.1093/database/baaa018 7. de S´ a, H.R., Prudˆencio, R.B.C.: Supervised link prediction in weighted networks. In: Proceedings of the 2011 International Joint Conference on Neural Networks, pp. 2281–2288 (2011) 8. Grover, A., Leskovec, J.: node2vec: Scalable feature learning for networks. CoRR, abs/1607.00653 (2016) 9. Kuo, T.-T., Yan, R., Huang, Y.-Y., Kung, P.-H., Lin, S.-D.: Unsupervised link prediction using aggregative statistics on heterogeneous social networks. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2013, pp. 775–783. Association for Computing Machinery, New York (2013) 10. Lin, J., Zhang, L., He, M., Zhang, H., Liu, G., Chen, X., Chen, Z.: Multi-path relationship preserved social network embedding. IEEE Access 7, 26507–26518 (2019) 11. Ma, Z., Ma, J., Miao, Y., Liu, X.: Privacy-preserving and high-accurate outsourced disease predictor on random forest. Inf. Sci. 496, 225–241 (2019) 12. Mallick, K., Bandyopadhyay, S., Chakraborty, S., Choudhuri, R., Bose, S.: Topo2vec: a novel node embedding generation based on network topology for link prediction. IEEE Trans. Comput. Soc. Syst. 6(6), 1306–1317 (2019) 13. Mart´ınez, V., Berzal, F., Cubero, J.-C.: A survey of link prediction in complex networks. ACM Comput. Surv. 49(4), 69:1–69:33 (2016) 14. Masci, J., Meier, U., Cire¸san, D., Schmidhuber, J.: Stacked convolutional autoencoders for hierarchical feature extraction. In: Honkela, T., Duch, W., Girolami, M., Kaski, S. (eds.) Artificial Neural Networks and Machine Learning, ICANN 2011, pp. 52–59. Springer, Heidelberg (2011) 15. Perozzi, B., Al-Rfou, R., Skiena, S.: DeepWalk: online learning of social representations. CoRR, abs/1403.6652 (2014) 16. Rossi, R.A., Jin, D., Kim, S., Ahmed, N., Koutra, D., Lee, J.B.: From community to role-based graph embeddings. ACM Trans. Knowl. Discov. Data 13(6), 1–25 (2019) 17. Rozemberczki, B., Davies, R., Sarkar, R., Sutton, C.: GEMSEC: graph embedding with self clustering. In: Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2019, pp. 65–72. Association for Computing Machinery, New York (2019) 18. Shen, X., Dai, Q., Mao, S., Chung, F., Choi, K.: Network together: node classification via cross-network deep network embedding. In: IEEE Transactions on Neural Networks and Learning Systems, pp. 1–14 (2020) 19. Shobha, K., Nickolas, S.: Integration and rule-based pre-processing of scientific publication records from multiple data sources. In: Satapathy, S. (ed.) Smart Intelligent Computing and Applications, pp. 647–655 (2020)
52
M. Li and M. Koyut¨ urk
20. Stark, C., Breitkreutz, B.-J., Reguly, T., Boucher, L., Breitkreutz, A., Tyers, M.: BioGRID: a general repository for interaction datasets. NAR 34(suppl-1), D535– D539 (2006) 21. Su, C., Tong, J., Zhu, Y., Cui, P., Wang, F.: Network embedding in biomedical data science. Briefings Bioinform. 21(1), 182–197 (2018) 22. Tang, J., Qu, M., Wang, M., Zhang, M., Yan, J., Mei, Q.: Line: large-scale information network embedding. In: WWW, pp. 1067–1077 (2015) 23. Yue, X., Wang, Z., Huang, J., Parthasarathy, S., Moosavinasab, S., Huang, Y., Lin, S.M., Zhang, W., Zhang, P., Sun, H.: Graph embedding on biomedical networks: methods, applications and evaluations. Bioinformatics 36(4), 1241–1251 (2020) 24. Zhang, J., Xia, C., Zhang, C., Cui, L., Fu, Y., Yu, P.S.: BL-MNE: emerging heterogeneous social network embedding through broad learning with aligned autoencoder. In: Proceedings of the IEEE International Conference on Data Mining (ICDM 2017), pp. 605–614 (2017)
Graph Convolutional Network with Time-Based Mini-Batch for Information Diffusion Prediction Hajime Miyazawa(B) and Tsuyoshi Murata Department of Computer Science, School of Computing, Tokyo Institute of Technology, W8-59 2-12-2 Ookayama, Meguro, Tokyo 152-8552, Japan [email protected], [email protected] http://www.net.c.titech.ac.jp/
Abstract. Information diffusion prediction is a fundamental task for understanding information spreading phenomenon. Many of the previous works use static social graph or cascade data for prediction. In contrast, a recently proposed deep leaning model DyHGCN [20] newly considers users’ dynamic preference by using dynamic graphs and achieve better performance. However, training phase of DyHGCN is computationally expensive due to the multiple graph convolution computations. Faster training is also important to reflect users’ dynamic preferences quickly. Therefore, we propose a novel graph convolutional network model with time-based mini-batch (GCNTM) to improve training speed while modeling users’ dynamic preference. Time-based mini-batch is a novel input form to handle dynamic graphs efficiently. Using this input, we reduce the graph convolution computation only once per mini-batch. The experimental results on three real-world datasets show that our model performs comparable results against baseline models. Moreover, our model learns about 5.97 times faster than DyHGCN. Keywords: Dynamic graph · Graph neural network diffusion prediction · Social networks
1
· Information
Introduction
With the rapid growth of online social media, massive amount of information is propagated to users over social networks and the diffusion is traceable. This provides a great opportunity to study information diffusion phenomenon with real-world data and many researchers explore how to model or predict information diffusion, such as estimating social influence [21], predicting how much contents propagate over users [1], or detecting rumors [12]. Since information diffusion is often recorded as time-series data, recent researchers have formulated information diffusion prediction as a sequential prediction task. The goal is to predict a user to be activated in the next given observed cascade sequence, and to understand how past activation affects future c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 944, pp. 53–65, 2021. https://doi.org/10.1007/978-3-030-65351-4_5
54
H. Miyazawa and T. Murata
diffusion trends. With the great success of the neural network in sequence modeling, several researchers apply deep learning techniques such as Recurrent Neural Network (RNN) or self-attention [13]. Several studies have been made for diffusion models by using cascade sequences [2,4,15,17,18] and social graphs [14,16,19,20]. However, these methods do not capture users’ dynamic preferences, which changes as time goes and influences information diffusion. On the other hand, a recent study [20] proposed DyHGCN, a novel method that can learn users’ dynamic preferences by building Ng dynamic graphs and using heterogeneous graph convolutional networks, where Ng is the number of dynamic graphs. Although DyHGCN improved prediction accuracy, the model training is time-consuming. The main reason for the computation time issue in [20] is that the model has to compute graph convolution Ng times using all the dynamic graphs for each mini-batch training. In practical, training speed is also important because users repost a massive amount of information in real time. The prediction model should take those diffusion histories into consideration as quick as possible. Therefore, we propose a novel Graph Convolutional Network with Time-based Mini-batch (GCNTM), to improve training speed while learning users’ dynamic preferences. In GCNTM, we first compute one-layer graph convolution only once per mini-batch training using the latest dynamic graph for the input cascade sequence, to learn dynamic user preferences. Then, we capture the historical dependency by using self-attention, and produce vector representations of the input cascades for prediction. To reduce graph convolution to only once, we also propose a novel mini-batch input called time-based mini-batch, which builds the mini-batch input based on the latest timestamp for each cascade sequence. We evaluate GCNTM and other baselines including state-of-the-art models on information diffusion prediction over three real-world datasets. Experimental results suggest that GCNTM improves computational speed significantly while resulting in comparable accuracy in terms of map@k and hits@k score.
2
Related Works
In this paper, we focus on the sequential prediction task using deep learning techniques. In this section, we summarize previous works for information diffusion prediction, especially for sequential diffusion prediction task using deep learning. Early works study information diffusion prediction based on graph, and often assume the underlying diffusion process such as Independent Cascade (IC) model or Linear Threshold (LT) model [5]. However, the underlying diffusion process is unknown in most cases and this assumption causes poor performance when the assumption is invalid. With the great success of deep learning methods, some works introduce deep learning models for information diffusion prediction [9,11], which do not assume any prior diffusion process and automatically learn from real-world data. For sequential diffusion prediction task, most previous works apply deep learning techniques for sequential data, such as Recurrent Neural Network
GCN with Time-Based Mini-Batch for Information Diffusion Prediction
55
(RNN) or self-attention [13]. These models automatically learn a representation of the diffusion path from past cascade data. These work often use social graph or sequential data. For example, Topo LSTM [14] extended the standard LSTM model to learn the information diffusion path to generate a topology-aware embedding. DeepDiffuse [4] and CYAN-RNN [15] employed RNN with an attention to use the activation timestamp information. SNIDSA [16] also employed RNN and computed structural attention based on the social graph. FOREST [19] employed graph embedding [10] and conducted neighborhood aggregation to obtain user representations. NDM [18] built a microscopic cascade model based on self-attention and convolution neural networks. HiDAN [17] adopted self-attention and time attention to take time interval into consider. All of these methods do not learn users’ dynamic preferences, which change as time goes on and intuitively beneficial to the prediction. On the other hand, recently proposed method [20] called DyHGCN used dynamic graph to learn users’ dynamic preferences by employing multiple graph convolutional networks [7] (GCN). The model also captured sequential patterns with self-attention. Since this model has to compute graph convolution many times, its training phase is time-consuming.
3
Notations and Problem Formulation
In this section, we introduce necessary notation and formulate information diffusion prediction task to be solved. Consider a static social graph G = (V, E), where V is the user set, and E is the edge set. Moreover, we consider that a document i is diffused over V by users’ reposting behavior. The diffusion process is represented as a sequence of tuple ci ∈ C (1) ci = {(vji , tij )|j = 1, 2, ..., Nci , vji ∈ V, tij ∈ R+ }, where Nci is cascade length, and C is the set of all cascade sequences. Intuitively, (vji , tij ) ∈ ci means that a user vji reposted document i at timestamp tij . Tuples are ordered by their timestamps for each ci . We further define ci:k as ci:k = {(vji , tij )|j = 1, 2, ..., k ≤ Nci , (vji , tij ) ∈ ci },
(2)
which denotes a repost sequence until k-th repost. We call ci:k partial cascade sequence of document i until k-th user. Figure 1 shows the illustration of cascade sequence ci and partial cascade sequence ci:k . Furthermore, we define dynamic graph Gt = (V, Et ). Gt represents the dynamic graph at time t. Each dynamic graph Gt is directed and unweighted. We construct Gt using the static graph G and cascade sequences. The construction procedure of the graph Gt is as follows. Note that this procedure follows the implementation of [20].1 1
Although the original paper defined static graph and diffusion graph separately and treat the latter as weighted graph, both graphs are unified and treated as undirected in the authors’ implementation.
56
H. Miyazawa and T. Murata
Fig. 1. Illustrations of cascade sequence ci and partial cascade sequence ci:k . Each tuple (vki , tik ) denotes that user vki reposted document i at time tik . The red dashed box shows the range of partial cascade sequence ci:k .
First, we collect all cascade sequences until time t. Then we construct diffusion graph Gt = (V, Et ), where Et includes an edge (vi , vj ) if vj repost at least one document just after vi . Finally, we define dynamic edge set Et as the union of Et and static graph E, i.e., Et = Et ∪ E, and finally define dynamic graph Gt as Gt = (V, Et ). Figure 2 shows the illustration of construction procedure of Gt . In this paper, we split the observation period into Ng intervals, and create Ng dynamic graphs. We lastly define the dynamic graph sequence Gd as Gd = {Gt1 , ..., GtNg }, where each Gtj is the dynamic graph at the beginning of i-th time interval [tj , tj+1 ). Finally, we formulate information diffusion prediction as sequential prediction task. As problem input, we have a set of cascade sequences C and a dynamic graph sequence Gd . As problem output, we have a diffusion model M, which predicts next activation user vxi given test partial cascade sequences ci:x−1 and Gtj , i.e., a dynamic graph at time tj where the latest repost timestamp tix−1 satisfies tix−1 ∈ [tj , tj+1 ).
Fig. 2. Illustration of the construction of dynamic graph Gt . The left box denotes the example cascade sequences. In static graph G, the arrow vi → vj denotes that vj follows vi in static social network. In diffusion graph Gt , the arrow denotes that vj reposed some documents just after vi until time t. We define dynamic graph Gt as the union of static graph G and diffusion graph Gt .
GCN with Time-Based Mini-Batch for Information Diffusion Prediction
4
57
Proposed Method
In this section, we introduce GCNTM, a deep learning model that uses a novel mini-batch input and learning dynamic user representations effectively. Our method consists of three stages. First, we construct a novel mini-batch input called time-based mini-batch (Subsect. 4.1). Second, we compute dynamic user representations using dynamic graph (Subsect. 4.2). Finally, we capture dynamic cascade representations and predict the user to be activated next (Subsect. 4.3). Our method is illustrated in Fig. 3.
Fig. 3. An overview of GCNTM: it consists of the following three steps: (1) Construction of time-based mini-batch (bottom box, Subsect. 4.1) (2) Graph convolution using dynamic graph (top left box, Subsect. 4.2) (3) Capturing sequential characteristics using self-attention (top right box, Subsect. 4.3).
4.1
Time-Based Mini-Batch
In this section, we explain time-based mini-batch, a novel mini-batch input. Most of the previous methods define mini-batch input as document-based manner. Specifically, these methods simply sample up to Nb cascade sequences and use their user id sequences or also timestamps together as model input. Nb denotes the batch size for document-based mini-batch. In time-based mini-batch, on the other hand, we initially select a time interval [tj , tj+1 ). Then we collect partial cascade sequences that satisfy tik ∈ [tj , tj+1 ), where tik is the latest timestamp in a partial cascade sequence ci:k . This condition guarantees that the latest reposting behavior in each collected cascade sequence was observed within the same time interval. Finally, we randomly select up to Nb partial cascade sequences from collected ones and then use their user id sequences and the dynamic graph Gtj together as model input. Nb denotes the batch size for time-based mini-batch. This mini-batch input enables us to compute dynamic vector representations of all input cascade sequences using only one dynamic graph. Figure 4 shows examples of both mini-batch construction.
58
H. Miyazawa and T. Murata
Fig. 4. Construction examples for both document-based and time-based mini-batch. In this figure, we consider 3 cascade sequences c1 , c2 , c3 and 2 dynamic graphs Gt1 , Gt2 , shown in repost timeline (left). In the figure of each mini-batch (right), each row denotes the sequence of repost user vki , and P denotes padding token. From the cascades, we collect all the repost user ids as input in document-based mini-batch. In contrast, when we construct time-based mini-batch with dynamic graph at time t1 , we sample partial cascades ci:k such that the latest timestamp tik satisfies tik ∈ [t1 , t2 ). After sampling, we use Gt1 and user ids in ci:k as input.
4.2
Computing Dynamic User Representation
Assume that we input a dynamic graph Gtj and a partial cascade sequence ci:k as the sequence of user ids [v1i , ..., vki ] ∈ Nk . Let Atj be an adjacency matrix of Gtj . Initially, we have base user representation matrix X ∈ R|V |×d , and we apply graph neural network [23] to obtain dynamic user representation matrix Xl+1 = ReLU(Atj Xl Wl ),
(3)
where Atj is a normalized adjacency matrix in GCN [7], i.e., Atj = ˜tj = Atj +I|V | , D ˜ mm = ˜ − 12 A ˜tj D ˜ − 12 , A ˜ D n Amn , and X0 = X. After applying Eq. (3) a times, we use Xa as dynamic user representation matrix X . 4.3
Predicting Next Activation User
Once we obtain dynamic user representation matrix X , we extract user vectors for the input cascade sequence. Specifically, given user ids [v1i , ..., vki ] ∈ Nk , we obtain participated user representation matrix X ci:k = [Xv i , ..., Xv i ] ∈ Rk×d 1 k where Xl denotes l-th row of X . After obtaining X ci:k , we capture the context dependency and create vector representation of the cascade sequence using multihead self-attention [13]. QK T Attention(Q, K, V ) = Softmax √ V, d (4) headp = Attention(X ci:k WpQ , X ci:k WpK , X ci:k WpV ), H = [head1 ; head2 ; ...; headNh ]W O ,
GCN with Time-Based Mini-Batch for Information Diffusion Prediction
59
where WpQ , WpK , WpV ∈ Rd×d , W O ∈ Rd ×d , H ∈ Rl×d , d = d/Nh , and Nh is the number of heads of attention modules. Next, we transform H into a single vector representation h ∈ Rd . We simply apply column-wise mean-pooling k 1 function hn = k m=1 Hmn to obtain h, where hn is the n-th element of vector h. Finally, we compute output yˆ via two layers fully-connected neural network and softmax function. y ˆ = Softmax(W2out ReLU(W1out h + b1 ) + b2 ),
(5)
where W1out ∈ Rd×d , W2out ∈ Rd×|V | , b1 ∈ Rd , b2 ∈ R|V | , yˆ ∈ R|V | . Wiout and bi are model parameters and yˆ is the final output. 4.4
Model Training
We train GCNTM using time-based mini-batch (see Subsect. 4.1) as sequential input. Specifically, we use Atj , the adjacency matrix of a dynamic graph Gtj , and user ids U ∈ RNb ×l , to obtain prediction matrix Yˆ ∈ RNb ×|V | where Nb is batch size and l is the max cascade length of mini-batch cascade sequences. Then, we update model parameters to minimize cross entropy loss (Eq. (6)). L(Θ) = −
Nb |V |
Yij log Yˆij ,
(6)
i=1 j=1
where Θ denotes all model parameters, Yˆij is (i, j) element of final output Yˆ ∈ RNb ×|V | , and Yij = 1 if the actual next activation user for i-th input sequence is vj else Yij = 0. Since information about recent reposting user is important for prediction [18], we use up to Nl recent user ids for computing yˆ to limit the size of user id matrix U to Nb × Nl .
5 5.1
Experiments Experimental Settings
Datasets. Following previous studies [19,20], we conduct experiments on three publicly available datasets. The statistics of these datasets are shown in Table 1. #Users and #Links denote the number of users and the follow-follower relation of users, respectively. #Cascades denotes the number of cascade sequences in each dataset. Ave. length denotes the average length of observed cascade sequences. Twitter [3] dataset records tweets containing URLs during October 2010. Each URL is interpreted as an information item spreading among users. Douban [22] is a Chinese social website where users can update their book reading statuses and follow the statuses of other users. Each book is considered as an information item and a user is activated if she reads the book. Memetracker [8] collects a millions of news stories and blog posts from online websites and track the most frequent quotes and phrases, i.e. memes, to
60
H. Miyazawa and T. Murata Table 1. Statistics of datasets Datasets
Twitter Douban Memetracker
#Users
12,627
#Links
309,631 348,280 -
23,123
4,709
#Cascades 3,442
10,602
12,661
Ave. length 32.60
27.14
16.24
analyze the migration of memes among people. Each meme is an information item and each URL of websites is treated as the user. Note that this dataset has no underlying social graph. Following previous studies [19,20], we randomly sampled 80% of cascades for training, 10% for validation and the rest 10% for testing. Baselines. We compared GCNTM with previous deep learning methods for information diffusion prediction. The details of each method are as follows. Topo LSTM2 [14] regards information diffusion as growing directed acyclic graph and extends the standard LSTM model to capture topology-aware user embedding for information diffusion prediction. DeepDiffuse (see Footnote 2) [4] employs embedding technique and attention to model the activation timestamp information. The model can predict when and who is going to be activated in a social network based on previously observed cascades. NDM (see Footnote 2) [18] builds a microscopic cascade model based on self-attention and convolutional neural network to alleviate the long-term dependency problem. SNIDSA (see Footnote 2) [16] computes pairwise similarities of all useruser pairs and incorporates the structural information into RNN by a gating mechanism. FOREST3 [19] is a multi-scale diffusion prediction model based on reinforcement learning. The model incorporates the macroscopic diffusion size information into the RNN-based microscopic diffusion model. DyHGCN (see Footnote 3) [20] defines the dynamic graph from cascade sequences and obtain historical dynamic user representations using heterogeneous graph convolutional neural network and self-attention. This method is state-of-the-art at the time of submitting this paper. 2 3
Results of these baselines are cited from papers [19, 20]. Although results of these baselines are reported in [19, 20], we conducted experiment again. These models used an additional user token that denotes the end of cascade sequence and include this token as one of target values. We found that this inclusion improved the results, but this settings were unfair to other baselines. Therefore, we conducted experiments without this token.
GCN with Time-Based Mini-Batch for Information Diffusion Prediction
61
Table 2. Settings for DyHGCN (not our model). For our model, see Subsect. 5.1. Parameter
Value
Description
(Optimizer)
Adam [6]
Parameter update algorithm
β1
0.9
A parameter for Adam
β2
0.999
A parameter for Adam
Learning rate
10−3
Initial learning rate
Batch size Nb
16
Batch size for training
GNN module
2-layer GCN Graph integration module
#Dim for GNN kernel 128
Intermediate dimension size for GNN module
#Heads Nh
14
Attention heads for self-attention
#Dynamic graph Ng
8
Create dynamic graphs with #number time intervals
Dropout
0.1
Drop rate
Evaluation Metrics and Parameter Settings. Following the settings of previous studies [14,19,20], we consider the next activated user prediction as a retrieval task by ranking inactivated users by their activation probabilities. We evaluate the performance of GCNTM with state-of-the-art baselines in terms of Mean Average Precision (MAP) on top k (map@k) and Hits score on top k (hits@k). For FOREST and DyHGCN, we used these papers’ recommended settings. Especially, Table 2 shows the parameters of DyHGCN. We implemented our model by PyTorch. We use 1-layer GCN for GNN module. To align the number of minibatches to baselines’ ones, the size of time-based mini-batch Nb is set to about Nb = 16 × (Average cascade length), e.g., 512 for Twitter, 480 for Douban, 256 for Memetracker. Drop rate is set to 0.2. Maximum user size Nl is set to Nl = 30. Other parameters are set to the same value as DyHGCN (see Table 2). 5.2
Experimental Results
Comparison with Baselines. We compare GCNTM with the baselines on information diffusion prediction. Table 3 shows the performance of all baselines. From the table, we can see that our model performs the best in most cases in terms of hits@k metric and comparable in map@k metric. Compared with NDM, SNIDSA, and FOREST, our model performs better without hits@10 in Memetracker in terms of hits@k metric. However, in terms of map@k metric, our model performs the best in Twitter, comparable in Douban, and scores are dropped down in Memetracker. These methods do not use dynamic information and even any graph in NDM. We can say from this result that by adding dynamic information, we can make a broader range of promising candidates be greater rank. However, this behavior sometimes worsens the quality of the highest ranked users, which is important in map@k metric. Compared with DyHGCN, our model performs better or comparable in many cases. This shows that performance is still comparable even if we consider only the latest dynamic graph.
62
H. Miyazawa and T. Murata
Table 3. Experimental results of all baselines in three datasets. Bold value shows the highest, and underline shows the second highest. Because of the absence of social graph in Memetracker, we omit the models that require underlying social graph. Dataset
Model
hits@k @10 @50
@100
map@k @10 @50
@100
Twitter
DeepDiffuse TopoLSTM NDM SNIDSA FOREST DyHGCN GCNTM
4.57 6.51 21.52 23.37 24.97 26.76 27.98
8.80 15.48 32.23 35.46 37.83 46.03 47.21
13.39 23.68 38.31 43.49 45.76 56.75 58.36
3.62 4.31 14.30 14.84 16.35 15.99 16.45
3.79 4.67 14.80 15.40 16.94 16.87 17.32
3.85 4.79 14.89 15.51 17.06 17.02 17.48
Douban
DeepDiffuse TopoLSTM NDM SNIDSA FOREST DyHGCN GCNTM
9.02 9.16 10.31 11.81 10.50 12.66 13.62
14.93 14.94 18.87 21.91 20.91 25.73 26.89
19.13 18.93 24.02 28.37 26.91 33.40 34.69
4.80 5.00 5.54 6.36 5.13 5.75 6.32
5.07 5.26 5.93 6.81 5.60 6.34 6.93
5.13 5.32 6.00 6.91 5.68 6.54 7.03
Memetracker DeepDiffuse NDM FOREST DyHGCN GCNTM
13.93 25.44 25.28 25.27 24.69
26.50 42.19 43.06 45.50 45.12
34.77 51.14 52.72 55.66 55.84
8.14 13.57 12.89 12.53 11.82
8.69 14.33 13.71 13.45 12.75
8.80 14.46 13.84 13.60 12.91
Ablation Study. Table 4 shows the results of ablation studies. We conducted the same experiment with ablation models. GCNTM-G: This model does not use any graph information. GCNTM-D: This model uses only static social graph. Moreover, we also show the results when the number of GNN layer is set to 2. We omit the results in Memetracker because this dataset does not have static network. Compared with graph ablation models in terms of hits@50 and hits@100 metric, results consistently get better as the amount of information increases. However, in hits@10 and map@k metric, sometimes results get worse despite the increase in information. This is the same trend as comparison with baselines. Moreover, results are consistently better when the number of GNN layer is set to 1 than 2. These results imply that direct neighborhoods are the most influential users for each user when we construct dynamic graphs from cascade sequence. Moreover, even 2-hop information can be noisy.
GCN with Time-Based Mini-Batch for Information Diffusion Prediction
63
Table 4. Ablation study Dataset Model
hits@k @10 @50
map@k @100 @10 @50
@100
Twitter GCNTM-G GCNTM-D #GNN-layer=2 #GNN-layer=1
28.25 28.46 26.89 27.98
46.20 47.16 46.25 47.21
57.14 58.05 57.98 58.36
17.00 16.73 15.39 16.45
17.82 17.58 16.45 17.32
17.98 17.73 16.27 17.48
Douban GCNTM-G GCNTM-D #GNN-layer=2 #GNN-layer=1
12.83 12.73 13.45 13.62
25.61 25.77 26.48 26.89
33.22 33.40 34.35 34.69
5.79 5.71 6.21 6.32
6.36 6.31 6.80 6.93
6.47 6.41 6.91 7.03
Learning Time. Table 5 shows the learning time for each dataset. We only show the results of methods we conducted. We conducted all experiments on a machine with half of an Intel Xeon E5-2680 V4 CPU, 64 GB memory, and a Tesla P100 for NVLink-Optimized Servers as GPU. The results indicate that our model learns much faster than baselines while learning dynamic information. Compared with DyHGCN, our model finishes an epoch 2.0–6.0 times faster because our model computes GNN module only once per mini-batch training, while Ng = 8 times in DyHGCN. Compared with FOREST, our model is faster because our model uses self-attention to capture sequential dependency. The complexity of RNN is O(n · d2 ) and it requires O(n) computations for sequential operation, where n is cascade length and d is the size of hidden vector. On the other hand, the complexity is O(n) and O(1), respectively, in self-attention. In our experiment, we have n ≤ Nl = 30 < d = 64. Compared with GCNTM-G, we can see that GCN itself is still time-consuming. This fact can be problematic when we apply our model to a larger dataset. We can alleviate this problem by employing sampling-based methods. However, we leave this examination as future work. Table 5. Learning time for each dataset. Unit: second/epoch Model
Douban Twitter Memetracker
FOREST 97.6 DyHGCN 176.9 GCNTM-G 23.2
35.5 54.0 9.0
43.8 55.0 22.0
GCNTM
10.9
27.4
29.6
64
6
H. Miyazawa and T. Murata
Conclusion
In this paper, we study the information diffusion prediction problem. To improve the computation time in model training phase while capturing users’ dynamic preference, we propose a novel mini-batch input called time-based mini-batch, and a deep learning model that handles time-based mini-batch. We conduct experiments on three real-world datasets. The experimental results show that our model performs comparable results against baseline methods, including the recently proposed state-of-the-art model, and achieves faster computation time, which shows the effectiveness and efficiency of the model. Acknowledgement. This work was supported by JSPS Grant-in-Aid for Scientific Research (B)(Grant Number 17H01785) and JST CREST (Grant Number JPMJCR1687).
References 1. Cheng, J., Adamic, L., Dow, P.A., Kleinberg, J.M., Leskovec, J.: Can cascades be predicted? In: Proceedings of the 23rd International Conference on World Wide Web, pp. 925–936 (2014) 2. Du, N., Dai, H., Trivedi, R., Upadhyay, U., Gomez-Rodriguez, M., Song, L.: Recurrent marked temporal point processes: embedding event history to vector. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1555–1564 (2016) 3. Hodas, N.O., Lerman, K.: The simple rules of social contagion. Sci. Rep. 4, 4343 (2014) 4. Islam, M.R., Muthiah, S., Adhikari, B., Prakash, B.A., Ramakrishnan, N.: DeepDiffuse: predicting the ‘who’ and ‘when’ in cascades. In: Proceedings of the IEEE International Conference on Data Mining (ICDM 2018), pp. 1055–1060. IEEE (2018) ´ Maximizing the spread of influence through 5. Kempe, D., Kleinberg, J., Tardos, E.: a social network. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 137–146 (2003) 6. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 7. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016) 8. Leskovec, J., Backstrom, L., Kleinberg, J.: Meme-tracking and the dynamics of the news cycle. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 497–506 (2009) 9. Li, C., Ma, J., Guo, X., Mei, Q.: DeepCas: an end-to-end predictor of information cascades. In: Proceedings of the 26th International Conference on World Wide Web, pp. 577–586 (2017) 10. Perozzi, B., Al-Rfou, R., Skiena, S.: DeepWalk: online learning of social representations. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 701–710 (2014)
GCN with Time-Based Mini-Batch for Information Diffusion Prediction
65
11. Qiu, J., Tang, J., Ma, H., Dong, Y., Wang, K., Tang, J.: DeepInf: social influence prediction with deep learning. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2110–2119 (2018) 12. Takahashi, T., Igata, N.: Rumor detection on Twitter. In: Proceedings of the 6th International Conference on Soft Computing and Intelligent Systems, and the 13th International Symposium on Advanced Intelligence Systems, pp. 452–457. IEEE (2012) 13. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L ., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017) 14. Wang, J., Zheng, V.W., Liu, Z., Chang, K.C.C.: Topological recurrent neural network for diffusion prediction. In: Proceedings of the IEEE International Conference on Data Mining (ICDM 2017), pp. 475–484. IEEE (2017) 15. Wang, Y., Shen, H., Liu, S., Gao, J., Cheng, X.: Cascade dynamics modeling with attention-based recurrent neural network. In: Proceedings of the IJCAI, pp. 2985– 2991 (2017) 16. Wang, Z., Chen, C., Li, W.: A sequential neural information diffusion model with structure attention. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 1795–1798 (2018) 17. Wang, Z., Li, W.: Hierarchical diffusion attention network. In: Proceedings of the IJCAI, pp. 3828–3834 (2019) 18. Yang, C., Sun, M., Liu, H., Han, S., Liu, Z., Luan, H.: Neural diffusion model for microscopic cascade prediction. arXiv preprint arXiv:1812.08933 (2018) 19. Yang, C., Tang, J., Sun, M., Cui, G., Liu, Z.: Multi-scale information diffusion prediction with reinforced recurrent networks. In: Proceedings of the IJCAI, pp. 4033–4039 (2019) 20. Yuan, C., Li, J., Zhou, W., Lu, Y., Zhang, X., Hu, S.: DyHGCN: A dynamic heterogeneous graph convolutional network to learn users’ dynamic preferences for information diffusion prediction. arXiv preprint arXiv:2006.05169 (2020) 21. Zhang, J., Liu, B., Tang, J., Chen, T., Li, J.: Social influence locality for modeling retweeting behaviors. In: Proceedings of the IJCAI, vol. 13, pp. 2761–2767 (2013) 22. Zhong, E., Fan, W., Wang, J., Xiao, L., Li, Y.: ComSoc: adaptive transfer of user behaviors over composite social network. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 696–704 (2012) 23. Zhou, J., Cui, G., Zhang, Z., Yang, C., Liu, Z., Wang, L., Li, C., Sun, M.: Graph neural networks: a review of methods and applications. arXiv preprint arXiv:1812.08434 (2018)
A Sentiment Enhanced Deep Collaborative Filtering Recommender System Ahlem Drif1(B) , Sami Guembour1 , and Hocine Cherifi2 1
Faculty of Sciences, Ferhat Abbas University, Setif 1, Setif, Algeria [email protected], [email protected] 2 LIB, University of Burgundy, Dijon, France [email protected]
Abstract. Recommender systems use advanced analytic and learning techniques to select relevant information from massive data and inform users’ smart decision-making on their daily needs. Numerous works exploiting user’s sentiments on products to enhance recommendations have been introduced. However, there has been relatively less work exploring higher-order user-item features interactions for sentiment enhanced recommender system. In this paper, a novel Sentiment Enhanced Deep Collaborative Filtering Recommender System (SE-DCF) is developed. The architecture is based on a Neural Attention network component aggregated with the output predictions of a Convolution Neural Network (CNN) recommender. Specifically, the developed neural attention component puts more emphasis on user and item interactions when constructing the latent spaces (user-item) by adding the mutual influence between the two spaces. Additionally, the CNN learns the specific review of users and his sentiments aspects. Hence, it models accurately the item latent factors and creates a profile model for each user. The proposed framework allows users to find suitable items through the comprehensive aggregation of user’s preferences, item attributes, and sentiments per user-item pair. Experiments on real-world data prove that the proposed approach significantly outperforms the state-of-theart methods in terms of recommendation performances. Keywords: Recommender systems · Neural recommender models · Collaborative filtering · Content-based filtering · Convolution Neural Network · Attention mechanism · Aspect based opinion mining
1
Introduction
Recommender systems have become an integral part of e-commerce sites and other platforms such as social networking sites and the movie/music rendering sites. They have a huge impact on the revenue earned by these businesses and also benefit users by reducing the cognitive load of searching through an overload of c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 944, pp. 66–78, 2021. https://doi.org/10.1007/978-3-030-65351-4_6
A Sentiment Enhanced Deep CF Recommender
67
data. Recommendation systems usually rely on the explicit (e.g., user ratings) or implicit (e.g., click behaviors) interactions between users and products for recommendation. Collaborative filtering recommender systems attempt to learn similarities between users and items. The unspecified ratings are based on the fact that observed ratings are often highly correlated across various users and items. User-based methods for collaborative filtering evaluate users similarities based on similar tastes, while the item-based methods compute predicted ratings as a function of the ratings of the same user on similar items. Several works present recommender systems based on Recurrent Neural Networks (RNNs) to model the temporal dynamics and sequential evolution of content information [1,2]. In the work [3], Tang et al. proposed a sequential recommendation that incorporates the Convolution Neural Network (CNNs) to learn sequential features, and Latent Factor Model (LFM)to learn user-specific features. However, a rating can only reflect the user’s overall attitude towards a product without including information about the underlying reasons for the user behavior. As a result, it is difficult for recommender systems to model user’s fine-grained preferences on specific product features and provides an explanation to the recommendations. Several works have been introduced to use reviews in order to improve the recommendation accuracy. In the work [4], the Explicit Factor Models (EFM) conduct aspect-level sentiment analysis to extract user’s preference and product’s quality on specific product feature. The authors incorporated the results into Matrix Factorisation (MF) framework to provide more accurate recommendation. Meng et al. [5] incorporated the users’ emotions towards a review into a matrix factorization model. Wang et al. [6] employed the sentiment analysis to optimize the movie recommendation. In the work [7], the authors expanded user-item matrix to contain both ratings from reviews and rated data. Then, they fed it into a collaborative filtering algorithm. Da’u et al. [8] used a lexion method to calculate the sentiments and build the aspect-ratings matrix using CNN, then, they integrated this matrix with the rating matrix in a factorization model to enhance the recommendation. The main drawback of these methods that inferred the overall opinions and converted them into virtual ratings is the difficulties to model the context dependence between the different entities. To tackle these limitations, we propose a framework that exhibits interactions between users and items in latent embedding while recommending items according to the users preferences and the sentiment aspects. This work uses reviews to alleviate the user’s sentiment on different product features, and thus contains more fine-grained information about the user preference. Our main contributions are summarized as follows: – A novel Sentiment Enhanced Deep Collaborative filtering Recommender System (SE-DCF)is proposed. This architecture uses a neural attention network to model the mutual influence between the users and item spaces. Moreover, it incorporates a sentiment-enhanced recommender based on CNN. – The two components developed, are: 1-) Neural attention network based collaborative filtering recommender, and 2-) a CNN-content based filtering rec-
68
A. Drif et al.
ommender. Merging these components reduces the bias towards items that are rated frequently by users. – The attention network- CF component assigns the relevant weights for item and user mutual interactions. It boosts the accuracy of the sentiment enhanced recommender systems by indicating which higher order features interactions are informative for the prediction. – The CNN based content filtering recommender incorporates a multifaceted information such as users reviews, sentiments, item description. The main advantage is that the ability of CNN in generating features map allows to find which items correlate the most with the user’s interest. Furthermore, it can be easily extended to other item modalities. – The empirical evaluation demonstrates that the SE-DCF framework significantly outperforms state-of-the-art baselines on several real-world datasets. The rest of the paper is organized as follows. Section 2 introduces the Sentiment Enhanced Deep Collaborative Filtering Recommender System proposed. Section 3 covers the experimental setting and reports the results of the experimentation. Finally, conclusions are provided in Sect. 4.
2
SE-DCF: Sentiment Enhanced Deep Collaborative Filtering Recommender System
In order to deal with the rating prediction problem for recommendation, this work combines the strengths of the CNN feature extraction with the ability of attention networks to learn the complex relationship of the target users and their neighbors. Hence, the recommender architecture is enhanced (i) by modeling a high level of non linearities that exhibits mutual interactions between users and items in latent embeddings. (ii) by incorporating side information to extract sentiment-aspect from the items, and use them to make predictions. The proposed architecture is depicted in Fig. 1. 2.1
Notation
There are various recommendation tasks such as item ranking and rating prediction. In this work, the recommendation task is formulated as a prediction problem. The interactions between users and items are represented in the form of a utility matrix R. For each user u ∈ U , the recommender system attempts to predict all of unspecified rating rˆu . Let U = u1 , u2 , ..., un and I = i1 , i2 , ..., im be the sets of users and items respectively, where n is the number of users, and m is the number of items. The matrix factorization algorithm decomposes a matrix M(n×m) into two matrices P ∈ RN ×K and Q ∈ RM ×K . A user’s interaction on an item is modelled as the inner product (interaction function) of their latent vectors. Let Rui be the ground truth rating assigned by the user u on the item i. The utility matrix is defined as: ˆ ui = P T Q = R
K k=1
puk qki
(1)
A Sentiment Enhanced Deep CF Recommender
69
Fig. 1. The Sentiment enhanced deep collaborative filtering architecture “SE-DCF”. It contains two major components: user-item space and item space.
where K denotes the dimension of the latent space. A set of training set T consists of N tuples. Each tuple (u, i, rui , su,i ) denotes a review written by user u for item i with rating rui and a review of sui . ˆ the normalization is done on a userFor a given predicted rating matrix R, basis: For each user u ∈ U , the minimal rating min = min (rs (ui)), and the maximal rating max = max (rs (ui)) are extracted. Then, the min/max scaling ˆ as follows: function is applied to R x − min ∀x ∈ rˆ(ui) (2) max − min The normalization step can be performed before, during, or after the generation of the rating matrix. Moreover, neural-based recommenders are easily adjusted for this purpose by using sigmoid as the activation function for the output layer, ultimately skipping the use of the Min/Max scaler. In the rest of this paper, the notations reported in Table 1 are used. minmax(x) =
2.2
Neural Attention Network Component
Attention mechanism has proved effective in various machine learning tasks such as image/video captioning [9–12] and machine translation [13]. It allows different parts contributing differently when compressing them to a single representation. The Attention network-CF approach considered in this work is inspired by Chen’s paper [14]. Previous works put more emphasis on learning only the complex relationship between the target users (or items) and their neighbors by
70
A. Drif et al. Table 1. Notation Symbols
Definitions and descriptions
rui
The rating value of item i by user u
sui
The review details of item i by user u
P ∈ RN ×K The latent factors for user u Q ∈ RM ×K The latent factors for item i g1 (), g2 ()
The LSTM models applied to users and items respectively
eu
The user embedding layer
ei
The item embedding layer
αu∗
Attention network of user u
αi∗
Attention network of item i
αu
The final attention weights of user u
αi
The final attention weights of item i
℘
Each possible combination of the prediction set
C
The number of observed ratings
⊕
The concatenation operator
rij
a scalar referring to the rating of an item i as specified by user u
W, b
The weight and bias in the Interactive attention neural network
attention network. Here, we aim at exploiting the encoding ability of the interactive attention between the users and the items to learn deeply the most relevant weights that represent the users mutual influence on the item. The underlying idea is that some correlation between users and items with particular characteristics can reveal the possibility that an item will be interesting to similar users (see Fig. 2). First, the list of users U and the list of items I are fed to two different embedding layers: eu and ei respectively. This allows to capture some useful latent properties of users pu and items qi . Each of these embedding layers is chained with a Long Short Term Memory (LSTM) layer. The LSTM is a variety of recurrent neural network that can learn long sequences with long time lags. The advantage of this architecture is that LSTM units are recurrent modules, which enables long-range learning. Each LSTM state has two inputs, the current feature vector and the previous state’s output vector ht−1 , and one output vector ht . The LSTM based representation learning can be denoted as: htu = g1 (p)
(3)
hti = g2 (q)
(4)
The learned representation can be denoted as Hp and Hq respectively. The respective dimensions of Hp and Hq are d × n and d × m (d-dimensional vectors of LSTM). The attention mechanism is used to project the users and items embedding inputs into a common representation space. The proposed neural attention
A Sentiment Enhanced Deep CF Recommender
71
framework can model the high order nonlinear relation between users and items and the mutual influence. Indeed, an interactive attention on both the users and items is applied. The final rating prediction is based on all the interactive attended users and items features. This mechanism is explored as follows: joint user and item interactive attention maps are build and combined recursively to ultimately predict a distribution over the items. First, a matrix L ∈ Rn×m as: L = tanh(Hp Wpq Hq ) is computed, where Wpq is a d × d matrix of learnable parameters. The features interaction attention maps is given by: αp∗ = tanh(Wp Hp + (Wq Hq )L ) αq∗ = tanh(Wq Hq + (Wp Hp )L)
(5)
Therefore, the interactive attention model the mutual interactions between the users latent factors and items latent factors by applying tangent function (tanh). Afterward, an attention distribution is calculated as a probability distribution over the embedding space. The attention weights are generated through the softmax function: (6) αu = Sof tmax(f (αp∗ )) αi = Sof tmax(f (αq∗ ))
(7)
The function f is a multi-layer neural network (MLP). Besides, the attention vectors of high order interaction feature can be generated through weighted sum using the derived attention weights or a sigmoid function, given by βp and βq . These latent spaces of users and items are then concatenated as follows: f1 = [βu ⊕ βi ]
(8)
In order to form the predicted score Rˆui , the concatenation spaces are fed into a dense layer with an activation function sigmoid as follows: Rˆui = f (f1 )
Fig. 2. Neural attention network based- collaborative filtering component.
(9)
72
A. Drif et al.
During the training phase, a Grid-search method is used to learn the model parameters and to set the cost function as Mean Absolute Error (MAE) as defined below: ˆ ui ) = 1 (Rui − Rˆui ) (10) L(Rui , R |C| (u,i)∈C
The model predicts interest probabilities for each possible combination of the prediction set ℘. The attention-CF training procedure is summarized in Algorithm 1.
Algorithm 1: Neural Attention network based- collaborative filtering. Input : U : list of user ids : size n I: list of item ids: size m R: list of groundtruth ratings per couple (u, i): size s ≤ n × m Θf : the model parameters. ˆ predicted utility matrix: size (n × m) Output: R: begin // Preparing data to be passed to the network foreach u ∈ U do Ri = minmax(Ri ) ; ℘=U ×I ; /* cartesian product */ D=∅; /* training set */ foreach (u, i) ∈ D do D ← (u, i, Rui ) if Rui = null else (u, i, 0) ; // Attention-Class() InteractiveAttention = BuildModel(usersize=|U |, itemsize=|I|, Θ) ; InteractiveAttention.trainModel(D); ˆ ui =InteractiveAttention.predict(℘) return R end
2.3
Convolution Neural Network Content-Based Recommender
The main advantage of the CNN-content based filtering component is that it makes a prediction about the user’s interests according his preferences. The CNN recommender extracts relevant features from the review text and when combined with other side informations, it recommend the items that the user would be interested in the most. As shown in Fig. 3, the recommendation algorithm based on CNN recommends items for the target user which mainly includes the following steps. Firstly, the item attributes are preprocessed using input data. Then, the embedded layer is the first layer in the model to generate each attribute feature vectors. More details are provided about the embedding vectors in Sect. 3. The embedding output is fed to the Convolutional Neural Networks layer (ConvNets) that extract features from local input patches. Ten filters with size 3 are used in the experiments to extract features from reviews. Each filter detects multiple features
A Sentiment Enhanced Deep CF Recommender
73
in the text using ReLu [15] activation function in order to represent them in the feature map. The same is done in the Id item part with a size fixed to 1. Further, the regularization (L2 activity regularizer) is used to avoid overfitting. Then, a standard Maxpooling operation is performed on the latent space, followed by flatten layer. The reason of selecting the highest value is to capture the most important feature and reduce the computation in the advanced layers. The flattening step is needed for merging the two CNNs models and for feeding the output of the merging to the fully connected layer. Thus, both outputs of the flatten layers of each model are concatenated. After that, the item features Z (results of concatenation) are fed into the fully connected layer with an activation function sigmoid to generate the prediction rating of the CNN based content model as follows: ˆ CN N = f2 (Z) R ui
(11)
After training the two components needed for the task at hand, an aggregation function is applied to merge their outputs into a single utility matrix. The framework SE-DCF uses the simple unweighted average aggregation function. The final predicted utility matrix r is written as:
ˆ ui , R ˆ CN N ). rui = fagg (R ui
(12)
Fig. 3. CNN-based content filtering. Multiple filters are applied to create a stack of convolution results. The data structure is flattened and concatenated before the fullyconnected output layer.
3 3.1
Experimental Results and Discussion Experimental Setting
Tensorflow [16] distributed open source framework is used to implement the proposed algorithm. It can effectively solve the problems of large data volume, large
74
A. Drif et al.
model, and slow speed of recommender system. Three representative datasets from the Amazon reviews collection [17] are used in the experiments. These data have been reduced to extract the k-core, such that each of the remaining users and items have k reviews. The datasets is divided into 80% training data and 20% testing data in a stratified manner, where the proportion of appearance of each user is the same both in training and test data. Table 2 shows the datasets statistics. – Amazon fine food Reviews dataset: the Amazon fine food Reviews dataset spans a period of more than 10 years. It includes product and user information, ratings, and a plain text review. – Amazon Toys and Games: This dataset contains product reviews and metadata from Amazon Toys and Games. – Amazon Clothing, Shoes and Jewelry: The Amazon Clothing, Shoes and Jewelry datasets includes: reviews (ratings, text, helpfulness votes) and products metadata (descriptions, category information, price, brand).
Table 2. Datasets statistics Datasets
Reviews# Users# Products# Sparsity
Amazon fine food
568 454
256 059 74 258
99.98%
Amazon toys and games
167 597
19 412 11 924
99.92%
Amazon clothing shoes and jewellery 278 677
39 387 23 033
99.96%
In order to evaluate the predictive accuracy of the recommendation framework, two popular metrics are adopted. The Mean Absolute Error (MAE) measures the average magnitude of the errors in a set of predictions, without considering their direction. The Root Mean Squared Error (RMSE) is a quadratic scoring rule that also measures the average magnitude of the error. It’s the square root of the average of squared differences between prediction and actual observation: 1 (rui − rˆui )2 RM SE = u,i∈C |C| To ensure the integrity of comparisons, baselines are executed in the same evaluation environment as SE-DCF. The proposed recommender system is compared with the following recommendation methods: – Probabilistic Matrix Factorization (PMF) [18]: a Probabilistic Matrix Factorization utilizes user item rating matrix only and models latent factors of users and items by Gaussian distributions. – Neural Collaborative Filtering (NCF) [19]: this recommender system applies the multi-layer perceptron to learn the user-item interaction function.
A Sentiment Enhanced Deep CF Recommender
3.2
75
Results and Discussion
In the recommendation field, the positive information has a positive effect while negative information has a negative effect. Hence, sentiment analysis is introduced into the user’s reviews to understand the polarity of the reviews. First, these reviews are studied using Vader [20] (Valence Aware Dictionary for sentiment Reasoning). It is a lexicon and rule-based sentiment analysis tool. It works by relying on a dictionary which maps lexical features to emotion intensities, called sentiment scores. Table 3 reports an example of sentiment scores computed for Amazon Clothing, Shoes and Jewellery dataset. Table 3. The top positive and negative words in users reviews on products. Top 10 positive
Top 10 negative
Word
Coefficient Word
Coefficient
Great
13.722053
Worst
−11.710415
Delicious
12.152401
Disappointing
−10.117334
Best
12.013713
Terrible
−9.263025
Perfect
10.606939
Disappointed
−8.492150
Excellent
9.860665
Disappointment −8.312410
Loves
9.583115
Awful
Wonderful
7.965187
Horrible
−7.793645
Amazing
7.797122
Unfortunately
−7.437975
Awesome
7.629256
Tasteless
−6.838199
Good
7.339747
Threw
−6.733330
−8.149215
In general, an item with most positive information will be recommended to user. For this reason, we use the popular tool Word2Vec [21] for the embbeding layers in the CNN-content based filtering. It explores the possibility of automatically generating domain adapted polarity lexicons employing continuous word representations. In the input layer, each word in the text, which is one token in the word level is embedded into a vector with length of 300. Any text that contains less than the maximum number of tokens is padded to have the same length with the maximum text length. The maximum number of word in one review is 1958. Therefore, we used Tensorflow embedding layer with a vocabulary size of 116373. For the id item part, the embedding size equal 8 (we tested the value in the range [8, 16]). The outputs are fed into separate convolutional layers. There is an important number of hyperparameters to search and analyze because the hyperparameters analysis step is performed separately on each component. Here, we report only the evaluation of the assembling architecture due to the page limit. The hyperparameter optimization of SE-DCF framework leads to batchsize = 50, Epochs = 100, and optimizer = Adam. Figures 4 and 5 depict
76
A. Drif et al.
the performance errors (RMSE and MAE) over epochs. One can observe the good convergence of the proposed model.
Fig. 4. Training RMSE and test RMSE over training epochs for the SE-DCF model.
Fig. 5. Training MAE and test MAE over training epochs for the SE-DCF model.
Table 4 shows the overall rating prediction error (RMSE and MAE) among the recommendation methods on Amazon review datasets. Comparing the models based on the lowest RMSE and MAE achieved, it appears that the proposed model outperforms the NCF and PMF models. It achieves a RMSE of 0.2771 and the lowest MAE of 0.1562. The PMF model works well achieving the lowest RMSE of 0.4998 and the MAE of 0.3763. One can conclude that the gap in performance between Neural CF and PMF due to the fact that PMF put more emphasis on the users influence on items to be recommended. The performance gain over SE-DCF mainly comes from the powerful encoding ability of the interactive attention. It performs better at modeling mutual influence between users and items, especially it identifies the most relevant weights that represent the users mutual influence on the item. To put it differently, the more positive the users behaviour towards an item the more likely to be recommended by users who have similar preferences. Moreover, applying LSTM encoder allows the model to learn patterns based on historical data. Thus, subsequent feature representation becomes more and more enriched by information from later interactions. On the other hand, the CNN-reinforced sentiments recommender boosts the recommendation accuracy. Incorporating users reviews provides the recommender system the common taste and preferences of each user and may suggest him items recommended by people with some specific common interest. Although the deployment of the proposed framework to production in online phase is very time-efficient because the CNN-based content filtering create a personalized profile for each user, the CNN recommender neglects the much needed novelty and serendipity aspects for recommendation.
A Sentiment Enhanced Deep CF Recommender
77
Table 4. Performance comparison of different recommender systems.
4
Datasets
Metrics Neural NCF PMF
SE-DCF
Amazon fine food
MAE RMSE
1.1311 1.3311
0.6810 0.1562 0.8031 0.2771
Amazon toys and games
MAE RMSE
0.8360 1.0072
0.3837 0.1625 0.5033 0.2819
Amazon clothing, shoes and jewellery MAE RMSE
0.8972 1.0965
0.3763 0.1528 0.4998 0.2772
Conclusion
Sentiment enhanced recommender systems have gained increasing momentum in the past years, with a remarkable enhancement of their accuracy. They have been sided by a boost in recommendation. In this work, a deep collaborative filtering recommender aimed at boosting the accuracy of the sentiment enhanced recommender systems is proposed and evaluated. The SE-DCF framework explores the complex nonlinear interaction between users and items. It effectively models the importance of the individual contribution and the mutual influence user-item interaction while incorporating sentiment aspects for recommendation. Empirical evaluations show that it provides competitive performance. To better understand the potential correlation between users sentiments and users’ interactions more thorough experiments are required in the future and will be conducted on more datasets. Future work aim to integrate context-awareness which generates items relevant to the users according to a specific context.
References 1. Musto, C., Greco, C., Suglia, A., Semeraro, G.: Ask me any rating: a content-based recommender system based on recurrent neural networks. In: IIR (2016) 2. Okura, S., Tagami, Y., Ono, S., Tajima, A.: Embedding-based news recommendation for millions of users. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1933–1942 (2017) 3. Tang, J., Wang, K.: Personalized top-n sequential recommendation via convolutional sequence embedding. In: Proceedings of the 11th ACM International Conference on Web Search and Data Mining, pp. 565–573 (2018) 4. Zhang, Y., Lai, G., Zhang, M., Zhang, Y., Liu, Y., Ma, S.: Explicit factor models for explainable recommendation based on phrase-level sentiment analysis. In: Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 83–92 (2014) 5. Meng, X., Wang, S., Liu, H., Zhang, Y.: Exploiting emotion on reviews for recommender systems. In: AAAI, pp. 3788–3795 (2018) 6. Wang, Y., Wang, M., Xu, W.: A sentiment-enhanced hybrid recommender system for movie recommendation: a big data analytics framework. Wirel. Commun. Mob. Comput. 18 (2018)
78
A. Drif et al.
7. Osman, N., Noah, S., Darwich, M.: Contextual sentiment based recommender system to provide recommendation in the electronic products domain. Int. J. Mach. Learn. Comput. 9(4), 425–431 (2019) 8. Da’u, A., Salim, N., Rabiu, I., Osman, A.: Recommendation system exploiting aspect-based opinion mining with deep learning method. Inf. Sci. 512, 1279–1292 (2020) 9. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057 (2015) 10. Pastrana-Vidal, R., Gicquel, J., Blin, J., Cherifi, H.: Predicting subjective video quality from separated spatial and temporal assessment. In: Human Vision and Electronic Imaging XI, vol. 6057, p. 60570S. SPIE (2006) 11. Demirkesen, C., Cherifi, H.: A comparison of multiclass SVM methods for real world natural scenes. In: International Conference on Advanced Concepts for Intelligent Vision Systems, pp. 752–763. Springer, Heidelberg (2008) 12. Lasfar, A., Mouline, S., Aboutajdine, D., Cherifi, H.: Content-based retrieval in fractal coded image databases. In: Proceedings of the 15th International Conference on Pattern Recognition, vol. 1, pp. 1031–1034. IEEE (2000) 13. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) 14. Chen, J., Zhang, H., He, X., Nie, L., Liu, W., Chua, T.-S.: Attentive collaborative filtering: multimedia recommendation with item-and component-level attention. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 335–344 (2017) 15. Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In: Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, pp. 315–323 (2011) 16. Tensorflow. Accessed 01 Sept 2020 17. McAuley, J., Targett, C., Shi, Q., Van Den Hengel, A.: Image-based recommendations on styles and substitutes. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 43–52 (2015) 18. Mnih, A., Salakhutdinov, R.R.: Probabilistic matrix factorization. In: Advances in Neural Information Processing Systems, pp. 1257–1264 (2008) 19. He, X., Liao, L., Zhang, H., Nie, L., Hu, X., Chua, T.-S.: Neural collaborative filtering. In: Proceedings of the 26th International Conference on World Wide Web, pp. 173–182 (2017) 20. Gilbert, C., Hutto, E.: Vader: a parsimonious rule-based model for sentiment analysis of social media text. In: Eighth International Conference on Weblogs and Social Media (ICWSM-1204), vol. 81, p. 82 (2014) 21. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Experimental Evaluation of Train and Test Split Strategies in Link Prediction Gerrit Jan de Bruin1(B) , Cor J. Veenman1,3 , H. Jaap van den Herik2 , and Frank W. Takes1 1
2
Leiden Institute of Advanced Computer Science (LIACS), Leiden University, Leiden, The Netherlands [email protected] Leiden Centre of Data Science (LCDS), Leiden University, Leiden, The Netherlands 3 Data Science Department, TNO, The Hague, The Netherlands
Abstract. In link prediction, the goal is to predict which links will appear in the future of an evolving network. To estimate the performance of these models in a supervised machine learning model, disjoint and independent train and test sets are needed. However, objects in a real-world network are inherently related to each other. Therefore, it is far from trivial to separate candidate links into these disjoint sets. Here we characterize and empirically investigate the two dominant approaches from the literature for creating separate train and test sets in link prediction, referred to as random and temporal splits. Comparing the performance of these two approaches on several large temporal network datasets, we find evidence that random splits may result in too optimistic results, whereas a temporal split may give a more fair and realistic indication of performance. Results appear robust to the selection of temporal intervals. These findings will be of interest to researchers that employ link prediction or other machine learning tasks in networks.
Keywords: Link prediction learning
1
· Performance estimation · Machine
Introduction
Machine learning has emerged as a powerful instrument to analyze all kinds of datasets. Here, we focus on supervised learning, of which the use on nonrelational (i.e., tabular) data is rather straightforward. However, supervised machine learning on network data is more challenging due to problems of obtaining an independent train and test set [1]. A common type of machine learning in networks is link prediction, where the goal is to predict whether a link will form in some future state of an evolving network. In this work, we focus specifically on the generalization of link prediction methods. Link prediction is defined as the problem where given the current state of a network, new edges between pair of nodes are inferred for the near future [2]. This method has many applications c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 944, pp. 79–91, 2021. https://doi.org/10.1007/978-3-030-65351-4_7
80
G. J. de Bruin et al.
in different kinds of real-world scenarios, such as spam mail detection, friend recommendations in online social networks and identifying related references in a publication. In recent years, there has been an increasing interest in link prediction and hence several review papers on this topic exist [3–5]. A crucial first step in machine learning in networks is feature engineering, where network topology data is converted into features with potentially useful information for a predictive model. The main established approaches for feature engineering in link prediction are based on similarity, probabilistic and maximum likelihood, and dimensionality reduction [3]. We will focus on the similarity-based approach. In this approach, pairs of nodes (candidates for links formed in the future) are assigned scores according to their similarity. We will exclusively use topological properties to assess similarity, such that we can apply the feature engineering also to networks where no additional information is available about the nodes. The similarity-based approach provides at least three benefits. First, similarity-based features provide more accurate results compared to embedding techniques [6]. Second, the similarity-based approach provides easily explainable features compared to other approaches. Third, most features can be obtained at relatively low computational costs for the larger networks used in this study. This brings us to the main problem addressed in this paper. For proper validation in any machine learning task, instances belonging to the training set, on which the model is trained, should be disjoint and independent of features belonging to the validation and test set. However, because many dependencies exists between nodes in a network, this is inherently difficult to achieve. This possibly results in too optimistic performance measurements or, equivalently, overestimating the so-called generalization performance of the model [7]. According to Ghasemian et al. it is yet unclear how common machine learning steps, such as cross-validation and model selection methods, extend from non-relational to network data [8]. Assessment of the performance in supervised machine learning is important for at least two reasons. The first is model selection. Different models can be constructed for a certain task, ranging from completely different classifiers to identical models with different (hyper)parameters. The performance of such models for train data is not informative, as one wants the model with the best generalization performance on an independent test set. An independent validation set allows the detection of overfitting. The second reason for assessing model performance is to estimate the prediction error on new, unseen data. This should be assessed using the test data, not used in any part of training the model and neither used in choosing the right hyper-parameters or selecting a model [7]. In the current research we investigate to what extent differences in collecting the train and test set influence the generalizability score of the classifier. The contributions of this work are as follows. First, we investigate the two most common ways in which pairs of nodes can be split in disjoint and independent train and test sets in link prediction. Second is an in-depth comparison of these two approaches, on a number of evolving real-world networks. We contribute to a better understanding of performance estimation in link prediction.
Evaluation of Train and Test Split Strategies in Link Prediction
81
The remainder of this paper is organized as follows. Related work is discussed in Sect. 2. We continue with definitions and our approach towards reporting generalization performance in Sect. 3. Section 4 features information about the datasets used. Then, Sect. 5 is concerned with the experimental setup, results and discussion. Conclusions and future work are provided in Sect. 6.
2
Related Work
There is a relatively small body of literature that is directly concerned with splitting a network dataset into train and test set to evaluate the performance for machine learning purposes. Hence, we start our exploration of the literature with work on performance estimation in general, before focusing specifically on prediction tasks in networks. One of the causes of too optimistic performance estimation is what is often described as “test set re-use” [9]. A well-known example is the p-hacking problem [10]. In short, p-hacking is the application of many different models to the same data in search for a statistically significant result with a high enough pvalue. This misuse can result in increasing probability that applied research findings are false. More specific to data-driven research, too optimistic performance estimation is suspected in Kaggle competitions. In these online competitions participants all get the same dataset and compete for the best classifier performance on some predictive task, without having access to the test data. However, Kaggle allows users to repeatedly probe test data to obtain the performance of a submitted model. This is argued to lead to too optimistic results [11], which was experimentally only observed to a limited extent [9]. Returning more specifically to the topic of machine learning on networks, Ghasemian et al. [8] investigated under- and overfitting in networks. They did so in an attempt to estimate the performance of various community detection algorithms. The performance on link prediction and so-called link description task are used as a diagnostic to evaluate the general tendency of such algorithms to under- and overfit. The authors define the link prediction task a little differently, since they do not necessarily have temporal information about the edges. Hence they remove a fraction of edges from a network and employ a machine learner to find these removed links from all pairs of nodes that are not connected anymore. The link description problem is different. Again a network is sampled, but now the task for the machine learner is to find the remaining edges of the sampled network from all pairs of nodes. The authors explain that no algorithm can excel at both the link prediction and link description task and that these two tasks force an algorithmic tradeoff, like the bias-variance tradeoff in non-relational data [7]. In this work, we likewise want to bring the notion of overfitting from non-relational data to relational data. While [8] focusses on overfitting caused by the bias-variance tradeoff, we investigate too optimistic estimation of generalization performance caused by test set reuse in networks.
82
3
G. J. de Bruin et al.
Approach
This section will start with a formal description of the link prediction problem. In Sect. 3.2 we explain how we split the data into a train and test for the link prediction classifier. Section 3.3 continues with the features used. Section 3.4 provides information about the used classifier. Finally, in Sect. 3.5 we explain the performance metrics used. 3.1
Link Prediction Problem
We base our procedure of supervised link prediction in evolving networks upon definitions used by Liben-Nowell et al. [2], Lichtenwalter et al. [12], and Kumar et al. [3]. To ensure uniformity, we use sometimes slightly different terminology than aforementioned works. The temporal, potentially undirected, network G = (V, E) consists of a set of nodes V and edges (u, v, t) ∈ E connecting nodes u, v ∈ V with time t ≥ t0 . Time t0 indicates the time of the first edge occurring in G. Parallel edges with different timestamps can exist. Since the network is temporal, we can construct snapshots of network G for a given time interval. We denote such a snapshot with G[ta ,tb ] = (V[ta ,tb ] , E[ta ,tb ] ) with E[ta ,tb ] being a set consisting only of edges occurring between ta and tb (with ta < tb ) and V[ta ,tb ] the nodes taking part in these edges. We make two such snapshots, G[ta ,tb ] and G[tb ,tc ] from two time intervals [ta , tb ] and [tb , tc ] with ta < tb < tc . This procedure is shown in Fig. 1a. The task for the supervised binary link prediction classifier (see Sect. 3.4) is in G[tb ,tc ] . Hence, the to predict from G[ta ,tb ] whether a pair of nodes will connect input for the classifier is all pairs of nodes X[ta ,tb ] = V[ta ,tb ] × V[ta ,tb ] \ E[ta ,tb ] , see also Fig. 1b. The network G[ta ,tb ] needs to be “mature” enough that the underlying static topology is well captured [12] and hence we call [ta , tb ] the maturing interval. Subsequently, we define the probing interval [tb , tc ]. For every pair of nodes xi ∈ X[ta ,tb ] , we probe whether the pair is present in the probing interval (indicated with yi = 1) or not (indicated with yi = 0). The entire procedure is summarized in Fig. 1. 3.2
Splitting Strategies
Now that we described the general procedure of link prediction, we need a strategy to separate the pairs of nodes into a train and test set for the classifier. The classifier is learned on the train set and the performance is determined on the test set. We will now explain two dominant ways encountered in literature to split the dataset. While the procedure of applying a temporal split is more complicated than the random split due to the various parameters, it prevents to a greater extent the reuse of node and edge set information from the test set in training.
Evaluation of Train and Test Split Strategies in Link Prediction
83
Fig. 1. (a) The evolution of a temporal network divided into different snapshots. (b) Instances considered in the classifier. Positive instances (yi = 1) are shown in green solid lines, while negatives (yi = 0) are shown in red dashed lines.
Random Split. This procedure for example used in [12], consists of three steps, which are also shown in Fig. 2a. The first step is to obtain all pairs of nodes that are not connected during the maturing phase, X[t0 ,tb ] . Second, we determine for each of these pairs of nodes whether they connect (the value of yi ) in the probing phase E[tb ,tc ] , as shown in Eq. 1. 1 if xi ∈ E[tb ,tc ] for xi ∈ X[t0 ,tb ] (1) yi = 0 if xi ∈ E[tb ,tc ] Third, these pairs of nodes X[t0 ,tb ] are separated into two disjoint sets X[ttrain 0 ,tb ] train test train test and X[ttest such that X ∪ X = X and X ∩ X = ∅. We [t ,t ] 0 b [t0 ,tb ] [t0 ,tb ] [t0 ,tb ] [t0 ,tb ] 0 ,tb ] will refer to this split procedure as the random split, as the train and test set are taken at random from the instances in X[t0 ,tb ] . Temporal Split. A different procedure is for example used by Hasan et al. [13], which we will call temporal split. In this procedure, the train and test set are made by applying the probing phase on two different, consecutive snapshots called the training interval [t0 , tc ] and test interval [t0 , td ]. The four steps of this process are shown schematically in Fig. 2b. The training set is constructed in the first two steps as follows. First, we consider all node pairs that are not connected in the maturing phase of the train interval X[t0 ,tb ] . Second, for each of these node pairs we determine whether it will connect in the probing phase of the train interval, like Eq. 1. In steps three and four the test set is constructed in a similar way as the train set. In step three, we consider all pairs of nodes X[t0 ,tc ] . Finally, in step four we determine for each of these pairs of nodes in whether they connect in the probing phase of the test interval, as shown in Eq. 2.
84
G. J. de Bruin et al.
1 if xi ∈ E[tc ,td ] yi = for xi ∈ X[t0 ,tc ] and with tc < td 0 if xi ∈ E[tc ,td ]
(2)
Fig. 2. Different strategies to obtain train and test set for classifier f , discussed in Sect. 3.2. a) Train and test set are obtained by randomly splitting instances from a single probing phase. b) Train and test set are obtained by two consecutive probing phases obtained from two different time intervals.
3.3
Features
As input for a classifier one needs a feature representation for all pairs of nodes xi ∈ X. As discussed in the introduction, in this work we use the well-established similarity-based approach, where the feature for each pair of nodes xi = (u, v) consists of a particular score for each feature Sfeature (u, v). These scores are based solely on topological properties intrinsic to the network itself and not on any contextual information [12,14]. Hence, features used can be employed in any network, without requirements on node information available. Nodes with similar scores and hence a high similarity are then more likely to connect. The score is either neighbor-based (similarity in local properties of the two nodes) or path-based (quasi-local or global properties of the two nodes) [3,15]. We use the so-called HPLP feature set defined in [12], as these are known to obtain good performance while keeping the number of features limited. In the definitions that follow, the set Γ (u) denotes the neighbors of node u and deg(u) the degree of node u. Parallel edges can exist in the network, and hence the number of neighbors of node u, |Γ (u)|, is not necessarily equal to
Evaluation of Train and Test Split Strategies in Link Prediction
85
the degree of node u, deg(u). In directed networks, we differentiate between the neighbors connecting to node u, indicated by Γin (u), and the neighbors node u connects to, Γout (u). Likewise, we differentiate also between the indegree and outdegree of node u, degin (u) and degout (u), respectively. Neighbor-Based Features. Neighbor-based features take only the direct neighbors of the two nodes under consideration into account. Number of Neighbors (NN). This feature is determined differently for undirected and directed networks. For directed networks, we use both the number of neighbors connecting to nodes u and v and the numbers of nodes connected by u and v. Hence, we get four features: SNN-in-u (u, v) = |Γin (u)|, SNN-in-v (u, v) = |Γin (v)|, SNN-out-u (u, v) = |Γout (u)|, and SNN-out-v (u, v) = |Γout (v)|. In the undirected case, the same score for pairs of nodes (u, v) and (v, u) is desired and there is no difference between the number of nodes connecting from or to node u. Hence we report both the maximum and minimum for a given pair of nodes, i.e., SNN-min (u, v) = min (|Γ (u)|, |Γ (v)|) and SNN-max (u, v) = max (|Γ (u)|, |Γ (v)|). Degree (D). The degree feature is defined similarly as the number of neighbors, except that the number of edges is considered instead of the number of nodes connected. For directed networks, we obtain again four features, viz. SD-in-u (u, v) = degin (u), SD-in-v (u, v) = degin (v), SD-out-u (u, v) = degout (u), and SD-out-v (u, v) = degout (v). For undirected networks, we obtain the maximum and minimum degree of nodes u and v, SD-min (u, v) = min (deg(u), deg(v)) and SD-max (u, v) = max (deg(u), deg(v)). Common Neighbors (CN). The number of common neighbors for a given pair of nodes is calculated by SCN (u, v) = |Γ (u) ∩ Γ (v)|. For directed networks, the score is calculated by considering the nodes that are connected from nodes u and v, i.e. SCN (u, v) = |Γout (u) ∩ Γout (v)|. Path-Based Features. Path-based features take into account the paths between the two nodes under consideration. Since many paths can exist, these features are computational more expensive than the neighbor-based features. Shortest Paths (SP). This measure SSP (u, v) indicates the number of shortest paths that run between nodes u and v. PropFlow (PF). The PropFlow measure, SPF (u, v), corresponds to the probability that a restricted random walk starting from node u ends at node v within l steps [12]. We use the commonly used value of l = 5. We collapse the network with multiple edges (occurring at different timestamps) to a weighted network where the weight is equal to the number of parallel edges running between two nodes. Higher weights result in a higher transition probability for the random walk. This method is known to potentially obtain different scores for pairs of nodes (u, v) than for (v, u), even in the undirected case [16]. Hence, we use the mean of the scores obtained for the pairs of nodes (u, v) and (v, u) in the undirected case.
86
3.4
G. J. de Bruin et al.
Classifier
We used a tree-based gradient boost learner for our classifier, as these are known to perform well in generic classification tasks. The Python implementation of XGBoost was used [17]. This classifier has various hyperparameters. While extensive hyperparameter tuning is beyond the scope of this paper, we cross-validate two important hyperparameters, viz. maximum depth of tree and class weights. 3.5
Performance Metric
Link prediction is associated with extreme class imbalance, lower bounded by the number of nodes in the network [12]. Ideally, performance metrics used to evaluate the classifier, should be robust against this class imbalance. The commonly encountered Receiver Operator Characteristic (ROC) lacks this robustness [16,18] and is hence not used. We are especially interested in correctly predicting positives without loosing precision, i.e., keep the number of false positives low, and without loosing recall, i.e., make sure we find all true positives. The Average Precision (AP) metric, which is equal to the weighted mean of precisions achieved at each threshold in the precision-recall curve, is well-suited in this case.
4
Data
Since our research aims to split the network into different snapshots based on time, temporal networks are needed. In this work, we use six different temporal networks, spanning a broad range of different domains. Properties of these networks are shown in Table 1. The density, diameter Ø and mean distance d¯ were calculated on the underlying static network, i.e., the network without parallel edges. Below, we briefly discuss the six datasets used in this work. Except from the Condmat network, all datasets were obtained from KONECT [19]. AU. The Ask Ubuntu (AU) network is an online contact network. Interactions were gathered from the StackExchange site “Ask Ubuntu”. The nodes are the users, and a direct edge is created when a user replies to a message of another user. These interactions can consist of an answer to a question of another user, comments on another user’s question, and comment on another user’s answer. Each edge is annotated with the time of interaction. Condmat. This scientific co-authorship dataset entails condensed matter physics collaborations from 1995 to 2000, obtained from https://github.com/ rlichtenwalter/LPmade. A temporal undirected network is created by adding a node for each author in a publication and adding an edge between all authors of a publication [18]. For each edge, the date of the publication connecting these authors is used. We observe that the number of authors per paper increases over time. This may cause varying performance in link prediction for different temporal snapshots. We deemed this outside the scope of the current research.
Evaluation of Train and Test Split Strategies in Link Prediction
87
Table 1. Statistics of networks used in this study. Edges and nodes in giant component (GC) are indicated between brackets. Mean distance between nodes is given in column d¯ and the column Ø indicates the diameter of the networks.
Digg. The Digg network is a communication network and contains the reply network of the social news website Digg. Each node in the network is a person, and each edge connects the user replying to the receiver of the reply. Each reply is annotated with the time of that interaction. Enron. The Enron dataset is a communication network and contains over one million emails sent between employees of Enron between 1999 and 2003 [20]. For each email present in the dataset, sender and recipient are added as nodes and a directed edge from sender to recipient indicates the date of the email. Slashdot. Technology website Slashdot is a popular English website. The website allows commenting on each page, where users can start threaded discussion. The communication network is constructed from these threads where users are nodes and replies are edges, annotated with the time of the reply. SO. Like AU, the Stack Overflow (SO) network is collected from StackExchange and can be considered an online contact network. Nodes are users, and directed edges represent interactions, annotated with the time of the interaction.
5
Experiments
This section starts with the experimental setup used. We continue then in Sects. 5.2 and 5.3 with the results and robustness checks. 5.1
Experimental Setup
A few parameters need to be addressed to run the link prediction task. We highlight the following four in the next sections. First, the selection of node pairs using their distance in the network are considered. Second, the time intervals for the maturing and probe phase(s) need to be chosen for both the random and temporal split, as explained in Sect. 3.2. Third, the number of pairs of nodes used for training and testing are discussed. Fourth, the value of two hyper-parameters of the classifier are determined. Lastly, we explain how multiple snapshots from a network are constructed for robustness checks.
88
G. J. de Bruin et al.
Distance Selection. The is computationally intensive task of link prediction for larger networks, since V[t0 ,tb ] × V[t0 ,tb ] \ E[t0 ,tb ] instances needs to be considered in the classifier. One way to reduce computational complexity, and reduce class imbalance as well, is to only consider pairs of nodes at a certain distance in the network [18]. We consider only pairs of nodes at a two-hop distance. Time Intervals. The time intervals used for the maturing and probing phase in both the random and temporal split, can potentially have an effect on the obtained results. Hence, these values needs to be consistent for various networks. Since a similar set-up is used as [12], timestamps of tb , tc and td were set in such a way that the proportion of edges in the maturing and probing phase remains roughly similar for the condmat network in [12]. This ratio E[t0 ,tb ] : E[tb ,tc ] is approximately equal to 5 : 1. To allow fair comparisons between the random and temporal split, the probing phase of the test interval should contain a similar number of edges as the probing phase of the training interval, i.e., E[t0 ,tb ] ≈ E[t ,t ] . c d For computational reasons, we choose tb , tc and td for all networks such that the number of edges interval remains similar to the Condmat network. in each This means that E[t0 ,tb ] ≈ 50000 and E[tb ,tc ] ≈ E[tc ,td ] ≈ 10000. Training and Testing. In case of random splitting,the instances X[t0 ,tb ] should be split into two disjoint sets, as explained in Sect. 3.2. instances are 75%of the test train used for training and the remainder for testing, i.e. X[t0 ,tb ] = 3 X[t0 ,tb ] . Hyper-parameters. Default parameters for the XGBoost were used, except for the following. The weights of the positive instances can be adjusted during training in such a way that the total weight of the positive and negatives samples are equal. In a fivefold cross-validation setting applied on the training data, we determined for each network separately whether this improved performance on the train set. Furthermore, the maximum tree depth used for the base learners was also determined in the same fivefold cross-validation. Robustness Checks. We check whether our results are robust. Hence, the entire procedure is repeated ten times on Ask Ubuntu. We create ten nonoverlapping snapshots by shifting intervals such that each next interval starts (ta ) at the end of the previous interval (tc for random split, td for temporal split). 5.2
Results
The average precision score (AP) of the classifiers for the six networks with the random split and temporal split method is shown in Table 2. This metric shows large performance differences between the random and temporal split.
Evaluation of Train and Test Split Strategies in Link Prediction
89
The performance of the temporal split is for all networks lower than the random split. This may indicate that the random split provides an overly optimistic indication of the performance value. The difference between the random and temporal splits, varies widely between the networks, which may indicate that the extent to which the test set is reused, varies per network. Notably, the AP of the Ask Ubuntu network drops with 80%, demonstrating that the test set reuse could be severe. Table 2. Comparison of performance on the link prediction classification measured using average precision.
5.3
Dataset
Random split Temporal split
Askubuntu
0.023
0.0046
Condmat
0.012
0.0048
Digg
0.0043
0.0014
Enron
0.016
0.012
Slashdot
0.0076
0.0021
StackOverflow 0.0029
0.0013
Robustness Checks
We check the robustness of the findings, by following the procedure outlined in Sect. 5.1. We find an AP of 0.025±0.009 (mean ± standard deviation) when using the random split, while an AP of only 0.0061 ± 0.0016 is found for the temporal split. The different average precision curves are shown in Fig. 3. The random split precision - recall curves clearly dominate their temporal counterparts at all snapshots.
Fig. 3. Precision-recall curves for ten snapshots of the AskUbuntu network.
90
6
G. J. de Bruin et al.
Conclusion
The aim of the present research was to analyze different ways of obtaining the train and test set in link prediction. The results of this investigation on various large networks indicates that the random split consistently show better performance than the temporal split. In future work, we plan to investigate new splitting strategies to separate train and test set. While the procedure of the temporal split prevents using the exact same temporal information of a given node, it still allows that the same node is both used in train and test set. More rigorous strategies should be devised to ensure to a further extent that the train and test set are truly disjoint and independent.
References 1. Hamilton, W.L., Ying, R., Leskovec, J.: Representation learning on graphs: methods and applications. arXiv preprint arXiv:1709.05584 (2017) 2. Liben-Nowell, D., Kleinberg, J.: The link-prediction problem for social networks. J. Am. Soc. Inform. Sci. Technol. 58(7), 1019–1031 (2007) 3. Kumar, A., Singh, S.S., Singh, K., Biswas, B.: Link prediction techniques, applications, and performance: a survey. Physica A 553, 124289 (2020) 4. Linyuan, L.L., Zhou, T.: Link prediction in complex networks: a survey. Physica A: Stat. Mech. Appl. 390(6), 1150–1170 (2011) 5. Al Hasan, M., Zaki, M.J.: A survey of link prediction in social networks. In: Social Network Data Analytics, pp. 243–275. Springer (2011) 6. Ghasemian, A., Hosseinmardi, H., Galstyan, A., Airoldi, E.M., Clauset, A.: Stacking models for nearly optimal link prediction in complex networks. Proc. Natl. Acad. Sci. 117, 201914950 (2020) 7. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Science & Business Media (2009) 8. Ghasemian, A., Hosseinmardi, H., Clauset, A.: Evaluating overfit and underfit in models of network community structure. IEEE Trans. Knowl. Data Eng. 32, 1722– 1735 (2019) 9. Roelofs, R., Miller, J., Hardt, M., Fridovich-keil, S., Schmidt, L., Recht, B.: A meta-analysis of overfitting in machine learning. In: NeurIPS, p. 11 (2019) 10. Ioannidis, J.P.: Why most published research findings are false. Get. Good: Res. Integr. Biomed. Sci. 2(8), 2–8 (2018) 11. Dwork, C., Feldman, V., Hardt, M., Pitassi, T., Reingold, O., Roth, A.: Preserving statistical validity in adaptive data analysis. In: Proceedings of the Annual ACM Symposium on Theory of Computing, pp. 117–126 (2015) 12. Lichtenwalter, R.N., Lussier, J.T., Chawla, N.V.: New perspectives and methods in link prediction. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 243–252 (2010) 13. Hasan, M.A., Chaoji, V., Salem, S., Zaki, M., York, N.: Link prediction using supervised learning. In: SDM 2006: Workshop on Link Analysis, Counter-Terrorism and Security, pp. 798–805 (2006) 14. Mutlu, E.C., Oghaz, T.A.: Review on graph feature learning and feature extraction techniques for link prediction. arXiv preprint arXiv:1901.03425 (2019)
Evaluation of Train and Test Split Strategies in Link Prediction
91
15. Huang, Z., Li, X., Chen, H.: Link prediction approach to collaborative filtering. In: ACM/IEEE Joint Conference on Digital Libraries, pp. 141–142 (2005) 16. Yang, Y., Lichtenwalter, R.N., Chawla, N.V.: Evaluating link prediction methods. Knowl. Inf. Syst. 45(3), 751–782 (2015) 17. Chen, T., He, T., Benesty, M., Khotilovich, V., Tang, Y.: Xgboost: extreme gradient boosting. R package version 0.4-2, pp. 1–4 (2015) 18. Lichtenwalter, R., Chawla, N.V.: Link prediction: fair and effective evaluation. In: Proceedings of the 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2012, pp. 376–383 (2012) 19. Kunegis, J.: KONECT: the Koblenz network collection. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 1343–1350 (2013) 20. Klimt, B., Yang, Y.: The enron corpus: a new dataset for email classification research. In: European Conference on Machine Learning, pp. 217–226 (2004)
Enriching Graph Representations of Text: Application to Medical Text Classification Alexios Mandalios1(B) , Alexandros Chortaras1 , Giorgos Stamou1 , and Michalis Vazirgiannis2 1
School of Electrical and Computer Engineering, NTUA, Athens, Greece [email protected], {achort,gstam}@cs.ntua.gr 2 LIX, Ecole Polytechnique, Palaiseau, France [email protected]
Abstract. Graph based representations have been utilized to achieve state-of-the-art performance in text classification tasks. The same basic structure underlies knowledge graphs, large knowledge bases that contain rich information about the world. This paper capitalises on the graph of words model and enriches it with concepts from knowledge graphs, resulting in more powerful hybrid representations of a corpus. We focus on the domain of medical text classification and medical ontologies in order to test our proposed methods and analyze different alternatives in terms of text representation models and knowledge injection techniques. The method we present produces text representations that are both explainable and effective in improving the accuracy on the OHSUMED classification task, surpassing neural network architectures such as GraphStar and Text GCN. Keywords: Graph of words · Semantic enrichment of text · Knowledge graphs · Natural language processing · Medical information systems
1
Introduction
Text representation is a fundamental task in the field of natural language processing. Choosing how to represent the written word has been in the forefront of research since the inception of automated language processing. One of the main machine learning tasks that depend heavily on text representation is that of text classification. In the problem of text classification, we learn the patterns contained in a labeled document corpus, and then we aim to predict We acknowledge support of this work by the project “APOLLONIS.” (MIS 5002738) which is implemented under the Action “Reinforcement of the Research and Innovation Infrastructure”, funded by the Operational Programme “Competitiveness, Entrepreneurship and Innovation” (NSRF 2014–2020) and co-financed by Greece and the European Union (European Regional Development Fund) (applicable for Alexios Mandalios, Alexandros Chortaras, Giorgos Stamou). c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 944, pp. 92–103, 2021. https://doi.org/10.1007/978-3-030-65351-4_8
Enriching Graph Representation of Text
93
on unlabeled documents. Here we discuss the most effective approaches that have been proposed, for the task of text classification in general, as well as the contribution of domain knowledge. The mainstream approaches involve derivatives of the well-studied recurrent neural network architecture. For example, the stateof-the-art solution for the Yahoo! answers dataset involves incorporating position invariance of terms in recurrent neural networks [1]. However, it is worth noting that for several important text classification datasets, the best solution involves representing texts in a graph structure. The 20NEWS and OHSUMED tasks are best tackled by graph convolutional networks. Yao et al. [2] suggested a representation of a whole corpus using a graph, with nodes representing both terms and documents. Wu et al. [3] demonstrated how graph convolutional networks can become both simpler and more effective in terms of the task in hand, yielding better results for both datasets. In addition, GraphStar [4], an extremely flexible graph-learning network, manages to produce state-of-the-art results for R52. The idea of injecting domain knowledge in a text to improve the performance of downstream tasks has also been studied in past work. This paper focuses on medicine, so we limit our reference to work on this domain. Liu et al. [5] used external knowledge in order to expand queries and improve information retrieval of medical text. G-Bean [6], a graph-based biomedical search engine, led to the improvement of MEDLINE’s indexing. Based on the work of Huang et al. [7], the team of Albitar et al. [8] applied the idea of Bag of Concepts as an alternative to the Bag of Words using resources that are also described in this paper. On one hand, machine learning state-of-the-art indicates that representing texts as graphs is effective in the field of text classification. On the other hand, knowledge graphs have been utilized to store domain information that cannot be deduced by any given dataset. It seems natural to try and combine these two similar structures into what could be an enriched representation of a corpus of texts. This is the main motivation behind our paper: we aim to find out if the graph of words is not only a superior structure for text representation, but also the best model to incorporate external knowledge, when it is available. The decision points of this process include the specific graph model used to represent texts, the type of knowledge graph selected to enrich our corpus, the caveats of the graph alignment process, and the generation of informative hybrid text representations. Our semantically-enriched graph of words model achieves stateof-the-art results on the OHSUMED classification task. The rest of the paper is structured as follows: Sect. 2 presents an established and effective way of representing text using a graph of words. In Sect. 3 we model the process of text enrichment using a knowledge base in general terms. Section 4 transfers the problem to the medical domain, by presenting tools, resources, and finally a text representation and enrichment pipeline that is a special case of the general method discussed in Sect. 3. Section 5 tests our proposed architecture in different experimental configurations, and proves our method’s effectiveness quantitatively. Finally, Sect. 6 concludes the paper.
94
2
A. Mandalios et al.
Graph Representation of Text
This section aims to introduce the reader to the representations of text considered in the paper. We focus on the models used in the experimental evaluation presented in Sect. 5. For different approaches on using graphs for NLP, one can consult the work presented in Sect. 1. 2.1
The Bag of Words Model
The most established text vectorization method is the Bag of Words (BoW) model. This model describes documents using word occurrences, while ignoring their position in text. The essential steps of the Bag of Words model are as follows: 1. Tokenizing the documents in order to transform them into bags of words. 2. Counting the occurrences of tokens in each document. 3. Normalizing the counting results, as we need to recognize the most important terms in our documents, while diminishing the value of terms that appear frequently in the corpus. The first step is performed by basic natural language processing tools, such as NLTK [9] or SpaCy [10], that convert a string to a multiset of words. The two final steps are combined into an information retrieval scheme known as TFIDF. This statistical method aims to weight a term of a document in a way that is proportional to the number of times it appears in the document (Term Frequency/TF) and to the inverse of the number of documents in the corpus that contain that term (Inverse Document Frequency/IDF). The main shortcoming of this method is the inability to incorporate word order. An example that demonstrates this issue is the BoW representations of the two phrases “the quick brown fox jumps over the lazy dog” and “the quick dog jumps over the lazy brown fox”. These two sentences have a very different meaning, but the BoW model fails to produce distinct representations for each of them. 2.2
The Graph of Words Model
A text representation alternative that has been studied in the literature is the graph of words (GoW) model. This model mends the main issue of the bag of words model, that is, the lack of consideration for word order, by partially preserving word order information in the form of a graph. Even though different graph structures have been proposed by various researchers, we focus on the work of Rousseau et al. [11], as they have the most extensive evaluation of the impact of the GoW model on the task of ad hoc IR in large scale text corpora. The main concept of the GoW model is that each text T in a corpus is represented by a directed graph G = (V, E) defined by its set of nodes (vertices) V and its set of edges E, where:
Enriching Graph Representation of Text
95
– V = {n : n is a term in T}, which means that each node n in G represents a term n of the text T . Note that duplicate terms are mapped to a single graph node. – E = {n1 ∈ V → n2 ∈ V : adjacent(n1 , n2 , T, W )}, adjacent(n1 , n2 , T, W ) = True iff n2 follows n1 in T with distance at most W terms. An edge e = n1 → n2 serves to model syntactic proximity between terms n1 and n2 in T . W is a parameter that lets us be more or less strict about how we define proximity. W = 1 means that only terms that follow one another in T are syntactically close, W = 2 means that we also consider terms that are one step further in T , and so on. In other words, this model transfers adjacency of terms in text to the adjacency table of the graph of words G. The GoW model transfers our considerations to the realm of graph theory. Paths in G can then be interpreted as word sequences appearing in the original text, and the text itself can be seen as the result of a walk on the GoW. Node degrees can be interpreted as term importance in text. The way GoW incorporates word order is illustrated in Fig. 1.
dog
brown
the quick brown fox jumps over the lazy dog
jumps
fox GoW
quick jumps
the
=
quick over
the
dog over lazy
fox lazy
GoW
the quick dog jumps over the lazy brown fox
brown
Fig. 1. Illustration of the way two sentences with different meaning are converted into different graphs of words, for the case of W = 2.
Rousseau et al. also propose a vectorization method that extracts the same features as TFIDF. The catch is that term importances are not set equal to term frequencies, as in TFIDF, but equal to the corresponding nodes’ degrees in the GoW representation. This vectorization scheme is named Term-Weight/InverseDocument-Frequency (TWIDF). In the case of directed graphs, it makes sense to use the indegree as the measure of node importance: a node n with syntactic proximity to a large chunk of a document must be important in its context.
3
Enriching Text Using a Knowledge Base
This section presents the general concept of text enrichment using external knowledge, in the form of a knowledge base. Assuming we have a text T and a knowledge base KB = (ABox, T Box). – ABox is the assertion component of the KB. – T Box is the terminology component of the KB.
96
A. Mandalios et al.
Within the KB, we may reason about the elements of its vocabulary, defined as V oc = IN ∪ CN ∪ RN . – IN is the set of individual names. For a medical knowledge base, a specific patient at a hospital or a specific case of cancer all would be members of IN . – CN is the set of concept names. In medicine, the set of all instances of vitiligo would belong to CN . – RN is the set of role names. In the medical domain, the connection that states how an instance of lung cancer appears at an instance of lung would be a role. The enrichment process of T using KB is performed in two steps. 1. Alignment, the process of finding common ground between T and KB. If we represent T as a sequence of characters, and we denote ST as the set of contiguous subsequences of T , then the alignment process can be formally defined as a mapping ST → V oc. 2. Enrichment, the process of further reasoning using KB in order to boost the information content of T . This step is only limited by the contents and expressiveness of KB. For example, we could find all the CN s that contain individuals appearing in T , so we can get a better idea of the types of individuals that appear in T . A special case of this process, that proves to be most effective in practice, is that of the semantically-enriched graph of words, denoted as GoW+ = {V + , E + }. Enriching a GoW would work as follows: 1. The alignment step involves finding a mapping between GoW’s nodes V and V oc. 2. The enrichment step would involve expanding GoW using knowledge contained in the KB. This is most effective when KB is in the form of a knowledge graph, so we can transfer parts of it directly onto the GoW, as analyzed in the sections that follow. GoW+ ’s nodes and edges can then be interpreted as follows: – nodes in V + represent terms of T as well as elements from V oc. – edges in E + can be interpreted either as syntactic proximity in T , if they originate from GoW, or semantic proximity in KG, if they were inserted during the enrichment step. GoW+ proves to be a superior hybrid representation of text, that allows for state-of-the-art performance in medical text classification.
4
Classification of Medical Text
This section describes our architecture for classification of medical text. We describe the classification task in hand, as well as the tools and resources utilized during our experiments. Finally, we define a general pipeline that will be specificized in Sect. 5 for different experimental modes.
Enriching Graph Representation of Text
4.1
97
OHSUMED
OHSUMED is a medical text dataset that contains abstracts from publications in the MEDLINE bibliographic database1 . More specifically, OHSUMED contains 13929 unique cardiovascular diseases abstracts, coming from the year 1991. While it is a multi-label dataset, as each document can be associated with one or more of 23 disease categories, we focus on the single-label classification task. This allows us to make our experimental results comparable to those of other papers, described in Sect. 1. The process of transitioning from a multi-label classification task into a multi-class classification task is performed by filtering out any document that is associated with more than a single label. 4.2
SNOMED CT
SNOMED CT [12] is a comprehensive collection of medical terms, containing medical concepts and relationships between them. Going back to the modeling presented in Sect. 3, we note that SNOMED is a simplified version of a knowledge base. More specifically, we have no information about the ABox, only the axioms included in the T Box, containing medical concepts and the relationships, hierarchical or other, that exist between them. For the purposes of this paper, it is convenient to store SNOMED CT in the form of a graph, with nodes representing medical concepts and edges representing relationships between them. Even though it’s not explicitly utilized in the experiments that follow, we note that an edge in SNOMED’s graph holds specific ISA r →B meaning in terms of description logics: A −−→ B means that A B, and A − means that A ∃r.B. We can see how this representation of external knowledge is analogous to how the GoW model represents texts, with edges representing more specific relationships instead of syntactic proximity. In terms of DBMS, we use the Neo4j graph database [13] to store our knowledge graph. An example of a concept and the concepts it is connected to appears in Fig. 2.
Hypopigmentation
Skin structure
FINDING SITE
Site-specific disorder of skin
ISA
ASSOCIATED MORPHOLOGY
Vitiligo
ISA
Hypopigmentation disorder
Fig. 2. The example of Vitiligo in SNOMED CT. Here we can see some of its hierarchical connections, as well as some information about its finding site and its associated morphology.
1
https://www.nlm.nih.gov/bsd/medline.html.
98
4.3
A. Mandalios et al.
MetaMap
Previously we described the OHSUMED dataset of medical abstracts, as well as the SNOMED terminology that contains rich information about medical terms. In order to align these two, and be able to harness SNOMED’s information to boost the classification task, we need to align the terms found in OHSUMED’s texts with SNOMED. This is performed using MetaMap [14]. MetaMap is an NLP tool developed at the National Library of Medicine (NLM) to map biomedical entities to the UMLS metathesaurus. An example of how MetaMap performs on the title of an OHSUMED text is in Fig. 3.
tomography
Computed tomography in patients with esophageal perforation.
perforation computed
in
esophageal with
vs
patients
computed tomography
in
esophageal perforation with
patients
Fig. 3. Two GoWs for the same text, one with MetaMap (right) and one without (left). We can see how consecutive words can be clumped together into nodes.
4.4
Pipeline
Here we discuss how we use MetaMap and SNOMED in order to enrich medical text. Text Representation and Enrichment Pipeline: The pipeline that aligns a text T and SNOMED is as follows: 1. Text alignment with knowledge graph: this step aims to find common ground between T and SNOMED, and results in an enriched form of text, with added metadata, namely Taligned , or Ta , for short. This step is performed using MetaMap. 2. Text representation: this affects the amount of information preserved but also the available options for utilizing the knowledge provided by SNOMED. This step produces a representation for T , that is more manageable than the metadata-enriched sequence of characters that is Ta . The resulting text representation is denoted as Trepresentation , or Tr , for short. 3. Text enrichment: the enrichment strategy describes the utilization of the knowledge contained in SNOMED in order to produce a more information-rich representation of Tr . The enriched text resulting from this step is Tenriched , or Te , for short. 4. Feature extraction: this step extracts specific features from the representations we have constructed in the previous steps. These features are used to evaluate our process in a machine learning task. The set of features resulting from this final step is Tfeatures , or Tf . An illustration of the pipeline can be found in Fig. 4.
Enriching Graph Representation of Text T, KG
T alignment Ta with KG
Ta representation
Tr
Tr enrichment
Te
Te feature extraction
99 Tf
Fig. 4. Illustration of the general text representation and enrichment pipeline. With a text T and a knowledge graph KG as our inputs, we generate informative vector representations of T that incorporate information coming from KG.
5
Experimental Evaluation
This section aims to expand on the pipeline described in Sect. 4 in order to demonstrate the following two points: – Common literature baselines for text representation can be described as a simplified version of the proposed pipeline. – The proposed pipeline can describe improved text enrichment and representation methods. We start from baselines that utilize only textual information without incorporating a knowledge graph, and then proceed to describe methods that make use of external knowledge in order to produce more information-rich representations. 5.1
Bag of Words
The Bag of Words model, denoted as BoW , is the most popular baseline for text representation. Our pipeline is reduced to steps 2 and 4, as described below: 1. 2. 3. 4.
Text alignment with knowledge graph: this step is omitted. Text representation: this step converts our texts into bags of words. Text enrichment: this step is omitted. Feature extraction: this step uses the traditional vectorization method TFIDF.
5.2
Graph of Words
The Graph of Words model, denoted as GoW, represents texts as graphs of the terms they contain, thus converting a corpus of documents into a corpus of graphs. Similar to the BoW model, we end up implementing just steps 2 and 4 of the pipeline, as follows: 1. 2. 3. 4.
Text alignment with knowledge graph: this step is omitted. Text representation: this step converts our texts into graphs of words. Text enrichment: this step is omitted. Feature extraction: this step uses the TWIDF vectorization method in order to turn graphs into feature vectors Sect. 2.
100
5.3
A. Mandalios et al.
Enriched Bag of Words
The Enriched Bag of Words model, denoted as BoW + , expands on the BoW model described above, by incorporating relevant entities extracted from a knowledge graph KG. The pipeline becomes as follows: 1. Text alignment with knowledge graph: a number of text terms are matched to KG nodes. 2. Text representation: this step converts our texts into bags of words. 3. Text enrichment: texts are enriched using triples extracted from KG. More specifically, for each term t1 contained in text and matched to a KG node, then for each KG node t2 for which there exists a path t1 → . . . → t2 of length at most N , the term t2 is added to the text’s bag of words. 4. Feature extraction: this step uses the TFIDF vectorization method. 5.4
Enriched Graph of Words
The Enriched Graph of Words model, denoted as GoW+ , expands on the GoW model described above, by incorporating relevant subgraphs extracted from a knowledge graph KG. The pipeline becomes as follows. 1. Text alignment with knowledge graph: a number of text terms are matched to KG nodes. 2. Text representation: this step converts our texts into graphs of words. 3. Text enrichment: texts are enriched using triples extracted from KG. More specifically, for each term t1 contained in text and matched to a KG node, then for each KG node t2 for which there exists a path p = t1 → . . . → t2 of length at most N , the path p is added to the text’s graph of words. 4. Feature extraction: this step uses the TWIDF vectorization method. 5.5
Enriched Graph of Words with Node Filtering
This pipeline modifies the GoW+ model, by incorporating a set of KG nodes that are considered relevant, denoted as Vrelevant . The enriched model with node filtering is denoted as GoW+ f. 1. Text alignment with knowledge graph: a number of text terms are matched to KG nodes. 2. Text representation: this step converts our texts into graphs of words. 3. Text enrichment: texts are enriched using triples extracted from KG. More specifically, for each term t1 contained in text and matched to a KG node, then for each KG node t2 for which t2 ∈ Vrelevant , then the edge t1 → t2 is added to the text’s graph of words. For the purposes of the method’s evaluation, we experiment with a per-node personalized pagerank approach. More specifically, for each matched term t in a text, we calculate the most significant neighbors in KG using personalized pagerank with the node as a starting point, and we keep the top-M ranked entities.
Enriching Graph Representation of Text
101
Table 1. Accuracy achieved using the different text representation and enrichment models. The machine learning algorithm used is a simple SVM across the board. Note that for the GoW+ f model we use the best value of N = 3. For comparison, we add two state of the art methods, GraphStar and Text GCN. Accuracy BoW
0.684
GoW
0.685
BoW+ (N )
0.636(0)
0.693(1)
0.711(2)
0.6995(3)
0.685(4)
0.668(5)
0.659(6)
GoW+ (N )
0.648(0)
0.679(1)
0.720(2)
0.727(3)
0.725(4)
0.722(5)
0.723(6)
GoW+ f (M ) 0.691(10) 0.709(20) 0.712(30) 0.718(40) 0.716(50) 0.717(60) 0.715(70) GraphStar
0.642
Text GCN
0.684
4. Feature extraction: this step uses the TWIDF vectorization method in order to produce feature vectors that can be used to learn a classifier. The experimental results in Table 1 use a basic linear SVM model, as it’s the best model to learn from the sparse vectors our pipelines produce. From these results, we come to the following conclusions: – For the OHSUMED dataset, the baseline BoW approach already works well, with results that are comparable to the state-of-the-art. – Without modification to the graph structure, the GoW model does not improve accuracy. This can be attributed to the fact that using a graph of words in combination with the TWIDF weighting scheme does not generate additional features for our texts, but only weights the existing features in a different way. – Mapping medical entities in text to the SNOMED knowledge graph decreases classification accuracy, if we do not consider the nodes’ neighborhoods in the knowledge graph. This holds true for both the BoW+ and GoW+ cases, as seen for N = 0. – The enrichment with additional knowledge graph terms boosts performance for both BoW+ and GoW+ , as the classifier gets fed additional relevant features that allow us to discriminate between classes. The best performance is achieved using the GoW+ approach, with a 4% increase in accuracy. – Both the BoW+ and GoW+ models peak at about the same value of N (3 and 4, respectively), as seen in Fig. 5a. After that point, we observe a phenomenon of diminishing returns. This can be explained if we consider the structure of the SNOMED knowledge graph used for the enrichment. The first steps add features that are discriminative between classes in our dataset, but further steps pollute the feature space with super-classes that are common throughout our entire corpus. However, the GoW+ model is more robust to additional features, as its performance decline is significantly slower as N increases. + – The GoW+ f model serves as an alternative of GoW , with control over the number of additional nodes attached per graph of words node. However, the
102
A. Mandalios et al.
way the proposed method, based on calculating the most relevant knowledge graph entities for each matched text term, fails to capture terms that are relevant for a text as whole, but might not be as relevant for each text term separately. A comparison between BoW, GoW+ and GoW+ f can be found in + Fig. 5. Figure 6 illustrates how the GoWf model saves up on graph size.
Comparison of BoW+ , GoW+ , and baselines on OHSUMED.
+ Comparison of GoW+ f , GoW (N = 3), and baselines on OHSUMED.
Fig. 5. Illustration of different models’ performance on OHSUMED.
Average graph size increase with N in the case of GoW+ .
Average graph size increase with M for GoW+ f and average graph size for the best value of N = 3 for the case of GoW+ .
Fig. 6. Illustration of the way the graph size changes with N and M
6
Conclusions
In this paper we studied graph-based text representations that draw their structural characteristics both from a corpus and from external knowledge, in the
Enriching Graph Representation of Text
103
form of a knowledge graph. After defining a general pipeline of steps, we covered several alternatives for text representation and enrichment, and concluded that the optimal choice in the case of OHSUMED + SNOMED is the use of graphs for both the text representation and enrichment steps. Representing texts as GoWs and expanding these using knowledge graphs produces better than state-of-the-art performance in the case of OHSUMED. Even though most of the research effort has been focused on fine tuning the process of converting a corpus of texts into graph form, and then finding the optimal neural network architecture that learns from this structure, we focus on fine-tuning the representations of text instead. We prove that moving the focal point to text representation instead of intricate machine learning yields better results in terms of explainability and classification performance.
References 1. Wang, B.: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2311–2320 (2018) 2. Yao, L., Mao, C., Luo, Y.: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 7370–7377 (2019) 3. Wu, F., Zhang, T., de Souza Jr., A.H., Fifty, C., Yu, T., Weinberger, K.Q.: arXiv preprint arXiv:1902.07153 (2019) 4. Haonan, L., Huang, S.H., Ye, T., Xiuyan, G.: arXiv preprint arXiv:1906.12330 (2019) 5. Liu, Z., Chu, W.W.: Inf. Retrieval 10(2), 173 (2007) 6. Wang, J.Z., Zhang, Y., Dong, L., Li, L., Srimani, P.K., Philip, S.Y.: BMC Bioinform. 15(S12), S1 (2014) 7. Huang, L., Milne, D., Frank, E., Witten, I.H.: J. Am. Soc. Inf. Sci. Technol. 63(8), 1593 (2012) 8. Albitar, S., Espinasse, B., Fournier, S.: The Twenty-Seventh International Flairs Conference (2014) 9. Loper, E., Bird, S.: arXiv preprint cs/0205028 (2002) 10. Honnibal, M., Montani, I.: spaCy 2: natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing (2017, To appear) 11. Rousseau, F., Vazirgiannis, M.: Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, pp. 59–68 (2013) 12. Donnelly, K.: Stud. Health Technol. Inform. 121, 279 (2006) 13. Webber, J.: Proceedings of the 3rd Annual Conference on Systems, Programming, and Applications: Software for Humanity, pp. 217–218 (2012) 14. Aronson, A.R.: Proceedings of the AMIA Symposium (American Medical Informatics Association 2001), p. 17 (2001)
SaraBotTagger - A Light Tool to Identify Bots in Twitter Carlos Magno Geraldo Barbosa1 , Lucas Gabriel da Silva F´elix2 , Antˆ onio Pedro Santos Alves1 , Carolina Ribeiro Xavier1,2 , and Vin´ıcius da Fonseca Vieira1(B) 1
2
Federal University of S˜ ao Jo˜ ao del-Rei, S˜ ao Jo˜ ao del-Rei, Minas Gerais, Brazil [email protected] Federal University of Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
Abstract. In this work we present SaraBotTagger, a tool for identifying bots in Twitter based only in the metadata of the accounts based on a Random Forest Classifier, able to efficiently identify bots in Twitter. We also propose a validation methodology that verifies the ability of the classifier in identifying bots that are actually suspended by Twitter. We apply our methodology to the context of two real-world discussion, regarding STF (the Supreme Court of Brazil) and anxiety. We analyze the description of the accounts, the tweets in the timelines and the retweet networks in order to investigate the difference between bots and humans.
Keywords: Bot
1
· Twitter · Social network · Word cloud
Introduction
Lately, the way news are produced, disseminated and reverberated in online social networks, like Twitter, Facebook and Whatsapp, represents a drastic paradigm shift when compared to the way it occurs in traditional media. These networks also assumed an essential role to the society as users became active source for the production and propagation of information, users started to have more control over the content. As a side effect of users’ control to the content propagated in social networks, it is possible to observe an increase in the amount of low quality content, fake news and attempts to manipulate discussions through automated profiles [7], often treated as robots, or bots, which is a growing phenomenon [1]. The investigation of the action of bots and automated accounts on social networks can bring a clearer understanding of the events occurred in the world and the way the population perceives and reacts to them. Thus, we have an important motivation for the development of methodologies that are able to characterize social network events taking into account the organicity of the profiles. Moreover, due to the agility and dynamism of the topics discussed in social c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 944, pp. 104–116, 2021. https://doi.org/10.1007/978-3-030-65351-4_9
SaraBotTagger - A Light Tool to Identify Bots in Twitter
105
networks, it is necessary that the computational tool is able to process a large volume of data. In this work, we present a computational tool developed within a larger context of social network analysis and it is part of the framework SARA (Automated System with Complex Networks and Analytics), presented in a previous work [3], which allows one to take, as a starting point, the perspective of individuals (ego), and then to identify their communities, considering topological aspects of the network and analyze the messages disseminated on topics of interest in groups, the ego-communities. SaraBotTagger, added to SARA in the present work, is able to efficiently identify bots in Twitter, a better understanding of discussions on Twitter about specific topics, distinguishing the organic human discussion and the discussion biased by automated accounts. SaraBotTagger is used to analyze discussion regarding two subjects in Twitter: STF (the Supreme Court of Brazil) and anxiety. For the construction of the presented methodology, we seeked to use only the metadata from the accounts and the network of users in which they belong in order to preserve the agility and scalability of SARA to analyze a large volume of tweets and operate in real time in response to dynamic events, but still ensuring an efficient identification of the bots through supervised machine learning techniques. We raised some research questions, which we intend to answer with this work: (RQI) Is it possible to obtain an efficient automated account detection model with only metadata of users? (RQII) Are topics with a greater potential for polarization, such as politics, more susceptible to the action of malicious bots? (RQIII) Does the organization of automated users on the network allow the observation of topological patterns? SaraBotTagger is constructed based on the Random Forest [10] machine learning method and one of the contributions of this work is to use only the metadata of the accounts. In order to validate SaraBotTagger, we propose a strategy of quarantine of profiles by analyzing the suspension rate of accounts marked as suspicious by Twitter. Another important contribution is the application of SaraBotTagger to the investigation of the action of the automated accounts in two real-world contexts in social networks, allowing us to better understand how bot accounts behave in comparison to human accounts.
2
Related Work
Several works found in the literature aim to characterize coordinated actions automated accounts with different goals. Ferrara et al. [7] present a taxonomy for the different categories of bots, defined by the authors as accounts automated by means of algorithms that seek to imitate human behavior and perform interactions and content production automatically. By standard, these accounts may not play a harmful role for users of social networks, and may act as news aggregators, marketing and entertainment pieces. However, there are bots that are created to deliberately harm the network and the discussion, hindering political discussions, spreading false news, malwares and other prejudicial behavior. This
106
C. M. Geraldo Barbosa et al.
is a clear motivation for methodologies and tools to identify this class of bots, which is the subject of many works in the literature [7,13]. As well as in this work, Yang et al. [14] present an approach that uses supervised machine learning and the random forest technique, focusing on scalability and generalization, allowing to identify patterns of accounts different from those used in the training of the model. Considering the large amount of data derived from social networks, a property that can be fundamental in methods for identifying bots is their ability to offer a quick response, so that it can be executed in almost real time. Therefore, the computational time for executing the method must be low and the use of APIs that provide access to the social network data must be minimized, so that any limitations on requests that the API may impose are avoided to the maximum. In this sense, it can be a reasonable decision to use only the metadata of the accounts, which can allow good insights into your likelihood to be a bot. This is a particularly interesting strategy in the context of this work, which seeks to develop a methodology for identifying bots that integrates with SARA framework for quick and scalable analysis of events on social networks, following the idea of Rossi et al. [12], which also develop a bot detection model that uses only metadata of Twitter accounts, trained in a database manually labeled from the datasets of Cresci of spam bots [5], that shows accuracy and recall values superior to 80%.
3
Methodology
The methodology presented in this work was developed to fit in the context of the SARA Framework, which has already been presented in a previous work [3] and aims to quickly analyze large-scale events occurring on social networks. SARA allows the analysis of the discussion of specific topics in social networks from a more local perspective, detecting the ego-community around actors considered relevant in respect of centrality or other subjective criteria. From that, it is possible to investigate the content and the sentiment involved in the messages propagated in regions of the social network with different, and possibly antagonic, view on a specific subject. Twitter API, through the access points user lookup and Tweet lookup allows to obtain the information of only 86400 users or tweets per day. Such limitation restricts the analysis of a large number of users in a viable time. In the methodology proposed in this work, we try to circumvent this limitation by using only the information from user profiles for training a supervised machine learning method, using only metadata of the profiles, thus avoiding the need to use extra information beyond those obtained with a real-time access point. The 24 attributes used to train the model, were generated considering the approaches in the literature [6,12–14]: is account verified, is image default, is background image default, profile standard, followers, followings, number of total favorites, number of total tweets, age of account, size of description, following/followers ratio, reputation, fav/following ratio, increase of followings, increase of followers per day, tweets per day, digits in screen name, digits in name, year of creation, increase of favorites per day, bot in description, size of screen name, fav/followers ratio, size of name.
SaraBotTagger - A Light Tool to Identify Bots in Twitter
107
After a broad review of the literature we identified that supervised machine learning approaches presented outstanding results for this type of problem [2,10] and we decided to adopt this kind of strategy for the methodology in this work. Moreover, different classifiers were tested in preliminary tests and we verified that the best results were reached by Random Forest. The experiments were performed with scikit-learn [11], particularly a parameter evaluation using Grid Search and the implementation of Random Forest Classifier, a meta estimator that combines the result of a series of decision trees and a forest to perform the prediction of the class of an object. The classifier is parameterized with 100 n estimators (the number of trees) with 2 min samples splits. A very important step for the methodology of this work is the construction of the dataset that is used to train, evaluate and validate the proposed model, which is based on two groups: (I) datasets of users collected from discussions in two specific contexts (football and politics) and automatically labeled by a tool already available in the literature and widely used, the Botometer [13], which uses machine learning to generate a model that considers more than 1000 characteristics from user profiles (here called bases of type Botometer), and (II) datasets already used as a benchmark in other related works (here called bases of type Benchmark). A list of these datasets, as well as the number of users labeled as human and bot, can be seen in Table 1. It is important to highlight that even in the Benchmark databases, only the metadata of users were considered. Table 1. Datasets considered for training, evaluation and validation of the model proposed in this work. Name
Humans Bots
Type
cresci-rtbust-2019 [9] 340 353 Benchmark 0 62 political-bots-2019 [14] botometer-feedback-2019 [14] 380 139 verified-2019 [14] 1987 0 pronbots-2019 [14] 0 17882 vendor-purchased-2019 [14] 0 1087 botwiki-2019 [14] 0 698 cresci-stock-2018 [5] 6174 7102 midterm-2018 [14] 8092 42446 gilani-2017 [8] 1413 1090 Total 18386 70859 Football Politics Total
141305 51691 192996
995 Botometer 538 1533
The users from the datasets were labeled using Botometer considering the CAP score (Complete Automation Probability), with a threshold of 0.5 for the
108
C. M. Geraldo Barbosa et al.
definition between humans and bots. From the datasets of Table 1, two datasets were built for training the model proposed in this work, one with unbalanced quantity and another one with balanced between humans and bots. For the unbalanced-training dataset, users marked as bots were obtained considering all the bots of the Benchmark type bases (70859 users) combined with all the bots of the Botometer type bases (1502 unique users), totaling 72361 bots. The users marked as human were obtained considering all humans from the Benchmark type bases (18386) combined with 1502 humans randomly selected from the Botometer type bases, totaling 19881 humans (repeated entries were disconsidered). For the balanced-training dataset, 19881 human and bot users were randomly selected from all the datasets presented in Table 1. The model proposed in the present work was validated and evaluated considering real-world scenarios, as will be discussed in Sect. 4. Specially in these cases, one of the greatest difficulties in the design of methods for identifying bots in social networks is the validation of the model, since the social networks are often not transparent regarding the criteria used to consider an account as a bot or not. Thus, we performed a long-term quarantine assessment of some accounts marked as potential bots and examined, at some checkpoints, whether these accounts had been suspended by Twitter. The actors involved in the discussions were also evaluated from a network perspective, by analyzing a retweet network from the point of view of the importance of the nodes and other structural properties of the network. A topic modeling approach was also applied to the description of the studied accounts and to the content of the tweets, considering Latent Dirichlet Allocation algorithm (LDA) [4], using the SARA Framework.
4 4.1
Experiments and Discussion Application of the Models in Benchmark Datasets
The data defined from the datasets in Table 1 were used to train and parameterize the Random Forest model presented in Sect. 3, in order to assess its applicability in real-world scenarios. The result of this experiment, considering 10fold cross-validation, is presented in Table 2. We can observe that the approach considering unbalanced-training model presents slightly better results than balanced-training model, however, the balanced-training model presents a True Positive rate higher than 90%, which is consistent to other state-of-art works in the literature [12,13]. Table 2. Application of the Random Forest classifier considering the datasets balancedtraining and unbalanced-training. Model
Accuracy Precision Recall F1 score MCC
balanced-training
0.948
0.931
0.935
0.948
0.897
unbalanced-training 0.953
0.964
0.971
0.970
0.861
SaraBotTagger - A Light Tool to Identify Bots in Twitter
109
It is possible to see from Fig. 1 that accounts labeled as bots were suspended with significantly more frequency than accounts labeled as human and random accounts. It is also possible to note that the balanced-training model is able to more frequently identify accounts suspended by Twitter. In general, the proposed models achieved a better result to indicate accounts marked as suspicious that were deleted in discussion regarding anxiety. This may be an indicator that the highly polarized subjects, such as the STF discussion. For the remaining of the work, the model trained with balanced data (balanced-training model) will be considered for the analysis of the real-world subjects. Combining the results presented in Fig. 1 with the results in Table 2 it is possible to answer RQI. The Random Forest model is able to effectively identify bots in Twitter considering the benchmark and tagged datasets using only metadata of users. The quarantine validation also allows us to state that the trained models can be applied to real-world contexts, identifying the bots with a significantly higher frequency than a random classifier. In order to complement the analysis of the model considered for the classification of profiles as human or bots, it is important to highlight that total tweets, followers, total favorites, followings and year creation were the top ranked features for the balanced-training model and total tweets, year creation, followers, tweets day and total favorite were the top ranked features for the unbalanced-training model. 4.2
Application of the Models to Real-World Contexts
One of the main motivations for the construction of the methodology proposed in this work is to integrate the SARA framework for analysis of large-scale events in real time and, therefore, it is important to analyze the model’s behavior in events of this nature. In this sense, after training the Random Forest classifier, as previously described, some case studies were conducted with the analysis of two different subjects: the first related to discussions about the Supremo Tribunal Federal (STF), the Supreme Court in Brazil, and the second related to anxiety. 914,914 tweets about the STF and 528,898 tweets about anxiety were collected between May and August 2020, totaling more than 1.4 million tweets. The choice of subjects was conveniently made due to their importance on social networks in the period studied, since the STF is a topic that generates a very intense and polarized debate on social networks in Brazil and, in a different context, a great discussion about anxiety can also be observed on social networks, specially in the period studied, which is in the middle of the COVID-19 pandemic. First, it is important to observe that the task of identifying bots in Twitter is very hard and we must always keep in mind the limitations of any approach that considers only the information freely available by the Twitter API, which leads to methodologically flawed models, as many more effective tools are used by the Twitter team, like captchas and the verification of the reach of the accounts that are biasing public debate1 . Thus, the Twitter development team make a 1
https://blogs.oglobo.globo.com/sonar-a-escuta-das-redes/post/amp/entrevistatermo-robo-acaba-sendo-usado-de-forma-generalizada-diz-executivo-do-twitter. html.
110
C. M. Geraldo Barbosa et al.
distinction between harmless accounts, that do not reach a substantial number of timelines and other malicious bots that artificially amplify conversations in Twitter are penalized with suspension2 . In this sense, we propose to evaluate and validate the models presented in this work by comparing the bots tagged by the Random Forest models to the accounts suspended by Twitter. A sample of the accounts marked as bots was placed under observation, like a quarantine, so that it is possible to investigate whether Twitter itself, through its verification system, would suspend them according to their own criteria. Within the range considered, some checkpoints were defined for the assessment of quarantined accounts. Figure 1 shows the results of the account suspension assessment at the five defined checkpoints.
(a) STF
(b) Anxiety
Fig. 1. Evolution of the number of suspended accounts: (a) STF; (b) anxiety.
Analysis of the Content of Tweets and Users’ Descriptions. First, a high level analysis of the users was performed, considering only the description of their profiles in Twitter. The description of the profiles, humans and bots, were analyzed, using the Latent Dirichlet Allocation (LDA) algorithm, resulting in the word clouds presented in Fig. 2. It is worth to notice that only a
(a) STF (bots)
(b) STF (humans)
(c) Anxiety (bots)
(d) Anxiety (humans)
Fig. 2. Word cloud considering the topics generated by LDA applied to the description of the accounts. 2
https://blog.twitter.com/en us/topics/company/2020/bot-or-not.html.
SaraBotTagger - A Light Tool to Identify Bots in Twitter
111
fraction of the accounts display a description text (54.40% and 73.18% for bots and humans, respectively, tweeting about STF; and 79.02% and 88.83% for bots and humans, respectively, tweeting about anxiety) and that most descriptions are in Portuguese too, translated to English for this work. It is interesting to observe that the word clouds show substantial differences when humans and bots are compared and when the two subjects are compared. The profiles of the users that tweet about the STF are frequently described with words that refer to political and ideological views (like “conservative”, “christian”, “family”, “god”), specially for the bots. Some diversification, although small, can be observed in the human description (with words like “university professor”, “lawyer”, “public management”). It can be an indicator that users that tweet about the STF are more engaged to political discussions and to impose their political/ideological views in these discussions. On the other hand, users that tweet about anxiety are rarely described with terms related to care and psychological aspects (the exception are some terms like “nursing” and “psychologist”, present in the humans word cloud). In general, users that tweet about anxiety use a lot of terms indicating fan accounts (the more obvious is the term “fan account”, but a lot of artists’ and bands’ names, specially in the humans accounts, also indicate this, like “one direction”, “taylor swift” and “kate perry”). It is also interesting to notice that the description of the bots that tweet about anxiety show a lot of terms presenting their Instagram pages (like “follow insta” and“insta page”) and terms that indicate that the profiles refer to inoffensive users that only retweet specific terms (like “retweets contains”, “contains word” and “bot created”). A deeper investigation about the users were performed by analyzing the content of a sample of the tweets in the timeline for a sample of 4800 of the profiles, humans and bots, that tweet about STF and anxiety (2200 in each subset). The users were analyzed using SARA framework, which performed a pre-processing, topic modeling and generation of word clouds. Figure 3 shows the word clouds obtained by the application of LDA to the tweets. Again, it is important to mention that, originally, the terms were written in Portuguese and
(a) STF (bots)
(b) STF (humans)
(c) Anxiety (bots)
(d) Anxiety (humans)
Fig. 3. Word cloud considering the topics generated by LDA applied to the timeline of sampled accounts.
112
C. M. Geraldo Barbosa et al.
they were translated to English in this work. Some observations made for the word cloud of the descriptions for the accounts can also be made for the word cloud of the timeline of the users, presented by Fig. 3. The accounts that tweet about the STF are much more engaged to political content, and the tweets from their timelines usually refer to political terms and political actors Some terms related to the COVID-19 pandemic can also be observed specially in the humans’ cloud but ideological terms related to this topic can also be seen (like “chinese virus”). The word clouds obtained for the timelines of accounts that tweet about anxiety are much more diverse, like the word clouds for the descriptions, but with distinct terms. Some terms that may be related to anxiety appear (like “cry” and “disgrace”). A clear difference between the behavior of humans and bots can be noticed when we observe a great number of artists’ names in the humans’ cloud From the analysis of the word clouds, RQII can be partially answered. The users that tweet about the STF are more mono-thematic and very engaged to political discussion, describing themselves as conservatives and consuming rightwing related content and this behavior is reproduced in the bots. Thus, a rightwing and conservative political/ideological view can be substantially amplified by the bots, confusing the public debate on the subject. On the other hand, users that tweet about anxiety are very plural in the topics discussed, showing a lower potential to dictate a debate in Twitter. Analysis of the Retweet Networks. The two subjects studied in this work were also investigated from a network perspective. For this, two retweet networks were created in which the nodes represent the users that tweeted about a subject (STF/anxiety) and the directed edges represent a retweet from one user to another. Table 3 shows some basic properties of the networks. The retweet networks are very sparse and this sparsity remains in the subgraphs that considers only humans or bots. Considering the STF network, the sparsity of the complete network is consistent with the sparsity of the humans network, but the bots network is one order of magnitude more sparse than the complete network, indicating that the bots are spreaded over the network and are not organized in core of retweets. This observation is corroborated with the fact that the giant Table 3. Basic properties of retweets networks considered in this work where n is the number of nodes, m is the number of edges, nbots is the number of users labeled as bots, nhumans is the number of users labeled as humans and nG is the number of nodes in the giant component weakly defined. n
m
d 10−5
nbots n
nhumans n
Complete network (STF)
137653 621205 3.27 ×
Humans network (STF)
128725 597083 3.60 × 10−5 0.0
1.0
328 4.11 × 10−6 1.0
0.0
Bots network (STF)
8928
0.0648 0.9351
Complete network (anxiety) 246159 283312 4.67 × 10−6 0.0282 0.9717 Humans network (anxiety) Bots network (anxiety)
nG 124636 115557 46 217523
239204 272776 4.76 × 10−6 0.0
1.0
209368
339 7.01 × 10−6 1.0
0.0
208
6955
SaraBotTagger - A Light Tool to Identify Bots in Twitter
113
component of the bots network is so small, with only 46 nodes (5.2% of the bots), unlike the humans network, that has 115557 nodes (89.7% of the humans). When the anxiety network is analyzed, the sparsity of the complete network is preserved in the humans network and, unlike in the STF network, the density of the bots network is greater than the density of the complete network, although the difference is not very substantial. Interestingly, the giant component of the bots network is also small, with only 208 nodes (2.9% of the nodes), unlike the humans network, which has 209368 nodes (87.5% of the nodes). The fragmentation of the bots in the networks is also stated when we investigate the community structure of the network. For the STF retweet network, the algorithm identified 62 communities and the proportion of bots compared to all users in the giant component (0.0678) is roughly preserved in all communities of the network. The standard deviation observed for the proportion of bots in the 62 communities identified for the STF network is 0.1208 and we can not affirm that there is a typical community for the bots. For the anxiety network, the algorithm identified 139 communities and none of them can be considered as a typical community for the bots, since the proportion of bots in the giant component (0.0274%) is roughly preserved in all communities (with a standard deviation of 0.0354). The fragmentation of the bots in the networks, observed by the investigation of the density and the communities, allows us to state that the bots do not retweet themselves, but do not reveal whether the bots are retweeted by humans and whether they play special roles in the network. In order to better understand who are the most important users in terms of retweets, the retweet networks were investigated regarding the degrees (in and out) and the Page Rank centralities of the nodes and Table 4 shows some results concerning these results. Again, only the giant component of the networks were considered in this study. From Table 4 it is possible to see that the mean indegree and Page Rank is significantly greater for the humans when compared to the bots for both networks, however, the medians are very similar, indicating that the bots behave as typical humans and a few highly central humans may be raising the mean, indicating that the bots do not play central roles in the networks, but they are able to amplify certain views or leanings in order to bias the public perception about the discussion. Table 4. Basic statistics about the centralities observed for the retweet networks. Indegree Mean Median
Outdegree Mean Median
Page Rank Mean
4.89 5.20 0.64
0.00 0.00 0.00
4.89 5.09 2.17
1.00 2.00 1.00
8.02 × 10−6 3.25 × 10−6 8.33 × 10−6 3.25 × 10−6 3.80 × 10−6 3.25 × 10−6
All (anxiety) 1.22 Humans (anxiety) 1.24 Bots (anxiety) 0.49
0.00 0.00 0.00
1.22 1.22 1.14
1.00 1.00 1.0
4.59 × 10−6 1.54 × 10−6 4.66 × 10−6 1.54 × 10−6 2.23 × 10−6 1.54 × 10−6
All (STF) Humans (STF) Bots (STF)
Median
114
C. M. Geraldo Barbosa et al.
Taking a closer look to the ranks of the most central users in both retweet networks, it is possible to observe that, for the STF network, all users are human in the top 100 rank regarding indegree and Page Rank and 97 users are human in the top 100 outdegree rank. When the anxiety retweet network is considered, a similar result can be observed for the indegree and Page Rank lists. For the indegree top 100 rank, 99 users are human and for the Page Rank top 100 rank, all the users are human. A slightly different result can be observed for the outdegree top 100 rank, where 13 users are bots. A closer and inspection allows us to verify that these users are self declared bots, monitoring specific terms in Twitter (like “quarantine”). And other 19 users are dedicated to automatically retweet specific terms although they do not describe themselves explicitly as bots. The analysis of the networks allows us to answer RQIII, since, for the contexts investigated, the bots are spreaded over the network and, although they are not organized in strong large cores, they behave like typical users, considering the topology of the network. As we could not observe any bot with a high structural relevance in the network, we can conclude that the bots retweet contents from distinguished people, in order to validate their views and positions. However, a further investigation of other types of networks, like following networks, could bring even more understanding in this sense. The observation of the results presented in Tables 3 and 4, and Figs. 2 and 3 allows us to observe that the users involved in the discussion about STF are very engaged to political debates, although they do not retweet themselves, but users that are notably relevant to political discussions in Brazil. Thus, when the word clouds are combined to the network analysis, we can define a relation between RQII and RQIII, by understanding that the bots does not need necessarily to produce original content or to be topologically central in order to be harmful to a public discussion. Instead, they only need to amplify certain discourses of distinguished actors to bias the public debate.
5
Conclusions and Future Directions
In this work we present SaraBotTagger, a tool for identifying bots in Twitter based only in the metadata of the accounts based on a Random Forest Classifier. We also propose a validation methodology that verify if the classifier is able to identify bots that are actually suspended by Twitter. The proposed methodology is able to efficiently identify bots, with high accuracy rates (answering RQI). We apply our methodology to the context of two real-world discussion, regarding STF (the Supreme Court of Brazil) and anxiety. The investigation of the descriptions and timelines of the accounts allows us to see that accounts that tweet about STF reveal a strong engagement to political content, unlike the users that tweet about anxiety, which cover a more diverse range of topics. The retweet networks of users involved in the discussions about STF and anxiety reveal that the bots are very spreaded all over the network and are not concentrated in strong cores. The combination of the analysis of the word clouds and
SaraBotTagger - A Light Tool to Identify Bots in Twitter
115
the networks allows us to answer RQII and RQIII, since the engagement of the users that tweet about STF on political themes and their fragmentation over the network could be used in order to influence the public debate about many subjects of public interest, what could be dangerous for a healthy discussion in social networks. The methodology presented in this work is still in evolution and in future works we intend to incorporate more complex strategies combining the metadata of the users and the content produced/consumed by them to other network strategies, like the evaluation of the cascades produced by the tweets, in order to try to circumvent some limitations of the present methodology, specially when dealing with hybrid bots and training more general models. Acknowledgement. The authors would like to thank the Brazilian research funding agencies Capes and CNPq for the support to this work.
References 1. Abokhodair, N., Yoo, D., McDonald, D.W.: Dissecting a social botnet: growth, content and influence in twitter. Association for Computing Machinery, New York (2015) 2. Alothali, E., Zaki, N., Mohamed, E., Alashwal, H.: Detecting social bots on twitter: a literature review. In: 2018 International Conference on IIT, pp. 175–180 (2018) 3. Barbosa, C.M.G., Felix, L.G.D.S., Xavier, C.R., Vieira, V.D.F.: A framework for the analysis of information propagation in social networks combining complex networks and text mining techniques. In: Proceedings of the 25th Brazilian Symposium on Multimedia and the Web, WebMedia 2019, pp. 401–408. Association for Computing Machinery, New York (2019). https://doi.org/10.1145/3323503.3360289 4. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003) 5. Cresci, S., Lillo, F., Regoli, D., Tardelli, S., Tesconi, M.: Cashtag piggybacking: uncovering spam and bot activity in stock microblogs on Twitter. ACM Trans. Web (TWEB) 13(2), 1–27 (2019) 6. Davis, C.A., Varol, O., Ferrara, E., Flammini, A., Menczer, F.: BotOrNot: a system to evaluate social bots. In: Proceedings of the 25th International Conference Companion on World Wide Web, pp. 273–274 (2016) 7. Ferrara, E., Varol, O., Davis, C., Menczer, F., Flammini, A.: The rise of social bots. Commun. ACM 59(7), 96–104 (2016) 8. Gilani, Z., Farahbakhsh, R., Tyson, G., Wang, L., Crowcroft, J.: Of bots and humans (on Twitter). In: Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, pp. 349–354 (2017) 9. Mazza, M., Cresci, S., Avvenuti, M., Quattrociocchi, W., Tesconi, M.: RTbust: exploiting temporal patterns for botnet detection on Twitter. In: Proceedings of the 10th ACM Conference on Web Science, pp. 183–192 (2019) 10. Orabi, M., Mouheb, D., Al Aghbari, Z., Kamel, I.: Detection of bots in social media: a systematic review. Inf. Process. Manag. 57(4), 102250 (2020) 11. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
116
C. M. Geraldo Barbosa et al.
12. Rossi, S., Rossi, M., Upreti, B., Liu, Y.: Detecting political bots on Twitter during the 2019 Finnish parliamentary election. In: Proceedings of the 53rd Hawaii International Conference on System Sciences (2020) 13. Varol, O., Ferrara, E., Davis, C.A., Menczer, F., Flammini, A.: Online human-bot interactions: detection, estimation, and characterization. In: Eleventh International AAAI Conference on Web and Social Media (2017) 14. Yang, K.C., Varol, O., Hui, P.M., Menczer, F.: Scalable and generalizable social bot detection through data selection. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 1096–1103 (2020)
Graph Auto-Encoders for Learning Edge Representations Virgile Rennard1 , Giannis Nikolentzos1,2(B) , and Michalis Vazirgiannis1,2 1
2
´ Ecole Poytechnique, Palaiseau, France [email protected] Athens University of Economics and Business, Athens, Greece {nikolentzos,mvazirg}@aueb.gr
Abstract. Graphs evolved as very effective representations of different types of data including social networks, biological data or textual documents. In the past years, significant efforts have been devoted to methods that learn vector representations of nodes or of entire graphs. But edges, representing interactions between nodes, have attracted less attention. Surprisingly, there are only a few studies that focus on generating edge representations or deal with edge-related tasks such as the problem of edge classification. In this paper, we propose a new model (in the form of an auto-encoder) to learn edge embeddings in (un)directed graphs. The encoder corresponds to a graph neural network followed by an aggregation function, while a multi-layer perceptron serves as our decoder. We empirically evaluate our approach in two different tasks, namely edge classification and link prediction. In the first task, the proposed model outperforms the baselines, while in the second task, it achieves results that are comparable to the state-of-the-art. Keywords: Edge embeddings mining
1
· Representation learning · Graph
Introduction
In the past years, machine learning on graphs has received considerable attention. This was mainly driven by the ubiquitousness of graph-structured data. Indeed, such kind of data arise in many application domains such as in chemoinformatics [9], in physics [3], and in natural language processing [17]. Several of these tasks focus on different components of graphs such as nodes and edges. For instance, one might wish to predict the biological state of a protein in a proteinprotein interaction network, to recommend new friendship relationships in social networks, or even to discover the type of the relationships between entities in a knowledge graph. To that end, the tasks of learning and analyzing large-scale real-world graph data are at the core of several important applications, but also present many challenges. The major challenge in machine learning on graphs, is how to incorporate information about the structure of the graph in the learning model. For example, in the case of friendship recommendations in social networks (also known as c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 944, pp. 117–129, 2021. https://doi.org/10.1007/978-3-030-65351-4_10
118
V. Rennard et al.
the link prediction problem), in order to determine whether two unlinked users are similar, we need to obtain an informative representation of the users and their proximity – that potentially is not fully captured by graph statistics (e.g., centrality criteria) [6], and other handcrafted features extracted from the graph [14]. Very recently, there is an intense research effort to develop algorithms that learn graph representations – also known as embeddings – that encode and capture the structural information of the underlying graph. More precisely, a graph embedding methodology maps elements of the graph into a low-dimensional vector space, while the graph structure is preserved. Note that in most of the cases, the feature learning approach is purely unsupervised. That way, the obtained embeddings can further be used in any downstream machine learning task, such as classification and clustering. In the past years, a significant amount of effort has been devoted to node embedding approaches, i. e., algorithms that map the nodes of a graph into a low-dimensional space [5,10,18,24,25]. However, the same does not apply to edge embedding approaches, and more generally, to approaches that project higherorder structures into low-dimensional spaces. Surprisingly, there are only a few studies that focus on generating edge representations or deal with edge-related tasks such as the problem of edge classification. It should be mentioned here that we can produce a representation for some edge by combining the (node) embeddings of its endpoints. This approach has been discussed in [10]. However, it is unlikely that such an approach can produce representations of the same quality as those generated by an approach that operates directly on edges. In this paper, we propose a new model to learn edge embeddings in (un)directed graphs. The proposed architecture is an auto-encoder. The encoder corresponds to a graph neural network followed by an aggregation function, while a multilayer perceptron serves as our decoder network. We empirically evaluate our approach in two different tasks, namely edge classification and link prediction. In the first task, the proposed model outperforms the baselines, while in the second task, it achieves results that are comparable to the state-of-the-art.
2
Related Work
As mentioned above, a large body of work has focused on node embedding algorithms. Early node embedding approaches perform random walks on the graph, and treat the emerging walks as sentences in a special language. They then capitalize on ideas from natural language processing such as the Skipgram model [15] to generate node representations. This family of approaches includes DeepWalk [18] which simulates simple random walks, and node2vec [10] which performs biased walks. LINE does not simulate random walks, but optimizes an objective function that preserves the first- and second-order proximities [23]. GraRep is an approach that preserves high-order proximities by applying SVD to high-order proximity matrices [5]. There have been also proposed other embedding approaches that capture different properties of graphs such as their community structure [25]. Other approaches, such as edge2vec [8], produce node
Graph Auto-Encoders for Learning Edge Representations
119
representations which capture edge semantics. More specifically, edge2vec uses an Expectation-Maximization approach to train an edge-type transition matrix, and then uses a stochastic gradient descent model to learn node embeddings. Interestingly, some works have leveraged deep learning techniques to learn node representations. One notable example of this line of research is the variational graph autoencoder [12] which utilizes a graph neural network encoder and a simple inner product decoder. Variants of this model have also been proposed [19]. A detailed review of node embedding algorithms is beyond the scope of this paper; we refer the interested reader to [4]. The problem of node classification has been studied extensively in recent years. Both unsupervised and supervised node representation learning algorithms have been investigated to that end. In the former case, a node embedding algorithm is first applied to produce node representations, and then these representations are fed to a standard classifier (e. g., logistic regression) [10,18,23]. In the latter case, the graph is directly passed on to a neural network which classifies the nodes [11,13]. These neural networks are typically end-to-end trainable. On the other hand, the problem of edge classification has received considerably less attention. Edge classification is generally considered to be harder than node classification since it is not straightforward how to apply the homophily principle to the case of edges. To the best of our knowledge, there are only a few approaches that have dealt with the edge classification problem, and those works are the closest to ours. Aggarwal et al. proposed in [2] a set of neighborhoodbased methods to perform edge classification in networks. To account for the high complexity of these methods, the authors proposed the use of probabilistic, min-hash based data structures. The difference between our algorithm and this approach is that we use deep learning techniques to deal with the edge classification task instead of relying on graph statistics. Abu-El-Haija et al. proposed in [1] a model that learns edge representations where each edge is modeled as a function of the nodes. However, in contrast to our work, the authors do not use graph neural networks to generate node representations, while they employ a different objective function than ours. Zhou et al. proposed in [26] an approach for generating edge representations which takes into account the density-based local characteristics of edges and subgraphs. The major difference between this approach and ours is that we use an entirely different architecture (i. e., message passing) to obtain edge representations as well as a different loss function. To identify the relationships between WeChat users, Song et al. proposed in [22] LoCEC, an algorithm that partitions the users’ ego networks into communities, classifies these communities using a convolutional neural network, and predicts a relationship type for each edge of the network based on the classification results.
3
Preliminaries
Let G = (V, E) be a directed graph consisting of a set V of vertices and a set E of edges between them. We will denote by n = |V | the number of vertices and by m = |E| the number of edges. A graph G can be represented by its
120
V. Rennard et al.
adjacency matrix A. The (i, j)th entry of A is wij if the edge (vi , vj ) between vertices vi and vj exists and its weight is equal to wij , and 0 otherwise. Let also D denote the set of pairs of vertices that are not connected by an edge, i. e., D = {(vi , vj ) : vi , vj ∈ V, vi = vj , (vi , vj ) ∈ E}. We assume that the vertices of the graph are annotated with continuous multidimensional attributes. We use X ∈ Rn×d where d is the number of attributes to denote the graph’s node information matrix with each row representing the attribute of a vertex. We next give the definition of the representation learning problem that we study in this paper. Our goal is to learn an embedding for each pair of nodes of a graph such that these embeddings capture as much structural information of the graph as possible. Formally, we aim to learn a function f : E ∪ D → Rd that maps pairs of nodes to feature representations that can be then utilized for some downstream prediction task such as an edge classification or link prediction. Here d is a hyperparameter specifying the number of dimensions of the produced embeddings. While we assume directed graphs, our approach is general and can be applied to any (un)directed, (un)weighted graph.
4
The Proposed Algorithm: Edge Auto-Encoder (EAE)
Our model is mainly inspired by the work of Kipf and Welling [12], who designed an auto-encoder for learning node representations. Our encoder corresponds to a graph neural network followed by a function that combines node representations to produce edge representations or more general, representations for pairs of vertices. Our decoder is a multi-layer perceptron that determines if a pair of vertices is linked by an edge or not. We next present the two modules in detail. 4.1
Encoder
Our encoder corresponds to an instance of a well-known family of graph neural networks, known as message passing neural networks (MPNNs) [9]. These networks consist of a series of neighborhood aggregation layers. Each layer uses the graph structure and the node feature vectors from the previous layer to generate new representations for the nodes. Specifically, in our implementation, to update the representations of the nodes, we use the following neighborhood aggregation scheme: ˜ H(t) W(t+1) ) H(t+1) = f (A where H(t) is a matrix that contains the node representations of the previous layer, with H(0) = X, W(t) is the matrix of trainable parameters of layer t, and f is a non-linear activation function such as ReLU. Following [13], we normalize the adjacency matrix A such that the sum of the weights of the incoming edges ˜ = D−1 A where D is a diagonal matrix with of eachnode is equal to 1, i. e., A n Dii = j=1 Aij . Typically, an MPNN contains T neighborhood aggregation layers, and generates a vector representation for each node of the input graph (i. e., the rows of matrix H(T ) ). However, we are interested in producing representations for pairs
Graph Auto-Encoders for Learning Edge Representations
121
Table 1. Functions that combine the representations of two nodes vi and vj . f (·) is a function that maps nodes to vector representations, while |f (·)| denotes the pointwise absolute value of the representation of a node. Operator
Function
Sum
f (vi ) + f (vj )
Average
1/2(f (v ) i
Weighted-L1
|f (vi ) − f (vj )|
Weighted-L2
(f (vi ) − f (vj ))2
+ f (vj ))
Concatenation [f (vi )||f (vj )]
of nodes instead. To achieve that, we combine the representations of the two nodes. We use a set of simple permutation invariant functions most of which were originally introduced in [10], but in a different context. These functions are illustrated in Table 1. In all cases except the last one (i. e., concatenation), the dimensionality of the emerging edge representations is identical to that of the two combined node representations. In the case of the concatenation, the dimensionality is twice as large as that of the two endpoints. Note also that all functions besides concatenation are symmetric. Concatenation thus seems to be well-suited for directed graphs where each edge is represented as an ordered set of nodes. In the case of an undirected graph, to apply this function, we first need to impose an order on the two nodes that form the edge. This can be achieved by a node ranking function such as the degree function or the function that assigns pagerank scores to the nodes. The main drawback of the concatenation function is that it increases the computational complexity of the model since it provides higher-dimensional representations than the other functions. 4.2
Decoder
The second module of the proposed model is a multi-layer perceptron (MLP) and serves as the decoder of the proposed architecture. Typically, a decoder is expected to reconstruct the encoder’s input. Our decoding scheme is different from that of standard graph auto-encoders which compute the inner product between node representations. Instead, we use an MLP to predict if a pair of vertices is connected by an edge or not. The input to the MLP is the representation of a pair of nodes (vi , vj ) where (vi , vj ) ∈ E ∪ D. For instance, for the pair (T ) (T ) (T ) (vi , vj ), the input to the MLP is the following vector [Hi ||Hj ] where Hi is the ith row of matrix H(T ) and || is the concatenation operator. The output of the MLP is a scalar. We apply the sigmoid function to the output to obtain the probability that there is indeed an edge between the two nodes.
122
V. Rennard et al.
Table 2. Summary of the three datasets that were used in our edge classification experiments. Dataset
# Nodes # Edges # Classes
Slashdot
82,144
Wikipedia Epinions
4.3
549,202
2
7,118
103,747
2
119,217
841,000
2
Learning
Our loss function is the binary cross-entropy: L=− log(ˆ ye ) + log(1 − yˆd ) e∈E
d∈D
where yˆe is the output of the proposed model for the edge e, and similarly, yˆd is the output of the proposed model for a pair of nodes d = (vi , vj ) where vi is not connected to vj by an edge. The above loss function is expensive to compute for large networks. Note that real-world networks are usually very sparse, i. e., |E| |D|. Therefore, in practice, we use negative sampling to approximate the second term of the above loss function That is, we sample a specific number of pairs of nodes and use them as negative instances.
5
Experimental Evaluation
We demonstrate the ability of the proposed model to learn meaningful embeddings in two tasks, namely edge classification and link prediction. 5.1
Edge Classification
We first evaluate the proposed model in the task of edge classification. The main objective of edge classification is to assign class labels to unlabeled edges. Each edge ei ∈ E has an associated class label yi and the goal is to learn a representation vector f (ei ) of ei such that ei ’s label can be predicted as yˆi = h(f (ei )) where h(·) is a classifier. Datasets. We experimented with three large-scale edge classification datasets: Slashdot, Wikipedia and Epinions. All three datasets are directed networks and each edge is classified into one out of two categories. Slashdot is a news website. The vertices of the network represent users and two users are linked by an edge if the first user liked (labeled as 1) or disliked (labeled as 0) a comment posted by the second user. Wikipedia is a network where nodes correspond to users, and two nodes are connected by an edge if the first voted positively (labeled as 1) or negatively (labeled as 0) for the second toward a promotion as Wikipedia
Graph Auto-Encoders for Learning Edge Representations
123
administrators. Epinions is a product review website in which the vertices represent users and the edges denote trust ratings between users. The three datasets are fairly diverse, and are thus ideal for demonstrating the generality of our approach to various settings. Table 2 shows statistics of the three datasets. Baselines. We compare the representations generated by the proposed model against two baseline methods. The first approach utilizes the Deepwalk algorithm [18] to produce node embeddings, and then to generate edge representations, it uses an aggregation function such as the ones illustrated in Table 1. In preliminary experiments, we found that the function that concatenates the node representations outperforms the others on all three considered datasets. Therefore, we only report results associated with this operator. Our second baseline is ExtWF, the best-performing variant of the approach presented in [2]. This approach first compares nodes to each other using a weighted Jaccard coefficient which treats edges that belong to each class label separately. Then, given an edge (vi , vj ), its class label is determined as the majority label in (Sk (vi )×Sk (vj ))∩E where Sk (v) is a set that contains the top-k most similar nodes to v. Experimental Setup. For each dataset, we create a sparsely labeled graph by selecting 50% of its edges and eliminating their class labels. Therefore, 50% of the samples (i. e., edges) belong to the training set, and the rest of the samples belong to the test set. For all configurations, we train the neural networks for 2000 epochs. We use the Adam optimizer with a learning rate of 0.01. The encoder consists of 2 neighborhood aggregation layers. All our dense layers use ReLU activation. We sample m negative pairs (i. e., nodes that are not linked by an edge) where m is the number of edges of the network. The hyper-parameters we tune are: (1) the number of hidden units of the message passing layers ∈ {64, 128, 256, 512}, (2) the number of layers of the MLP ∈ {3, 4, 5}, (3) the number of hidden units of the MLP ∈ {128, 256}, and (4) the dropout rate ∈ {0.0, 0.2}. The edge representations produced by the different approaches are fed into a logistic regression classifier. We use the classifier to predict the class labels of the edges that belong to the test set, and we report the accuracies of the different approaches. Results. We first study what is the impact of the different aggregation functions on the quality of the generated representations. We experiment with the following five functions (also shown in Table 1): (1) Sum, (2) Average, (3) Weighted-L1 (4) Weighted-L2, and (5) Concatenation. In the absence of node attributes, we assign a feature vector to each node that corresponds to its embedding produced by the DeepWalk algorithm. Table 3 illustrates the obtained accuracies on the three datasets with respect to the different functions for combining node representations. We observe that the choice of the aggregation function can significantly affect the resulting model’s performance. In fact, the model achieves
124
V. Rennard et al.
Table 3. Performance of the different aggregation functions in the edge classification task. Aggr. function Slashdot Wikipedia Epinions Sum
0.774
0.754
0.842
Average
0.772
0.796
0.864
Weighted-L1
0.764
0.799
0.874
Weighted-L2
0.764
0.794
0.872
0.835
0.913
Concatenation 0.792
the highest accuracy when the representations of the endpoints of each edge are concatenated into a single feature vector. We hypothesize that this is related to the nature of the three datasets. Indeed, all three networks are directed. Therefore, encoding the direction of the edges into the learnt representations is likely to have a positive impact on performance. With regards to the other aggregation functions, Weighted-L1, Weighted-L2 and Average performed equivalently on the three datasets, while Sum is the worst-performing function. In previous studies, it has been observed that node features are of paramount importance in some graph classification tasks. In some cases, they are even more important than the graph structure itself [7]. Since we need to provide our model with some initial node attributes, we also study what is the impact of different types of attributes on its performance. The first type of attributes corresponds to local node features. Specifically, we annotate each node with four such features: its in-degree, its out-degree, its core number, and the average degree of its neighbors (both in-degree and out-degree). The second type of attributes is the vector representation of the node that is produced by the DeepWalk algorithm. We also concatenate the above two attribute vectors. The performance of the different node attribute vectors is shown in Table 4. We can see that initializing the node features with pre-computed node embeddings provides the best results in terms of performance. On the other hand, when the nodes are annotated with local features, the quality of the representations produced by the proposed model is slightly lower. Surprisingly, when the two types of features are combined together, this leads to the worst-performing edge representations. One reason may be that the dimensionality of the node embeddings generated by DeepWalk is much larger than the number of considered local features. Furthermore, the local features have much larger magnitudes than the elements of the DeepWalk embeddings, which may have rendered the model unable to learn from those features correctly as expected. We next compare the two baseline methods presented above against the bestperforming variant of our model, i. e., the one that annotates nodes with their embeddings as produced by DeepWalk and concatenates the node representations to generate edge representations. The obtained results are given in Table 5. Clearly, the proposed model, EAE, achieved the best performance, while ExtWF produced the second best results. In fact, it outperformed the proposed model on
Graph Auto-Encoders for Learning Edge Representations
125
Table 4. Performance of the different types of node features in the edge classification task. Node features
Slashdot Wikipedia Epinions
Deepwalk Embeddings
0.792
0.835
0.913
Local Features
0.783
0.825
0.878
Deepwalk Embeddings + Local Features 0.770
0.813
0.901
one dataset (i. e., Slashdot). DeepWalk failed to perform on par with the other two approaches. One interesting observation is that our model outperforms the DeepWalk approach on all three datasets and by quite wide margins. This highlights the effectiveness of the proposed model since the results indicate that its representational power does not come directly from the node attributes that are initialized with the DeepWalk embeddings. Table 5. Performance of the proposed model and the baselines in the edge classification task. Method
5.2
Slashdot Wikipedia Epinions
Deepwalk 0.780
0.819
0.831
ExtWF
0.802
0.825
0.882
EAE
0.792
0.835
0.913
Link Prediction
We next evaluate the proposed model in the task of link prediction. The main objective of link prediction is to predict the presence or absence of edges between nodes of a graph. In our setting, the goal is to learn a representation vector f ((vi , vj )) of a pair of nodes such that the presence or absence of an edge between the two nodes can be predicted as yˆij = h(f ((vi , vj ))) where h(·) is a classifier. Datasets. We experimented with three real-world bibliographic datasets: Cora [21], CiteSeer [21] and PubMed [16]. The Cora dataset contains a number of machine learning papers divided into one of 7 classes while the CiteSeer dataset has 6 class labels. The PubMed dataset consists of articles related to diabetes from the PubMed database. The node-level features correspond to the text representation of the papers. To produce these features, the three datasets underwent standard preprocessing including stemming, stopword removal as well as removal of terms with document frequency less than 10. Table 6 shows statistics of the three datasets that were used for the evaluation.
126
V. Rennard et al.
Table 6. Summary of the three datasets that were used in our link prediction experiments. Dataset
#Nodes #Edges #Features # Classes
Cora
2,708
5,429
1,433
7
Citeseer
3,327
4,732
3,703
6
PubMed 19,717
44,338
500
3
Baselines. We compare the proposed model against a traditional approach: spectral clustering (SC), and three auto-encoders: variational graph auto-encoder (VGAE) [12], graph auto-encoder (GAE) [12], and gravity variational graph auto-encoder (GVGAE) [20]. Note that SC cannot handle the node features that are available for all three datasets. For SC, GAE and VGAE, we report the results presented in [12], while for GVGAE, we show the results in the original paper [20]. Experimental Setup. We compare models based on their ability to correctly classify edges and non-edges. The validation and test sets contain 5% and 10% of citation links, respectively. The validation set is used for optimization of hyperparameters. With regards to the hyperparameters of the proposed model, we use exactly the same configuration as in the case of edge classification. The representations produced by the proposed model are again fed into a logistic regression classifier, while we use the following two evaluation metrics: Area under curve (AUC) and Average Precision (AP). Table 7. Performance of the different aggregation functions in the link prediction task. Aggr. function Cora AUC
AP
Citeseer AUC AP
PubMed AUC AP
Sum
0.701
0.721
0.713
0.719
0.886
0.853
Average
0.715
0.730
0.716
0.726
0.899
0.902
Weighted-L1
0.716
0.738
0.751
0.758
0.916
0.859
Weighted-L2
0.773
0.756
0.731
0.741
0.891
0.862
Concatenation 0.882 0.885 0.861 0.867 0.967 0.940
Results. We first compare the different aggregation functions in terms of the quality of the representations that they produce. The obtained results are reported in Table 7. Again, we observe that concatenating the representations of the two nodes leads to the best performance. This function outperforms the
Graph Auto-Encoders for Learning Edge Representations
127
others on all three datasets by very wide margins. Weighted-L1 and WeightedL2 produced the second best and third best results, respectively, while Sum and Average are the worst-performing functions. We then compare the best-performing variant of our model against the baseline methods presented above. The results are shown in Table 8. We can see that the proposed model reaches lower performance levels than the three baseline auto-encoders on Cora and Citeseer, while it is the best-performing method on PubMed in terms of the AUC. As expected, SC does not perform well compared to the methods that take the node features into account, yielding much lower performance. Overall, the proposed model does not achieve a new state-of-the-art performance, however, it is competitive with the baselines on all three datasets. Table 8. Performance of the proposed model and the baselines in the link prediction task. Method
6
Cora AUC
AP
Citeseer AUC AP
PubMed AUC AP
SC GAE VGAE GVGAE
0.846 0.910 0.914 0.919
0.885 0.920 0.926 0.924
0.846 0.895 0.908 0.876
0.899 0.920 0.920 0.897
0.842 0.964 0.944 -
EAE
0.882
0.885
0.861
0.867
0.967 0.940
0.841 0.965 0.947 -
Conclusion
In this paper, we have designed an auto-encoder for learning representations for pairs of nodes. The encoder is an instance of a graph neural network, while the decoder is an MLP that predicts if the two nodes are linked by an edge or not. We evaluated the proposed model in the tasks of edge classification and link prediction. Results indicate that our architecture is competitive with the state-of-the-art.
References 1. Abu-El-Haija, S., Perozzi, B., Al-Rfou, R.: Learning edge representations via lowrank asymmetric projections. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 1787–1796 (2017) 2. Aggarwal, C., He, G., Zhao, P.: Edge classification in networks. In: Proceedings of the 32nd IEEE International Conference on Data Engineering, pp. 1038–1049 (2016) 3. Battaglia, P., Pascanu, R., Lai, M., Rezende, D.J.: Interaction networks for learning about objects, relations and physics. In: Advances in Neural Information Processing Systems, pp. 4502–4510 (2016)
128
V. Rennard et al.
4. Cai, H., Zheng, V.W., Chang, K.C.C.: A comprehensive survey of graph embedding: problems, techniques, and applications. IEEE Trans. Knowl. Data Eng. 30(9), 1616–1637 (2018) 5. Cao, S., Lu, W., Xu, Q.: GraRep: learning graph representations with global structural information. In: Proceedings of the 24th International Conference on Information and Knowledge Management, pp. 891–900 (2015) 6. Chakrabarti, D., Faloutsos, C.: Graph mining: laws, generators, and algorithms. ACM Comput. Surv. 38(1), 2–es (2006) 7. Errica, F., Podda, M., Bacciu, D., Micheli, A.: A fair comparison of graph neural networks for graph classification. In: 8th International Conference on Learning Representations (2020) 8. Gao, Z., Fu, G., Ouyang, C., Tsutsui, S., Liu, X., Yang, J., Gessner, C., Foote, B., Wild, D., Ding, Y., Yu, Q.: edge2vec: representation learning using edge semantics for biomedical knowledge discovery. BMC Bioinform. 20(1), 306 (2019) 9. Gilmer, J., Schoenholz, S.S., Riley, P.F., Vinyals, O., Dahl, G.E.: Neural message passing for quantum chemistry. In: Proceedings of the 34th International Conference on Machine Learning, pp. 1263–1272 (2017) 10. Grover, A., Leskovec, J.: node2vec: scalable feature learning for networks. In: Proceedings of the 22nd International Conference on Knowledge Discovery and Data Mining, pp. 855–864 (2016) 11. Hamilton, W., Ying, Z., Leskovec, J.: Inductive representation learning on large graphs. In: Advances in Neural Information Processing Systems, pp. 1024–1034 (2017) 12. Kipf, T.N., Welling, M.: Variational graph auto-encoders. arXiv preprint https:// arxiv.org/abs/1611.07308 (2016) 13. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: 5th International Conference on Learning Representations (2017) 14. Liben-Nowell, D., Kleinberg, J.: The link-prediction problem for social networks. J. Am. Soc. Inf. Sci. Technol. 58(7), 1019–1031 (2007) 15. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013) 16. Namata, G., London, B., Getoor, L., Huang, B., EDU, U.: Query-driven active surveying for collective classification. In: 10th International Workshop on Mining and Learning with Graphs (2012) 17. Nikolentzos, G., Tixier, A., Vazirgiannis, M.: Message passing attention networks for document understanding. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence, pp. 8544–8551 (2020) 18. Perozzi, B., Al-Rfou, R., Skiena, S.: Deepwalk: online learning of social representations. In: Proceedings of the 20th International Conference on Knowledge Discovery and Data Mining, pp. 701–710 (2014) 19. Salha, G., Hennequin, R., Vazirgiannis, M.: Keep it simple: graph autoencoders without graph convolutional networks. arXiv preprint https://arxiv.org/abs/1910. 00942 (2019) 20. Salha, G., Limnios, S., Hennequin, R., Tran, V.A., Vazirgiannis, M.: Gravityinspired graph autoencoders for directed link prediction. In: Proceedings of the 28th International Conference on Information and Knowledge Management, pp. 589–598 (2019) 21. Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B., Eliassi-Rad, T.: Collective classification in network data. AI Mag. 29(3), 93 (2008)
Graph Auto-Encoders for Learning Edge Representations
129
22. Song, C., Lin, Q., Ling, G., Zhang, Z., Chen, H., Liao, J., Chen, C.: LoCEC: local community-based edge classification in large online social networks. In: Proceedings of the 36th International Conference on Data Engineering, pp. 1689–1700 (2020) 23. Tang, J., Qu, M., Wang, M., Zhang, M., Yan, J., Mei, Q.: Line: Large-scale information network embedding. In: Proceedings of the 24th International Conference on World Wide Web, pp. 1067–1077 (2015) 24. Tsitsulin, A., Mottin, D., Karras, P., M¨ uller, E.: Verse: Versatile graph embeddings from similarity measures. In: Proceedings of the 2018 World Wide Web Conference, pp. 539–548 (2018) 25. Wang, X., Cui, P., Wang, J., Pei, J., Zhu, W., Yang, S.: Community preserving network embedding. In: Proceedings of the 31st AAAI Conference on Artificial Intelligence, pp. 203–209 (2017) 26. Zhou, Y., Wu, S., Jiang, C., Zhang, Z., Dou, D., Jin, R., Wang, P.: Density-adaptive local edge representation learning with generative adversarial network multi-label edge classification. In: Proceedings of the 2018 IEEE International Conference on Data Mining, pp. 1464–1469 (2018)
Incorporating Domain Knowledge into Health Recommender Systems Using Hyperbolic Embeddings Joel Peito and Qiwei Han(B) Nova School of Business and Economics, Universidade NOVA de Lisboa, Campus de Carcavelos, 2775-405 Carcavelos, Portugal [email protected], [email protected]
Abstract. In contrast to many other domains, recommender systems in health services may benefit particularly from the incorporation of health domain knowledge, as it helps to provide meaningful and personalised recommendations catering to the individual’s health needs. With recent advances in representation learning enabling the hierarchical embedding of health knowledge into the hyperbolic Poincaré space, this work proposes a content-based recommender system for patient-doctor matchmaking in primary care based on patients’ health profiles, enriched by pre-trained Poincaré embeddings of the ICD-9 codes through transfer learning. The proposed model outperforms its conventional counterpart in terms of recommendation accuracy and has several important business implications for improving the patient-doctor relationship. Keywords: Health recommender systems · Primary care · Poincaré embeddings · International classification of diseases · Patient-doctor relationship
1 Introduction With the emergence of healthcare analytics and growing needs to leverage the prevalent electronic health records from healthcare providers, machine learning (ML) solutions, such as recommender systems (RS), have experienced growing relevance in the healthcare sector [1]. In fact, patients increasingly seek bespoke and digital medical solutions, similar to what they are used to from e-commerce and other domains. However, as patients’ relationship to their doctors can be very personal and health conditions are sensitive topic, healthcare recommender systems (HRS) are subject to a different set of rules and evaluation criteria than other commercial applications of RS. For instance, product or movie RS do not operate under the same scrutiny regarding the reliability and trustworthiness of their predictions, since the ramifications of specific treatment or doctor recommendations are severer in nature. In general, RS often capitalise on the target user’s interaction data without the need of any additional information about the user itself or the recommended entity. While such methods can be highly performant, they usually do not offer a straightforward explanation © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 944, pp. 130–141, 2021. https://doi.org/10.1007/978-3-030-65351-4_11
Incorporating Domain Knowledge into Health Recommender Systems
131
as to why a specific product or movie is being recommended. Still, as long as users receive interesting recommendations, one can assume that this is not a particular issue for the latter. Patients, on the other hand, may be highly interested in solutions that not only fit their personal medical profile insofar, as they are built on medically meaningful information about the patient, but also provide explanations of the recommendation itself. That is to say, patients will arguably prefer recommendations optimised towards their individual medical needs, instead of recommendations based on the similarity to other patients that may show very similar behavioural patterns but have an entirely different medical background. Analogously, healthcare providers can treat this property as a value proposition to their clients, offering medically personalised recommendations and thereby meeting current market trends. As such, this paper aims to investigate the possibility of adding such a medical personalisation dimension to the HRS by incorporating complex, domain-specific knowledge into the underlying model. More specifically, we propose a content-based RS for patient-doctor matchmaking built on real data from a leading European private healthcare provider. Patients’ historical health records, as indicated by the ICD-9 codes1 serve as the main source of domain knowledge. However, the use of ICD-9 code for encoding patients’ health conditions faces a series of practical implementation problems. Chief among those is the structure of the data itself, as in nature ICD codes are encoded as hierarchical, tree-like structures that are hard to be embedded into the continuous space necessary for most ML models. Nevertheless, recent works [2, 3] proposing hyperbolic embeddings for learning hierarchical representations appear to provide a bypass for this issue. Consequently, we investigate how to incorporate complex domain knowledge, such as the ICD-9 hierarchy into a HRS using hyperbolic embeddings and examine whether such domain knowledge can add value to the HRS in terms of improving recommendation accuracy. For that purpose, we pursue the following approach: contextualising the topic, Sect. 2 begins with a bibliographical examination of related works on HRS and lays out the benefits of embedding hierarchical data into the hyperbolic space. Notably, it will be shown why hyperbolic embeddings are inherently better equipped than their Euclidean counterparts to embed hierarchical, tree-like data into the continuous space. Moving forward, Sect. 3 sheds light on the data at hand. Section 4 discusses the methods employed introducing the notion of hyperbolic distance as a similarity measure for RS and formulating two content-based models using said hyperbolic distance. For evaluation purposes, a conventional model is formulated to serve as a benchmark. Section 5 analyses the results of this investigation. Finally, Sect. 6 discusses the conclusions we draw from this work, as well as suggestions for future research.
1 International Classification of Diseases (ICD) is a comprehensive standard of diseases or medical
conditions maintained by the WHO and widely used among healthcare organizations worldwide. It is revised periodically and now in its 10th version (known as ICD-10). However, the ICD-code used this study is still in the 9th version (ICD-9).
132
J. Peito and Q. Han
2 Background and Related Work 2.1 Recommender Systems in Healthcare In general, RS are a subclass of information filtering systems with the goal to provide meaningful suggestions to users for certain items or entities, by attempting to predict the affinity or preference of a given user for said items [4]. RS can be broadly divided into three major categories: collaborative filtering (CF) approaches, content-based (CB) recommenders, and hybrid models, which are a combination of the former two. CF approaches rely solely upon past interactions recorded between users and items, whereas CB approaches use additional information about users and/or items [5]. More precisely, CF capitalises on behavioural data, i.e. users’ co-occurrence patterns, in order to detect similar users and/or items and make predictions based on these similarities, while CB recommenders explore user or item metadata to derive user preferences and model the observed user-item interactions. Although CB recommenders do not suffer from the cold-start problem, i.e. the question of what to do with new users that have no prior interactions usable for predictions [6], CF approaches tend to outperform the former, as usually even a few ratings are more valuable than metadata about users or items [7]. Ultimately, a method to balance both CF and CB’s respective limitations is to use hybrid recommenders, which are a combination of the former and the latter. While RS have been widely used in e-commerce, e.g. for movie or product recommendations, within the healthcare domain RS are only recently emerging, due to elevated requirements regarding reliability and trustworthiness, as well as increased data privacy regulations [8, 9]. The last years, however, have shown an increase in studies and research papers on HRS. Among those works, medical user profiling and medical personalisation have been particularly trending topics [10]. Hence, noteworthy examples of HRS applications include recommenders for relevant medical home care products [11], lifestyle adaption recommendations for hypertension treatment and prevention [12], identification of key opinion leaders [13], as well as clinical decision support systems using inherent methods of RS to capitalise on the large volume of clinical data [14]. Ultimately, [15] address the topic of patient-doctor matchmaking proposing RS for suggesting primary care doctors to patients based on their prior consultation history and metadata. 2.2 Hyperbolic Embeddings As has been hypothesized in the introduction, HRS might profit more than other areas from incorporating domain knowledge into the model. Within the healthcare context, such knowledge may include, for instance, a catalogue and categorisation of health conditions, such as the ICD-9 hierarchy. Abstracting a hierarchy into mathematical terms, it is essentially a complex tree that is defined as a connected graph in which for any pair of two vertices u = v there is exactly one path connecting them [16]. An inherent characteristic of hierarchies or trees, however, is that they are discrete structures and thus embedding them in a way that can be used in machine learning models can be challenging, as the latter often rely on continuous representations [17]. Hence, the underlying question is, how to efficiently and accurately model an increasingly complex
Incorporating Domain Knowledge into Health Recommender Systems
133
hierarchy - and accordingly an increasingly complex tree - into a continuous space, such that the information of the hierarchy can be used for machine learning? Recent proposals by [2] and [3] suggesting hyperbolic embeddings to address this issue, have found much notice in the machine learning community. The rationale is that embeddings of hierarchical, tree-like data into the hyperbolic space perform better at the task of capturing and preserving the distances and complex relationships within a given hierarchy, than embeddings in the Euclidean space would. As a matter of fact, these works show that hyperbolic embeddings, even in very low dimensions, consistently outperform their higher-dimensional, Euclidean counterparts when learning hierarchical representations. The reasons for said superiority lie within the properties of hyperbolic geometry itself. Hyperbolic space is a space with a constant negative curvature that expands exponentially rendering it inherently well-suited for the task of embedding a tree into the continuous space [2]. Meanwhile, the preferred geometrical models for representation learning tasks, such as the one at hand, are the Poincaré models as they offer to conform mapping between hyperbolic and Euclidean space, since angles are preserved – a convenient property when translating between spaces and models [17]. Recalling that the goal when embedding tree-like graphs into a continuous space is to preserve original graph distances, one needs to consider the hyperbolic distance: ||x − y||2 (1) dH (x, y) = acosh 1 + 2 1 − ||x||2 1 − ||y||2 In hyperbolic space, the shortest paths between two points, called geodesics, are curved (similarly to the space itself). Due to this curvature, the distance from the origin to a given point dH (O, x) grows towards infinity as x approaches the edge of the disc, as can be observed in Fig. 1. Now, considering the embedding of a graph (or tree) into a continuous space, suppose x and y are children of a parent z, which is placed at the origin O. Then, the distance between x and y is:
Fig. 1. The Poincaré disk model (left) and distance ratios of hyperbolic and Euclidean distance in comparison with original input graph distance ratio (right) [17].
d (x, y) = d (x, O) + d (O, y)
(2)
Normalizing this equation, provides the distance ratio of the original graph, i.e. d (x,y) d (x,O) + d (O,y) = 1. This equation will be relevant in the following, since when comparing
134
J. Peito and Q. Han
its behaviour in hyperbolic and Euclidean space, quite different effects can be observed. As is visualised in Fig. 1, when moving towards the edge of the unit disk, i.e. x → 1, dE (x,y) in Euclidean space dE (x,O) + dE (O,y) remains a constant, whereas in hyperbolic space dH (x,y) dH (x,O) + dH (O,y)
approximates 1, which is exactly the original graph distance ratio! Therefore, it can be seen that Poincaré embeddings are inherently better suited for this kind of representation learning task, due to their better capacity to preserve original graph distances with arbitrarily low distortion [17]. While further analysis of the detailed mathematics of Poincaré embeddings, as laid out in [2, 3] and [17], are beyond the scope of this paper, we instead consider an actual use-case of Poincaré embeddings relevant to this work. For instance, [17] perform representation learning tasks for a variety of datasets, most of which related to NLP. In light of the given topic, however, their work on embeddings of the UMLS diagnostic hierarchy from ICD-9 vocabularies is of particular interest as they provide the very domain-specific knowledge needed for the proposed model.
3 Data The dataset used in this work was provided by a leading European private heath network operating 18 hospitals or clinic centres across the country. Typically, data can be divided into three categories: 1) patients’ demographic information, such as gender, age and home locations, and this information is further enriched with health records in the form of the ICD-9 code for inpatients, i.e. patients who stay at the hospital while under treatment; and 2) doctors’ demographic and professional information, such as gender, age and the hospital they are working at and 3) the interactions between patients and doctors according to their consultation history. Instead of learning the representation of ICD-9 code from our data, we resort to a transfer learning approach by relying on pre-trained Poincaré embeddings provided by [17]. In particular, they used the diagnostic hierarchy of ICD-9 vocabulary in the Unified Medical Language System Metathesaurus (UMLS) to retrieve Poincaré embeddings of medical concepts within the ICD-9 hierarchy. This method results in unique hyperbolic embeddings of medical concepts (identified by the CUI, i.e. Concept Unique Identifier) available in different levels of dimensionality (10, 20, 50 or 100d). We choose the 100d embeddings for our model, as [2] indicates that while Poincaré embeddings already perform well in low dimensions, their performance seems to further increase with dimensionality. Notably, the transfer learning approach allows us to adapt the meaningful medical knowledge from a different health context. Ultimately, since the pre-trained hyperbolic embeddings are only available in UMLS and not directly available for ICD-9 codes of our dataset, a mapping between UMLS and ICD-9 is needed. As a matter of fact, this process requires a multi-stage mapping, because the direct mapping from CUI codes in UMLS to the ICD-9 codes of the core dataset is not available. Instead, the SNOMED CT2 were selected as an intermediary, as 2 SNOMED CT refers to Systematized NOmenclature of MEDicine Clinical Terms that is used to
encode healthcare terminology for electronic health records. All UMLS data including SNOMED CT and CUIs have been retrieved from the US National Library of Medicine (NLM).
Incorporating Domain Knowledge into Health Recommender Systems
135
it serve as healthcare terminology standard and transferable both to UMLS’ CUIs and ICD-9 codes. Figure 2 provides an overview of the data flow. While the SNOMED CT were used to link the CUI with ICD-9 in order to establish a unique Poincaré embedding of each available ICD-9 code, the core dataset itself needed to be filtered for patients that have an ICD-9 record, as well. Consequently, this process reduced the original size of the dataset substantially, such that a dataset of 33k patients and 223 doctors with more than 166k interactions between them remain.
Fig. 2. Data diagram describing the data sources, as well as the necessary mapping steps between terminologies and datasets.
4 Methods 4.1 Hyperbolic Distance as a Similarity Measure for Recommender Systems Since conventional similarity measures for recommender systems, such as cosine similarity or Pearson correlation are only inherently suited for Euclidean space [18], we introduce the notion of hyperbolic distance as a more suitable similarity measure for the data at hand. Given that each of the derived ICD-9 diseases is represented by an embedding in the hyperbolic space, the objective is to determine how similar these diseases – and ultimately the patients admitted with or doctors having treated these diseases – are with one another. The basic principle of hyperbolic distance as a similarity measure is simple: Once a unique embedding per either patient or doctor is derived, a patient-patient or doctordoctor similarity score can be determined utilizing the hyperbolic distance function from Eq. (1). The resulting matrix of distances is subsequently scaled from 0 to 1 and subtracted from 1 in order for 0 to be the minimum similarity and 1 the maximum. Applying this heuristic yields a similarity score that is not only consistent with hyperbolic space (i.e., preserving the hierarchal information and complexities of the input graph), but also as
136
J. Peito and Q. Han
easily interpretable as conventional similarity measures. This similarity measure shall be referred to as hyperbolic similarity for the remainder of this paper and its implementation into the model at hand will be examined in the following section. 4.2 Implementation of a Recommender System Using Hyperbolic Distance Since the ICD-9 embeddings represent metadata about patients or doctors, a RS using such embeddings can be classified as CB. While there is further data available for both patients and doctors (e.g. demographic or location data), the proposed model will consider only the ICD-9 information. In fact, we benchmark it against a conventional CB model using that very metadata for performance evaluation purposes. As discussed in Sect. 3, the ICD-9 information per patient from the core dataset has been enriched with the Poincaré embeddings provided by [17]. Since many patients have been admitted with more than one disease throughout their individual medical history, naturally, the majority of patients have multiple ICD-9 entries. Therefore, in order to determine a unique embedding per patient and per doctor, multiple entries need to be averaged. Due to the specific properties of the hyperbolic space, however, the usual Euclidean mean is not applicable and thus a generalisation is needed. In hyperbolic geometry, the averaging of feature vectors is done by using the Einstein midpoint [19]. The Einstein midpoint takes its simplest form in Klein coordinates and is defined as follows: N γi xi 1 , where γi = and c = 1 (3) HypAve(x1 , . . . , xN ) = i=1 N 1 − c||xi ||2 i=1 γi The Klein model is consistent with the Poincaré ball model, but since the same point has different representations in the two models, they need to be first, translated from the Poincaré to the Klein model, then averaged and ultimately mapped back into the Poincaré model in order to complete the operation. Thus, if xD and xK correspond to the same point in the Poincaré and the Klein model, respectively, then the following formulas serve for translating between them: xD =
xK 2xD and xK = 1 + c||xD ||2 1 + 1 − c||xK ||2
(4)
With an appropriate methodology for hyperbolic feature vector averaging in place, a content-based model for patient-doctor matchmaking can be formulated. In formal terms, for N patients and K doctors, the patient-doctor interaction matrix Y ∈ RN × K is denoted as: 1, if patient i interacted with doctor j yij = (5) 0, otherwise Adapting [15], the patient-doctor interactions are furthermore weighted with a trust measure. Thereby, the trust between a patient and a doctor is modelled by considering bot the recency and frequency of their consultation history, i.e., doctors that have been visited repeatedly and recently will be weighted higher for a given patient.
Incorporating Domain Knowledge into Health Recommender Systems
137
Regarding feature creation, the ICD-9 embeddings need to be considered. If V ∈ R is the set of all Poincaré embeddings, with each embedding being essentially a 1 × 100 dimensional row vector, then for each patient i the set of embedding vectors is denoted as Vi ⊂ V corresponding to all ICDs that the patient has been diagnosed with. Similarly, for each doctor j the set of embedding vectors is specified by Vj ⊂ V corresponding to the ICDs of all patients that visited doctor j. Hence, the feature vectors of patient i and doctor j are given by the hyperbolic average of their embeddings: fi = HypAve(Vi ) and fj = HypAve Vj (6) With the feature matrices for patients and doctors established, the similarity across patients and doctors can be calculated. For purposes of simplicity, this process will be described only for doctor-doctor similarity, while it is acknowledged that the method is analogously applicable for patients. The similarity between doctor j and k is described by the above-defined hyperbolic similarity of their feature embeddings: sj,k = sH fj , fk (7) Ultimately, the predicted affinity pi,j of a user i towards a doctor j can be computed using the following operation: K pi,j =
k=1 yi,k ∗ sj,k K k=1 sj,k
(8)
Recalling that K is equal to the total amount of doctors and yi,k is the trust-weighted interaction value between patient i and doctor k, it becomes evident that the predicted affinity of patient i is essentially given by the similarity-weighted sum of doctors the patient visited previously, divided by the sum of the weights. While RS in e-commerce usually aim to suggest primarily new, unseen items, our model does not exclude doctors the patient already interacted with for recommendation. This is of relevance insofar, as the goal of this model is to suggest the patient with the best suiting doctor for their next primary care visit, for which previously seen doctors are arguably highly relevant candidates and should by no means be excluded.
5 Results While a substantial part of this paper has been dedicated to the theoretical benefits of Poincaré embeddings and their application to the given problem of patient-doctor matchmaking, it is ultimately necessary to evaluate their performance in comparison to conventional methods, in order to judge their actual value. For that purpose, we compare the following models3 :
3 We emphasise that all proposed models are entirely CB, hence neglecting the similarity of
interactions between patients or doctors. As this research is preliminary, we acknowledge that adding interaction data in a hybrid approach may boost performance substantially.
138
J. Peito and Q. Han
1. Conventional CB: a patient-patient-similarity based benchmark model using cosine similarity of demographic data and one-hot encoded ICD-9 data as patient features to identify patients with similar metadata, 2. Patient ICD-9 similarity: a patient-patient-similarity based RS using patients’ averaged, hyperbolic feature vectors to identify patients with similar diseases and, 3. Doctor ICD-9 similarity: a doctor-doctor-similarity based RS using doctors’ averaged, hyperbolic feature vectors to identify doctors that have similar expertise to the ones the patient visited in the past. Since each of the proposed RS is presented as a sorted list – with either 3, 5 or 10 recommended doctors – it is sensible to rely on hit rate (HR) and precision (p) as evaluation criteria as the evaluation objective is to see, if the patient actually visited one of the recommended doctors or not. That being said, HR@n refers to the number of total hits, divided by the number of patients depending on the number of recommended doctors n ∈ {3, 5, 10}. Analogously, p@n indicates the amount of correctly predicted doctors depending on n. Intuitively, HR will increase with a growing number of recommendations, whereas p will decrease. As a matter of fact, the very reason to combine these two evaluation criteria is that although it is desirable to maximise the number of hits, patients should not be confused with too many options that do not meet their needs, as this might even have counterproductive effects. Figure 3 illustrates the performance of the three suggested models regarding HR and p. While the patient ICD-similarity model is apparently not able to add substantial value scoring even slightly below the benchmark model, the doctor ICD-similarity model does, indeed, outperform the benchmark model. In fact, this allows for two major conclusions in light of the theoretical considerations in the sections above: First, hyperbolic averaging appears to be a viable method for feature averaging of Poincaré embeddings considering the substantial number of different patients and diseases doctors treat. This is insofar noteworthy, as one might reasonably assume that the more disease embeddings are being averaged, the less meaningful they become. Yet, the resulting averaged embeddings are evidently still capable of setting apart doctors fairly well. And second, the Poincaré embeddings – despite not having been trained on this dataset – can add value to this HRS. As such, these two findings show that Poincaré entity embeddings of hierarchical data are a powerful framework to help incorporate complex domain knowledge into a ML application in the healthcare sector. With this in mind, the business implications for the healthcare sector remain to be considered. As has been hypothesised in the introduction, the potential business value of successfully incorporating complex domain knowledge into machine learning applications in the healthcare sector may be substantial. Recalling that the goal of a matchmaking algorithm between patients and doctors is insofar different from typical e-commerce RS, as it aims to recommend patients with the doctor best suited for their specific, medical condition, instead of the “next best doctor”, different evaluation criteria may apply from a business value perspective. For instance, one might argue that technical performance evaluation metrics such as hit rate and precision are, in fact, negligible in favour of a more qualitative evaluation. RS, in general, often suffer from popularity bias, in that they tend to suggest mostly popular doctors [20]. That being said, patients should not be matched with doctors because they are popular or because other patients with
Incorporating Domain Knowledge into Health Recommender Systems
139
Fig. 3. Hit rate and precision per proposed model.
similar demographics visited them (even if this yields in high HR and p scores), but because they best fit their medical needs. Hence, we suggest for further research that recommenders akin to this work should be optimised not only with respect to hit rate and precision, since this may not fully account for popularity bias, but also towards the domain-specific quality of the recommendation. In light of these considerations, healthcare providers can treat this factor as a value proposition for their clients. With the increasing demand for personalised healthcare solutions, RS built on patients’ individual health records are arguably in-line with current market trends. Picturing a potential customer journey, the RS would suggest a patient that has been admitted with, for example, hypertension with doctors that have treated many cases of hypertension or similar diseases. In addition, making recommendations based on individual health profiles adds an explanatory perspective to the suggestions that many RS lack. Since health is a sensitive topic, in general and trust into AI solutions is a major concern in the healthcare domain, in specific, this may be a substantial driver for the success and adaptation of the recommendation engine in practice.
6 Conclusion Overall, we demonstrate that incorporating complex domain knowledge using Poincaré embeddings of the ICD-9 hierarchy that reflect the patients’ pre-existing health conditions into an HRS yields an actual performance improvement in comparison to conventional approaches. In particular, this paper examined the benefits of the hyperbolic space for representation learning tasks in theory and, furthermore, applied to real-world setting. In doing so, we show that Poincaré embeddings can contribute meaningful value in domains beyond their original scope of NLP. Moreover, we find that the incorporation of domain knowledge is of particular value in the healthcare domain, as it allows for medically personalised recommendations. While the results of this preliminary investigation in this field are promising in principle, a set of limitations remains to be resolved in the future work. Firstly, since the proposed models are purely CB in nature, they neglect valuable information that can be
140
J. Peito and Q. Han
retrieved from patient-doctor interaction data. Further research on a hybrid RS leveraging both interaction data and ICD-9 embeddings is a viable approach. Secondly, data consistency remains a persistent issue with a substantial portion of available data lost due to insufficient mapping between terminologies. As has been stressed before, transferability between terminologies is paramount to the further growth of AI in healthcare and healthcare analytics. Hence, the healthcare industry should continue to foster collaboration among different standardization initiatives such as SNOMED CT, UMLS, ICD, etc. Similar to the need of improved data consistency in the healthcare sector in general, healthcare service providers, in specific, need to drive digitization in their industry to improve the data quality as well. For instance, instead of collecting ICD information only for inpatients, all patients should be assigned with a diagnostic code in order to increase the scalability of ML solutions. Acknowledgements. This work was funded by Fundação para a Ciência e a Tecnologia (UID/ECO/00124/2019, UIDB/00124/2020 and Social Sciences Data Lab, PINFRA/22209/2016), POR Lisboa and POR Norte (Social Sciences Data Lab, PINFRA/22209/2016).
References 1. Ghassemi, M., Naumann, T., Schulam, P., Beam, A., Ranganath, R.: A review of challenges and opportunities in machine learning for health. AMIA Joint Summits Transl. Sci. Proc. 2020, 191–200 (2020) 2. Nickel, M., Kiela, D.: Poincaré embeddings for learning hierarchical representations. Adv. Neural. Inf. Process. Syst. 30, 6341–6350 (2017) 3. Chamberlain, B.P., Clough, J., Deisenroth, M.P.: Neural embeddings of graphs in hyperbolic space. arXiv:1705.10359 (2017) 4. Melville, P., Sindhvani, V.: Recommender systems. In: Encyclopedia of Machine Learning, pp. 1–8. Springer, New York (2010) 5. Ricci, F., Rokach, L., Shapira, B.: Introduction to recommender systems handbook. In: Ricci, F., Rokach, L., Shapira, B., Kantor, P.B. (eds.) Recommender Systems Handbook, pp. 1–35. Springer, Boston (2011) 6. Ghazanfar, M.A., Prugel-Bennett, A.: A scalable, accurate recommender system. In: Proceedings of the 2010 Third International Conference on Knowledge Discovery and Data Mining, pp. 94–98 (2010) 7. Pilászy, I., Tikk, D.: Recommending new movies: even a few ratings are more valuable than metadata. In: Proceedings of the third ACM Conference on Recommender Systems, RecSys 2009, pp. 93–100 (2009) 8. Ramakrishnan, N., Keller, B.J., Mirza, B.J., Grama, A.Y., Karypis, G.: Privacy risks in recommender systems. IEEE Internet Comput. 5(4), 54–62 (2001) 9. Sezgin, E., Ozkan, S.: A systematic literature review on health recommender systems. In: E-Health and Bioengineering Conference (EHB), pp. 1–4 (2013) 10. Schäfer, H., Hors-Fraile, S., Karumur, R.P., Valdez, A., Said, A., Torkamaan, H.: Towards health (aware) recommender systems. In: Proceedings of the 2017 International Conference on Digital Health, pp. 157–161 (2017) 11. Luo, G., Thomas, S., Tang, C.: Automatic home medical product recommendation. J. Med. Syst. 36, 383–398 (2012)
Incorporating Domain Knowledge into Health Recommender Systems
141
12. Radha, M., Willemsen, M.C., Boerhof, M., IJsselsteijn, W.A.: Lifestyle recommendations for hypertension through Rasch-based feasibility modelling. In: Proceedings of the 2016 Conference on User Modelling Adaptation and Personalization - UMAP 2016, pp. 239–247 (2016) 13. Guo, L., Jin, B., Yao, C., Yang, H., Huang, D., Wang, F.: Which doctor to trust: a recommender system for identifying the right doctors. J. Med. Internet Res. 18(7), 186–200 (2016) 14. Gräßer, F., Malberg, H., Zaunseder, S., Beckert, S., Küster, D., Schmitt, J., Abraham, S.: Neighborhood-based collaborative filtering for therapy decision support. In: Proceedings of the Second International Workshop on Health Recommender Systems, pp. 1–5 (2017) 15. Han, Q., Ji, M., de Troya, I.M.D.R., Gaur, M., Zejnilovic, L.: A hybrid recommender system for patient-doctor matchmaking in primary care. In: 2018 IEEE 5th International Conference on Data Science and Advanced Analytics, pp. 481–490 (2018) 16. Bender, E.A., Williamson, S.G.: Lists, decisions and graphs. In: Bender, E.A., Williamson, S.G. (eds.) With an Introduction to Probability (2010) 17. De Sa, C., Gu, A., Ré, C., Sala, F.: Representation tradeoffs for hyperbolic embeddings. In: Proceedings of the 35th International Conference on Machine Learning, PMLR, vol. 80, pp. 4460–4469 (2018) 18. Leimeister, M., Wilson, B.J.: Skip-gram word embeddings in hyperbolic space. arXiv:1809. 01498 (2019) 19. Khrulkov, V., Mirvakhabova, L., Ustinova, E., Oseledets, I., Lempitsky, V.: Hyperbolic image embeddings. arXiv:1904.02239 (2019) 20. Abdollahpouri, H., Burke, R., Mobasher, B.: Managing popularity bias in recommender systems with personalized re-ranking. In: The Thirty-Second International Florida Artificial Intelligence Research Society Conference (FLAIRS-32), pp. 413–418 (2019)
Image Classification Using Graph-Based Representations and Graph Neural Networks Giannis Nikolentzos1(B) , Michalis Thomas1 , Adín Ramírez Rivera3 , and Michalis Vazirgiannis1,2 1
Athens University of Economics and Business, Athens, Greece {nikolentzos,p3150048,mvazirg}@aueb.gr 2 École Poytechnique, Palaiseau, France 3 University of Campinas, Campinas, Brazil [email protected]
Abstract. Image classification is an important, real-world problem that arises in many contexts. To date, convolutional neural networks (CNNs) are the state-of-the-art deep learning method for image classification since these models are naturally suited to problems where the coordinates of the underlying data representation have a grid structure. On the other hand, in recent years, there is a growing interest in mapping data from different domains to graph structures. Such approaches proved to be quite successful in different domains including physics, chemoinformatics and natural language processing. In this paper, we propose to represent images as graphs and capitalize on well-established neural network architectures developed for graph-structured data to deal with image-related tasks. The proposed models are evaluated experimentally in image classification tasks, and are compared with standard CNN architectures. Results show that the proposed models are very competitive, and yield in most cases accuracies better or comparable to those of the CNNs. Keywords: Graph-based representations Image classification
1
· Graph neural networks ·
Introduction
Image classification is a fundamental task in computer vision, where the goal is to classify an image based on its visual content. For instance, we can train an image classification algorithm to answer if a car is present in an image or not. While detecting an object is trivial for humans, robust image classification is still a challenge in computer vision applications. In the past years, convolutional neural network architectures (CNNs) have proven extremely successful on a wide variety of tasks in computer vision [13]. These models are naturally suited to problems where the input data take the form of a regular grid, and exhibit some inherent statistical properties such as local stationarity and compositionality. Images are examples of data that fall into this category. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 944, pp. 142–153, 2021. https://doi.org/10.1007/978-3-030-65351-4_12
Graph-Based Representations of Images and GNNs
143
In many domains, data is commonly represented as graphs. This is mainly due to the rich representation capabilities that these structures exhibit. Graphs can model both the entities and the relationships between them. Typically, the vertices of a graph correspond to some entities, and the edges model how these entities interact with each other. In a collaboration network, such interactions may for instance correspond to collaborations in a network of scientists. Graphs are a very flexible means of data representation, and several fundamental data structures can be thought of as instances of graphs. For example, a sequence can be thought of as a graph, with one node per element and edges between consecutive elements. In some cases, even data that does not exhibit graph-like structure like text is mapped to graph representations [19]. In the past years, a vast number of learning algorithms has been developed in order to work with graphs and process the information they represent. There are now available neural network models which have achieved state-of-the-art performance on many real-world graph classification datasets [22]. More specifically, an explosion in research activity in the field of graph neural networks has taken place in the last few years. In this paper, we propose to represent images as graphs, and to apply machine learning algorithms that operate on graphs to the emerging representations. Specifically, we present different approaches for representing images as graphs, and we capitalize on well-established neural network architectures developed for graph-structured data to deal with image classification tasks. The proposed models exploit properties inherent in images such as stationarity of statistics and locality of pixel dependencies. We evaluate the proposed models in image classification tasks, and we compare them with standard CNN architectures. We also study the robustness of the proposed models to transformations of the input images and to adversarial attacks. Results show that the proposed models are very competitive, and yield in most cases accuracies better or comparable to those of the CNNs. It should be mentioned that this is not the first work to apply graph neural networks to image data. However, in this paper, we evaluate a large combination of representations and graph neural network architectures, and to the best of our knowledge, this is the most complete evaluation to date of graph representations in computer vision, and graph neural networks for image-related tasks.
2
Related Work
Graph-based representations of images have a long history in the field of pattern recognition. A detailed review of these approaches is beyond the scope of this paper; we refer the interested reader to the work of Conte et al. [5] and Vento and Foggia [21]. The problem of image classification has been widely studied over the past years, while several of the proposed approaches borrowed ideas from graph mining techniques. For instance, some works have proposed the use of graph kernels and have studied their effectiveness in image classification [4,7,11], while others have produced new image representations using graph-based features [1,25].
144
G. Nikolentzos et al.
Graph neural networks have been recently applied to image classification tasks. Specifically, some recent graph neural network models were evaluated on the benchmark MNIST classification problem [3,6,18]. The main difference between these works and ours is that they follow different approaches for representing images as graphs. For instance, Defferard et al. construct in [6] a weighted k-NN similarity graph where nodes represent pixels and each pixel is connected to its k most similar pixels in terms of intensity. The weights of the edges are computed using a function similar to the radial basis function kernel. Simonovsky and Komodakis represented in [18] each image as a point cloud with coordinates (x, y, 0) where x, y ∈ {0, . . . , 27}, while Bruna et al. subsample the normal 28 × 28 grid to get 400 coordinates [3]. Furthermore, all these works apply a single architecture to the MNIST dataset, while in our work, we evaluate a series of message passing layers and readout functions. Very recently, graph neural networks have been also applied to other computer vision tasks, such as to the problem of image matching (i.e., to find correspondences between points in images) [17,24].
3
Models and Representations
In this section, we present the graph representations of images that we employed and the different models that we applied to these representations. We start by fixing our notation. Let G = (V, E) be an undirected graph consisting of a set V of nodes and a set E of edges between them. We will denote by n the number of nodes and by m the number of edges. The neighborhood of a node v ∈ V is the set of all vertices adjacent to v, that is N (v) = {u : (v, u) ∈ E} where (v, u) is an edge between vertices v and u. The graph representations that we utilize are node-attributed graphs. That is, each node is annotated with one or more attributes. 3.1
Graph-Based Representations of Images
In the past, several approaches have been proposed for mapping images to graph structures. Almost all existing approaches are ad-hoc and are generally motivated by performance considerations. One usually adopts the representation that is shown to perform best in the considered task. In this paper, we experiment with two different graph representations of images, namely the king’s graph and a coarsened graph which we derive from the output of some community detection algorithm. We illustrate the two considered graph representations in Fig. 1. King’s Graph. The m × n king’s graph is a graph with mn vertices in which each vertex represents a square in an m × n chessboard, and each edge corresponds to a legal move by a king. The m × n king’s graph can be constructed as the strong product of the path graphs Pm and Pn . In other words, the king’s graph is a graph whose nodes (except those belonging to the border of the grid) are connected with their 8 neighborhood nodes by an edge. In our setting, each node
Graph-Based Representations of Images and GNNs
145
Fig. 1. The two considered graph representations of images: (a) king’s graph, and (b) a coarsened graph whose nodes correspond to communities extracted from a weighted variant of the king’s graph.
of the graph represents a pixel. Furthermore, each node is annotated with a real value (i.e., intensity of the pixel) in case of grayscale images or a 3-dimensional vector (i.e., intensity of the RGB channels) in the case of colored images. Coarsened Graph. In the aforementioned representation, each node corresponds to a pixel in the input image. Since the number of pixels is usually large (even for low resolution images), we propose to use community detection algorithms to reduce the number of nodes in the graph to a representative subsample of pixels or regions. We start from the aforementioned king’s graph representation and we transform it into a weighted graph where edge weights capture the similarity between pairs of pixels. We assume that there is no a priori knowledge about the components of the input image, and therefore, we use the following function to compute the weight of the edge between two vertices vi and vj : wij = 1 − xi − xj .
(1)
Here, we have assumed that the pixel intensities take values between 0 and 1 and, therefore, 0 ≤ wij ≤ 1 holds. Other functions such as the Gaussian kernel could also be employed. To extract the representative subsample of pixels, we apply the Louvain method, a well-known community detection algorithm [2]. The algorithm returns a set of communities, and we treat each community as a node in the new graph. In the new graph, two nodes (i.e., communities) are linked to each other by an edge if one or more pixels of the one community was connected to one or more pixels of the second community in the original king’s graph. Furthermore, each node is annotated with the average of the intensities (or vectors in case of colored images) of the pixels that belong to the corresponding community. 3.2
Models
Graph neural networks (GNNs) have attracted a lot of attention in the past years. Most GNNs share the same basic idea, and can be reformulated into a
146
G. Nikolentzos et al.
single common framework [8]. A GNN model consists of a series of message passing (MP) layers. Each one of these layers uses the graph structure and the node feature vectors from the previous layer to generate new representations for the nodes. The feature vectors are updated by aggregating local neighborhood information. To generate a vector representation over the whole graph, GNNs apply a readout function to node representations generated by the final message passing layer. We next present three general models which we will evaluate in image classification. Note that two of the models (MP+CNN and MP+Pool+Readout) are specifically designed for graph representations of images that exhibit a grid-like structure, and cannot be applied to general graphs. MP+Readout. This model consists of a series of message passing layers followed by a readout function. Each message passing layer updates the representation of each node based on the representations of its neighbors and its own representation. Then, to produce an image representation, the model applies a readout function to the node representations of the final message passing layer. MP+CNN. This model consists of a series of message passing layers followed by a CNN. The message passing procedure can be seen as a method for updating the features of the pixels. Therefore, after T message passing layers, we end up with an “image” with as many channels as the hidden dimension of the final message passing layer. This “image” can be passed on to a standard CNN model to produce a representation for the input image. Note that this model is specifically designed for the image classification task where images are modeled as king’s graphs, and it cannot be applied to general graphs. MP+Pool+Readout. This model consists of a series of message passing and pooling layers followed by a readout function. A pooling layer is in fact a clustering layer which replaces a set of nodes with a single node. We incorporate spatial relations between pixels into clustering, and thus pixels are clustered together with their neighbors. Clearly, this model can only be applied to images represented as king’s graphs. Each pooling layer halves the size of each dimension of the grid. The features of the new nodes are computed using some permutation invariant function on the cluster’s nodes (e.g., sum, mean or max). Therefore, this model consists of alternating message passing and pooling (i.e., clustering) layers in the same spirit as CNNs are composed of alternating convolutional and pooling layers. To produce an image representation, the model applies a readout function to the node representations of the final pooling layer. Employed Message Passing Layers. In this work, we experimented with the following five message passing layers: Gated Graph layer [14], GCN [12], GAT [20], GraphSAGE [10], and 1-GNN [15]. These layers were selected according to the following criteria: (1) publicly available implementations; (2) strong architectural differences; and (3) popularity. Due to space constraints, we cannot provide more details about the different message passing layers and we refer the reader to their respective papers.
Graph-Based Representations of Images and GNNs
147
Employed Readout Functions. We utilized the following four readout functions: (1) sum: it computes the sum of the node representations; (2) mean: it computes the average of the node representations; (3) max : this operator computes a vector representation for the graph where each element is equal to the maximum value of the corresponding elements of all node representations; and (4) SortPool [23]: this layer generates graph representations of specific size by first sorting the nodes of the graph, and then retaining only the first k nodes. If the number of nodes is less than k, zero-padding is applied. To rank the nodes, the SortPool layer sorts the last element of the nodes’ representations in a descending order.
4
Experimental Evaluation
We perform all our experiments on the MNIST dataset of handwritten digits. The dataset is split into a training set and a test set of 60 000 and 10 000 images, respectively. Each image is a 28 × 28 pixel square. There are 10 digits in total (from 0 to 9), and therefore 10 different classes. We represent each image as a 28 × 28 king’s graph as discussed above. All the images share the same underlying graph structure, however, their nodes are annotated possibly with different attributes. We assign weights to the edges of the king’s graph using the function shown in (1), and then we apply the Louvain algorithm to obtain the set of communities and generate the coarsened graph. Note that the Louvain graph automatically detects the number of communities. Hence, the coarsened graph representations of some images may have different number of nodes than others (we found that the number of nodes of the emerging graphs ranges from 10 to 20). 4.1
Model Selection
We created a training and a validation set of images (the two sets are disjoint) by randomly sampling 10 000 and 1000 images from the training set of MNIST, respectively. We trained the models on the 10 000 images and report their accuracy on the 1000 images of the validation set. The representations of the images produced by the different models are passed on to a 2-layer multilayer perceptron (MLP) with a softmax activation function in the output. For all configurations, we train the neural networks for 100 epochs. We use the Adam optimizer with a learning rate of 0.001. The batch size is set equal to 64. All our dense layers use ReLU activation. To prevent over-fitting, we use dropout with a rate of 0.2. The hyper-parameters we tune are: (1) the number of message passing layers ∈ {2, 3, 4} for MP+Readout and MP+CNN, and ∈ {2, 3} for MP+Pool+Readout; (2) the number of hidden units of the message passing layers ∈ {16, 64, 128} for MP+Readout and MP+Pool+Readout and ∈ {4, 16, 32} for GNN +CNN; (3) the number of hidden units of the MLP layer ∈ {128, 256} for all models. For MP+Pool+Readout, we also tune the number of sub-sampling (i.e., clustering) layers ∈ {2, 3, 4}, and the type of the aggregation function of the
148
G. Nikolentzos et al.
Table 1. Performance of the different combinations of graph representations, message passing layers and readout functions on the validation set of MNIST. MP layer
MP+Readout
MP+CNN MP+Pool+Readout
Sum Max Mean SortPool King’s graph
Sum Max Mean SortPool
GAT
80.2 73.8 60.9
41.3
96.6
93.3 91.3 91.0
GCN
76.4 66.8 52.0
32.7
96.4
93.3 92.8 93.3
94.6 93.2
GraphSAGE
79.5 54.6 56.9
33.0
97.1
92.6 91.9 93.3
90.9
1-GNN
95.5 94.5 95.4
63.8
97.1
97.6 97.8 97.5
96.7
Gated Graph layer 96.9 95.1 95.6
67.2
96.4
97.3 97.9 96.7
97.1 –
Coarsened graph GAT
65.9 66.5 65.0
49.8
–
–
–
–
GCN
61.4 60.9 61.4
46.3
–
–
–
–
–
GraphSAGE
59.0 59.1 55.6
42.8
–
–
–
–
–
1-GNN
75.1 75.2 75.4
66.2
–
–
–
–
–
Gated Graph layer 78.3 78.5 73.2
67.7
–
–
–
–
–
features of the clustered nodes ∈ {Sum, Mean, Max}. For the SortPool readout function, k was set equal to 20. The baseline CNN and the CNN component of MP+CNN consist of two convolutional layers. The first layer contains 16 filters of size 4 × 4, while the second layer contains 32 filters of size 3 × 3. Both convolutional layers are followed by max-pooling layers of size 2 × 2. Table 1 illustrates the classification accuracies obtained from the different models. Note that in the case of the coarsened graphs, we can only apply the MP+Readout models. Indeed, the MP+CNN and MP+Pool+Readout models cannot be applied since the input data does not take the form of a regular grid anymore, and moreover, the graph has already been clustered. We first focus on the MP+Readout model. We find that the king’s graph representation yields higher accuracies compared to the coarsened graph representation. In all cases, the difference in performance is significant. We believe that this is due to the information loss associated with the coarsening procedure (groups of nodes and their features are merged together). With regards to the different message passing layers, our results indicate that the 1-GNN and Gated Graph layer achieve much higher accuracies than the rest of the layers. Interestingly, these are the two layers that do not use mean aggregators in the message passing procedure. Mean aggregators capture the distribution of the features in the neighborhood of a node. Thus, they may fail to distinguish the exact neighbors of a node. We next compare the different readout functions to each other. We can see that SortPool is the worst-performing function. We believe that this is due to the fact that it ignores a large number of node representations. In the case of the king’s graph representation, Sum reached the highest accuracies, while Max and Mean reached the second and the third best accuracy levels among all considered functions. On the other hand, in the case of coarsened graphs, Max is the best-performing function. The Sum function yielded similarly good results, while Mean produced slightly worse results than the other two functions. As discussed above, the MP+CNN model can only be applied to the king’s graph representation of the images. This model delivers the “best of both worlds” from GNNs and CNNs. It combines the representational capacity of GNNs with
Graph-Based Representations of Images and GNNs Table 2. Performance of the different message passing layers of the MP+CNN model on the validation set of MNIST for (un)directed king’s graph representations.
149
Table 3. Performance of the selected models on the full MNIST dataset. Model
MNIST
CNN
99.1
MP+Readout
96.4
MP layer
Undirected Directed
GAT
96.6
96.7
MP+Readout (coarsened graph) 72.7
GCN
96.4
96.3
MP+CNN
99.4
SAGE
97.1
96.8
1-GNN
MP+Pool+Readout
98.9
97.1
97.3
Gated Graph layer 96.4
98.0
the ability of CNNs to effectively deal with image data. In Table 1, we can see that it also achieves very high accuracies, regardless of the employed message passing layer. Interestingly, message passing layers that failed to achieve high levels of accuracy when integrated into the MP+Readout model (e.g., GAT, GCN, GraphSAGE), now achieve accuracies close to the maximum observed. The MP+Pool+Readout model is also applied only to the king’s graph representation of images. Clearly, the MP+Pool+Readout model improves over the MP+Readout model for all combinations of message passing layers and readout functions. This highlights that clustering neighboring pixels is highly beneficial when dealing with image data due to the statistical properties inherent to this kind of data such as local stationarity and compositionality. It should be mentioned that one of the variants of the MP+Pool+Readout model (the one that uses the Gated Graph message passing layer and the Max readout function) achieved the highest validation accuracy among all considered models. We also studied the impact of edge directions on the performance of the models. We assign directions to all the edges such that they start from nodes on the top and/or left and end at nodes on the bottom and/or right. We use the MP+CNN model to evaluate these representations. Table 2 illustrates the obtained accuracies for different message passing layers. We can see that for almost all layers, the use of directed edges has almost no impact on the classification accuracy. Notably, in the case of the Gated Graph layer, the use of the directed king’s graph representation led to an absolute improvement of 1.6%. 4.2
Image Classification
We next evaluate the architectures that performed best in our model selection experiments on the full MNIST dataset. The obtained accuracies are shown in Table 3. We can see that MP+CNN achieves the highest accuracy, followed by CNN, MP+Pool+Readout, MP+Readout and MP+Readout (applied to the coarsened graph) in that order. Excluding the model applied to the coarsened graph, the difference in performance between the rest of the models is relatively
150
G. Nikolentzos et al.
small, indicating that all of them are effective in classifying the images contained in the MNIST dataset. 4.3
Robustness to Affine Transformations
We next investigate the sensitivity of the proposed models against affine transformations applied to the images of MNIST. Such natural transformations can be used to completely fool image classification models. We apply three different types of transformations: scaling, rotation and translation. In all three cases, we experimented with the full MNIST dataset, i.e., 60 000 training samples and 10 000 test samples. We next present the experimental setup for each one of the three transformations. – Scaling: Given a scaling factor k, we scale down all the images of the test set as follows: for each image, we randomly sample a scaling factor from 1/10 to 1/k by steps of 1/10, i.e., we randomly sample one of the elements of the set {1/10, 2/10, 3/10, . . . , 1/k} with equal probability. We also scale down 5% of the images of the training set following the same procedure. – Rotation: In order to ensure that the perturbed images are not heavily distorted, we restrict our rotations to a maximum of 20◦ . Specifically, we run a series of experiments where in each experiment, we rotate all the images of the test set by a specific amount of degree, while we rotate no images of the training set. – Translation: The translation is applied as follows. We first randomly sample a valid pair of cardinal directions (i.e., northeast, southeast, southwest, and northwest) with uniform probability (i.e., 0.25 for each pair). Then, the image is shifted by no more than k pixels along each one of the two directions. For each direction, the amount of shift is chosen with uniform probability from {0, 1, . . . , k}. This transformation is applied to all the images of the test set and to 5% of the training data.
Fig. 2. Performance of the different models under transformations of test set.
We show the results for the three types of transformations in Fig. 2. Interestingly, scaling and rotation do not have such a large impact on the per-
Graph-Based Representations of Images and GNNs
151
Fig. 3. Performance of the different models with respect to the amount of perturbation applied to the images of the test set.
formance of the models as translation. We observe that MP+CNN is generally the most robust approach to the three types of transformations. Specifically, it outperforms the other approaches across all considered scaling factors and rotation degrees. It is only outperformed by the other approaches in the case of largely-translated images (maximum translation greater than 7 pixels). MP+Cluster+Readout achieves the second best accuracy levels among all models considered. It outperforms CNN and MP+Readout in the case of scaled and translated images, while it yields similar results to CNN in the case of rotated images. MP+Readout seems to be more invariant to scale-transformed images than CNN since it achieves higher accuracies in this set of experiments. However, the same does not hold for images that have undergone rotation or translation. In the latter case, MP+Readout is outperformed by all the other models by very wide margins. 4.4
Robustness Against Adversarial Examples
It has been shown that many classes of machine learning algorithms are vulnerable to adversarial manipulation of their input. This can often lead to incorrect classification. More specifically, neural networks are highly vulnerable to attacks based on small modifications of the input to the model at test time [9,16]. An interesting research direction is to investigate how robust the different models are against adversarial attacks. To this end, we follow the work presented by [9] and we create a set of adversarial samples by applying perturbations to the ˜ = X + P where X is an image and P the images of the test set as follows: X matrix that contains the perturbations for the different elements of the image. It turns out that if worst-case perturbations are applied to test samples, then a model might produce incorrect predictions with high confidence [9]. We next present how such worst-case perturbations can be produced. Let θ be the parameters of a model, X the input to the model, y the target associated with X and J(θ, X, y) be the loss function used to train the neural network. An optimal max-norm constrained perturbation can be obtained as follows: P = sign(∇X J(θ, X, y))
(2)
152
G. Nikolentzos et al.
We use the above approach to produce 10 000 adversarial samples from the test images of MNIST. We use different values of that range from 0 to 0.3. We train the models on the standard training set of MNIST (i.e., 60 000 samples), and then we evaluate them on the set of adversarial samples. Figure 3 illustrates the obtained results. We observe that the performance of the different models decreases significantly as the value of increases. This is not surprising since the greater the value of , the larger the amount of perturbation that is applied to the images of the test set. Clearly, MP+CNN outperforms the other models for the different values of . CNN produced the second best results. Note that CNN and MP+Readout yield an error rate greater than 87% when is equal to 0.3. Acknowledgement. This research is co-financed by Greece and the European Union (European Social Fund- ESF) through the Operational Programme «Human Resources Development, Education and Lifelong Learning»in the context of the project “Reinforcement of Postdoctoral Researchers - 2nd Cycle” (MIS-5033021), implemented by the State Scholarships Foundation (IKY).
References 1. Acosta-Mendoza, N., Gago-Alonso, A., Medina-Pagola, J.E.: Frequent approximate subgraphs as features for graph-based image classification. Knowl.-Based Syst. 27, 381–392 (2012) 2. Blondel, V.D., Guillaume, J.L., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities in large networks. J. Stat. Mech. Theor. Exp. 2008(10), P10008 (2008) 3. Bruna, J., Zaremba, W., Szlam, A., LeCun, Y.: Spectral networks and deep locally connected networks on graphs. In: 2nd International Conference on Learning Representations (2014) 4. Camps-Valls, G., Shervashidze, N., Borgwardt, K.M.: Spatio-spectral remote sensing image classification with graph kernels. IEEE Geosci. Remote Sens. Lett. 7(4), 741–745 (2010) 5. Conte, D., Foggia, P., Sansone, C., Vento, M.: How and why pattern recognition and computer vision applications use graph. In: Applied Graph Theory in Computer Vision and Pattern Recognition, pp. 85–135 (2007) 6. Defferrard, M., Bresson, X., Vandergheynst, P.: Convolutional neural networks on graphs with fast localized spectral filtering. In: Advances in Neural Information Processing Systems, pp. 3844–3852 (2016) 7. Duchenne, O., Joulin, A., Ponce, J.: A graph-matching kernel for object categorization. In: Proceedings of the 2011 International Conference on Computer Vision, pp. 1792–1799 (2011) 8. Gilmer, J., Schoenholz, S.S., Riley, P.F., Vinyals, O., Dahl, G.E.: Neural message passing for quantum chemistry. In: Proceedings of the 34th International Conference on Machine Learning, pp. 1263–1272 (2017) 9. Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. In: 3rd International Conference on Learning Representations (2015) 10. Hamilton, W., Ying, Z., Leskovec, J.: Inductive representation learning on large graphs. In: Advances in Neural Information Processing Systems, pp. 1024–1034 (2017)
Graph-Based Representations of Images and GNNs
153
11. Harchaoui, Z., Bach, F.: Image classification with segmentation graph kernels. In: Proceedings of the 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2007) 12. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint https://arxiv.org/abs/1609.02907 (2016) 13. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012) 14. Li, Y., Tarlow, D., Brockschmidt, M., Zemel, R.: Gated graph sequence neural networks. arXiv preprint https://arxiv.org/abs/1511.05493 (2015) 15. Morris, C.: Weisfeiler and Leman go neural: higher-order graph neural networks. In: Proceedings of the 33rd AAAI Conference on Artificial Intelligence, pp. 4602–4609 (2019) 16. Papernot, N., McDaniel, P., Jha, S., Fredrikson, M., Celik, Z.B., Swami, A.: The limitations of deep learning in adversarial settings. In: Proceedings of the 1st IEEE European Symposium on Security and Privacy, pp. 372–387 (2016) 17. Sarlin, P.E., DeTone, D., Malisiewicz, T., Rabinovich, A.: SuperGlue: Learning feature matching with graph neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4938–4947 (2020) 18. Simonovsky, M., Komodakis, N.: Dynamic edge-conditioned filters in convolutional neural networks on graphs. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3693–3702 (2017) 19. Vazirgiannis, M., Malliaros, F.D., Nikolentzos, G.: GraphRep: Boosting text mining, NLP and information retrieval with graphs. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 2295– 2296 (2018) 20. Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., Bengio, Y.: Graph attention networks. arXiv preprint https://arxiv.org/abs/1710.10903 (2017) 21. Vento, M., Foggia, P.: Graph matching techniques for computer vision. In: Image Processing: Concepts, Methodologies, Tools, and Applications, pp. 381–421 (2013) 22. Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., Philip, S.Y.: A comprehensive survey on graph neural networks. arXiv preprint https://arxiv.org/abs/1901.00596 (2019) 23. Zhang, M., Cui, Z., Neumann, M., Chen, Y.: An end-to-end deep learning architecture for graph classification. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence, pp. 4438–4445 (2018) 24. Zhang, Z., Lee, W.S.: Deep graphical feature learning for the feature matching problem. In: Proceedings of the 2019 IEEE International Conference on Computer Vision, pp. 5087–5096 (2019) 25. Zheng, M., Bu, J., Chen, C., Wang, C., Zhang, L., Qiu, G., Cai, D.: Graph regularized sparse coding for image representation. IEEE Trans. Image Process. 20(5), 1327–1336 (2010)
Graph-Based Topic Extraction from Vector Embeddings of Text Documents: Application to a Corpus of News Articles M. Tarik Altuncu1 , Sophia N. Yaliraki2 , and Mauricio Barahona1(B) 1
Department of Mathematics, Imperial College London, London, UK [email protected] 2 Department of Chemistry, Imperial College London, London, UK
Abstract. Production of news content is growing at an astonishing rate. To help manage and monitor the sheer amount of text, there is an increasing need to develop efficient methods that can provide insights into emerging content areas, and stratify unstructured corpora of text into ‘topics’ that stem intrinsically from content similarity. Here we present an unsupervised framework that brings together powerful vector embeddings from natural language processing with tools from multiscale graph partitioning that can reveal natural partitions at different resolutions without making a priori assumptions about the number of clusters in the corpus. We show the advantages of graph-based clustering through end-to-end comparisons with other popular clustering and topic modelling methods, and also evaluate different text vector embeddings, from classic Bag-of-Words to Doc2Vec to the recent transformers based model Bert. This comparative work is showcased through an analysis of a corpus of US news coverage during the presidential election year of 2016.
1
Introduction
The explosion in the amount of news and journalistic content generated across the globe, coupled with extended and instantaneous access to information via online media, makes it difficult and time-consuming to monitor news and opinion formation in real time. There is an increasing need for tools that can pre-process, analyse and classify raw text to extract interpretable content; specifically, identifying topics and content-driven groupings of articles. This is in contrast with traditional approaches to text classification, typically reliant on human-annotated labels of pre-designed categories and hierarchies [3]. Methodologies that provide automatic, unsupervised clustering of articles based on content directly from free text without external labels or categories could thus provide alternative ways to monitor the generation and emergence of news content. In recent years, fuelled by advances in statistical learning, there has been a surge of research on unsupervised topic extraction from text corpora [6,9]. In previous work [1], we developed a graph-based framework for the clustering of text documents that combines the advantages of paragraph vector representation of text through Doc2vec [8] with Markov Stability (MS) community detection [4], c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 944, pp. 154–166, 2021. https://doi.org/10.1007/978-3-030-65351-4_13
Graph-Based Topic Extraction from Document Vector Embeddings
155
a multiscale graph partitioning method that is applied to a geometric similarity graph of documents derived from the high-dimensional vectors representing the text. Through this approach, we obtained robust clusters at different resolutions (from finer to coarser) corresponding to groupings of documents with similar content at different granularity (from more specific to more generic). Because the similarity graph of documents is based on vector embeddings that capture syntactic and semantic features of the text, the graph partitions produce clusters of documents with consistent content in an unsupervised manner, i.e., without training a classifier on hand-coded, pre-designed categories. Our original application in Ref. [1] was a very large and highly specialised corpus of incident reports in a healthcare setting with specific challenges, i.e., highly specialised terminology mixed with informal and irregular use of language and abbreviations. The large amount of data (13 million records) in that corpus allowed us to train a specific Doc2vec model to circumvent successfully these restrictions. In this work, we present the extension of our work in Ref. [1] to a more general use-case setting by focussing on topic extraction from a relatively small number of documents in standard English. We illustrate our approach through a corpus of ∼9,000 news articles published in the US during 2016. Although for such small corpora, training a specific language model is less feasible, the use of standard language allows us to employ English language models trained on generic corpora (e.g.., Wikipedia articles). Here, we use pre-trained language models obtained with Doc2Vec [8] and with recent transformer models like Bert [5], which are based on deeper neural networks, and compare them to classic Bag-of-Words (BoW) based features. In addition, we evaluate our multiscale graph-based partitioning (MS) against classic probabilistic topic clustering methods like LDA [2] and other widely-used clustering methods (k-means and hierarchical clustering).
2
Feature Vector Generation from Free Text
Data: The corpus consists of 8,748 online news articles published by US-based Vox Media during 2016, a presidential election year in the USA1 . The corpus excludes articles without specific news content, without an identifiable publication date, or with very short text (less than 30 tokens). Pre-processing, Tokenisation and Normalisation: We removed HTML tags, code pieces, and repeated wrapper sentences (header or footer scripts, legal notes, directions to interact with multimedia content, signatures). We replaced accented characters with their nearest ASCII characters; removed white space characters; and divided the corpus into lists of word tokens via regex word token divider ‘\w+’. To reduce semantic sparsity, we used Part of Speech (POS) tags to leave out all sentence structures but adjectives, nouns, verbs, and adverbs. The remaining tokens are lowered and converted to lemmas using the WordNet Lemmatizer [11]. Lastly, we removed common (and thus less meaningful) tokens2 . 1 2
The news corpus is accessible on https://data.world/elenadata/vox-articles. The full list of common words is: {‘be’, ‘have’, ‘do’, ‘make’, ‘get’, ‘more’, ‘even’, ‘also’, ‘just’, ‘much’, ‘other’, ‘n’t’, ‘not’, ‘say’, ‘tell’, ‘re’}.
156
M. T. Altuncu et al.
Bag-of-Words Based Features: We produce TF-IDF (Term Frequency-Inverse Document Frequency) BoW features filtered with Latent Semantic Analysis (LSA) [14] to reduce the high-dimensional sparse TF-IDF vectors to a 300 dimensional continuous space. Neural Network Based Features: Doc2vec [8], a version of Word2vec [10] adapted to longer forms of text, was one of the original neural network based methods with the capability to represent full length free text documents into an embedded fixed-dimensional vector space. Before it can be used to infer a vector for a given document, Doc2Vec must be trained on a corpus of text documents that share similar context and vocabulary with the documents to be embedded. As a source of common English compatible with the broad usage in news articles, we used Gensim version 3.8.0 [17] to train a Doc2Vec language model on a recent Wikipedia dump consisting of 5.4 million articles pre-processed as described above3 . The optimised model had hyperparameters {training method = dbow, number of dimensions for feature vectors size = 300, number of epochs = 10, window size = 5, minimum count = 20, number of negative samples = 5, random down-sampling threshold for frequent words = 0.001}. Transformers Based Features: Natural Language Processing (NLP) is currently evolving rapidly with the emergence of transformers-based, deep learning methods such as ELMo [15], GPT [16] and BERT, the recent state-of-the-art model by Google [5]. The first step for these models, called pre-training, uses different learning tasks (e.g., next sentence prediction or masked language model) to model the language of the training corpus without explicit supervision. Whereas the neural network in Doc2vec only has two layers, transformers-based models involve deeper neural networks, and thus require much more data and massive computational resources to pre-train. Fortunately, these methods have publically available models for popular languages. Here, we use the model ‘BERT base, Uncased’ with 12-layers and 100 million parameters, which produces 768dimensional vectors based on a pre-training using BookCorpus [22] and English Wikipedia articles. Because Bert models carry out their own pre-processing (WordPiece tokenisation and Out-Of-Vocabulary handling steps [20]), we do not apply our tokenisation and normalisation routine. Since transformers based models cannot process text longer than a few sentences (510 tokens in total) due to memory limits in GPUs, we analyse the sentences in an article individually; obtain embedded vectors for all; and compute the feature vector of the document as the average of sentence vectors. We obtained two feature vectors from Bert: (i) baas: reduced mean vector of each token’s embedded vector on the second from last layer among the 12 layers, as given by bert-as-service4 ; (ii) sbert: sentence level vector computed with Sentence Transformers5 [18], which 3 4 5
English Wikipedia corpus (1 December 2019) downloaded from https://dumps. wikimedia.org/enwiki/. bert-as-service is an open-source library published at github.com/hanxiao/ bert-as-service. Sentence Transformers is an open-source library published at github.com/UKPLab/ sentence-transformers.
Graph-Based Topic Extraction from Document Vector Embeddings
157
improves the pre-trained Bert model using a 3-way softmax classifier for Natural Language Inference (NLI) tasks. Although BERT models are powerful, they are optimised for supervised downstream tasks and need to be fine-tuned further through a secondary step on specifically labelled annotated data sets. Hence pre-trained BERT models without fine-tuning are not optimised for the quality of their vector embedding, which is the primary input for our unsupervised clustering task.
3
Finding Topic Clusters Using Feature Vectors
The above methods lead to five different feature vectors for the documents in the corpus. These vectors are then clustered through the graph-based MS framework, which we benchmark against alternative graph-less clustering methods. 3.1
Graph Construction and Community Detection
In many real-world applications, there is an absence of accurate prior knowledge or ground truth labels about the underlying data. In such cases, unsupervised clustering methods must be applied. Rather than fixing the number of clusters a priori, we use here Markov Stability, a multiscale method that produces a series of intrinsically robust partitions at different levels of resolution. MS thus allows flexibility in choosing the granularity of topics as appropriate for the analysis. Constructing a Sparsified Similarity Graph: From the feature vectors of the 8,748 articles, we compute all pairwise cosine similarities. This dense similarity matrix is sparsified by using the MST-kNN method [21] to construct a sparse geometric graph. MST-kNN consists of two steps: obtain the Minimum Spanning Tree (MST) to ensure global connectivity in the graph followed by linking the k Nearest Neighbours (kNN) to each document. so as to connect highly similar documents thus preserving the local geometry. Based on our work in [1], we set k = 13. We construct MST-kNN similarity graphs from each of the five sets of feature vectors: Gtfidf , Gtfidf+lsa , Gd2v , Gbaas , Gsbert . Multiscale Graph Partitioning with MS: We applied the MS partitioning algorithm [4,7,19] to these five similarity graphs. We compute the partitions Pt that optimise the MS cost function as we vary the Markov time t, a parameter that modifies the coarseness of the partitions. We choose three partitions at different resolutions that are most robust, both across t and to the optimisation process [7,19], and label them as fine (F), medium (M), and coarse (C). The three partitions for the five similarity graphs shown in Fig. 1 are: 1. Gbaas : [22, 10, 5] communities at t = [0.849, 3.881, 18.557] 2. 3. 4. 5.
Gsbert : [20, 10, 4] communities at t = [0.931, 4.159, 14.756] Gtfidf : [22, 11, 5] communities at t = [2.037, 9.749, 34.592] Gtfidf+lsa : [23, 11, 5] communities at t = [3.793, 15.811, 52.356] Gd2v : [20, 9, 4] communities at t = [1.315, 2.944, 21.329].
158
M. T. Altuncu et al.
Fig. 1. Results of Markov Stability for the five similarity graphs constructed from different document feature vectors. As a function of Markov time t, we show: the number of clusters of the optimised partition Pt (red line); the variation of information V I(Pt ) for the ensemble of optimised solutions at each t (blue line); and the variation of information V I(Pt , Pt ) between the optimised partitions across Markov time (background colourmap). Robust partitions correspond to dips of V I(Pt ) and extended plateaux of the number of communities and V I(Pt , Pt ). The selected partitions are indicated by vertical lines: green (Fine), orange (Medium) and purple (Coarse).
3.2
Graph-Less Clustering Methods as Benchmarks
We benchmark our graph-based clustering against: (i) two widely used graph-less clustering methods: k-means and hierarchical clustering with Ward linkage. Both methods are applied using the implementation and default parameters in ScikitLearn version 0.22; (ii) LDA probabilistic topic models [2] for each resolution (F, M, C) using the state-of-the-art implementation, i.e., default parameters of LdaMulticore in Gensim version 3.8.0 setting the number of passes to 100 to ensure convergence. Unlike the other clustering methods, LDA does not produce a hard assignment of news articles into clusters, but returns probabilities over all clusters. We assign each article to its most probable topic (i.e, cluster). Because LDA works on term frequencies (tf ), we cannot use the other feature vectors. We remark that, had we not already obtained our MS results, it would not be easy to make an informed a priori choice of number of topics/clusters for any of these methods. Hence we use the three MS levels (F = 20, M = 9, C = 4) as the number of clusters in the benchmark methods. Counting five different
Graph-Based Topic Extraction from Document Vector Embeddings
159
feature vectors and three clustering algorithms along with the LDA, we have 16 experiments per each resolution level (F,M,C). The next sections are devoted to the quantitative and qualitative comparison of all 48 experiments.
4
Evaluation of Topic Clusters
4.1
Quantitative Analysis of Partition Quality
In the absence of ‘ground truth’ topics in our corpus, we cannot score directly the quality of the clusters. Instead, we compute two different measures of the consistency and relevance of cluster content. Measuring Intrinsic Topic Coherence with the Aggregate PMI: Following [12,13] we create an intrinsic measure of topic coherence without reference to external ground truth topics. Our measure considers the association of frequent ‘word couples’ in each cluster and computes the pointwise mutual information (PMI) compared to its reference score in a standard corpus (in our case, the English Wikipedia). The topical coherence of articles within topic averaged over all topics for each experiment is shown in Table 1. gives the Aggregate PMI scores (PMI) Our results of topic coherence show that LDA perform poorly as the least coherent topics, whereas MS partitions are the most coherent overall for most types of features (except for a few cases where k-means and hierarchical Ward are better). On average across all features, MS is the best, followed by k-means and Ward. Regarding text features, there is not a best performing one across all resolutions. Interestingly, tfidf features produce coherent topic clusters, perhaps due to the fact that tfidf and tfidf+lsa features are based on counts of word occurrence, and hence very similar in spirit to PMI scores. Among the neural network based features, baas vectors from Bert work best in the fine and medium levels while d2v is best at the coarse level. of unsupervised topic clusters (all clusterTable 1. Aggregate topic coherence (PMI) ings and document features) at three resolution levels. Best clustering for each resolution level in boldface.
Features
Resolution: Clustering: baas d2v sbert tf tfidf tfidf+lsa
Fine Medium Coarse MS Ward k-means LDA MS Ward k-means LDA MS Ward k-means LDA 1.866 1.692 1.843 1.660 1.731 1.706 1.844 1.803 2.026 1.870
1.832 - 1.775 1.553 1.856 - 1.584 1.672 1.829 - 1.772 1.639 1.473 1.860 - 1.855 1.584 1.992 - 1.579 1.664
1.717 - 1.656 1.332 1.673 - 1.724 1.532 1.638 - 1.482 1.405 1.292 1.665 - 1.845 1.481 1.632 - 1.727 1.602
1.498 1.349 1.635 1.182 1.209 1.606 -
160
M. T. Altuncu et al.
Comparison to External Commercial Classifications: An alternative measure of topic cluster quality is to compare to other external classifications. Here we use categorical labels of semantic information produced by two commercial text classification services: Google Cloud Platform’s (GCP) Natural Language API, and Open Calais (OC) by Thomson Reuters. We compare our unsupervised topic clusters to the classes obtained by these commercial products (assumed to be trained on clean, human labelled data) using the Normalised Mutual Information (NMI) and the Adjusted Rand Index (ARI) as shown in Table 2. Table 2. Unsupervised topic clusters (all features and clusterings) at three resolution levels scored against two commercial classification labels: Google Cloud Platform (GCP) and Open Calais (OC). Best clustering for each resolution level in boldface. Commercial
Resolution:
service
Clustering:
Features
GCP
Features
OC
Fine MS
Medium
Ward k-means LDA
MS
Coarse
Ward k-means LDA
MS
Ward k-means LDA
baas
0.356 0.335
0.345
-
0.395 0.325
0.355
-
0.400 0.339
0.345
-
d2v
0.363 0.356
0.363
-
0.391 0.351
0.380
-
0.425 0.365
0.383
-
sbert
0.347 0.312
0.320
-
0.370 0.303
0.308
-
0.381 0.286
0.258
-
-
0.183
-
0.161
-
0.180
tf
-
-
-
-
-
-
tfidf
0.353 0.339
0.369
-
0.404 0.332
0.377
-
0.408 0.294
0.387
-
tfidf+lsa
0.389 0.350
0.370
-
0.400 0.327
0.345
-
0.428 0.313
0.387
-
baas
0.362 0.319
0.349
-
0.392 0.315
0.350
-
0.391 0.318
0.343
-
d2v
0.372 0.349
0.363
-
0.391 0.330
0.361
-
0.413 0.351
0.371
-
sbert
0.354 0.309
0.321
-
0.361 0.284
0.293
-
0.394 0.259
0.225
-
-
0.168
-
0.175
-
0.191
tf
-
-
-
-
-
-
tfidf
0.364 0.341
0.354
-
0.410 0.310
0.371
-
0.393 0.273
0.387
-
tfidf+lsa
0.389 0.341
0.372
-
0.378 0.303
0.332
-
0.411 0.296
0.344
-
(a) Normalised Mutual Information Commercial
Resolution:
service
Clustering:
Features
GCP
Features
OC
Fine MS
Medium
Ward k-means LDA
MS
Coarse
Ward k-means LDA
MS
Ward k-means LDA
baas
0.171 0.152
0.163
-
0.349 0.241
0.259
-
0.483 0.295
0.306
-
d2v
0.212 0.180
0.150
-
0.360 0.252
0.254
-
0.559 0.494
0.415
-
sbert
0.161 0.151
0.147
-
0.319 0.292
0.227
-
0.487 0.304
0.291
-
-
0.127
-
0.075
-
-0.004
tf
-
-
-
-
-
-
tfidf
0.190 0.156
0.167
-
0.488 0.253
0.245
-
0.520 0.195
0.312
-
tfidf+lsa
0.301 0.173
0.167
-
0.476 0.149
0.190
-
0.552 0.151
0.342
-
baas
0.171 0.150
0.162
-
0.332 0.234
0.250
-
0.405 0.251
0.282
-
d2v
0.215 0.176
0.157
-
0.336 0.229
0.224
-
0.468 0.428
0.408
-
sbert
0.173 0.156
0.159
-
0.311 0.252
0.213
-
0.462 0.269
0.253
-
-
0.102
-
0.114
-
0.019
tf
-
-
-
-
-
-
tfidf
0.197 0.149
0.147
-
0.422 0.194
0.224
-
0.409 0.164
0.301
-
tfidf+lsa
0.284 0.161
0.160
-
0.399 0.146
0.171
-
0.457 0.171
0.269
-
(b) Adjusted Rand Index
From the NMI and ARI scores for all our experiments, we find that MS graph-partitioning provides the best correspondence to the external classifications followed by k-means and Ward. LDA provides the worst scores against the external labels. Again, there is no clear winner among the features: tfidf+lsa, tfidf and d2v (all with MS) produce the best results against external classes at the F, M and C levels. Among the Bert variants, baas performs better than sbert but they do not outperform d2v.
Graph-Based Topic Extraction from Document Vector Embeddings
161
Fig. 2. Sankey diagram and wordclouds for MS partitions of Gtfidf+lsa and the external labels of the commercial services GCP and OC.
As visualisation aids, we use multilevel Sankey diagrams to represent relationships between partitions, and wordclouds to represent the content of topic clusters. In Fig. 2, we show the mapping of the external classes of the commercial services OC and GCP against the best unsupervised clustering at the F level, given by MS of Gtfidf+lsa . Our clustering shows strong agreement with the large categories in both OC and GCP with additional detail in several groupings. Overall, our results show that MS usually performs better than the other clustering algorithms both in terms of intrinsic word consistency and when comparing to commercial hand labelled external classes. 4.2
Qualitative Analysis of the Topics
Using a multilevel Sankey diagramme and wordclouds, Fig. 3 shows the MS topic clusters obtained from Gd2v at the three resolutions: fine, medium, coarse. We find four main topics at the C level, which, using information from the wordclouds, we label as ‘Politics & Elections’, ‘Societal Issues’, ‘Entertainment’, and ‘Health & Environment’. The last one divides into three communities on the medium resolution: ‘Public health & Medicine’, ‘Energy & Environment’, and
162
M. T. Altuncu et al.
Fig. 3. Multilevel Sankey diagram and wordclouds of the MS partitions of Gd2v at the three resolution levels (fine, medium, coarse).
‘Healthcare & Insurance’. Among those, the last one also involves some contribution from the ‘Politics & Elections’ community over the discourse of healthcare policy of the presidential candidates. A similar mixed contribution is observed on the M4, a finer resolution community of ‘Societal Issues’. M4 contains news articles related to ‘Black Lives Matter (BLM)’ and lies close to ‘Politics & Elections’. The other finer level of ‘Societal Issues’ is M6 that mixes ‘Politics & Elections’ and ‘Health & Environment’ as it involves ‘gender identity’ issues in the US as well as discussions around ‘planned parenthood’. This topic is a good example of the advantage of having intrinsically quasi-hierarchical partitions instead of forced hierarchies as in Ward’s hierarchical clustering. Indeed, in a strictly hierarchical clustering, we would have either lost M6 completely engulfed by another topic, or we would have obtained less coherent topic cluster(s) on other levels. The flexibility of quasi-hierarchical structures, however, allows us to extract topics independently to the other resolutions. Figure 3 also shows that the top community in the M level (M7) is largely conformed by ‘Politics & Elections’ with small involvement of ‘Entertainment’. Since this topic is related to the election campaigns, the involvement of ‘Entertainment’ reflects the role of media in campaigns through e.g.., political speeches, interviews and debates in TV shows as well as allegations of ‘fake news’ by one of the candidates who targeted the media industry (a small finer community F25). Although we only describe a few examples of interesting relationships across the
Graph-Based Topic Extraction from Document Vector Embeddings
163
Fig. 4. Multilevel Sankey diagram and wordclouds for the three MS partitions of Gtfidf . The highlighted wordclouds correspond to topic clusters mentioned in the main text.
three levels of resolution, all communities have distinct and high quality topics as reflected in their wordclouds. To illustrate the importance of the choice of features for text embedding, in Fig. 4 we examine the MS topic clusters obtained from Gtfidf , i.e., from BoWbased features. Four of the five communities at the coarsest level of MS with tfidf are consistent with the four coarse level topics obtained by MS with d2v. The additional topic is ‘Foreign Policy’. Interestingly, a very similar cluster appears as a distinct grouping in the M level in the d2v partitions (M5 in the Fig. 3) under ‘Politics & Elections’. The ‘Foreign Policy’ cluster in Fig. 4 includes as a subcluster ‘Brexit and EU’ news, one of the most unique and important events in 2016, and a subcluster that divides into a tiny but distinct topic of ‘Israeli and Palestinian conflict’, thus signalling the consistency of topic clusters. On the other hand, within the ‘Brexit and EU’ we find a finer subcluster (F23) with news articles mentioning that Andrew Jackson’s portrait is being replaced by Harriet Tubman’s on $20 bills, thus indicatig that the ‘BLM’ movement has been mistakenly grouped under M7. Hence BoW clusters at fine levels can lead to mixed topics due to coincidental use of specific words and the reliance on pure word counts, instead of contextual semantic predictions. A good example for this problem is F19, which mixes three unrelated topics of news articles from
164
M. T. Altuncu et al.
events like ‘Panama papers’, ‘super bowl’, and ‘Rio Olympics’ which share a list of common word tokens. For further inspection of the resulting topic clusters, we provide a time line of monthly clusters accessible though interactive plots on http://bit.ly/Vox2016.
5
Conclusion
In this paper, we have presented a graph-based methodology for unsupervised topic extraction from text, which uses text vector embeddings and multiscale graph partitioning. Using a corpus of 9,000 US news articles collected during 2016, we have compared our graph-based clustering to other widely used graphless clustering methods, such as k-means, hierarchical clustering and LDA topic modelling. Taking advantage of the recent significant improvements in natural language processing, we have evaluated different document embedding vectors, from the simplest BoW features to Doc2Vec to the recent Bert based alternatives. We benchmarked our results using measures of intrinsic word consistency, as well as comparisons to external commercial classifications in order to quantify the quality of the resulting topic clusters. Using our quantitative analysis, we concluded that MS partitioning outperforms k-means, Ward and LDA clusters in almost all cases. We also observed that d2v embeddings are still among the best methods, outperforming the Bert variants on our data set. Most surprisingly, the traditional BoW based tfidf and tfidf+lsa features are also successful, although their performance is less robust than d2v with some impure topic clusters dominated by strong word counts. The qualitative analysis of the topic clusters displays quasi-hierarchical consistency in the topics, allowing for flexible subtopics to emerge at finer resolutions. Our analysis also shows that the cluster content is affected by both the features and the partitioning methods, especially at the finer levels. A conclusion of our qualitative analysis is that quantitative benchmarks based on word consistency or comparisons to external commercial classifications do not capture fully the quality of the topics. Future work should be aimed at improving the quantitation of content quality. For instance, the aggregation of topic coherence per topic does not take repetition into account, hence it does not penalise a partition with multiple topic clusters with similar content (see e.g., ‘Politics & Elections’ and ‘Foreign Policy’ in the same partition in Fig. 4). Overall, we conclude that our graph partitioning approach using Markov Stability works well for topic clustering by using community detection on geometric graphs obtained from high-dimensional embedding vectors of documents. Its advantages include the possibility of obtaining topic clusters at different levels of granularity, with a flexible quasi-hierarchy of related topics and robust results. We also conclude that d2v features serve well for the objective of topic clustering with good quantitative scores and low topic confusion in our qualitative analysis.
Graph-Based Topic Extraction from Document Vector Embeddings
165
References 1. Altuncu, M.T., Mayer, E., Yaliraki, S.N., Barahona, M.: From free text to clusters of content in health records: an unsupervised graph partitioning approach. Appl. Netw. Sci. 4(1), 2 (2019). https://appliednetsci.springeropen.com/articles/ 10.1007/s41109-018-0109-9 2. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003). http://dl.acm.org/citation.cfm?id=944919.944937 3. Burkhardt, S., Kramer, S.: A survey of multi-label topic models. SIGKDD Explor. Newslett. 21(2), 61–79 (2019). https://dl.acm.org/doi/10.1145/3373464.3373474 4. Delvenne, J.C., Yaliraki, S.N., Barahona, M.: Stability of graph communities across time scales. PNAS 107(29), 12755–12760 (2010). https://www.pnas.org/content/ 107/29/12755 5. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Long and Short Papers, vol. 1, pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota (2019). https://www.aclweb.org/anthology/N19-1423 6. Dieng, A.B., Ruiz, F.J.R., Blei, D.M.: Topic modeling in embedding spaces. Trans. Assoc. Comput. Linguist. 8, 439–453 (2020) 7. Lambiotte, R., Delvenne, J., Barahona, M.: Random walks, Markov processes and the multiscale modular organization of complex networks. IEEE Trans. Netw. Sci. Eng. 1(2), 76–90 (2014) 8. Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on International Conference on Machine Learning, ICML’14, Beijing, China, vol. 32, pp. 1188–1196 (2014). http:// dl.acm.org/citation.cfm?id=3044805.3045025 9. Lenz, D., Winker, P.: Measuring the diffusion of innovations with paragraph vector topic models. PLoS ONE 15(1), e0226685 (2020). https://journals.plos.org/ plosone/article?id=10.1371/journal.pone.0226685. Public Library of Science 10. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems, NIPS’13, Lake Tahoe, Nevada, vol. 2, pp. 3111–3119. Curran Associates Inc., USA (2013). http://dl.acm.org/citation.cfm?id=2999792.2999959 11. Miller, G.A.: WordNet: a lexical database for English. Commun. ACM 38(11), 39–41 (1995). http://doi.acm.org/10.1145/219717.219748 12. Newman, D., Karimi, S., Cavedon, L.: External evaluation of topic models. In: Australasian Doc. Comp. Symp., pp. 11–18 (2009) 13. Newman, D., Lau, J.H., Grieser, K., Baldwin, T.: Automatic evaluation of topic coherence. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. HLT’10, Association for Computational Linguistics, Stroudsburg, PA, USA (2010). http://dl.acm.org/citation.cfm?id=1857999.1858011. Los Angeles, California 14. Papadimitriou, C.H., Tamaki, H., Raghavan, P., Vempala, S.: Latent semantic indexing: a probabilistic analysis. In: Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. PODS ’98, ACM, New York, NY, USA, pp. 159–168 (1998). https://doi.org/10. 1145/275487.275505.Seattle, Washington, USA
166
M. T. Altuncu et al.
15. Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L.: Deep contextualized word representations (2018). arXiv: 1802.05365 16. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language Models are Unsupervised Multitask Learners. Technical report. OpenAI (2018) 17. Rehurek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. ELRA, Valletta, Malta (2010). https://radimrehurek.com/gensim/ lrec2010 final.pdf 18. Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using siamese BERT-networks (2019). arXiv:1908.10084 19. Schaub, M.T., Delvenne, J.C., Yaliraki, S.N., Barahona, M.: Markov dynamics as a zooming lens for multiscale community detection: non clique-like communities and the field-of-view limit. PLoS ONE 7(2), e32210 (2012). https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0032210 20. Schuster, M., Nakajima, K.: Japanese and Korean voice search. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5149–5152 (2012). ISSN: 1520-6149 21. Veenstra, P., Cooper, C., Phelps, S.: Spectral clustering using the kNN-MST similarity graph. In: 2016 8th Computer Science and Electronic Engineering (CEEC), pp. 222–227 (2016) 22. Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A., Fidler, S.: Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 19–27 (2015). ISSN: 2380-7504
Learning Parameters for Balanced Index Influence Maximization Manqing Ma(B) , Gyorgy Korniss, and Boleslaw K. Szymanski Network Science and Technology Center, Rensselaer Polytechnic Institute, Troy, NY, USA {mam6,korniss,szymab}@rpi.edu Abstract. Influence maximization is the task of finding the smallest set of nodes whose activation in a social network can trigger an activation cascade that reaches the targeted network coverage, where threshold rules determine the outcome of influence. This problem is NP-hard and it has generated a significant amount of recent research on finding efficient heuristics. We focus on a Balance Index algorithm that relies on three parameters to tune its performance to the given network structure. We propose using a supervised machine-learning approach for such tuning. We select the most influential graph features for the parameter tuning. Then, using random-walk-based graph-sampling, we create small snapshots from the given synthetic and large-scale real-world networks. Using exhaustive search, we find for these snapshots the high accuracy values of BI parameters to use as a ground truth. Then, we train our machinelearning model on the snapshots and apply this model to the real-word network to find the best BI parameters. We apply these parameters to the sampled real-world network to measure the quality of the sets of initiators found this way. We use various real-world networks to validate our approach against other heuristic. Keywords: Influence maximization · Threshold Model machine learning · Random forest classification
1
· Supervised
Introduction
In a social network setting, influence maximization (IM) is a task motivated by viral marketing. Its goal is to identify the smallest set of social network nodes, which if initially activated to a new state, will collectively influence others to activate. Originally defined by Kempe et al. [1], the problem assumes the known directed social network with either weighted or unweighted edges. The problem uses a stochastic influence propagation model (i.e., the Linear Threshold Model (LTM) [2], in which threshold rules determine influence outcome). The challenge is to find the minimal set of initiators that maximize the spread of their initiated state. The corresponding influence maximization problem is NPhard [1] and it has generated a significant amount of recent research on finding efficient heuristics. One approach focuses on various node indexing heuristics [3,4], in which all nodes in the graph are indexed based on their properties, and c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 944, pp. 167–177, 2021. https://doi.org/10.1007/978-3-030-65351-4_14
168
M. Ma et al.
the highest ranking nodes are selected to the seed set. In this approach graph features related percolation, such as degree, or betweenness, are often used for indexing [5–8]. From the application perspective, it helps to include the specific context information into the node indexing heuristic. In a survey paper [9], the authors concluded that the IM challenge includes finding how the graph structure affects the solution and how to identify a robust set of initiators given a limited number of graph changes. To address this challenge, we use a Balance Index (BI) algorithm [10] that relies on three parameters to tune its performance to the given network structure. Here, we propose to use Machine-Learning (ML) for such tuning. We use a standard supervised ML approach in which the ML model learns from the training data, validates the model performance on the test data, and predicts the best parameters for the given network. In summary, we use ML to find the most influential graph features and apply them to the parameter tuning for the BI algorithm. The main contributions of this work are as follows. We propose a randomwalk-based graph-sampling that quickly creates snapshots of large-scale realworld networks as training data. We developed a method of finding the most influential graph features for the BI algorithm. We also validated the applicability of the synthetic network trained ML model to various real-world networks.
2
Methodology
We use the following notation. We consider a social network with N nodes and set of E edges, so with |E| edges, undergoing conversion from the old to new state using Threshold Model spread process. We denote by ri the resistance of node i to spreading, which is the number of neighbors of i that need to turn active in order for node i to become active. Each node in the network has a fractional threshold for activation, which represents the node’s resistance to peer pressure. The spreading rule is that an inactive node i, with in-degree kiin and threshold φi , is activated by its in-neighbors only when their fraction of activated nodes is higher than the node’s threshold, that is j∈Ni 1 ≥ φi kiin , where Ni denotes set of neighbors of node i. This is a deterministic process and once activated, a node cannot return to its previous state. In addition, kiout stands for out-degree of node i, which represents the immediate decrease of the network resistance to spread when i is activated. kiout,1 is the number of i neighbors with resistance 1. Such neighbors will be immediately activated once i is activated. Hence, this value represents the immediate increase in the number of activated nodes when i is activated. The Balanced Index (BI) introduced in [10] quantifies the combined potential of being effective initial spreader based on node’s resistance, out-degree, and the number of out-neighbors ready for activation with resistance 1, using parameters defined as: (kjout − 1) (1) BIi = ari + bkiout + c j∈∂i|rj =1
Learning Parameters for IM
169
where a + b + c = 1 and a, b, c ≥ 0. Given a large social network, attempting to find effective parameters for applying the BI algorithm to this network would be prohibitively expensive. So, our approach first creates many of its subgraphs to avoid random variance in their quality and then uses a supervised classification task to find those parameters. Next, the averaged parameters are applied to the original graph. Here, we use a number of real-world networks, instead of just one, to measure efficiency of our approach for each of these graphs. 2.1
Random-Walk, Graph-Sampling and Supervised Classification Task
We need a graph-sampling method that could create subgraphs that are similar to the original graph in features relevant to the values of the BI parameters. Many graph-sampling methods were tested for the similarity between the original network and the resulting subgraphs in [11]. The author found that the random-walk sampling preserves the structural graph features well. This conclusion motivates us to use the random-walk sampling in our approach. Each sample is created in one complete walk with no restarts to ensure the created subgraph is fully connected. We denote the dimension of input space (also known as feature space) of this task as n. Here, n is the number of graph features selected for our task and the feature space is Rn . Each feature vector xi is represented as (1) (2) (n) xi = (xi , xi , ..., xi ). The output (target) space of dimension m is defined as (1) (2) (m) yi = (yi , yi , ..., yi ). The targets could be further sliced into classes {Cj }l , enabling us to transform our task to a multi-class or binary (two-class) classification problem. Given the dataset of size D, we split it into two disjoint parts. The training dataset of size M is represented as T = (x1 , y1 ), (x2 , y2 ), ..., (xM , yM ), and the complementary testing dataset is represented as V = (xM +1 , yM +1 ), ..., (xD , yD ). Several methods of classification have shown good performance for a small number of features, including the Logistic Regression Classification and Random Forest Classification [12]. We chose the Random Forest Classification method for our task because of its high adaptability to input scales, input noise and fitting to both linear and nonlinear problems with no precedence hypotheses. 2.2
Datasets and Baseline Comparison
We use two types of original networks on which we want to run the BI algorithm, the synthetic ER graphs with edge swapping, and real-world networks. For both types, we generate sample subgraphs for model training. All those networks are summarized in Table 1. By generating synthetic graphs based on random graphs, we get a dataset covering a broad range of graph features, so we expect that the BI parameters values obtained with them will perform worse on real-world networks than the
170
M. Ma et al.
Table 1. Listing of datasets used for sample generation for learning and testing Synthetic networks Network generation model
Parameters
ER with edge swapping
N = 100, k = 5, 10 75 * 2 N = 300, k = 5, 10 75 * 2 N = 500, k = 5, 10 75 * 2
Count
Real-world networks Network name
Parameters
Amazon Co-purchasing network samples Twitter retweet network samples: “center” Twitter retweet network samples: “lean left” Facebook network samples CA-CondMat network samples CA-HepPh network samples
N N N N N N
∼ ∼ ∼ ∼ ∼ ∼
1000 1000 1000 500 500 500
Count 1000 50 50 50 50 50
parameters obtained by real-word network sampling. We expect that subgraph generated from a real-world network will preserve well its graph structure characteristics. For finding the ground-truth best parameter values in each subgraph setting (cascade coverage and threshold distribution), we simply perform the search over max(b) the triangle grid of max(a) 2prec + 1 × prec + 1 points, where prec = 0.01, so this is a triangle grid of 51 × 101 points, which require 5, 151 executions of indexing of the nodes with complexity in the order of O(N k out ), and then running the spread that also requires O(N k out ) steps. Then, the best values of a and b are selected for generating the smallest number of initiators. The synthetic dataset is split using M = 2D/3, so 2/3 of data for training and 1/3 for testing. After the model is trained on synthetic dataset, it is validated using the real-world network data. We called the new approach a tuned BI heuristic. The performance of each solution is measured using the number of initiators needed by this solution to reach the targeted network coverage, so smaller measurement indicates better performance. We compare the tuned BI heuristic with the following node indexing based heuristics: 1. res: Node resistance based indexing, which corresponds for the BI with the values for (a;b;c) equal to (1;0;0). 2. deg: Adaptive high out-degree based indexing [4] corresponding to the BI with the parameter values set to (0;1;0). 3. RD: Resistance and node out-degree based heuristic strategy, corresponding to the BI with the parameter values set to (0.5;0.5;0).
Learning Parameters for IM
171
Table 2. Graph features, where C denotes a local clustering coefficient, Niout stands for average out-degree of neighbors of node i, cov is the targeted coverage of a cascade, and ρ denotes an out-degree assortativity of the graph. The mean and standard deviation of a distribution of values v are denoted as v, σv , respectively. Feature Definition
Complexity
N
-
C
Node count N |{ejk :vj ,vk ∈δi ,ejk ∈E}| 1 N
i=1
kiout (kiout −1)
σN out
N |{ejk :vj ,vk ∈δi ,ejk ∈E}| 1 2 i=1 (Ci − C) , Ci = N kiout (kiout −1) N out 1 k N i=1 i N 1 out − kout )2 i=1 (ki N N out 1 1 j∈δi kj i=1 N |δi | N 1 1 out − N out )2 j∈δi (kj i=1 N |δi |
ρ
See Eq. (21) in [13]
σC kout σkout N out
Ed cov φ σφ
|E|
2
O(N kout ) O(N ) O(N ) O(N kout ) O(N kout ) See [13] O(N kout )
N 2 t
( ) N N N 1 φi N i=1 N 1 (φ i i=1 N
2
O(N kout )
O(1) O(N ) − φ)2
O(N )
4. CI − T M : Collective influence based indexing for a sphere of influence when L = 1 [5]. Since the metric of CI-TM is only composed of the out-degree of the nodes surrounding the target node, so this sets the BI parameter values to (0;0.5;0.5) [10] (Table 2).
3
Result and Analysis
Here, we first examine the relationship between different parameter values and cases in which they deliver their best performance. For the synthetic subgraphs and given the range of targeted cascade coverage, Fig. 1 shows the optimal a and b values in a triangle grid search with precision 0.01 (so with 51 × 101 = 5, 151 points), while the third parameters is set as c = 1 − a − b. The first plot of Fig. 1 shows that as the network size increases, the plot moves toward the diagonal. There is also an increase of the negative correlation between the best values of a and b when the targeted cascade coverage increases, together with the increase of their sum to one, shown in the second plot. The third and fourth plots show the increasing importance of out-degree (b) and resistance (a) when the larger cascade coverage is needed. The conclusion is that the larger is the targeted cascade coverage, the less important is to focus on ready for immediate activation out-neighbors and to concentrate instead on the long-term strategy of selecting the most resistant (a) and influential (b) nodes.
172
M. Ma et al.
Fig. 1. Optimal values obtained in a triangle grid search with precision 0.01 over synthetic subgraphs and the different targeted cascade coverage values. (Left) The Spearman correlation between the best values of a and b. (Left-Center) Sum of the best a and b. (Right-Center) The best a. (Right) the best b
Fig. 2. Feature importance (For cov = 0.9). (Left) Bars from the left to the right show features in the order of importance for the BI coefficient a: σφ , standard deviation; φ, average value of threshold; C and σC , average and standard deviation of local clustering coefficient; ρ, assortativity; N out and σN out , average and standard deviation of average out-degree of neighbors; σkout , standard deviation of out-degree; Ed , edge density; kout , average out-degree; and N , the number of nodes. (Right) For coefficient b, bars show the same features but in the order of significance for the BI b coefficient. In both plots, the lines above the bars show cumulative importance of features below and to the left of a point of reference
3.1
Identifying the Most Important Features
Although the classification model can be used as a black box, knowing the important features may reduce or increase feature dimension. For the Random Forest model, the feature importance corresponds to the cumulative entropy reduction as each feature is a root of the sub-tree in the decision tree of the forest. Figure 2 shows the results of both classification tasks on synthetic subgraphs. The plots show that σφ is dominant among all features by capturing over 30% of the overall importance. The second is φ that claims over 15% of importance. The next five features account each for nearly 10% of importance, while the remaining four are negligible.
Learning Parameters for IM
3.2
173
Training the Classification Tasks on Synthetic Subgraphs
Figure 3 shows the Random Forest Classification performance on the synthetic subgraphs. The first subplot shows the absolute differences between the predicted and optimal values of a and b obtained by a triangle grid search. The difference is less than 0.2 on both sides of the optimal values. The padding shows boundaries of single standard deviation from the average line. The second and the third subplots compare performance of our method with other node indexing based heuristics. The bar plot in the second plots shows the total number of initiators, while the line plots chart the numbers of initiators needed by each heuristic over the optimal number of initiators. The third plot shows the fraction of additional initiators needed by heuristic compared to such fraction when a and b values obtained by the triangle grid search are used. The synthetic networks contain a spectrum of network features Fig. 4. Hence, we want to see if the result averaged over different network realizations, characterized by varying degree assortativity, ρ, and threshold φ distribution standard deviation σφ . Similarly, comparing the results at each targeted cascade coverage shown in Figs. 4 and 3 shows that our method, “tuned BI”, performs second only to the exhaustive triangle grids search labeled as “best performance BI”. In most cases when cascade size is small, “CI-TM” with L = 1 performs comparably with “tuned BI”. The next two best performing approaches include “deg” and “RD”, while “res” generally perform the worst.
Fig. 3. Comparison of performance of BI parameters found by model trained on synthetic subgraphs to the other node indexing based heuristics. (Left) Range of difference between a and b parameters found by the model and by exhaustive search. (Center) Bars show size of the set of initiators for each heuristic with scale on the left. Plots show additional initiators needed by each heuristic over what was required by BI parameters found by exhaustive search with scale on the right. (Right) Fraction of the best set of initiators needed by each heuristic to achieve the same coverage. Our tuned BI heuristic requires the smallest such fraction, with CI-TM matching it for smaller cascades.
In summary, the results show that parameter tuning using our Random Forest Classification has achieved a convincing performance boost on the synthetic dataset.
174
M. Ma et al.
Fig. 4. Number of initiators needed for a range of values for cascade coverage and different node ranking metrics on the synthetic subgraphs. Each plot compares heuristics for different ranges of threshold’s standard deviation σφ (σ in the plot) and assortativity ρ
3.3
Validating the Approach on Real-World Networks
We used the model trained on synthetic networks for the real-world graphs (subgraph samples) that the graph metric values are not known beforehand. For the Amazon co-purchasing network subgraph samples, we further performed the grid search with the predicted a and b and compared to the results with a and b found by the grid search. The difference was smaller than 0.05 for both parameters. In real-life scenarios, for larger graphs it may take several days to finish even one run of the linear threshold influence maximization. Hence, it is beneficial to use the average of the predicted parameter values generated for the subgraphs on the large-scale original graph. When utilizing subgraph-running results, the more nodes are included in the subgraph samples, the more accurately the average approximates the actual best parameters. Figure 5 shows this narrowing range
Learning Parameters for IM
175
effect in response to increase of numbers of nodes in the subgraph samples for the Amazon co-purchasing network.
Fig. 5. Predictions of a and b are more stable for larger subgraph samples, indicating a narrowing range effect
Here, we use other real-world networks for testing. The results are averaged over 1000 network subgraph samples and 50 for others, and summarized in Fig. 6 without specifying experiment settings (i.e., the distribution of resistance thresholds) since the individual results were similar to each other. For the six real-world networks, the tuned BI approach performs consistently strongly for all kinds of real-world networks included (i.e., academic collabora-
Fig. 6. Comparison of performance of tested heuristics on other real-world net-work subgraphs
176
M. Ma et al.
tion networks and social networks). However, the CI-TM with L = 1 shows bifurcation behaviors for the Twitter retweet graphs and the others, indicating that neglecting the resistance aspect of a influence propagation system could be detrimental to the performance.
4
Conclusion
We use synthetic network data to train the Random Forest Classification to tune the BI algorithm parameters for the high performance on influence maximization problem. Our contributions include the following. We identified the most important features for all the BI parameters, of which the threshold φ distribution standard deviation dominates others. We designed a novel tuned BI heuristic, which uses random-walk sampling to create subgraphs from the networks of interest, which we use to train an ML model for selecting the efficient values of BI parameters to use for solving Influence Maximization problem on the networks of interest. We also compared the new heuristics with other node indexing heuristics on six real-world networks. The results demonstrate that the tuned BI approach outperforms the other tested heuristics, and reduces the number of needed initiators by up to 10%. Acknowledgement. This work was supported in part by the Army Research Laboratory (ARL) through the Cooperative Agreement (NS CTA) Number W911NF-09-20053, the Office of Naval Research (ONR) under Grant N00014-15-1-2640, and by the Army Research Office (ARO) under Grant W911NF-16-1-0524. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies either expressed or implied of the Army Research Laboratory or the U.S. Government.
References ´ Maximizing the spread of influence through 1. Kempe, D., Kleinberg, J., Tardos, E.: a social network. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 137–146 (2003) 2. Chen, W., Yuan, Y., Zhang, L.: Scalable influence maximization in social networks under the linear threshold model. In: 2010 IEEE International Conference on Data Mining, pp. 88–97. IEEE (2010) ´ Influential nodes in a diffusion model for 3. Kempe, D., Kleinberg, J., Tardos. , E.: social networks. In: International Colloquium on Automata, Languages, and Programming, pp. 1127–1138. Springer (2005) 4. Kitsak, M., et al.: Identification of influential spreaders in complex networks. Nat. Phys. 6(11), 888–893 (2010) 5. Morone, F., Makse, H.A.: Influence maximization in complex networks through optimal percolation. Nature 524(7563), 65–68 (2015) 6. Pei, S., Teng, X., Shaman, J., Morone, F., Makse, H.A.: Efficient collective influence maximization in cascading processes with first-order transitions. Sci. Rep. 7, 45240 (2017)
Learning Parameters for IM
177
7. Karsai, M., I˜ niguez, G., Kikas, R., Kaski, K., Kert´esz, J.: Local cascades induced global contagion: How heterogeneous thresholds, exogenous effects, and unconcerned behaviour govern online adoption spreading. Sci. Rep. 6(1), (2010) 8. Unicomb, S., I˜ niguez, G., Karsai, M.: Threshold driven contagion on weighted networks. Sci. Rep. 8(1), (2018) 9. Yuchen Li, J., Fan, Y.W., Tan, K.-L.: Influence maximization on social graphs: a survey. IEEE Trans. Knowl. Data Eng. 30(10), 1852–1872 (2018) 10. Karampourniotis, P.D., Szymanski, B.K., Korniss, G.: Influence maximization for fixed heterogeneous thresholds. Sci. Rep. 9(1), 1–12 (2019) 11. Backstrom, L., Leskovec, J.: Supervised random walks: predicting and recommending links in social networks. In: Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, pp. 635–645 (2011) 12. Abu-Mostafa, Y.Y., Magdon-Ismail, M., Lin, H.-T.: Learning From Data (2012). amlbook.com 13. Newman, M.E.J.: Mixing patterns in networks. Phys. Rev. E 67(2), 026126 (2003)
Mobility Networks
Mobility Networks for Predicting Gentrification Oliver Gardiner(B) and Xiaowen Dong Department of Engineering Science, University of Oxford, Oxford, UK [email protected], [email protected] Abstract. Gentrification is a contentious issue which local governments struggle to deal with because warning signs are not always visible. Unlike current literature that utilises solely socio-economic data, we introduce the use of large-scale spatio-temporal mobility data to predict which neighbourhoods of a city will gentrify. More specifically, from mobility data, which is associated with the exchange of ideas and capital between neighbourhoods, we construct mobility networks. Features are extracted from these mobility networks and used in gentrification prediction, which is framed as a binary classification. As a case study, we use the Taxi & Limousine Commission Trip Record Data to predict which census tracts would gentrify in New York City from 2010 to 2018, and show that considering network features alongside socio-economic features leads to a significant improvement in prediction performance.
Keywords: Gentrification New York City
1
· Mobility networks · Urban computing ·
Introduction
The precise definition of gentrification remains a topic of open debate, but at its core the term refers to a period of rapid change in a previously disadvantaged neighbourhood. Regardless of exact terminology, the impact of gentrification on neighbourhoods is undeniable. Moreover, public opinion, sometimes manifested as protests, demonstrates that it is a problem city governments are struggling with [26]. Fundamentally, local governments struggle to deal with gentrification because by the time obvious signs of gentrification appear, the process is already in full flow: ‘The tide of living expenses in a given neighbourhood may already be rising so fast... If you’re poor or working class, it’s just time to leave’ [8]. This has motivated a number of recent quantitative studies to understand how different factors contribute to gentrification, and to predict which neighbourhoods will gentrify. These studies have used either regression [22,25] or binary classification [1,5,14]. Studies which use regression face the natural challenge that there is no obvious continuous variable that can be used to represent gentrification. Binary classification, using a clear definition of gentrification, is therefore a more logical approach and is adopted in this paper. All of these studies, however, utilise only socio-economic data and face the significant limitations c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 944, pp. 181–192, 2021. https://doi.org/10.1007/978-3-030-65351-4_15
182
O. Gardiner and X. Dong
Fig. 1. Proposed analysis pipeline.
that such data often has poor spatial and temporal resolution, and fails to account for other factors, such as human behaviour and tastes, which undoubtedly also contribute to gentrification. We investigate the use of mobility data, specifically taxi trajectory records, to overcome these limitations. The usage of taxis is documented to vary according to household income [24], and so we hypothesise it should be useful in predicting gentrification. Moreover, mobility data is also interesting with respect to gentrification because the movement of people leads to interactions between neighbourhoods which, in turn, lead to the exchange of ideas, opportunities, and capital, all of which are of key importance to any urban process. Such spatiotemporal data streams, however, do not offer obvious features that can be related to gentrification. To address this, we take inspiration from previous examples of spatial networks in urban computing. Liu et al. showed that a spatial network inferred from taxi trip data in Shanghai encoded useful information about the city structure [16]. Hristova et al. used Twitter and Foursquare data to infer spatial and social networks in London [12]. These networks were interconnected and node statistics were used to measure the social diversity of each neighbourhood. They then demonstrated a correlation between these statistics and the change in socio-economic well-being. We explore the hypothesis that node statistics from a mobility network (i.e. a spatial network inferred from mobility data) are useful in predicting gentrification. Unlike the analysis in [12], we formally define the gentrification process using census data, and quantitatively evaluate the predictive performance of the proposed method. As a case study, we explore the Taxi & Limousine Commission (TLC) Trip Record Data, a data set of taxi journeys in New York City (NYC) [18]. We then consider a number of different network definitions, each defined in Sect. 3.1. From the mobility network we extract a set of network features, which are used alongside socio-economic features, to train a binary classifier to identify which census tracts in NYC would gentrify from 2010 to 2018. There is no commonly used definition of gentrification; for this paper we use the same definition for gentrification as the Urban Displacement Project [6], which provides our labels for classification (the definition is given in Sect. 4.1). We show a significant increase in the
Mobility Networks for Predicting Gentrification
183
performance of binary classification, measured by the area under the receiver operating characteristic (AUROC), compared to using only socio-economic features. An overview of this approach is shown in Fig. 1. In summary, the main contributions of this paper are: – We propose a novel framework to use large-scale spatio-temporal mobility data for understanding and predicting gentrification. – As a case study, we present a technical methodology and results for using both taxi trajectory records and socio-economic data in NYC to predict gentrification. – We present a qualitative discussion of the network features particularly important to gentrification prediction.
2
Data Sources
We consider two main data sources: the TLC Trip Record Data [18] and the American Community Survey (ACS) [27]. The TLC Trip Record Data details over two billion taxi trips in NYC from 2009 to present. The ACS provides socio-economic and demographic information on each census tract in NYC. We also use Google Maps to provide the travel time via subway of each tract to downtown (Union Square Park) [10]. We consider both yellow taxis and green taxis, for the years where available, from the TLC Trip Record Data in the time period of 2011–2014 (so as not to overlap with the ACS data). The data set was cleaned by removing data points that were obviously erroneous (e.g. have a negative travel time). Following this, the pick-up/drop-off location of each trip was assigned to the census tract within which it was located, using a shapefile of 2018 NYC census tracts [28]. Some summary statistics from the Trip Record Data are shown in Table 1. The NYC TLC Trip Record Data data set has already been the subject of interest from a networks perspective [7,19,29]. We build on these studies via a discussion of different definitions of a mobility network (Sect. 3.1) and by presenting a formal methodology for the use of these networks in predicting gentrification (Sect. 4). Table 1. TLC trip record data: summary statistics. Statistic
2011
2012
2013
2014
Number of trips
170,941,180
173,097,854
169,884,723
176,295,717
Mean travel time (±std) 12:25 (±11:34) 12:24 (±10:01) 12:39 (±15:37) 13:32 (±19:33)
We use ACS 5-year estimates for the periods 2006–2010 and 2014–2018 (the most recent available at the time of analysis). The socio-economic data from the ACS is used to both provide features for gentrification prediction and identify which census tracts have gentrified. Pre-processing was needed before the ACS could
184
O. Gardiner and X. Dong
be used; some data categories are only recorded to a certain value, we replaced these data points with their limit (i.e. 250,000+ was replaced by 250,000). We also removed tracts that had no population in either 2010 or 2018.
3 3.1
Mobility Networks Network Definition
In NYC we define a spatial network in which each census tract (excluding those with no population) is represented by a node. Thus, the network has 2114 nodes. Below we describe the different network definitions we explored. For consistency between definitions, all of the networks defined here are undirected. Origin-Destination Network. In an origin-destination network an edge is defined between two nodes if there is a taxi trip between the two tracts, similar to other studies on mobility networks [7,16,19,29]. The weight of an edge is the total number of trips between the two tracts in the time period considered (Eq. 1, where ti→j is the number of trips from tract i to tract j in the time period considered, and A is the adjacency matrix). Ai,j = Aj,i = ti→j + tj→i
(1)
This definition follows from the idea that a trip from tract i to tract j likely leads to interactions between individuals in those tracts. We consider two time periods: 2011–2014 and 2014. Co-work Location Network. This definition is inspired by the use of gathering events by Psorakis et al. [21]. It follows the assumption that two trips originating from tracts i and j and terminating at tract k do not necessarily imply a link between tracts i or j and k, but instead may imply a link between tracts i and j. An obvious example of this is two colleagues commuting to the same office. A gathering event is defined when there are many trips to a single tract during a short period of time, with the assumption that during this event there is a higher probability of interactions between individuals. We use the morning commute as an obvious candidate for a gathering event. Not only is it easy to define, but the growth of service sector employment is also commonly noted in gentrified neighbourhoods [13]. Specifically, we define a gathering event at each tract in Manhattan (which has a high density of offices) on weekday mornings between the hours of 07:00–10:00 am for all of 2014. The strength of the relationship between tracts that send trips to the gathering event is given by the association factor, rij,e . Taking inspiration from ecological networks [9], we define the association factor between tracts i and j for gathering event e: xi xj rij,e = n (2) ( k=1 xk )2
Mobility Networks for Predicting Gentrification
185
where xi is the number of trips to the gathering event from tract i, and n is the total number of tracts in NYC. The association factor is, therefore, the number of possible interactions between individuals from tracts i and j divided by the maximal number of interactions possible between individuals at the event. To infer a network over some time period, we take the mean association factor (Eq. (3), where E is the total number of gathering events in a time period). E rij,e (3) Ai,j = Aj,i = e=1 E Weighted and Binary Networks. The edge weights of the origin-destination network and co-work location network face the inherent problem that the Trip Record Data undoubtedly contains noise. To combat this we opt for a relatively simple solution of creating a binary network: the weight of a link is set to 1 if it is greater than the median edge weight and 0 otherwise. We explore both weighted networks and their binary counterparts in the remainder of this paper. 3.2
Network Visualisation and Summary Statistics
Using these definitions, we inferred six mobility networks from the Trip Record Data. Some summary statistics of these are shown in Table 2 and a visualisation of the 2011–2014 Weighted Origin-Destination Network is shown in Fig. 2. Table 2. Network summary statistics. The degree is the number of edges each node has (i.e. for weighted graphs ignoring the weight of an edge) and the edge density is the number of edges in the graph divided by the total number of potential edges. Network
Mean degree Edge density
2011–2014 Weighted Origin-Destination 881
0.42
2011–2014 Binary Origin-Destination
334
0.16
2014 Weighted Origin-Destination
720
0.34
2014 Binary Origin-Destination
265
0.13
2014 Weighted Co-Work Location
474
0.22
2014 Binary Co-Work Location
200
0.09
The communities detected in Fig. 2a confirm that the networks contain useful information as they closely match what would be expected in NYC1 . To determine which definition is the most useful for gentrification prediction all of the inferred networks are investigated in Sect. 4. 1
The authors have discussed the networks and communities with Dr. Gerard TorratsEspinosa, Assistant Professor in the Department of Sociology at Columbia University, New York City (in conversation 29 April 2020), and Charlie Dulik, Tenant Organizer at the Urban Homesteading Assistance Board, New York City (in conversation 23 April 2020).
186
O. Gardiner and X. Dong
(a) Communities
(b) Node degree visualisation
Fig. 2. 2011–2014 Weighted Origin-Destination Network (visualisation created using Gephi (version 0.9.2) [2]). Each node is positioned at the centroid of the corresponding tract. In Fig. 2a the colours represent communities detected via modularity optimisation [3].
4 4.1
Gentrification Prediction Methods
Gentrification Identification. First, we identify which tracts gentrified from 2010 to 2018; this provides class labels for training a binary classifier. We adopt the same definition for gentrification of a tract as Chapple et al. (the Urban Displacement Project) and Rigolon and N´emeth [6,25], and from this definition present the number of eligible and gentrified tracts in Table 3. The definition is as follows. First we identify whether a tract was eligible to gentrify in 2010: – Owner-occupied home value or gross rent < 80% of NYC median And (any 3 of 4): – % low income households (annual income below $50,000) > NYC median – % of residents college educated < NYC median – % of residents who rent > NYC median – % of residents who are non-white > NYC median And then whether a tract gentrified from 2010 to 2018: – Eligible to gentrify in 2010 – Increase in % of college educated residents > NYC median – Percentage increase in real median household income > NYC median And (either of ): – Increase in median real rent > NYC median – Increase in median value of owner-occupied units > NYC median
Mobility Networks for Predicting Gentrification
187
Table 3. Number of gentrified tracts by borough in NYC. Borough
Total number Eligible to Gentrified 2010 Percentage of eligible of tracts gentrify in 2010 to 2018 tracts gentrified (%)
Manhattan
281
76
42
55.3
Queens
643
80
14
17.5
The Bronx
332
185
51
27.6
Brooklyn
750
118
56
47.5
9
1
11.1
Staten Island 108
Feature Extraction. We investigate a number of socio-economic features and network features. For brevity, here we only present the features used after a feature selection step to remove multicollinearity. Socio-economic feature selection was guided by numerous existing studies in the literature [5,14,22,25] and the features chosen are shown in Table 4. The ‘Distance to Downtown’ is highly non-linear and so was log transformed. The network features chosen are also presented in Table 4 (where V is the set of nodes, N (i) is the set of neighbours of i, NT is the cardinality of the set of neighbours, R is the set of recently gentrified tracts2 , RT is the cardinality of the set of recently gentrified tracts, w ˆij is the edge weights between i and j normalised by the maximum weight in the network, T (i) is the number of triangles through i, and log() is the natural logarithm). Finally, we also include the borough as a one-hot encoded variable to account for unobserved heterogeneity and confounding variables in the data set associated with the boroughs, such as local trends or policies. This also increases our confidence in the analysis of feature importance presented below. Binary Classification. We investigated two different binary classifiers, logistic regression (LR) and random forest (RF), both implemented using scikit-learn (version 0.22.2) [20]. We only consider tracts eligible to gentrify in 2010, and a positive label is assigned to tracts which gentrified from 2010–2018. For both classifiers the data is split into training and test sets in a 80%:20% stratified split. The hyperparameter(s) is chosen using a random grid search and 4-fold cross validation on the training data, optimising for AUROC. Finally, the model is then fitted to the training data and the performance evaluated on the test data. 4.2
Results
Prediction Performance. To investigate the utility of each network definition for gentrification prediction, we compare the performance of the classifiers using features extracted from each network. The performance, as measured by AUROC, is shown in Fig. 3. We see that most, although not all, of 2
‘Recently Gentrified’ tracts are those identified to have gentrified from 2000–2010, using the definition of gentrification in Sect. 4.1 (for this different time period), data from the 2000 U.S. decennial census [27] and the 2006–2010 ACS data.
188
O. Gardiner and X. Dong
Table 4. Socio-economic and network features used. The network features are mostly calculated using the Python package NetworkX (version 2.4) [11]. Socio-economic
Definition
% Black
Percentage of ‘Black or African American’ residents
% White
Percentage of ‘White Alone’ residents
Distance to Downtown
Distance from tract centroid to downtown NYC (m)
% College Educated
Percentage of residents with a Bachelor’s degree
% Unemployed
Percentage of residents with ‘Unemployed’ status
% Renters
Percentage of total housing units occupied by rent paying tenant
Subway Travel Time
The expected travel time by subway from tract centroid to downtown (minutes)
Borough
The borough within which the tract is situated included as a one-hot encoded variable
Network
Definition Degree ki = n j=1 Aij [17] Average neighbour degree knn,i = N1T j∈N (i) kj [17] 1 Links to recently gentrified li = NR j∈R Aij Shannon entropy weighted: H(i) = j∈N P (wij )log P (w1 ij ) 1 Clustering Coefficient Weighted: Cc (i) = ki (k1i −1) jk (w ˆij w ˆik w ˆjk ) 3 [17] binary: Cc (i) =
2T (i) ki (ki −1)
[17]
the network definitions lead to an improvement on the baseline of solely considering socio-economic features and that considering only network features performs slightly worse than considering only socio-economic features. Both of these results demonstrate the added value of considering network features. The performance improvement is much more pronounced for the LR classifier than for the RF. This suggests that a linear relationship between the features and the log odds seems to be an appropriate assumption for the network features. LR considering features from the 2011–2014 Weighted Origin-Destination network is the best performing model with a median AUROC of 0.73. Having been inferred from a four-year period of data, the 2011–2014 Weighted OriginDestination network may have captured some temporal trends that are not captured in the networks defined over a shorter period of time. We also note green taxis are only included in the data set from August 2013, which is another difference with the networks inferred from solely 2014. Whilst removing noise, valuable information is also lost in creating the binary network, possibly explaining the slightly better performance of this network over its binary counterpart. We further consider this model in terms of feature importance below.
Mobility Networks for Predicting Gentrification
LR AUROC Results
189
RF AUROC Results
Fig. 3. Boxplots show the AUROC of the classifiers over 1,000 training/testing data set splits. In each, both socio-economic features and features from the named network are considered, apart from ‘Socio-Economic Features Only’ and ‘Network Features Only’ (2011–2014 Weighted Origin-Destination Network) which provide baselines against which the performance can be compared. The boxplots show quartiles of the data; outliers are defined as points more than 1.5 times the interquartile range past the upper and lower quartiles. The boxplots are arranged from left to right in order of increasing median.
Feature Importance. Since we Z-transform the features before training the LR classifier, the magnitude and sign of the LR beta values (coefficients) give an indication of a feature’s importance and relationship with gentrification. The beta values are presented in Table 5; we see that the ‘% Black’, ‘% White’ and ‘% College Educated’ are the most important socio-economic features and the ‘Degree’ and ‘Clustering Coefficient’ the most important network features. Tracts with a lower degree are shown to be more likely to gentrify. This may be explained by the fact that gentrification is often driven by developers buying plots of land and building luxury housing units [15]. Weems et al. showed a positive correlation between the degree and property prices for an origin-destination network in NYC [29]. Thus, less central tracts being more likely to gentrify may be indicative of developers seeking out cheap plots of land for their projects. We also see that a tract with a lower clustering coefficient is less likely to gentrify. A low clustering coefficient indicates structural holes in the network [17]. In the literature of social network analysis it has been argued that individuals located at structural holes are more likely to have good ideas [4]. It might be expected that tracts which are exposed to good ideas (or perhaps better referred to as trends in an urban context) may be more likely to gentrify. However, this interpretation is seemingly at odds with the relationship discovered here. Considering gentrification as a diffusion-like process, as argued by Redfern [23], may be useful in explaining this discrepancy. Tracts at structural holes, to which there is a limited flow of capital and ideas [17], may be less likely to be involved in this diffusion process.
190
O. Gardiner and X. Dong
Table 5. Mean beta values (over 1,000 training/testing splits) for a LR classifier considering the 2011–2014 Weighted Origin-Destination network. Statistically significant values, with p < 0.05, are in bold. Socio-economic Feature Beta value % Black 0.559 % White 0.577 Distance to downtown 0.468 % College Educated -0.838 % Unemployed -0.343 % Renters -0.509 Subway Travel Time -0.487 Bronx 0.746 Brooklyn 1.109
Socio-economic Feature Beta value Manhattan 0.565 Queens 0.460 Network Feature Beta value Degree -1.607 Average Neighbour Degree 0.059 Links to Recently Gentrified -0.130 Shannon Entropy 0.097 Clustering Coefficient 2.269
In fact, the degree and clustering coefficient have the largest-magnitude beta values of the features considered. This emphasises that, in addition to traditional factors, the potential exchange of capital and ideas associated with these network features is important to consider in understanding and predicting gentrification. Finally, the beta values indicate that, on average, it is more likely for a tract in Bronx and Brooklyn to gentrify than tracts in the other boroughs. However, this tendency is found to not be statistically significant (with p-values of 0.48 and 0.24, respectively). Nevertheless, analysing differences between the boroughs with regards to gentrification would be an interesting future direction.
5
Discussion
We have presented a methodology for using large-scale mobility data alongside socio-economic data for gentrification prediction, using NYC as a case study. As part of this, we presented a discussion of different methods for inferring a mobility network from mobility data (Sect. 3), and showed that features extracted from these networks can be used to improve the performance of gentrification prediction (Sect. 4). Finally, we also provided a qualitative discussion of network features identified as particularly important (Sect. 4.2). The methodology and results presented in this paper have limitations. The vast majority of taxi trips in NYC take place in Manhattan, so the Trip Record Data may not accurately capture the movement of all the residents of NYC. There is also no single definition of gentrification used by social scientists, which means that results are difficult to compare across studies. To build upon the research presented in this paper, other forms of mobility data could be considered, for instance bus or subway data. The methodology used here could be adapted to a finer spatial and temporal resolution to further benefit from the advantages of mobility data. The Co-Work Location network definition presented here could be expanded by automatically detecting gathering events. Finally, the methodology presented in this paper should also be applied to different cities and time periods.
Mobility Networks for Predicting Gentrification
191
References 1. Alejandro, Y., Palafox, L.: Gentrification prediction using machine learning. In: 18th Mexican International Conference on Artificial Intelligence. MICAI 2019, Xalapa, Mexico, pp. 187–199. Springer (2019) 2. Bastian, M., Heymann, S., Jacomy, M.: Gephi: an open source software for exploring and manipulating networks. In: BT - International AAAI Conference on Weblogs and Social. International AAAI Conference on Weblogs and Social Media, pp. 361–362 (2009) 3. Blondel, V.D., Guillaume, J.L., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities in large networks. J. Stat. Mech. Theor. Exp. 2008(10), 1–12 (2008) 4. Burt, R.S.: Structural holes and good ideas. Am. J. Sociol. 110(2), 349–399 (2004) 5. Chapple, K., et al.: Mapping Susceptibility to Gentrification: The Early Warning Toolkit. Technical report, University of California, Berkeley (2009). https:// communityinnovation.berkeley.edu/sites/default/files/mapping susceptibility to gentrification.pdf 6. Chapple, K., Loukaitou-Sideris, A., Chatman, D., Waddell, P., Ong, P.: Developing a New Methodology for Analyzing Potential Displacement. Technical report, University of California, Berkeley (2016) 7. Deri, J.A., Moura, J.M.: Taxi data in New York city: a network perspective. In: 49th Asilomar Conference on Signals, Systems and Computers, pp. 1829–1833 (2015) 8. Frank, A.: What Does It Take To See Gentrification Before It Happens? (2017). https://www.npr.org/sections/13.7/2017/08/29/546980178/what-does-ittake-to-see-gentrification-before-it-happens?t=1587409355525 9. Ginsberg, J.R., Young, T.P.: Measuring association between individuals or groups in behavioural studies. Anim. Behav. 44(2), 377–379 (1992) 10. Google: Google Maps. https://www.google.com/maps/dir//Union+Square, +New+York,+NY+10003,+USA/@40.7358421,-74.0611234,12z/data=!3m1!4b1! 4m9!4m8!1m0!1m5!1m1!1s0x89c259989e14aa8b:0xcd00afc9db20caa4!2m2!1d-73. 9910835!2d40.7358633!3e3 11. Hagberg, A.A., Schult, D.A., Swart, P.J.: Exploring network structure, dynamics, and function using NetworkX. In: 7th Python in Science Conference (SciPy 2008), pp. 11–15 (2008) 12. Hristova, D., Williams, M.J., Musolesi, M., Panzarasa, P., Mascolo, C.: Measuring urban social diversity using interconnected geo-social networks. In: 25th International World Wide Web Conference, WWW 2016, pp. 21–30 (2016) 13. Hyra, D.S.: Race, Class, and Politics in the Cappuccino City. The University of Chicago Press, Chicago (2017) 14. Knorr, D.: Using Machine Learning to Identify and Predict Gentrification in Nashville, Tennessee. Technical report, Vanderbilt University (2019) 15. Kohli, S.: Developers have figured out the secret sauce for gentrifying neighborhoods (2015). https://qz.com/408986/developers-have-figured-out-the-secretsauce-to-gentrification/ 16. Liu, X., Gong, L., Gong, Y., Liu, Y.: Revealing travel patterns and city structure with taxi trip data. J. Transp. Geogr. 43, 78–90 (2015). https://doi.org/10.1016/ j.jtrangeo.2015.01.016 17. Newman, M.E.J.: Networks: An Introduction, 1st edn. Oxford University Press 2010) 18. NYC Taxi & Limousine Commission: NYC Taxi and Limousine Commission (TLC) Trip Record Data. https://www1.nyc.gov/site/tlc/about/tlc-trip-recorddata.page
192
O. Gardiner and X. Dong
19. Patil, N.: Characterizing & analyzing networks : NYC taxi data. https://nileshpatil.github.io/blog/transportation-graph-nyc-taxi-data/#full-network-analysis 20. Pedregosa, F.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011) 21. Psorakis, I., Roberts, S.J., Rezek, I., Sheldon, B.C.: Inferring social network structure in ecological systems from spatiotemporal data streams. J. R. Soc. Interface 9(76), 3055–3066 (2012) 22. Reades, J., De Souza, J., Hubbard, P.: Understanding urban gentrification through machine learning. Urban Stud. 56(5), 922–942 (2019) 23. Redfern, P.A.: A new look at gentrification: 2. A model of gentrification. Environ. Plann. A: Econ. Space 29(8), 1335–1354 (1997) 24. Renne, J., Bennett, P.: Socioeconomics of urban travel: evidence from the 2009 national household travel survey with implications for sustainability. World Transp. Policy Pract. 20(4), 7–27 (2014) 25. Rigolon, A., N´emeth, J.: Toward a socioecological model of gentrification: how people, place, and policy shape neighborhood change. J. Urban Affairs 41(7), 887– 909 (2019) 26. Santus, R.: How anti-gentrification activists derailed Amazon’s New York City plans (2019). https://www.vice.com/en/article/nex34z/how-anti-gentrificationactivists-derailed-amazons-new-york-city-plans 27. US Census Bureau: American Community Survey 5-year estimates (2018). https:// data.census.gov/cedsci/table?g=0500000US36005,36047,36061,36081,36085& d=ACS%205-Year%20Estimates%20Data%20Profiles&tid=ACSDP5Y2018. DP02&hidePreview=false 28. US Census Bureau Geography Division: TIGER/Line Shapefiles. https://www. census.gov/cgi-bin/geo/shapefiles/index.php 29. Weems, B., Field, E., Ward, T.: New York Taxi Network: Community Structure and Predictive Analysis. Technical report, Stanford (2016). http://snap.stanford. edu/class/cs224w-2016/projects/cs224w-59-final.pdf
Connecting the Dots: Integrating Point Location Data into Spatial Network Analyses Shuto Araki1 and Aaron Bramson1,2,3(B) 1 2
GA Technologies, Inc., Roppongi 3-2-1, Minato-ku, Tokyo 106-6290, Japan RIKEN Center for Biosystems Dynamics Research, Laboratory for Symbolic Cognitive Development, Minatojima-Minamimachi 6-7-3, Chuo-ku, Kobe 650-0047, Japan 3 Ghent University, Department of General Economics, Tweekerkenstraat 2, 9000 Ghent, Belgium a [email protected]
Abstract. Transportation networks allow us to model flows of people and resources across geographic space, but the people and resources we wish to model are often not natively tied to our networks. Instead, they can occur as point data (such as store, train station, and domicile locations) and/or grid data (such as socio-economic and aggregate area data). Here we present a set of methods to integrate point data into an augmented transportation network. This method facilitates analyses of temporo-spatial measures (such as accessibility scores) using only efficient breadth first search algorithms. We demonstrate the approach by calculating walkability scores for the train stations within the central Tokyo area. Keywords: Spatial networks · Transportation networks integration · Accessibility · Walkability
1
· Data
Introduction
Performing transportation network analyses typically requires data from multiple sources, and some of them may not fit into the network as attributes of existing nodes or edges. Both point and grid data are examples of data requiring additional integration steps. If one considers these points as additional nodes, one needs to consider how these point nodes should be related to/from other nodes in the network and how those configurations affect network features. This paper presents a refined set of methods for integrating additional location point data into a transportation network in ways that improve accessibility scoring (among other things). Using the central Tokyo metropolitan area as an example, we demonstrate the effectiveness of our integration method by ranking railway stations by their ‘walkability,’ a measure reflecting the degree to which c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 944, pp. 193–205, 2021. https://doi.org/10.1007/978-3-030-65351-4_16
194
S. Araki and A. Bramson
surrounding amenities are reachable by foot. Our method produces a more realistic and personalizable metric of accessibility to surrounding stores than a simple count of nearby stores. The integration methods described here are useful beyond the calculation of walkability scores of train stations. We chose train stations because their locations are publicly available and the Tokyo urban area is highly train-centric: half of all transportation uses the rail system and urban development is planned around stations [2,3]. Considering this, it is natural to focus on these stations to assess and compare accessibility.
2
Data Sets
Network Data. Our base network data is the road network for the Tokyo area from Open Street Map (OSM) [13]. The OSM road network includes nodes for all intersections as well as nodes to capture the curvature of the roads with straight edges. We isolate the largest connected component (98% of the nodes); this action removes remote islands, many pedestrian walkways, some access roads (e.g., within amusement parks and gardens), etc. Ideally we would use the footpath network data that includes sidewalks, pedestrian bridges, multi-use paths, greenways, etc. to more accurately determine accessibility via walking [4], however such data is not reliably available for Tokyo at present. Using the coordinates of the nodes, we generate the length of each edge using Haversine distance. Point Data. Our store data comes from NTT Townpage [12], a private data service that provides lists of stores and other entities by category based on phone numbers. For the current demonstration, we limit our analysis to within Toyko’s 23 Wards (central Tokyo) and to establishments within the following categories: variety store, hobby, travel, restaurant, cafe, bar, supermarket, convenience store, hospital, drugstore, laundry, public bath, spa, hotel, sport shop, sporting, cram school, nursery school, religion, and areas of concern (such as gambling establishments). For simplicity we refer to all these establishments as ‘stores.’ Note that because store locations are considered as nodes, there can be multiple stores in one store node. In order to assign walkability scores to train and subway stations in Sect. 4, we also need to incorporate the locations of those railway stations. We use data from [11] for the names, longitude, and latitude of the stations, and augment this with the locations of subway entrances and station turnstiles from OSM [13].
3
Integrated Analysis
An analysis using the integrated datasets requires two components: network construction and network traversal. First we describe the procedure to augment the road network with additional point data and edges, then we describe changes to network traversals incorporating mobility considerations.
Connecting the Dots
3.1
195
Connecting the Road Network to Points
Both store and station/exit nodes require integration with the road network, and our method differs between them. Integrating the Stores to the Road Network. The first step of the multimodal network construction is to integrate the stores to the road network. The general process is shown in Fig. 1; each store’s building location is connected to the road network at its closest point. Specifically, for each store location, there needs to be an edge between the store and a newly created node at the closest point along its closest road edge. Because the distance is minimized, the road edge and store access edge are perpendicular. When the closest element of the road network is a node rather than an edge, we connect the store directly to that road network node. Although this method emulates how one enters the stores, the actual store entrance may be different from the closest road network edge/node.
(a) Base Road and Building Data
(b) Road Network Data
(c) Store Location Point Data
(d) Road Network Extended to Stores
Fig. 1. Steps to connect the store point data to the road network data to create the store-augmented multimodal transportation network.
In order to minimize the computational time to find the nearest edge, the Sort-Tile-Recursive (STR) tree data structure [9] is used to identify the closest edge from each store location. First, we construct an STR tree by treating the network edges as line geometry objects. Second, we convert each store node into a circle geometry object by adding a 125 m radius buffer. In this way, we can
196
S. Araki and A. Bramson
query the STR tree for all edges that overlap or intersect with the store’s circle. From this set of closest edge candidates, we use a binary heap data structure (min-heap) to track the actual closest edge. Because the distance calculation is computationally expensive compared to the STR tree query, this two-step method saves considerable time compared to checking the distance between each store to every edge. Setting the Coordinate Reference System. One caveat to this method is that the distance between a store and its nearby edges can be calculated using standard Euclidean distance only if the Coordinate Reference System (CRS) approximates a Euclidean space around the area of interest. Since the earth is nearly spherical, analyzing geospatial data on a flat plane requires a projection, and any projection of a 3D surface onto a 2D surface comes with some amount of distortion. The Mercator projection, for example, is often used for mapping because it preserves shapes and angles, but it heavily distorts areas and distances as you get further away from the reference point. Figure 2 shows the difference between the Web Mercator projection (epsg:3857 in red) and a distance-preserving CRS (+proj=eqc +lat 0=35.6825 +lon 0=139.7671 +units=m in blue). Since the visualization engine (Kepler.gl [7]) uses the Web Mercator projection, it visualizes edges that are connected to minimize distance using distance-preserving CRS as crooked (i.e., they do not appear perpendicular). If we calculate minimum distance using the Web Mercator projection, the edges appear perpendicular and correct to our eyes; however, the distances have been distorted. As seen in Fig. 2, this small discrepancy in CRS settings can cause some store nodes to be connected to a different edge. Misidentification of the nearest edge from a certain store could therefore influence the accuracy in assessing reachability of the store, and we utilize the distance-preserving CRS throughout.
Fig. 2. The choice of coordinate reference system (CRS) impacts the calculation of distances and visualization of edges extended to stores. Red edges look perpendicular when calculated using the same Web Mercator projection used for visualization, but the blue edges are the shortest when calculating with a distance-preserving CRS.
Connecting the Dots
197
Integrating the Station Exits to the Road Network. Our approach to integrating rail station data differs from the method used for stores. While stores are single point data representing the center of the appropriate building, stations are often large and multiply connected structures. There are three kinds of rail systems: subway, surface rail, and trams (streetcars). Ideally we would connect the entrances/exits of the stations to the closest point on the road network, and then connect the entrances to a station’s main location point. In the OSM data, nearly all subway stations have accurate exit locations, but most surface and tram stations have only one exit point per station (and it’s not an actual exit point). This limitation in our data requires us to approximate how the stations are connected to the road network. Rather than handle stations on a case-by-case basis, we decided to create an adaptive rule that can be parsimoniously applied in order to maintain generality and hence applicability beyond Tokyo. First, each exit node is connected to the station node to which it is closest. In some case this differs from the station for which that exit officially corresponds, but this is a reasonable approximation considering the high level of interconnectedness within stations and the practical implications for access. Station exit nodes are connected to all road nodes within a certain, but adaptive, distance around them. The radius of connectivity is determined in the following manner: Starting with r = 10, if there is at least one road node within a circle with radius of r meters, connect to all road nodes within a larger circle of radius r + 10 m. Else, increment r by 10 and repeat the process. This results in a station-augmented network shown in Fig. 3 (note that we use the road network after it has been augmented by the store access nodes).
Fig. 3. Stations (green) are connected to exits (yellow) which are in turn connected to their surrounding road nodes (red). Using increasingly larger circles ensures at least one connection, but may also include additional road nodes within a similar distance (Color figure online).
In practice this method overconnects the exit nodes to the dense road network, resulting in redundant edges as can be seen in Fig. 4 (yellow lines). How-
198
S. Araki and A. Bramson
ever, a stricter rule suffers from underconnection; specifically, stations are often connected only on one side and this underestimates their accessibility. Although extra links worsen the performance of network traversals, they have a negligible effect on accessibility measurements, which we consider to be of greater importance. In the future we will explore ways to reduce the edge redundancy in a parsimonious and generalizable manner.
Fig. 4. Geospatial network diagram showing the subgraph induced by traversing 15 min along the integrated network from the node for Shinjuku station. Green station nodes are linked to yellow/red exit nodes via green links; yellow exit nodes are linked to red road nodes via yellow links; red road nodes are linked to each other via red links and to blue store nodes via blue links. Note that the Kepler.gl visualization engine introduces artifacts (missing nodes, wrong colors, etc.) not present in the data (Color figure online).
3.2
Integrated Network Traversals
The simplest measure of accessibility is the number of stores within a radius of focal point. A simplistic network-based approach uses the number of stores within a distance to a reachable road segment. However, our fully integrated network approach allows us to precisely measure the time required to traverse any origin-destination path using standard breadth first search algorithms. In this paper, the accessibility score of a station is the time-weighted total number of stores reachable from that station. Each store node j has mj ≥ 1
Connecting the Dots
199
stores located there. The contribution ωj of store node j to the accessibility of station node i is time-discounted using a cosine-based function that reaches zero at T (shown in Eq. 1). We chose this functional form because it allows us to control the willingness to walk via the λ parameter to emulate objectivelymeasured moderate and vigorous physical activity (MVPA) data by age cohort [5,17]. Lower λ values correspond to those who prefer shorter distances while larger λ values delay the reduction in score contribution. Obviously, the cosine function rebounds after T , so we prune t > T . The weight value by traversal time for three values of λ are shown in Fig. 5. Further adjustments to T and λ, or alternative functional forms, can capture other means of transportation. πtλij mj (1) 1 + cos λ ωj = 2 T
Fig. 5. A plot of the function for discounting the number of reachable establishments by the time needed to reach them using T = 15 (15 min by foot; 1250 m at 5 kph). Higher λ values delay the score reduction.
4
Demonstration via Walkability Scores
The focus of this work is presenting the refined physical network integration methods, which are generally applicable across mobility and accessibility research. By integrating a fine-grained road or footpath network augmented with access nodes and links to points of interest one can 1) determine best paths based on multiple criteria, 2) score and classify regions based on network features, 3) evaluate the impact of construction plans, and 4) assess various other social and transportation issues. As already noted, it is especially useful when scoring places based on their accessibility on the network. We demonstrate its effectiveness using a simple walkabilty scoring application.
200
4.1
S. Araki and A. Bramson
Accessibility Scoring
We use the term ‘accessibility’ as an umbrella concept that includes any assessment of the ability to reach/use surrounding resources, broadly construed. Various measures of accessibility have been developed over the years, and applications have ranged from access to job, access to other people, access to food shopping, etc. [1,10,14]. Early development of accessibility research focused on efficiency and energy consumption [5] while recent ones have focused more on personalized metrics. For example, Quercia et al. [15] identifies the paths between points in London that are more beautiful, quiet, and happy. Using data about slopes, steps, ramps, elevators, etc. one can determine how well people with specific disabilities can access a location as well. All of these count as accessibility scoring, and in this paper, we refer to accessibility via walking as ‘walkability’. 4.2
Walkability Scores
Driven by a desire to promote exercise and reduce carbon emissions from vehicles, there has been a recent boom in research on walkability. One often used measure of walkability in particular is Walk Score ® [18], which is focused on North America but partially validated for Japan by [8]. That measure’s details are not public, so we can’t reproduce them for comparison; however, it seems to simply count the number of establishments reachable from nodes of the largescale road network using decreasingly sized circular buffers based on the time to that node. The Walkability Index of [16] uses a uniform buffer on the fine-grained network, but includes other considerations such as diversity of establishments. No available method leverages an integrated network to discount the contribution of further establishments. We compare five different walkability measures. The first one is a baseline that counts of stores within 1250 m (the distance an average person can walk in 15 min) from each station. The second one uses Dijkstra’s algorithm on the integrated network to determine the number of accessible stores, but no timeweighting is applied. The third, fourth, and fifth measures take the same results from Dijkstra’s algorithm, but use the discount function shown in Eq. 1 with T = 15 and three different λ values (2.0, 1.0, and 0.5) to weight the results. As explained in Sect. 3.2, these different λ values are designed to model one’s willingness to walk. For example, with a walkability score of λ = 0.5, a station with stores very close to the station could get a higher score than another station with twice as many stores within the 1250 m circle, but all more than 5 min away because at t = 5 the station counts are already discounted to around 40% as shown in Fig. 5. Case Analysis. Although there is no “true” walkability level against which to measure accuracy, we do find that these different types of walkability scores reveal interesting differences in what they are measuring. Table 1 shows the top 15 stations ranked by the five measures. We recognize that many readers are
Connecting the Dots
201
not familiar with the areas of Tokyo, so we will explain the kinds of insights our method reveals using a few example stations. Table 1. Comparison of the highest walkability stations by score. λ=2
λ=1
Yurakucho (6064)
Ginza (4749)
Ginza (3066)
Ginza (2023)
Ginza (980)
2
Ginza (6023)
Ginza (4327) Icchome
Shinbashi (2833)
Shinbashi (1880)
Shinbashi (934)
3
Ginza (6015) Icchome
Hibiya (4201)
Shinjuku (2525) Nishiguchi
Shinjuku (1669) Nishiguchi
Shinjuku (838) Nishiguchi
4
Kyobashi (5707)
Shinbashi (4180)
Shinjuku (2478)
Shinjuku (1602) Sanchome
Shinjuku (821) Sanchome
5
Hibiya (5548)
Yurakucho (4115)
Uchisaiwai (2427)
Shinjuku (1560)
Shinjuku (718)
6
Higashi (5389) Ginza
Higashi (3987) Ginza
Shinjuku (2409) Sanchome
Uchisaiwai (1450)
Ikebukuro (651)
7
Shinbashi (5320)
Uchisaiwai (3891)
Ginza (2261) Icchome
Seibu (1367) Shinjuku
Seibu (650) Shinjuku
8
Takaracho (5309)
Shinjuku (3667) Nishiguchi
Yurakucho (2194)
Ginza (1324) Icchome
Uchisaiwai (646)
9
Shiodome (5210)
Shinjuku (3658)
Seibu (2172) Shinjuku
Yurakucho (1269)
Ginza (621) Icchome
10
Tsukiji (4977) Market
Shiodome (3603)
Higashi (2167) Ginza
Higashi (1238) Ginza
Shinsen (583) Shinjuku
11
Uchisaiwai (4669)
Shinjuku (3579) Sanchome
Shinsen (2036) Shinjuku
Ikebukuro (1232)
Yurakucho (568)
12
Nihonbashi (4369)
Seibu (3501) Shinjuku
Hibiya (1797)
Shinsen (1218) Shinjuku
Ueno (558) Hirokoji
13
Onarimon (4232)
Shinsen (3464) Shinjuku
Shiodome (1723)
Shibuya (1074)
Higashi (556) Ginza
14
Tokyo (4207)
Kyobashi (3190)
Ikebukuro (1698)
15
Shinjuku (4179) Nishiguchi
Takaracho (2892)
Kyobashi (1562)
rank 1
in 1250m radius
unweighted
λ = 0.5
Ueno (1017) Ueno (547) Okachimachi Okachimachi Ueno (998) Hirokoji
Shibuya (547)
The most obvious pattern is that the Ginza metro station dominates this ranking. Ginza is known for its massive shopping streets, eateries, and entertainment venues; thus it is not surprising to see that Ginza and its nearby stations (Shinbashi and Yurakucho) are consistently ranked near the top of the lists. Perhaps more interesting is that many stations surrounding Ginza (Ginza Icchome, Kyobashi, Hibiya, Higashi Ginza) fill the top spots of the in-radius ranks, but are pushed down further and further as we move to unweighted, and increasingly strict discounting. We now take a closer look at one of those surrounding stations: Hibiya metro station. It is ranked 5th and 3rd in the circle baseline and unweighted approach, respectively; however, its rank drops significantly as the discounting is applied (12 → 16 → 20). The reason for this is clear when looking at a map: one can reach Yurakucho within 10 min and almost to Ginza within 15 so its reach includes many of the surrounding larger shopping streets. But there are not
202
S. Araki and A. Bramson
many stores around Hibiya station itself: one corner has the sprawling Imperial Palace and another the famous Hibiya Park. This result demonstrates the need for a weighted walkability score; failing to discount the contribution of further stores falsely promotes locations on the fringe of major shopping districts while downplaying the convenience of locations in the middle of smaller shopping areas. Surprisingly, Shinjuku station, the busiest station in the world [6], also famous for its huge shopping and entertainment areas, only appears at rank 15 in the circle baseline approach, and even then it is a satellite station rather than the main one. The reason is Shinjuku station’s immense size: the station itself is hundreds of meters long and wide, so other stations (especially subway stations that have practically zero footprint) have more stores that are physically close. Those stations benefit from the distance one can travel in 15 min and the proportion of the area that supports having stores. One can see Shinjuku (as well as Shinjuku Nishiguchi, Shinjuku Sanchome, Seibu Shinujuku, and Shinsen Shinjuku) rising up the ranking as λ gets smaller. Those ‘walkable’ stations that were ranked high in the circle baseline approach rapidly fell from the ranking because their scores got significantly discounted. Ikebukro station exhibits a similar trend, but more drastic. Ikebukro is a secondary city center with many stores (though not as many as Ginza or Shinjuku) nearby the station, but as a more recent development they do not sprawl out into surrounding territory. It did not rank high in either baseline (67th and 34th), but because it has a somewhat large number of stores focused around the station it reaches ranks 14 → 11 → 6 as λ decreases. This is an important case because anybody familiar with Tokyo would agree that Ikebukuro is a major and convenient shopping and entertainment hub, but the unweighted measures could not reveal this characteristic. Similarity Analysis. Although an analysis of specific stations allows us to compare the resulting walkability scores with our intuitions, Fig. 6 shows the similarities between each pair of measures using the Kendall rank correlations. This statistic takes two ordered lists and computes the number of pairs in the same order, minus the number of pairs in a different order, and divides by the number of possible pairs. It informs us how similarly two lists of the same items are ranked. Note that the circle baseline approach is most similar to the unweighted approach and becomes less similar as λ decreases. This result is not surprising considering how the discount function heavily penalizes stores further away. One can also observe the high similarities among the four network-based measures. The fact that the unweighted approach is more similar to the case with λ = 2 than it is to the circle baseline approach suggests that there exist some distinct features that the network traversal was able to extract (i.e., stores that are within 1250 m but not actually reachable within 15 min due to circuity of the network and barriers such as rivers, railways, and highways) that are more important than the weighting. Although these similarity results are unsurprising, it is reassuring to get a confirmation of the intuitive relationships among these measures.
Connecting the Dots
203
Fig. 6. Pairwise comparisons of the Kendall τ coefficients.
5
Conclusions and Future Work
Using the fine-grained road network data facilitates the discovery of accurate paths and therefore accurate traversal times. Augmenting this network to parsimoniously integrate access edges to points of interest (such as train stations and stores) allows us to calculate times from an origin to each potential destination using efficient network search algorithms. After describing our novel methods for capturing this physical network system, we presented a comparison of walkability scores showing the importance of network-based assessments and discounting establishments that are further away. We also demonstrated how varying the time-weighting parameter can capture differences in accessibility for different populations, such as the elderly or disabled. Based on this preliminary analysis, the integrated network achieves more believable scores compared to the circle baseline approach because walking paths in Tokyo are often meandering and complicated. To get from point A to point B, there rarely exits a straight path and therefore the circle baseline approach overestimates the number of realistically reachable stores. However, applying a discount function to the circle approach might be a good approximation of the integrated analysis because the further stores would get heavily discounted scores. We are currently investigating this approach for basic scoring. Although our network augmenting methodology produces more accurate paths, traversal times, and walkability scores, we recognize that accessibility measures that only include the time to places of business offer a narrow view of walkability. Rather than just focus on the degree to which people can get their shopping done on foot, one might also consider how pleasant an area is to walk through [15]. Including locations such as parks, gardens, riverside paths, scenic views, etc. offers a score of walk-worthiness. We could produce different
204
S. Araki and A. Bramson
measures for the various populations, interests, and purposes, and then generate a walkability score that combines these measures. For all these purposes and interests, the paths must be further analyzed beyond just traversal times. By incorporating building heights and footprints we can characterize neighborhoods by their openness. Data on green areas such as road-side trees and grassy medians is also clearly relevant. Typical noise and traffic levels can also be used to improve our assessment of walkability. Perhaps the most important factor needing inclusion is the slopes of road segments and a measure of the traversal effort. All these, along with parameterizations for bicycles, wheelchairs, and other mobility factors, are included in the walkability index we are developing based on the network methodology presented in this paper.
References 1. Biazzo, I., Monechi, B., Loreto, V.: Universal scores for accessibility and inequalities in urban areas. arXiv preprint arXiv:1810.03017 (2018) 2. Calimente, J.: Rail integrated communities in Tokyo. J. Trans. Land Use 5(1), 19–32 (2012) 3. Chorus, P., Bertolini, L.: An application of the node-place model to explore the spatial development dynamics of station areas in Tokyo. J. Transp. Land Use 4(1), 45–58 (2011) 4. Ellis, G., Hunter, R., Tully, M.A., Donnelly, M., Kelleher, L., Kee, F.: Connectivity and physical activity: using footpath networks to measure the walkability of built environments. Environ. Plan. B: Plan. Des. 43(1), 130–151 (2016) 5. Frank, L., Ulmer, J., Lerner, M.: Enhancing walk score’s ability to predict physical activity and active transportation. In: Active Living Research Annual Conference, San Diego, CA Retrieved from (2013). http://activelivingresearch. org/sites/default/files/2013 Bike-WalkScore Frank. pdf 6. Guinness World Records: Busiest station (2018). https://www. guinnessworldrecords.com/world-records/busiest-station 7. keplergl: kepler.gl, August 2020. https://github.com/keplergl/kepler.gl 8. Koohsari, M.J., Sugiyama, T., Hanibuchi, T., Shibata, A., Ishii, K., Liao, Y., Oka, K.: Validity of walk score® as a measure of neighborhood walkability in japan. Prevent. Med. Rep. 9, 114–117 (2018) 9. Leutenegger, S.T., Lopez, M.A., Edgington, J.: STR: A simple and efficient algorithm for r-tree packing. In: Proceedings 13th International Conference on Data Engineering, pp. 497–506. IEEE (1997) 10. Levinson, D.: Network structure and city size. PloS one 7(1), e29721 (2012) 11. L¨ uthy, M.: japan-train-data, May 2017. https://github.com/adieuadieu/japantrain-data 12. NTT Townpage Inc.: Townpage Database. Proprietary Dataset, July 2019 13. OpenStreetMap Contributors: Planet dump retrieved from planet.osm.org (2019). www.openstreetmap.org 14. P´ aez, A., Scott, D.M., Morency, C.: Measuring accessibility: positive and normative implementations of various accessibility indicators. J. Transp. Geo. 25, 141–153 (2012)
Connecting the Dots
205
15. Quercia, D., Schifanella, R., Aiello, L.M.: The shortest path to happiness: Recommending beautiful, quiet, and happy routes in the city. In: Proceedings of the 25th ACM conference on Hypertext and social media, pp. 116–125 (2014) 16. Shimizu, C., Baba, H., Kawase, T., Matsunawa, N.: Walkability and real estate value: Development of walkability index. Online, June 2020. http://www.csis.utokyo.ac.jp/wp-content/uploads/2020/06/163.pdf 17. Trost, S.G., Pate, R.R., Sallis, J.F., Freedson, P.S., Taylor, W.C., Dowda, M., Sirard, J.: Age and gender differences in objectively measured physical activity in youth. Med. Sci. Sports Exerc. 34(2), 350–355 (2002) 18. Walk Score: Walk Score® (2020). https://www.walkscore.com/
Topological Analysis of Synthetic Models for Air Transportation Multilayer Networks Marzena F¨ ugenschuh1 , Ralucca Gera2 , and Andrea Tagarelli3(B) 1
Beuth University of Applied Sciences, Berlin, Germany [email protected] 2 Naval Postgraduate School, Monterey, CA, USA [email protected] 3 University of Calabria, Rende, Italy [email protected]
Abstract. Airline transportation systems can naturally be modeled as multilayer networks, each layer capturing a different airline company. Originally conceived for mimicking real-world airline transportation systems, synthetic models for airline network generation can be helpful in a variety of tasks, such as simulation and optimization of the growth of the network system, analysis of its vulnerability or strategic placement of airports. In this paper, we thoroughly investigate the behavior of existing generative models for airline multilayer networks, namely BINBALL, STARGEN, and ANGEL. To conduct our study, we used the European Air Transportation Network (EATN) and the domestic United States Airline Transportation Network (USATN) as references. Our extensive analysis of structural characteristics has revealed that ANGEL excels the two previously introduced generative models in terms of replication of the layers of the reference networks. To the best of our knowledge, this is the first study that provides a systematic comparison of generative models for airline transportation multilayer networks.
1
Introduction
Air transportation networks (ATNs) emerge as each airline carrier develops its own network based on economic and political factors, as well as interactions between airline companies. The resulting network system can conveniently be modeled as a multilayer network. Multilayer networks have been introduced as an extension of the monoplex networks [6,10,17]. For our research, this network representation model is desired, as each airline company would correspond to a layer, with the airports being represented by the nodes and the flights by the edges [5,16,19]. In addition, the multilayer network model is well-suited to represent the interrelations between airlines, which cannot be captured by studying the layers in isolation. Analyzing ATNs is essential to understand the logistics of airlines, including route planning and inter-dependencies of airports and airlines (e.g., [1,3,9,21]), but also to provide insights into the vulnerability of these systems to accidental events that can affect a country or even a continent, such as violent weather c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 944, pp. 206–217, 2021. https://doi.org/10.1007/978-3-030-65351-4_17
Topological Analysis of Synthetic Models
207
phenomena, or virus propagation. To reliably conduct studies on these topics and validate hypotheses, multiple real-world data sets would be required. However, very few such data sets exist. To our knowledge, the European Air Transportation Network (EATN) [8] is the only collected data set that has each airline modeled as a different layer. This creates the need for multilayer synthetic models that can produce networks similar to the ones referring to real-world systems. Creating generative models for networks has been very active research area, yet less attention has been devoted to synthetic multilayer network generation [4,6]. Common approaches of growing multilayer network models are based on the preferential attachment model, in analogy to most generative models that have been defined for social networks [5,16,19]. Very few models of airline transportation networks have been designed for multilayer networks, and their analysis has not been completely explored. In fact, apart from EATN [8], the world air transportation system has been modeled as a single-layer network and analyzed as such [14,15], as well as individual countries such as the U.S. airline transportation system (USATN),1 the Brazilian [20], the Indian [2], and the Chinese one [18,22]. It should be noted that in [18] the Chinese air traffic network is modeled as a multilayer network, but from a coarser, different perspective than as expected for conducting our study: the model indeed consists of three layers, i.e., a core layer including airports of provincial capital cities supporting most flight flow, while the bridge layer connects the core layer with a periphery layer comprising more remote areas with no direct flight between them [11]. In the multilayer approach of [22], the Chinese air traffic network is divided it into airway, route, and flight networks. Our approach to model an airline network as a multiplex has its origins in [8], where the EATN is viewed as a composition of connections offered by different airlines, captured as layers. In this sense, a prototype of a generative model of EATN, called BINBALL, was introduced in [5]. Here, the layers are initialized with an equal number of nodes and edges are added in a preferential attachment manner. The STARGEN model introduced in [12] improves BINBALL by enforcing differentiated layer set sizes and their hub-spoke shape, but still relies on the preferential attachment method. In the conception of the ANGEL model expounded in [13], to gain more influence on the intra- and inter-layer structure of the multiplex we retreat from the preferential attachment. The three models are outlined in Sect. 2. Our major contribution in this work is a comparison of all three models in terms of their topological structure, which is described in Sect. 3. As our results shall reveal, only the ANGEL model is capable of mimicking the macro- as well as the micro-structure of the underlying references, while BINBALL and STARGEN perform comparably mainly on the macro level. We summarize the results and propose further directions in Sect. 4.
2
Preliminaries
All the three models under consideration in this work, namely BINBALL, STARGEN, and ANGEL, are designed to mimic an airline network using the EATN 1
https://github.com/gephi/gephi/wiki/Datasets.
208
M. F¨ ugenschuh et al.
Algorithm 1: BINBALL and STARGEN models Input: n, m, l, PedgeL (uniform in case of BINBALL) Initialize L1 , . . . , Ll , empty graphs representing layers, and M , multiplex with n isolated nodes foreach i ∈ 1 . . . m do select a layer, say Li , with respect to PedgeL select node u and node v according to the local and global preferential attachment based on degree distributions, respectively add the edge e = (u, v) to Li and M update local and global degree distribution of u and v Output: Layers L1 , . . . , Ll and the multiplex M
Algorithm 2: Generalized ANGEL model Input: n, m, l, PedgeL , PnodeL , PlayerN Initialize L1 , . . . , Ll , empty graphs representing layers, and M , multiplex with n isolated nodes (VM ) Assign nodes to layers: foreach u ∈ VM do sample layer repetition count, ru , from the PlayerN use PnodeL to select ru different layers to place u in Create hub-sub-network: Assign hubs to layers and create a multigraph on all hubs using configuration model Assign edge numbers to layers: assign number of edges to layers according to PedgeL Create layers: foreach i = 1 to l do call a layer creation procedure for Li add all edges from Li to M Output: Layers L1 , . . . , Ll and the multiplex M
as the reference network. We now provide a brief overview of how they work. Let us denote with l the number of layers of the multilayer network to be generated, with n and m the total number of nodes and edges, respectively, in the corresponding multiplex. The purpose of BINBALL is to generate a random multiplex network using a preferential attachment method on the basis of elementary information about the replicated multiplex, i.e., n, m, and l. In this model, nodes of the multiplex are distributed uniformly within the layers. The ends of links are chosen with respect to a local, i.e., layer-based, and global, i.e., multiplex-based, preferential attachment. The STARGEN model is an enhanced version of BINBALL. Here, node numbers in layers vary due to the random – but based on a fitted distribution, PedgeL – choice of edges and the weighted local preferential attachment function. A different weight is assigned to each layer. Despite these differences, both models share a common structure captured in Algorithm 1. We refer to [5] and [12] for detailed descriptions.
Topological Analysis of Synthetic Models
209
Table 1. Statistics on the EATN and USATN. Multiplex
Node count Edge count Density Transitivity Degree μ
EATN
417
37 layers Min 35
0.04
0.30∗
μ
σ
17.21 27.78 2.76 0.80
μ
σ
0.42∗ 0.33∗
34
0.03
0
1.94
2.54 1.94 0.18
0.00 0.00
Max 128
601
0.11
0.34
9.39 11.55 3.35 1.43
0.55 0.47
μ
54.97
96.97
0.06
0.07
3.13
6.07 2.25 0.56
0.20 0.28
σ
22.04
103.00
0.02
0.08
1.45
1.85 0.34 0.29
436
4483
0.05
0.32∗
USATN
14 layers Min 18
∗
3588
Short. path Clust. coeff. σ
20.56 46.28 3.28 1.45
26
0.02
0.01
Max 249
732
0.22
0.39
μ
320.21
0.09
0.18
5.50
9.34 2.34 0.74
0.51 0.41
σ 84.44 291.90 0.07 0.11 Values calculated with discarded multiple edges
2.96
5.44 0.56 0.31
0.17 0.08
104
2.56
0.16 0.16 0.56∗ 0.41∗
2.62 1.84 0.52
0.13 0.26
12.74 16.51 4.05 1.67
0.70 0.49
The preferential attachment method applied in BINBALL and STARGEN yields to homogeneous hub-spoke layer structures. As a deeper analysis of the EATN revealed [13], besides a few stringent hub-spoke and point-to-point structures, the majority of layers resembles a mixture of both patterns. In contrast to BINBALL and STARGEN, where all layers grow simultaneously, in ANGEL each layer is created independently. This approach, outlined in Algorithm 2, allows us to mimic the diversity of the layers with respect to the spatial location of the nodes in the network. We refer to [13] for further details2 . Table 1 displays the statistics on the reference multilayer airline networks we subsequently calculate for the synthetic models.3 The EATN data was collected in [7]. The USATN represents the airline network based on the US domestic airline connections. The data was extracted in 2018 from https://openflights.org. Airline networks are characterized by hubs. These nodes affect not only the layers that they span. Viewed globally, they are the core of the multiplex. Based on [13], we identify a node u as a hub in layer L if deg(u) v∈NL (u) deg(v) L > 0.3, (1) sm (u) = |EL |2 where sL m (u) is the s-metric of a node u, with NL (u) being neighbors of u in L and |EL | the edge number in L. The s-metric value displays the affinity of a layer to build a hub-spoke structure.
2
3
The parameters PnodeL and PlayerN stand for the probability distribution of the node count per layer and the random selection of the number of layers a node appears in, respectively. We implemented BINBALL, STARGEN, and ANGEL, and carried out their analysis – presented in the next section – in Python 3.6.0 and networkx 2.0.
210
M. F¨ ugenschuh et al. Table 2. ANGEL’s reference dependent parameters. Network to mimic PlayerN
3
PnodeL
PedgeL
EATN
pdf e (x, 1, 3.88) pdf e (x, 34.99, 19.75) pdf e (x, 33.99, 62.86)
USATN
pdf e (x, 1, 2.19) pdf e (x, 18, 86)
pdf e (x, 26, 294)
Topological Analysis of the Multiplex Models
In this section, we analyze the performance of the synthetic models with respect to each of the two real networks separately. In both cases, all models are initialized with the same input values for the number of nodes, edges, and layers that come from the respective real network, as displayed in Table 1. The remaining parameters of BINBALL and STARGEN are specified in [5] and [12], respectively. In Table 2, we report the reference dependent parameters required for ANGEL, i.e., fitted probability distribution functions, which are of the form x−l pdf e (x, l, s) = 1s e(− s ) . For the statistics presented in this section, 100 replicas of each synthetic multiplex were generated. With the exception of boxplots, the average curves over the 100 samples are plotted as follows: per synthetic multiplex, the values for each node are collected and sorted; next, position by position in the sorted order, the average of all 100 values is taken. Throughout this section, we use the same color code as displayed in Fig. 1. Table 3. Min-max values on 100 replicas per model and real network. Multiplex
Max
EATN ANGEL BINBALL
Node centr. Asp (node) Density
degree Min Max
Min Max
Mul Sim
Trans. Asp Betw. centr.
156
0.20 0.55
1.82 4.86
0.04 0.03 0.30
Min 103
0.22 0.53
1.56 3.11
0.04 0.03 0.17
2.42 0.0034
Max 282
0.32 0.64
1.87 4.48
0.04 0.04 0.22
2.59 0.0039
Min
2.75 0.0042
79
0.20 0.51
1.64 3.56
0.04 0.04 0.11
2.48 0.0036
Max 295
0.28 0.61
1.94 4.89
0.04 0.04 0.14
2.59 0.0039
STARGEN Min 103
0.21 0.52
1.59 3.53
0.04 0.03 0.19
2.50 0.0037
Max 295
0.28 0.63
1.9
4.81
0.04 0.04 0.23
2.6
352
0.13 0.51
1.96 7.25
0.05 0.03 0.32
3.27 0.0052
USATN ANGEL BINBALL
0.0041
Min 160
0.20 0.59
1.30 2.82
0.02 0.02 0.12
2.22 0.0028
Max 395
0.35 0.77
1.70 5.03
0.05 0.04 0.26
2.67 0.0039
Min 106
0.00 0.54
0.50 3.40
0.05 0.04 0.11
2.41 0.0033
Max 395
0.29 0.63
1.86 4.57
0.05 0.04 0.15
2.49 0.0035
STARGEN Min 160
0.21 0.55
1.51 3.38
0.05 0.04 0.18
2.34 0.0031
Max 395
0.30 0.66
1.80 4.86
0.05 0.04 0.21
2.53 0.0035
A link within a multiplex has a double meaning: it contributes globally to the multiplex, a huge multigraph, and locally to a layer, usually, a much smaller simple graph. In our topological analysis, we approach a multilayer network from the outside, considering it as a multigraph, and from the inside, viewing it as an integration of multiple simple graphs.
Topological Analysis of Synthetic Models
211
Fig. 1. Network statistics of the synthetic multiplexes versus the EATN (top) and the USATN (bottom).
3.1
Validation of the Multiplex
In Figs. 1, 2 and 3 we consider the multiplexes from a macroscopic point of view i.e., as multigraphs, one-layered networks, where multiple connections between two nodes are allowed. Considering Fig. 1, from the left, we observe that the average shortest path and next the average closeness centrality per each node in the multiplex, then the boxplots with the density and the transitivity of the synthetic multiplex, and - in the last column - the boxplots with the average shortest path together with the betweenness centrality. (Labels den m and den s on the x-axis in the plots in the third column stand for the density calculated with and without multiple edges, respectively; the third x-value, the transitivity, is – as it only can be – calculated on the simplified graph.) As one can observe, all synthetic models deliver a very good approximation of the references according to all criteria considered here. The differences remain within a narrow range as can be seen in Table 3, where the minimum and maximum values of the measures are reported. In Fig. 2, we give an insight into structural similarity aspects. Here, we plot the cosine similarity per node-pair (i.e., the number of common neighbors of the two nodes divided by the geometric mean of the two nodes’ degrees) of the real multiplex (lower-left, left plot) versus a randomly selected multiplex replica of ANGEL (upper-right, left plot), STARGEN (lower-left, right plot), and BINBALL (upper-right, right plot). Nodes on both axes are sorted by the degree. In all cases except for BINBALL, the fraction of common neighbors tends to increase with the growing degree of the nodes – the darkest area in the lowerright corner, shadowed by the hubs. Compared to both references, which show a partially discrete texture, all synthetic networks’ similarities transition more
212
M. F¨ ugenschuh et al.
Fig. 2. Node pairwise cosine similarity of the synthetic multiplexes versus the EATN (top) and the USATN (bottom). Table 4. Average and standard deviation of the values over each half-matrix in Fig. 2. EATN Replicating the EATN
USATN Replicating the USATN
ANGEL STARGEN BINBALL
ANGEL STARGEN BINBALL
Mean 0.082
0.077
0.086
0.059
0.117
0.120
0.094
0.067
Std
0.105
0.118
0.079
0.198
0.125
0.111
0.077
0.142
smoothly; however, as also detailed in Table 4, ANGEL approximates EATN as good as STARGEN, while it outperforms both the competitors in replicating USATN. Finally, we consider the degree distribution of the multiplexes. As shown in Figs. 3 (a) and (d), the curve of the EATN is better approximated – especially by ANGEL and STARGEN (averages over 100 replicas) – than that one of the USATN. On the other hand, both ANGEL and STARGEN follow the power-law fitting of the USATN’s degree distribution. 3.2
Validation of the Layers
In our second stage of evaluation, we compare the microstructure of the multiplexes focusing on the topology of the layers. An important feature of layers that occur in ATNs is their tendency to build hub-spoke structures. How strong it is can be measured with the s-metric. Boxplots presented in Figs. 3 (b) and (e) consolidate s-metric values of all layers in
Topological Analysis of Synthetic Models
213
Fig. 3. Further statistics on the synthetic multiplexes versus the EATN (top) and the USATN (bottom): (a, d) multiplex degree distribution, (b, e) s-metric of the layers, (c, f) hub count within the multiplex. (Color code is the same as displayed in Fig. 1)
all the 100 multiplex replicas versus the layers in the reference. Despite a significant number of outliers when compared to the USATN, the s-metric values of the ANGEL-layers remain the closest to both references. The low values of BINBALL indicate that its replicas feature very poorly hub-spoke formations. Additionally, in Figs. 3 (c) and (f), we compare the total count (per multiplex) of nodes marked as hubs according to (1). This statistic confirms that BINBALL’s layers do not evolve hub-spoke structures as very few hubs or even not at all, such as when the USATN is reproduced, are counted. In Fig. 4, we depict representatives of layers, selected randomly from one multiplex of each model and from the respective real network. The structure of BINBALL-layers resembles the least a hub-spoke shape being much better featured by STARGEN-layers. Recall that both models focus only on that form of airline networks. By contrast, the ANGEL method is designed to mimic both the hub-spoke (Fig. 4 top-left) and point-to-point (Fig. 4 bottom-left) layer structure. Figures 5 and 6 display further analysis of the synthetic layers. As shown in the top-left plot in Figs. 5 and 6, STARGEN and ANGEL adapt to the fitted distributions of the edge set sizes in layers given in the input. BINBALL’s layers are characterized by uniform values for both edge and node numbers. The histograms on the right side of Figs. 5 and 6 show the distributions of layer repetition count per node, i.e., how many times a node belongs to a layer, broken by non-hub nodes (top) and hubs (bottom). Recall that the fitting to the reference curves for non-hubs is used in the input for ANGEL (cf. PlayerN in Algorithm 2).
214
M. F¨ ugenschuh et al.
Fig. 4. Examples of layers’ layouts from the three models mimicking the EATN (top) and the USATN (bottom). (Color code is the same as displayed in Fig. 1)
Fig. 5. Synthetic layers against the EATN: distributions of edge and node set sizes (top-left and bottom-left, resp.), and distributions of layer repetition count per node (right plots). (Color code is the same as displayed in Fig. 1)
The BINBALL curves are less meaningful due to the insufficient number of hubs resulting in an imbalance between hubs and non-hubs. ANGEL performs moderately having troubles to keep pace with the node numbers (bottom-left plots), but again it steadily approximates the references in all four plots, while STARGEN’s performance subsides when compared to the USATN.
Topological Analysis of Synthetic Models
215
Fig. 6. Synthetic layers against the USATN: Subfigure descriptions correspond to those in Fig. 5. (Color code is the same as displayed in Fig. 1)
4
Conclusions
Summary. We presented the first study on an analysis of the existing generative models for the creation of synthetic networks mimicking air transportation network systems modeled as multilayer networks. Here, we compared the ANGEL model with two previously formulated random multiplex models BINBALL and STARGEN. Using the European Air Transportation Network (EATN) and the domestic U.S. Air Transportation Network (USATN) as reference, our analysis has revealed that high accordance of the network statistics at the multiplex level does not necessarily imply the consistency on the layer level. ANGEL is superior to BINBALL in general. Especially the comparison with the STARGEN model shows that considering the multiplex as a one-layer network is superficial. The results on the validation with the USATN underpin the robustness of the ANGEL model. Furthermore, ANGEL allows taking the spatial location of the nodes into consideration, which is ignored in the BINBALL and STARGEN algorithms, and a challenge to incorporate in a preferential attachment procedure. Ongoing Work. We are currently investigating on the spectral and eigenfunction properties of the networks generated by BINBALL, STARGEN and ANGEL models: from a comparison with the properties of our real-world reference ATNs, i.e., EATN and USATN, we aim to deepen our understanding about the superiority of ANGEL in better replicating the reference ATN layers. Future Directions. Our study paves the way for further development of sophisticated generative models for air transportation network systems. One interesting research direction would be to enhance the modeling of ATNs by incorporating “feature-rich” information that may be associated with nodes and/or edges in each layer. In particular, this includes taking the opportunity of extending the existing models to account for time-aware variables (e.g., flight departure and arrival times), which would lead to a generalization of the multilayer network
216
M. F¨ ugenschuh et al.
model to represent the evolution over time of an ensemble of airline layers. Moreover, integrating numerical/categorical attributes at node level as well as node embeddings obtained via shallow or deep learning models, into a multilayer synthetic model for ATNs, would be invaluable to push forward our understanding of the complex patterns that can be mined and learned from ATNs.
References 1. Amaral, L.A.N., Scala, A., Barthelemy, M., Stanley, H.E.: Classes of small-world networks. PNAS 97(21), 11149–11152 (2000) 2. Bagler, G.: Analysis of the airport network of India as a complex weighted network. Phys. A 387(12), 2972–2980 (2008) 3. Barrat, A., Barthelemy, M., Vespignani, A.: The architecture of complex weighted networks: measurements and models. In: Large Scale Structure and Dynamics of Complex Networks, pp. 67–92. World Scientific (2007) 4. Barth´elemy, M.: Spatial networks. Phys. Rep. 499, 1–101 (2011) 5. Basu, P., Sundaram, R., Dippel, M.: Multiplex networks: a generative model and algorithmic complexity. In: Proceedings of IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, pp. 456–463 (2015) 6. Boccaletti, S., Bianconi, G., Criado, R., Del Genio, C.I., G´ omez-Gardenes, J., Romance, M., Sendina-Nadal, I., Wang, Z., Zanin, M.: The structure and dynamics of multilayer networks. Phys. Rep. 544(1), 1–122 (2014) 7. Cardillo, A., G´ omez-Garde˜ nes, J., Zanin, M., Romance, M., Papo, D., del Pozo, F., Boccaletti, S.: Emergence of network features from multiplexity. Sci. Rep. 3, 1344 (2013) 8. Cardillo, A., Zanin, M., G´ omez-Gardenes, J., Romance, M., del Amo, A.J.G., Boccaletti, S.: Modeling the multi-layer nature of the European air transport network: resilience and passengers re-scheduling under random failures. Eur. Phys. J. ST 215(1), 23–33 (2013) 9. Colizza, V., Barrat, A., Barth´elemy, M., Vespignani, A.: The role of the airline transportation network in the prediction and predictability of global epidemics. PNAS 103(7), 2015–2020 (2006) 10. De Domenico, M., Sol´e-Ribalta, A., Cozzo, E., Kivel¨ a, M., Moreno, Y., Porter, M.A., G´ omez, S., Arenas, A.: Mathematical formulation of multilayer networks. Phys. Rev. X 3(4), 041022 (2013) 11. Du, W.B., Zhou, X.L., Lordan, O., Wang, Z., Zhao, C., Zhu, Y.B.: Analysis of the Chinese airline network as multi-layer networks. Transp. Res. Part E 89, 108–116 (2016) 12. F¨ ugenschuh, M., Gera, R., Lory, T.: A synthetic model for multilevel air transportation network. In: Proceedings of the Conference on OR, pp. 347–353 (2017) 13. F¨ ugenschuh, M., Gera, R., Tagarelli, A.: ANGEL: a synthetic model for airline network generation emphasizing layers. IEEE Trans. Netw. Sci. Eng. 7, 1977–1987 (2020). https://doi.org/10.1109/TNSE.2020.2965207 14. Guimera, R., Mossa, S., Turtschi, A., Amaral, L.N.: The worldwide air transportation network: anomalous centrality, community structure, and cities’ global roles. PNAS 102(22), 7794–7799 (2005) 15. Guimera, R., Amaral, L.A.N.: Modeling the world-wide airport network. Eur. Phys. J. B 38(2), 381–385 (2004)
Topological Analysis of Synthetic Models
217
16. Kim, J.Y., Goh, K.I.: Coevolution and correlated multiplexity in multiplex networks. Phys. Rev. Lett. 111(5), 058702 (2013) 17. Kivel¨ a, M., Arenas, A., Barthelemy, M., Gleeson, J.P., Moreno, Y., Porter, M.A.: Multilayer networks. J. Complex Netw. 2(3), 203–271 (2014) 18. Li, W., Cai, X.: Statistical analysis of airport network of China. Phys. Rev. E 69(4), 046106 (2004) 19. Nicosia, V., Bianconi, G., Latora, V., Barthelemy, M.: Growing multiplex networks. Phys. Rev. Lett. 111, 5 (2013) 20. da Rocha, L.E.: Structural evolution of the Brazilian airport network. J. Complex Netw. 2009(4), P04020 (2009) 21. Wuellner, D.R., Roy, S., D’Souza, R.M.: Resilience and rewiring of the passenger airline networks in the United States. Phys. Rev. E 82(5), 056101 (2010) 22. Zhou, Q., Yang, W., Zhu, J.: Mapping a multilayer air transport network with the integration of airway, route, and flight network. J. Appl. Math. 2019, 1–10 (2019). Article ID 8282954
Quick Sub-optimal Augmentation of Large Scale Multi-modal Transport Networks Elise Henry(B) , Mathieu Petit, Angelo Furno, and Nour-Eddin El Faouzi Univ. Gustave Eiffel, Univ. Lyon, ENTPE, LICIT, 69675 Lyon, France {elise.henry,mathieu.petit,angelo.furno,nour-Eddin.faouzi}@univ-eiffel.fr
Abstract. With the recent and continuous growth of large metropolis, the development, management and improvement of their urban multimodal transport networks become a compelling need. Although the creation of a new transport mode often appears as a solution, it is usually impossible to construct at once a full networked public transport. Therefore, there is a need for efficient solutions aimed at prioritizing the order of construction of the multiple lines or transport modes. Hence, the proposed work aims at developing a simple and quick-to-compute methodology aimed at prioritizing the order of construction of the lines of a newly designed transport mode by maximizing the network performance gain, as described by complex networks metrics. In a resilience context, the proposed methodology could also be helpful to support the rapid and quick response to disruptions by setting up or reinforcing an adapted emergency transport line (e.g., bus service) over a set of predefined itineraries.
Keywords: Multi-modal transport modelling Transport network design
1
· Multi-layer networks ·
Introduction
According to the United Nations, 2.5 billions more people will live in cities in 20501 . To ensure the urban mobility of such large volumes of people, it is crucial to adapt and augment the current transport offer by improving network performances through the addition of new transport modes or the development of new transport lines. Currently, some transport networks already operate at their capacity limits. Due to the urgency of the situation and the need to stagger over time the construction of a public transport network for budget constraints and roadworks occupancy, it is essential to optimally schedule the construction of the transport network in order to quickly improve the network performances. Such analyses are at the interface of network design issues and graph augmentation ones. The 1
https://twitter.com/ONUinfo/status/996852098536034304.
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 944, pp. 218–230, 2021. https://doi.org/10.1007/978-3-030-65351-4_18
Quick Sub-optimal Scheduling
219
first field focuses on the problems of planning and implementing a (transport) network [9]. The second one aims at identifying the smallest set of edges/nodes that, once added to the input graph, improves one or more graph properties [16]. In our study, the new transport lines are considered as a given input and our goal is to optimize the construction in terms of short-term efficiency, by proposing a methodology to quantify the positive impact of the addition of these already designed transport lines. Additionally, in light of an application for emergency planning (e.g., when hardly-predictable disruptions occur), our methodology satisfies the requirement of rapid computation to allow for a quick definition of a new line or transport mode to deploy over the perturbed network. The contribution is twofold. Firstly, we propose a multi-layer modelling approach to smartly simplify a large-scale multi-modal transport network by summarizing the main topological characteristics of the network via a weighting process based on the computation of multi-layer (weighted) shortest paths. Secondly, we propose a lightweight methodology to rank different transport line construction scenarios, based on both global and local complex network metrics that are computed atop the proposed multi-layer modelling solution.
2
Network Performances Quantification
To describe the positive impact determined by the construction of a novel transport line, it is essential to quantify the network performances of the multi-modal transport network, both at global and local scales. Whereas the global indicators provide insights about the impact of the addition of a transport line, the local analysis permits to describe the geographical distribution of the improvements. Two approaches in network improvement becomes possible: a higher improvement in network efficiency localized in a tight area could, for instance, be preferred to a moderate enhancement of the global performances uniformly distributed over the network. However, in this paper, the second strategy is preferred according to a resilience point of view, as it allows to increase network properties more homogeneously over the whole network. In other words, although the public transport network is simplified, each node of the graph represents most of the time a unique station (e.g., a public transport station, an inter-modal interchange, etc.). Thus, a large localized improvement of network performances, induced by the construction of a transport line, is seen as a vulnerable choice. In fact, the removal or degradation of this specific part of the network will highly negatively impact the network performances. The multi-modal transport network is modelled as a weighted directed multilayer graph [23], G = (V, E, L), with L = {Lm }M m=1 the set of elementary layers, each representing a specific transport mode m, V the set of nodes and E the set of edges. Each layer contains a node subset Vl ⊂ V and an edge one El ⊂ E, which correspond, respectively, to the set of intersections and the intra-layer links composing the transport mode m represented on layer l ∈ L. The edge set E is also composed by inter-layer edges El that allow to cross the layers, i.e., the transport modes composing the graph. Some network metrics have been
220
E. Henry et al.
adapted to our multi-layer modelling. Specifically, we consider the degree of node u, i.e., ku , to be the sum of the number of edges directed to/exiting from u, independently of their layer. We likewise consider that shortest paths can contain any edge e ∈ E regardless of its layer. 2.1
Global Metrics
2.1.1 Degree Centrality Distribution The degree centrality distribution is an important indicator for network characterization [13,18] and [1]. For the sake of simplicity and despite this could represent a strong assumption, we decided to reduce this distribution by considering only its average (network density k): k =
1 ku |V |
(1)
u∈V
where |V | its cardinality and ku is the degree centrality of node u. k characterizes the nodes’ connectivity in the whole network, which is an important aspect from the perspective of redundancy and capability to absorb perturbations. The denser the graph, the more connections exist between nodes. For such an indicator, it could be interesting to weight the graph by the levelof-service of each link under real traffic condition. In fact, an increase of travel time due to a disturbance reduces the capacity of the road and deteriorates the connection from or to the linked intersection [20]. However, in this paper we assume the network to be under free flow conditions (cf. Sect. 3). Thus, no disruption occurs and the measure then only characterizes the network topology, as the level-of-service is always equal to 1. 2.1.2 Network Efficiency We also compute the Average Efficiency (AE) (Eq. 2) [24], frequently employed in the related literature [3,7,12,14], to take into account the change in travel time induced by the addition of new transport lines. It quantifies how efficiently information (or any other kind of flow) is exchanged over the network. To directly compare the improvement in connectivity implied by the addition of new edges, Filippidou et al. [15] considered the percentage change in the average shortest path length between the initial graph and the one augmented by some edges determined. The defined metric is the Gain (G), highly related to the AE. The higher G and AE are, the more the augmentation has a positive impact on the network performances by reducing the length of the shortest paths. 1 1 L − L (2) AE = ∗ 100 (3) G= n(n − 1) u,v∈V ×V duv L u=v where n is the number of nodes composing the graph, duv is the length of the shortest path between two nodes u and v, L is the average shortest path length of the original graph and L the average shortest path length of the graph augmented with the new transport mode lines.
Quick Sub-optimal Scheduling
2.2
221
Local Metrics
To describe network performances at local scale, the edge betweenness centrality (EBC) (Eq. 4) [17] is among the preferred ones in the transport field. This metric characterizes the importance of an edge by considering the fraction of shortest paths that traverse the edge for each pair of nodes, and is largely used to identify bottlenecks of the traffic network. Although an edge with a high EBC is vulnerable in terms of resilience, it also means that the edge is very attractive and important to sustain flow in the graph, being crossed by a high number of shortest paths. Another widely used metric, which assesses the network performance at local scale, is the closeness centrality (N CC) [17] (Eq. 5). The latter quantifies how far a given node u is from all the other nodes by summing the reciprocal of the length of the shortest path to u from all others. As for the AE, the higher the N CC, the shorter the shortest paths, meaning that nodes are better connected. Both local metrics are used on road network analysis and public transport ones [5,10,12]. 1 σuv (e) (5) N CCu = (4) EBCe = d vu v∈V σ uv u=v u,v∈V ×V u=v
sp duv 0 Excitability threshold τ
n
[
V1 +···+Vk n λ
Order n > 0 Scaling factor λ > 0
Self-models Representing Network Characteristics by Network States As indicated above, ‘network characteristics’ and ‘network states’ are two distinct concepts for a network. Self-modeling is a way to relate these distinct concepts to each other in an interesting and useful way: • A self-model is making the implicit network characteristics (such as connection weights and excitability thresholds) explicit by adding states for these characteristics; thus the network gets an internal self-model of part of the network structure itself.
Self-modeling Networks Using Adaptive Internal Mental Models
263
• In this way, different self-modeling levels can be created where network characteristics from one level relate to explicit states at a next level. By iteration, an arbitrary number of self-modeling levels can be modeled, covering second-order or higher-order effects. Self-modeling causal networks can be recognized both in physical and mental domains. For example: • In the physical domain, in the brain, information about the characteristics of the network of causal relations between activation states of neurons is, for example, represented in physical configurations for synapses (e.g., connection weights), neurons (e.g., excitability thresholds) and/or chemical substances (e.g., neurotransmitters). • In the mental domain, a person can create mental states in the form of representations of his or her own (personal) characteristics, thus forming a subjective self-model (acquired by experiences); e.g., of being very sensitive for pain or for critical feedback or of having an anger issue. Adding a self-model for a temporal-causal network is done in the way that for some of the states Y of the base network and some of the network structure characteristics for connectivity, aggregation and timing (in particular, some from ωX,Y , γi,Y , πi,j,Y , ηY ), additional network states WX,Y , Ci,Y , Pi,j,Y , HY (self-model states) are introduced (see the blue upper plane in Fig. 2): (a) Connectivity self-model • Self-model states WXi ,Y are added representing connectivity characteristics, in particular connection weights ωXi ,Y (b) Aggregation self-model • Self-model states Cj,Y are added representing aggregation characteristics, in particular combination function weights γi,Y • Self-model states Pi,j,Y are added representing aggregation characteristics, in particular combination function parameters πi,j,Y (c) Timing self-model • Self-model states HY are added representing timing characteristics, in particular speed factors ηY The notations WX,Y , Ci,Y , Pi,j,Y , HY for the self-model states indicate the referencing relation with respect to the characteristics ωX,Y , γi,Y , πi,j,Y , ηY : here W refers to ω, C refers to γ, P refers to π, and H refers to η, respectively. For the processing, these selfmodel states define the dynamics of state Y in a canonical manner according to Eqs. (1) whereby ωX,Y , γi,Y , πi,j,Y , ηY are replaced by the state values of WX,Y , Ci,Y , Pi,j,Y , HY at time t, respectively.
264
J. Treur
An example of an aggregation self-model state Pi,j,Y for a combination function parameter πi,j,Y is for the excitability threshold τY of state Y, which is the second parameter of the logistic sum combination function; then Pi,j,Y is usually indicated by TY , where T refers to τ. Such aggregation self-model states TY will play an important role in the network model addressed below, as will connectivity self-model states WX,Y , referring to connection weights ωX,Y . As the outcome of the addition of a self-model is also a temporal-causal network model itself, as has been proven in [21], Ch 10, this construction can easily be applied iteratively to obtain multiple levels of self-models.
3 Domain Description: Cognitive Analysis and Support Processes In many cases, when humans perform complex or demanding tasks, it makes sense to keep an eye on them, to see how they are doing and to assess in how far their functioning is getting poor. If so, then some support actions may be needed or desirable. To determine such assessments and support actions requires complex and adaptive cognitive processes. For example, for a car driver, based on sensoring or observation data, it may involve judgements about the driver’s alcohol usage, gaze and steering behaviour and whether for long trips (s)he takes enough rest. If the gaze is unfocused or the steering behaviour unstable, this may be assessed as a driving risk and if that occurs, a support action like slowing down the car may be adequate. The knowledge behind such assessments may be adaptive, so that the underlying cognitive processes can improve over time. Within such complex and adaptive cognitive processes usually internal mental models are used. For example, in [3–6] internal mental models were used for the analysis process and for the support process (see also Fig. 1): • analysis model This is used to assess the human’s states and processes using observations (possibly using specific sensors) and domain knowledge. Examples of observations that are used in the car driver example are a long period of driving, a gaze that is not well-focused, unstable steering, and alcohol usage. Examples of assessments that come out of this process are that there is a risk for getting exhausted or there are other risks for driving. • support model This is used to generate support for the human based on domain knowledge. This uses as input the assessments made by the analysis model. Examples of actions that come out of this process are advice to take some rest period, blocking the starting of the car (when it is not driving), and slowing down the car (when it is driving). As such processes are in principle adaptive, a third internal mental model is needed [18], Ch. 16: • adaptation model To make the analysis and support model better fit the specific characteristics of the driver, car and the further situation. This can be done by adapting certain characteristics of the internal mental models for analysis and support.
Self-modeling Networks Using Adaptive Internal Mental Models
265
Adaptation model Analysis model
Support model
Fig. 1. Adaptive model-based architecture to analyse and support humans; adapted from [18], Ch 16, p. 469
Section 4 addresses the question how these internal mental models can be modeled by a self-modeling network, with as outcome a second-order self-modeling network model.
4 The Second-Order Adaptive Self-modeling Network Model In this section it will be shown how the modeling approach briefly described in Sect. 2 can be and actually has been used to model within one self-modeling network the adaptive mental models for analysis and support sketched in Sect. 3. A useful network architecture to handle internal mental models in general is a selfmodeling network that covers at least two levels (see also [2]): a base level representing the mental model as a network so that it can be used to process it (based on withinnetwork dynamics), and a first-order self-model explicitly representing the (network) characteristics of the mental model which can be used for formation and adaptation of the mental model. In addition, a third level with a second-order self-model can be used to control these processes. This general setup has been applied here. First two useful adaptation principles for plasticity and metaplasticity from the Cognitive Neuroscience literature are discussed. When self-models are changing over time, this offers a useful method to model adaptive networks. This does not only apply to firstorder adaptive networks, but also to higher-order adaptive networks, using higher-order self-models. For example, two types of (connectivity and aggregation) self-model states can be used to model adaptive connection weights and intrinsic neuronal excitability as described in [7]: ‘Learning-related cellular changes can be divided into two general groups: modifications that occur at synapses and modifications in the intrinsic properties of the neurons. While it is commonly agreed that changes in strength of connections between neurons in the relevant networks underlie memory storage, ample evidence suggests that modifications in intrinsic neuronal properties may also account for learning related behavioral changes’. [7], p. 30. More in particular, the following quote indicates that synaptic activity relates to longlasting modifications in excitability of neurons: ‘Long − lasting modifications in intrinsic excitability are manifested in changes
266
J. Treur
in the neuron’s response to a given extrinsic current (generated by synaptic activity or applied via the recording electrode’.[7], p.30
(3)
The above refers to a form of plasticity, which can be described by a first-order adaptive network that is modelled using a dynamic first-order self-model for aggregation characteristics of the base network, in particular for the excitability threshold used in aggregation. Whether or not and to which extent such plasticity actually takes place is controlled by a form of metaplasticity; e.g., [1, 8, 13, 14, 16, 17]. For example, in [14] the following compact quote is found, summarizing that due to stimulus exposure, adaptation speed will increase: ‘Adaptation accelerates with increasing stimulus exposure’ [14], p.2
(4)
This indeed refers to a form of metaplasticity, which can be described by a secondorder adaptive network that is modeled using a dynamic second-order self-model for timing characteristics of the first-order self-model for the first-order adaptation. In this way, both (first- and second-order) adaptation principles for plasticity and metaplasticity summarized in (3) and (4) will be applied in the network model presented below. Because of its complexity, the model will be presented in two steps as depicted in Fig. 2 and Fig. 3. In Fig. 2 the connectivity of the first two levels of the proposed network model is depicted. This covers the base network within the base (pink) plane, and the first-order self-model in the upper (blue) plane. For an overview of all states of the network model, see Table 2; here the first 10 states describe the base level and the next 15 states (up to state 25) the network’s first-order self-model. The base network consists of two subnetworks, one that describes a mental model for the analysis to determine (by within-network dynamics) out of monitored information about the driver’s situation (long drive, alcohol, unstable steering, unfocused gaze), an assessment of the situation of the driver (within the considered scenarios the two options are exhaustiveness risk and driving risk). The second one describes a mental model for the support process to determine (by within-network dynamics) out of the assessment a suitable support action for the driver (in the considered scenarios three options: rest advice, slow down, and block start). For these mental models described at the base level, corresponding self-models have been added to be able to change them, for example by learning. The first-order selfmodel in the upper plane in Fig. 2 models some of the network characteristics of the two (sub)networks at the base level: Analysis Self-Model: First-order self-model W-states and T-states X 11 to X 16 . Support Self-Model: First-order self-model W-states and T-states X 17 to X 25 . For each of the subnetworks for mental models at the base level, the first-order selfmodel has two submodels: a first-order connectivity self-model (based on W-states) and a first-order aggregation self-model (based on T-states). The connectivity self-model represents the connectivity characteristics of the particular mental model by self-model states WX,Y and the aggregation self-model represents the excitability thresholds of the assessment options (for the analysis model) and the support action options (for the support model) by self-model states TY . Each of these first-order self-model states WX,Y and TY has a downward connection (in pink) to indicate the state Y of the mental model at
Self-modeling Networks Using Adaptive Internal Mental Models
Wlongdrive,exhrisk Walcohol,drivingrisk Tdrivingrisk
longdrive alcohol unstabsteer unfocgaze
Trestadvice
Wdriving,restadvice Texhrisk Wdriving,slowdownWdrivingrisk,slowdown
Wunstabsteer,drivingrisk Wunfocgaze,drivingrisk
Wexhrisk,restadvice
267
Wdrivingrisk,blockstart Wdriving,blockstart
Tslowdown Tblockstart
restadvice
exhrisk driving drivingrisk
First-order Self-Model
slowdown blockstart
Base Network
Fig. 2. Connectivity of the first-order self-modeling network model
the base level for which they have their special effect; so, based on these downward links, the value of WX,Y plays the role of the indicated connection weight and the value of TY the role of excitability threshold for the state Y pointed at. For the sake of simplicity, the connectivity self-model states WX,Y have no incoming connections from other states; for the scenarios considered here they are kept constant, in further extensions of the model they easily may made dynamic as well, for example, based on hebbian learning as modeled in [21], Ch. 3. The aggregation self-model states for excitability thresholds do have incoming connections which make them dynamic, to model the abovementioned adaptation principle (3) for plasticity from [7]. This is modeled by specifying a negative weight of the connections from the states that causally preceed the indicated state Y. In addition, to counterbalance an excess of this negative effect, a positive weight 1 is used for all upward connections from Y itself to TY . For the simulation outcomes discussed in next section it is shown how these two opposite effects create some equilibrium value for each aggregation self-model state TY , which illustrates one form of adaptivity in the model: in this way the aggregation self-model learns. However, by including a second-order self-model as shown (in the purple plane) in Fig. 3, this learning has been made adaptive itself (in particular the learning speed), which creates second-order adaptation used as a form of control over the first-order adaptation. In Fig. 3, the connectivity of the entire second-order adaptive network model is depicted. Compared to Fig. 2 a third (upper, purple) plane was added consisting of a second-order self-model for the network. The second-order self-model states in this upper (purple) plane are explained in Table 2: the last 19 states from state 26 on: Adaptation Self-Model: Second-order self-model W-states and HT -states X 26 to X 44 . As in the first-order self-model, the second-order self-model includes a second-order connectivity self-model using states WX ,TY for all incoming connections of the T-states of the first-order self-model. Again, like in the first-order self-model case, these states WX ,TY are kept constant for now. In addition, the second-order self-model includes a second-order timing self-model for the first-order T-states based on states HTY . This
268
J. Treur Table 2. Explanation of the states of the second-order self-modeling network model Name
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 X23 X24 X25 X26 X27 X28 X29 X30 X31 X32 X33 X34 X35 X36 X37 X38 X39 X40 X41 X42 X43 X44
longdrive alcohol unstabsteer unfocgaze exhrisk drivingrisk driving restadvice slowdown blockstart Wlongdrive,exhrisk Walcohol,drivingrisk Wunstabsteer,drivingrisk Wunfocgaze,drivingrisk Texhrisk Tdrivingrisk Wexhrisk,restadvice Wdriving,restadvice Wdrivingrisk,slowdown Wdriving,slowdown Wdrivingrisk,blockstart Wdriving,blockstart Trestadvice Tslowdown Tblockstart Wlongdrive,Texhrisk Wexhrisk,Texhrisk Walcohol,Tdrivingrisk Wunstabsteer,Tdrivingrisk Wunfocgaze,Tdrivingrisk Wdrivingrisk,Tdrivingrisk Wexhrisk,Trestadvice Wrestadvice,Trestadvice Wdrivingrisk,Tslowdown Wdriving,Tslowdown Wslowdown,Tslowdown Wdrivingrisk,Tblockstart Wdriving,Tblockstart Wblockstart,Tblockstart HTexhrisk HTdrivingrisk HTrestadvice HTslowdown HTblockstart
Explanation The driver is driving for a long period of time Alcohol is detected The driver’s steering is unstable The driver’s gaze is not focused Assessment of a risk that the driver will get exhausted Assessment of a safety risk for the driving The car is driving Supporting action to advice the driver to take some rest Supporting action to slow down the car Supporting action to block the starting of the car First-order connectivity self-model state for weight of the connection from longdrive to exhrisk First-order connectivity self-model state for weight of the connection from alcohol to drivingrisk First-order connectivity self-model state for weight of the connection from unstablesteer to drivingrisk First-order connectivity self-model state for weight of the connection from longdrive to drivingrisk First-order aggregation self-model state for excitability threshold of exhrisk First-order aggregation self-model state for excitability threshold of drivingrisk First-order connectivity self-model state for weight of the connection from exhrisk to restadvice First-order connectivity self-model state for weight of the connection from driving to restadvice First-order connectivity self-model state for weight of the connection from drivingrisk to slowdown First-order connectivity self-model state for weight of the connection from driving to slowdown First-order connectivity self-model state for weight of the connection from drivingrisk to blockstart First-order connectivity self-model state for weight of the connection from driving to blockstart First-order aggregation self-model state for excitability threshold of restadvice First-order aggregation self-model state for excitability threshold of slowdown First-order aggregation self-model state for excitability threshold of blockstart Second-order connectivity self-model state for weight of the connection from longdrive to Texhrisk Second-order connectivity self-model state for weight of the connection from exhrisk to Texhrisk Second-order connectivity self-model state for weight of the connection from alcohol to Tdrivingrisk Second-order connectivity self-model state for weight of the connection from unstabsteer to Tdrivingrisk Second-order connectivity self-model state for weight of the connection from unfocgaze to Tdrivingrisk Second-order connectivity self-model state for weight of the connection from drivingrisk to Tdrivingrisk Second-order connectivity self-model state for weight of the connection from exhrisk to Trestadvice Second-order connectivity self-model state for weight of the connection from restadvice to Trestadvice Second-order connectivity self-model state for weight of the connection from drivingrisk to Tslowdown Second-order connectivity self-model state for weight of the connection from driving to Tslowdown Second-order connectivity self-model state for weight of the connection from slowdown to Tslowdown Second-order connectivity self-model state for weight of the connection from drivingrisk to Tblockstart Second-order connectivity self-model state for weight of the connection from driving to Tblockstart Second-order connectivity self-model state for weight of the connection from blockstart to Tblockstart Second-order timing self-model state for the speed of Texhrisk Second-order timing self-model state for the speed of Tdrivingrisk Second-order timing self-model state for the speed of Trestadvice Second-order timing self-model state for the speed of Tslowdown Second-order timing self-model state for the speed of Tblockstart
second-order self-model is dynamic, which makes the whole network second-order adaptive. The special effect of each state HTY as speed factor for state TY is effectuated by the downward (pink) connection to the related state TY . To make them dynamic, the states HTY themselves are affected by upward connections from the base level network, in this case following the abovementioned adaptation principle (4) for metaplasticity ‘Adaptation accelerates with increasing stimulus exposure’ [14]. Therefore, there are (blue) upward links with positive weights to each state HTY from the base states causally preceeding base state Y. This makes that, as soon as
Self-modeling Networks Using Adaptive Internal Mental Models
269
Fig. 3. Connectivity of the entire second-order self-modeling network model
these causal ‘antecedents’ of Y get higher activation levels, the adaptation speed (starting from 0: no adaptation initially) will increase, as will be shown in the example scenarios. For full specifications of the adaptive network model, see the Appendix at URL https://www.researchgate.net/publication/344165044. For example, for connectivity characteristics all connection weights not determined bij W-states are 1, except for the connection from driving to HTblockstart , which is −1. For aggregation characteristics, the logistic sum combination function is used for the base states for assessment and support options (with steepness σ = 8 and adaptive excitability threshold) and the second-order HT -states (with steepness σ = 4 and excitability threshold τ = 0.7 or 1.4 depending on the number of incoming connections). All other states use the Euclidean combination function with n = 1 and λ = 1 (see Table 1), which actually is just a sum function. For timing characteristics, the speed factors of the base states for assessment and support options are 0.5 and for the second-order HT -states 0.05. All other speed factors are adaptive (the base states for assessment and support options) or 0 (for the other base states and for all W-states). The initial values for all W-states (which are constant due to speed factor 0) are 1 when they represent a positive connection; negative ones are Wdriving, blockstart , Wlongdrive , Texhrisk , Walcohol , Tdrivingrisk , Wunstabsteer , Tdrivingrisk , Wunfocgaze , Tdrivingrisk , Wexhrisk , Trestadvice which have initial value −1, and Wdrivingrisk , Tslowdown , Wdriving , Tslowdown , Wdrivingrisk , Tblockstart , Wdriving , Tblockstart with initial value −0.5. The initial values of all HT -states are 0 as are they for all base states except the observables shown in Table 3, which depend on the chosen scenario. Finally, the initial values for the five T-states were on purpose set on too high values 2, 1.4, 2.4, 2.8, 2.4, respectively (in relation to the number of their incoming connections), in order to let adaptation happen.
270
J. Treur
5 Outcomes of Example Simulation Scenarios In Figs. 4, 5 and 6 simulation results are shown for three realistic scenarios, defined by the common settings as shown in the role matrices in the Appendix discussed in the last paragraph of Sect. 4 and specific constant values 0 or 1 for the states X 1 to X 4 and X 7 as shown in Table 3. In these graphs the following are shown: • the relevant assessment (resulting from the analysis model) and support action (resulting from the support model) • how the excitability thresholds used within the analysis model and the support model adapt over time and how the adaptation speed for them changes over time (resulting from the adaptation model) The initial values for the excitabilty thresholds for the analysis model and support model were deliberately set too high, so that the adaptation process that was needed to get results is illustrated. Note that the adaptation speeds have initial values 0 so that in the first phase nothing happens in the analysis model and support model until indeed a suitable adaptation process has started and in a next phase has resulted in successful adaptation of the analysis and support models. Table 3. The three displayed scenarios X1 X2 X3 X4 X7 Explanation longdrive alcohol unstabsteer unfocgaze driving Scenario 1 1 (Fig. 4)
0
0
0
1
A driver who has been driving too long
Scenario 2 0 (Fig. 5)
0
0
1
1
A driver who drives with an unfocused gaze
Scenario 3 0 (Fig. 6)
1
0
0
0
A driver who consumed alcohol and wants to start driving
For Scenario 1, it can be seen in Fig. 4 that by the second-order self-model the adaptation speed for the exhaustion risk excitability threshold (within the analysis model) increases from time 0 on (the purple line); this is conform to the ‘Plasticity Versus Stability Conundrum’ discussed in [17], p. 773: only adapt when relevant (adaptation speed > 0), otherwise keep stable (adaptation speed 0). This increase in adaptation speed (due to stimulus exposure) results in adaptation of this excitability threshold (conform to (3) from [7]): starting at value 2, it goes down to finally (after time 13) reach values between 0.2 and 0.4 (the brown line). Apparently this is low enough, as after time 10 the exhaustion risk assessment is generated and reaches value 1 after time 15 (the red line), which makes a successful analysis model outcome for this scenario. This in turn makes that by the adaptation model after time 10 the adaptation speed for the excitability threshold of the support action rest advice (in the support model) gets
Self-modeling Networks Using Adaptive Internal Mental Models
271
2.6 2.4 2.2 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0
5 10 X5 - exhauson-risk X23 - T-rest-advice
15
20 25 X8 - rest-advice X40 - H-T-exhauson-risk
30 35 X15 - T-exhauson-risk X42 - H-T-rest-advice
40
Fig. 4. Long drive leads to an exhaustion risk assessment and to the support action rest advice
higher (the orange line). This results in adaptation of that threshold: the value which initially was 2.4 starts to decrease after time 10 and reaches values between 1.4 and 1.6 after time 18 (the dark purple line). Again, apparently this is low enough as the support action rest advice comes up after time 18 and reaches 1 after time 25 (the dark green line). This makes a successful support outcome. For Scenario 2, it can be seen in Fig. 5 that by the adaptation model the adaptation speed for the driving risk excitability threshold (within the analysis model) increases from time 0 on (the light blue line), which results in adaptation of this threshold: starting at value 1.4, it goes down to (after time 7) reach values below 0.7 (the light green line). 3 2.8 2.6 2.4 2.2 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0
5 10 X6 - driving-risk X24 - T-slow-down
15
20 25 X9 - slow-down X41 - H-T-driving-risk
30 35 X16 - T-driving-risk X43 - H-T-slow-down
40
Fig. 5. Driving with an unfocused gaze leads to a driving risk assessment and to the support action slow down
Apparently this is low enough, as from time 5–10 the driving risk assessment is generated and reaches value 1 after time 15 (the pink line), which makes a successful analysis model outcome for this case. This in turn makes that by the adaptation model
272
J. Treur
after time 5 the adaptation speed for the excitability threshold of the support action slow down (in the support model) gets higher (the dark green line). This results in adaptation of that threshold: the value which initially was 2.8 starts to decrease after time 10 and reaches values between 1.4 and 1.6 after time 18 (middle green line). Again, apparently this is low enough as the support action slow down comes up after time 18 and reaches 1 after time 20 (the brown line). This makes a successful support model outcome for this case. 2.6 2.4 2.2 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0
5 10 X6 - driving-risk X25 - T-block-start
15
20 25 X10 - block-start X41 - H-T-driving-risk
30 35 X16 - T-driving-risk X44 - H-T-block-start
40
Fig. 6. Alcohol usage leads to a driving risk assessment and to the support action block start
For Scenario 3, Fig. 6 shows initially (for the analysis model) a similar pattern as in Scenario 2. However, for the second part of the process (for the support model), this scenario shows how also fluctuating patterns can occur. More specifically, this illustrates how the adaptation of the excitability threshold threshold gets reinforcement from the outcome of the support model, so that in the end they reach an equilibrium according to a fluctuating pattern.
6 Discussion In complex cognitive processes, often internal mental models are used; e.g., [9, 10, 12, 15]. Such models can just be applied, but they are also often adaptive, in order to form and improve them. The focus in this paper was on adaptive cognitive analysis and support processes for the situation and states of a human in a demanding task; the adaptive network model was illustrated for a car driver. Within these processes internal mental models are used for the analysis and support processes. An adaptive network model was presented that models such adaptive cognitive analysis and support processes. The network model makes use of adaptive first-order self-models for the internal mental models used for the cognitive analysis and support processes. To control the adaptation of these first-order self-models, second-order selfmodels are included. The adaptive network model was illustrated for realistic scenarios
Self-modeling Networks Using Adaptive Internal Mental Models
273
for a car driver who gets exhausted, shows unstable steering or shows an unfocused gaze and/or used alcohol. For the adaptativity and its control, the network model makes use of two biologically plausible adaptation principles informed by the Cognitive Neuroscience literature, one within the first-order self-model for adaptation of aggregation characteristics of the base network, in particular the excitability threshold [7], and the other one [14] within the second-order self-model for adaptation of timing characteristics for the first-order selfmodel by metaplasticity [1, 8, 13, 14, 16, 17]. This study shows how complex adaptive cognitive processes based on internal mental models can be modeled in an adequate manner by multi-order self-modeling networks.
References 1. Abraham, W.C., Bear, M.F.: Metaplasticity: the plasticity of synaptic plasticity. Trends Neurosci. 19(4), 126–130 (1996) 2. Bhalwankar, R., Treur, J.: Modeling the development of internal mental models by an adaptive network model. In: Proceedings of the 11th Annual International Conference on BrainInspired Cognitive Architectures for AI, BICA*AI’20. Advances in Intelligent Systems and Computing, Springer Nature Publishers (2020) 3. Bosse, T., Both, F., Duell, R., Hoogendoorn, M., van Lambalgen, R., Klein, M.C.A., van der Mee, A., Oorburg, R., Sharpanskykh, A., Treur, J., de Vos, M.: An ambient agent system assisting humans in complex tasks by analysis of a human’s state and performance. Int. J. Intell. Inf. Database Syst. 7, 3–3 (2013) 4. Bosse, T., Both, F., Gerritsen, C., Hoogendoorn, M., Treur, J.: Methods for model-based reasoning within agent-based ambient intelligence applications. Knowl.-Based Syst. J. 27, 190–210 (2012) 5. Bosse, T., Hoogendoorn, M., Klein, M.C.A., Lambalgen, R.M. van, Maanen, P.P. van, Treur, J.: Incorporating human aspects in ambient intelligence and smart environments. In: Chong, N.Y., Mastrogiovanni, F. (eds.) Handbook of Research on Ambient Intelligence and Smart Environments: Trends and Perspectives, pp. 128–164. IGI Global (2011) 6. Bosse, T., Hoogendoorn, M., Klein, M.C.A., Treur, J.: An ambient agent model for monitoring and analysing dynamics of complex human behaviour. J. Ambient Intell. Smart Environ. 3, 283–303 (2011) 7. Chandra, N., Barkai, E.: A non-synaptic mechanism of complex learning: modulation of intrinsic neuronal excitability. Neurobiol. Learn. Mem. 154, 30–36 (2018) 8. Garcia, R.: Stress, Metaplasticity, and antidepressants. Curr. Mol. Med. 2, 629–638 (2002) 9. Gentner, D., Stevens, A.L.: Mental models. Erlbaum, Hillsdale NJ (1983) 10. Greca, I.M., Moreira, M.A.: Mental models, conceptual models, and modelling. Int. J. Sci. Educ. 22(1), 1–11 (2000) 11. Hebb, D.O.: The Organization of Behavior: A Neuropsychological Theory. Wiley, New York (1949) 12. Kieras, D.E., Bovair, S.: The role of a mental model in learning to operate a device. Cognit. Sci. 8(3), 255–273 (1984) 13. Magerl, W., Hansen, N., Treede, R.D., Klein, T.: The human pain system exhibits higher-order plasticity (metaplasticity). Neurobiol. Learn. Mem. 154, 112–120 (2018) 14. Robinson, B.L., Harper, N.S., McAlpine, D.: Meta-adaptation in the auditory midbrain under cortical influence. Nat. Commun. 7, 13442 (2016) 15. Seel, N.M.: Mental Models in Learning Situations. In: Advances in Psychology, vol. 138, pp. 85–107. North-Holland, Amsterdam (2006)
274
J. Treur
16. Sehgal, M., Song, C., Ehlers, V.L., Moyer, J.R., Jr.: Learning to learn – intrinsic plasticity as a metaplasticity mechanism for memory formation. Neurobiol. Learn. Mem. 105, 186–199 (2013) 17. Sjöström, P.J., Rancz, E.A., Roth, A., Hausser, M.: Dendritic excitability and synaptic plasticity. Physiol. Rev. 88, 769–840 (2008) 18. Treur, J.: Network-Oriented Modeling: Addressing Complexity of Cognitive, Affective and Social Interactions. Springer (2016) 19. Treur, J.: Modeling higher-order adaptivity of a network by multilevel network reification. Network Sci. 8, S110–S144 (2020) 20. Treur, J.: Network-oriented modeling for adaptive networks: Designing Higher-order Adaptive Biological, Mental and Social Network Models. Springer Nature Publishing, Cham (2020)
Evolution of Spatial Political Community Structures in Sweden 1985–2018 J´erˆome Michaud(B) , Ilkka H. M¨ akinen, Emil Frisk, and Attila Szilva Uppsala University, 751 26 Uppsala, Sweden [email protected] https://katalog.uu.se/empinfo/?id=N17-1665
Abstract. Understanding how the electoral behaviour of a population changes in a country is key to understand where and why social change is happening. In this paper, we apply methods from network science to the study the middle-long-term evolution of Swedish electoral geography. Sweden is an interesting case since its political landscape has significantly changed over the last three decades with the rise of the Sweden Democrats and the Green Party and the fall of the Social Democrats. By partitioning the Swedish municipalities according to their similarity in voting profiles, we show that Sweden can be divided into three or four main politico-cultural communities. More precisely, a transition from three to four main politico-cultural communities is observed. The fourth community emerged in the early 2000s, and it is characterized by a large vote-share for the Sweden Democrats, while almost all other parties underperform. Keywords: Electoral geography · Swedish parliamentary elections · Network science · Community detection · Partition · Fragmentation · Convergence · Evolution
1
Introduction
Understanding where and why political change is happening in a country is fundamental issue in political geography. In this paper, we propose to use methods from network science to help characterize where political change is happening. Every country is divided into administrative regions such as municipalities or counties. The main idea of this work is to represent a country as a network of its administrative regions connected by a weighted edge measuring their similarity in political/electoral behaviour. The resulting network can be analyzed using standard network science methods, such as community detection. In this paper, we focus on the Swedish case. Over the last three decades, the Swedish political landscape has significantly changed with the rise of the Sweden Democrats who rise from nothing to reach about 18% and the fall of the Social Democrats from about 45% to below 30%, but the geographical patterns of electoral behaviour have remained fairly stable. Henrik Oscarsson and S¨ oren c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 944, pp. 275–285, 2021. https://doi.org/10.1007/978-3-030-65351-4_22
276
J. Michaud et al.
Holmberg have studied various perspectives on Swedish elections and electoral behavior within the framework of the Swedish National Election Studies Program (SNES). As for the political geography in Sweden, they suggest two dividing lines that pervade the electoral behavior: one between the North and the South and one between cities and the countryside [16, p. 247]. Although regional variation of electoral behavior in Sweden is quite weak compared with many other European countries [12] and may even have decreased for some parties during the 20th century [8, p. 220], there are still clear differences between different parts of Sweden. According to the literature, the electoral geography in Sweden has been quite unchanged over the years and the following patterns are generally described: • Strong support for the Left Party (V) and the Social Democratic Party (S) in the northern parts of Sweden [3,9,10,14–17]. • Strong support for the Center Party (C) in smaller cities and on Gotland [9,15,16]. • Strong support for the Green Party (MP) in Stockholm and other university cities [9,15]. • Strong support for the Sweden Democrats (SD) in Scania and Blekinge [15, 18–20]. • Strong support for the Liberal Party (L) in larger cities and on the West Coast (for instance Gothenburg) [9,15]. • Strong support for the Moderate Party (M) in the three largest cities (Stockholm, Gothenburg, and Malm¨ o) [9,15]. • Strong support for the Christian Democrats (KD) in Sm˚ aland and Norrland [9,15,16]. Regarding the evolution of the electoral geography in Sweden over time, there are not many studies concerned with the issue. One important observation is that the Sweden Democrats (SD), which is a relatively new (established 1988) party, started growing rapidly since 1998 [18–20]. Another important study is by Oscarsson et al. [15], where the geographical convergence of party support was studied by calculating the evolution of the coefficient of variation measure (CV). The CV is defined as σμ , where μ is the mean vote share of a party and σ its standard deviation. CV is a measure of dispersion and a high value for a certain party indicates a large variation in support for the party between the regions of Sweden. Oscarsson et al. [15] discuss the evolution of the CV for each party since 1991, noting in particular that S and M have the smallest coefficients, thus, the smallest regional differences in support. One drawback in the state-of-the-art approach to Swedish electoral geography is that it normally focuses on one specific election or on the change from one election to the next, ignoring a longer perspective. Furthermore, it is common to simply present the locations where a certain political party has its strongest and weakest support [16]. Sometimes more sophisticated methods are used, e.g., the regression-analysis-based method employed by Lidstr¨ om [14]. However, these studies are often performed in a party-by-party fashion, and do not provide an integrated view of electoral behavior. It appears that an integrated approach to
Evolution of Spatial Political Community in Sweden
277
the topic would be advantageous not least when analysing the factors underlying the political divisions. In this study, an alternative methodology to investigate electoral geography, based on network science [13], is used. The analysis provides a partition of Swedish municipalities into “communities” based on the similarity in their inhabitants’ electoral behavior. The term “community” is here employed as in network science rather than in political sciences or in sociology. In the functional network analysis [13] performed here, “connections” signify similarity in electoral behavior. Communities, in turn, reflect the regions where electoral behavior is relatively homogeneous. Such a method provides a way to investigate subnational politico-cultural geography. In contrast to a party-by-party analysis, this approach also takes into account the full spectrum of possible electoral choices, including abstention, blank votes, and invalid votes. Furthermore, since this analysis can be performed for successive elections, it allows us to study the evolution of the communities identified. This paper is organized as follows. Section 2 presents the data and methodology used in this study. Section 3 presents the results and Sect. 4 provides some discussion and concluding remarks.
2 2.1
Data and Methodology Data
For this project, the results of parliamentary elections in Sweden from 1985 to 2018 at the municipal level are used [21–26]. The choice of the level is motivated by it being coarse enough for the maps to be informative, yet fine-grained enough to provide some reasonable insights. Note that the map of municipalities has changed a little during the study period: their number increased from 284 to 290. The creation of new municipalities occurred without exception as a division of one municipality into two or three new ones. The changes were handled by assuming homogeneity in electoral behavior before the division and using the finer decomposition when comparing partitions into communities. The vote-share distribution characterizing a municipality i is represented as an 11-dimensional vector vi , comprising the following choices: The Social Democratic Party (S); The Moderate Party (M); The Sweden Democrats (SD); The Green Party (MP); The Center Party (C); The Left Party (V); The Liberals (L, formerly FP); The Christian Democrats (KD) ; Others (minor parties) ; Invalid or blank vote; Abstention. The components of the vote-share vector are obtained by computing the percentage of votes for each option in each municipality. 2.2
Methods
In order to partition Sweden into politico-cultural communities we perform a functional network analysis similar to that presented in [7], where bipartisanship
278
J. Michaud et al.
in Spanish election was analyzed. The authors of that study extracted the functional network measuring the similarity of electoral behaviour between municipalities using the cosine similarity measure, however discussing the resulting partition only shortly. In this paper, an improved version of their methodology is applied to Swedish parliamentary elections. The functional network associated with a given election can be specified by a matrix Sij , where the elements Sij represent similarity in electoral behavior, represented by the vote-share vectors vi and vj , between two municipalities i and j. The nodes of this network are the Swedish municipalities, and the edges are weighted by the similarity between municipalities. The degree of similarity between two municipalities i and j is given by the Bhattacharyya coefficient (BC) [2] as √ vi k vj k . (1) Sij = BC(vi , vj ) = k
This coefficient is an approximate measure of the overlap between two probability distributions, here, two vote-share distributions. Its values vary between 0 and 1, reaching 0 when there is no overlap between the distributions. It increases with the number of parties present in both municipalities and with the amount of overlap in the vote shares for a party. In our dataset, all choice options are present in every municipality, which yields generally high values of the coefficient. Other choices are possible for the similarity measure. In [7], the cosine similarity is used. We argue that a similarity measure taylored to comparing probability distribution is more natural. We tested both the BC and the Jensen-Shannon similarity measure [6] and found that the partition decomposition was undistinguishable between the two. We decided to use the BC as it is computationally cheaper. Using BC as a similarity measure, Sweden was partitioned into politicocultural communities, each consisting of a number of municipalities, at the times of each of the ten parliamentary elections held in Sweden between 1985 and 2018. The partition was performed applying the Louvain community detection algorithm [4], which aims at maximizing (in terms of robustness) the modularity of a partition of a network. In order to compare the partitions at different times it is necessary to quantify the difference between partitions. This is done using the normalized mutual information (NMI) measure [5,11]. The NMI is based on the confusion matrix N, where the rows correspond to the communities detected in partition A and the columns to those detected in partition B. The elements of N, Nij represent the number of nodes (here: municipalities) in community i of partition A that are also present in community j of partition B. Let cA and cB be the number of communities found in partitions A and B, respectively. We denote the sum over row i of the confusion matrix N by Ni· and the sum over column j by N·j so that N is the total number of municipalities (the sum of all elements of matrix N). With these definitions, the NMI measure
Evolution of Spatial Political Community in Sweden
is given by
Nij N N log ij i=1 j=1 Ni· N·j . NMI(A, B) = N cB N·j cA i· N log N log i· ·j i=1 j=1 N N −2
279
cA cB
(2)
The values of the NMI measure vary between 0 and 1, being 0 when the two community structures are independent and 1 when they are identical. The partitions can be visualized by projecting the network onto the map of Sweden and coloring the municipalities according to their community. Note that municipalities belonging to the same community are not necessarily geographically adjacent, since the criteria of grouping them together is based on similarity in voting only. The largest communities with large geographical overlaps between successive elections have been identified as being “the same” community. This leads us also to study the change in the community structure over time. For example, the evolution of the size of the major communities can be expressed by the number of municipalities within them. Over time, this number changes, providing some insights into the overall dynamics of the main communities. In order to complete the analysis, electoral behavior in each community is studied by computing the average vote-share distribution in each of the major communities as the average of the vote-share distribution of all municipalities in respective community. This is called the prototypical vote-share distribution of a community. The similarity measure used to construct the functional network is then also used to estimate the similarity between the communities, as well as their evolution. In addition, electoral behavior in the major communities is characterized using standardized support scores. These are computed by the formula (μCσ−μ) , where μC is the prototypical vote-share of Community C, μ is the mean vote-share of all Swedish municipalities, and σ is the standard deviation of that. Thus, the score measures the over/underrepresentation of a party in a community with respect to the national municipality average. Finally, the evolution of the number of communities identified by the Louvain algorithm is accounted for.
3
Results
Applying functional network analysis to the ten Swedish parliamentary elections between 1985 and 2018, a partition of the country into politico-cultural communities was obtained for each election. The maps of the partitions are displayed in Fig. 1. In each of them, the four largest communities are colored; municipalities outside these are grey. We identify four main communities: North. The community displayed in green in Fig. 1 that covers most of the North of Sweden as well as some coastal municipalities in the South East. Urban. The community displayed in yellow in Fig. 1, covering the major Swedish cities, Stockholm, Gothenburg, and Malm¨ o, along with many municipalities around them.
280
J. Michaud et al.
Fig. 1. Partition of Sweden into communities for the 10 parliamentary elections held between 1985 and 2018. The largest 4 communities are colored. Smaller communities are grey.
Rural South. The community displayed in blue in Fig. 1, covering rural parts of the South of Sweden as well as some municipalities in the North. Far South. The community displayed in brown in Fig. 1 that emerges in the far South of Sweden, expanding northward. This community is only identified from 2002. Before that, it is merely a rest category. In order to account for the characteristics of the major communities in terms of electoral behavior, standardized support scores were computed for all parties (and other possibilities) in the four main communities. The scores are displayed in Table 1. The main features are as follows: • In the North community, S and V are strongly overrepresented, while M, L, and KD are underrepresented. This pattern is stable over time and provides a good characterization of the North community. • In the Urban community, M and L are strongly overrepresented, while S is underrepresented. Abstentions tend to be fewer. The underrepresentation of C decreases over time. Interestingly, the initial overrepresentation of SD in
Evolution of Spatial Political Community in Sweden
281
Table 1. Standardized support score for the different parties in each of the main communities averaged over the 10 elections. Scores outside the [−0.5, 0.5] interval are displayed in bold for readability. Community
M
North
−0.83 −0.17 −0.61 −0.52
Urban
C
L
1.08 −0.64
KD 0.98
1.14 −0.34
S
V 0.99
MP
0.07 −0.71 −0.33
Rural South
−0.10
Far South
−0.11 −0.45 −0.27 −0.44 0.01
SD
Others Invalid N-Vot
0.81 −0.42 −0.26 −0.12 −0.26 0.60
0.78 −0.51 −0.63 −0.19 −0.59 −0.47
0.03
0.37
0.33
0.11 −0.48
0.09 −0.23
0.21
0.03
1.75 −0.18
0.30
0.46
Fig. 2. Left: Evolution of the size of the main communities. The dotted line reports the cumulated size of smaller communities. Right: Homogeneity of the main communities. The Urban community is the least homogeneous and the most variable. Legends are the same for the two figures.
1998 and 2002 turns into underrepresentation in 2014 and 2018 while KD goes to the opposite direction since the 1980s. • The Rural South community is characterized by an overrepresentation of both C and KD and an underrepresentation of S and V. This pattern is stable over time, however, some decline for MP can be discerned. • The Far South community is mainly characterized by great overrepresentation of the SD party and an underrepresentation of V. It is also noticeable that the support for S has gone from being markedly overrepresented to being markedly underrepresented, and that for V becomes even more underrepresented over time. Overall, the analysis shows that the major communities display marked differences, most of which are stable over time. The communities have changed over time. Most visibly (see Fig. 1 and Fig. 2), the Rural South community tends to become smaller; its geographical area diminishes in favor of the other communities. In Fig. 2 (left), the evolution of the size of the main communities in terms of the number of “their” municipalities is shown. The Rural South community’s shrinking from 99 to 31 municipalities dominates the picture, while the North and Urban communities tend to grow at least until 2010. In 2014 and 2018, they are also shrinking while the emerging Far South community grows from 15 to 42 municipalities between 2014 and 2018.
282
J. Michaud et al.
Fig. 3. Left: Distribution of similarity measure (BC) for each election year. Right: Evolution of the similarity between the four main communities.
The number of municipalities outside the large communities remains approximately constant until 2014, jumping then from 22 to 39. The reduction of the Rural South community seems to be due to three different dynamics. Starting from the 1990s, the municipalities in the North that were previously similar to Rural South community switch to the North community. By 2018 only one municipality (Bjurholm) in Northern Sweden remains in “Rural South”. The Urban community also gains some territory from the Rural South community. This is particularly visible in 2010, as Gothenburg and Stockholm are almost connected by Urban community, whereas in 1985, these two regions were separated by a blue region of the Rural South community. Finally, in the southernmost parts of Sweden, the Far South community has grown mainly on the territory of the Rural South community. The communities vary also as regards their internal homogeneity. The differences in this regard are shown in Fig. 2 (right). Between 1985 and 1998, the municipalities within the North and the Rural South communities were more similar to each other than those within the Urban community. In 2002, the North community’s internal variation suddenly increased, but has decreased since then. After 2002, both the Rural South and Far South communities have been internally more cohesive than the other two. In order to compare the main communities as to their electoral behavior, prototypical vote-share distributions for each of them were computed by averaging those of “their” municipalities. Pairwise similarity scores (BC) [2] between the major communities were then calculated for each election. The results are displayed in Fig. 3 (right). Overall, Sweden is relatively homogeneous [12] at this level of measurement, and the similarity scores between communities are high. The average BC between the main communities varies between 0.971 (in 2002) and 0.980 (in 2018). The values are high because, amongst other things, all options are present in every municipality, and the low proportions of blank and invalid votes do not vary greatly across municipalities. The North and Urban communities are the most dissimilar throughout the study period. The Rural South community becomes more similar to both the Urban and the North communities over time, most in the 1980s and 1990s. The Far South community
Evolution of Spatial Political Community in Sweden
283
starts out being very similar to all other major communities in 2002, becoming then less similar to North and Urban communities, but increasingly so to the Rural South community. The almost uniform increase in similarity observed between 2014 and 2018 can somewhat unexpectedly be explained by the rise of the SD party, which occurred 2018 in all parts of Sweden, making the voting profiles more similar than before, which can also be seen from the distribution of similarities displayed in Fig. 3 (left), in which the distribution for the 2018 election shows a higher degree of similarity. The division into communities has changed from election to election. In order to measure the rate of this change, normalized mutual information (NMI) is used here. The results are displayed in Table 2. Table 2. Evolution of the NMI measure between consecutive elections. Elections
85–88 88–91 91–94 94–98 98–02 02–06 06–10 10–14 14–18
NMI
0.840 0.734 0.773 0.793 0.701 0.710 0.727 0.699 0.701
Over the study period, the NMI has decreased from 0.84 to 0.70, indicating an acceleration of the change in the community structure. While the partition into communities was more stable between 1985 and 1988, there occurred a larger change between the 1988 and the 1991 elections. After a temporary restabilisation, the community structure has since 2002 been changing at a markedly faster rate. Table 3. Number of communities detected by the Louvain algorithm for each election. Election
1985 1988 1991 1994 1998 2002 2006 2010 2014 2018
# Communities 12
14
15
17
18
19
14
14
22
24
In addition, the total number of communities (of which only the major ones have been considered here) detected by the Louvain algorithm increases over time. Table 3 shows a doubling between 1985 and 2018. This can be interpreted as a sign of fragmentation of the Swedish political landscape.
4
Concluding Remarks
In free and fair elections, voters can freely choose among the parties on offer, or choose not to vote. The individual nature of voting, along with the variation in preference always present among individuals, divides the votes within an administrative unit among the parties in different proportions and with some votes having been declared as invalid. Seen from an aggregate-level perspective, many
284
J. Michaud et al.
administrative units—often neighboring ones—are reminiscent of each other in the division of votes among parties. In previous research, numerous examples of regional propensities towards over- or underrepresentation of certain parties can be found. Such tendencies can be fairly constant despite the changing results of the parties at the national level. At the same time, entire regions can change, however slowly, and some parts of them can change more than others. This paper constitutes an attempt to identify and systematically analyze these processes in Sweden over a 33-year period. The analysis conducted here has shown that nine out of ten Swedish municipalities could be assorted into three or four major, stable regions between 1985 and 2018. The communities are not entirely cohesive geographically (see Fig. 1). In 1985, only the North community is spatially highly concentrated, but over time the other communities seem to become more so as well. Overall, the picture is rather stable. However, the emergence of the Sweden Democrats has caused a marked change in the South of Sweden, where a belt of communities with a new common voting profile has sprung up and expanded since 2002. The rural community type dominated by Center and Christian Democrats has been on the wane during the entire study period, losing the majority of its municipalities. In politico-scientific literature, the notion of (party system) “fragmentation” is normally used to depict an increase in the number of political parties [1]. In this study, analysing the political geography of Sweden, the meaning of “fragmentation” is tied to the number of different recognizable voting patterns in the municipalities. These two processes are not necessarily independent of each other. Similarly, “convergence” would here not necessarily point at fewer parties, or more similar political ideas across parties, but at a decrease in the number of distinct voting patterns. We have found that there are more large parties and more types of collective voting profiles at the municipal level in 2018 than there were in 1985. Both of these indicate that fragmentation in the Swedish political field, as indicated by its geography, has increased, especially after 2010. The fact that the identified dissimilarities are smaller in 2018 than 1985 indicates, in turn, that the voting patterns in Swedish municipalities as such have tended to converge, due mainly to the relatively ubiquitous nature of the fragmentation process. In this manner, it seems that the fragmentation of the field of alternatives has been accompanied by a simultaneous convergence as regards the contents of the collective voting profiles. In order to estimate the real importance of the new entities to the national politics it is, however, essential to understand their social, and especially demographic, character that will in the last end determine their political weight in the future. As regards single parties, it is also important to take into account their possibilities of “relocating” their vote—in this regard, the slight gains in the Urban community may in the longer run be more important to the Center Party (or the Christian Democrats) than the losses in the Rural South.
Evolution of Spatial Political Community in Sweden
285
References 1. Best, R.E.: How party system fragmentation has altered political opposition in established democracies. Go. Opposition 48(3), 314–342 (2013) 2. Bhattacharyya, A.: On a measure of divergence between two statistical populations defined by their probability distributions. Bull. Calcutta Math. Soc. 35, 99–109 (1943) 3. Blomgren, M.: Det r¨ oda Norrland och det bl˚ a Sverige. Forskningsrapporter i statsvetenskap vid Ume˚ a universitet 1, 85–104 (2012) 4. Blondel, V.D., Guillaume, J.L., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities in large networks. J. Stat. Mech: Theory Exp. 2008(10), P10008 (2008) 5. Danon, L., Diaz-Guilera, A., Duch, J., Arenas, A.: Comparing community structure identification. J. Stat. Mech: Theory Exp. 2005(09), P09008 (2005) 6. Endres, D.M., Schindelin, J.E.: A new metric for probability distributions. IEEE Trans. Inf. Theory 49(7), 1858–1860 (2003) 7. Fern´ andez-Gracia, J., Lacasa, L.: Bipartisanship breakdown, functional networks, and forensic analysis in Spanish 2015 and 2016 national elections. Complexity 2018 (2018) 8. Gilljam, M.: Sveriges politiska geografi. In: V¨ aljarna inf¨ or 90-talet, pp. 216–221. C.E. Fritzes AB, Stockholm (1993) 9. Hagevi, M.: Den svenska v¨ aljaren [The Swedish Voter]. Bor´ea, Ume˚ a, Sweden (2011) 10. Holmberg, S.: V¨ alja parti. Norstedts juridik, Stockholm (2000) 11. Kuncheva, L.I., Hadjitodorov, S.T.: Using diversity in cluster ensembles. In: 2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No. 04CH37583), vol. 2, pp. 1214–1219. IEEE (2004) 12. Lane, J.E., Ersson, S.O., Ersson, S.: Politics and society in Western Europe. Sage, London (1999) 13. Latora, V., Nicosia, V., Russo, G.: Complex Networks: Principles, Methods and Applications. Cambridge University Press, Cambridge (2017) 14. Lidstr¨ om, A.: Socialdemokraternas tillbakag˚ ang 1973-2014: strukturella f¨ orklaringar och regionala variationer. Ume˚ a universitet (2018) 15. Oscarsson, H., Andersson, D., Falk, E., Forsberg, J.: F¨ orhandlingsvalet 2018. Analyser av valresultatet i 2018 ˚ ars riksdagsval (2018) 16. Oscarsson, H., Holmberg, S.: Regeringsskifte: v¨ aljarna och valet 2006. Norstedts juridik (2008) 17. Oscarsson, H., Holmberg, S.: Nya svenska v¨ aljare. Norstedts Juridik AB (2013) 18. Oscarsson, H., Holmberg, S.: Svenska v¨ aljare. Wolters Kluwer, Stockholm (2016) 19. Sannerstedt, A.: Sverigedemokraternas sympatis¨ orer fler ¨ an n˚ agonsin. In: Ohlsson, J., Oscarsson, H., Solevi, M. (eds.) Ekvilibrium (2016) 20. Sannerstedt, A.: Sverigedemokraterna: Sk˚ anegapet krymper. Larmar och g¨ or sig till, pp. 451–471 (2017) 21. Statistics Sweden: Swedish National Data Service. Version 1.0 (1989) 22. Statistics Sweden: Swedish National Data Service. Version 1.0 (1992) 23. Statistics Sweden: Swedish National Data Service. Version 1.0 (1996) 24. Statistics Sweden: Swedish National Data Service. Version 1.0 (1999) 25. Statistics Sweden: Swedish National Data Service. Version 1.0 (2017) 26. Statistics Sweden: Swedish National Data Service. Version 1.0 (2018)
Graph Comparison and Artificial Models for Simulating Real Criminal Networks Lucia Cavallaro1(B) , Annamaria Ficara2 , Francesco Curreri2 , Giacomo Fiumara3 , Pasquale De Meo4 , Ovidiu Bagdasar1 , and Antonio Liotta5 1
2
School of Computing and Engineering, University of Derby, Kedleston Road, Derby DE22 1GB, UK {l.cavallaro,o.bagdasar}@derby.ac.uk DMI Department, University of Palermo, via Archirafi 34, 90123 Palermo, Italy {aficara,fcurreri}@unime.it 3 MIFT Department, University of Messina, Viale Ferdinando Stagno d’Alcontres 31, 98166 Messina, Italy [email protected] 4 DICAM Department, University of Messina, Viale Giovanni Palatuci 13, 98168 Messina, Italy [email protected] 5 Faculty of Computer Science, Free University of Bozen-Bolzano, Piazza Domenicani 3, 39100 Bolzano, Italy [email protected] Abstract. Network Science is an active research field, with numerous applications in areas like computer science, economics, or sociology. Criminal networks, in particular, possess specific topologies which allow them to exhibit strong resilience to disruption. Starting from a dataset related to meetings between members of a Mafia organization which operated in Sicily during 2000s, we here aim to create artificial models with similar properties. To this end, we use specific tools of Social Network Analysis, including network models (Barab´ asi-Albert identified to be the most promising) and metrics which allow us to quantify the similarity between two networks. To the best of our knowledge, the DeltaCon and spectral distances have never been applied in this context. The construction of artificial, but realistic models can be a very useful tool for Law Enforcement Agencies, who could reconstruct and simulate the evolution and structure of criminal networks based on the information available. Keywords: Criminal networks · Complex networks · Social network analysis · Graph theory · Graph comparison · Graph similarity · Graph matching
1
Introduction
Criminal organizations [16] often profit from providing illicit goods and services in public demand, or by offering legal goods and services in an illicit manner. One of the most renowned criminal organisations (i.e., clans, gangs, syndicates) is the c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 944, pp. 286–297, 2021. https://doi.org/10.1007/978-3-030-65351-4_23
Artificial Models for Real Criminal Networks
287
Sicilian Mafia. This organisation was analysed in Gambetta’s classic work on its economics and dynamics [17], where it is referred to as the original “Mafia”. In a more recent work [23], Letizia Paoli provided a clinically accurate portrait of mafia behavior, motivations, and structure in Italy, relying on previously undisclosed confessions of former mafia members now cooperating with the police. The analysis of the Sicilian Mafia syndicates social structure generated great scientific interest [20]. Currently, scholars and practitioners alike are increasingly adopting a network science perspective to explore criminal phenomena [6]. Social Network Analysis (SNA) has emerged as an important component in the study of criminal networks and in criminal intelligence analysis. This tool is used to describe the structure and functioning of a criminal organisation, to construct crime prevention systems [5] or to identify leaders within a criminal organisation [19]. Indeed, some studies had the unique opportunity to examine real datasets and to use the data sources to build networks and to examine them by means of classical SNA tools [5,8,11,15,25,26,30]. Law Enforcement Agencies (LEAs) increasingly employ SNA in the study of criminal networks, as well as to analyse the relations amongst criminals based on calls, meetings and other events derived from investigations [1,14,15]. When dealing with practical networks, missing data may refer to nodes and/or edges. Often, criminal networks are incomplete, incorrect, and inconsistent, either due to deliberate deception on the part of criminals, or to limited resources or unintentional errors by LEAs [1,4,5,9,14]. SNA is also used to evaluate LEA interventions aimed at dismantling and disrupting criminal networks [8,11]. Another interesting application of SNA and graph theory is to develop random graph models which mimic the structure and behaviour of real criminal networks. Indeed, even if the growing mechanism of this criminal network remains largely unknown, growth and preferential attachment mechanisms are most probably at the core of the affiliation process. In this respect, comparing an artificial model network to a real network is not only plausible, but even fruitful in terms of useful insights about the structure and behaviour of the real network. The growth of available data and number of network models [22,24,28], has led researchers to face the problem of comparing networks, i.e., finding and quantifying similarities and differences between them. Network comparison requires measures for the distance between graphs [29]. This is a non-trivial task, which involves a set of features that are often sensitive to the specific application domain such as: the results’ effectiveness, the interpretability, and the computational efficiency. There is some debate about the weakness of this technique, principally due to cospectrality issues, but there is evidence that the fraction of cospectral graphs is 21% for networks composed of 10 nodes and is less for 11 nodes [34]. We may, therefore, expect that cospectrality becomes negligible for larger graphs. Granted the reliability of these techniques, we selected the simplest and yet effective among the various metrics. The literature on this topic is abundant, but the classification of best methods for specific situations (including the comparison of real-world networks) remains an open field. A few critical reviews of the literature on this subject have already
288
L. Cavallaro et al.
been compiled [10,12,27]. Wills and Meyer [33] compared commonly used graph metrics and distance measures, and demonstrate their ability to discern between common topological features found in both random graph models and real world networks. They put forward a multi-scale picture of graph structure wherein they studied the effect of global and local structures on changes in distance measures. The number of useful graph comparison techniques [2] drastically reduces when one requires an algorithm which runs in reasonable time on large graphs. In recent years, many random graph models emulating features of real-world graphs [3,31] have been developed. An accurate probabilistic study of the application of graph distances to these random models is difficult, as they are often defined in terms of their generative process. For this reason, most researchers restrict their attention to very simple random models such as that of Erd¨ os and R´enyi [13]. Even so, rigorous probabilistic analysis can be difficult. A possible solution is the one proposed by Wills and Meyer [33], that is a numerical approach where a sample is taken from random graph distributions and the empirical performance of various distance measures is observed. Despite the growing scholarly attention to network comparison, to the best of our knowledge, there is no previous research aiming to identify best measures for the distance between graphs related to real criminal networks. Filling this gap is a first step towards comparing and generating artificial networks which mirror the topology and functionality of real criminal networks. Far more important, LEAs could considerably benefit from such a discovery. A surrogate network on which to conduct their investigations could predict the evolution of new connections between criminals or, on the other side, break those links by arresting one (or more) of the suspects, based on the network topology. To this end, we borrow some of the distance techniques proposed by [33]. We first generate data using popular artificial network models like Erd¨ os and R´enyi (ER), Watts-Strogatz [31] (WS), and different configurations of Barab´ asiAlbert (BA) [3]. This is compared against real criminal network dataset named Meetings network from our earlier works [5,8,15], whose datasets are publicly available on Zenodo [7]. This captures the physical meetings among suspects in an anti-mafia investigation called “Montagna Operation”, concluded in 2007 by the Public Prosecutor’s Office of Messina (Sicily).
2
Materials and Methods
This section shows the standard definitions used in this work, as well as a brief description of the real dataset used to compare the artificial networks, along with the method followed to pursue the experiments. 2.1
Background
In this paper we deal with unweighted undirected graphs. An unweighted graph G = N, E consists of a finite set N of n nodes (also called vertices/actors) and a set E ⊆ N × N of m edges (or links/ ties). A graph
Artificial Models for Real Criminal Networks
289
is undirected when all the edges between nodes are bidirectional, as opposed to a directed graph, where the edges actually point to a direction. The adjacency matrix of graph G defined over the set of nodes N = {1, . . . , n}, is a n × n square matrix denoted by A = (aij ), 1 ≤ i, j ≤ n, where aij = 1 if there exists an edge joining vertices i and j, and aij = 0 otherwise. In the case of an undirected graph, its adjacency matrix is symmetric, i.e., aij = aji . Such a matrix, along with the Laplacian and Normalized Laplacian matrices, are the most common representation matrices for a graph. The spectrum of a graph consists of the set of sorted (increasing or decreasing) eigenvalues of one of its representation matrices. It is used to characterise graph properties and extract information from its structure. The spectra derived from each representation matrix may reveal different properties of the graph. The largest eigenvalue (in absolute value) of the graph is called the graph’s spectral radius. In the case of the adjacency matrix A, if λk is its k th eigenvalue, the spectrum is given by their descending order as λ1 ≥ λ2 ≥ · · · ≥ λn . The spectral distance [34] between two graphs G and G of size n, is the Euclidean distance between their spectra, i.e., the set of eigenvalues λi and λi (according to the chosen representation matrix). In case of the adjacency matrix, the Adjacency Spectral Distance is n d(G, G ) = (λi − λi )2 ; (1) i=1
If the two spectra have different sizes, the smaller graph (of size k ≤ n) is brought to the same cardinality of the other by adding zero values to its spectrum. In such case, only the first k eigenvalues are compared, which for the Adjacency Spectral Distance d are the largest k eigenvalues. Comparing the higher eigenvalues allows to focus more on global features. Another class of graph distances is the matrix distance [33]. A matrix of pairwise distances δ(v, w) between graph nodes is constructed for each graph, where δ is the shortest path connecting the nodes v and w. Such matrices provide a signature of each graph characteristics and carry important structural information. Given two graphs defined on the same set of nodes, their respective matrices of pairwise distances are built and then the distance between the two matrices is computed with any of the many available norms. In this work we adopt the DeltaCon distance. This matrix distance method is based on the Matsusita difference (also called root euclidean distance) drootED (G, G ) between matrices S and S , created from the fast belief propagation method of measuring node affinities [21]. The fast belief propagation matrix is defined as S = [I + 2 D − A]−1 , where I is the identity matrix. D is the degree matrix, namely a diagonal matrix whose elements are defined as dii = ki , ki being the degree of the ith node, A is the adjacency matrix and = 1/(1 + maxi dii ) [21]. The DeltaCon similarity, with values in the interval [0, 1] is introduced as simDC (G, G ) =
1 , 1 + drootED (G, G )
(2)
290
L. Cavallaro et al.
where drootED (G, G ) is given by 2 drootED (G, G ) = Si,j − Si,j .
(3)
i,j
The Matsusita difference is used instead of classical euclidean distance since, as opposed to the latter, it detects even small changes in the graphs. Random network theory emulates the irregularity and unpredictability of real networks by constructing from scratch and characterizing graphs that are truly random. Some of the most popular random network models are Erd¨ os-R´enyi (ER), Watts-Strogatz (WS) and Barab´ asi–Albert (BA). According to the ER model [13], a network is firstly generated by laying down a number n of isolated nodes. Then each pair is selected and a random number in the interval [0, 1] is generated. If the generated number exceeds a chosen probability p, then the selected nodes are connected. Otherwise they are left disconnected. The procedure is performed for all the n(n − 1)/2 pairs of nodes. This is the simplest model, also known as the G(n, p) model [18]. A closely related variant is the G(n, M ) model, where n labeled nodes are connected with M randomly placed links that is the model we used to conduct our experiments. Even though it is unlikely that real social networks form like this, such models can predict a number of different properties [13]. While the ER model may exhibit a small clustering coefficient along with a small average shortest path length, the WS model [31] can produce graphs with small-world properties, which are highly clustered but with small characteristic path lengths. Most nodes are not neighbors, but the neighbors of a node are likely to be connected and most nodes can be reached from every other one by a small number of steps (also called Six Deegree of Separation property) [31]. In a small-world network, if L is the distance in steps between two randomly chosen nodes, it grows proportionally to the logarithm of the number of nodes n: L ∝ log(n). Thus, the model is constructed as follows. Starting from a ring of nodes, each node is connected to their previous and next neighbours. Each link is then rewired with probability p to a randomly chosen node. For small values of p, the network maintains high clustering but the random long-range links can drastically decrease the distances between the nodes. When p = 1, all links are rewired, so the network turns into a random ER network [31]. The BA model [3] exploits a preferential attachment mechanism to develop a scale-free network, i.e., the degree distribution follows a power law. The algorithm starts from a network with m0 nodes, whose links are chosen arbitrarily, as long as each node has at least one link. At each step, a new node with m ≤ m0 links is added. The preferential attachment ensures that the probability pi that the new node is connected to a node i depends on the degree di of the latter as follows: di . pi =
j dj
(4)
Artificial Models for Real Criminal Networks
291
So the new node prefers to attach itself to already heavily linked nodes, called hubs, that tend to accumulate even more links at each step, while nodes with only few links are unlikely to be chosen [3]. 2.2
Dataset
Our dataset is available on Zenodo [7] and was discussed in detail in our earlier studies [5,8,15]. Derived from the pre-trial detention order issued by the Court of Messina’s preliminary investigation judge on March 14, 2007, was towards the end of the major anti-mafia effort referred to as “Montagna Operation”, concluded in 2007 by the Public Prosecutor’s Office of Messina (Sicily) and conducted by the Special Operations Unit of the Italian Police (the R.O.S. Reparto Operativo Speciale of the Carabinieri is specialising in anti-Mafia investigations). This prominent operation focused on two Mafia clans, known as the “Mistretta” family and the “Batanesi clan”. From 2003 to 2007, these families were found to have infiltrated several economic activities including major infrastructure works, through a cartel of entrepreneurs close to the Sicilian Mafia. We created two networks, capturing phone calls and physical meetings, respectively. Herein, we focus on the Meetings network which accounts for the physical meetings among suspected (police stakeout), which is composed of 101 nodes and 256 edges. 2.3
Methodology
The main question we want to address in this paper is to measure how well an artificial network may catch some real network features. In this respect we first computed the simDC similarity and the drootED distance (see Eqs. 2 and 3). Thus, we have compared three network models (ER, WS and BA) with several BA configurations (BA2, BA3, and EBA), reaching a total of five networks, with the Meetings dataset. We have chosen BA because in [15] we have discovered that the criminal network under scrutiny follows a scale-free power law [15]. Furthermore, while not being the main focus of this study, there are reasons to believe that criminals follow specific criteria for recruiting new affiliates (growing and preferential attachment dynamics) [32]; however, this behaviour cannot be identified by a single network snapshot, as the real network herein investigated is. Moreover, in order to have a yardstick, we have also selected the ER and WS models. In particular, WS is not a totally unrealistic model because it is characterised by a short diameter and distance between nodes. The models have been created by using NetworkX libraries and the source code has been developed in Python. Table 1 summarises the input parameters required and the values we assigned to them. The number of nodes n is defined a priori in all the models considered, whereas the number of edges m is set only in ER model. In WS, k represents the average degree. This has been set equal to 6, in order to obtain a final configuration as close as possible to the real criminal network in terms of number of total links. The same has been done for the input parameters of all the BA models chosen herein.
292
L. Cavallaro et al.
Indeed, three different flavours have been selected: BA2 and BA3, in which the number of edges added at each iteration mi is equal to two and three, respectively, and the extended BA version (EBA) in which two more parameters are required: (i) p, the probability that m already existing pairs of nodes may be connected by a link, and (ii) q, the probability that an already existing link may be rewired. Thus, instead of creating a new link, an old one is reconnected between another pair of nodes; however, we set q = 0 to avoid injecting more randomness into the network building process. Table 1. Artificial models configurations. ER
n = 101 m = 256
WS ⎧ ⎪ ⎨n = 101 k = 6 ⎪ ⎩ p = 0.6
BA2
BA3
n = 101 mi = 2
n = 101 mi = 3
EBA ⎧ n = 101 ⎪ ⎪ ⎪ ⎨m = 2 ⎪ p = 0.225 ⎪ ⎪ ⎩ q=0
Afterwards, we first computed the DeltaCon distance as it can be used to compare two graphs with different numbers of nodes and/or edges. Unfortunately, the results this metric provide did not allow to determine which network model is closer to the real network. For this reason, we have also computed the spectral distance by using the adjacency matrix A, which undoubtedly identified the BA models to be the best ones to catch some real network features among the ones herein analysed. The last refinement concerned the number of edges: as BA2 and BA3 produced networks with a number of edges different from the real network, we decided to further investigate whether the spectral distance could be reduced increasing (resp., decreasing) the number of edges of BA2 (resp., BA3). The experiments consisted of adding an edge to the BA2 (resp., removing an edge from the BA3) network and computing the spectral distance. This procedure would eventually end when the number of edges reaches the number of edges in the real network. We devised two strategies to add (resp., remove) edges: (i) the preferential attachment selection, according to which the edge is created (removed) between the most attractive nodes and (ii) the random selection, in which the pair of nodes is selected in a purely random way. In order to have statistically sound results, 1000 artificial networks of each type (ER, WA, EBA, BA2, BA3) have been produced, from which the average values have been computed.
3
Results
This section shows the main findings obtained from our comparative investigation between artificial and real networks. As stated in Sect. 2.3, our study starts
Artificial Models for Real Criminal Networks
293
from the computation of S, the fast belief propagation matrix that is required to compute the Matsusita difference whose outcomes are commented in Sect. 3.1. Next, the spectral distance previously described is discussed in Sect. 3.2. 3.1
Matrix Distance
The discovery of an artificial network that almost mirrors the topology of the Meetings network begins by using the DeltaCon distance. As shown in Table 2, the largest differences emerge for the ER and WS models, whereas all the BA tests have slight differences between each other. However, there is no distance that sticks out among them. Thus, this metric is insufficient on its own to point out a model with significantly better performances in terms of emulating a criminal network topology. Even the values of simDC do not allow to conclude which artificial model network is more similar to the real network. As expected, the values of simDC lie in the interval [0, 1], and there is little to no difference among the various artificial model networks. It could be interesting to investigate the similarity among them, but this lies outside the scopes of this work. For these reasons we have opted to also use the spectral distance. Table 2. DeltaCon distance and similarity between the Meetings dataset and the artificial models. Model m
3.2
Dist. S
simDC
ER
256 2.2 ± 0.2
0.317
WS
202 2.5 ± 0.2
0.287
EBA
246 1.31 ± 0.08 0.433
BA2
198 1.28 ± 0.08 0.438
BA3
294 1.27 ± 0.07 0.441
Spectral Distance
The spectral distances are computed for the Adjacency matrix A. Table 3 confirms the ER and WS to perform worst, however we still cannot identify the best BA configuration because of the significant error range led to an overlapping outcome cross all BA tests. Thus, we have adjusted these configurations by adding (resp., deleting) links from the BA2 (resp., BA3) model following two options: first, choose the pair of nodes through preferential attachment (resp., detachment); second, pick those pairs among which adding (resp., deleting) links randomly. The resulting graph in Fig. 1 suggests that by adjusting the number of edges, the distance is reduced without a preferable BA configuration.
294
L. Cavallaro et al.
Table 3. Degree distribution variation between the Meetings dataset and the artificial models computed by the spectral distance. Model m ER
Dist. A
256 8.4 ± 0.2
WS
303 9.2 ± 0.2
EBA
255 6.6 ± 0.2
BA2
198 6.9 ± 0.2
BA3
294 7.1 ± 0.2
Fig. 1. Spectral distances evolution during the addition/deletion of edges. Upper subplot: spectral distance of the adjacency matrix using the preferential attachment-based selection of edges; Lower subplot: same distance using a random selection of edges.
4
Discussion and Conclusions
This paper paves the way to a new branch of criminal network analysis by providing a new perspective on how SNA methods can help LEAs. We applied tools
Artificial Models for Real Criminal Networks
295
from Network Science and Graph Theory on a real criminal network dataset with the aim to discover new ways to use artificial networks on police investigations. The idea is to find a way to replicate the topology of real criminal networks through classical models widely used in the state-of-art for several domains. Consequently, we computed two distance metrics on different artificial models to find the one which better reproduces the features of a real criminal network topology. To do so, we started by computing the DeltaCon distance on ER, WS, and BA models. This metric is independent from the graphs’ size and, from our experiments, it has only suggested which model(s) could be discarded, moving towards the computation of the spectral distance. However, even by using this metric, small differences emerge because the error range overlap some of the models outcomes. Hence, we adjusted the edges’ number of the artificial networks, in order to match the real criminal size used as litmus test. The link selection criteria followed is twofold: first, we selected the pair of nodes accordingly with the preferential attachment (resp., detachment) that also takes into account the nodes’ degree; second, we opted for a randomly choice of the links that need to be added (resp., removed). So far the conclusion is that the BA model unveiled to be the closest one to the real dataset considered for the comparison among the ones herein investigated. Performance was not significantly affected by its construction. The results obtained suggest pathways to new scenarios and applications; indeed, the use of an artificial model may significantly help LEAs. Starting from the investigation data (even though affected by noise or missing information), it could be possible in the future to create a substitute model that replicates, closely enough, the criminal network under scrutiny. Thus, it could be useful for the investigators to make their decisions in terms of how to efficiently spread their resources (i.e., policeman, patrols, etc.): from one side, the artificial model could be able to predict (and prevent) the creation of relationship ties between criminals; on the other side, LEAs could quickly intervene to break the links among them (when already present) by arresting one or more of the suspects. As future work, we wish to extend those tests performing them on different spectral distance configurations (such as, choosing the Laplacian, rather than the Adjacency matrix, as the latter appears as the weakest among the matrix representations of a graph [34]) as well as including both of the real criminal networks we modelled (i.e., Meetings and Phone Calls) as they are complementary to each other and a joint analysis may offer a better view on the overall interconnections. As those networks are weighted, we would like to discover whether and how weights influence the performances obtained by the artificial models herein investigated. Another interesting point is to try to answer to another open question that is how to compare through SNA two different real criminal networks in order to unveil whether there are some analogies despite their size.
296
L. Cavallaro et al.
References 1. Agreste, S., Catanese, S., De Meo, P., Ferrara, E., Fiumara, G.: Network structure and resilience of Mafia syndicates. Inf. Sci. 351, 30–47 (2016). https://doi.org/10. 1016/j.ins.2016.02.027 2. Akoglu, L., Tong, H., Koutra, D.: Graph based anomaly detection and description: a survey. Data Min. Knowl. Discov. 29(3), 626–688 (2015). https://doi.org/10. 1007/s10618-014-0365-y 3. Barab´ asi, A.L., Albert, R.: Emergence of scaling in random networks. Science 286(5439), 509–512 (1999). https://doi.org/10.1126/science.286.5439.509 4. Berlusconi, G., Calderoni, F., Parolini, N., Verani, M., Piccardi, C.: Link prediction in criminal networks: a tool for criminal intelligence analysis. PLoS ONE 11(4), 1–21 (2016). https://doi.org/10.1371/journal.pone.0154244 5. Calderoni, F., Catanese, S., De Meo, P., Ficara, A., Fiumara, G.: Robust link prediction in criminal networks: a case study of the Sicilian Mafia. Expert Syst. Appl. 161, 113–666 (2020). https://doi.org/10.1016/j.eswa.2020.113666 6. Campana, P.: Explaining criminal networks: strategies and potential pitfalls. Methodological Innov. 9, 205979911562274 (2016). https://doi.org/10.1177/ 2059799115622748 7. Cavallaro, L., Ficara, A., De Meo, P., Fiumara, G., Catanese, S., Bagdasar, O., Song, W., Liotta, A.: Criminal Network: the Sicilian Mafia. “Montagna Operation” (2020). https://doi.org/10.5281/zenodo.3938818 8. Cavallaro, L., Ficara, A., De Meo, P., Fiumara, G., Catanese, S., Bagdasar, O., Song, W., Liotta, A.: Disrupting resilient criminal networks through data analysis: the case of Sicilian Mafia. PLoS ONE 15(8), 1–22 (2020). https://doi.org/10.1371/ journal.pone.0236476 9. De Moor, S., Vandeviver, C., Vander Beken, T.: Assessing the missing data problem in criminal network analysis using forensic DNA data. Soc. Netw. 61, 99–106 (2020). https://doi.org/10.1016/j.socnet.2019.09.003 10. Donnat, C., Holmes, S.: Tracking network dynamics: a survey using graph distances. Ann. Appl. Stat. 12(2), 971–1012 (2018). https://doi.org/10.1214/18AOAS1176 11. Duijn, P.A.C., Kashirin, V., Sloot, P.M.A.: The relative ineffectiveness of criminal network disruption. Sci. Rep. 4(1), 4238 (2014). https://doi.org/10.1038/srep04238 12. Emmert-Streib, F., Dehmer, M., Shi, Y.: Fifty years of graph matching, network alignment and network comparison. Inf. Sci. 346–347, 180–197 (2016). https:// doi.org/10.1016/j.ins.2016.01.074 13. Erd¨ os, P., R´enyi, A.: On random graphs I. Publicationes Mathematicae 6, 290–297 (1959) 14. Ferrara, E., De Meo, P., Catanese, S., Fiumara, G.: Detecting criminal organizations in mobile phone networks. Expert Syst. Appl. 41(13), 5733–5750 (2014). https://doi.org/10.1016/j.eswa.2014.03.024 15. Ficara, A., Cavallaro, L., De Meo, P., Fiumara, G., Catanese, S., Bagdasar, O., Liotta, A.: Social network analysis of sicilian mafia interconnections. In: Complex Networks and Their Applications VIII, pp. 440–450. Springer International Publishing (2020). https://doi.org/10.1007/978-3-030-36683-4 36 16. Finckenauer, J.O.: Problems of definition: what is organized crime? Trends Organized Crime 8(3), 63–83 (2005). https://doi.org/10.1007/s12117-005-1038-4 17. Gambetta, D.: The Sicilian Mafia: The Business of Private Protection. Harvard University Press, Cambridge (1996)
Artificial Models for Real Criminal Networks
297
18. Gilbert, E.N.: Random graphs. Ann. Math. Stat. 30(4), 1141–1144 (1959). https:// doi.org/10.1214/aoms/1177706098 19. Johnsen, J.W., Franke, K.: Identifying central individuals in organised criminal groups and underground marketplaces. In: Shi, Y., Fu, H., Tian, Y., Krzhizhanovskaya, V.V., Lees, M.H., Dongarra, J., Sloot, P.M.A. (eds.) Computational Science – ICCS 2018, pp. 379–386. Springer International Publishing, Cham (2018). https://doi.org/10.1007/978-3-319-93713-7 31 20. Kleemans, E.R., de Poot, C.J.: Criminal careers in organized crime and social opportunity structure. Eur. J. Criminol. 5(1), 69–98 (2008). https://doi.org/10. 1177/1477370807084225 21. Koutra, D., Vogelstein, J.T., Faloutsos, C.: DeltaCon: A principled massive-graph similarity function. In: Proceedings of the 2013 SIAM International Conference on Data Mining, pp. 162–170 (2013). https://doi.org/10.1137/1.9781611972832.18 22. Newman, M.E.J.: Estimating network structure from unreliable measurements. Phys. Rev. E 98, 062321 (2018). https://doi.org/10.1103/PhysRevE.98.062321 23. Paoli, L.: Mafia brotherhoods: organized crime, Italian style. Oxford University Press, Oxford Scholarship Online (2008). https://doi.org/10.1093/acprof:oso/ 9780195157246.001.0001 24. Peixoto, T.P.: Reconstructing networks with unknown and heterogeneous errors. Phys. Rev. X 8, 041011 (2018). https://doi.org/10.1103/PhysRevX.8.041011 25. Robinson, D., Scogings, C.: The detection of criminal groups in real-world fused data: using the graph-mining algorithm “GraphExtract”. Secur. Inform. 7(1), 2 (2018). https://doi.org/10.1186/s13388-018-0031-9 26. Rostami, A., Mondani, H.: The complexity of crime network data: a case study of its consequences for crime control and the study of networks. PLoS ONE 10(3), 1–20 (2015). https://doi.org/10.1371/journal.pone.0119309 27. Soundarajan, S., Eliassi-Rad, T., Gallagher, B.: A guide to selecting a network similarity method. In: Proceedings of the 2014 SIAM International Conference on Data Mining, pp. 1037–1045 (2014). https://doi.org/10.1137/1.9781611973440.118 28. Squartini, T., Mastrandrea, R., Garlaschelli, D.: Unbiased sampling of network ensembles. New J. Phys. 17(2), 023052 (2015). https://doi.org/10.1088/13672630/17/2/023052 29. Tantardini, M., Ieva, F., Tajoli, L., Piccardi, C.: Comparing methods for comparing networks. Scientific Rep. 9(1), 17557 (2019). https://doi.org/10.1038/s41598-01953708-y 30. Villani, S., Mosca, M., Castiello, M.: A virtuous combination of structural and skill analysis to defeat organized crime. Socio-Econ. Plann. Sci. 65(C), 51–65 (2019). https://doi.org/10.1016/j.seps.2018.01.002 31. Watts, D.J., Strogatz, S.H.: Collective dynamics of ‘small-world’ networks. Nature 393(6684), 440–442 (1998). https://doi.org/10.1038/30918 32. Williams, P.: Transnational criminal networks. Netw. Netwars Future Terror, Crime, Militancy 1382, 61 (2001) 33. Wills, P., Meyer, F.G.: Metrics for graph comparison: a practitioner’s guide. PLoS ONE 15(2), 1–54 (2020). https://doi.org/10.1371/journal.pone.0228728 34. Wilson, R.C., Zhu, P.: A study of graph spectra for comparing graphs and trees. Pattern Recogn. 41(9), 2833–2841 (2008). https://doi.org/10.1016/j.patcog.2008. 03.011
Extending DeGroot Opinion Formation for Signed Graphs and Minimizing Polarization Inzamam Rahaman(B) and Patrick Hosein The University of the West Indies, St. Augustine, Trinidad and Tobago [email protected], [email protected]
Abstract. Signed graphs offer a more rich representation of social networks than unsigned graphs. Most opinion formation models are developed for unsigned graphs. In this paper, we extend DeGrootian opinion dynamics to accommodate signed graphs. Furthermore, we also define the task of minimizing polarization on a budget through the lens of this DeGrootian model as an optimization problem and provide numerical results to demonstrate a decrease in polarization.
Keywords: Signed graphs Polarization
1
· Opinion formation · Optimization ·
Introduction
Online social networks have become increasingly important facets of contemporary life, thereby making it paramount that we develop better theoretical models for how human behaviour shapes and is shaped by these online social networks. A better understanding will help us design better social networks and social network algorithms to stymie unhealthy outcomes. For instance, there has been a recent increase in the consumption of social media as a source of information about current events [41], and the media shared between users on social media platforms can have important effects [7] such as leading to echo chambers [13] or to a form of online “trench-warfare” between rival factions [26,28]. Consequently, understanding opinion formation has become increasingly crucial. Most graphical opinion formation models are defined upon unsigned graphs. However, signed graphs can, in some instances, offer a richer and picture of human relationships [43]. Signed graphs capture not only positive relationships and interactions but antagonistic relationships and adversarial interactions as well. In this way, signed graphs can be seen as generalizations of unsigned graphs. However, the differing valence on edges makes analyzing and developing algorithms around signed graphs more difficult as conventional approaches presume the signals conveyed by edge presence to be more homogeneous [39]. By using signed graphs, we can incorporate not only agents’ psychological tendency c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 944, pp. 298–309, 2021. https://doi.org/10.1007/978-3-030-65351-4_24
Extending DeGroot Opinion Formation for Signed Graphs
299
towards conformity, but also their tendency to behave in accord with the social boomerang theory [11]. To this end, we contribute the following: we develop an extension to DeGrootian opinion dynamics for directed signed networks, and we present an optimization problem for reducing polarization upon said opinion dynamics model.
2
Previous Work
There has been an opulence of work in formulating continuous opinion dynamic models. A continuous opinion model is a model in which an agent’s opinion is repres ented by a single real number. Often, this opinion is bounded in some range. The field of formalizing opinion dynamics owes its genesis to the work of DeGroot [15]. DeGroot represented opinion formation as a social learning process [5] in which the opinion expressed by an agent is the weighed sum of their neighbors. Friedkin and Johnson [18] then extended DeGroot’s model to incorporate exogenous factors and their persistent effects on an agent’s opinions. Most subsequent work in opinion dynamics adopt Friedkin and Johnson’s assumption of exogenous variables affecting opinion formation. For example, Bindel et al. [8] demonstrated that opinion formation can be interpreted through the prism of game theory. From this, they derive a stable equilibrium for a system in terms of the graph laplacian. Parsegov et al. [38] analyzed the convergence and consensus properties of these DeGrootian models. Many other researchers have used Bindel et al.’s and Parsegov et al.’s results in their own work [32,36]. Some work has been done in minimizing polarization in social networks assuming a DeGrootian model [32,36]. Our work differs from Matakos et al. [32] by the DeGrootian model used and the underlying mechanism of intervention used; Matakos et al. focus on setting some agents to neutrality. Furthermore, our work differs from Musco et al. [36] in the definition of polarization used, the budgeting mechanism constraining perturbations, the range of perturbations used, and the spectrum of opinions assumed. In terms of general opinion manipulation, Bauso et al. [6] and Dong et al. [17] have examined the problem of steering opinions towards a specified consensus. Hegselmann et al. [24] examined using a seed agent whose opinions can be changed “on the fly” to coax the overall system into a particular state. Garimella et al. [20] have investigated connecting disparate users in a graph as a means of reducing polarization. Gaitonde et al. [19] also examined how adversarial attacks can potentially be used to engender high levels of opinion variance in the network; they also characterised the spectral properties of the socials graphs that are most susceptible to such adversarial attacks. Some work has been done on extended opinion dynamics to signed settings. However, these results differ from ours in that they rely on a bounded confidence model [3], model opinion spread in terms of ODEs [34], or model opinion spread in an asynchronous setting [2,42].
300
3 3.1
I. Rahaman and P. Hosein
Model Agent Properties
We let our system contain N ∈ Z+ agents. Each agent is equipped with three psychological factors: a conformity factor, a consistency factor, and an enmity factor. The conformity factor describes the relative degree to which an agent prefers to stay aligned with their friends; the consistency factor describes the relative degree to which an agent prefers to remain true to their own opinion regardless of other agents’ opinions; the enmity factor describes the relative degree to which an agent prefers to disagree with their enemies rather than agree with them. All three of the aforementioned psychological factors for an agent are contingent on the individual psychology of that agent [16,27]. An agent’s individual psychology can be inferred from their behaviours in the system [21]. We denote the conformity factor, consistency factor, and enmity factor for agent i by βi , αi , and γi respectively. Since these psychological factors are relative factors we let 0 < βi , αi γi < 1 and βi + αi + γi = 1. Moreover, we describe these factors across the entire system by the vectors α, β, γ which are the consistency, conformity, and enmity factors across the system respectively. Note that αi , βi , and γi remain the consistency, conformity, and enmity factors for agent i. For notation convenience consider that if M = diagm(v) then M is a diagonal matrix where Mi,i = vi ; using this notation, we define matrices A = diagm(α), B = diagm(β), and Γ = diagm(γ). In addition, much like Friedkin and Johnson [18] and attendant literature [8,32], we let every node i possess an internal opinion yi . Like Mataokos et al. [32], we let yi ∈ [−1, 1] where an opinion of 0 indicates neutrality. These internal opinions can be assessed through a combination of both exogenous factors [22] such as cultural background [40], demographics [9], or neurobiology [33], and endogenous factors such as the content consumed from the social network [12]. Furthermore, we let vector y denote the internal opinion such that yi is the opinion of agent i. An agent’s expressed opinion can be different from its internal opinion. We suppose like Friedkin and Johnson [18] and others [8,32] that opinion expression occurs in discrete time intervals. The opinions expressed at time t is denoted by the vector z(t) such that zi (t) denotes the opinion expressed by agent i at time t. At t = 0, each agent expresses its internal opinion; hence z(0) = y. The agent then observes the opinions of its neighbours and then changes the opinion expressed in the subsequent time step. The opinion expressed is contingent on the behaviour an agent observes and its psychological factors. We explicate our model for this behaviour in Sect. 3.3. 3.2
Interaction Properties
A signed graph G over N agents can be described as a triple defined by G = (V, W + , W − ) where V is the set of nodes (each node represents an agent), W + ∈ RN ×N is the weight matrix of positive relationships, and W − ∈ RN ×N is the
Extending DeGroot Opinion Formation for Signed Graphs
301
weight matrix of negative relationships. Henceforth, the terms agent and node will be used interchangeably. All entries in W + are non-negative, and all entries + − − > 0 =⇒ Wi,j = 0 and Wi,j < 0 =⇒ in W − are non-positive. Furthermore, Wi,j i,j W+ = 0; in other words node i cannot simultaneously possess a positive and a + − = 0, Wi,i = 0. negative relationship with another node j. In addition, ∀i, Wi,i N N + − Let ri = j=1 Wi,j and si = j=1 Wi,j . We define the normalized positive W+
W−
and negative relationship matrices by Rij = ri,j and Sij = si,j respectively; i i hence R and S denote the relative strengths of positive and negative relationships in the signed graph. Also note that while W − was non-positive, S is non-negative. 3.3
Opinion Formation
Much like Bindel et al. [8], we assume an agent incurs a psycho-social cost that is a function of the opinion it decides to express at t + 1, opinions expressed by its neighbours in the time step t, and its internal opinion. We can express the psycho-social cost function of an agent in terms of its psychological factors and its relationships to its neighbours. A cost function for agent i is given by Cnsi (zi (t + 1), yi ) = αi (yi − zi (t + 1))2
(1)
⎛ ⎞2 N Ri,j (zj (t) − zi (t + 1)⎠ Cnf i (zi (t + 1), z(t)) = βi ⎝
(2)
j=1
⎛ Enmi (zi (t + 1), z(t)) = γi ⎝
N
⎞2 Si,j (−zj (t) − zi (t + 1)⎠
(3)
j=1
ci (zi (t + 1), yi , z(t)) = Cnsi (zi (t + 1), yi ) + Cnf i (zi (t + 1), z(t)) + Enmi (zi (t + 1), z(t))
(4)
Equation (1) describes agent i’s cost incurred from deviating from an internal opinion. Equation (2) represents agent i’s cost of disagreeing with their friends - by social representation theory [35], an agent would typically value, at least to some extent, agreeing with their friends. Equation (3) describes agent i’s cost of agreeing with their enemies; under social boomerang theory [11], an agent would want to adopt the position opposite their enemies and such would like to minimize the distance between their opinion and the opposite (negation) of their enemies position. By proceeding like Bindel et al. [8], we can derive an equation for the opinion we expect an agent i to express at time t + 1. Like Bindel et al. [8], we assume that an agent will act to myopically minimize their cost. By determining the actions of individual agents at time t + 1, we can then determine the expressed opinion vector across the entire system for t + 1.
302
I. Rahaman and P. Hosein
Proposition 1. Given the expressed opinion vector z(t), internal opinion vector y, psychological factor matrices A, B, Γ per Sect. 3.1, and assuming an agent will act to minimize their cost computed by Eq. (4), then z(t + 1) = Ay + Qz(t)
(5)
where Q = (BR − Γ S). Proof. Consider an agent i. Agent i will choose zi (t + 1) to minimize their cost. ∂ci ∂ Cnsi i Hence they will choose zi (t+1) such that ∂zi∂c (t+1) = 0. Since ∂zi (t+1) = ∂zi (t+1) + ∂ Cnf i ∂zi (t+1)
+
∂ Enmi ∂zi (t+1) ,
∂ Cnsi ∂ Cnf i ∂zi (t+1) , ∂zi (t+1) ,
let us consider
and
∂ Enmi ∂zi (t+1) .
∂ Cnsi = −2αi (yi − zi (t + 1)) = −2αi yi + 2αzi (t + 1) ∂zi (t + 1) ⎛ ∂ Cnf i ∂zi (t + 1)
−2βi ⎝
=
N
Since
∂ Cnsi ∂zi (t+1)
⎛ ⎞ N Ri,j (zj (t)⎠ + 2βi ⎝ Ri,j zi (t + 1)⎠ ⎞
j=1
⎛ ∂ Enmi ∂zi (t + 1)
=
−2γi ⎝
N
∂ Cnf i ∂zi (t+1)
+
⎞
⎛
Si,j (zj (t)⎠ + 2γi ⎝
N
⎞ Si,j zi (t + 1)⎠
(8)
j=1
∂ Enmi ∂zi (t+1)
=0=
∂ci ∂zi (t+1)
− γi Si,j )zj (t) zi (t + 1) = N N αi + βi j=1 Ri,j + γi j=1 Si,j N N Since j=1 Ri,j = 1, j=1 Si,j = 1, and αi + βi + γi = 1, then ⎛ ⎞ N zi (t + 1) = αi yi + ⎝ (βi Ri,j − γi Si,j )zj (t)⎠ αi yi +
(7)
j=1
j=1
+
(6)
N j=1 (βi Ri,j
(9)
(10)
j=1
Re-writing in matrix form using the definition of matrix-vector multiplication, we get z(t + 1) = Ay + (BR − Γ S)z(t) = Ay + Qz(t) (11) where Q = (BR − Γ S). Q.E.D Note that this result is in accord with the Anderson’s notion of cognitive algebra [4]. Now that we have found the optimizer for Eq. (4). We must now show that Eq. (11) is a minimizer. Proposition 2. Equation (11) is a minimizer of Eq. (4). Proof. Using Eqs. (6), (7), and (8), it is easy to see that ∂ 2 Enmi ∂ 2 zi (t+1)2
2
ci = 2. This means that ∂ 2 z∂i (t+1) 2 = 6. Since that the optimizer of Eq. (4) is a minimizer. Q.E.D.
∂ 2 Cnsi ∂ 2 Cnf i ∂zi (t+1)2 , ∂ 2 zi (t+1)2 ,
∂ 2 ci ∂ 2 zi (t+1)2
> 0, this means
Extending DeGroot Opinion Formation for Signed Graphs
3.4
303
Long-Term and Equilibrium Behaviour
Before proceeding to derive the equilibrium properties of the model, we must first show that matrix Q as defined by Proposition 1 is convergent. Any matrix M is convergent if limk→∞ M k = 0 where 0 is a matrix of zeros. Proposition 3. The matrix Q defined by Proposition 1 is a convergent matrix. Proof. Q is convergent if and only if ρ(Q) < 1 where ρ(Q) is the spectral radius of Q. We can prove ρ(Q) < 1 using the properties of Q and its construction per Proposition 1. Consider the max row sum matrix norm denoted by ||| · |||∞ defined by Horn and Johnson [25]. This norm is defined over a N × N real matrix as follows: |||M |||∞ = maxN i=1
N
|Mi,j |
(12)
j=1
||| · |||∞ is induced by the ∞ vector norm and as such is a valid matrix norm. Recall that if f is a valid matrix norm, then ρ(M ) ≤ f (M ). + − − + > 0 =⇒ Wi,j = 0 and Wi,j < 0 =⇒ Wi,j = 0, this means that Since Wi,j Ri,j > 0 =⇒ Si,j = 0 and Si,j > 0 =⇒ Ri,j = 0. Since R and S are both N stochastic matrices, for all i, j=1 |Qi,j | ≤ βi + γi Since, by the definitions in Sect. 3.1, ∀i, βi + γi < 1, this means that |||Q|||∞ < 1, which in turn implies that ρ(Q) < 1. Consequently, Q is a convergent matrix. Q.E.D Since we have shown that Q is convergent, we can now show that an equilibrium exists and there exists a closed formed equation to compute it. Proposition 4. Let z∞ be the long term behaviour (equilibrium expressed opinion vector) of the system. Then z∞ = (I − Q)−1 Ay Proof. Let us consider z(t+1)−z(t). By Eq. (11), z(t+1)−z(t) = Q(z(t)−z(t− 1)). Applying this recursively, we can see that z(t + 1) − z(t) = Qt (z(1) − z(0)) = Qt (z(1) − y). Since Q is convergent, this means that z(t + 1) − z(t) will tend to 0, meaning that an equilibrium point exists. Let z∞ be said equilibrium point. Using Eq. (11) z∞ = Ay + Qz∞ ; this means that (I − Q)z∞ = Ay, which in turn means that z∞ = (I − Q)−1 Ay. Q.E.D As a consequence of Proposition 4, {z(t)}∞ t=1 is a convergent sequence. As noted by Abbott [1], every convergent sequence is a Cauchy sequence. Consequently, if b(t+1) = ||z(t+1)−z(t)||∞ , then {b(t)}∞ t=1 is decreasing and converges to 0. Suppose that we want to find a t such that ||z(t + 1) − z(t)||∞ ≤ , we can exploit the fact that {b(t)}∞ t=1 is decreasing. This would be useful as it would allow us to approximate z∞ without needing to calculate the inverse of I − Q.
304
I. Rahaman and P. Hosein
Proposition 5. Suppose that we want to find an expressed opinion vector z(t) such that ||z(t + 1) − z(t)||∞ ≤ . Finding said z(t) would require k iterations of Eq. (11) where
log() − log(||z(1) − y||∞ ) k= log(|||Q|||∞ ) to guarantee that ||z(t + 1) − z(t)||∞ ≤ . Proof. Consider ||z(k + 1) − z(k)||∞ . Recall from the proof of Proposition 4 that z(k + 1) − z(k) = Qk (z(1) − y). This means that ||z(t + 1) − z(t)||∞ = ||Qk (z(1) − y)||∞ . Through repeated use of the Cauchy-Swartz inequality, we get that ||z(k + 1) − z(k)||∞ ≤ |||Q|||k∞ ||z(1) − y||∞ . Let |||Q|||k∞ ||z(1) − y||∞ ≤ ; obviously, ||z(k + 1) − z(k)||∞ ≤ . Through algebraic manipulation, we get that log()−log(||z(1)−y||∞ ) ≤ k. Since k ∈ Z+ , we take ceiling to arrive at the number log(|||Q|||∞ ) of iterations required, thereby showing Proposition 5. Q.E.D. 3.5
Polarization
||2 . Much like Matakos et al. [32], we define the polarization of a system as ||z∞ N The smallest possible polarization value possible is 0 and occurs when every agent is neutral at equilibrium. Suppose that we can perturb the internal opinions of agent i by some quantity δi . Moreover, suppose that each perturbation incurs a cost that is quadratic on the magnitude of the perturbation. From this, we can define the following problem:
Definition 1. Continuous Budgeted Polarization Problem: Suppose that we are given a signed graph G = (V, W + , W − ) with |V| = N agents, the internal opinions over all agents y ∈ [−1, 1]N , the conformity factors of every agent β ∈ (0, 1)N , the consistency factors of every agent α ∈ (0, 1)N , the enmity factors of every agent γ ∈ (0, 1)N , and a budget B. Using these we compute A and Q as described in Sects. 3.1, 3.3. We want to find the perturbation vector δ ∈ [−1, 1]N such that N1 ||(I − Q)−1 Ay||2 is minimized, −1 ≤ (δ + y) ≤ 1, and N 2 i=1 δi ≤ B. From Definition 1, we can represent the Continuous Budgeted Polarization Problem as the following equivalent mathematical program min δ
s.t.
||(I − Q)−1 Ay||2 N
δi2 ≤ B
(13)
i=1
− 1 ≤ (δ + y) ≤ 1 Using Grant et al.’s [23] Disciplined Convex Programming, the above problem can be represented as a second-order conic programming problem [31] that is soluble using O’donoghue et al.’s [37] Splitting Cone Solver.
Extending DeGroot Opinion Formation for Signed Graphs
4
305
Experiments
To perform our numerical experiments, we collected five datasets in total from the SNAP repository [30] and the KONECT [29] repository. Table 1 provides a summary of the datasets and their properties. We developed all experiments in Julia1 and run on a machine with a Core i9 processor and 64 GB of RAM. All experimental code is made available to assist in reproducing the experiments2 . Table 1. Dataset description Dataset Cloister Highlandtribes
Num nodes Num edges 18
189
16
58
219
521
Bitcoinalpha
3783
35592
Bitcoinotc
5881
35592
Congress
Both the psychological factors and internal opinions were generated independently for each agent. 4.1
Convergence
As a sanity check, we used Eq. (11) to compute the opinions expressed at each time across 20 time-steps for all five datasets. Doing this allowed us to trace the changes of expressed opinions across time to monitor for convergence. For the conservation of space, we only show the trace plot for the Cloister dataset. However, the trace plots for all other datasets showed similar convergence behaviour across time (Fig. 1). 4.2
Reducing Polarization
To solve the Continuous Budgeted Polarization Problem outlined in Definition 1, we evaluated several methods for computing the perturbations to apply to agents’ internal opinions to reduce polarization. These methods were: – optimal: we used the SCS solver to determine the optimal perturbations. – ks exact: the smallest possible polarization is achieved when z∞ = 0. Hence, in this method, we try to set as many agents to neutrality as possible. Every agent’s value is determined by the drop in polarization caused by them being set to neutrality. Every agent’s cost is the square of their internal opinion. Using these values and costs, we then use Dantzig’s algorithm [14] for the fractional knapsack problem. 1 2
https://julialang.org/. https://github.com/InzamamRahaman/SignedDegroot.
306
I. Rahaman and P. Hosein
Fig. 1. Plot Showing Change in Opinions Across Time Per Agent (Each color represents a different agent)
– ks approx: this method is similar to the aforementioned ks exact except it uses the approximation procedure described in Proposition 5 to compute the equilibrium vector. Since most real graphs are sparse, and sparse matrix multiplication is generally computationally cheaper than matrix inversion, and we expected this method to reduce polarization on par with ks exact while taking less time. – stress select: similar to ks exact, but used the stress centrality [10] on the unsigned graph as a proxy for the value of perturbing an agent to neutrality – pagerank select: similar to ks exact, but used the PageRank on the unsigned graph as a proxy for the value of perturbing an agent to neutrality. To evaluate our methods, we recorded both the average execution time and the mean polarization reduction achieved across several budgets. For the cloister, hightlandtribes, and congress datasets we considered budgets of 1 to 10 in increments of 1, and for the bitcoinalpha and bitcoinotc datasets, we considered budgets of 1 and 10 to 100 in increments of 10. Table 2. Mean Time (s) for Each Method to Compute Perturbation Vector Across All Datasets Dataset\Method Optimal ks exact ks approx stress select pagerank select Cloister
0.0058
0.0003
0.0002
0.0006
0.0005
Highlandtribes
0.0068
0.0004
0.0003
0.0005
0.0003
0.0476
Congress Bitcoinalpha Bitcoinotc
0.0937
0.0200
0.0201
0.0007
79.5208 28.0474
4.5672
0.1412
314.3565 218.3950 41.9709
10.4321
0.5192
62.3074
Extending DeGroot Opinion Formation for Signed Graphs
307
Table 3. Mean % Reduction In Polarization In All Datasets Dataset\Method Optimal ks exact ks approx stress select pagerank select Cloister
93.0
82.2
80.9
77.2
72.9
Highlandtribes
95.9
88.2
88.1
85.5
85.6
Congress
41.3
3.7
3.7
7.3
8.2
Bitcoinalpha
28.3
2.4
2.4
4.4
4.4
Bitcoinotc
23.7
2.0
2.0
4.5
4.6
As can be seen in Table 2, the non-optimal methods took less time than the optimal methods but performed less well. Moreover, as expected, in most cases, ks approx took less time than ks exact. However, as seen in Table 3, the optimal outperforms all other methods by a significant margin. This margin also grows as the number of nodes in the graph increases. All methods performed less well on the larger datasets because of the smaller ratio of budget to the number of nodes.
5
Conclusion and Future Work
Mitigating polarization in online social networks has become increasingly important. However, mitigating such polarization would be more difficult without developing theoretical models for opinion formation. In this paper, we have developed a DeGrootian opinion formation model over signed graphs that accounts for heterogeneity in the psychology for the agents embedded in the graph. We have proved the theoretical properties of this model and defined polarization minimization as an optimization problem on said model. The model’s definition opens up new avenues for future work. In particular, the development of robust methods for parameter estimation. Also, like most DeGrootian models, our model assumes that internal opinions and graph structure are immutable. It would be interesting to modify the model to account for changes to internal opinions and graph structure over time. In particular, drawing from bounded confidence models of opinion spread seems like a worthwhile avenue for future developments. Blah Blah Blah [1].
References 1. Abbott, S.: Understanding Analysis, vol. 2. Springer, Cham (2001) 2. Altafini, C.: Consensus problems on networks with antagonistic interactions. IEEE Trans. Autom. Control 58(4), 935–946 (2012) 3. Altafini, C., Ceragioli, F.: Signed bounded confidence models for opinion dynamics. Automatica 93, 114–125 (2018) 4. Anderson, N.H.: A Functional Theory of Cognition. Psychology Press, New York (2014)
308
I. Rahaman and P. Hosein
5. Bandura, A., Walters, R.H.: Social Learning Theory, vol. 1. Prentice-Hall, Englewood Cliffs (1977) 6. Bauso, D., Cannon, M.: Consensus in opinion dynamics as a repeated game. Automatica 90, 204–211 (2018) 7. Bergstr¨ om, A., Jervelycke Belfrage, M.: News in social media: incidental consumption and the role of opinion leaders. Digit. Journal. 6(5), 583–598 (2018) 8. Bindel, D., Kleinberg, J., Oren, S.: How bad is forming your own opinion? Games Econ. Behav. 92, 248–265 (2015) 9. Boseovski, J.J., Lee, K.: Seeing the world through rose-colored glasses? Neglect of consensus information in young children’s personality judgments. Soc. Dev. 17(2), 399–416 (2008) 10. Brandes, U.: A faster algorithm for betweenness centrality. J. Math. Sociol. 25(2), 163–177 (2001) 11. Brehm, S.S., Brehm, J.W.: Psychological Reactance: A Theory of Freedom and Control. Academic Press, New York (2013) 12. Chang, C.C., Chiu, S.I., Hsu, K.W.: Predicting political affiliation of posts on Facebook. In: Proceedings of the 11th International Conference on Ubiquitous Information Management and Communication, pp. 1–8 (2017) 13. Colleoni, E., Rozza, A., Arvidsson, A.: Echo chamber or public sphere? Predicting political orientation and measuring political homophily in Twitter using big data. J. Commun. 64(2), 317–332 (2014) 14. Dantzig, G.B.: Discrete-variable extremum problems. Oper. Res. 5(2), 266–288 (1957) 15. DeGroot, M.H.: Reaching a consensus. J. Am. Stat. Assoc. 69(345), 118–121 (1974) 16. DeYoung, C.G., Peterson, J.B., Higgins, D.M.: Higher-order factors of the big five predict conformity: are there neuroses of health? Personality Individ. Differ. 33(4), 533–552 (2002) 17. Dong, Y., Ding, Z., Mart´ınez, L., Herrera, F.: Managing consensus based on leadership in opinion dynamics. Inf. Sci. 397, 187–205 (2017) 18. Friedkin, N.E., Johnsen, E.C.: Social influence and opinions. J. Math. Sociol. 15(3– 4), 193–206 (1990) 19. Gaitonde, J., Kleinberg, J., Tardos, E.: Adversarial perturbations of opinion dynamics in networks. arXiv preprint arXiv:2003.07010 (2020) 20. Garimella, K., De Francisci Morales, G., Gionis, A., Mathioudakis, M.: Reducing controversy by connecting opposing views. In: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pp. 81–90 (2017) 21. Golbeck, J., Robles, C., Turner, K.: Predicting personality with social media. In: CHI 2011 Extended Abstracts on Human Factors in Computing Systems, pp. 253– 262 (2011) 22. Graham, J., Haidt, J., Nosek, B.A.: Liberals and conservatives rely on different sets of moral foundations. J. Pers. Soc. Psychol. 96(5), 1029 (2009) 23. Grant, M., Boyd, S., Ye, Y.: Disciplined convex programming. In: Global Optimization, pp. 155–210. Springer (2006) 24. Hegselmann, R., K¨ onig, S., Kurz, S., Niemann, C., Rambau, J.: Optimal opinion control: the campaign problem. arXiv preprint arXiv:1410.8419 (2014) 25. Horn, R.A., Johnson, C.R.: Matrix Analysis. Cambridge University Press, Cambridge (2012) 26. Karlsen, R., Steen-Johnsen, K., Wollebæk, D., Enjolras, B.: Echo chamber and trench warfare dynamics in online debates. Eur. J. Commun. 32(3), 257–273 (2017) 27. Krueger, J., Clement, R.W.: The truly false consensus effect: an ineradicable and egocentric bias in social perception. J. Pers. Soc. Psychol. 67(4), 596 (1994)
Extending DeGroot Opinion Formation for Signed Graphs
309
28. Kumar, S., Hamilton, W.L., Leskovec, J., Jurafsky, D.: Community interaction and conflict on the web. In: Proceedings of the 2018 World Wide Web Conference, pp. 933–943 (2018) 29. Kunegis, J.: KONECT – The Koblenz network collection. In: Proceedings of International Conference on World Wide Web Companion, pp. 1343–1350 (2013). http://dl.acm.org/citation.cfm?id=2488173 30. Leskovec, J., Krevl, A.: SNAP Datasets: Stanford large network dataset collection, June 2014. http://snap.stanford.edu/data 31. Lobo, M.S., Vandenberghe, L., Boyd, S., Lebret, H.: Applications of second-order cone programming. Linear Algebra Appl. 284(1–3), 193–228 (1998) 32. Matakos, A., Terzi, E., Tsaparas, P.: Measuring and moderating opinion polarization in social networks. Data Min. Knowl. Disc. 31(5), 1480–1505 (2017) 33. Mendez, M.F.: A neurology of the conservative-liberal dimension of political ideology. J. Neuropsychiatry Clin. Neurosci. 29(2), 86–94 (2017) 34. Meng, Z., Shi, G., Johansson, K.H., Cao, M., Hong, Y.: Behaviors of networks with antagonistic interactions and switching topologies. Automatica 73, 110–116 (2016) 35. Moscovici, S.: La psychanalyse, son image et son public. Presses universitaires de France (2015) 36. Musco, C., Musco, C., Tsourakakis, C.E.: Minimizing polarization and disagreement in social networks. In: Proceedings of the 2018 World Wide Web Conference, pp. 369–378 (2018) 37. O’donoghue, B., Chu, E., Parikh, N., Boyd, S.: Conic optimization via operator splitting and homogeneous self-dual embedding. J. Optim. Theory Appl. 169(3), 1042–1068 (2016) 38. Parsegov, S.E., Proskurnikov, A.V., Tempo, R., Friedkin, N.E.: Novel multidimensional models of opinion dynamics in social networks. IEEE Trans. Autom. Control 62(5), 2270–2285 (2016) 39. Rahaman, I., Hosein, P.: A method for learning representations of signed networks. In: Proceedings of the International Workshop on Mining and Learning on Graphs (MLG 2018) (2018) 40. Schwabe, I., Jonker, W., Van Den Berg, S.M.: Genes, culture and conservatism-a psychometric-genetic approach. Behav. Genet. 46(4), 516–528 (2016) 41. Shearer, E.: Social media outpaces print newspapers in the US as a news source. Pew research center 10 (2018) 42. Shi, G., Proutiere, A., Johansson, M., Baras, J.S., Johansson, K.H.: The evolution of beliefs over signed social networks. Oper. Res. 64(3), 585–604 (2016) 43. Tang, J., Chang, Y., Aggarwal, C., Liu, H.: A survey of signed network mining in social media. ACM Comput. Surv. (CSUR) 49(3), 1–37 (2016)
Market Designs and Social Interactions. How Trust and Reputation Influence Market Outcome? Sylvain Mignot1(B) and Annick Vignes2 1
2
Lille Catholic University and LEM, Lille, France [email protected] INRAE UMR Territoires and CAMS-EHESS, Paris, France [email protected]
Abstract. This article analyses the influence of trust on the functioning of a market for perishable goods, where there exists no quality signal and quantities can be scarce. On this market, agents choose between bidding or exchanging through bilateral transactions. It is well accepted in economy that trust plays an important role in transactions but its definition and measurement stay, as far as we know, very elusive. We first propose an original measure of trust, based on the dynamics of agents’ encounters. We then analyze the differences in the network structures and estimate how they affect the market outcomes. We show that, when the transaction links on the auction market reflects the economic constraints of the partners, the relationships on the bilateral market result of economics and non economics determinants. At first glance, the stable co-existence of two market structures looks like a paradox. Our results help to understand the distinctive characteristics and functioning of each mechanism.
Introduction It is now largely accepted that social relationships affect the efficiency of a market structure (centralized or decentralized) (Babus et al. 2013, Opp and Glode 2016, Glode and Opp 2017). This article pays a particular attention to the influence of social connections on the process of transactions, whether a market is centralized or decentralized. We carry out an empirical analysis of a particular fish market (the Boulogne s/mer fish market). This wholesale market is organized through two sub-markets, one centralized (auctions), the other decentralized (pairwise transactions) and traders decide each day which sub-market they will go on. We exploit this remarkable organization to analyze the influence of loyalty between traders on the market outcome and the impact of market structures on trading relationships. We postulate that a daily market can be represented as a social network and that trust is an output of loyalty between a buyer and a seller. A huge litterature has now underlined the importance of trust to insure c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 944, pp. 310–321, 2021. https://doi.org/10.1007/978-3-030-65351-4_25
Trust Carefully
311
the repetition of exchanges (Perelman 1998, Williamson 1993). Trust mixed with trustworthy behaviors turn out to be crucial for reducing transaction costs (Meidinger et al. 1999, Williamson and Masten 1999), uncertainty (Guiso et al. 2008) or risk (Mccabe et al. 2007). It is often associated with cooperation (Milgrom et al. 1990). In what follows, trust is measured between pairs of individuals and is directly linked to repeated social relationships. In the line of Hern´ andez et al. (2018), we define the level of trust between two persons by the number of encounters (number of days two persons traded together), relative to the number of encounters the same persons have with other traders: the more two persons exchange together, the higher the level of trust. We then assume that the intensity of social relationships affect the way people exchange together. Following this train of though, we build a specific social network, based on trust relationships. We first associate a trust index to each couple of agents (a buyer and a seller). We then consider that two traders trust each other when their trust index is high (belonging to the top 10% of the trust index distribution). From these sub-set of trust index, we create a network where a link between two nodes depends on the intensity of their trust index. This procedure is done for both sub-markets, pairwise and auctions. This article is organized as follows: Sect. 1 outlines the main characteristics of the market and describes the database. Section 2 presents some descriptive statistics. The measure of trust is disclosed in Sect. 3 and Sect. 4 concerns the network analysis. The conclusion follows.
1
The Main Market Features and the Data
We present here some particular features of the market we analyze, the Boulogne s/mer fish market, through the analysis of a detailed database, consisting of 300 000 daily transactions on the period 2006–2007. The Market: This market is a daily one, open 6 d a week. Agents are heterogeneous on both sides of the market. They are or sellers or buyers. Sellers are boats owners and their boats are of different capacities. Buyers are restaurant owners, retail buyers and fish processors. Buyers form then an heterogeneous population, facing different budget and time constraints. They can freely buy on both sub-markets. Mignot et al. (2012) show the existence of two behaviors: some agents purchase most of the time on the same sub-market, when others switch regularly. Loyal sellers, the ones who change rarely, are mainly present on the bilateral market. The auction market opens at 4 a.m. and always operates at the same place. The prices of the transactions are known by everybody and then, constitute a public information signal. Each lot offered for sale is carefully described (type of fish and quantities, name of the boat). On the bilateral market, the prices are not displayed and emerge from a bargaining process. Buyers, who are retailers are looking for specific species, that correspond to their expected demand. Here agents have different source
312
S. Mignot and A. Vignes
of private information, depending on their past history, their ability to bargain and transact and the special links they can have with agents of the other type (buyers or sellers). The Data: 200 boats are registered in this market and designated as “sellers” in what follows. 100 buyers purchase regularly, most of them on both sub-markets. The database we use covers a year and a half (2006–2007) where both submarkets coexist. For each transaction, the date, the species, the characteristics of the traded fish (size, presentation, quality), buyer’s and seller’s identities, the type of trade mechanism (auction or negotiated), the quantity exchanged and the transaction price are known. The analysis of the database tells a story of heterogeneity. First statistical results exhibit heterogeneous behaviors in terms of quality and quantities exchanged, on the both sides of the market. On the period studied, the two sub-markets (auctions and negotiated) are of equal importance (45% of volume for the auctions market, 55% for the bilateral one): the same agents transact on the two “sub-markets” and the same types of fish are sold through both mechanisms (80 different species of fish are traded). Between 37% and 54% of each of the four main fish species (in term of quantities) are sold on the auction market which suggests an equivalent distribution of the production between the two market mechanisms.
2
Stylized Facts
2.1
Prices Distributions
We compare now the distributions of transactions prices and the agents behavior on both sub-markets. In a first step, we compute the weekly aggregate prices per sub-market, using a classic Paasche index, which allows to take into account the heterogeneity of the goods: Pˆw =
i=N i=1
qi,w )) (pi,w ( i=N i=1 (qi,w )
(1)
pi,w being the price per kilo of a transaction1 in week w, qi,w the quantity sold in this transaction, and N the number of transactions made in week w. The first observation we can make from Table 1 is that the prices are higher on the negotiated market (average and median) and that the prices distributions behave differently on the two markets. The very high kurtosis value on the bilateral market suggests a leptokurtic distribution, with fat tails: the “rare events”, i.e., very low or high prices (outliers), are quite frequent. The higher standard-deviation confirms a higher uncertainty on the bilateral market. Finally we observe that both distribution have a positive skewness, which is way higher for the pairwise distribution. A positive skewness is associated to asymmetry: 1
In this article, unless stated otherwise, all the prices considered will be prices per kilo.
Trust Carefully
313
Table 1. Descriptive statistics describing the weekly aggregate prices distributions on the two sub-markets. Both standard-deviation and average prices have significant differences. Observe a leptokurtic distribution of pairwise prices. Price
Auction Pairwise
Average
2.38
Median
2.31
2.71
Skewness 0.87
3.00
2.94
Kurtosis
1.71
16.74
Std dev
0.51
0.82
the mass of the distribution is concentrated on the left, where the values are lower, with a fat tail on the right. The “rare events” correspond to high prices. The auction distribution, even if not following a normal law, is less asymmetric than the pairwise one (skewness of 0.87 vs. 3.00 and kurtosis of 1.71 vs. 16.74 on the bilateral market) and then exhibits relatively few high values. Clearly, pairwise exchanges are more risky and this result is in line with the literature. 2.2
Buyers’ and Sellers’ Pairing
We start by checking pairing strategies of buyers and sellers on both submarkets, looking a the distributions of the number of buyers (respect. sellers) that each seller (respect. buyer) transact with on the whole period considered For sellers the number of partners is not a strategic variable on auction as they can’t directly act on it. When looking at the buyers strategy, we observe that, in average, they exchange with a higher number of sellers on the negotiated market than on the auction one. We t that thehen do the assumption that the trade network is more dense on the negotiated market that on the auction one. In a second step, we seek to estimate the intensity of social links for the different agents. Equation 2 represents, for an agent i, the ratio αi of the number of transactions carried out by a trader divided by the number of agents he traded with on the whole period. This ratio should give a first estimation of the intensity of links for each agent. A high value should indicate an agent involved in loyal relationships. number of transactions (2) αi = number of traders The aggregate descriptive statistics are summarized in Table 2. The mean and the median of αi are higher on the bilateral market than on the auction one (for both buyers and sellers). For a given number of transactions, a buyer (respect. seller) trades with less sellers (respect. buyers) on the bilateral market than on the auction one. We observe a large difference between the mean and the median on the auction market and this can be due to a large number of people, coming rarely and buying at random. When looking at the data,
314
S. Mignot and A. Vignes
Table 2. Mean and Median for buyers and sellers ratio distributions on both markets. αi buyers side Auction Negotiated Mean
13.51
19.63
Median
6.31
12.17
αi sellers side Auction Negotiated Mean
16.77
23.15
Median
4.77
19.61
we observe a high number of buyers coming rarely on auction market, while on the decentralized market, traders are present more regularly. These first results suggest that the bilateral market is more risky, and that on this sub-market traders seem to choose their partners more carefully. Empirical evidence suggests the existence of loyal behavior on the decentralized market, which could help to mitigate the risk.
3
A Measure of Trust
Let’s do now the hypothesis that auctions do not facilitate loyal strategies, while on the bilateral market people can choose with whom they exchange. Once on the decentralized market, people can either exchange with someone they trust or exchange at random. Because people are not present every day, it can happen that a very loyal agent has no other choice than exchanging with a non-usual suspect. Consider now a bilateral market, where there is no arbitrage, composed by N buyers i, and M sellers j, who buy and sell regularly during τ periods, τ =1...T. At each period τ , a buyer i and a seller j can both be present (Pi,j = 1) or not (Pi,j = 0). If both are present they can exchange (Li,j = 1) or not (Li,j = 0). 3.1
Looking for a Signal of Trust
We now measure trust by the intensity of the matching. Do buyers and sellers match at random, following the opportunity of common presence or do they strategically choose their partners? We first look at the correlation between Mi,j the number of days a couple (buyer i and seller j) is present on each subT market, and Bi,j the number of encounters. We compute Mi,j = τ =1 Pi,j,τ and T Bi,j = τ =1 Li,j,τ . Looking at the correlation between Bi,j and Mi,j , we observe a stronger correlation between these two values when exchanges are centralized. The value is higher on the auction market than on the bilateral one (0.77 vs 0.58). The linear regression of Bi,j on Mi,j (Table 3) shows that the R2 of the linear fit is at 0.60 for the auction versus 0.35 for the decentralized market.
Trust Carefully
315
Table 3. Strength of the relation between Bi,j and Mi,j on the negotiated and the auction market. Auction Negotiated R
2
0.60
0.35
Coef
0.32
0.16
Std dev
0.002
0.002
Pr > t
t
Price
Coef
Intercept
−0.07 0.12
0.56
0.73 0.08
0 ⎨1 if yi = xj ∈X ⎩ X∈N aii 0 otherwise
A Methodology for Evaluating the Extensibility of Boolean Networks
377
Remark 3. By construction, N ai0 will always contain only the configuration where all the inputs are deactivated, i.e. 0 = 0, 0, . . . , 0. Since no SBF f is activated on 0 (f (0) = 0 for any SBF f ), y0 will always be equal to 0. Therefore we choose to discard y0 from the vectors Y in the following. The vectors Y allow us to define the following notion capturing the fundamental equivalence of SBFs with respect to reordering of the inputs. Definition 3. The abstract representation of a SBF f is the vector A(f ) = y1 , ..., yd where yi is the number of Boolean input vectors with i activated inputs for which the SBF is 1 (as above). The abstract representations of any two SBFs f1 and f2 , whose weight vectors are permutations of one another, are the same: A(f1 ) = A(f2 ). We empirically verified the converse statement for dimensions d ∈ {1, 2, 3}: any two minimal SBFs f1 and f2 of dimension d ∈ {1, 2, 3} with the property A(f1 ) = A(f2 ) are given by weight vectors which are permutations of one another. For example, as shown in Table 1, f1 (x1 , x2 ) = f2 (x2 , x1 ) and A(f1 ) = A(f2 ). Table 1. Example of two equivalent SBFs of dimension 2 sharing a common abstract representation. X f1 x1 x2 f1 (X) Inequalities
f2 Y f2 (X) Inequalities
Y
0
0
0
N ai0 0
0
0
N ai1 1 0
0 1
1 0
w1 > 0 w2 ≤ 0
1
0 1
w1 ≤ 0 w2 > 0
N ai2 1
1
1
w1 + w2 > 0 1
1
w1 + w2 > 0 1
1
Interaction Graph of a SBN. The interaction graph of a SBN is the weighted directed graph in which the nodes are the variables of the SBN, and which w contains the weighted edge EnA nB : nA −→ nB between nodes nA and nB if the SBF updating nB receives the state of nA weighted by w. Let us remark that: – the output of a SBF only depends on the sign of the weighted sum of its inputs, – the interaction graph of a SBN completely describes the SBN, – setting a weight to 0 models an absence of interaction and consequently an absence of the corresponding edge, – we require networks to have connected interaction graphs: disconnected nodes are not allowed.
378
R. Segretain et al.
Transition Graph of a SBN. A state of a SBN with d nodes {n1 , ...nd } is a vector s = x1 , . . . xd ∈ {0, 1}d giving the value of each of the nodes of the SBN. A network updates all of its nodes at every step. In this article we only consider updates under the parallel update mode (also called synchronous mode), however, our method may be adapted to non-synchronous update modes. The initial state of a SBN is given by the vector s0 ∈ {0, 1}n , which sets the initial values of the nodes of the network. We will often refer to nodes by their update functions (SBFs). While several nodes may have the same update function, we assume that the labels of the SBFs are different and are in bijective correspondence with nodes. Given a SBN N , its transition graph is a graph whose vertices are the states of N and which has an edge s1 → s2 iff updating all the nodes in s1 according to their update functions yields s2 . The dynamics of a SBN is deterministic: if the transition graph contains the edges s1 → s2 and s1 → s3 , then s2 = s3 . Consequently, the connected components of the state graph are cycles, possibly with some pre-cycle (non-cyclic prefixes) attached. These cycles are generally posited to correspond to particular behaviours (phenotypes) of biological networks. The output of a SBN N is recovered first by designating an output node and then listing the successive values it may have when N evolves from a particular starting state. Since the dynamics of a SBN always ends up in a cycle, and since SBN never stop updating their states, any node output sequence they generate has the form uv ∗ , u, v ∈ {0, 1}∗ (that is, u is a prefix and v is a suffix which can be repeated arbitrarily many times). Because we limit the definition of behavior to the repeated suffix v ∗ , we will ignore the prefix u and associate the sequence S played by a node of N with the repeating suffix v ∗ . 2.2
Extension of Structures and Behaviours
We will now define the central question addressed in this paper: the ability of SBNs to extend by addition of new elements while maintaining existing structure and function. To formulate the extension problem in a logical way we only need to fix a given binary sequence S1 , a suffix Sk (S1 , Sk ∈ {0, 1}∗ ), the dimension d, and the constraints over the quadruplet (N1 , S1 , N, S) as follows : – N1 is a SBN composed of d nodes, – N1 shows the behavior described by S1 , on a node ni . Meaning that the successive states of node ni along a cycle in the transition graph match the binary sequence S1 , – S is a binary sequence defined as S = S1 · Sk , where · stands for sequence concatenation, – N is a SBN composed of d + 1 nodes, – N shows the behavior S on the same node ni as in N1 , – N contains a sub-network N1 such that: • N1 is composed of d nodes. • N1 and N1 share the same structure of edges: ∀Edgeni →nj ∈ N1 , ∃!Edgeni →nj ∈ N1 . i, j ∈ [1; d],
A Methodology for Evaluating the Extensibility of Boolean Networks
379
• N1 and N1 share the same SBFs: ∀fi ∈ N1 , ∃!fi ∈ N1 , fi ≡ fi . i ∈ [1; d], where ≡ stands for equivalence as defined in Sect. 2.1. • N1 and N1 possibly differ regarding the weights of their edges with the following restriction: ∀wni →nj ∈ N1 , ∃!wn i →nj ∈ N1 , wni →nj ≤ wn i →nj . i, j ∈ [1; d]. Figure 2 gives two detailed examples of extensions of 2-dimensional SBNs. In practice, due notably to the huge number of possible d-size networks and associated binary sequences, it is not possible to directly calculate all the quadruplets (N1 , S1 , N, S) in a reasonable CPU time and amount of memory. To avoid this issue, we used a particular (N1 , S1 ) instance-centered implementation of this logical formulation of the network and behaviors extension problem: for a given couple (N1 , S1 ) and a given behavior extension Sk , we infer (in ASP) the set of networks N that comply with those constraints. In order to avoid duplicates of N1 networks, i.e. keep only one network instance per equivalence class, we must exploit the abstract representation of their functions presented above (see Sect. 2.1). Full details concerning the implementation of this computation involving a Java processing pipeline orchestrating ASP modules are explained in the online archive [18]. 2.3
Complexity of SBNs and SBFs
In a recent past, we described a manner to compute both the complexity of binary sequences S played by network nodes (sequences that we associate to behaviors) and the structural complexity of TBFs and TBNs [2]. In the present paper, we present an improved version of the latter. As before for TBFs, to compute the structural complexity of a SBF f , we consider the equivalence class A(f ) of f and evaluate the probability P of randomly picking an instance of f in this equivalence class, under the uniform distribution over the unit ball of parameter space (Fig. 1 illustrates this for SBFs). The complexity of a given TBF was defined as the inverse of this probability. To compute the complexity of a TBN N in [2], we used to multiply the individual complexities of its constituent TBFs. This computation however presents an issue : it does not take into account the way these TBFs are connected together i.e. the topology of the network and the associated parameters (weights), in other terms its structure. In the present paper, we now take network structure into account in the form of another probability measurement C s . Furthermore, although the complexity of a TBF C f is related to a probability, it is not a probability anymore. Combining it with the structural complexity C s is therefore not easy. We decided to make it more uniform: both SBFs complexities C f and structural complexities C s are now probabilities. In order to do that we updated C f to C f = 1 − P(A(f )). C f is therefore the probability of not picking f . Thus, the complexity C f of a function f is high when P(A(f )) is low.
380
R. Segretain et al.
(a) N1
(b) NA
(c) NB
(d) TG of N1
(e) TG of NA
(f) TG of NB
Fig. 2. Two examples of SBN extension from a 2D SBN. In the networks N1 (a), NA (b) and NB (c), the nodes are labelled A, B or C. The node for which we follow the behavior is red and the additional node in the extended networks is gray. The corresponding transition graphs (TG) are shown below in (d), (e) and (f). In the TGs, the states and behavior of node A are shown in red while the state of the additional node in extended networks is shown in gray. From network N1 (a) and its asymptotic behavior S1 = (10)∗ (d) played on node A, networks NA (b) and NB (c) are examples of extended networks that can be found when asking for an extension of N1 by one node, and an extension of one of its behaviors from S1 = (10)∗ to S = (101)∗ . Within the potentially large set of extended networks that satisfy the constraints given above, in this case 48 extended SBN from N1 , S1 and S, some will partly preserve the initial transition graph as shown in (e) for network NA , while it may be largely reconfigured for other as shown in (f). In any case, the structure of network N1 is preserved in the extended networks, and the initial behavior (10)∗ is encapsulated in a larger one, here (101)∗ .
We choose the structural complexity C s to be a centrality measure because centrality naturally expresses the concept of influence of one node ni on another one. We construct Cis as follows: for each directed edge of the network Eij : wij ni −→ nj , (i, j ∈ [1; d]), we define the influence probability P I (Eij ) that node ni influences node nj due to the weighted directed edge Eij , among all the incoming edges Ekj that influence node nj : P I (Eij ) =
|wij | . d |wkj | k=1
A Methodology for Evaluating the Extensibility of Boolean Networks
381
We can then define the P I of a longer path, going from a node a to a node b (P I (P athab )) as the geometric mean of the influence probabilities P I (Eij ) of the edges of this path, Eij ∈ P athab : P I (P athab ) = Lab P I (E). E∈P athab
where Lab it the length of the path. Remark 4. In the former equation, we express the mean influence of a given path as an average probability that can be compared to that of other paths. We used the geometric mean because each SBF along the path receives a variable number of input edges, so the influence |wij | at each step on one node (one edge Eij ) may be normalized by a different number of weights, from 1 (single edge) to d (edges from all nodes including nj ). I Since several different paths can exist from a to b, the general probability Pab that a influences b must take into account all of them. We therefore calculate I as follows: Pab
I =P P ath . Pab P ath∈P athsab
where P athsab is the set of all paths from node a to node b, and we overload P ath to refer both to a path from a to b and the event that node a influences node b over this path. The probability of this event is P I (P ath). We define the centrality Cis of a node ni as the probability that it influences at least one node nk , including itself:
d
s Ci = P P ath . k=1
P ath∈P athsik
Finally we use the centrality Cis of each node ni to modulate the corresponding SBF complexity Cif , and define the complexity of a network N as follows: d Cif × Cis C(N ) = 1 i=1
if Cis > 0 otherwise,
Remark 5. Nodes with centrality equal to zero (i.e. the nodes that are pure readers, their state is never read by any other node including themselves) are de facto excluded from the network complexity calculation.
382
3
R. Segretain et al.
Discussion
In this article we presented a scientific question that concerns various biological networks including regulatory networks, ecosystems, neural networks, etc.: how do networks and their associated behavior grow together in the case the initial network structure and behavior must be preserved? We also wonder how much such extensible character is related to the complexity of networks and behaviors. Among the further research directions that emerge in this context, we hypothesize that most of the highly complex d-dimensional networks playing complex behaviors cannot extend into complex (d + 1)-dimensional networks easily because they are too constrained. Conversely, too simple networks cannot play complex behaviors and cannot become larger complex networks, playing complex extended behaviors. We however expect networks of moderate structural and behavioral complexity to be the most capable of generating complex extended networks and behaviors when growing. In this case, the network structure and its asymptotic behavior are not too much linked, a large part of the transition graph being occupied by transitory states or other basins of attraction, so additional nodes do not necessarily break the initial behavior. Here, we focus on the exhaustive description of the method we developed, from both the theoretical and the implementation point of views, to compute both network extensions and network characteristics like complexities. Using Sign Boolean Networks is well adapted to materialize our questions and test our hypotheses. SBNs are simple enough to work with in logical constraint programming and to limit the number of extensions obtained from any triplet (N1 , S1 , S). The homogeneous description of their constituting SBFs, entirely determined by their input weights, also allows expressing complexity scores for SBNs easier than with ordinary Boolean Networks or even Threshold Boolean Networks. Table 2. Number of SBFs, SBNs, triplet (N1 , S1 , S) and extended networks N per dimension d. Ranged values are estimations based on SBN extension from dimension 2 to 3. d SBF SBN 2 7 3 17 4 47
{N1 , S1 , S}
N
3
101 96 10 777216 206 103 [180 106 – 32 109 ] [1.5 109 – 280 109 ] 76 109
We tested our method by computing the extension of 2-dimensional SBNs towards 3-dimensional SBNs, i.e. 777216 quadruplets (N1d=2 , S1 , N d=3 , S). For each network (initial and extended), we also computed several characteristics including network and behavioral complexities, but also the number of attractors and the size of basins of attraction. The volume of data to process is then already huge when starting from dimension 2 for which only 7 SBFs and 101 SBNs exist. Our program running on a workstation with 32 CPU threads
A Methodology for Evaluating the Extensibility of Boolean Networks
383
(2 × Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20 GHz) and 64Go RAM (ECC DDR4 2400 MHz) took 20 min to complete. This is quite efficient but computation times for extensions from larger dimensions may grow fast as shown in Table 2. To be applicable to larger networks, in particular biological ones, our method should be scalable. Looking for exhaustive results, i.e. infer all quadruplets (N1 , S1 , N, S) as we did for 2 to 3 dimensional extensions but for larger dimensions, the computation as it is realized here may quickly reach its own limits. The interest of exhaustive exploration is to embrace the entire diversity of network structures and behaviors, and study their relationships in a systematical way. At the least, one would aim to yield comparable information for larger dimensions or for larger extensions (i.e. extensions that involve more than one additional node at a time). The get around is to downgrade the exploration of quadruplets. There are several ways by which this can be done. In larger dimensions, a manner to save computational time is to use Monte Carlo approaches. Another manner to reduce the number of quadruplets to compute while allowing us to explore larger extensions, is to make thin slices in initial and extended behavior complexities (e.g. 3 slices along initial behavior complexities: small, moderate and large complexities, and 3 slices in extended behaviors also limited to small, moderate and large complexities). Finally, even the extension of very large specific networks (dozens of nodes) and for specific behaviors could be computed in a reasonable time using the same ASP modules. All these downgraded exploration methods will be evaluated on the benchmark that constitutes our full 2 to 3 dimension exploration. In addition, from the technical point of view, computation of larger extensions (e.g. dimension 3 to 4) will take advantage of the optimization realized (modular structure, parallelism, etc.) in our implementation. We are now only beginning to analyse and interpret the results obtained for the 2D to 3D extension and will compute partial data for larger extensions soon. As an example of what is obtained for the 2D to 3D extension, the 3D histogram in Fig. 3, shows notably that most 2D SBN behaviors are not complex and most of them, when extended, show behaviors of limited complexity played by extended networks of limited complexity too. It also shows that complex extended behaviors and networks (red points) are mostly obtained from networks of moderate complexity. Although such a result seems to go in the good direction, this brief result overview must be refined with reinforced statistic (e.g., min, max, standard deviation, in addition to average complexity values) of network and behavior complexities. Increase in behavior complexity is not the only way networks can evolve: extended networks can also develop new behaviors (increase of the number of attractors) or the increase of complexity of networks is used to reinforce the robustness of their behaviors, i.e. by increasing the size of their basins of attraction. Finally, beyond the theoretical study, the position of the complexity cursor in biological systems is an open question. We expect our work to give clues on how networked systems can or cannot evolve as they are constrained by the existing conditions and by the necessity to maintain vital functions. A specific article will be dedicated to this analysis and to developing the biological question.
384
R. Segretain et al.
Fig. 3. 3D histogram of network structure and behavior complexities. Counts of all 777216 (N1 , S1 , N, S) quadruplets obtained by extension are divided into 3D classes of complexities: network N1 complexity along the C(Initial Network N1 ) axis, initial S1 and extended S behavior complexities along the C(Initial Behavior) and C(Extended Behavior) axes respectively. Point size (in logarithmic scale) denotes the number of networks in a class, while their color corresponds to the average complexity of extended networks N in this class. Acknowledgements. Sergiu Ivanov is partially supported by the Paris region via the project DIM RFSI n◦ 2018-03 “Mod`eles informatiques pour la reprogrammation cellulaire”. The authors would also like to thank the IDEX program of the University Grenoble Alpes for its support through the projects COOL: this work is supported by the French National Research Agency in the framework of the Investissements d’Avenir program (ANR-15-IDEX-02). This work is also supported by the Innovation in Strategic Research program of the University Grenoble Alpes. The authors would thanks Ibrahim Cheddadi for fruitful discussions.
References 1. Pardo, J., Ivanov, S., Delaplace, F.: Sequential reprogramming of biological network fate. Lecture Notes in Computer Science, In: Proceedings of the Computational Methods in Systems Biology - 17th International Conference (2019). https://doi. org/10.1007/978-3-030-31304-3 2 2. Christen, U., Ivanov, S., Segretain, R., Trilling, L., Glade, N.: On computing structural and behavioral complexities of threshold Boolean networks. Acta Biotheor. 68, 119–138 (2020)
A Methodology for Evaluating the Extensibility of Boolean Networks
385
3. Thomas, R.: On the relation between the logical structure of systems and their ability to generate multiple steady states or sustained oscillations. Springer Series in Synergetics, vol. 9, pp. 180–193 (1980) 4. Mendoza, L., Alvarez-Buylla, E.R.: Dynamics of the genetic regulatory network for Arabidopsis thaliana flower morphogenesis. J. Theor. Biol. 193, 307–319 (1998) 5. Delaplace, F., Ivanov, S.: Bisimilar Booleanization of multivalued networks. Biosyst. 197 (2020) 6. Delahaye, J.-P., Zenil, H.: Numerical evaluation of the complexity of short strings: a glance into the innermost structure of algorithmic randomness. Appl. Math. Comput. 219, 63–77 (2012) 7. Soler-Toscano, F., Zenil, H., Delahaye, J.-P., Gauvrit, N.: Calculating Kolmogorov complexity from the output frequency distributions of small turing machines. PLoS ONE 9(5), e96223 (2014) 8. Segretain R.: Repository of the inference pipeline in ASP and java. https://gitlab. com/rsegretain/java-parallel-pipeline 9. Hopfield, J.J.: Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. USA 79, 2554–2558 (1982) 10. Glass, L., Kauffman, S.: The logical analysis of continuous, nonlinear biochemical control networks. J. Theor. Biol. 39, 103–129 (1973) 11. Za˜ nudo, J.G.T., Aldana, M., Mart´ınez-Mekler, G.: Boolean threshold networks: virtues and limitations for biological modeling. In: Niiranen S., Ribeiro A. (eds) Information Processing and Biological Systems. Intelligent Systems Reference Library, vol 11. Springer, Berlin, Heidelberg (2011) 12. Bornhold, S.: Boolean network models of cellular regulation: prospects and limitations. J. R. Soc. Interface 5, S85–S94 (2008) 13. Tran, V., McCall, M.N., McMurray, H.R., Almudevar, A.: On the underlying assumptions of threshold Boolean networks as a model for genetic regulatory network behavior. Frontiers Gen. 4, 263 (2013) 14. Gebser, M., Kaminski, R., Kaufmann, B., Ostrowski, M., Schaub, T., Schneider, M.: Potassco: the potsdam answer set solving collection. AI Comm. 24, 107–124 (2011) 15. Ostrowski, M., Schaub, T.: ASP modulo CSP: the clingcon system. Theory and Practice of Logic Programming (2012) 16. Banbara, M., Inoue, K., Kaufmann, B., Okimoto, T., Schaub, T., Soh, T., Tamura, N., Wanko, P.: teaspoon: solving the curriculum-based course timetabling problems with answer set programming. Ann. Oper. Res. 275, 3–37 (2019) 17. Vuong, Q.-T., Chauvin, R., Ivanov, S., Glade, N., Trilling, L., A logical constraintbased approach to infer and explore diversity and composition in threshold Boolean automaton networks, Studies in Computational Intelligence Series; Proceedings of the Complex Networks 2017 conference (2017) https://doi.org/10.1007/978-3-31972150-7 46 18. Segretain, R., Ivanov, S., Trilling, L., Glade, N., Implementation of a Computing Pipeline for Evaluating the Extensibility of Boolean Networks’ Structure and Function, bioRxiv 2020.10.02.323949; https://doi.org/10.1101/2020.10.02.323949
NETME: On-the-Fly Knowledge Network Construction from Biomedical Literature Alessandro Muscolino1 , Antonio Di Maria1 , Salvatore Alaimo2 , Stefano Borz`ı3 , Paolo Ferragina4 , Alfredo Ferro2 , and Alfredo Pulvirenti2(B) 1
3
Department of Physics and Astronomy, University of Catania, Catania, Italy [email protected], [email protected] 2 Department of Clinical and Experimental Medicine, University of Catania, Catania, Italy {salvatore.alaimo,alfredo.ferro,alfredo.pulvirenti}@unict.it Department of Maths and Computer Science, University of Catania, Catania, Italy [email protected] 4 Department Computer Science, University of Pisa, Pisa, Italy [email protected]
Abstract. The huge amount of biological literature, which daily increases, represents a strategic resource to automatically extract and gain knowledge concerning relations among biological elements. Knowledge Networks are helpful tools in the context of biological knowledge discovery and modeling. Here we introduce a novel system called NETME, which, starting from a set of fulltext obtained from PubMed, through an easy-to-use web interface, interactively extracts a group of biological elements stored into a selected list of ontological databases and then synthesizes a network with inferred relations among such elements. The results clearly show that our tool is capable to efficiently and efficaciously infer reliable functional biological networks. Keywords: Network analysis
1
· Knowledge graph · Text mining
Introduction
The increasing amount of scientific literature is posing new challenges for scientists. Identifying the most suitable set of articles dealing with a topic is not a straightforward task, leading to the high chance of missing important references and indeed relevant literature. In particular, in research areas like biology or bio-medicine, thanks also to fast-track publication journals, the number of published papers increases significantly fast. On the other hand, network analysis has become a key enabling technology to help the understanding of mechanisms of life, living organisms, and in general uncover the underlying fundamental biological processes. Example of applications include: i) analyzing disease networks for identifying disease-causing genes A. Muscolino and A. Di Maria—Equal contributor. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 944, pp. 386–397, 2021. https://doi.org/10.1007/978-3-030-65351-4_31
NETME: On-the-Fly Knowledge Network Construction
387
and pathways [1]; ii) discovering the functional interdependence among molecular mechanisms through functional network mining [2]; and iii) Network-based inference models with application on drug re-purposing [3]. Thanks to the availability of open-access articles repositories such as PubMed Central [4], arxiv [5] bioarxiv [6] as well as ontology databases which hold entities and their relations [7], in the last few years, the research community has focused on text mining tools and machine learning algorithms to digest corpus and extract semantic knowledge. Text mining [8] and Natural Language Processing [9] tools employ information extraction methods to translate unstructured textual knowledge in a form that can be easily analyzed and used to build functional network or knowledge graphs [2,10,11]. This technology allows to infer putative relations among molecules such as: understanding how proteins interact with each other; determining which gene mutations are involved in a disease, etc. In the context of biology and biomedicine, the Biological Expression Language (BEL) [12] or Resource Description Framework (RDF) [13] has been widely applied to convert a text in semantic triplets having the following form: . The subject and object represent biological elements. Whereas, the predicate represents a logical or physical relationship between them [2,14]. But the implementation of biological text mining tools requires highly specialized skills in natural language processing and Information Retrieval. Therefore, several ecosystems and tools have been implemented and made available to the bio-science community. Relevant tools include: PubAnnotation [15], a public resource for sharing annotated biomedical texts based on the “Agile text mining” concept; PubTator (PTC) [16], a web service for viewing and retrieving bioconcept annotations (for genes/proteins, genetic variants, diseases, chemicals, species and cell lines) in full text biomedical articles. This latter tool annotates all PubMed abstracts, and more than three million full texts. The annotations are downloadable in multiple formats (XML, JSON and tab delimited) through the online interface, a RESTful web service and bulk FTP. SemRep [17] extracts relationships from biomedical sentences in PubMed articles by mapping textual content to an ontology which represents the meaning. To establis hthe binding relation, SemRep relies on internal rules (called “indicator rules”) which map syntactic elements, such as verbs, prepositions, and nominalization, to predicates in the Semantic Network. In [18] authors propose a minimum supervision approach for knowledge graph construction based on 24,687 unstructured biomedical abstracts. Authors included entity recognition, unsupervised entity and relation embedding, latent relation generation via clustering, relation refinement and relation assignment to assign cluster-level labels. The proposed framework is able to extract 16,192 structured facts with high precision. Hetionet [3], a heterogeneous network of biomedical knowledge that unifies data from a collection of several available databases, and millions of publications. In addition, the edges are extracted from omics-scale resources, and consolidated through multiple studies or resources.
388
A. Muscolino et al.
In this paper we present NETME a web-based app (publicly available at http://netme.atlas.dmi.unict.it/netme/) which is capable to extract knowledge from a collection of full-text documents. The tool orchestrate two different technologies which enable the inference of biological networks: – A customized version of the entity-linker TAGME [19], named OntoTAGME, which is designed to run on a set of biological ontology knowledge-bases (such as GeneOntology [20], Drug-Bank [21], DisGeNET [22], and Obofoundry [23]) and allows to extract biological entities (i.e. genes, drugs, diseases) from a collection of full-text articles obtained through a query to PubMed. These biological entities will represent the nodes of the network; – Next, OntoTAGME derives relations (edges) among these nodes by using a software module developed upon the NLTK library resource. More precisely, an edge (i, j) between nodes i and j will express a biological relation (in the form of: the drug i treats the disease j, the gene x is involved in the disease j, the gene x regulates the gene y). Edges are weighted according to the frequency of appearance of their corresponding relation in the chosen list of full-text articles. This kind of inferred networks are particularly useful in biomedicine, where it is important to understand the difference between various components and mechanisms, such as genes and diseases, and their relations, such as up-regulation and binding. The tool therefore results helpful to fast identify putative relations among the biological entities under investigation, based on their occurrences and mentions in PubMed’s articles. To the authors’ knowledge NETME is the first tool which allows to interactively synthesize biological knowledge graphs on-the-fly starting from a query on PubMed. Section 3 reports a preliminary comparison between a knowledge graph obtained with NETME and a network extracted from Hetionet [3] using SemRep [17] as ground-truth. This comparison clearly shows the robustness and efficacy of our approach in terms of Precision-Recall. The paper is organized as follows. Section 2 describes the system. Subsection 2.1 introduces OntoTAGME whereas 2.2 gives the details of the network’ edges inference model. Section 3 provides the technical details of the back-end and the front-end of NETME. Through the section we give also a preliminary experimental analysis. Section 4 ends the paper and sketches future research directions.
2
The NETME Model
Knowledge graphs can portray in a systematic way data and information representing common knowledge. A knowledge graph G = (V, E) consists of an entity set V and an ontology relations set E. Each entity can be obtained from several structured knowledge-bases such as ontologies (O1 , O2 , · · · , Ok ). Each relation represents a connection among entities of an ontology. Knowledge graph also enable inter-ontology relations, therefore, there might exits a relation e = (v1 , v2 ) ∈ E where v1 ∈ Oi and v2 ∈ Oj with i = j. Thus, for any ontology
NETME: On-the-Fly Knowledge Network Construction
389
graph Oi we can state that Oi ⊆ G for each 1 ≤ i ≤ k. NETME builds biomedical knowledge sub-networks starting from a set of publications obtained through a PubMed query. The network contains nodes representing genes, disease, and drugs, and edges among describing possible relationships. In Fig. 1 we outline the architecture of our system.
Fig. 1. NETME pipeline architecture.
Once the user provides a list of search keywords, the tool generates the network in two phases: i) First, a customized version of TAGME [19] annotation tool named OntoTAGME, which uses the ontology datasets GeneOntology, DisGeNET, DrugBank and Obofoundry, is run on all full-texts to extract a set of entities which will be the nodes of the network; ii) Next, an NLP model, based on the NLTK library [24], is executed to infer the relations among nodes belonging to the same or neighbouring sentences. The final network will contain both directed and undirected edges according to the predictions made by the model. At the end of the process the network will be rendered through Cytoscape JS. In the following paragraphs we provide the details of the two phases. 2.1
OntoTAGME: Ontology on Top of TAGME
TAGME [19] can identify short sequences of significant words (spots) into unstructured text and link them to relevant Wikipedia pages. It annotates brief unstructured texts such as snippets of search-engine results, tweets, news, etc. This kind annotation is quite informative and is a step forward compared to the bag-of-words paradigm. Furthermore, it provides more details than descriptive links text enrichment since it tries to infer relations among annotations. The detected spots will be characterized by a direct reference to a Wikipedia page with a value ρ ∈ [0, 1] that estimates the “goodness” of the annotation compared to other entities of the input text. Here we introduce OntoTAGME a vertical customized version of TAGME which uses biological and biomedical ontology databases suitable to perform annotations on text coming from scientific papers. The annotation process has
390
A. Muscolino et al.
been modified through the use the following databases: GeneOntology [20], DrugBank [21], DisGeNET [22], and Obofoundry [23]. By replacing Wikipedia corpus with topic-specific ontologies databases yields more accurate and domain-related knowledge, with both reduced annotation errors and fewer disambiguation problems. We developed a two-step procedure to integrate such databases converting a generic ontology in a wikipedia-like database. First, each biological element within the downloaded database or ontology is transformed in a XML file containing the unique ID of the biological element assigned by our customized version, its name (title), type (category), and other details such as abstract and body of the page. These elements, called pages, are then transformed in tuples uniqueID, title, category, . . . and stored within a SQL table (i.e. “wiki-latest-page”). Since an element j could have several linked pages “LPs” (i.e. DOID:0002116 is a DOID:10124) or redirect pages (i.e. GO:0006200 replace by GO:0016887) “RPs”, we generated a tuple uniqueIDj , uniqueIDk for each element k belonging to LPs and a tuple uniqueIDj , uniqueIDi for each element i belonging to RPs. These tuples are stored in the tables “wikilatest-pagelinks” and “wiki-latest-redirect”, respectively. Next, the SQL tables and XML pages are converted in nodes and edges which can be manipulated by OntoTAGME. 2.1.1 Ontology Databases Below we report a brief description of the ontology databases used for the annotation process. Gene Ontology. (GO) [20] is a project conceived to unify the description of gene products characteristics in all species, it contains more than 44 thousand GO terms, 8 millions of annotations, 1.5 millions of gene products and nearly 5 thousand species. In particular, the project aims to: (i) maintain and develop a controlled vocabulary to describe genes and gene products for every living organism; (ii) record genes and gene products, and disseminate such data; (iii) provide tools for easy access to data provided by the project. Gene Ontology is part of a larger classification project, the Open Biomedical Ontologies (OBO). DrugBank. [21] contains detailed drug data and comprehensive drug-target information. It also uses and links Wikipedia content. Furthermore, Wikipedia often links to DrugBank. The DrugBank database used in this project is released (v5.1) as an XML of 1,5 GB which contains several drugs entries (13, 367), including 2, 611 approved small molecule drugs, 1, 300 approved biotech (protein/peptide) drugs, 130 nutraceuticals and over 6, 315 experimental drugs. Additionally, 5, 155 non-redundant protein (i.e. drug target/enzyme/transporter/carrier) sequences are linked to these drug entries. Each entry contains more than 200 data fields with half of the information being devoted to drug/chemical data and the other half devoted to drug target or protein data.
NETME: On-the-Fly Knowledge Network Construction
391
Disgenet. [22] is a platform containing one of the largest publicly available collections of genes and variants associated with human diseases. It integrates data from scientific literature, GWAS catalogues, expert curated repositories and animal models. Additionally, several original metrics are provided to assist the prioritization of genotype–phenotype relationships. DisGeNET releases two types of databases, Gene-Disease Associations and Variant-Gene Associations. OBO Foundry. The Open Biological and Biomedical Ontology (OBO) Foundry [23] is a collective of ontology developers committed to collaboration and adherence to shared principles. The OBO Foundry mission is to develop a family of inter-operable ontologies that are both logically well-formed and scientifically accurate. To achieve this, OBO Foundry participants voluntarily adhere to and contribute to the development of an evolving set of principles including open use, collaborative development, non-overlapping and strictly-scoped content, and common syntax and relations, based on ontology models that work well, such as the Gene Ontology (GO). 2.2
Network Edge Inference
Once network nodes have been deduced, the textual analysis step will infer edges between them. The analysis has been performed through a methodology developed on top of the Python library NLTK [24]. Sentences have an internal organization that can be represented using a tree. Solving a syntax analysis problem for a sentence means looking for predefined syntactic forms, which, like a tree, branch out from the single words. The main syntactic form is the sentence (S) which contains noun phrases (NP) or verb phrases (VP) that are formed by further elementary syntactic forms such as nouns (N), verbs (V), determiners (DET), etc (see Table 1). Table 1. List of syntactic categories Symbol Meaning
Example
S
Sentence
the man walked
NP
Noun phrase
a dog
VP
Verb phrase
saw a park
PP
Prepositional phrase with a telescope
DET
Determiner
the
N
Noun
dog
V
Verb
walked
P
Preposition
in
Commonly, two approaches are available for the syntactic analysis of sentences: top-down and bottom-up. In the top-down approach the syntactic form
392
A. Muscolino et al.
of a sentence is determined before parsing by using a grammar to predict the input. This allows the prediction of phrase constituents, that can be recursively analyzed until the smallest units are reached. This approach however could lead to infinite loops. Furthermore, the backtracking process could discard analyzed components that will later be reconstructed. In the bottom-up parsing approach first the objects represented by the smallest units of text together with their syntactic class are recognized. Next, it tries to find sequences of words and phrases that correspond to the right hand side of a grammar production, and replaces them with the left-hand side, until the whole sentence is reduced to an S. This approach does not guarantee that a valid parsing will be found for the input even if it exists. Moreover, no checking is performed to establish if the language derived entity is globally consistent with the grammar. Our approach is based on a left-corner parser, which integrates both the bottom-up and the top-down approaches. First, a left-corner parser pre-processes the context-free grammar to build a table where each row contains two cells. The first one holds a non-terminal category, and the second cell holds the collection of possible left corners of that non-terminal; for example in a grammar production like S − → NP VP, we store S as non-terminal category and NP as possible left corner. Next, it parses each phrase higher syntactic forms, filtering the results starting from the smallest text units. This approach should overcome the shortcomings of both parsing methodologies. The procedure will yield all Part-Of-Speech (POS) of a sentence (S) that are nouns (N) linked through verbal relations (V). The final network is built by collecting all annotations given by OntoTAGME (network nodes). Such annotations are then compared with the elements identified in the phrase, in order to infer if any edges should connect two entities. We employ a regular expression pattern in each period to look for potential interactions. More precisely, through a PubMed search, we obtain a list of n full-text papers [p1 , p2 , · · · , pn ], that are divided in grammatical periods. Therefore, given an article pi we derive a set of m sentences si,1 , si,2 , · · · , si,mi . Then, for each sentence, we employ OntoTAGME computing a set of spots n1 , ..., nz . These spots will be the nodes in our final network. To define the edges in such a network, we tokenize all sentences and perform a syntactic analysis. This analysis will yield, for each sentence si,j of a paper i, a set of labelled tokens lti,j,1 , lti,j,2 · · · , lti,j,ki where each token is a pair (token, P OS), and POS is selected from the list in Table 1. Irrelevant POS are therefore removed (stop-words, URLs, etc.), keeping only useful verb forms and spots extracted through annotation procedure. The remaining tokens are stored in a list together with end-of-period markers (EOP). For each pair of annotated spot, through the regular expression pattern [? = \b{node1}\b(.∗?)\b{node2}\b] we associate edges to nodes, and put them in edges list. A further filtering procedure is performed on such a list by employing a dictionary of biological verb forms. Finally, each element contained in the list is marked as a network edge, and labelled using the token. z In our final network, each edge e = (a, b) is weighted as: w(e) = k=1 e1k /z where z is the number of sentences (in all the papers) in which an edge between
NETME: On-the-Fly Knowledge Network Construction
393
(a, b) is reported and ek the number of edges between a and b in the k-th sentence. Indeed, nodes a and b may be connected more the once in the same sentence. In Fig. 2, we report an example of annotation.
Fig. 2. NETME example of annotation of the sentence [...] EGR1 upregulates PTEN following treatment with the phosphatase inhibitor calyculin A, which might contribute to the apoptotic effects of this agent. EGR1 levels strongly correlate with PTEN levels in a cohort of non-small cell lung cancer tumors.[...]. Through annotation system we detect the spots [“EGR1”, “PTEN”], after syntactic analysis and noise reduction we have a list of ordered and labeled tokens: [“egr1”, “upregulates”, “pten”, “follow”, “contribute”, “.”, “egr1”, “correlate”, “pten”] We dectect two valid edges: [“EGR1”, “upregulates”, “PTEN”] and [“EGR1”, “correlate”, “PTEN”]
3
The Annotation Tool
NETME consists of a front-end developed in PHP and Javascript. Network rendering is performed with the CytoscapeJS library [25]. The application back-end, integrating OntoTAGME, is written in Java, with the support of the Python NLTK library [24] for the NLP module. PubMed search is performed with the Entrez Programming Utilities [26], a set of server-side programs providing a stable interface to the Entrez database and query system at the National Center for Biotechnology Information (NCBI). NETME is equipped with an easy-to-use web interface providing two major functions (see Fig. 3): (i) query-based network annotation, and (ii) user-provided free-text network annotation. In the query-based network annotation, the user provides a list of keywords, which are employed to run a query on PubMed. The top resulting papers are retrieved and the network inference procedure is performed. Several parameters can be defined by the user: (i) TAGME ρ to increase or reduce the number of spots, (ii) the maximum number of network nodes, (iii) the number of top article to retrieve from PubMed, and (iv) the criteria used to sort papers (relevance or date). In the user-provided free-text network annotation, a text can be provided directly by the user. The network inference procedure will be therefore applied directly to it. Once the user starts the analysis, the directed network is generated. Furthermore, all inference details
394
A. Muscolino et al.
will be provided in three tables containing (i) the list of extracted papers, (ii) the list of annotations, and (iii) the list of edges together with their weight.
Fig. 3. NETME web interface in (a), generated network in (b)
In Fig. 4 we show a case study application of NETME. We built a network by using the query “SRC” (Proto-oncogene tyrosine-protein kinase). We choose to build a network of 20 nodes from the Top-20 PubMed articles. ρ has been set to 0.3 which are suitable to perform an interactive test. We can observe several interesting edges between the nodes. For each node having the gene SRC as source or destination we performed a comparison with HetioNet. The results are reported in Table 3. For a more systematic comparison using SemRep as groundTable 2. Netme performance results NETME HetioNet SemRep (Ground truth) Detected nodes Valid nodes (TP) Wrong nodes (FP) Missing nodes (FN) Precision Recall
200 123 77 66 61.5% 65.1%
112 61 51 128 54.5% 32.3%
189 189 − − − −
Detected edges Valid edges (TP) Wrong edges (FP) Missing edges (FN) Precision Recall
495 178 317 114 56.2% 61%
56 34 22 258 60.7% 11.6%
292 292 − − − −
truth, we built a network of 200 nodes from the Top-20 PubMed articles using the previous query. ρ has been set to 0.3. Results from SemRep were obtained using the same list of PubMed articles obtained with our query. For each node and edge
NETME: On-the-Fly Knowledge Network Construction
395
Table 3. List of nodes and labeled edges connected with SRC Nodes
Incoming edges’ labels outgoing edges’ labels NETME HetioNet NETME HetioNet
Artesunate Activate CSK
regulates
DLC1
interact, Reactivate
regulates, cause Interact
regulates, interact regulate activate
interact
TNF
Activate
CAV1
Includes, binds
interact
interact
interact
regulates
EGFR
Associate, include
interact
includes
regulates
SGK1
Decrease
RHOA
Regulates
regulates
Dasatinib
Binds
binds
CASP8
Inhibit
EGF
Upregulates
Nsclc
Treat
FYN
Activate
regulates interact inhibit
binds
activate
interact
regulates activate activate, include interact
interact
in SemRep results, we checked whether NETME was able to infer it. The same analysis has been done using HetioNet. In this analysis edge labels were ignored. Finally, all comparisons were performed in terms of Precision and Recall. All results are showed in Table 2. Our analysis, although preliminary, clearly shows that NETME has a comparable precision with HetioNet in terms of inferred nodes and an higher recall. Concerning inferred edges, HetioNet precision is slightly higher although NETME recall is much higher.
Fig. 4. A sample of NETME with Proto-oncogene tyrosine-protein kinase (Src). The partial list of edges between the nodes is reported in Table 3
396
4
A. Muscolino et al.
Conclusions
In this paper we introduced NETME, a novel approach for the inference of knowledge graphs from a collection of full-text articles gathered through PubMed query. It builds upon a customized version of TAGME, called OntoTAGME, in connection to a syntactic analysis module based on the NLTK library. Our preliminary results highlight that NETME is able to quickly identify reliable relations among molecules in specific research topics. However, networks are generated on-the-fly starting from a partial knowledge based on a limited list of list of extracted papers. A further limitation is related to the fact that, NETME works on data extracted from open-access papers of PubMed Central, Hence relevant literature related to a topic could be missed. Future work will include the construction of labeled networks on a user-provided set of papers, the extension of the ontology data through more databases, and the temporary storage of networks for fast computation of updates.
References 1. Barab´ asi, A., Gulbahce, N., Loscalzo, J.: Network medicine: a network-based approach to human disease. Nat. Rev. Genet. 12(1), 56–68 (2010) 2. Szklarczyk, D., Morris, J.H., Cook, H., Kuhn, M., Wyder, S., Simonovic, M., Santos, A., Doncheva, N.T., Roth, A., Bork, P., Jensen, L.J., Von Mering, C.: The STRING database in 2017: quality-controlled protein–protein association networks, made broadly accessible. Nucleic Acids Res. 45(D1), D362–D368 (2016) 3. Scott Himmelstein, D., Lizee, A., Hessler, C., Brueggeman, L., Chen, S.L., Hadley, D., Green, A., Khankhanian, P., Baranzini, S.E.: Systematic integration of biomedical knowledge prioritizes drugs for repurposing. eLife, 6 September 2017 (2017) 4. Beck, J.: Report from the field: PubMed central, an XML-based archive of life sciences journal articles. In: Proceedings of the International Symposium on XML for the Long Haul: Issues in the Long-term Preservation of XML, Mulberry Technologies, Inc 5. Ginsparg P.: arXiv. https://arxiv.org 6. bioRxiv. https://www.biorxiv.org/ 7. Lambrix, P., Tan, H., Jakoniene, V., Str¨ omb¨ ack, L.: Biological ontologies. In: Semantic Web, pp. 85–99. Springer (2010) 8. Cohen, A.M.: A survey of current work in biomedical text mining. Brief. Bioinform. 6(1), 57–71 (2005) 9. Krallinger, M., Erhardt, R.A., Valencia, A.: Text-mining approaches in molecular biology and biomedicine. Drug Discovery Today 10(6), 439–445 (2005) 10. D¨ orpinghaus, J., Apke, A., Lage-Rupprecht, V., Stefan, A.: Data exploration and validation on dense knowledge graphs for biomedical research (2019) 11. Nicholson, D.N., Greene, C.S.: Constructing knowledge graphs and their biomedical applications. Comput. Struc. Biotechnol. J. 18, 1414–1428 (2020) 12. Slater, T.: Recent advances in modeling languages for pathway maps and computable biological networks. Drug Discovery Today 19(2), 193–198 (2014) 13. McBride, B.: The resource description framework (RDF) and its vocabulary description language RDFS. In: Handbook on Ontologies, pp. 51–65. Springer, Heidelberg (2004)
NETME: On-the-Fly Knowledge Network Construction
397
14. Himmelstein, D.S., Baranzini, S.E.: Heterogeneous network edge prediction: a data integration approach to prioritize disease-associated genes. PLOS Comput. Biol. 11(7), e1004259 (2015) 15. Kim, J., Wang, Y., Fujiwara, T., Okuda, S., Callahan, T.J., Cohen, K.B.: Open agile text mining for bioinformatics: the PubAnnotation ecosystem. Bioinformatics, 35(21), 4372–4380 (2019) 16. Wei, C., Allot, A., Leaman, R., Lu, Z.: PubTator central: automated concept annotation for biomedical full text articles. Nucleic Acids Res. 47(W1), W587–W593 (2019) 17. Rindflesch, T.C., Fiszman, M.: The interaction of domain knowledge and linguistic structure in natural language processing: interpreting hypernymic propositions in biomedical text. J. Biomed. Inform. 36(6), 462–477 (2003) 18. Yuan, J., Jin, Z., Guo, H., Jin, H., Zhang, X., Smith, T., Luo, J.: Constructing biomedical domain-specific knowledge graph with minimum supervision. Knowl. Inf. Syst. 62(1), 317–336 (2019) 19. Ferragina, P., Scaiella, U.: TAGME. In: Proceedings of the 19th ACM International Conference on Information and Knowledge Management - CIKM 2010. ACM Press (2010) 20. Gene Ontology Consortium: The gene ontology (GO) database and informatics resource. Nucleic Acids Res. 32(90001), 258D–261 (2004) 21. Wishart, D.S., Feunang, Y.D., Guo, A.C., Lo, E.J., Marcu, A., Grant, J.R., Sajed, T., Johnson, D., Li, C., Sayeeda, Z., Assempour, N., Iynkkaran, I., Liu, Y., Maciejewski, A., Gale, N., Wilson, A., Chin, L., Cummings, R., Le, D., Pon, A., Knox, C., Wilson, M.: DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Research, 46(D1):D1074–D1082, November 2017 22. Pi˜ nero, J., Ram´ırez-Anguita, J.M., Sa¨ uch-Pitarch, J., Ronzano, F., Centeno, E., Sanz, F.I., Furlong, L.: The DisGeNET knowledge platform for disease genomics: 2019 update. Nucleic Acids Research, November 2019 23. Smith, B., Ashburner, M., Rosse, C., Bard, J., Bug, W., Ceusters, W., Goldberg, L.J., Eilbeck, K., Ireland, A., Mungall, C.J., Leontis, N., Rocca-Serra, P., Ruttenberg, A., Sansone, S., Scheuermann, R.H., Shah, N., Whetzel, P.L, Lewis, S.: The OBO foundry: coordinated evolution of ontologies to support biomedical data integration. Nat. Biotechnol. 25(11), 1251–1255, November 2007 24. Loper, E., Bird, S.: NLTK. In: Proceedings of the ACL-02 Workshop on Effective tools and methodologies for teaching natural language processing and computational linguistics. Association for Computational Linguistics (2002) 25. Franz, M., Lopes, C.T., Huck, G., Dong, Y., Sumer, O., Bader, G.D.: Cytoscape.js: a graph theory library for visualisation and analysis. Bioinformatics, page btv557, September 2015 26. Sayers, E.: Entrez Programming Utilities Help. https://www.ncbi.nlm.nih.gov/ books/NBK25501/
Statistics of Growing Chemical Network Originating from One Molecule Species and Activated by Low-Temperature Plasma Yasutaka Mizui1 , Shigeyuki Miyagi1,2 , and Osamu Sakai1,2(B) 1
Department of Electronic Systems Engineering, The University of Shiga Prefecture, Hassaka-cho 2500, Hikone, Shiga 522-8533, Japan [email protected] 2 Regional ICT Research Center for Human, Industry and Future, The University of Shiga Prefecture, Hassaka-cho 2500, Hikone, Shiga 522-8533, Japan
Abstract. Chemistry in plasma is complicated because it has so many reactions in parallel and in series. A complex network is suitable for the visualization and the analysis of its complexity. A numerical calculation based on hundreds of rate equations is a typical tool for plasma chemistry, but such a computational process does not clarify the undergoing physical and chemical properties that stabilize many industrial plasma processes for a number of applications. In this study, we focus on lowtemperature plasma in which high-energy electrons are activators for chemical reactions, and investigate the origin of the stability by examining the statistical properties of networks for silane (SiH4 ) plasma. There is only one seed species in the initial space, SiH4 , which is surrounded by high-energy electrons. SiH4 is decomposed into several fragments composed of Si and/or H atoms with possible charges, and such radical and ion species are decomposed or synthesized into other species, leading to the formation of temporal reaction networks in chemistry. With the effects of rate constants that determine chemical reaction rates, we create temporal networks and observe preferential attachments that induce a new reaction in a transient state. The centrality indices for participant species and degree distributions reveal what is occurring in this complex system, and during the sequential process we observe an exponential-tail degree distribution, which is a significant source of reaction stability.
Keywords: Temporal network chemistry
1
· Preferential attachment · Plasma
Introduction
A chemical reaction space which is activated by plasma contains a variety of complexity since plasma, which is a high-energy state with charged particles, simultaneously enhances many chemical reactions [1–3]. By keeping the neutral c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 944, pp. 398–409, 2021. https://doi.org/10.1007/978-3-030-65351-4_32
Growing Chemical Network
399
gas temperature at a low level that is similar to the surrounding room temperature, we can exclude the effects of thermal energy with equilibrium [4], and focus on the activities of electrons e− whose temperature typically exceeds 1 eV [1] (approximately 10,000 K, although the energy of the positive/negative ions resembles that of neutral particles due to frequent elastic collisions). Electrons with temperatures over 1 eV include high-energy fractions along with their Boltzmann distribution and consequently decompose molecules and radical particles that contain multiple atoms into smaller ones; many threshold energy levels for the decomposition of such species range from a few eV to a few tens of eV [3]. The resultant smaller particles coalesce into heavy particles that are different from seed species [5,6]. Such successive reactions in series and in parallel are summarized in a reaction network that displays reactants, intermediates, and products [7,8]. The transient and dynamic formation of multiple species causes temporal networks, which have been extensively examined recently [6,9–13]. In a temporal network, intermittent edges basically hinder the information flow by disrupting paths that have begun to fade. However, such temporality sometimes enhances the flow’s controllability; from an analogical aspect, ion channels in a cell membrane control the ion transport by changing the membrane potential, leading to the sustainable activities of life [14]. In particular, in the growing dynamics of complex networks, one prominent understanding is summarized as the Barab´ asiAlbert model [15]. It includes two elements: growth and preferential attachment, in which a new node joining a network is attached with a higher probability to a node existing with more linked edges. Preferential attachment takes place frequently with other properties recognized in complex network science. The particular focus of this study is that the starting seed of a complex network is one simple molecule species, SiH4 , and energetic electrons in plasma trigger the network’s successive growth [5]. All the species considered in this study are composed of Si and H atoms except e− . The growth of the reaction network, in which nodes are species and edges are chemical reactions, is completely governed by the chemical principles. We restrict our analysis to smaller particles than Si4 Hm with digit number m less than 10, which is in the initial polymerization stage of Si-based macromolecules. Although such strong constraints exist, a certain statistical property emerges in a growing graph, and part of the observed features is shared with other networks, like a metabolic network in a biological cell [16]. In Sect. 2, we describe how we set our model of growing chemical networks of SiH4 . In Sect. 3, the statistics of various graphs and their features are addressed. Discussion in Sect. 4 analyses experimental and numerical results and estimates the validity of our model. Although the phenomena investigated here are rarely observed in experiments due to their very rapid growth, such growth may include a key feature that stabilizes this plasma chemical system. Our analytical procedure reveals a growth mechanism similar to preferential attachment, and the macroscopic property resembles a scale-free network with some anomalies. The difference from the Barab´asi-Albert model is the multiple attachments of new
400
Y. Mizui et al. Table 1. Statistical data of growing networks for SiH4 plasma. Step Before reactions
Counts nodes Directed edges Reactions 3
1
11
28
6
2
20
144
32
3
30
216
52
4
38
359
92
In steady state
56
749
185
Fig. 1. Partial views of first and second steps of reactions with corresponding reactants and products at each step. Notation “(v1,3)” etc. indicates vibrationally excited state of molecule.
nodes at one step, and it is worthwhile to clarify the ongoing processes for this growing network and its statistics.
2
Model and Topology of Growing Network
As described in our previous reports [7,8,17], visualization and analysis in a graph composed of chemical species and reactions in a plasma-enhanced chemical space work well and reveal several novel aspects that other methods cannot perform, like the numerical computation of rate equations [5,18–22]. Since many reactions in plasma chemistry are irreversible, a directed graph is suitable for display. The following is the procedure for creating edges from one reaction equation. When a chemical reaction is shown as A+B → C+D, the four directed edges are from nodes A to C, nodes A to D, nodes B to C, and nodes B to D. Our previous studies considered all the reactions listed in published papers, which included the data of every possible reaction to calculate the centrality indices of the species. In other words, our previous analysis is valid for a steady-state reaction space. Although no information is missing from the listed data, we cannot consider the temporal effects of a reaction whose reaction rate varies by
Growing Chemical Network
401
several orders of magnitude in each case; since some reactions are less frequent, we can exclude them from our model. To select reactions that are significant in a real reactor in an industrial fabrication process, we distinguish among them using the comparisons of rate constants with residence time tr , which is the time span for each species for a smooth gas flow in a hypothetical reaction chamber whose setting parameters are typical to a practical reactor. Specifically, we set reaction volume v to 3.14 l or 3.14 × 10−3 m3 , gas flowing rate f to 0.85 l/min or 1.4 × 10−5 m3 s−1 , and gas pressure p to 27.5 Pa. In this case, tr = (p/101, 325)v/f = 0.060 [s] and the seed gas density is 4.0 × 1021 m−3 for a temperature of 500 K. Next we compare tr with a rate constant of each reaction. When reaction A + B → C + D has rate constant k [m3 /s], a rough estimation of the saturation span for a given ongoing reaction is possible by tr > 1/k[A] ≡ τs
(1)
assuming that density of species [A] is larger than that of [B] with constant [A]. In the first step of a network growth, the reactants are SiH4 , e− and H+ . After selecting the reactions with the constraint of Eq. (1) from the information in Refs. [5,19,20] and Tables 2 and 3, we can deduce the products in the first step, as shown in Fig. 1. In the next step, the feasible reactants for the reactions are a mixture of the reactants and products from the previous steps. Again, in each step, we adapt Eq. (1) to select the reactions with saturation. We note that for simplicity, the densities of all the species are assumed to be constant since we consider the reactions starting from one molecule species of SiH4 and the background densities of the species are set in the external boundary conditions. If the views and the information of the plasma ignition stage are required with the transient signals, numerical calculations must be performed, in which the absolute evolutions of the species densities are available. Unfortunately, transient phenomena hinder the clarification of the properties of a reaction network. The growth of the networks of SiH4 plasma is displayed in Fig. 2, where Table 1 represents the basic statistical data of the networks in Fig. 2. At each step, about ten species join the previous network, which reaches a steady state when we assume infinite reaction steps with sufficiently smaller restrictions than Eq. (1). The maximum number of polymerizations n for the general form of a molecule as Sin Hm corresponds to the step number: n = 1 in the first and second steps, n = 2 in the third, and n = 3 in the fourth, where m ≤ 2n + 2. In a steady-state reaction network, we recognize several microscopic properties possessed by this network. Although we mainly reported them using centrality indices in our previous reports [8,17], here we describe some significant points related to our study using the entry distribution of the adjacency matrix that represents the source and target nodes of the edges or the reactions (Fig. 2(f)). A large number of superposed edges exist among the same nodes; this superposition level is shown as contours. Except for reactions among the positive ions, after electrons trigger successive and parallel reactions originating from SiH4 , many reactions start from SiH4 itself as a reactant of importance. Examples of
402
Y. Mizui et al.
Fig. 2. Growth of reaction networks for SiH4 plasma with dataset of edges from source (reactant) to target (product) species: (a) in first step; (b) in second step; (c) in third step; (d) in forth step; (e) in steady state with all species in reaction system. (f) edge dataset for (e). Superposition levels of edges between identical paired nodes are shown in color contours, and node (or species) numbers on both axes are listed in Table 2.
the main products of reactions, shown as target nodes in Fig. 2(f), are H2 and H. These species initially exist or emerge in the early reaction steps. Large and heavy molecules with n > 3, emerging in later steps, are less significant with small numbers of in- and out-degrees over the entire network.
3
Analytical Results
In Sect. 2, we overview the growing reaction networks in SiH4 plasma. Although the seed is only one molecule species (SiH4 ), the network grows continuously up to the fourth step. In the current semiconductor industry, SiH4 plasma is widely used for the fabrication of Si-contained electronic devices, and the reactor stability and reproducibility are outstanding [1,2]. Here, we examine statistics of this growing network to understand this chemical reactive field. To identify its microscopic views, we calculate a centrality index in this network. In a previous report [8] we proposed that betweenness and closeness values
Growing Chemical Network
403
Fig. 3. Simple PagaRank CSPR of growing reaction network for SiH4 plasma in Figs. 2(a)–(e). (a) CSPR changes for all species and (b) CSPR trends for typical species.
are useful for the clarification of species roles through an entire reaction system, like reactants, intermediates, and products. Here we use a simple version of PageRank [23] CSPR , which we previously described [17]. As our original report argued, CSPR indicates a level of influence over the whole graph, and our recent analytical approach suggests that CSPR reflects the increment and decrement rates of species densities when we apply it to the variables in a set of rate equations that form a network of multiple variables [17]. Furthermore, when we calculate a value along the regular and normal directions of the edges, CSPR indicates the level of importance as a product, whereas it means the level as a reactant if we reverse the directions of the edges for calculations. This calculation technique of CSPR is useful in general networks, beyond chemical reactions, when one obtains an importance level of a given node as an information source. Figure 3 shows a calculated CSPR for a graph at each step. The changes of the graphs significantly lead to those of CSPR in some cases. For instance, Si2 H6 , which emerges in the third step, initially has a low value of CSPR , but it obtains a high value at the next step. CSPR of the normal direction varies gradually by the creation of new species at each step. However, in CSPR cases of reverse direction that indicates the importance as a reactant, CSPR does not vary much throughout all the steps, with small fluctuated changes of values. This fact indicates that species of importance emerge in early steps. Polymerization number n is lowered when a step comes earlier and with the invariance of CSPR , suggesting that smaller molecules, which include radicals and ions, play more important roles in SiH4 plasma reaction systems. From a macroscopic point of view, degree distribution is the well-known display of the statistical properties of graphs. As listed in Table 1, the total node counts in the graphs in Fig. 2 are so small that similar discussion is impossible for judgments on the scale-free networks performed in the general cases of network science. However, it is worthwhile to plot a degree histogram to identify the
404
Y. Mizui et al.
Fig. 4. Degree distribution of steady-state reaction network in Fig. 2(e) with identified components from each reaction step.
statistical properties in our growing network. To obtain degree distribution, we apply the logarithmic binning procedure described below [24]. At the lowest case of degrees k, we set the range from ka to kb as [ka(1) , kb(1) ], and for the next range, we set [ka(2) , kb(2) ] = [ka(1) + 1, xkb(1) ] with multiplier x. In a similar manner using arbitrary integer i, we set [ka(i+1) , kb(i+1) ] = [ka(i) + 1, xkb(i) ], and divide the summation for each range by (kb(i) − ka(i) ) to normalize the histogram. Figure 4 shows the degree distribution for the steady-state network and the components of the emerging steps for each range. Here we set ka(1) = 1, kb(1) = 5, and x = 2, and the center value of k represents a given spectrum range. The exponent of this degree distribution is 1.36, which is lower than the universal cases for previously reported complex networks [15]. We note that the exponent for degree k > 10 is 1.72 (dotted line in Fig. 4). To confirm the order validity of these values, we changed the setting as follows: kb(1) = 3, 5, 7, and x = 2, 3. The exponent varies from 1.05 to 1.48, but this variation range is small, meaning that the exponent’s values are abnormal [15]. Although this discrepancy can be partly attributed to the uncertainty caused by insufficient node counts (Fig. 5), the attachments of emerging nodes to the network at each step are not single but multiple. About ten nodes that correspond to species simultaneously join the network, which is a fairly different case from the Barab´ asi-Albert model, where a new node joins a network one by one [15]. In other words, this analysis stresses the effects of multiple information generations and flows in parallel, beyond chemical systems. Despite the growing-process difference, we observe certain preferential attachments of newly emerging nodes in this growing network. In Fig. 4, we display the component fractions for each degree spectrum, where the spectrum components are shown as fractions of the emerging steps of nodes. As a general tendency, nodes emerging at an earlier step are located in a higher-degree spectrum. Within species groups where the degree exceeds 40, except one species (Si2 H6 ), all the nodes emerge before the first step. At the lowest spectrum, more than 60% of
Growing Chemical Network
405
Fig. 5. Reaction network, which is a graph whose nodes are the graph edges shown in Figs. 2(e), (f) and 4, and its degree distribution spectrum for steady-state reaction networks for SiH4 plasma. (a) Degree distribution of reaction network. Inset is corresponding graph display. (b) Raw spectra of degree distributions for species network (Fig. 4) and reaction network (Fig. 5(a)).
nodes emerge at steps after the fourth step. This tendency clearly suggests that the species of importance emerges at earlier steps and that a similar process of preferential attachments occurs. Closer scrutiny of networks in statistical analyses reveals the specific and abnormal features of this growing network (Fig. 5). Although nodes are linked by chemical reactions in our model described so far, yet the linkage between the chemical reactions themselves might provide even greater importance for network connectivity and stability. Since many reactions in series and in parallel occur in plasma chemistry, the temporal sequences of the reactions and their connections offer other aspects of importance. In other words, the number of species is insufficient for statistical stabilization effects, and each species plays multiple roles as participants in a chemical reaction system. This reaction network is automatically derived from the species network since the reaction network represents the adjacency of the reactions or the edges in a graph. We visualize this network in the inset in Fig. 5, where its node and edge numbers are 749 and 24,258, respectively. The degree distribution of this network in Fig. 5(a) has an exponential component in the tail region with an exponent of ∼1.16 for k > 30. This exponential tail arises not from the height of each spectrum but from the spectrum density in its distribution (Fig. 5(b)). The numbers of nodes with a given k value in the network do not continuously decrease over the digit numbers; this spectrum appears at discrete k quantities, and the intermittent length between the discrete values increases exponentially as k increases. Thus, this long-tail distribution implies diverse spectra in this reaction network, possibly leading to the stability of chemical system, even though the spectra are fairy discrete, partly because multiple nodes or species join the network on limited discrete steps. SiH4 chemical reactions develop with sequential n numbers from
406
Y. Mizui et al.
1 to several steps at most to maintain good quality Si thin films, which is a typical industrial product [2], unlike huge molecules in biological reactions [16]. Although the degree distributions with the estimated exponent are abnormal, we observed the statistical properties in this growing network. Figure 3 suggests the importance of lower-n or lighter molecules, which include radical species and ions, and Fig. 4 stresses that preferential attachments take place. These two facts indicate that lower-n molecules stabilize the total balance of the network configurations. In the real space, lighter molecules surround heavier molecules and various reactions take places, in which lighter molecules are more active in a reactive field. These reaction balances (performed by lighter molecules) are substantial for the macroscopic stabilization of systems. In other words, although lighter molecules are in the periphery around the heavier molecules, they are likely to be in the center of this growing network. Beyond chemical issues, even if an outlook of a given node in the real space is indistinct, networking of nodes by their interrelations clarifies their functions in the entire system.
Fig. 6. Comparison of our network analysis and experimental results (13.3 Pa) [25]. Results of network analysis are summed-up values of CSPR in corresponding n values in Sin Hm notation.
4
Discussion
The analytical results described in Sect. 3 reveal statistical properties of the growing network for the SiH4 plasma reaction system. Direct experimental confirmation of this analysis is difficult since the diagnostics of species detection for plasma chemistry includes severe limitations for spatiotemporal resolutions. Although the comparisons are indirect, we show one example for possible discussion. Figure 6 shows an experimental result of mass spectra from SiH4 plasma measured by a quadrupole mass analyzer [25]. It detected the amount of particle fluxes to one portion on the reactor wall at each mass number as a mass spectrum, and its order of magnitude is relatively reliable. However, an absolute particle count is impossible. Since the mass flux signals with similar mass numbers overlap, we are forced to sum up the values for each case with number n. Furthermore, the detected flux is a linear function of the species
Growing Chemical Network
407
densities, and the derived spectra of CSPR are the rates of the temporal changes of the density. Based on these assumptions, we conclude that these spectra are correlated, even though direct comparisons of their quantities are impossible. p and tr in the reactor were in the same level as that in our model shown in Sect. 2, and the similar network evolutions shown in Fig. 2 may be ongoing. In Fig. 6, certain correlations exist in these data plots. However, the spectra distribution of CSPR is broader than that of the mass spectra. Although p and tr are similar, Si4 Hm exists in our CSPR but not in the mass spectra. This discrepancy may arise from the fluid motions in the reactor. In our model using tr , we assumed that all the particles are well mixed to react to each other, but some separations in the particle spatial groups are likely to hinder sufficient mixing in the experiments, potentially slowing the growth of the network in a real SiH4 -plasma reactor. Although this quantitative comparison is incomplete, the values we derived from the centrality indices qualitatively share some commonality with the mass spectra in the experiment. Table 2. Species with notation numbers considered in our model. Parentheses indicate a vibrationally-excited state, and “∗” denotes an electronically-excited state. Any species works as “M.” For instance, such a rare atom as Ar or He can be replaced by M. Species No. Species
No. Species No. Species No.
e−
15
Si3 H8
SiH+ 2
29 30
+
SiH4
2
Si2 H3
16
SiH
Si2 H6
3
Si2 H2
17
Si+
31
H2
4
Si3 H7
18
Si2 H+ 7
32
H
5
Si4 H9
19
Si
6
Si4 H10
20
SiH
7
Si5 H11
21
SiH2
8
H∗2
22
∗
SiH3
9
H
23
Si2 H4
10
SiH4 (v2, 4)
24
Si2 H5
11
SiH4 (v1, 3)
25
Si2 H∗6
12
Si2 H6 (v2, 4) 26
M
13
Si2 H6 (v1, 3) 27
14
SiH+ 3
Si2 H∗∗ 6
5
1
28
Si2 H+ 6 Si2 H+ 5 Si2 H+ 4 Si2 H+ 3 Si2 H+ 2 Si2 H+ Si3 H+ 6 Si3 H+ 4 Si4 H+ 6 Si4 H+ 8
33 34
Si3 H+ 5
43
Si4 H+ 7 Si4 H+ 2 +
44
H
46
H+ 2 H+ 3
47
45
48
37
Si5 H+ 10 SiH− 3 SiH− 2
38
H2 (v1) 52
39
H2 (v2) 53
40
Si5 H12
54
41
Si5 H+ 9
55
35 36
42
Si5 H+ 4
49 50 51
56
Conclusion
We examined growing networks in a plasma chemical system starting from one reactive molecule. Our study’s statistical properties include scale-free-like degree distribution with an exponential tail, in which multiple nodes simultaneously join the network at one step. This unveiled property is likely to help stabilize
408
Y. Mizui et al.
the SiH4 plasma, in which tens of species are in kinetic motion without any specific connections; this growing chemical reaction network makes a chemical linkage between particles. Lighter molecules as network nodes have more linked edges, and their roles in the center of the network are constant throughout the network growth with very few exceptions. Our experimental results on mass spectra measurement share some points with those in our model, and existing discrepancies will become comprehensive by future progress in both experimental diagnostic methods and analytical network models. Appendix. Table 2 lists the species in our model, and Table 3 shows the densities of each species assumed here. The datasets are based on Refs. [5,19,20]. Table 3. Density of species assumed in our model at 27.5 Pa. Species
Density (m−3 ) Species Density (m−3 )
e−
1.0 × 1017 21
SiH4
4.0 × 1021
H2
3.0 × 10
SiH3
3.0 × 1018
SiH2
3.0 × 1016
SiH
3.0 × 1016
Si2 H6
3.0 × 10
Si2 H5
1.0 × 1018
Si2 H2
4.0 × 1012
Si3 H8
1.0 × 1020
17
Si2 H∗6
4.0 × 1021
SiH4 (v2.4) 4.0 × 1021
Si3 H7
1.0 × 1018
Si4 H9
1.0 × 1018
H Si4 H10
20
4.0 × 10
19
6.0 × 10
Acknowledgements. One of the authors (OS) thanks Prof. T. Murakami at Seikei University, Prof. M. J. Kushner at the University of Michigan, and Dr. S. Nunomura at National Institute of Advanced Industrial Science and Technology for their useful comments on this study. This work was partly supported by JSPS KAKENHI Grant Numbers JP18H03690 and JP18K18756.
References 1. Lieberman, M.A., Lichtenberg, A.J.: Principles of Plasma Discharges and Material Processing. Wiley, New York (1994) 2. Bruno, G., Capezzuto, P., Madan, A.: Plasma Deposition of Amorphous SiliconBased Materials. Academic Press, San Diego (1995) 3. Itikawa, Y.: Molecular Processes in Plasmas. Springer, Berlin (2007) 4. Kittel, C.: Thermal Physics. Wiley, Hoboken (1969) 5. Kushner, M.J.: A model for the discharge kinetics and plasma chemistry during plasma enhanced chemical vapor deposition of amorphous silicon. J. Appl. Phys. 63, 2532–2551 (1988) 6. Bohdan, M., Sprakel, J., van der Gucht, J.: Multiple relaxation modes in associative polymer networks with varying connectivity. Phys. Rev. E 94, 032507-1–032507-7 (2016)
Growing Chemical Network
409
7. Sakai, O., Nobuto, K., Miyagi, S., Tachibana, K.: Analysis of weblike network structures of directed graphs for chemical reactions in methane plasmas. AIP Adv. 5, 107140-1–107140-6 (2015) 8. Mizui, Y., Kojima, T., Miyagi, S., Sakai, O.: Graphical classification in multicentrality-index diagrams for complex chemical networks. Symmetry 9, 309-1–30911 (2017) 9. Laut, I., R¨ ath, C., W¨ orner, L., Nosenko, V., Zhdanov, S.K., Schablinski, J., Block, D., Thomas, H.M., Morfill, G.E.: Network analysis of three-dimensional complex plasma clusters in a rotating electric field Phys. Rev. E 89, 023104-1–023104-9 (2014) 10. Holme, P.: Temporal network structures controlling disease spreading. Phys. Rev. E 94, 022305-1–022305-8 (2016) 11. Lusch, B., Maia, P.D., Kutz, J.N.: Inferring connectivity in networked dynamical systems: challenges using Granger causality. Phys. Rev. E 94, 032220-1–032220-14 (2016) 12. Bellesia, G., Bales, B.B.: Population dynamics, information transfer, and spatial organization in a chemical reaction network under spatial confinement and crowding conditions. Phys. Rev. E 94, 042306-1–042306-8 (2016) 13. Li, A., Cornelius, S.P., Liu, Y.-Y., Wang, L., Barab´ asi, A.-L.: The fundamental advantages of temporal networks. Science 358, 1042–1046 (2017) 14. Alberts, V., Bray, D., Hoplin, K., Johnson, A., Lewis, J., Raff, M., Roberts, K., Walter, P.: Essential Cell Biology, 4th edn. Garland Science, New York (2013) 15. Albert, R., Barab´ asi, A.-L.: Statistical mechanics of complex networks. Rev. Mod. Phys. 74, 47–97 (2002) 16. Jeong, H., Tombor, B., Albert, R., Oltvai, Z.N., Barab´ asi, A.-L.: The large-scale organization of metabolic networks. Nature 407, 651–654 (2000) 17. Mizui, Y., Nobuto, K., Miyagi, S., Sakai, O.: Complex reaction network in Silane Plasma chemistry. In: Complex Networks VIII, pp. 135–140. Springer, Cham (2017) 18. Tachibana, K., Nishida, M., Harima, H., Urano, Y.: Diagnostics and modelling of a methane plasma used in the chemical vapour deposition of amorphous carbon films. J. Phys. D 17, 1727–1742 (1984) 19. Gogolides, E., Mary, D., Rhallabi, A., Turban, G.: Rf plasmas in methane: prediction of plasma properties and neutral density with combined gas-phase physics and chemistry model. Jpn. J. Appl. Phys. 34, 261–270 (1995) 20. Bleecker, K.D., Bogaerts, A., Godeheer, W., Gijbels, R.: Investigation of growth mechanisms of clusters in a silane discharge with the use of a fluid model. IEEE Trans. Plasma Sci. 32, 691–698 (2004) 21. Murakami, T., Niemi, K., Gans, T., O’Connell, D., Graham, W.G.: Chemical kinetics and reactive species in atmospheric pressure helium-oxygen plasmas with humid-air impurities. Plasma Sources Sci. Technol. 22, 015003-1–015003-29 (2013) 22. Bie, C.D., Dijk, J., Bogaerts, A.: The dominant pathways for the conversion of methane into oxygenates and syngas in an atmospheric pressure dielectric barrier discharge. J. Phys. Chem. C 119, 22331–22350 (2015) 23. Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst. 30, 107–117 (1998) 24. Newman, M.E.J.: Power laws, Pareto distributions and Zipf’s law. Contemporary Phys. 46, 323–351 (2005) 25. Nunomura, S., Kondo, M.: Positive ion polymerization in hydrogen diluted silane plasmas. Appl. Phys. Lett. 93, 231502-1–231502-3 (2008)
Joint Modeling of Histone Modifications in 3D Genome Shape Through Hi-C Interaction Graph Emre Sefer1,2(B) 1
Ozyegin University, Computer Science Department, Istanbul, Turkey [email protected] 2 JP Morgan Applied AI Research, New York City, USA
Abstract. Chromosome conformation capture experiments such as HiC are used to map the three-dimensional spatial organization of genomes. Even though Hi-C interactions are not biased towards any of the histone modifications, previous analysis has revealed denser interactions around many histone modifications. Nevertheless, simultaneous effects of these modifications in Hi-C interaction graph have not been fully characterized yet, limiting our understanding of genome shape. Here, we propose Coverage Hi-C to decompose Hi-C interaction graph in terms of known histone modifications. Coverage Hi-C is based on set multicover with pairs, where each Hi-C interaction is covered by histone modification pairs. We find 4 histone modifications H3K4me1, H3K4me3, H3K9me3, H3K27ac to be significantly predictive of most Hi-C interactions across species and cell types. Coverage Hi-C is quite effective in predicting Hi-C interactions and topologically-associated domains (TADs) in one species, given it is trained on another species or cell types. Keywords: Hi-C Epigenetics
1
· Set cover · Bioinformatics · Algorithms ·
Introduction
Graph theory has emerged as a powerful tool for quantifying the connectivity patterns within complex systems [3]. Networks are graphs consisting of nodes connected by edges that can be used to represent the underlying structure of biological, social, physical and information systems. Biological interactions at many different levels of detail, from the genomic interactions in a folded genome structure to the relationship of organisms in a population or ecosystem, can be modeled as networks. A complex system exhibiting network structure is chromatin interaction network, which models higher-order folding of chromatin in the three-dimensional nucleus. Here, we focus on chromatin interaction networks, which we define as a set of nodes, representing restriction fragments, or genome regions, and a set of undirected edges, representing the physical interactions between these locations [1]. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 944, pp. 410–421, 2021. https://doi.org/10.1007/978-3-030-65351-4_33
Joint Modeling of Histones in Hi-C Interactions
411
Chromatin interactions obtained from a variety of recent chromosome conformation capture experimental techniques such as Hi-C [17] have resulted in significant advances in our understanding of the geometry of chromatin structure [23]. Hi-C provides genome-wide maps of contact frequencies, a proxy for how often any given pair of loci are sufficiently close in space to be captured together. As a result, Hi-C yields matrices of counts that represent the frequency of cross-linking between restriction fragments of DNA at a certain resolution. At the chromosomal level, Hi-C has revealed a spatial segregation of open and closed chromatin into two compartments called A and B compartments [4]. A and B compartments largely correlate with accessible, transcriptionally active euchromatin and compacted, transcriptionally silent heterochromatin, respectively. Similarly, analysis of the resulting matrix by Dixon et al. [6] at a higher resolution led to the discovery of topologically-associated domains (TADs) which correspond to consecutive, highly-interacting matrix regions that are close in densely packed chromatin. TADs are ubiquitous unit of genome organization that are highly reproducible features of Hi-C matrices. Higher-order genome organization (including TADs, compartments) is correlated with long-range transcriptional regulation and cell differentiation [20,22]. The emerging evidence has revealed that epigenetics is critical in understanding the basic molecular mechanisms that take place in chromatin (transcription, splicing, replication, and DNA repair). Even though Hi-C is not biased towards any of the histone modifications, previous analysis has revealed denser interactions around many histone modifications [6,10]. Interactions between these one-dimensional histone modifications determine the 3D structure of genome. For instance, histone modifications H3K4me3, H3K27ac, and insulator proteins are enriched, H3K27me3 is depleted within TAD boundaries [8], although the causal direction of these associations is unknown. Despite these analyses, the complete picture of how histone modifications through the binding sites jointly affect 3D genome shape remains poorly understood across species, cell types, and cell cycles. This is partially because the previous analyses relating histone modifications to TADs and A/B compartments have often considered each histone modification independently, without accounting for their combined quantitative effects. It is not fully known to what extent relationships between the histone modifications are important across species and cell types, or whether there is a small set of histone modifications that are of primary importance in explaining observed Hi-C interactions, and thus 3D genome shape. In this paper, we consider the problem of identifying the relationships between high-order chromatin interactions and histone modifications. Concretely, we aim to understand and predict how Hi-C interactions are formed as a result of these modifications and interactions within them. We propose a covering type method Coverage Hi-C to decompose Hi-C interaction graph in terms of known histone modifications. Coverage Hi-C selects subset of histone modifications based on set multicover with pairs, where each Hi-C interaction is covered by histone modification pairs. We systematically identify 4 histone modifications (H3K4me1, H3K4me3, H3K9me3, H3K27ac) to be highly
412
E. Sefer
predictive of most Hi-C interactions across species and cell types when considered in combination. We complete the missing Hi-C interactions and predict inter and intra-chromosomal Hi-C interactions at a high resolution by using this sparse set of inferred modifications. These histone modifications account for a large proportion of the accuracy of Hi-C prediction, matching with their known roles, which fail to predict Hi-C interaction when considered independently. We show that these modifications are conserved across human and mouse species, as well as embryonic stem cells and GM12878 cells. Overall, our contributions are as follows: 1- We propose a covering type formulation to identify subset of histone modifications over Hi-C interaction graph across genome locations, 2- Then, we propose an efficient relaxation-based method with provable optimal guarantees, 3- We show that most of the identified histone modifications exist consistently across different mammals and cell types, 4- We demonstrate the effectiveness of identified histone modifications in predicting Hi-C interactions and TADs. Although [25] discusses the low performance of biological data prediction across cell types, our method’s performance across cell types is quite promising. 1.1
Related Work
Previous work has focused on understanding the subset of genome architectures through epigenetic modifications by ignoring the interactions between modifications. [23] analyzed the distribution of various genomic elements such as histone modifications, CTCF, enhancers in terms of Hi-C interactions. Another work [13] shows how Hi-C is distributed around regulatory sequences. [19] discusses the degree of overlap between Hi-C interactions and the known promoter and enhancer sites. [16,26] has considered the impact of epigenetic modifications in predicting TADs, which is different than predicting Hi-C interactions. Another set of work has focused on analyzing epigenetic data by deep nongenerative models lacking the high-quality interpretation of the relationships. [5] uses a neural network-based algorithm to predict subcompartment annotations from epigenetic modifications. [15] proposes a bootstrapping deep learning model that predicts interactions only between regulatory elements without utilizing histone modifications. Similarly, [27] proposes a deep learning approach to predict the impact of only non-coding sequence variants on 3D chromatin structure. Common to all these methods, they use Deep Neural Networks (DNNs) which tend to be seen as black boxes that can perform a great variety of tasks but offer little mechanistic explanation of how the inputs of the model are being used to generate the output. They also do not model the relationships by a generative framework, limiting the interpretability of the relationships between epigenetic modifications and Hi-C data. There are a number of differences between our work and the existing work: 1Some of these methods consider each histone modification independently ignoring the global dynamics between the modifications, 2- They do not develop explanatory models of Hi-C interactions in terms of histone modifications, so they lack interpretability of these relationships, and 3- They do not quantify
Joint Modeling of Histones in Hi-C Interactions
413
the strength of relationships between histone modifications in the Hi-C interaction dataset so they cannot capture the most informative subset of histone modifications.
2
Problem Formulation
Hi-C provides us set of interactions between restriction fragments over the whole genome. More formally, let R be the set of restriction sites over considered genome, and Hi-C provides us an undirected interaction graph G = (V = R, E) where E = {Euv , u < v ∈ R2 } is set of interactions between restriction sites and Euv is the number of interactions between u and v. These interactions can be analyzed in two ways: 1- We can either work directly at a restriction fragment level where each node is a restriction site and G is an unweighted graph, or 2- We bin the data at a given resolution and analyze the resulting graph G = (V = R , E ) where R represents nonoverlapping genomic regions of fixed length (called a bin), and each edge Euv is the total number of interactions between restriction sites of bins u and v. Let M be set of histone modifications that are candidates to explain observed Hi-C interactions and associated biases. Histone modifications are previously shown to be associated with several Hi-C interaction patterns [6]. We define cvm to be the number of histone modification m ∈ M around restriction site v which can take binary values if the data is not binned; modification m either exists or not around v. Let H[v] = {(m, cvm ), | m ∈ M, cvm > 0} be set of histone modi fication around restriction site v. When analyzed after binning, H[v ] = tcounts t {(m, k=1 ckm ), | m ∈ M, k=1 ckm > 0} where bin v = {v1 , v2 , . . . , vt } ∈ R includes t restriction sites. Given H = {H[v], v ∈ R} (or H = {H[v ], v ∈ R } if the data is binned), we propose the following problem to identify subset of modifications in terms of which Hi-C data can be explained: Problem 1. Coverage Hi-C : Given histone modifications data H and Hi-C interaction graph G over a genome, we infer the minimum weighted set of histone modifications any pair of which can cover all observed Hi-C interactions. Problem where data is binned at a given resolution is defined similarly. Coverage Hi-C identifies subset of histone modifications that can cover Hi-C interactions to explain 3D genome shape by taking interacting and non-interacting genomic regions into account.
3
Coverage Hi-C : Covering Chromatin Interactions by Subset of Histone Modification Pairs
We propose a covering type solution to select a subset of modifications to explain observed Hi-C interactions between restriction sites. We assume each Hi-C interaction to be covered by at least one modification pair. Similar covering problems have been studied in primer selection and haplotyping [12,14]. Let xmn be a
414
E. Sefer
binary variable taking value 1 when modification m interacts with modification n, and let ym be a binary variable taking value 1 when modification m is in the solution. Without loss of generality, we assume each restriction site to have at least a single modification. Otherwise, we remove restriction sites to which there is no mapped modification. We assume that two histone modifications can interact only when both modifications individually belong to the solution. The resulting Program (1)–(5) is defined as follows: wm ym (1) argmin Y
m∈M
s.t.
xmn ≥ 1,
(u, v) ∈ E
(2)
v (m,cu m )∈H[u] (n,cn )∈H[v]
xmn ≤ ym , xmn ≤ yn ,
m ≤ n ∈ M2
(3)
xmn ≥ 0,
m≤n∈M
(4)
ym ≥ 0,
m∈M
2
(5)
where wm is the cost of adding modification m to the solution. We define wm = v min(cu m ,cm ) (u,v)∈E / which increases if m exists highly across non-interacting R ( 2 )−|E| restriction sites. This heuristic weighting scheme penalizes the modifications that are also seen across non-interacting sites which cannot be penalized by the unweighted problem formulation. Constraint (2) ensures that each interaction is covered by at least one modification pair existing in the corresponding sites, and constraint (3) ensures that a modification pair can cover an interaction only when both histone modifications belong to the solution. Since interactions are independent sets of modification pairs, we can replace constraints (3) by the following stronger set of constraints: (n,cu n )∈H[u]
xmn +
xmn ≤ ym ,
m ∈ M, (u, v) ∈ E
(6)
(n,cv n )∈H[v]
Let Q = max(u,v)∈E (|H[u]| |H[v]|) be maximum size of histone modification pairs that can cover an interaction, Coverage Hi-C is NP-hard, and Program (1)–(5) with the replaced constraint can be approximated by O( Q log(|E|)) as in Theorem 1 which follows from approximation-preserving reduction to Minimum Weight Multicolored Subgraph Problem [11]. This is achieved by solving its LP relaxation and running a randomized rounding, adding each modification m to the solution with probability ym . If constraints are still not satisfied after rounding, we keep adding ym with the maximum number of satisfied constraints increase per unit cost (wm ) until solution is satisfied. Problem is of polynomial size in the order of variables and constraints; the number of variables is M (M2 +1) + M , number of constraints is E + M E. Theorem 1. Coverage Hi-C can be approximated by O( Q log(|E|)). Proof. Minimum Weight Multicolored Subgraph Problem (MWMCSP) instance: Given an undirected graph GM = (VM , EM ) with non-negative vertex weights and a color function that assigns to each edge one or more of n given colors as input, the goal is
Joint Modeling of Histones in Hi-C Interactions
415
to find a minimum weight set of vertices of GM inducing edges of all n colors. U is universe of colors where χ = (χ1 , . . . , χn ) be the family of nonempty “color classes” of edges (without loss of generality we assume that ∪i χi = U ). When mapping Coverage Hi-C into MWMCSP instance, histone modifications to be selected M maps to VM . Each edge in EM defines a color class χm,n = {(u, v)|(u, v) ∈ E, m ∈ H[u], n ∈ H[v]} on Hi-C interactions pairs m and E for corresponding histone n. MWMCSP can be approximated by O( mlog(n))), which becomes O( Q log(|E|)) in our case where m = Q = max(u,v)∈E (|H[u]| |H[v]|) is the maximum size of a color class, and n = |E| is number of colors.
3.1
Binning Variant
If we bin the Hi-C data to a given resolution, there will be multiple Hi-C interactions to be explained by multiple histone pairs. Similar to the unbinned case, we assume each interaction to be explained by a single histone pair. There can also be self-interactions in G as each node is a binning over multiple restriction sites. In this case, problem becomes Multiset Multicover variant of the problem in Sect. 3 where constraint (2) is replaced by: cum cvn xmn ≥ Eu v , (u , v ) ∈ E (7)
v (m,cu m )∈H[u ] (n,cn )∈H[v ]
We define weights in objective function in Eq. (1) as wm =
u v cm cm (u,v)∈E / R −|E| 2
. To our ( ) best knowledge, this variant of the problem has not been defined before. This problem can again be solved byLP relaxation and randomized rounding. However, such scheme now does not give O( Q log(|E|)) approximation guarantee as in the unbinned case.
4 4.1
Results Implementation
We use Hi-C data from embryonic stem (ES) cells in mouse and human [24], GM12878 cells only in human [23] covering autosomal chromosomes. We download the genome assemblies from the UCSC genome browser. We use Juicer [7] to process the Hi-C sequencing reads of species to obtain the Hi-C contact pairs based on the corresponding genome assembly. We obtain histone modifications for human and mouse from NIH Roadmap Epigenomics [2] and UCSC Encode [9]. In the binned case, we bin HiC, ChIP-Seq histone modifications at 1 kb resolution, estimate RPKM (Reads Per Kilobase per Million) measure for each bin, and transform values x in each bin by log(x + 1), which reduces the distorting effects of high values. In the case of 2 or more replicates, the RPKM-level for each bin is averaged to get a single histone modification file, in order to minimize batch-related differences. Then, such normalized values are turned into binary values simply by thresholding 0.5. In the unbinned case, we map histone modification sites to neighboring Hi-C interaction’s restriction sites. A modification is said to belong to a restriction site if the distance between modification and restriction site is less than 100. After such mapping, a histone modification either exists or not at a given restriction site. Results are based on unbinned case unless otherwise noted.
416
E. Sefer
We implement Coverage Hi-C in Python, and use Gurobi to solve LP relaxations [21]. Code and datasets can be found on http://www.github.com/seferlab/ chrocoverage. Coverage Hi-C is reasonably fast: We can solve coverage formulations even without binning in less than 10 minutes on a laptop with 2.3 GHz Dual-Core Intel Core i5 processor and 8 GB Ram. We prevent overfitting and optimize regularization parameters by following a 5-fold nested cross-validation with inner and outer steps. In this case, The outer 5-fold cross-validation, for example, trains on all autosomal chromosomes except the one to be predicted. Within each loop of outer cross-validation, we perform 4-fold inner cross-validation to estimate the regularization parameters.
4.2
Four Histone Modifications Are Predictive of Most Hi-C Interactions
We find only 4 histone modifications (H3K4me1, H3K4me3, H3K9me3, H3K27ac) out of 16 modifications to be enough to explain the most of genome-wide Hi-C interactions of human ES cells by Coverage Hi-C . This is accurate for both 5 kb binned and unbinned cases. Figure 1a shows the percentage of covered interactions by increasing number of histone modifications, where more than 93% of interactions are covered by these 4 histone modifications for most chromosomes in human ES cells. As we increase the number of included modifications from 1 to 16, coverage increase nearly stabilizes after 4 modifications, with some additional small increase up to 8 histone modifications by H3K79me2, H4K20me1, H3K36me3, H3K27me3 for 5 kb binned case. These non-redundant set of histone modifications are highly preserved when we repeat this procedure across human GM12878 and mouse ES cells. Similarly, Fig. 1b shows the percentage of times histone modifications are in Coverage Hi-C solution, where Coverage Hi-C is run independently for each chromosome on human ES cells.
Fig. 1. a) Hi-C Coverage percentage by increasing number of histone modifications on 5 kb and unbinned cases, b) The probability of histone modifications appearing in Coverage Hi-C solution per chromosome for human ES cells.
Joint Modeling of Histones in Hi-C Interactions
4.3
417
Coverage Hi-C Can Predict Hi-C Interactions and TADs from Identified Histone Modifications
We are able to detect false positive interactions on human ES cells by applying 5-fold nested cross validation independently on each chromosome. We predict Hi-C interactions by Coverage Hi-C over previously identified 4 histone modifications. We evaluate the performance by F1 score which is the harmonic mean of precision and recall scores, showing the tradeoff between both scores. According to matrix in Fig. 2, chromosome 4 has the best performance with 0.75 F1. Our cross-chromosome experiments show that the performance decreases but is reasonably well when training on one chromosome and testing on another; training with interactions on chromosome 6 and predicting interactions on chromosome 4 gives F1 score of 0.72. The modifications identified as important across different chromosomes are very similar, suggesting that the overall properties governing chromosomal contacts are similar across chromosomes, however, there may be fine-grained differences that are not being captured Coverage Hi-C .
Fig. 2. F1 score for Hi-C interaction prediction per chromosome pair on human ES cells. 5-fold nested cross-validation is independently applied to each chromosome. We also evaluate the performance of Coverage Hi-C in predicting chromosomal structures. Figure 3 shows TAD prediction performance from histone modifications on human ES cells. We compare the performance by Normalized Variation of Information (NVI) [18], where VI measures the similarity between two partitions and lower score means better performance. We use Armatus [10] to detect TADs over identified Hi-C interactions. Prediction performance of training with all histone modifications is almost same as training only with 4 modifications.
418
E. Sefer
Fig. 3. Performance of TAD prediction by NVI on human ES cells for histone modifications.
4.4
Histone Modifications Are Important in Hi-C Interaction Prediction Across Species and Cell Types
We predict Hi-C interactions on mouse ES chromosomes over Coverage Hi-C trained with chromosome-wide human ES cells. Figure 4a shows F 1 score in mouse chromosomes. Our cross-species prediction shows that the performance can vary from one training species to another. Prediction performance is the worst for X chromosome, and intrachromosomal interactions can be predicted more accurately than the interchromosomal ones. Similarly, Fig. 4b repeats the same analysis on human ES cells by training Coverage Hi-C over mouse ES chromosomes.
Fig. 4. F1 score for predicting a) mouse ES from human ES, b) human ES from mouse ES cells. 5-fold nested cross-validation is independently applied to each chromosome.
Joint Modeling of Histones in Hi-C Interactions
419
Fig. 5. F1 score for predicting a) human GM12878 from human ES, b) human ES from human GM12878 cells. 5-fold nested cross-validation is independently applied to each chromosome.
We also examine the impact of cell types in Hi-C interaction prediction as in Figs. 5a–5b respectively. Prediction performance between Human GM12878 and human ES is better than the prediction between species suggesting that common subset of histone modifications explain genome shapes of different cell types. There is no significant performance difference between training on human ES vs. human GM12878.
5
Conclusions
We investigate how histone modifications and interactions between them explain HiC interactions, thus the three-dimensional genome organization. We present a novel covering-based method with provable optimal guarantees to decompose Hi-C interactions in terms of the modifications. Experiment results on human and mouse show that a common subset of histone modifications can accurately predict Hi-C interactions across species and cell types. Via our trained models, we also accurately predict Hi-C interactions without using any Hi-C data which is especially useful for understanding the 3D genome conformation on species with limited Hi-C data. Coverage Hi-C is also effective in identifying topologically-associated domains. Our cross-chromosome experiments show that the performance decreases when training on one chromosome and testing on another. The features identified as important across different chromosomes are quite similar, suggesting that the overall properties governing chromosomal interactions are similar across chromosomes. Overall, the analysis performed in this work provides good insights on the impact of histone modifications and interaction between them in the 3D genome shape. In the future, Coverage Hi-C can simply be extended to more recent multilocus chromatin interaction experiments through a hypergraph formalism instead of a graph. Besides, the problem can be casted as a constrained supermodular minimization
420
E. Sefer
problem where covered interactions will be a supermodular function of added modifications. Another variant of Coverage Hi-C is its prize collecting version, where we allow number of Hi-C interactions not to be covered by paying a penalty.
References 1. Babaei, S., Mahfouz, A., Hulsman, M., Lelieveldt, B.P.F., de Ridder, J., Reinders, M.: Hi-c chromatin interaction networks predict co-expression in the mouse cortex. PLOS Comput. Biol. 11(5), 1–21 (2015) 2. Bernstein, B.E., et al.: The NIH roadmap epigenomics mapping consortium. Nat. Biotechnol. 28(10), 1045–1048 (2010) 3. Bullmore, E., Sporns, O.: Complex brain networks: graph theoretical analysis of structural and functional systems. Nat. Rev. Neurosci. 10(3), 186–198 (2009) 4. Dekker, J., Marti-Renom, M.A., Mirny, L.A.: Exploring the three-dimensional organization of genomes: interpreting chromatin interaction data. Nat. Rev. Genet. 14(6), 390–403 (2013) 5. Di Pierro, M., Cheng, R.R., Lieberman Aiden, E., Wolynes, P.G., Onuchic, J.N.: De novo prediction of human chromosome structures: Epigenetic marking patterns encode genome architecture. Proc. Nat. Acad. Sci. 114(46), 12126–12131 (2017) 6. Dixon, J.R., Selvaraj, S., Yue, F., Kim, A., Li, Y., Shen, Y., Hu, M., Liu, J.S., Ren, B.: Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 485(7398), 376–380 (2012) 7. Durand, N.C., Shamim, M.S., Machol, I., Rao, S.S., Huntley, M.H., Lander, E.S., Aiden, E.L.: Juicer provides a one-click system for analyzing loop-resolution hi-c experiments. Cell Systems 3(1), 95–98 (2016) 8. Emre, S., Geet, D., Carl, K.: Deconvolution of ensemble chromatin interaction data reveals the latent mixing structures in cell subpopulations. J. Comput. Biol. 23(6), 425–438 (2016) 9. ENCODE Project Consortium, et al.: An integrated encyclopedia of DNA elements in the human genome. Nature 489(7414), 57–74 (2012) 10. Filippova, D., Patro, R., Duggal, G., Kingsford, C.: Identification of alternative topological domains in chromatin. Algorithms Molecular Biol. 9(1), 14 (2014) 11. Hajiaghayi, M.T., Jain, K., Lau, L.C., M˘ andoiu, I., Russell, A., Vazirani, V.V.: Minimum multicolored subgraph problem in multiplex PCR primer set selection and population haplotyping. In: Computational Science–ICCS 2006, pp. 758–766. Springer (2006) 12. Halld´ orsson, B.V., Bafna, V., Edwards, N., Lippert, R., Yooseph, S., Istrail, S.: A survey of computational methods for determining haplotypes. Lecture Notes Comput. Sci. 2983, 26–47 (2004) 13. Hughes, J.R., Roberts, N., McGowan, S., Hay, D., Giannoulatou, E., Lynch, M., De Gobbi, M., Taylor, S., Gibbons, R., Higgs, D.R.: Analysis of hundreds of cisregulatory landscapes at high resolution in a single, high-throughput experiment. Nat. Genetics 46(2), 205–212 (2014) 14. Konwar, K.M., Mandoiu, I.I., Russell, A., Shvartsman, A.A.: Improved algorithms for multiplex PCR primer set selection with amplification length constraints, pp. 41–50 15. Li, W., Wong, W.H., Jiang, R.: DeepTACT: predicting 3D chromatin contacts via bootstrapping deep learning. Nucleic Acids Res. 47(10), e60–e60 (2019)
Joint Modeling of Histones in Hi-C Interactions
421
16. Libbrecht, M.W., Ay, F., Hoffman, M.M., Gilbert, D.M., Bilmes, J.A., Noble,W.S.: Joint annotation of chromatin state and chromatin conformation reveals relationships among domain types and identifies domains of cell type-specific expression. Genome Research (2015) 17. Lieberman-Aiden, E., van Berkum, N.L., Williams, L., Imakaev, M., Ragoczy, T., Telling, A., Amit, I., Lajoie, B.R., Sabo, P.J., Dorschner, M.O., et al.: Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326(5950), 289–293 (2009) 18. Meil˘ a, M.: Comparing clusterings–an information based distance. J. Multivar. Anal. 98(5), 873–895 (2007) 19. Mifsud, B., Tavares-Cadete, F., Young, A.N., Sugar, R., Schoenfelder, S., Ferreira, L., Wingett, S.W., Andrews, S., Grey, W., Ewels, P.A., et al.: Mapping long-range promoter contacts in human cells with high-resolution capture hi-c. Nature genetics 47(6), 598–606 (2015) 20. Nora, E.P., Dekker, J., Heard, E.: Segmental folding of chromosomes: A basis for structural and regulatory chromosomal neighborhoods? In: BioEssays : News and Reviews in Molecular, Cellular and Developmental Biology (2013) 21. Optimization, G.: Gurobi optimizer reference manual (2020). http://www.gurobi. com 22. Phillips-Cremins, J.E., Sauria, M.E., Sanyal, A., Gerasimova, T.I., Lajoie, B.R., Bell, J.S., Ong, C.T., Hookway, T.A., Guo, C., Sun, Y., Bland, M.J., Wagstaff, W., Dalton, S., McDevitt, T.C., Sen, R., Dekker, J., Taylor, J., Corces, V.G.: Architectural protein subclasses shape 3D organization of genomes during lineage commitment. Cell 153(6), 1281–1295 (2013) 23. Rao, S.S., Huntley, M.H., Durand, N.C., Stamenova, E.K., Bochkov, I.D., Robinson, J.T., Sanborn, A.L., Machol, I., Omer, A.D., Lander, E.S., et al.: A 3d map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159(7), 1665–1680 (2014) 24. Schmitt, A., Hu, M., Jung, I., Xu, Z., Qiu, Y., Tan, C., Li, Y., Lin, S., Lin, Y., Barr, C., Ren, B.: A compendium of chromatin contact maps reveals spatially active regions in the human genome. Cell Reports 17(8), 2042–2059 (2016) 25. Schreiber, J., Singh, R., Bilmes, J., Noble, W.S.: A pitfall for machine learning methods aiming to predict across cell types. bioRxiv (2019) 26. Sefer, E., Kingsford, C.: Semi-nonparametric modeling of topological domain formation from epigenetic data. Algorithms Molecular Biol. 14(1), 4 (2019) 27. Trieu, T., Martinez-Fundichely, A., Khurana, E.: Deepmilo: a deep learning approach to predict the impact of non-coding sequence variants on 3d chromatin structure. Genome Biol. 21(1), 79 (2020)
Network Models
Fast Multipole Networks Steve Huntsman(B) Virginia, USA [email protected]
Abstract. Two prerequisites for robotic multiagent systems are mobility and communication. Fast multipole networks (FMNs) enable both ends within a unified framework. FMNs can be organized very efficiently in a distributed way from local information and are ideally suited for motion planning using artificial potentials. We compare FMNs to conventional communication topologies, and find that FMNs offer competitive communication performance (including higher network efficiency per edge at marginal energy cost) in addition to advantages for mobility.
1
Introduction
A multirobot system [23] is a group of autonomous, networked robots. In order to achieve a complex goal such as swarming [6], the system requires distributed coordination of both mobility and communication, among other objectives. This is nontrivial, and “[e]fficient networking of many-robot systems is considered one of the grand challenges of robotics” [30]. The respective enabling technologies for mobility and communication are path planning and mobile ad hoc networks (MANETs) [27]. While networks are inevitably analyzed from the perspective of graph theory [2], path planning may be considered in either graph-theoretical [29] or continuous settings. Meanwhile, because geometrical considerations such as distance and motion strongly influence the structure of MANETs, it is natural to try to address mobility and communication for multirobot systems together, e.g., as in [37]. Much effort has focused on connectivity maintenance in situations where, e.g. multirobot systems maintain periodic connectivity [15] or communicate by physically meeting [19] while pursuing a motion objective, or maintain continuous connectivity relative to a fixed set of access points [11,17,18]. Additionally, co-optimization of communication and motion or coverage for an individual robot have been considered in [30,41]. More recently, tree-based approaches for connectivity maintenance have been considered in [24,28,39]. In this paper, we assume connectivity is possible (by using more energy if necessary) without any optimization, and we introduce a class of network backbones that can be trivially formed using an efficient local motion planning technique. These fast multipole networks (FMNs) to support both mobility and communication within a unified framework. The basic idea is to follow common practice in modeling robots, goals, and obstacles as (superpositions of) charged particles satisfying the Laplace equation ∇2 φ = 0 [7,21,34] and exploit the fast c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 944, pp. 425–436, 2021. https://doi.org/10.1007/978-3-030-65351-4_34
426
S. Huntsman
multipole method (FMM), an efficient algorithm for simulating particle dynamics [3,4,12], to simultaneously determine a sparse network topology that supports efficient communication. The animating principle that the far-field behavior of point charges [16] should determine a communication topology is geometrically natural. More surprisingly, we shall demonstrate that it leads to network topologies that perform well in their own right, with higher network efficiency per edge (at marginal energy cost) than standard topologies that ignore mobility. After briefly reviewing the artificial potential approach to path planning in Sect. 2 and the FMM in Sect. 3, we introduce FMNs in Sect. 4, and compare them to conventional MANET topologies in Sect. 5 before making concluding remarks in Sect. 6.
2
Artificial Potentials
The use of artificial potentials in motion and path planning has a long history, most frequently identified as beginning with [20]. The basic idea is to design a potential φ such that the equation of motion m¨ x = −∇φ results in a desired trajectory x. Towards this end, goals and obstacles are respectively modeled by attractive and repulsive terms contributing to the total potential φ. Depending on circumstances, we may choose to model the robots as “sources” with potentials of their own (e.g., to avoid collisions), or as passive “targets” that simply move along the gradient of an ambient potential. In general, we might consider essentially arbitrary forms for each term to produce very detailed behavior. Alternatively, we might rely on a single simple form for all the terms. Our approach is in the latter vein. The relative strengths and spatial distribution of these terms are chosen to establish priorities, spatially extended features, etc. In order to represent sufficiently complex spatial relationships along these lines, it is helpful to have an algorithmic framework that scales better in total computational effort, parallelism, and locality than evaluating O(N 2 ) interactions, since the number N of terms in the potential can be much larger than the number of robots involved. Besides these computational concerns, a problem with using artificial potentials that was identified at an early stage is the possible presence of local minima in the potential field that can trap agents [22]. To remedy this by construction, the notion of a navigation function that has a single minimum at the goal was developed, along with algorithms for constructing such functions [36]. A particularly simple way to avoid local minima while using a single form for all the potential terms is with a superposition of harmonic potentials [7,21,34], i.e., solutions to the Laplace equation ∇2 φ = 0, with a dominant term at the goal. This is most readily achieved through a discrete (if perhaps quasi-continuous) superposition of point charges, i.e. potentials of the form −qV (|x − x0 |) (the sign is for physical reasons), where the fundamental solution V (|x|) to the Laplace equation is defined by ∇2 V (|x|) = δ(x), and as usual δ indicates the Dirac delta distribution [38]. For Rd , it turns out that V (r) = 1/Ad (r), where Ad (r) is the Minkowski content (i.e., generalized perimeter, surface area, etc.) of the sphere
Fast Multipole Networks
427
of radius r in Rd . Choosing the most convenient constants of integration, for 1 log r, and for d = 3 we have V (r) = −1/4πr. d = 2 we have V (r) = 2π
3
The Fast Multipole Method
Naive simulation of N interacting point charges (e.g., the goals and obstacles modeled in Fig. 1) requires computing the interactions of each pair of charges, and hence O(N 2 ) operations per time step, which is prohibitive for large-scale N body simulations. The FMM [3,4,12] enables the simulation cost to be reduced to O(N ) with an extremely high degree of locality and parallelism [13].1 The key ideas underlying the FMM are i) a specification of accuracy (for truncating expansions in a controlled way); ii) decomposing space hierarchically to get well-separated charge clusters;2 iii) representing well-separated clusters of point charges with multipole expansions that maintain a desired approximation error ε with as few (log2 (1/ε)) terms as possible, leaving nearby particles to interact directly. In particular, the FMM recursively builds a quad-tree (Fig. 1; in three dimensions, an octo-tree is used instead) whose leaves are associated with boxes and truncated multipole expansions. This tree approximates a (typically much) finer tree whose leaves are associated with individual point charges that are wellseparated and their monopoles. Importantly, the FMM tree topology essentially ignores the values of charges, depending only on the desired level of accuracy ε3 and the locations of the charges. The computationally expedient part of the FMM is to manipulate the origins and coefficients of controlled series approximations to far-field potentials for clusters of point charges that are well-separated. More general incarnations of the FMM (see, e.g., [26,42,43]) amount to a very efficient scheme for computing N sums of the form j=1 K(xi , ξj )ψ(ξj ) for a given kernel K: i.e., the FMM and its generalizations are essentially specialized matrix multiplication algorithms. From this perspective, item iii) in the list above separates into [3] – a far-field expansion of the kernel K(x, ξ) that decouples the influence of the evaluation/target point x and the source point ξ; – (optionally) a conversion of far-field expansions into local ones. The FMM’s remarkable scaling performance has enabled petascale simulations of turbulence [45], molecular dynamics [32], and cosmological dynamics [35], 1 2
3
For the calculations in this paper, we used the very user-friendly library FMMLIB2D, available at https://cims.nyu.edu/cmcl/fmm2dlib/fmm2dlib.html. Two clusters of points {xj } and {yk } are well-separated iff there exist x0 , y0 such that {xj } ⊂ Bx◦0 (r) and {yk } ⊂ By◦0 (r) with |x0 − y0 | > 3r: here ◦ denotes interior. Two squares with side length r are well-separated iff they are at distance ≥ r. Though in principle the desired level of accuracy can be affected by charge values, this situation is sufficiently pathological that we can safely disregard it in practice.
428
S. Huntsman
Fig. 1. (L) A toy scenario in [−1, 1]2 . Goals are modeled by negative charges and shown in blue; obstacles are modeled by positive charges and shown in red. Opacity indicates relative magnitude. 103 robots are modeled by test points (versus, e.g., test charges of small positive sign) and their locations and velocities indicated by black gradient vectors of the artificial potential. The target locations are distributed as 45 U (top half)+ 1 U (bottom half), where here U indicates a uniform distribution. (R) The quad-tree 5 associated to the scenario on the right. Varying the desired precision in the FMM has very little effect on this tree; as a practical matter it can be assumed unique.
and will also enable future exascale simulations across hundreds of thousands of nodes [44]. This performance makes the FMM a natural choice for large scale path planning using artificial potentials. Equally important for the considerations of this paper, however, are the hierarchical and spatial locality properties that the FMM exploits in order to communicate internally. The FMM’s patterning of a logical intra-algorithm communication network after the spatial distribution of particles suggests that it can be used not only for large-scale multirobot path planning in complex geometries, but also to help organize the communications between robots in a distributed way. Furthermore, although the FMM’s hierarchical properties might seem to imply centralization, the computational load is small enough that these functions can be easily duplicated among robots with low overhead, i.e., the FMM tree does not impose centralization.
4
Fast Multipole Networks
We construct the fast multipole network F M N (ξ) corresponding to a configuration of points ξj ∈ R2 as follows: vertices correspond to the charge locations and we introduce edges that – connect all vertices in the same FMM leaf box; – connect nearest vertices in adjacent leaf boxes; – connect otherwise isolated vertices to their nearest neighbors.
Fast Multipole Networks
429
These edges are respectively colored blue, cyan, and red in Fig. 2.4 By construction, F M N (ξ) is connected, and the information required to generate it is automatically produced by the FMM. We note that while F M N (ξ) is constructed using the quad- or octo-tree of the FMM, it is very far from a tree. Rather, the FMM tree and its corresponding coarse-graining of space determines which nodes are permitted to communicate directly.5 Within a clique of permitted communications corresponding to a leaf of the FMM tree, we may further restrict communications to avoid quadratic bandwidth overhead and/or energy, though we do not consider such tactics further here.
Fig. 2. The FMN corresponding to the scenario in Fig. 1. Nodes are colored by betweenness centrality according to the colorbar on the right. The spatial decomposition from Fig. 1 is shown in gray for reference. Edges within a FMM box are blue, while edges connecting nearest nodes in adjacent boxes are cyan and edges connecting otherwise isolated nodes to their nearest neigbors are red.
4
5
The key difference between FMNs and the networks considered in [48] is that the latter are formed by inserting and permanently linking nearby charges, then dynamically evolving to obtain small-world features, whereas FMNs are (re)formed by linking nearby charges in a way that partially anticipates the next timestep of dynamical evolution. However, both types of networks exhibit aspects of small-world behavior (see Sect. 5 and [25]). Limiting permission for direct communication in FMNs can be enforced by, e.g., cognitive radios [46] whose spectrum allocation cooperates with the FMM tree.
430
5
S. Huntsman
Evaluation
We now introduce several families of graphs for evaluation purposes. Let ξj ∈ R2 for 1 ≤ j ≤ N , and let r > 0. The random geometric graph or disk graph (RGG; Fig. 3) RGG(ξ; r) has vertices ξj and edges E(RGG(ξ; r)) := {(ξj , ξk ) : d(ξj , ξk ) ≤ r} [14,33]. By construction, a RGG is both the most effective network topology from the point of view of information exchange, and the least effective network topology from the point of view of infrastructure costs. A more conservative topology is based on subgraphs of the Delaunay graph. The Delaunay graph D(ξ) has vertices ξj and edges defined from a triangulation of the vertices such that no vertex is interior to a circle circumscribed about a triangle [5,9,10].6
Fig. 3. (L) RGG(ξ; r) for ξ corresponding to the scenario in Fig. 1 and r = 0.135, slightly above the threshold for connectivity. (R) RD(ξ; r).
The Gabriel graph G(ξ) [29,31] is the unique (for the general position case) subgraph of the Delaunay graph such that each edge corresponds to the diameter of a disk that does not contain any other vertices; it is frequently considered as a potential candidate for “virtual backbones” in MANETs. It is worth noting however that G(ξ) and D(ξ) are more computationally expensive to construct than F M N (ξ), and parallelism does not change this. Because the Delaunay and Gabriel graphs do not have an intrinsic range parameter that will give a granular mechanism for evaluating their performance, we shall focus our attention on the (minimal) restricted Delaunay graph (Fig. 3) RD(ξ; r) := D(ξ) ∩ RGG(ξ; r) [1] and the restricted Gabriel graph (Fig. 4) RG(ξ; r) := G(ξ) ∩ RGG(ξ; r). Similarly, we shall consider the restricted FMN (Fig. 4) obtained along the lines RF M N (ξ; r) := F M N (ξ) ∩ RGG(ξ; r).
6
For ξj in general position, the Delaunay graph is unique.
Fast Multipole Networks
431
Fig. 4. (L) RG(ξ; r) for ξ corresponding to the scenario in Fig. 1 and r = 0.135, slightly above the threshold for connectivity. (R) RF M N (ξ; r).
The basic evaluation metric we use is the efficiency of a graph G = (V (G), E(G)), defined as the average inverse distance between distinct vertices, i.e. −1 |V (G)| 1 , (1) eff(G) := 2 djk j,k∈V (G) j=k
where the distance djk between vertices j and k is computed in the obvious way from a given distance on edges (by default, we may always choose the hop metric assigning 1 to each edge). While the efficiency characterizes how well a network supports information flow [25], it neglects costs (e.g., bandwidth, energy, etc.) associated to edges as infrastructure. For this reason we will also consider the efficiency per edge, i.e. eff(G)/|E(G)|. Although other normalizations may be more appropriate in certain situations, this particular one strikes a good balance between convenience/generality and detail, especially for the hop metric. Figure 5 shows the metrics above for 100 simulations of 103 uniformly distributed test points in [−1, 1]2 subject to the ambient potential from Fig. 1. It is apparent from the figure that FMNs and their range-restricted versions are worthy candidates for network backbones in their own right even before accounting for their mobility-specific advantages. Furthermore, although there exist efficient local and parallel algorithms for constructing Delaunay graphs [5,9,10], their computation and communication complexity and scaling behavior are still inferior to the FMM. Figure 6 shows metrics relating to degree distributions and efficiency per unit energy, i.e., eff(G)/energy• (G), where (ignoring an irrelevant constant of proportionality) the unidirectional energy for a metric graph G is d2jk (2) energyuni (G) := (j,k)∈E(G)
432
S. Huntsman
Fig. 5. Network metrics for RGG(ξ; r) (in black), RD(ξ; r) (in red), RG(ξ; r) (in magenta), and RF M N (ξ; r) (in blue) for 100 simulations of N = 103 uniformly distributed test charges in [−1, 1]2 . Although RGG(ξ; r) is most efficient, this network performance comes at a high cost in edges, and RF M N (ξ; r) performs well (and for hop efficiency per edge, the best) for all measures of efficiency. Note that RF M N (ξ; r) = F M N (ξ) for sufficiently large r within the range shown. We also have that, e.g. RD(ξ; r ) = D(ξ), and though the corresponding r is outside the range shown, the residual effects are minimal.
and the omnidirectional energy is
⎛ ⎜ energyomni (G) := ⎝ j∈V (G)
⎞2 max
k∈V (G) (j,k)∈E(G)
⎟ djk ⎠ .
(3)
Fast Multipole Networks
433
These quantities model the total energy budgets required to transmit uni- and omnidirectional signals, respectively. Figure 6 highlights that FMNs continue to perform marginally better than Delaunay graphs and marginally worse than Gabriel graphs for energy-normalized measures of network efficiency.
Fig. 6. Clockwise from top left, and for the same simulations as Fig. 5: the degree distributions of RGG(ξ; r), RD(ξ; r), RG(ξ; r), and RF M N (ξ; r) for r equal to the connectivity threshold; the total energy (in arbitrary units) required for the networks; the hop efficiency per unit energy, and the Euclidean efficiency per unit energy.
434
6
S. Huntsman
Remarks
By virtue of calculating potentials and forces, the FMM/FMN approach enables dynamic and predictable network topology reconfiguration with minimal cost and effort. In other words, as robots use the FMM to efficiently compute their motion according to a navigation function supplied by a superposition of point charges, the FMN is easily updated and efficiently represented. Incorporating resilient routing reconfiguration [8,40] on (F + 1)-connected local subgraphs of the FMN can be done with reasonable computational effort (e.g., the key linear program is quickly and easily solved in MATLAB for realistic networks of ≈ 50 nodes). This enables virtually instantaneous failover and rerouting in the presence of ≤ F link failures. Combining this local approach with a separate (perhaps similar) routing protocol to handle wide-area network traffic and obstacle potentials that prevent deterioration of basic connectivity can ensure network integrity and basic quality of service (QoS). These features can render our framework competitive with the approach of [37], which centers on the higher-level functions of network integrity and QoS, and which uses a convex program instead of an algorithmically simpler linear program. Along similar lines, [47] shows how to construct artificial potentials that discourage loss of connectivity. Although these fields are not harmonic, it is plausible that this idea can be adapted to the present context. It is worth pointing out that there are FMM variants for non-harmonic potentials, e.g. power laws, (generalized) multiquadrics [3,43], or more general kernels [26], and many of these have actually been applied in the context of interpolation and/or physical simulation. However, using non-harmonic potentials eliminates the automatic guarantee that there are no metastable local minima. We note in particular that the kernel-independent FMM variant of [42] exploits the existence and uniqueness of solutions to elliptic boundary value problems [38] to represent clustered sources in far field based on their behavior on a suitable boundary. This perspective suggests an extension of FMNs to sources modeled by fundamental solutions of elliptic partial differential equations. Acknowledgements. We thank Brendan Fong, Marco Pravia, and David Spivak for their comments.
References 1. Avin, C.: Fast and efficient restricted Delaunay triangulation in random geometric graphs. Internet Math. 5, 195 (2008) 2. Barrat, A., Barth´elemy, M., Vespignani, A.: Dynamical Processes on Complex Networks. Cambridge (2008) 3. Beatson, R., Greengard, L.: A short course on fast multipole methods. In: Ainsworth, M., et al. (eds.) Wavelets, Multilevel Methods, and Elliptic PDEs, Oxford (1997) 4. Board, J., Schulten, L.: The fast multipole algorithm. Comp. Sci. Eng. 2, 76 (2000) 5. Chen, R., Gotsman, C.: Localizing the delaunay triangulation and its parallel implementation. In: ISVD (2012)
Fast Multipole Networks
435
6. Chung, S.-J., et al.: A survey on aerial swarm robotics. IEEE Trans. Robot. 34, 837 (2018) 7. Connolly, C.I., Burns, J.B., Weiss, R.: Path planning using Laplace’s equation. In: ICRA (1990) 8. DeCleene, B., Huntsman, S.: Wireless resilient routing reconfiguration. arXiv: 1904.04865 (2019) 9. Fuetterling, V., Lojewski, C., Pfreundt, F.-J.: High-performance d-D Delaunay triangulations for many-core computers. In: HPG (2014) 10. Funke, D., Sanders, P.: Parallel d-D delaunay triangulations in shared and distributed memory. In: ALENEX (2017) 11. Ghaffarkhah, A., Mostofi, Y.: Communication-aware motion planning in mobile networks. IEEE Trans. Auto. Control 56, 2478 (2011) 12. Greengard, L., Rokhlin, V.: A fast algorithm for particle simulations. J. Comp. Phys. 73, 325 (1987) 13. Greengard, L., Gropp, W.D.: A parallel version of the fast multipole method. Comp. Math. Appl. 20, 63 (1990) 14. Haenggi, M.: Stochastic Geometry for Wireless Networks. Cambridge (2013) 15. Hollinger, G., Singh, S.: Multi-robot coordination with periodic connectivity. In: ICRA (2010) 16. Jackson, J.D.: Classical Electrodynamics. 3rd ed. Wiley (1998) 17. Kantaros, Y., Zavlanos, M.M.: Distributed communication-aware coverage control by mobile sensor networks. Automatica 63, 209 (2016) 18. Kantaros, Y., Zavlanos, M.M.: Global planning for multi-robot communication networks in complex environments. IEEE Trans. Robotics 32, 1045 (2016) 19. Kantaros, Y., Guo, M., Zavlanos, M.M.: Temporal logic task planning and intermittent connectivity control of mobile robot networks. IEEE Trans. Auto. Control 64, 4105 (2019) 20. Khatib, O.: Real-time obstacle avoidance for manipulators and mobile robots. In: ICRA (1985) 21. Kim, J.-O., Khosla, P.K.: Real-time obstacle avoidance using harmonic potential functions. IEEE Trans. Robot. Automat. 8, 501 (1992) 22. Koren, Y., Borenstein, J.: Potential field method and their inherent limitations for mobile robot navigation. In: ICRA (1991) 23. Knorn, S., Chen, Z., Middleton, R.H.: Overview: collective control of multiagent systems. IEEE Trans. Cont. Net. Sys. 3, 334 (2016) 24. Krupke, D., et al.: Distributed cohesive control for robot swarms: maintaining good connectivity in the presence of exterior forces. In: IROS (2015) 25. Latora, V., Marchiori, M.: Efficient behavior of small-world networks. Phys. Rev. Lett. 87, 198701 (2001) 26. L´etourneau, P.-D., Cecka, C., Darve, E.: Cauchy fast multipole method for general analytic kernels. SIAM J. Sci. Comp. 36, A396 (2014) 27. Loo, J., Mauri, J.L., Ortiz, J.H. (eds.) Mobile Ad Hoc Networks. CRC (2016) 28. Majcherczyk, N., et al.: Decentralized connectivity-preserving deployment of largescale robot swarms. In: IROS (2018) 29. Mesbahi, M., Egerstedt, M.: Graph Theoretic Methods in Multiagent Networks. Princeton (2010) 30. Minelli, M., et al.: Stop, think and roll: online gain optimization for resilient multirobot topologies. In: DARS (2019) 31. Norrenbrock, C.: Percolation threshold on planar Euclidean Gabriel graphs. Eur. Phys. J. B 89, 111 (2016)
436
S. Huntsman
32. Ohno, Y., et al.: Petascale molecular dynamics simulation using the fast multipole method on K computer. Comp. Phys. Comm. 185, 2575 (2014) 33. Penrose, M.: Random Geometric Graphs. Oxford (2003) 34. Pimenta, L.C.A., et al.: On computing complex navigation functions. In: ICRA (2005) 35. Potter, D., Stadel, J., Teyssier, R.: PKDGRAV3: beyond trillion particle cosmological simulations for the next era of galaxy surveys. Comp. Astrophys. Cosmol. 4, 2 (2017) 36. Rimon, E., Koditschek, D.E.: Exact robot navigation using artificial potential functions. IEEE Trans. Robot. Automat. 8, 501 (1992) 37. Stephan, J., et al.: Concurrent control of mobility and communication in multirobot systems. IEEE Trans. Robot. 33, 1248 (2017) 38. Taylor, M.E.: Partial Differential Equations: Basic Theory. Springer, Cham (1996) 39. Varadharajan, V.S., Adams, B., Beltrame, G.: The unbroken telephone game: keeping systems connected. In: AAMAS (2019) 40. Wang, Y., et al.: R3: resilient routing reconfiguration. In: SIGCOMM (2010) 41. Yan, Y., Mostofi, Y.: Co-optimization of communication and motion planning of a robotic operation in fading environments. In: ACSSC (2011) 42. Ying, L., Biros, G., Zorin, D.: A kernel-independent adaptive fast multipole algorithm in two and three dimensions. J. Comp. Phys. 196, 591 (2004) 43. Ying, L.: A kernel-independent fast multipole algorithm for radial basis functions. J. Comp. Phys. 213, 457 (2006) 44. Yokota, R., Barba, L.A.: A tuned and scalable fast multipole method as a preeminent algorithm for exascale systems. Int. J. High Perf. Comp. Appl. 26, 337 (2012) 45. Yokota, R., et al.: Petascale turbulence simulation using a highly parallel fast multipole method on GPUs. Comp. Phys. Comm. 184, 445 (2013) 46. Yu, R.F. (ed.) Cognitive Radio Mobile Ad Hoc Networks. Springer (2011) 47. Zavlanos, M.M., Pappas, G.J.: Potential fields for maintaining connectivity of mobile networks. IEEE Trans. Robotics 23, 812 (2007) 48. Zitin, A., et al.: Spatially embedded growing small-world networks. Sci. Rep. 4, 7047 (2015)
A Random Growth Model with Any Real or Theoretical Degree Distribution Fr´ed´eric Giroire1 , St´ephane P´erennes1 , and Thibaud Trolliet2(B) 1
Universit´e Cˆ ote d’Azur/CNRS, Sophia-Antipolis, France 2 INRIA Sophia-Antipolis, Sophia-Antipolis, France [email protected]
Abstract. The degree distributions of complex networks are usually considered to be power law. However, it is not the case for a large number of them. We thus propose a new model able to build random growing networks with (almost) any wanted degree distribution. The degree distribution can either be theoretical or extracted from a real-world network. The main idea is to invert the recurrence equation commonly used to compute the degree distribution in order to find a convenient attachment function for node connections - commonly chosen as linear. We compute this attachment function for some classical distributions, as the power-law, broken power-law, geometric and Poisson distributions. We also use the model on an undirected version of the Twitter network, for which the degree distribution has an unusual shape. Keywords: Complex networks · Random growth model attachment · Degree distribution · Twitter
1
· Preferential
Introduction
Complex networks appear in the empirical study of real world networks from various domains, such that social, biology, economy, technology, ... Most of those networks exhibit common properties, such as high clustering coefficient, communities, ... Probably the most studied of those properties is the degree distribution (named DD in the rest of the paper), which is often observed as following a powerlaw distribution. Random network models have thus focused on being able to build graphs exhibiting power-law DDs, such as the well-known Barabasi-Albert model [2] or the Chun-Lu model [7], but also models for directed networks [4] or for networks with communities [20]. However, this is common to find real networks with DDs not perfectly following a power-law. For instance for social networks, Facebook has been shown to follow a broken power-law1 [13], while Twitter only has the distribution tail following a power-law and some atypical behaviors due to Twitter’s policies, as we report in Sect. 5.1. This work has been supported by the French government through the UCA JEDI (ANR-15-IDEX-01) and EUR DS4H (ANR-17-EURE-004) Investments in the Future projects, by the SNIF project, and by Inria associated team EfDyNet. 1 We call a broken power-law a concatenation of two power-laws, as defined in [14]. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 944, pp. 437–449, 2021. https://doi.org/10.1007/978-3-030-65351-4_35
438
F. Giroire et al.
(a) DD of the number of unique callers and callees from a mobile phone operator. [21]
(c) Graphlet DD (b) In-DD befrom a biological tween shop-to-shop model. [19] recommendations from an online marketplace. [22]
(e) DDs of users of (f) DD of the length Flickr, an online so- of the contact list in cial network. [6] Microsoft Messenger network. [15]
(d) DD of users of Cyworld, the largest online social network of South Korea. [1]
(g) DD of the num- (h) Out-DD of the ber of friends from number of followees FaceBook, a social on Twitter. [23] network. [13]
Fig. 1. DDs extracted from different seminal papers studying networks from various domains.
It is yet crucial to build models able to reproduce the properties of real networks. Indeed, some studies such as fake news propagation or evolution over time of the networks cannot always be done empirically, for technical or ethical reasons. Carrying out simulations with random networks created with well-built models is a solution to study real networks without directly experimenting on them. Those models have to create networks with similar properties as real ones, while staying as simple as possible. In this paper, we propose a random growth model able to create graphs with almost any (under some conditions) given DD. Classical models usually choose the nodes receiving new edges proportionally to a linear attachment function f (i) = i (or f (i) = i + b) [2,4]. The theoretical DD of the networks generated by those models is computed using a recurrence equation. The main idea of this paper is to reverse this recurrence equation to express the attachment function f as a function of the DD. This way, for a given DD, we can compute the associated attachment function, and use it in a proposed random growth model to create graphs with the wanted DD. The given DD can either be theoretical, or extracted from a real network. We compute the attachment function associated with some classical DD, homogeneous ones such as Poisson or geometric distributions, and heterogeneous ones such as exact power-law and broken power-law. We also study the undirected DD of a Twitter snapshot of 400 million nodes and 23 billion edges, extracted by Gabielkov et al. [10] and made available by the authors. We notice
A Random Growth Model with Any Degree Distribution
439
it has an atypical shape, due to Twitter’s policies. We compute empirically the associated attachment function, and use the model to build random graphs with this DD. A necessary condition is that the given DD must be defined for all degrees under the (arbitrary chosen) maximum value. However this condition can be circumvented doing an interpolation between existing points to estimate the missing ones, as discussed in Sect. 5. The rest of the paper is organized as follows. We first discuss the related work in Sect. 2. In Sect. 3, we present the new model, and invert the recurrence equation to find the relation between the attachment function and the DD. We apply this relation to compute the attachment function associated to a powerlaw DD, a broken-power law DD, and other theoretical distributions. In Sect. 5 we apply our model on a real-world DD, the undirected DD of Twitter.
2
Related Work
The degree distribution has been computed for a lot of networks, in particular for social networks such as Facebook [13] or Microsoft Messenger [15]. Note that Myers et al. have also studied DDs for Twitter in [17], using a different dataset than the one of [10]. Questioning the relevance of power-law fits is not new: for instance, Clauset et al. [8] or Lima-Mendez and van Helden [16] have already deeply questioned the myth of power-law -as Lima-Mendez and van Helden call it-, and develop tools to verify if a distribution can be considered as a power-law or not. Clauset et al. apply the developed tools on 24 distributions extracted from various domains of literature, which have all been considered to be power-laws. Among them, “17 of the 24 data sets are consistent with a power-law distribution”, and “there is only one case in which the power law appears to be truly convincing, in the sense that it is an excellent fit to the data and none of the alternatives carries any weight”. In the continuity of this work, Broido and Clauset study in [5] the DD of nearly 1000 networks from various domains, and conclude that “fewer than 36 networks (4%) exhibit the strongest level of evidence for scale-free structure“. The study of Clauset et al. [8] only considered distributions which have a power-law shape when looking at the distribution in log-log. As a complement, we gathered DDs from literature which clearly do not follow power-law distributions to show their diversity. We extracted from literature DDs of networks from various domains: biology, economy, computer science, ... Each presented DD comes from a seminal well cited paper of the respective domains. They are gathered in Fig. 1. Various shapes can be observed from those DDs, which could (by eyes) be associated with exponential (Fig. 1b, 1c), broken power-law (Fig. 1a, 1e, 1g), or even some kind of inverted broken power-law (Fig. 1d). We also observe DDs with specific behaviors (Fig. 1f, 1h). The first proposed models of random networks, such as the Erd˝ os–R´enyi model [9], build networks with a homogeneous DD. The observation that a lot of real-world networks follow power-law DDs lead Albert and Barabasi to propose their famous model with linear preferential attachment [2]. It has been followed
440
F. Giroire et al.
by a lot of random growth models, e.g. [4,7] also giving a DD in power-law. A few models permit to build networks with any DD: for instance, the configuration model [3,18] takes as parameter a DD P and a number of nodes n, creates n nodes with a degree randomly picked following P , then randomly connects the half-edges of every node. Goshal and Newman propose in [11] a model generating non-growing networks (where, at each time-step, a node is added and another is deleted) which can achieve any DD, using a method close to the one proposed in this paper. However, both of those models generate non-growing networks, while most real-world networks are constantly growing.
3
Presentation of the Model
The proposed model is a generalization of the model introduced by Chun and Lu in [7]. At each time step, we have either a node event or an edge event. During a node event, a node is added with an edge attached to it; during an edge event, an edge is added between two existing nodes. Each node to which the edge is connected is randomly chosen among all nodes with a probability proportional to a given function f , called the attachment function. The model is as follows: We start with an initial graph G0 . At each time step t: – With probability p: we add a node u, and an edge (u, v) where the node v is chosen randomly between all existing nodes with a probabil; ity f (deg(v)) w∈V f (deg(w)) – With probability (1 − p): we add an edge (u, v) where the nodes u and v are chosen randomly between all existing nodes with a probability f (deg(u)) f (deg(v)) f (deg(w)) and f (deg(w)) . w∈V
w∈V
Note that the Chun-lu model is the particular case where f (i) = i for all i ≥ 1. We call generalized Chun-Lu model the proposed model where f (i) = i + b, for all i ≥ 1 with b > −1. 3.1
Inversion of the Recurrence Equation
The common way to find the DD of classical random growth models is to study the recurrence equation of the evolution of the number of nodes with degree i between two time steps. This equation can sometimes be easily solved, sometimes not. But what matters for us is that the common process is to start from a given model -thus an attachment function f -, and use the recurrence equation to find the DD P . In this section, we show that the recurrence equation of the proposed model can be reversed such that, if P if given, we can find an associated attachment function f .
A Random Growth Model with Any Degree Distribution
441
Theorem 1. In the proposed model, if the attachment function is chosen as: ∀i ≥ 1, f (i) =
∞ 1 P (k), P (i)
(1)
k=i+1
then the DD of the created graph is distributed according to P .2 Proof. We consider the variation of the number of nodes of degree i N (i, t) between a time step from t to (t + 1). During this time step, a node with degree i may gain a degree and thus diminishes by 1 the number of nodes of degree i. This happens with a probability p + 2(1 − p) (the mean number of half-edges (i) connected to existing nodes during a time step) × ff (j)N (j,t) (the probability j≥1
for this particular node of degree i to be chosen). Since it is the same for all nodes of degree i, the number of nodes going from degree i to i + 1 during a time step (i) is p + 2(1 − p) × ff (j)N (j,t) × N (i, t). In the same way, some nodes with j≥1 degree i − 1 may be connected to an edge and increase the number of nodes of degree i. Finally, with probability p, a node of degree 1 is added. Gathering those contributions, taking the expectation, and using concentration results give the following equation: E[N (i, t + 1)] − E[N (i, t)] = pδi,1 + (2 − p)
(2)
f (i − 1) f (i) E[N (i − 1, t)] − (2 − p) E[N (i, t)] f (j)E[N (j, t)] f (j)E[N (j, t)]
j≥1
j≥1
where δi,j is the Kronecker delta. The first term of the right hand is the probability of addition of a node. The second (resp. third) term is the probability that a node of degree i − 1 (resp. i) gets chosen to be the end of an edge. The factor (2 − p) = p + 2(1 − p) comes from the fact that this happens with probability p during a node event (connection of a single half-edge) and with probability 2(1 − p) during an edge event (possible connection of 2 half-edges). Let P (i) = lim E[Npt(i,t)] (the p in the denominator comes from the fact t→+∞
f (i) . We first show that that E[N (t)] = pt). We denote g(i) = 2−p p j≥1 f (j)P (j) ∞ P (k). We will then show that we can choose f = g. g(i) = P 1(i) k=i+1
We use the following lemma from [7]: Lemma 1. Let (at ), (bt ), (ct ) be three sequences such that at+1 = (1− btt )at +ct , c . lim bt = b > 0, and lim ct = c. Then lim att exists and equals 1+b
t→+∞
t→+∞
t→+∞
For i = 1, the equation becomes: E[N (1, t + 1)] − E[N (1, t)] = p − (2 − p)
f (1) E[N (1, t)]. f (j)E[N (j, t)]
j≥1 2
Note that Eq. 1 can also be expressed as f (i) =
P (k>i) . P (i)
(3)
442
F. Giroire et al.
Taking at =
E[N (1,t)] , p
bt =
p
(2−p)f (1) E[N (j,t)] f (j) pt
j≥1
, and ct = 1,we have lim bt = t→+∞
g(1) > 0 and lim ct = 1. We can thus apply Lemma 1: t→+∞
lim
t→+∞
1 E[N (1, t)] = P (1) = . pt 1 + g(1)
Now, ∀i ≥ 2, taking at = ct =
p
E[N (i,t)] , p
(2−p)f (i−1) E[N (i−1,t)] , E[N (j,t)] pt j≥1 f (j) pt
bt =
we have
p
(2−p)f (i) E[N (j,t)] f (j) pt
(4) , and
j≥1
lim bt = g(i) > 0 and lim ct =
t→+∞
t→+∞
g(i − 1)P (i − 1). Lemma 1 gives: lim
t→+∞
g(i − 1)P (i − 1) E[N (i, t)] = P (i) = . pt 1 + g(i)
(5)
Iterating over Eq. 5, we express g as a function of P : g(i)P (i) = g(i − 1)P (i − 1) − P (i) = g(1)P (1) −
i k=2
=⇒ g(i) =
P (k) = 1 −
i
P (k)
k=1
∞ 1 P (k) P (i)
(6)
f (k) (2 − p) . P (k) = p k =1 f (k )P (k )
(7)
k=i+1
Now, notice that: ∞ k=1
g(k)P (k) =
∞ 2−p k=1
p
∞
∞ g(i) So g(i) satisfies g(i) = 2−p . Hence the attachment function can be p k=1 g(k)P (k) chosen as f = g, which concludes the proof. For a given probability law, Theorem 1 can be used to compute the attachment function which, when used in the model, will give this probability law as DD. With the presented model, we also have an implicit constraint between the mean degree and the parameter p. Indeed by construction, we have E[N (t)] = pt and E(|E|(t)) = t with |E|(t) the number of edges at time t,leading to a meandegree of p1 . But the mean-degree can also be expressed as k≥1 kP (k).
Condition 1. The parameter p has to satisfy: 1 = p
(8)
We can finally combine the previous results and present the method to build a random network with a fixed DD: 1) Use Eq. 1 to compute f from P ; 2) Compute p using Condition 1; 2) Build the graph with the proposed model, given (f, p) as parameters.
A Random Growth Model with Any Degree Distribution
443
Table 1. Attachment functions f and conditions on p for some classical probability distributions P . ζ(s) is the Riemann zeta function, ζ(s, q) the Hurwitz zeta function, and γ(a, x) is the lower incomplete Gamma function. Name
P(i)
f(i)
(i+b) Generalized Chun-Lu C Γ Γ(i+b+α) i ζ(α)
Geometric Law
q(1 − q)i−1
Broken Power-Law
4
−α
Exact Power-Law Poisson Law
1 b i + α−1 α−1 ζ(α,i+1) i−α 1−q q eλ γ(i+1,λ) λi
λi 1 eλ −1 i! Γ (i+b1 ) C Γ (i+b 1 +α1 ) Γ (i+b2 ) Cγ Γ (i+b 2 +α2 )
if i ≤ d if i > d
Condition p= p=
α−2 α+b−1 ζ(α) ζ(α−1)
p=q p=
1−e−λ λ
cf Eq. 17 & 18 cf Eq. 16
Application to Some Distributions
We now apply Eq. 1 to compute the attachment function for some classical distributions. We first start in Sect. 4.1 from the distribution obtained with the generalized Chun-Lu model to show we find a linear dependence, as expected. We then compute in Sect. 4.2 the associated attachment function of the broken power-law distribution. Using similar computations (which can be found in Report [12]), we computed the attachment function of other classical distributions. Table 1 summarizes those results. 4.1
Preliminary: Generalized Chun-Lu Model
As a first example, by taking a power-law DD, we should be able to find a linear probability distribution for the generalized Chun-Lu model. In the general Chun-Lu model, we can show that the real DD is not an exact power-law but a fraction of Gamma function-equivalent to a power-law for high degrees- of the form: ∀i ≥ 1, P (i) = C
Γ (i + b) ∼ i−α Γ (i + b + α) i1
(9)
where C = (α − 1) ΓΓ (b+α) (b+1) , and α > 2. The choice of α determines the slope of the DD, while the choice of b determines the mean-degree of the graph. Constraint on p: Condition 1 gives: ∞
1 Γ (b + α) α2 + α(2b − 1) + b(b − 1) Γ (b + 1) = × kP (k) = (α − 1) p Γ (b + 1) (α − 2)(α − 1) Γ (α + b + 1) k=1
=⇒ p =
(α − 2) α+b−1
(10)
444
F. Giroire et al.
Attachment Function f: Using Theorem 1: Γ (i + b + 1) Γ (i + b + α) 1 P (k) = f (i) = P (i) Γ (i + b) (α − 1)Γ (i + α + b)
(11)
k≥i+1
b 1 i+ (12) α−1 α−1 As expected, we find a linear attachment function. To create a graph with a wanted slope α and mean-degree p−1 , one only has to choose α as the wanted slope and b following Eq. 10. In the particular case b = 0, we recover the Chun-Lu p as expected. model of [7], with a slope of α = 2 + 2−p =⇒ f (i) =
4.2
Broken Power-Law
We now study the case of a broken power-law, corresponding to the DD of real world complex networks, as discussed in Sect. 2. which was the one we were interested in initially. We consider a distribution of the form: Γ (i+b1 ) C Γ (i+b if i ≤ d 1 +α1 ) P (i) = (13) Γ (i+b2 ) Cγ Γ (i+b if i > d 2 +α2 ) where d, b1 , α1 , b2 , and α2 are parameters of our distribution such that α1 > 2, α2 > 2, C a normalisation constant, and γ chosen in order to obtain continuity for i = d. As seen in Sect. 4.1, the ratio of gamma functions is close to a power-law as soon as i gets large. Hence, this distribution corresponds to two powers-laws, with different slopes, and a switch between the two at the value d. We can easily find the continuity constant γ, since it verifies: Γ (d + b2 ) Γ (d + b1 )Γ (d + b2 + α2 ) Γ (d + b1 ) =γ =⇒ γ = . (14) Γ (d + b1 + α1 ) Γ (d + b2 + α2 ) Γ (d + b1 + α1 )Γ (d + b2 ) Constraints on C and p: The value of C can be computed by summing over all degrees: C=
∞
−1 P (k)
k=1
=
Γ (b1 + 1) Γ (b1 + d) b2 + d 1 b1 + d −1 + − α1 − 1 Γ (α1 + b1 ) Γ (α1 + b1 + d) α2 − 1 α1 − 1
(15) Using Condition 1, p is defined by the following equation: d ∞ Γ (k + b1 ) Γ (k + b2 ) 1 k k = +γ pC Γ (k + b1 + α1 ) Γ (k + b2 + α2 ) k=1
k=d+1
α2 + α1 (2b1 − 1) + b1 (b1 − 1) Γ (b1 + 1) = 1 (α1 − 2)(α1 − 1) Γ (α1 + b1 + 1) −
(16)
α12 (d + 1) + α1 (b1 (d + 2) + d2 − 1) + b1 (b1 − 1) − d(d + 1) Γ (b1 + d + 1) (α1 − 2)(α1 − 1) Γ (α1 + b1 + d + 1)
+ γ
α22 (d + 1) + α2 (b2 (d + 2) + d2 − 1) + b2 (b2 − 1) − d(d + 1) Γ (b2 + d + 1) (α2 − 2)(α2 − 1) Γ (α2 + b2 + d + 1)
A Random Growth Model with Any Degree Distribution
445
Attachment Function f : For the computation of the attachment function, we have to distinguish two cases: Case 1: i ≥ d f (i) =
Γ (i + b2 + α2 ) 1 Γ (i + b2 + 1) 1 b2 = i+ Γ (i + b2 ) α2 − 1 Γ (i + b2 + α2 ) α2 − 1 α2 − 1
(17)
We find a linear attachment function: indeed for i > d, we only take into account the second power-law, hence we expect to find the same result than in Sect. 4.1. Case 2: i < d ∞ Γ (k + b1 ) Γ (k + b2 ) +γ Γ (k + b1 + α1 ) Γ (k + b2 + α2 ) k=i+1 k=d+1 i + b1 Γ (i + b1 + α1 )Γ (d + b1 ) b2 + d b1 + d
= + − (18) α1 − 1 Γ (i + b1 )Γ (d + b1 + α1 ) α2 − 1 α1 − 1
Γ (i + b1 + α1 ) f (i) = Γ (i + b1 )
d
In this second case, we have a linear part, in addition to a more complicated part. Note that, for (α1 , b1 ) = (α2 , b2 ), i.e., when the two power-laws are equals, this second term vanishes, letting as expected only the linear part. Figure 2a shows the shape of f . We see that, while the second part is linear as discussed before, the first part is sub-linear. We used this attachment function to build a network using our model. The DD is shown in Fig. 2b: we see we built a random network with a broken powerlaw distribution as wanted.
Fig. 2. Theoretical attachment function f and degree distribution of a random network for the broken power-law distribution. Parameters are N = 5·105 , b1 = b2 = 1, α1 = 2.1, α2 = 4 and d = 100.
446
5
F. Giroire et al.
Real Degree Distributions
The model can also be applied to an empirical DD. Indeed, we observe in Theorem 1 that f (i) only depends on the values P (i) which can be arbitrary, that is not following any classical function. This is a good way to model random networks with an atypical DD. As an example, we apply our model on the DD of an undirected version of Twitter, shown as having atypical behavior due to the Twitter policies. We start with a presentation of this DD, then apply our model to build a random graph with this distribution.
Fig. 3. Modelization of the undirected Twitter’s graph.
5.1
Undirected DD of Twitter
For this study, we use a Twitter snapshot from 2012, recovered by Gabielkov and Legout [10] and made available by the authors. This network contains 505 million nodes and 23 billion edges, making it one of the biggest social graph available nowadays. Each node corresponds to an account, and an arc (u, v) exists if the account u follows the account v. The in- and out-DDs are presented in [23].
A Random Growth Model with Any Degree Distribution
447
In our case, we look at an undirected version of the Twitter snapshot. We consider the degree of each node as being the sum of its in- and out-degrees. The distribution of this undirected graph is presented in Fig. 3a. We notice two spikes, around d = 20 and d = 2000. We do not know the reason of the first one (which could be social, or due to recommendation system). The second spike is explained by a specificity of Twitter: until 2015, to avoid bots which were following a very large number of users, Twitter limited the number of possible followings to max(2000, number of followers). In other words, a user is allowed to follow more than 2000 people only if he is also followed by more than 2000 people. This leads to a lot of accounts with around 2000 followings. This highlights the fact that some networks have their own specificities, sometimes due to intern policies, which cannot be modeled but by a model specifically built for them. 5.2
Modelization
Figure 3c presents the obtained form of the attachment function f computed using Eq. 1 with the DD of Twitter. We notice that the overall function is mainly increasing, showing that nodes of higher degrees have a higher chance to connect with new nodes, like in classical preferential attachment models. We also notice two drops, around 20 and 2000. They are associated with the risings on the DD on the same degrees: to increase the amount of nodes with those degrees, the attachment function has to be smaller, so nodes with this degree have less chance to gain new edges. We finally use our model with the empirical attachment function of Fig. 3c. Note that, in an empirical study, P can be equal to zero for some degrees, for which no node has this degree in the network. In Twitter, the smallest of those degrees occurs around 18.000. In that case, f cannot be computed. To get around this difficulty, we interpolate the missing values of P , using the two closest smaller and bigger degrees of the missing points. Since we observe the probability distribution on a log-log scale, we interpolate between the two points as a straight line on a log-log scale, i.e., as a power-law function. We believe this is a fair choice since we only look at the tail of the distribution, which looks like a straight line, and since we interpolate between each pair of closest two points only, instead of fitting on the whole tail of the distribution. The DD of a random network built with our model is presented in Fig. 3b. For time computation reasons, the built network only has N = 2 · 105 nodes, to be compared to the 5 · 108 nodes of Twitter. However, it is enough to verify that its DD shape follows the one of the real Twitter’s DD: in particular we recognize the spikes around d = 20 and d = 2000.
6
Conclusion
In this paper, we proposed a new random growth model picking the nodes to be connected together in the graph with a flexible probability f . We expressed this f as a function of any distribution P , leading to the possibility to build a random
448
F. Giroire et al.
network with any wanted degree distribution. We computed f for some classical distributions, as much as for a snapshot of Twitter of 505 million nodes and 23 billion edges. We believe this model is useful for anyone studying networks with atypical degree distributions, regardless of the domain. If the presented model is undirected, we also believe a directed version of it, based on the Bollob´ as et al. model [4], can be easily generalized from the presented one.
References 1. Ahn, Y.-Y., Han, S., Kwak, H., Moon, S., Jeong, H.: Analysis of topological characteristics of huge online social networking services. In: Proceedings of the 16th International Conference on World Wide Web, pp. 835–844 (2007) 2. Albert, R., Barab´ asi, A.-L.: Statistical mechanics of complex networks. Rev. Mod. Phys. 74(1), 47 (2002) 3. Bollob´ as, B.: A probabilistic proof of an asymptotic formula for the number of labelled regular graphs. Eur. J. Comb. 1(4), 311–316 (1980) 4. Bollob´ as, B., Borgs, C., Chayes, J.T., Riordan, O.: Directed scale-free graphs. In: SODA, vol. 3, pp. 132–139 (2003) 5. Broido, A.D., Clauset, A.: Scale-free networks are rare. Nat. Commun. 10(1), 1–10 (2019) 6. Cha, M., Mislove, A., Gummadi, K.P.: A measurement-driven analysis of information propagation in the Flickr social network. In: Proceedings of the 18th International Conference on World Wide Web, pp. 721–730 (2009) 7. Chung, F., Chung, F.R.K., Graham, F.C., Lu, L., Chung, K.F., et al.: Complex graphs and networks. Am. Math. Soc. (2006) 8. Clauset, A., Shalizi, C.R., Newman, M.E.J.: Power-law distributions in empirical data. SIAM Rev. 51(4), 661–703 (2009) 9. Erd˝ os, P., R´enyi, A.: On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci 5(1), 17–60 (1960) 10. Gabielkov, M., Legout, A.: The complete picture of the Twitter social graph. In: Proceedings on CoNEXT Student Workshop, pp. 19–20. ACM (2012) 11. Ghoshal, G., Newman, M.E.J.: Growing distributed networks with arbitrary degree distributions. Eur. Phys. J. B 58(2), 175–184 (2007) 12. Giroire, F., P´erennes, S., Trolliet, T.: A random growth model with any real or theoretical degree distribution. arXiv preprint arXiv:2008.03831 (2020) 13. Gjoka, M., Kurant, M., Butts, C.T., Markopoulou, A.: Walking in Facebook: a case study of unbiased sampling of OSNs. In: IEEE INFOCOM (2010) 14. J´ ohannesson, G., Bj¨ ornsson, G., Gudmundsson, E.H.: Afterglow light curves and broken power laws: a statistical study. Astrophys. J. Lett. 640(1), L5 (2006) 15. Leskovec, J., Horvitz, E.: Planetary-scale views on a large instant-messaging network. In: Proceedings of the 17th International Conference on World Wide Web (2008) 16. Lima-Mendez, G., van Helden, J.: The powerful law of the power law and other myths in network biology. Mol. BioSyst. 5(12), 1482–1493 (2009) 17. Myers, S.A., Sharma, A., Gupta, P., Lin, J.: Information network or social network?: The structure of the Twitter follow graph. In: Proceedings of the 23rd International Conference on World Wide Web, pp. 493–498. ACM (2014) 18. Newman, M.E.J., Strogatz, S.H., Watts, D.J.: Random graphs with arbitrary degree distributions and their applications. Phys. Rev. E 64, 026118 (2001)
A Random Growth Model with Any Degree Distribution
449
19. Prˇzulj, N.: Biological network comparison using graphlet degree distribution. Bioinformatics 23(2), e177–e183 (2007) 20. Sallaberry, A., Zaidi, F., Melan¸con, G.: Model for generating artificial social networks having community structures with small-world and scale-free properties. Soc. Netw. Anal. Min. 3(3), 597–609 (2013) 21. Seshadri, M., Machiraju, S., Sridharan, A., Bolot, J., Faloutsos, C., Leskove, J.: Mobile call graphs: beyond power-law and lognormal distributions. In: ACM SIGKDD, pp. 596–604 (2008) 22. Stephen, A.T., Toubia, O.: Explaining the power-law degree distribution in a social commerce network. Soc. Netw. 31(4), 262–270 (2009) 23. Trolliet, T., Cohen, N., Giroire, F., Hogie, L., P´erennes, S.: Interest clustering coefficient: a new metric for directed networks like Twitter. arXiv preprint arXiv:2008.00517 (2020)
Local Degree Asymmetry for Preferential Attachment Model Sergei Sidorov(B) , Sergei Mironov, Igor Malinskii, and Dmitry Kadomtsev Saratov State University, Saratov 410012, Russia [email protected]
Abstract. One of the well-known phenomena of the sociological experience is the friendship paradox which states that your friends are more popular than you, on average. The friendship paradox is widely detected empirically in various complex networks including social, coauthorship, citation, collaboration, online social media networks. A local network metric called “the friendship index” has been recently introduced in order to evaluate some aspects of the friendship paradox for complex networks. The value of the friendship index for a node is defined as the ratio of the average degree of the neighbors of the node to the degree of this node. In this paper we examine the theoretical properties of the friendship index in growth networks generated by the Barab´ asi-Albert model by deriving the equation that describes the evolution of the expected value of the friendship index of a single node over time. Results show that there is a clear presence of the friendship paradox for networks evolved in accordance with the Barab´ asi-Albert model in which each new node acquires a single edge. Moreover, the distributions of the friendship index for such networks are heavy-tailed and comparable with the empirical distribution obtained for some real networks. Keywords: Complex networks · Social networks · Friendship paradox · Network analysis · Preferential attachment model · Network models
1
Introduction
One of the non-trivial heterogenous properties of complex networks is captured by the phenomenon of the friendship paradox, which says that people have fewer friends than their friends have. The friendship paradox has been recently the object of interest in the context of social networks in many researches including [1,3–8,10]. To quantify some characteristics of the friendship paradox for real networks, in the paper [11] a local network metric (called “the friendship index”, F I) was proposed. The value of measure F I for a node is defined in [11] as the ratio of the average degree of the neighbors of the node to the degree of this node. The measure represents important aspects associated with the friendship paradox including the ‘direction of influence’ (i.e. whether the node is exceeded c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 944, pp. 450–461, 2021. https://doi.org/10.1007/978-3-030-65351-4_36
Local Degree Asymmetry
451
by its neighbors in popularity, or it is more favored than its neighbors). It also quantifies the ‘disparity’ compared to a network without the friendship paradox: if values of F I are higher than 1 then it indicates the presence of the paradox. The authors of paper [11] aggregated the F I measure over all nodes and examined this aggregate measure theoretically and experimentally. Some measures and approaches of studying whether a vertex is more or less popular compared to its neighbors were developed in papers [4,6,9,10], e.g. a binary measure was used for analyzing the friendship paradox based on the comparison between a node’s degree to both the mean and median of neighbors’ degrees. Let di (t) denote the degree of node i at iteration t. Denote si (t) the value of the sum of degrees of all neighbors of node i at iteration t: dj (t). si (t) = j: (i,j)∈E(t)
Let αi (t) denote the average degree of all neighbors of node i (at iteration t), i.e. the ratio of the sum of the degrees of the neighbors of node i to the number of its neighboring vertices: si (t) . αi (t) = di (t) The friendship index of node i (at iteration t) is defined in [11] as follows: βi (t) = F Ii (t) =
si (t) αi (t) = 2 . di (t) di (t)
The theoretical properties of F I were studied in [11]. The index quantifies the local degree asymmetry as it shows the differences in the structure of the node degree distributions at a local level. For example, for a graph in which the degrees of all vertices are equal to each other (e.g. a complete graph, a cyclic graph), the value of this index for all vertices will be equal to 1. At the same time, for a star-type graph (one central and n peripheral vertices) the value of this index for the central vertex will be equal to n1 , while for peripheral vertices it will be equal to n. Differences in the value of this index for different vertices indicates the unevenness of the local structures of the graph in terms of node degree distributions. If all neighbors of node i have the same degree as this node, then the friendship index will be equal to 1. In this paper we examine the theoretical properties of the friendship index asi-Albert model. First, we find βi (t) in growth networks generated by Barab´ the dynamics of the expected values of the sum of the degrees of the neighbors for a single node si (t) (Sect. 2.1) as well as the sum the average degree of all neighbors of node αi (t) over time t (Sect. 2.2). Then in Sect. 3 we derive the equation describing the evolution of the expected value of the friendship index of a single node over time. Results indicate that there is a clear presence of the friendship paradox for networks evolved in accordance with the Barab´ asiAlbert model. Moreover, the empirical distributions of βi (t) are heavy-tailed and
452
S. Sidorov et al.
comparable with the empirical distribution of βi (t) obtained for the real network of phone calls (data was taken from [12]). In contrast with the paper [11], which examines the behavior of the average value of 1t i βi (t), this paper is focused on the studying of the evolution of the friendship index for a fixed node of the Barab´ asi-Albert network.
2
Dynamics of the Average Degree of Neighbors
The network growth in the Barab´ asi-Albert model [2] is carried out iteratively according to the following rules (in this paper we restrict our analysis with the simplest form of the Barab´ asi-Albert model in which each new node acquires a single edge): 1. At each iteration t, one node t is added; 2. At each iteration t, one link is added to the graph, and each new node is connected by this link to one of existing nodes i with a probability proportional to the degree of this node di (t). 2.1
Dynamics of the Total Degree of a Node Neighbors
Let us calculate the changes in the value of si (t) after adding a new node at iteration t + 1. The value of si (t) may increase at iteration t + 1 in two cases only: – if the new node t + 1 selects node i during iteration t + 1, and then the value of si (t) is increased by 1; – if the new node t + 1 selects one of the nodes already linked to node i, and then the value of si (t) is increased by 1, i.e. the growth in the total degree of all existing neighbors of node i will be equal to 1. The probability of the second case is
p(i, j)
j=1, j=i
1 dj (t) = si (t). 2t 2t
(1)
(t+1)
(t+1)
= 1 if node i is chosen at iteration t+1, and ξi = 0, Let the random ξi (t+1) otherwise. Let the random ηi = 1 if the new node t + 1 selects one of the (t+1) nodes already linked to node i, and ηi = 0, otherwise. We have Δsi (t + 1) = si (t + 1) − si (t) (t+1)
= ξi
(t+1)
(si (t) + 1) + ηi
(t+1) (t+1) si (t) − si (t) (si (t) + 1) + 1 − ξi − ηi (t+1)
= ξi
(t+1)
+ ηi
. (2)
Local Degree Asymmetry
453
Since E(ξi (t + 1)) = di2t(t) and E(ηi (t + 1)) = si2t(t) , we get the linear nonhomogeneous differential equation of first order (as an approximation to the difference Eq. (2)): di (t) dsi (t) 1 = + si (t) , dt 2t 2t the solution of which has the form si (t) = u(t)v(t), where v(t) follows dv(t) 1 = v(t) dt 2t
(3)
and u(t) satisfies the differential equation du(t) di (t) v(t) = . dt 2t The solution of (3) is
(4)
1
v(t) = t 2 . Then the solution of (4) is
u(t) =
Therefore, si (t) = t
1 2
di (t) 3
2t 2
(5)
dt + C.
di (t)
(6)
. 3 dt + C 2t 2 Note that at each moment t, di is a random with a probability density function
1 κi (x) s.t. k i (t) := E(di (t)) = xκi (x)dx = ti 2 and κi (x)dx = 1. Then the expected value of si (t) is 1 x si (t) = E(si (t)) = t 2 κi (x)dx 3 dt + C 2t 2 1 1 2 κi (x)dx = t xκi (x)dx 3 dt + C 2t 2 1 1 1 t 2 1 2 = t (log t + c(i)) . = 1 dt + C 2 i 2i 2 t (7) The expected initial value of si (t) at moment t = i is 12 i 12 i−1 i−1 i−1 j i 1 11 ∼ log i, E(si (i)) = P (i, j)dj (i) = = 2i j 2 j 2 j=1 j=1 j=1
(8)
where P (i, j) denotes the probability that at the moment of its appearance node i will be linked with node j. Figure 1 presents the evolution of the si (t) over t averaged over 100 independent simulations. The figures show that the empirical behavior of si (t) are undistinguished from predictions of Eq. (7).
454
S. Sidorov et al.
Fig. 1. The sum of neighbor’s degrees for nodes i = 50 (a) and i = 1000 (b). Both figures (a) and (b) show the time dynamics of theoretical values of si (red) and values of si obtained by simulations (blue), respectively. The simulated results were averaged for 100 independent runs (all networks are of size N = 10, 000).
2.2
Dynamics of αi (t)
Let us now find the expected value of αi (t). Note that it would be incorrect to find E(αi (t)) as the ratio of the expected value of the sum of the degrees of the neighbors of node i to the expected number of its neighboring vertices: si (t) E(si (t)) ! αi (t) := E(αi (t)) = E = di (t) E(di (t)) Let us calculate the changes in the value of αi (t) after adding a new node at the iteration t + 1 (and 1 links is added connecting this new node with an existing vertex). We have Δαi (t + 1) = αi (t + 1) − αi (t)
+1 si (t) (t+1) si (t) + 1 (t+1) (t+1) si (t) + ηi + 1 − ξi − − ηi di (t) + 1 di (t) di (t) di (t) si (t) + 1 si (t) + 1 si (t) si (t) (t+1) (t+1) = ξi − − + ηi di (t) + 1 di (t) di (t) di (t) 1 1 s (t) i (t+1) (t+1) 1 − , (9) = ξi + ηi di (t) + 1 di (t) + 1 di (t) di (t)
(t+1) si (t)
= ξi
where (t+1)
is equal to 1 if the new node t + 1 selects node i at iteration – the random ξi t + 1, and in this case both the degree of node i and the sum of all neighbor degrees is increased by 1 (since node i adds new neighbor t + 1 with degree 1, and respectively, the sum of degrees of neighbors of vertex i increases by (t+1) = 0, otherwise; 1), and ξi
Local Degree Asymmetry
455
(t+1)
– the random ηi is equal to 1 if the new node t + 1 selects one of the nodes already linked to node i, and then the value of si (t) is increased by 1, while (t+1) the degree of node i will be the same; and ηi = 0, otherwise. Note that
di (t) 2t is the probability that node i will be chosen at iteration t + 1. (t+1) The expected value of ηi is (t+1)
E(ξi
(t+1)
E(ηi
)=
) = Pi (t + 1) =
p(i, j)
j=1, j=i
We have
1 dj (t) = si (t). 2t 2t
1 di (t) =1− . di (t) + 1 di (t) + 1
(10)
(11)
(12)
Then it follows from (9), (10), (11) and (12) that Δαi (t + 1) =
1 1 1 − + αi (t) . 2t 2t(di (t) + 1) 2t(di (t) + 1)
(13)
The difference equation (13) can be approximated by the following linear nonhomogeneous differential equation of first order d(αi (t) − 1) 1 1 = + (αi (t) − 1) . dt 2t 2t(di (t) + 1)
(14)
We have αi (t) − 1 = u(t)v(t), where v(t) satisfies the differential equation dv(t) 1 = v(t) dt 2t(di (t) + 1)
(15)
du(t) 1 v(t) = . dt 2t
(16)
and u(t) follows
The solution of (15) is v(t) = exp −
dt 2t(di (t) + 1)
.
Then the solution of (16) is dt 1 exp dt + C. u(t) = 2t 2t(di (t) + 1)
(17)
(18)
456
S. Sidorov et al.
Then αi (t) = u(t)v(t) + 1 dt dt 1 exp = exp − dt + C + 1 2t(di (t) + 1) 2t 2t(di (t) + 1) 1 dt dt 1+ dt + C + 1 ∼ 1− 2t(di (t) + 1) 2t 2t(di (t) + 1) 1 dt 1 = C + log t − C + log t 2 2 2t(di (t) + 1) 2 dt dt − . (19) + 4t2 (di (t) + 1) 4t2 (di (t) + 1) (x)dx κi (x)dx i 12 Since κix+1 ≤ ∼ t we get x 1 1 1 κi (x) dxdt αi (t) := E(αi (t)) ∼ C + log t − C + log t 2 2 2t x+1 2 1 dt κi (x) + dxdt − κi (x)dx 4t2 x+1 4t2 (x + 1) 1 1 ∼ C + log t + o(t 2 ), 2 (20) i.e. the average values of neighbor’s degree coefficient for all nodes are asymptotiasically follows 12 log t (for the simplest growth networks generated by the Barab´ Albert preferential attachment model with the number of attached links m = 1). The dynamics of the αi (t) for two nodes i = 50 and i = 1, 000, averaged over 100 independent simulations, are shown in Fig. 2. It can be seen that the empirical behavior of αi (t) are fluctuating around the theoretical predictions of Eq. (20).
3
Dynamics of the Friendship Index for the Barab´ asi-Albert model
Let us calculate the changes in the value of βi (t) after adding a new node at the iteration t + 1. We have Δβi (t + 1) = βi (t + 1) − βi (t)
si (t) + 1 si (t) (t+1) si (t) + 1 (t+1) (t+1) si (t) + 1 − ξ − 2 + η − η i i i 2 2 2 (di (t) + 1) di (t) di (t) di (t) s s (t) + 1 (t) (t) + 1 s s i i i i (t) (t+1) (t+1) − − = ξi + η i (di (t) + 1)2 d2i (t) d2i (t) d2i (t) 1 di (t) 1 (t+1) (t+1) 1 . − 2 β (t) − β (t) + ηi = ξi i i 2 2 2 (di (t) + 1) (di (t) + 1) (di (t) + 1) d2i (t) (21) (t+1)
= ξi
Local Degree Asymmetry
457
16
15
14 12
10
α50 (t), theoretical α50 (t), simulation
α1000 (t), theoretical α1000 (t), simulation
10
5
8 0
0.25
0.5 t
0.75
1
0.25
·104
(a)
0.5 t
0.75
1 ·104
(b)
Fig. 2. The figures (a) and (b) show the time dynamics of theoretical values of αi (blue) and values of αi obtained by simulations (red) for i = 50 (a) and i = 1000 (b), respectively. The simulated results were averaged for 100 independent runs (all networks are of size N = 10, 000). (t+1)
Since E(ξi
)=
Δβi (t + 1) =
di (t) 2t
(t+1)
and E(ηi
)=
di (t) − βi (t) 2t(di (t) + 1)2
si (t) 2t
we get
d2i (t) di (t) 1 + − 2 2 t(di (t) + 1) 2t(di (t) + 1) 2t
.
2
k 2 Using the asymptotic (k+1) 2 ∼ 1 − k+1 , the difference equation can be approximated by the following linear nonhomogeneous differential equation of first order: 1 dβi (t) di (t) 3di (t) ∼ − − βi (t) , (22) dt 2t(di (t) + 1)2 2t 2t(di (t) + 1)2
The solution has form β i (t) = u(t)v(t), where v(t) satisfies the differential equation 1 3di (t) dv(t) = −v(t) − , (23) dt 2t 2t(di (t) + 1)2 and u(t) follows di (t) du(t) v(t) = . dt 2t(di (t) + 1)2
(24)
The solution of (23) is v(t) = t
− 12
3 di (t) exp dt . 2 t(di (t) + 1)2
Then the solution of (24) is 3 di (t) di (t) exp − dt dt + C. u(t) = 1 2 t(di (t) + 1)2 2t 2 (di (t) + 1)2
(25)
(26)
458
S. Sidorov et al.
Then
3 di (t) dt βi (t) = u(t)v(t) = t exp 2 t(di (t) + 1)2 di (t) di (t) 3 × exp − dt dt + C 1 2 t(di (t) + 1)2 2t 2 (di (t) + 1)2 1 di (t) 3 ∼ t− 2 1 + dt 2 t(di (t) + 1)2 di (t) 3 di (t) × dt dt + C . 1 − 1 2 t(di (t) + 1)2 2t 2 (di (t) + 1)2 (27)
1 i (x)dx κi (x)dx Since xκ ∼ ti 2 and repeating the reasoning of (20), it can (x+1)2 ∼ x be shown that the expected value of βi (t) asymptotically follows 12 1 i β i (t) := E(βi (t)) ∼ log(ci t) + o(t− 2 ). (28) t i−1 The initial value of β i is β i (i) = E(s(i)) = j=1 1j ∼ 12 log i since d2i (i) = 1. Let us now find the dynamics of βi (t) by averaging its increments Δβi (t).
1 k2 2 Using the asymptotics E( di1(t) ) ∼ ti 2 and (k+1) 2 ∼ 1 − k+1 , we obtain from (22) the following linear nonhomogeneous differential equation of first order for β i (t):
1 1 dβ i (t) i2 3i 2 1 ∼ 3 − βi (t) − , dt 2t 2t 32 2t 2 − 12
The solution has form β i (t) = f (t)g(t), where g(t) and f (t) satisfy the differential equations
1 1 dg(t) 1 3i 2 i2 df (t) = −g(t) − 3 , g(t) ∼ 3 . dt 2t 2t 2 dt 2t 2 We have
1 12 1 i i 2 , f (t) = −i 2 Ei 3 + C, exp −3 t t
g(t) = t
− 12
where Ei(·) denotes the exponential integral function. Then
1 12 12 i i i 2 β i (t) = f (t)g(t) = − Ei 3 + C exp −3 . t t t Using the Ramanujan formula for Ei(·) it can be easily shown that β i (t) asymptotically follows 1
1 12 12 i i i 2 3 i 2 1 β i (t) ∼ exp c(i) + log t − 3 exp −3 . t 2 t 2 t t
Local Degree Asymmetry
459
The dynamics of the friendship index βi (t) for two nodes i = 1 and i = 10 for a simulated network of size N = 100, 000, are shown in Fig. 3. It can be seen that βi (t) are gradually decreasing to 0 with the growth of the network.
0.2
β1 (t) β10 (t)
0.15 0.1 5 · 10−2 0 1 · 10−2
0.25
0.5 t
0.75
1 ·105
Fig. 3. The figure presents (empirical) trajectories of the values of βi (t) over t = 1, . . . , 100, 000 for nodes i = 1 (blue) and i = 10 (red) obtained by simulations. √
2t Thus, if i < log2 (c then βi (t) < 1, otherwise we have βi (t) > 1. As t tends to i t) ∞, we will get two groups of nodes with βi → 0 or ∞, i.e. there is a huge degree asymmetry between old nodes (for which βi → 0) and new nodes (for which βi → ∞). Moreover, as it was shown in [11], the average value of 1t i βi (t) tends to ∞ as t → ∞. This is a clear indication of the presence of the friendship paradox in the networks generated in accordance with the Barab´ asi-Albert model in which each new node acquires a single link. Now let us look at how the friendship index values are distributed in the real telephone call network from [12], and also observe how this distribution differs from the distribution of the friendship index in the Barab´ asi-Albert networks. The histogram for the real network and its log-log variant are shown in Figs. 4(b) and 4(d), respectively. It should be noted that the empirical distribution for the real network is asymmetric, heavy-tailed, and has a large variance (E(βi ) = 2.3, asi-Albert network, we VAR(βi ) = 9.4). To construct a histogram of the Barab´ generated 500 networks of the same size as the real network, i.e. 36,595 nodes, and then averaged the obtained values over each of intervals of length 1. The generated BA networks turned out sparser than the real one. The obtained histogram and its log-log version are shown in Figs. 4(a) and 4(c), which show that the distribution of the friendship index values the Barab´ asi-Albert network also has a heavy tail and a huge variance (E(βi ) = 13.16, VAR(βi ) = 439.8). The figures also indicate that the overwhelming majority of nodes in both the BA network and the real network have a friendship index greater than 1, which means that there is a noticeable friendship paradox in both networks.
460
S. Sidorov et al. ·104 1.5
4,000
1 2,000 0.5 0 0
10 20 #βi , BA model
30
0 0
(a) 6
6
4 log(#βi (t)) −2.02 log t + C
2
10
(b)
8
4
5 #βi , real network
2
log(#βi (t)) −2.7 log t + C
0 2 3 4 log(#βi in interval), BA model (c)
2 3 4 log(#βi in interval), real network (d)
Fig. 4. The figures (a) and (c) show empirical histograms of βi (N ) and its log-log plot, respectively, obtained by simulation of the Barab´ asi-Albert model. The simulated results were averaged for 500 independent runs (all networks are of size N = 36, 000). The plots (b) and (d) present empirical histograms of βi (N ) and its log-log plot, respectively, obtained for a real network of telephone calls of size N = 36, 000 [12].
4
Conclusion
This paper studies the theoretical properties of the friendship index for growth networks generated by the simplest version of the Barab´ asi-Albert model in which each new node has only one link. The results were obtained using mean field methods and rate equations. The results show that there is a clear presence of the friendship paradox in such type of networks. However, some questions that were left outside the scope of the paper may be of interest for further work. In particular, it would be interesting to find out whether similar conclusions are valid for the general version of the Barab´ asi-Albert model (with an arbitrary number of attached links). It would be also appealing to examine whether the features that are observed are due to special properties of the BA model, or whether configuration model networks with the same degree distribution would show similar properties.
Local Degree Asymmetry
461
Acknowledgment. This work was supported by the Ministry of science and education of the Russian Federation in the framework of the basic part of the scientific research state task, project FSRR-2020-0006.
References 1. Alipourfard, N., Nettasinghe, B., Abeliuk, A., Krishnamurthy, V., Lerman, K.: Friendship paradox biases perceptions in directed networks. Nat. Commun. 11(1), 707 (2020) 2. Barab´ asi, A.L., Albert, R.: Emergence of scaling in random networks. Science 286(5439), 509–512 (1999) 3. Bollen, J., Goncalves, B., van de Leemput, I., et al.: The happiness paradox: your friends are happier than you. EPJ Data Sci. 6(1), 1–17 (2017) 4. Eom, Y.H., Jo, H.H.: Generalized friendship paradox in complex networks: the case of scientific collaboration. Sci. Rep. 4(1), 4603 (2014) 5. Feld, S.L.: Why your friends have more friends than you do. Am. J. Sociol. 96(6), 1464–1477 (1991). http://www.jstor.org/stable/2781907 6. Fotouhi, B., Momeni, N., Rabbat, M.G.: Generalized friendship paradox: an analytical approach. In: Aiello, L.M., McFarland, D. (eds.) Social Informatics, pp. 339–352. Springe, Cham (2015) 7. Higham, D.J.: Centrality-friendship paradoxes: when our friends are more important than us. J. Complex Netw. 7(4), 515–528 (2018) 8. Jackson, M.O.: The friendship paradox and systematic biases in perceptions and social norms. J. Polit. Econ. 127(2), 777–818 (2019) 9. Lee, E., Lee, S., Eom, Y.H., Holme, P., Jo, H.H.: Impact of perception models on friendship paradox and opinion formation. Phys. Rev. E 99(5), 052302 (2019) 10. Momeni, N., Rabbat, M.: Qualities and inequalities in online social networks through the lens of the generalized friendship paradox. PLoS ONE 11(2), e0143633 (2016) 11. Pal, S., Yu, F., Novick, Y., Bar-Noy, A.: A study on the friendship paradox – quantitative analysis and relationship with assortative mixing. Appl. Netw. Sci. 4, 71 (2019) 12. Song, C., Qu, Z., Blumm, N., Barab´ asi, A.L.: Limits of predictability in human mobility. Science 327(5968), 1018–1021 (2010)
Edge Based Stochastic Block Model Statistical Inference Louis Duvivier1(B) , R´emy Cazabet2 , and C´eline Robardet1 1
2
Univ Lyon, INSA Lyon, CNRS, LIRIS UMR5205, 69621 Paris, France {louis.duvivier,celine.robardet}@insa-lyon.fr Univ Lyon, Universit´e Lyon 1, CNRS, LIRIS UMR5205, 69622 Paris, France [email protected] Abstract. Community detection in graphs often relies on ad hoc algorithms with no clear specification about the node partition they define as the best, which leads to uninterpretable communities. Stochastic block models (SBM) offer a framework to rigorously define communities, and to detect them using statistical inference method to distinguish structure from random fluctuations. In this paper, we introduce an alternative definition of SBM based on edge sampling. We derive from this definition a quality function to statistically infer the node partition used to generate a given graph. We then test it on synthetic graphs, and on the zachary karate club network. Keywords: Community · Stochastic block model · Statistical inference
1
Introduction
Since the introduction of modularity by Girvan and Newman [1], it has been shown that many networks coming from scientific domain as diverse as sociology, biology and computer science exhibit a modular structure [2], in the sense that their nodes can be partitioned in groups characterized by their connectivity. Yet, there is no universal definition of a community. Many techniques and algorithms have been developed for detecting remarkable node partition in graphs, most of the time by optimizing a quality function which assigns a score to a node partition [1,3,4]. The problem is that these algorithms rarely account for random fluctuations and it is thus impossible to say if the communities obtained reflect a real property of the graph under study or are just an artefact. In particular, it has been shown that even the very popular modularity may find communities in random graphs [5]. Stochastic block models offer a theoretical framework to take into account random fluctuations while detecting communities [6]. Since they are probabilistic generative models, one can perform statistical inference in order to find the most probable model used to generate a given observed graph. The most common way to do this inference is to associate to each SBM the set of graphs it may generate: the larger the set, the smaller the probability to generate each of them [7,8]. This methodology based on the minimization of entropy has the strength c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 944, pp. 462–473, 2021. https://doi.org/10.1007/978-3-030-65351-4_37
Edge Based Stochastic Block Model Statistical Inference
463
of being rigorously mathematically grounded. Yet it suffers from one drawback: as it considers probability distributions on graph ensembles, the random variable considered is the whole graph. Thus statistical inference is performed on a single realization, which leads to overfitting. Although techniques have been introduced to mitigate this effect, it cannot be totally eliminated and it induces counterintuitive behavior in some tricky situations [9]. In this paper, we propose a new quality function for node partitions, based on stochastic block models defined as probability distributions on a set of edges. This allows us to use statistical inference method in a more relevant way, relying on several realizations of the same random variable. To do so, we first define an edge-based stochastic block model, then use minimum description length method [10] to infer its parameters from an observed graph. Finally, we test this quality function on synthetic graphs, plus the Zachary Karate Club network.
2
Methodology Presentation
Traditionally, a stochastic block model is defined as a couple (B, M ), with B a partition of the set of nodes [1, n] in p blocks b1 , . . . , bp , and M a p×p block adjacency matrix whose entries correspond to the number of edges between any two blocks (or equivalently to the density). These parameters define a set of generable graphs ΩB,M from which graphs are sampled according to some probability distribution. As the probability distribution is defined on a set of graphs, we call the stochastic block models defined in this way generative models of graphs. In this paper, we will consider stochastic block models as generative models of edges. It also takes as parameters a set of nodes V = [1, n] partitioned in p blocks B = b1 , . . . , bp , but instead of a block adjacency matrix, it relies on a p × p block probability matrix M such that: – ∀i, j, M [i, j] ∈ [0, 1] – i,j M [i, j] × |bi ||bj | = 1 For a given partition B, the set of all matrices verifying those conditions will be denoted Mat(B). Given two nodes u and v, belonging respectively to the block bi and bj , the edge u → v is generated with probability PB,M [u, v] = M [i, j]. This probability distribution can be seen as a block-constant n × n matrix, and in the following, the notation PB,M will refer indifferently to the probability distribution and to the corresponding matrix. We will also denote by Prob mat(B) the set of all B-constant edge probability matrices on [1, n]2 , defined as: Prob mat(B) = {P | ∃MP ∈ Mat(B), P = PB,MP }
(1)
Generating a graph G = (V, E) made of m edges e1 , . . . , em with such a generative model of edges means generating each of its edges independently. Thus, G is generated with probability: PB,M [G] =
m i=1
PB,M [ei ]
464
L. Duvivier et al.
In particular, this means that the same edge u → v can be sampled more than once, so for the rest of the paper we will work with multigraphs. To simplify computations, we will consider directed graphs with self-loops. In practice, we study a graph G made of a set of vertices V = [1, n] and a list E of m edges: e1 = u1 → v1 , . . . , em = um → vm . We suppose that G was generated by a stochastic block model (B0 , M0 ), thus that all edges in E were independently sampled from the same probability distribution PB0 ,M0 , and our objective is to identify the original parameters B0 and M0 used to generate G. To do so, we rely on the minimum description length principle. This principle, borrowed from information theory, relies on the fact that any statistical regularity can be used for compression. Therefore, the quality of a statistical model can be measured by the compression it allows of the data under study. Let’s give an example: Alice draws messages independently at random from a set Ω, with a probability distribution P and she transmits them to Bob through a binary channel. Each message needs to be encoded through a coding pattern C : Ω → {0, 1}. For any message x ∈ Ω, we denote by |C(x)| the length of its code. The expected length of the encoded message will then be: P[x] · |C(x)| E [|C(x)|] = x∈Ω
x∈Ω
It can be shown that this expected value is minimum when C is such that ∀x, |C(x)| = −log2 (P[x]), and in this case, the previous expression is called the entropy of P. This result means that finding an optimal code C ∗ and finding the original probability distribution P are the same problem, because: P[u, v] = ∗ 2−|C [u,v]| . This is what we will use to recover PB0 ,M0 . Let’s suppose that Alice does not know P, but that she can draw as many random messages as she wants from Ω. Then, for any probability distribution Q on Ω, she can define a code CQ , under which the mean length of the messages e1 , . . . , em transmitted will be: code len(e1 , . . . , em , CQ ) = −
#{k | ek = x} · log2 (Q[x]) m
(2)
x∈Ω
And, as we know that #{k|emk =x} −→ P[x] because of the law of great numm→∞ bers, it means that if m is high enough, the best code C ∗ will correspond to a distribution Q which will be a good approximation of P. In our case, the messages to be transmitted are the edges of G: {e1 , . . . , em }, drawn from the set [1, n]2 with the probability distribution PB0 ,M0 . We want to approximate this distribution, to deduce B0 and M0 from it, but to avoid overfitting, we do not minimize the encoding length of all edges e1 , . . . , em at the same time, we consider them sequentially. It corresponds to a situation in which Alice observes the edges one at a time and transmits them right away, updating her code on the fly. At the other end, Bob updates his code in the same way. When Alice draws the xth edge, Bob only knows edges e1 , . . . ex−1 , so they optimize their code on this limited sample. For the remaining m − x edges, as
Edge Based Stochastic Block Model Statistical Inference
465
they have no information, they suppose they are random. Finally, as they know that edges are generated by a stochastic block model, they limit themselves to codes based on B-constant probability distributions, for some partition B. Thus QB,x is defined as: ⎡ argmin Q∈Prob mat(B)
⎣ x · code len(e1 , . . . , ex , Q) − m
u,v∈[1,n]
⎤ (m − x) · log2 (Q[u, v])⎦ m · n2
And the mean code length of the messages sent from Alice to Bob will be: m
code len(E, B) = −
1 log (QB,x−1 [ex ]) m x=1 2
(3)
Of course, it depends on the partition B used by Alice and Bob. If we now imagine that each partition B is tested in parallel, we can approximate B0 by: B ∗ = argmin (code len(E, B)) B
(4)
This partition corresponds to the best sequential compression of edges e1 , . . . , em , and according to the minimum description length principle, it should correspond to the original partition B0 . It should be noted that sequential encoding suppose that edges are ordered, which is typically not the case (except for temporal graphs). Therefore, we need to choose an order, and it will necessarily be arbitrary. Yet, we observe in practice that, although it modifies the precise value of code len(E, B), fluctuations have a limited impact on the estimation B ∗ .
3
Tests on Synthetic Graphs
In order to test this estimator, we generated random graphs using edge-based stochastic block models, and observed how it behaves for various partitions of the nodes, in particular with underfitted and overfitted partitions. Before looking at the estimator itself, we investigated how the prediction probability of the next edge evolves as Alice draws more and more edges. Then, we tested how the mean code length behaves on partitions which are a coarsening or a refinement of the original partition, and on partitions with the same number of blocks as the original, but blocks of different sizes, or shifted. Finally, we considered more complex SBM, with blocks of different sizes and density. 3.1
Prediction Probability
We start by considering three graphs G0 , G1 and G2 . Each of them is made of n = 128 nodes and m = 2800 edges (density is about 0.17), generated using three different stochastic block models described in Table 1.
466
L. Duvivier et al. Table 1. . Node partition S0 = (B0 , M0 ) ([1, 128]) S1 = (B1 , M1 ) ([1, 64], [65, 128])
Block probability matrix 1 n2
1 n2
S2 = (B2 , M2 ) ([1, 32], [33, 64], [65, 96], [97, 128]) 1 n2
20 · 02 ⎡ 400 ⎢ ⎢0 4 0 ·⎢ ⎢0 0 4 ⎣ 000
⎤ 0 ⎥ 0⎥ ⎥ 0⎥ ⎦ 4
Fig. 1. Prediction probability against edge rank for three different graphs
For each of these graphs, we consider the prediction probability of the next edge QB,x−1 [ex ] against x for the three different partitions B0 , B1 and B2 , which are all a refinement of the previous one. Results are shown on Fig. 1. We observe for all three graphs, whatever x, the prediction probability based 1 on the null partition B0 is constant at 128 2 ≈ 0.00006. This is logical, as the only one corresponding to the uniform distriB0 -constant probability matrix is the
bution. Therefore, ∀x, QB0 ,x = n12 . For other partitions, the results depend on the graph. On G0 , generated with B0 and thus presenting no block structure, the probability distributions associated to more refined partitions do no perform better than the one based on B0 . For some edges their prediction probability is better, but as often it is worse. On average, they have the same prediction power, they are only more sensible to the random fluctuations due to the order in which edges are drawn. On the other hand, for G1 , generated with B1 (two blocks), we observe that refining the partition from one block to two allows the prediction probability to increase quickly. While it remains n12 for the partition B0 , it rises up to n22 for the partition B1 . Yet, refining even more the partition is worthless, as illustrated by the B2 partition, with 4 blocks, which does not bring
Edge Based Stochastic Block Model Statistical Inference
467
any improvement on average. Finally, considering G2 , we observe that refining the partition brings more and more improvement to the prediction probability. With B0 it remains stable at n12 , with B1 it rises up to n22 , and with B2 up to 4 n2 . To investigate further how the prediction probability evolves when refining the partition, we considered the mean prediction probability mean prob(E, B) =
m 1 B,x−1 · Q [ex ] m x=1
on 10 graphs generated with S2 . For each of them, the mean prediction probability is computed for 8 different partitions Bi , with respectively 1, 2, 4, 8, 16, 32, 64 and 128 blocks. For each i, Bi+1 is obtained by dividing each block of Bi in two blocks of equal size. The mean prediction probability is then plotted against the number of communities for each graphs on Fig. 2. We observe that the mean prediction probability increases sharply as long as B is coarser or equal to B2 (the partition used to generate the edges). As soon as it becomes finer, it keeps increasing or decreasing a bit, according to the graph considered, but it mainly remains stable. This is due to the fact that QBi ,x tries to converge toward M2 . As M2 does not belong to Prob mat(B0 ) nor to Prob mat(B1 ), its convergence is limited for these two partitions, and therefore the mean prediction probability is limited. On the other hand, for i ≥ 2, M2 ∈ Prob mat(Bi ), so the convergence is not limited, but refining the partition is pointless, since M2 is B2 -constant. 3.2
Mean Code Length
If we now plot the same curves, replacing the mean prediction probability by the mean code length, we obtain the results shown on Fig. 3. We observe that for all 10 graphs, the mean code length sharply decreases until i = 2. This is because, as we have just seen, for i ≤ 2, refining the partition leads to a quick increase of QBi ,x−1 [ex ], and therefore a decrease of the code length of the xth edges −log2 (QBi ,x−1 [ex ]). On the other hand, for i ≥ 2, the mean code length starts to increase again slowly and then faster, in contrast with the mean prediction probability that remained stable in this regime. We have seen that, as i grows larger than 2, QBi ,x−1 [ex ] oscillates more and more due to random fluctuations. When computing the mean prediction probability, these oscillations compensate each other, but as logarithm is a concave function: m m 1 1 Bi ,x−1 Bi ,x−1 · Q [ex ] < − log (Q [ex ] −log2 m x=1 m x=1 2 Therefore, the more QBi ,x−1 [ex ] oscillates, the larger the mean code length in the end.
468
L. Duvivier et al.
Fig. 2. Mean prediction probability against partition refinement
Fig. 3. Mean code length against partition refinement: four communities graphs
Fig. 4. Mean code length against partition refinement: two communities graphs of various sharpness
Edge Based Stochastic Block Model Statistical Inference
469
These two phenomenon are very important, because they explain how the mean code length as a quality function prevents both overfitting and underfitting. If the partition tested is too coarse with respect to the original partition, QB,x cannot converge toward the original block probability matrix, and the mean code length increases. On the other hand, if it is too fine, the convergence occurs but in a more noisy way, and this too leads to an increase of the mean code length. Of course, it can work only if the edge generation probabilities are different enough and if the total number of edges drawn is large enough, for #{k | uk ∈ bi ∧ vk ∈ bj } to be significantly different from one pair of blocks (bi , bj ) to another. To illustrate this, we considered a set of 10 graphs, still with 128 nodes and 2800 edges, generated by stochastic block models based on the partition B1 (two blocks) and block probability matrices:
i i 1 (2 − 10 ) 10 Mi = 2 · i i (2 − 10 ) n 10 Therefore, S0 generates graphs with two perfectly separated communities, while S9 generates graphs with almost no community structure. For each stochastic block model, we generate a graph Gi and compute the mean code length for 6 different partitions, B0 to B5 , defined as before with 1 to 32 blocks. Results are plotted on Fig. 4. We observe that for i = 0, 2, 4, 6, 8, the minimum mean code length is obtained for the two blocks partition B1 , while for the other, it is obtained for the four blocks partition B2 . This shows that fuzzy communities may lead to limited overfitting, but that the quality function is very robust against underfitting. Finally, we considered the performance of the mean code length when modifing blocks’ sizes or shifting blocks. To do so, we generated 10 graphs with 128 nodes and 2800 edges, made of two perfectly separated communities of equal size. Then, for each of these graphs, we computed the mean code length for two sequence of partitions. – Scut = (B(c) = ([1, c], [c, 128]))c∈{0,8,16,24,...,128} – Sof f set = (B(o) = ([1 + o, 64 + o], [1, o] ∪ [65 + o, 128])o∈{0,4,8,12,...,32} ) Results are plotted, respectively against c and o, on Fig. 5. We observe that for all graphs, the minimum of mean code length is reached when c = 64 in the first sequence, and when o = 0 in the second, which both correspond to the partition B1 used to generate them. This means that mean code length is robust against shifting blocks and modifying blocks’ sizes. 3.3
Merge/Split Issue
As of today, the main stochastic block model statistical inference methodology is based on SBM considered as generative models of graphs, as explained at the beginning of Sect. 2. The best set of parameters B, M is inferred by minimizing the entropy of the set of generable graphs ΩB,M , as detailed in [6]. In the following, this entropy will be denoted by entropy(G, B), and we compute it using the
470
L. Duvivier et al.
Fig. 5. Mean code length against cut (left) and offset (right)
Fig. 6. Three different partitions of the zachary karate club network. Sociological (upper left), minimum modularity (upper right), minimum entropy (lower)
python library graph tools1 . It has been shown that this methodology can leed to a phenomenon of block inversion in graphs made of one large communities and a set of smaller ones [9]. Here, we will show how the mean code length allows to overcome the issue. To illustrate the phenomenon on a simple example, let’s consider a stochastic block model S1 defined on a set of n = 12 nodes, partitioned in three communities: B = ([0; 5], [6; 8], [9; 11]) and a probability matrix: ⎡ ⎤ 0.026 0 0 M = ⎣ 0 0.003 0 ⎦ 0 0 0.003 1
https://graph-tool.skewed.de.
Edge Based Stochastic Block Model Statistical Inference
471
Fig. 7. Mean code length for different partitions of the zachary karate club network
We test two different partitions: the original one, B, and the inverse partition Bi = ([0; 2], [3; 5], [6; 11]). To do so, we generate 100 graphs Gi made of m = 378 edges with S1 and for each graph, we compute the mean code length and the entropy for both partitions. Then, for both quality function, we compute the percentage of graphs for which the original partition is identified as better than the inverse one. Results are shown in Table 2. Table 2. Percentage of correct match for heterogeneous graphs SBM Mean code length Entropy S1
96%
0%
S2
100%
0%
While the mean code length almost always correctly identifies the original partition, the entropy of the microcanonical ensemble never does so. The graphs considered here had a very high density, which makes them not very realistic, but the same results can be obtained with low density graphs. Let’s consider a stochastic block model S2 on n = 256 nodes, partitioned in 33 communities: one of size 128, and 32 of size 4. The internal probability of the big community is 0.00006, the one of the small communities is 0.00076, and the probability between communities is null. As before, we generate 100 graphs with S2 and test for each of them the original partition and the inverse partition obtained by splitting the big community in 32 small ones and merging the small ones in one big. The percentage of graph for which the mean code length (resp. the entropy)
472
L. Duvivier et al.
is smaller for the original partition than the inverse one is shown in Table 2. In this case too, the mean code length always recovers the original partition, while minimum entropy never does.
4
Zachary Karate Club
Finally, we test the mean code length quality function on the zachary karate club network. We study three different partitions of it. First of all, the sociological partition, B100 , which is the partition described in the original paper as corresponding to the sociological ground truth about communities in the karate club. B200 is the partition obtained by minimizing the modularity using the louvain algorithm, and B300 the partition obtained by minimizing the entropy using the graph tool library. Those partitions are illustrated on Fig. 6. For each of these partitions, we compute the mean code length. We also do so for 100 random partitions of the graph, with 1 to 5 blocks, and for each of these partitions, we compute the mean code length for 99 random refinement of them, obtained by randomly dividing each block in two. Results are plotted on Fig. 7. We observe that the mean code length is minimum for the minimum entropy partition. All studied partitions perform better than the random ones, so the mean code length captures the fact that they reproduce part of the structure of the network. Yet, for B100 and B200 many of there random refinements improve the compression, sometimes by a large amount, indicating that they are not optimal. This is not the case for the minimum entropy partition B300 . There are only 2 refinements out of 99 which perform a little better, an issue we have seen may happen due to random fluctuations. These results are coherent with previous work showing that B100 is actually not fully supported by statistical evidence in the network. In the case of B200 , modularity is defined based on nodes’ degree, so the selected partition compensate for node degrees, which are not considered here. Finally, minimizing the entropy without correcting for the degree leads to the identification of two blocks of hubs, at the center of each sociological communities, and two blocks corresponding to their periphery. This is not necessarily what we expect, because we are used to communities defined with an implicit or explicit degree correction, but as we have not imposed constraints so far, this result corresponds to the statistical evidence present in the network.
5
Conclusion
In conclusion, in this paper, we have defined a new quality function, the mean code length, to evaluate node partitions. It relies on an alternative definition of the stochastic block model, as a probability distributions of edges. We make the hypothesis that the edges of the graph G under study were sampled independently from the same stochastic block model probability distribution. Then, we make use of the law of great numbers and of the minimum description length
Edge Based Stochastic Block Model Statistical Inference
473
principle to derive a statistical estimator of the partition used to generate G. The mathematical derivation of this estimator allows a clear interpretation of the partition identified. What is more, it is a basis for mathematicaly proving properties about it, for example its convergence toward the original partition. We then test this estimator on synthetic graphs, generated with a known block structure. It shows that mean code length is able to correctly identify blocks of nodes whose internal connections are homogeneous, avoiding both the tendancy to merge distinct communities which leads to underfitting, and to split communities in smaller blocks, which leads to overfitting. Finally, we test it on different partition of the zachary karate club and the result were coherent with previous results based on statistical inference of the stochastic block model. Those results are preliminary. This quality function should be tested more thoroughly, against graphs of various sizes and densities, with heterogeneous communities. In particular, it would be interesting to measure the density thresholds that allows stochastic block models to be recovered using this method, as at been done for other methodology. Acknowledgments. This work was supported by the ACADEMICS grant of the IDEXLYON, project of the Universit´e de Lyon, PIA operated by ANR-16-IDEX0005, and of the project ANR-18-CE23-0004 (BITUNAM) of the French National Research Agency (ANR).
References 1. Girvan, M., Newman, M.E.J.: Community structure in social and biological networks. Proc. Natl. Acad. Sci. 99(12), 7821–7826 (2002) 2. Fortunato, S., Hric, D.: Community detection in networks: a user guide. Phys. Rep. 659, 1–44 (2016) 3. Clauset, A., Newman, M.E.J., Moore, C.: Finding community structure in very large networks. Phys. Rev. E 70(6), 066111 (2004) 4. Newman, M.E.J.: Finding community structure in networks using the eigenvectors of matrices. Phys. Rev. E 74(3), 036104 (2006) 5. Guimera, R., Sales-Pardo, M., Amaral, L.A.N.: Modularity from fluctuations in random graphs and complex networks. Phys. Rev. E 70(2), 025101 (2004) 6. Peixoto, T.P.: Bayesian stochastic blockmodeling. In: Advances in network Clustering and Blockmodeling, pp. 289–332 (2019) 7. Peixoto, T.P.: Entropy of stochastic blockmodel ensembles. Phys. Rev. E 85(5), 056122 (2012) 8. Bianconi, G.: Entropy of network ensembles. Phys. Rev. E 79(3), 036114 (2009) 9. Duvivier, L., Robardet, C., Cazabet, R.: Minimum entropy stochastic block models neglect edge distribution heterogeneity. In: International Conference on Complex Networks and Their Applications, pp. 545–555. Springer (2019) 10. Gr¨ unwald, P.: A tutorial introduction to the minimum description length principle. In: Advances in minimum description length: Theory and applications, pp. 3–81 (2005)
Modeling and Evaluating Hierarchical Network: An Application to Personnel Flow Network Jueyi Liu1(B) , Yuze Sui1 , Ling Zhu2 , and Xueguang Zhou1 1
2
Stanford University, Stanford, CA 94305, USA {jueyiliu,yuzesui,xgzhou}@stanford.edu The Chinese University of Hong Kong, Shatin, NT 999077, Hong Kong SAR, China [email protected]
Abstract. This study evaluates (1) the properties of a hierarchical network of personnel flow in a large and multilayered Chinese bureaucracy, in light of selected classical network models, and (2) the robustness of the hierarchical network with regard to the edge weights as strength of “weak ties” that hold different offices together. We compare the observed hierarchical network with the random graph model, the scale-free model, the small-world model, and the hierarchical random graph model. The empirical hierarchical network shows a higher level of local clustering (in both LCC and GCC) and a lower level of fluidity of flow (i.e., high APL) across offices in the network, as compared with the small-world model and the hierarchical random graph model. We also find that the personnel flow network is vulnerable to the removal of “weak ties” that hold together a large number of offices on an occasional rather than regular basis. The personnel flow network tends to dissolve into locally insulated components rather than to maintain an integrated hierarchy. Keywords: Hierarchical social network personnel flow
1
· Weak ties · Bureaucratic
Introduction
In this study, we use social network data based on personnel flow among offices in a large and multilayered Chinese bureaucracy to investigate its hierarchical network properties by comparing the constructed network to selected theoretical network models in the literature. The observed network of ours is directed and weighted, and it is hierarchical by organizational design. While many network models, including the hierarchical random graph model (HRGM), are developed for undirected and unweighted networks [3], we focus on the undirected and weighted version of our observed hierarchical network. We consider edge weights in our network by comparing a series of unweighted models after successive removal of edges with various weights from the original model to examine changes in network properties. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 944, pp. 474–484, 2021. https://doi.org/10.1007/978-3-030-65351-4_38
Modeling and Evaluating Hierarchical Network
475
We have two goals in this paper. First, we evaluate how the properties of the observed hierarchical network compare with those of selected network models in the literature. We search for network models that better capture the key characteristics of the empirical hierarchical network. We are interested in the actual fluidity of flows across offices and hierarchical levels (i.e., administrative jurisdictions). Second, we are interested in how robust the hierarchical network is if we successively remove those “weak ties” (i.e., those edges with low weights) that link offices to form the hierarchical network. A network is robust against such removal of weak ties if most flows are frequent, and hence most edge weights are large. We find that, first, the observed hierarchical network has a strong tendency of local clustering, as measured by both the global clustering coefficient (GCC) and the local clustering coefficient (LCC). It also has low fluidity of flow to other parts of the network, in terms of high average path length (APL) compared with those in the small-world and HRGM models. Second, our hierarchical network is sensitive to the strength of weak ties. That is, the removal of the weak ties (edges with weights between 1 and 3) leads to a large number of isolates and components.
2
Related Work
To model our hierarchical network data, we begin with Clauset, Moore, Newman’s hierarchical random network graph model (HRGM) for inferring hierarchical structure from network data. It was demonstrated that the hierarchical model shows many commonly observed properties, including right-skewed degree distributions, high clustering coefficients, and short path lengths [3]. The Watts and Strogatz’s small-world model also captures many features of a hierarchical topology, with nodes having dense local clusters and short path length spanning vertices over the entire network [8]. The Erd¨ os-R´enyi random graph model and the scale-free model developed by Albert and Barab´ asi are two other network models that may fit some aspects of social networks [1,4,6]. We evaluate our observed network in comparison with the models mentioned above. Intuitively, we expect that HRGM provides the closest fit to the observed network in our data. Granovetter developed the idea of “weak ties” to identify those vertices that provide the only (or shortest) link between pairs of vertices [5]. As Onnela et al. shows, the removal of weak ties can quickly lead to the disintegration of a large and telecommunication network [7]. We exploit these ideas in the literature to examine the robustness of the hierarchical network in our data.
3
Data and Models
We model and examine the properties of an observed social network based on personnel flow among government offices in JS Province in southern China. The offices in our data are multilayered, involving three levels of hierarchical and
476
J. Liu et al.
administrative jurisdictions: provincial-level (frequency = 1), prefecture (13), and county (110). Personnel flow involves mobility events across offices within or across these jurisdictions. These moves may be vertical (i.e., across offices at different hierarchical levels) or horizontal (i.e., across offices at the same hierarchical level). The network is hierarchical, as those offices and jurisdictions among which mobility takes place have different administrative ranks, and thus they are hierarchically organized. Our data records all events of personnel mobility events of about 40,000 chief officials across n = 1678 offices between 1990–2008. The mobility information is updated every year. We construct our network by aggregating all these personnel flow events into one network. For further analysis, these offices can be grouped into distinct functional areas, including “party HQ”, “government HQ”, and “economy”, among others. We first constructed a directed and weighted network of mobility, where offices are nodes, and personnel flows are the edges that connect pairs of offices. As a person moves from one office to another, a directed flow is created. The weight on a directed edge A → B is the number of flow events from office A to office B. Then, we collapse flows into an undirected and weighted network and label it as job flows. The edge weight of an edge between a pair of nodes is the sum of weights for any flow between the two nodes. Hence, each edge weight provides information on the strength of the tie (or the robustness of the link) among pairs of offices. By construction, there are no multiple edges and self-loops in our graph. We focus on the undirected network because of our interest in the role of information transmission and identity associated with personnel flow in the network. We compare the observed hierarchical network with selected network models to evaluate the properties of our personnel-flow network. We consider the following undirected models:1 – random gnp is the Erd¨ os-R´enyi random graph model G(n, p), where n is the number of nodes and p is the edge density of job flows. – small world is the Watts and Strogatz model with the number of nodes n and neighborhood size being the average degree of nodes of job flows. The rewiring probability is adjusted according to the percentage of those flow events across administrative jurisdictions to the total number of flow events. asi-Albert model with the number of nodes n. The – scale free is the Barab´ power of the preferential attachment is the exponent obtained by fitting a power-law distribution to the degree distribution of job flows. The “attractiveness” of the nodes with no adjacent edges is the minimum value from which the power-law distribution was fitted. – To construct HRGM as illustrated by Clauset, Moore, and Newman [3], we first fit a HRGM to job flows, and then we create a consensus tree so that 1
The italic words refer to the models with the specific parameters that we analyse their properties.
Modeling and Evaluating Hierarchical Network
477
the consensus tree contains common features of multiple samples. Next, we sample a HRGM from the consensus tree as the final hierarchical random graph HRGM. We used 1000 samples to construct the consensus tree for the analysis reported here.
4
Properties of the Empirical Network Compared with Theoretical Models
Figure 1 visualizes the plot for the entire network of job flows. Edges in blue are cross-jurisdiction flows, and edges in red are within-jurisdiction flows. There are actually more red edges in Fig. 1, but they are not visible because they are highly clustered locally around nodes within the same jurisdictions. We consider both edge weight distribution and degree distribution because we are interested in the connectivity among the offices. From a social network point of view, the smaller the edge weight, the weaker the tie [5]. For example, for offices connected by edges with weight 1, removing these edges would disconnect those pairs of offices. Table 1 presents the distribution of the edge weights in the observed network. As we can see, most edges have a weight of 1 or 2 over the 19 years. In other words, most ties among the offices are relatively weak, and such flows are more occasional than regular. Figure 2a shows that the degree distribution based on the unweighted version of job flows is heavily skewed to the right, which is one of the features of hierarchical random graphs. Figures 2b–e are degree distributions of the theoretical models, with HRGM resembles closely to the observed pattern. Table 22 reports network properties of the observed hierarchical network and those of selected network models. Table 1. Edge weight distribution of job flows Weight
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15
Frequency 4930 963 386 205 97 61 35 28 20 3
9
7
6
1
2
The job flows column in Table 2 reports network properties of our observed network, including the number of nodes (N of nodes), number of edges (N of edges), average degree (Avg. degree), global clustering coefficient (GCC), local clustering coefficients (LCC), number of components (N of comp.), and average path length (APL) [6]. Other columns report those network properties of the selected models. All graphs have the same number of nodes by construction. Note that the row “N of comp.” is the number of components of the graph, where “N of comp. = 1” refers to one giant component with all vertices being connected. Here, random gnp, small world and scale free are connected graphs, and job flows and HRGM are composed by multiple components. We only report APL in the largest component because of the space limit.
478
J. Liu et al.
Fig. 1. Visualization of the observed hierarchical network job flows. Edges in blue (red ) indicate cross-jurisdiction flow (within-jurisdiction flow) Table 2. Properties of the observed network and selected network models job flows random gnp small world scale free HRGM N of nodes
1678
N of edges
1678
1678
1678
1678
6753
6729
13424
1677
6832
Avg. degree 8.049
8.02
16
1.999
8.143
GCC
0.141
0.004
0.029
0
0.018
LCC
0.227
0.004
0.03
0
0.02
N of comp.
3
1
1
1
13
APL
4.398
3.795
2.966
2.004
3.649
An inspection of Table 2 shows that the observed network has higher clustering coefficients (in both GCC and LCC) as well as higher APL than the selected network models. Of the theoretical models, small-world model and HRGM are closer to our empirical data in key network properties (GCC, LCC, and APL), and these two theoretical models are also more meaningful for hierarchical data. The findings show that personnel flow patterns in job flows are more locally insulated as it has a much higher tendency in local clustering. At the same time, our network has stronger barriers in reachability, as measured by APL. Note that the number of edges in the simulated small-world model far exceeds that in our empirical model, suggesting that the number of mobility events expected in the small-world model is much larger than observed in our data. random gnp shows GCC and LCC further away from the empirical pattern. scale free has the least fit to the empirical model, which means that the observed hierarchical network is not organized by the preferential attachment mechanism that underlies the scale-free model.
2
Table 2, Table 3, Table 4 and Table 5 are rounded to 3 decimal places.
479
frequency
0.05
0.06 0.04
0.00
0.00
0.02
frequency
0.08
0.10
0.10
0.12
0.15
Modeling and Evaluating Hierarchical Network
0
10
20
30
40
50
60
70
5
10
degree
15
20
degree
(b) random gnp
0
5
10
15
20
25
degree
(c) small world
30
0.10 0.06
frequency
0.00
0.02
0.04
0.6
frequency
0.0
0.00
0.2
0.4
0.08 0.04
frequency
0.08
0.8
0.12
1.0
(a) job flows
0
500
1000 degree
(d) scale free
1500
0
10
20
30
40
50
60
70
degree
(e) HRGM
Fig. 2. Degree distributions for job flows and generated models
5
Connectivity and Robustness of the Hierarchical Network
Thus far, we studied the entire hierarchical network based on personnel flows in the multilayered bureaucracy. Our second research question is to evaluate the extent of connectivity and robustness among different parts of the network. Following J.P. Onnela et al., we examine changes in network configuration in terms of edge weights in job flows [7]. Edge weights measure the strength of connectivity among different offices and, by extension, among functional areas and administrative jurisdictions. By removing edges with weights from the low to high order successively, we can detect the robustness of the network and the properties of new configurations as the original hierarchical network decomposes into different components. Specifically, we are interested in the following issues: (1) How robust are the network properties as we remove weak ties among the offices? (2) What are the characteristics of those offices’ functional areas that connect weak ties in the hierarchical network? Recall that Table 1 presents the distribution of edge weights (w) of job flows, ranging from 1 to 15. We evaluate changes in network configuration by removing edges with w = 1 − 5 successively. Figure 3a–e show the plots for graphs after removing edges and new isolates in each round. Figure 4 shows the percentage of offices becoming isolates and the percentage of offices that are still connected in the largest component. It appears that weak ties hold up the overall hierarchical network (Fig. 1). As we remove edges from the weakest ones (w = 1) to the relatively stronger ones (w = 5), the empirical network becomes more fragmented, with an increasing number of isolates and components. The offices in each component are more likely to be held together by within-jurisdiction ties than by cross-jurisdiction ties. This is consistent with the early finding that
480
J. Liu et al.
(a) w = 1
(c) w ≤ 3
(b) w ≤ 2
(d) w ≤ 4
(e) w ≤ 5
Fig. 3. Network configuration after successive removal of edges with weight w. Edges in blue (red ) indicate cross-jurisdiction flow (within-jurisdiction flow)
the empirical network is more locally clustered and with low fluidity across the overall network. We now take a look at the properties of the networks3 at different levels of weight strength. We adjust parameters for the simulated theoretical models accordingly. As Fig. 4 shows, after removing edges with w = 1, about 23% of the offices become isolates. After removing new isolates, there are 73 components, and about 65% of offices are connected with the largest component. Table 3 reports the network properties of the new configuration and the simulated graphs. The resulting network has properties that are similar to job flows. As Table 4 and Fig. 4 indicate, after removing edges with w ≤ 2, there are about 47% of isolates, and about 10% percent of offices are still hanging together in the largest component. The network is fragmented into 155 components (excluding new isolates). In this network configuration, the observed network shows higher clustering and higher fluidity (i.e., lower APL) in the
3
Because of the space limit, we only study the networks that drop edges with weights w ≤ 1, 2, 3.
0.4
0.8
481
0.0
percentage
Modeling and Evaluating Hierarchical Network
0
1
2
3
4
5
i Fig. 4. Percentage of nodes as new isolates (black ) and percentage of nodes in the largest component (red ) after removing edges with weight w ≤ i
largest component than the small-world model or the hierarchical random graph model. Table 5 and Fig. 4 show the network information after removing edges with w ≤ 3. About 64% percent of nodes are isolated, and only 2% percent of the offices are still in the largest component. We observe a similar pattern as in the previous round. These findings suggest that weak ties play an important role in the configuration of the hierarchical network. Besides, as the hierarchical network becomes fragmented when weak ties are taken away, the resulting network properties have also changed. In the analysis above, we only focused on the largest connected component, as other components quickly become fragmented with the increasing removal of edge weights. Table 3. Properties for graphs after removing w = 1 edges and isolates job flows random gnp small world scale free HRGM N of Nodes 1286
1286
1286
1286
1286
N of Edges
1823
1817
2572
1285
1793
Avg. degree 2.835
2.826
4
1.998
2.788
GCC
0.002
0.096
0
0.017
0.147
LCC
0.316
0.002
0.112
0
0.031
N of comp.
73
96
1
1
144
APL
8.6
6.629
5.981
2.253
5.957
Next, we look at the characteristics of those offices that serve as the weak ties to hold the network together. In particular, we are interested the functional areas where these offices belong. That is, we highlight those key functional areas that play a central role in maintaining the network.
482
J. Liu et al. Table 4. Properties for graphs after removing w ≤ 2 edges and isolates job flows random gnp small world scale free HRGM N of Nodes 883
883
883
883
883
N of Edges
862
883
882
859
Avg. degree 1.948
1.952
2
1.998
1.946
GCC
0.153
0
0
0
0.061
LCC
0.299
0
0
0
0.088
N of comp.
155
144
37
1
182
APL
6.34
9.185
21.468
3.387
8.266
860
Table 5. Properties for graphs after removing w ≤ 3 edges and isolates job flows random gnp small world scale free HRGM N of Nodes 597
597
597
597
597
N of Edges
455
597
596
478
Avg. degree 1.588
1.524
2
1.997
1.601
GCC
0
0
0
0.064
474 0.13
LCC
0.241
0
0
0
0.088
N of comp.
158
165
12
1
165
APL
3.969
11.27
26.037
2.286
10.278
Table 6 shows the top five functional areas of those offices having the most flow events at that level of edge weights. The number in parenthesis is the average number of flows for the offices in that area at the particular level of edge weight. “party HQ” and “govt HQ” are the two top leadership areas in each jurisdiction. The “LPC” (Local People’s Congress) and “CPPCC” (Chinese People’s Political Consultative Conference) are the two legislative bodies in the jurisdiction. These four areas include those highest offices in each administrative jurisdiction. In addition, the “economy” area and the “party-office” area are the other two functional areas on the list. These findings indicate that these high-status offices, including those in the economy and party management areas, play a central role in integrating different parts of the hierarchy. As we successively remove edges, some offices become isolates. It is interesting to find out which functional areas are prone to be isolates because they provide information on how insulated or closed these functional areas are in the multilayered bureaucracy. Table 7 lists the top five functional areas to which those office isolates belong; the percentage of offices as isolates in that area is given in parenthesis. Interestingly, those areas with most isolated offices are all related to local affairs – “economy”, “regulatory”, “educ/health”. Even “party office” tends to be localized for edges with weight 1. For example, as edges with weight 1 is removed,
Modeling and Evaluating Hierarchical Network
483
Table 6. Top five functional areas to which offices belong on edges with weight w w Functional areas 1 govt HQ (12.98)
economy (4.74)
party HQ (10.8)
party office CPPCC (5.44) (6.28)
2 govt HQ (3.03)
economy (0.78)
party HQ (2.04)
CPPCC (2) party office (0.9)
3 govt HQ (1.23)
CPPCC (1.04)
party HQ (0.77)
economy (0.28)
LPC (0.61)
Table 7. Top five functional areas to which new isolates belong after removing edges with weight w w Functional areas 1 economy (0.35)
regulatory (0.43)
educ/health (0.39)
govt other (0.47)
party office (0.23)
2 economy (0.21)
CPPCC (0.36)
govt HQ (0.27)
party HQ (0.27)
LPC (0.25)
3 economy (0.18)
CPPCC (0.25)
LPC (0.23)
govt HQ (0.21)
govt other (0.20)
35% of the “economy” area becomes isolated from other parts of the network. Offices in the party and government HQs are also likely to become isolates as edges with weight 2 is removed. In other words, even those offices in these central and high-status functional areas tend to become unconnected to the bureaucratic hierarchy after the removal of a few edges that link them together, indicating that local affairs in the administrative jurisdictions tend to be insulated from changes in other parts of the personnel flow network.
6
Conclusion and Future Work
In this study, we applied selected network models to evaluate a hierarchical network based on personnel flow in a multilayered bureaucracy and examined the robustness of connectivity in the observed hierarchical model. We summarize the main findings as follows. First, the empirical hierarchical network has a higher level of local clustering and a lower level of fluidity of flow, compared with those properties in the small-world model and the HRGM, indicating strong social closure locally and relatively weaker connectivity across administrative jurisdictions. Second, our hierarchical network is held together by weak ties as measured by occasional rather than regular personnel flow among many offices. After removing edges with small edge weights, the network is quickly fragmented into many components and isolates. The observed hierarchical network consists a large number of self-sustained local networks.
484
J. Liu et al.
Third, our investigation of those weak ties shows that offices in those central functional areas (i.e., party HQs, government HQs, and legislative bodies) provide critical links among offices. Many offices and areas in the local jurisdiction become detached from the hierarchical network once weak ties are removed. These features suggest that rather than a well-integrated hierarchy, the personnel flow network is vulnerable to local insulation and fragmentation. Future work will address issues such as directed network and changes over time, with refined analyses of network patterns across the hierarchical model levels. We will explore better techniques in evaluating weighted hierarchical networks [2]. The examination of weak ties need also take into consideration other components in the network beyond the largest connected component.
References 1. Albert, R., Barab´ asi, A.: Statistical mechanics of complex networks. Rev. Mod. Phys. 74(1), 47–97 (2002). https://doi.org/10.1103/revmodphys.74.47 2. Allen, D., Lu, T., Huber, D., Moon, H.: Hierarchical Random Graphs for Networks with Weighted Edges and Multiple Edge Attributes (2011) 3. Clauset, A., Moore, C., Newman, M.E.: Hierarchical structure and the prediction of missing links in networks. Nature 453(7191), 98–101 (2008). https://doi.org/10. 1038/nature06830 4. Erd˝ os, P., R´enyi, A.: On Random Graphs. I. Publicationes Mathematicae 6, 290–297 (1959) 5. Granovetter, M.S.: The strength of weak ties. Am. J. Sociol. 78(6), 1360–1380 (1973). https://doi.org/10.1086/225469 6. Newman, M.E.: Networks: An Introduction. Oxford University Press, Oxford (2010) 7. Onnela, J., Saram¨ aki, J., Hyv¨ onen, J., Szab´ o, G., Lazer, D., Kaski, K., Kert´esz, J., Barab´ asi, A.: Structure and tie strengths in mobile communication networks. Proc. Natl. Acad. Sci. 104(18), 7332–7336 (2007). https://doi.org/10.1073/pnas. 0610245104 8. Watts, D.J., Strogatz, S.H.: Collective dynamics of ‘small-world’ networks. Nature 393(6684), 440–442 (1998). https://doi.org/10.1038/30918
GrowHON: A Scalable Algorithm for Growing Higher-order Networks of Sequences Steven J. Krieg(B) , Peter M. Kogge, and Nitesh V. Chawla University of Notre Dame, Notre Dame, IN 46556, USA {skrieg,kogge,nchawla}@nd.edu
Abstract. Networks are powerful and flexible structures for expressing relationships between entities, but in traditional network models an edge can only represent a relationship between a single pair of entities. Higherorder networks (HONs) overcome this limitation by allowing each node to represent a sequence of entities, which allows edges to naturally express higher-order relationships. While HONs have proven effective in several domains, and previous works have been forced to choose between scalability and thorough detection of higher-order dependencies. To achieve both scalability and accurate encoding of higher-order dependencies, we introduce GrowHON, an algorithm that generates a HON by embedding the sequence input in a tree structure, pruning the non-meaningful sequences, and converting the tree into a network (Code available at https://github.com/sjkrieg/growhon). We demonstrate that GrowHON is scalable with respect to both the size of the input and order of the network, and that it preserves important higher-order dependencies that are not captured by prior approaches. These contributions lay an important foundation for higher-order networks to continue to evolve and represent larger and more complex data. Keywords: Higher order networks
1
· Sequence mining · Graph models
Introduction
Networks are powerful and flexible structures for expressing relationships between entities. However, some relationships are too complex to be represented by traditional network representations that typically rely on first order representation. In this first-order network (FON) representation, each entity in a sequence (typically a set of events or states that are ordered by time) is represented by a single node, with an edge connecting each pair of entities that are adjacent in the sequence. The resulting network assumes the first order Markov property, which means that at a given state, all necessary information about that state is available locally. For example, in Fig. 1, a random walker that arrives at node 2 could transition to nodes 3 or 4. But the underlying sequences exhibit higher-order dependencies: 2 is followed by 3 only when preceded by 1, c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 944, pp. 485–496, 2021. https://doi.org/10.1007/978-3-030-65351-4_39
486
S. J. Krieg et al.
and followed by 4 when preceded by 3. This vital information is lost during network construction. Several recent works have shown that this limitation of FONs is problematic in a number of domains, including the study of user behaviors [3], citation networks [18], trade relations [9], human mobility and navigation patterns [14,22], information networks [21], the spread of invasive species [24], anomaly detection [20], and others [10,14].
Fig. 1. A toy example of the differences between a FON and HON representation of a set of sequences.
A higher-order network (HON) representation, introduced as BuildHON by Xu et al. [24], is a solution to this problem that seeks to preserve dependencies by allowing each node to represent a variable-length sequence of entities, rather than a single entity. For example, in Fig. 1, node 3|2 (read as 3 given 2) represents the current state 3 conditioned on the prior state 2. This distinguishes it from node 3, and thus allows the network to more fully represent the underlying statistical patterns. The BuildHON framework is flexible because it allows nodes of varying order to coexist, and generalizable because it allows for existing network analysis tools, like clustering and random walking, to be used without modification. Finally and perhaps most importantly, a HON seeks to only preserve higher-order patterns that are statistically significant, which helps control the size of the network and prevent overfitting. The trade-offs for the increased representative accuracy of such a HON are increased network size and the cost of network generation, which becomes computationally expensive as the order of the network and size of the input increases. This paper introduces GrowHON, an algorithm that offers two main advantages over the algorithms utilized in prior work (BuildHON/BuildHON+) [20,24]: 1. Increased scalability with respect to order and input size. By embedding sequences in a tree, GrowHON avoids repeated searches through the input and enables efficient computation. 2. More thorough detection of dependencies at higher orders. By testing higherorder nodes at first, GrowHON is able to preserve important sequences that are missed by other methods. The rest of the paper proceeds as follows. First we survey related work (Sect. 2). Next we introduce the problem of HON generation and present GrowHON
GrowHON
487
(Sect. 3). Then we experimentally demonstrate GrowHON’s increased scalability and ability to detect dependencies at higher orders. Finally we conclude and discuss opportunities for future work (Sect. 5).
2
Related Work
Several recent works share the conclusion that first-order Markov models are an inadequate representation of many real-world processes, which often exhibit complex spatiotemporal dependencies. While a multitude of approaches exist, we focus our discussion on studies that utilize network structures to model these dependencies alongside, or together with, other patterns of connectivity between entities [15]. These works span many domains, including the study of user behaviors [3], citation networks [18], trade relations [9], human mobility and navigation patterns [14,22], information networks [21], the spread of invasive species [24], anomaly detection [20], and more [10,14]. These works agree on the need to reach beyond the limits of first-order networks, but differ in methodological approach. We build on the model of Xu et al., which encodes dependencies of variable order into a single-layered network structure [20,24]. Compared with multi-layered networks [21] and models that rely on supplemental higher-order path information [14,18], this approach has the advantage that existing network analysis tools can be utilized without modification. Additionally, rather than inferring a fixed order for the entire model—which can produce a model that is overfit on some sequences and underfit on others—it allows nodes of variable order to coexist in the same space. Xu et al. accomplished this by testing each higher-order pattern and only preserving those that were expected to reduce the entropy of the network. Saebi et al. noted that this approach was combinatorially explosive and developed a lazy algorithm that starts with lower-order patterns and seeks to extend them one entity at a time [20]. This method, BuildHON+, successfully mitigated the combinatorial problem, but is still limited (by repeated searches through the input and lazy pattern testing) in its scalability and ability to thoroughly detect dependencies at higher-orders. We additionally distinguish our task of higher-order representation from that of higher-order network analysis, which uses substructures like motifs [11] or graphlets [16] to model statistically significant patterns of connectivity between nodes. Higher-order analysis is important and has sparked new approaches to clustering [1,23,25], representation learning [17], and more, but is ultimately concerned with analyzing an existing network. In this work, however, we are concerned with the upstream task of creating a network—one that accurately represents the underlying data and thus enables meaningful analysis.
3 3.1
Methods Problem Setting
We define a sequence s = a0 , a1 , ..., an as an ordered collection of elements. Each ai ∈ s represents a discrete state or entity, and each adjacent pair ai
488
S. J. Krieg et al.
and ai+1 represents a transition from ai to ai+1 . We call s = a0 , a1 , ..., am a substring1 of s, denoted as s s, if and only if all the elements in s also appear in s in exactly the same order, i.e. s s ⇐⇒ ∃j : ∀ai ∈ s , ai = ai+j .
(1)
Fig. 2. An overview of GrowHON on toy data using k = 3. Nodes are colored according to their destination state.
In a FON G1 = (V1 , E1 ), a set of sequences S = {s0 , s1 , ..., sN } is represented by a set of nodes, V1 , and a set of edges, E1 . Each entity in { S} maps to exactly one node in V1 , and two nodes u and v are joined by a directed edge u → v if u precedes v in some sequence s ∈ S, i.e. u, v s. Edges are additionally weighted such that w(u, v) is the number of times u, v appears across all sequences in S. Statistically significant patterns involving more than two entities, which we call higher-order dependencies, are thus lost in the construction of G1 . A HON, on the other hand, seeks to preserve these dependencies. Let Gk = (Vk , Ek ) be a HON such that k is the maximum order, or the amount of history that can be encoded by each node. The set of possible nodes, Vk , is the set of all substrings of length k + 1 or less: Vk
=
k+1
{ai−m+1 , ..., ai : (∃s ∈ S) ∧ (∃i > 0), ai−m+1 , ..., ai s}.
(2)
m=1
In practice, we encode nodes in the following form in order to emphasize that the destination state am is conditioned on a sequence of prior states am−1 , ..., a0 : am |am−1 |...|a1 |a0 ≡ a0 , ..., am−1 , am , 1
(3)
We use the term “substring” instead of “subsequence” because the formal definition of a subsequence allows for intermediate elements to be removed, and thus does not preserve adjacency from the original sequence.
GrowHON
489
In both cases, a0 is the earliest element in the original sequence and am is the latest. For example, node 3|2 in Figs. 1 and 2 represents the substring 2, 3. This definition of a node allows for edges in Ek , which are defined in the same way as in E1 but generalized to allow for the fact that u and v are substrings, to naturally represent higher-order interactions between nodes. For example, the edge 3|2 → 1 represents the substring 2, 3, 1 that cannot be represented by a single edge in E1 . We define a node’s order as the number of states—current and prior—it represents. A HON can contain nodes with variable order, so we use cardinality to denote the order of a node, such that for a given node u = am |am−1 |...|a0 , we have |u| = m + 1. This heterogeneity with respect to node order means that there are many possible representations of a given sequence. For example, 1 → 2 and 1 → 2|1 are both valid representations of 1, 2. The key objective in generating a HON is to produce a representation, i.e. a final set of nodes Vk and edges Ek , that preserves the statistically significant higher-order sequences in a generalizable (avoids overfitting) and concise form. GrowHON accomplishes this through three phases: 1. Grow a sequence tree, which is a compact and connected representation of the original sequences that enables efficient pruning and extraction. 2. Prune statistically insignificant sequences. Pruning is performed in-place and top-down (with respect to order) in order to ensure all higher-order dependencies are preserved. 3. Extract sequences by converting them to edges in a manner that preserves the topological integrity of the network. The following sections detail each of the three phases, and Fig. 2 illustrates the output of each phase on a toy data set. 3.2
Phase 1: Grow
Grow processes each sequence by embedding the observed substrings of length m ≤ k + 1 as branches in a tree. For example, in Fig. 2, the first substring in s1 , 1, 2, 3, 1, produces the leftmost branch on the grown tree: 1 → 2|1 → 3|2|1 → 1|3|2|1. Algorithm 1 details the procedure, which is somewhat similar to growing a frequent-pattern tree (FP-tree) [6]. The main benefit to this structure is that it is very efficient to compute transition probability distributions during the prune phase, since all of a node’s outgoing neighbors are represented as its children in the tree. Alongside the tree, we utilize a hash table called the nmap (node map) to store object references and thus enable efficient node lookups. For a given node u = am |am−1 |...|a0 , we call any node u = am |am−1 |...|az for some 0 < z < m a lower-order counterpart of u, since it represents the same destination state am but with z fewer prior states. We express this relationship using the logical shift operator : am |am−1 |...|a0 z = am |am−1 |...|az
(4)
490
S. J. Krieg et al.
Algorithm 1. GrowHON Phase 1: Grow 1: function Grow(S, k) 2: Q ← an empty queue with length k + 1 3: t.root ← a dummy node 4: t.nmap ← an empty hash table 5: for each sequence s in S do 6: Prime Q with the first k + 1 elements in s 7: for each remaining element a in s do 8: parent ← t.root 9: for each element u in Q do 10: if u in parent.children then 11: child ← parent.GetChild(u) Retrieve child from hash table 12: child.indeg ← child.indeg + 1 13: else 14: child ← parent.AddChild(u) Store child in a hash table 15: child.indeg ← 1 16: child.outdeg ← 0 17: t.nmap.Insert(child) 18: parent.outdeg ← parent.outdeg + 1 19: parent ← child 20: Q.pop() 21: Q.push(a) 22: Repeat lines 9-20 until Q is empty 23: return t
Every higher-order node has exactly one lower-order counterpart for each 0 < z < |u |. This is trivially proven by the fact that if there exists some s ∈ S such that a0 , a1 , ..., am s, then az , az+1 , ..., am s. This means that we can look up a node’s lower-order counterpart in constant time using the nmap, rather than traversing the tree from the root, which could require up to k lookups. 3.3
Phase 2: Prune
Given a fully grown tree, Prune decides which nodes to preserve in the HON. Algorithm 2 details the procedure. We follow Saebi et al. in seeking to preserve a given node u if the Kullback-Leibler divergence (relative entropy) of its outgoing transition probabilities, measured with respect to its lower-order counterpart u , exceeds a threshold function f [20]. We define both as follows: DKL (u u 1) =
P (u → v)log2
v∈N (u)
f (u, τ ) =
P (u → v) , P (u 1 → v 1)
τ |u| , log2 (1 + indeg(u))
(5)
(6)
where N represents the set of outgoing neighbors or children in the grown tree, w(u,v) P (u → v) = outdeg(u) represents the probability of transition from node u to
GrowHON
491
Algorithm 2. GrowHON Phase 2: Prune 1: function Prune(t, τ ) 2: for i ← t.height − 1 down to 1 do 3: for each node u in t where |u| = i do 4: if u.marked or DKL (u u 1) > f (u, τ ) then 5: u.parent.marked = true Ensure ancestors are preserved 6: else 7: u.outdeg ← 0 8: for each child c in u.children do 9: c.indeg ← 0
Algorithm 3. GrowHON Phase 3: Extract 1: function Extract(t) 2: E ← an empty edgelist 3: Extract-Helper(E, t.root) 4: return E 5: function Extract-Helper(E, u) 6: if u.indeg > 0 then 7: if |u| > 1 then Nodes at level 1 cannot be destinations 8: v←u 9: while v.outdeg = 0 and |v| > 1 do Handle the dead-end (leaf) case 10: v←v1 11: E.insert(u.parent, v, u.indeg) 12: for child in n.children do 13: Extract-Helper(E, child)
node v, and τ is a free parameter (we use a default of 1.0). f increases monotonically as τ and |u| increases and decreases monotonically as indeg(u) increases, which helps control for the fact that DKL is biased at higher orders, where edge weights are sparser and transition probabilities are noisier. The key advantage afforded by Prune is that higher-order nodes are tested first. This means that when a higher-order node is preserved, we can also ensure that its ancestors are preserved (line 22)—otherwise it may have no in-edges in the resulting network. Previous approaches, limited by computational complexity, tested higher-order dependencies in bottom-up fashion [20,24]. This bottomup testing implicitly assumes that a node u can only have a dependency if u 1 also had a dependency, which is often not the case. 3.4
Phase 3: Extract
The final step is to convert the pruned tree into a HON. Extract, detailed in Algorithm 3, recursively traverses the pruned tree and converts each tree node into an edge in the HON. Because each tree node represents the final element in a substring, the parent/child relationship is directly converted to a
492
S. J. Krieg et al.
Table 1. A summary of the data sets used for evaluation. N and n ¯ represent the number of sequences and average sequence length, respectively. The Airport data served as the source for five synthetic sets of sequences. Name N n ¯ Nn ¯ |V1 | Shipping 54,892 151.15 8,296,770 5,590 T2D (ICD) 913,475 39.18 35,792,618 914 T2D (CCS) 913,475 31.54 28,809,027 304 Airport — — 1,2,3,4,5 00,000,000 1,922
|E1 | 369,965 481,020 85,025 31,491
|E1 | |V1 |
66.18 526.28 279.69 16.38
source/destination relationship in the HON for all non-leaf nodes. When a leaf node is detected, Extract attempts to redirect the edge to a lower-order counterpart. This seeks to maximize the preserved higher-order information while ensuring that tree leaves do not produce nodes with no out-edges, which would disrupt the flow of information. 3.5
Asymptotic Complexity
Grow requires n queue pushes and n pops for a sequence of n elements. For each substring of length k + 1, each element induces either a weight increment (for an existing child) or the insertion of a new child—both of which are constant-time operations2 . In total, Grow considers n − k substrings with length k + 1 and k substrings with length < k + 1 (line 22). The time complexity of processing a single sequence is thus bounded by O(kn). If there are N total sequences with an average length of n ¯ , then the function’s complexity is bounded by O(kN n ¯ ). Prune requires a constant number of operations for each tree node, since each node above order 1 could be considered in a calculation of DKL , and each node below order k will be looked up once in nmap. Extract similarly requires a constant number of operations, on average, for each node. If a leaf node u requires more than one lookup to find an appropriate lower-order destination, that means u 1 was pruned and will be skipped by Extract. Thus the complexity of both Prune and Extract is bound by the number of nodes in the tree, i.e. |Vk |. While it is possible to derive an upper bound for |Vk |, GrowHON’s complexity will always be dominated by Grow, and is bounded by O(kN n ¯ ).
4
Experimental Results
GrowHON is implemented in Python 3.7.3, and all code is publicly available3 . We evaluated the scalability of GrowHON and the representation of the resulting networks using eight data sets—three real and five synthetic—each of which are summarized in Table 1. The real data includes the set of global shipping 2 3
In practice, creating a new node takes much longer than a weight increment, so the number of unique substrings has a significant effect on runtime. https://github.com/sjkrieg/growhon.
GrowHON
493
routes over 15 years (1997–2012) that Xu et al. studied in their seminal HON manuscript [24], and a set of diagnoses sequences for type-2 diabetes (T2D) patients under two distinct mapping schemas: the first using the ninth revision of the International Classification of Diseases (ICD) and the second using the Clinical Classification Software (CCS) taxonomy[7]. Both represent the same set of real patients, but with diagnoses aggregated at different levels of granularity and density. We generated the synthetic data by constructing a first-order network of U.S. airport travel in 2018 based on data from the Bureau of Transportation Statistics [2]. From this network we used a random walker to generate five sets of sequences, each of a different size.
Execution time as a function of k for all real data sets.
Execution time as a function of input size (N n ¯ ) for airport data.
Fig. 3. Execution times, measured as wallclock time, of each algorithm on all data sets using k = 2..5. The reported values are the means of 10 iterations, and error bars represent standard deviations.
Figure 3 shows the execution times of each algorithm using k = 2..5 for 10 iterations on each data sets. All Shipping and Airport experiments utilized Intel Xeon E5-2680 v3 @2.50GHz CPUs, and all T2D experiments (due to data sensitivity) utilized Intel Xeon E5-2686 v4 @2.30GHz CPUs. GrowHON was faster than BuildHON+ in all experiments, in many cases by almost an order of magnitude—especially at higher k. Perhaps more importantly, GrowHON demonstrates greater scalability, with respect to both k and input size. In addition to the base version of GrowHON, we used Ray, a Python framework for efficient distributed computation [12], to implement a parallel version of Grow. This modified version utilizes a driver for growing the tree and a group of workers (four, in this case) for enumerating substrings from the input sequences. Despite Ray’s efficiency, our experiments show that, when k > 3,
494
S. J. Krieg et al. Table 2. Differences between network sizes when k = 5. Algorithm
Shipping |V5 |
|E5 |
T2D (ICD)
T2D (CCS)
|V5 |
|V5 |
|E5 |
|E5 |
BuildHON+ 2,010,511 5,937,933 17,918,723 50,272,035 11,620,878 36,423,805 GrowHON
2,596,214 6,893,689 21,683,241 54,378,822 15,285,534 40,461,838
Fig. 4. Distribution of nodes by order for the networks produced by both algorithms.
Fig. 5. Results of using a random walker to reproduce the original sequences. Each plot shows weighted Jaccard similarity (JW ) as a function of substring length (i.e. number of steps taken by the random walker). The reported values are the means of 10 samples, and standard deviations were < .0001 in all cases.
the combinatorial explosion of substrings causes the cost of message passing to outweigh its benefits. Table 2 shows the difference in network sizes at k = 5, and Fig. 4 visualizes the distributions of nodes by order. These differences are attributable to GrowHON’s top-down pruning procedure, which allowed it to preserve higherorder dependencies that BuildHON+’s lazy testing did not capture. To quantify the effect these additional nodes have on the representational quality, we utilized a random walker to generate samples of synthetic sequences from each network. Since random walking is core to many network applications like PageRank [13], community detection [5], and network embedding [4,19], it is critical that such walks reflect real patterns in the underlying data. We compared each synthetic sample to the real sequences using weighted Jaccard (Ruzicka) similarity JW , which returns a score between 0.0 (sets are totally disjoint) and 1.0 (sets are identical with respect to both membership and frequency) [8]. Figure 5 shows the results for m = 2..6 (substring length) on a FON (G1 ) and the HONs generated with k = 5 (G5 ) by both algorithms. At m = 2, where a substring is an
GrowHON
495
edge in E1 , all models performed similarly. As m increased, G1 ’s performance deteriorated quickly. While BuildHON+’s G5 significantly outperformed G1 at all m > 2, GrowHON’s G5 was best able to better reproduce the sets of longer substrings. This supports the conclusion that the additional higherorder dependencies preserved by GrowHON are important in representing the underlying data.
5
Conclusions and Future Work
Higher-order networks (HONs) overcome the Markovian limitations of first-order networks by allowing each node to represent part of a sequence, rather than a single entity. This allows edges to naturally encode higher-order relationships between entities and improves the representative quality of the network with respect to the original sequences. However, the process of enumerating and testing for higher-order dependencies is computationally complex, especially as the order of the network increases, and previous approaches have been limited by the trade-off between efficient computation and thorough detection of dependencies. We introduced GrowHON, an algorithm that grows a HON by embedding the input in a tree, pruning the non-meaningful sequences, and converting the preserved sequences into an edgelist. We demonstrated that GrowHON is scalable with respect to both the size of the input and order of the network, and that its top-down pruning procedure preserves important higher-order dependencies that are missed by prior approaches. While our work has mostly focused on the computational procedure of growing a HON, there are still many opportunities for advancing the HON framework. At present, GrowHON only captures information about sequence order, but this could be extended to consider additional information like distance in time or space. Additionally, it does not test the assumption that sequences are strictly ordered, i.e. that 1|2|3 is different than 1|3|2, and is limited in its ability to compute how relevant each entity is to the overall sequence. GrowHON could also be extended to include heterogeneous information, and even use this information in deciding whether a given sequence is statistically meaningful. We believe that GrowHON lays an foundation for studying these problems and creating even more meaningful representations of large and complex data.
References 1. Benson, A.R., Gleich, D.F., Leskovec, J.: Higher-order organization of complex networks. Science 353(6295), 163–166 (2016) 2. Bureau of Transportation Statistics: Transtats. https://www.transtats.bts.gov/. Accessed 30 Sep 2019 3. Chierichetti, F., Kumar, R., Raghavan, P., Sarlos, T.: Are web users really Markovian? In: Proceedings of the 21st International Conference on World Wide Web, pp. 609–618 (2012) 4. Cui, P., Wang, X., Pei, J., Zhu, W.: A survey on network embedding. IEEE Trans. Knowl. Data Eng. 31(5), 833–852 (2018)
496
S. J. Krieg et al.
5. Fortunato, S.: Community detection in graphs. Phys. Rep. 486(3–5), 75–174 (2010) 6. Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. ACM Sigmod Rec. 29(2), 1–12 (2000) 7. Healthcare Cost and Utilization Project (HCUP): Clinical classification software, March 2017. http://www.hcup-us.ahrq.gov. Accessed 8 Jan 2020 8. Ioffe, S.: Improved consistent sampling, weighted minhash and l1 sketching. In: 2010 IEEE International Conference on Data Mining, pp. 246–255. IEEE (2010) 9. Koher, A., Lentz, H.H., H¨ ovel, P., Sokolov, I.M.: Infections on temporal networks-a matrix-based approach. PloS ONE 11(4), e0151209 (2016) 10. Lambiotte, R., Rosvall, M., Scholtes, I.: From networks to optimal higher-order models of complex systems. Nat. Phys. 15(4), 313–320 (2019) 11. Milo, R., Shen-Orr, S., et al.: Network motifs: simple building blocks of complex networks. Science 298(5594), 824–827 (2002) 12. Moritz, P., Nishihara, R., et al.: Ray: A distributed framework for emerging AI applications. In: 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pp. 561–577 (2018) 13. Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: bringing order to the web. Technical report, Stanford InfoLab (1999) 14. Peixoto, T.P., Rosvall, M.: Modelling sequences and temporal networks with dynamic community structures. Nat. Commun. 8(1), 582 (2017) 15. Porter, M.A.: Nonlinearity+ networks: a 2020 vision. In: Emerging Frontiers in Nonlinear Science, pp. 131–159. Springer, Cham (2020) 16. Prˇzulj, N., Corneil, D.G., Jurisica, I.: Modeling interactome: scale-free or geometric? Bioinformatics 20(18), 3508–3515 (2004) 17. Rossi, R.A., Ahmed, N.K., Koh, E.: Higher-order network representation learning. In: Companion Proceedings of the The Web Conference 2018, pp. 3–4. International World Wide Web Conferences Steering Committee (2018) 18. Rosvall, M., Esquivel, A.V., Lancichinetti, A., West, J.D., Lambiotte, R.: Memory in network flows and its effects on spreading dynamics and community detection. Nat. Commun. 5, 4630 (2014) 19. Saebi, M., Ciampaglia, G.L., Kaplan, L.M., Chawla, N.V.: Honem: learning embedding for higher order networks. Big Data 8(4), 255–269 (2020) 20. Saebi, M., Xu, J., Kaplan, L.M., Ribeiro, B., Chawla, N.V.: Efficient modeling of higher-order dependencies in networks: from algorithm to application for anomaly detection. EPJ Data Sci. 9(1), 15 (2020) 21. Scholtes, I.: When is a network a network? In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1037–1046. ACM (2017) 22. Scholtes, I., et al.: Causality-driven slow-down and speed-up of diffusion in nonMarkovian temporal networks. Nat. Commun. 5, 5024 (2014) 23. Tsourakakis, C.E., Pachocki, J., Mitzenmacher, M.: Scalable motif-aware graph clustering. In: Proceedings of the 26th International Conference on World Wide Web, pp. 1451–1460 (2017) 24. Xu, J., Wickramarathne, T.L., Chawla, N.V.: Representing higher-order dependencies in networks. Sci. Adv. 2(5), e1600028 (2016) 25. Yin, H., Benson, A.R., Leskovec, J., Gleich, D.F.: Local higher-order graph clustering. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 555–564. ACM (2017)
Analysis of a Finite Mixture of Truncated Zeta Distributions for Degree Distribution Hohyun Jung and Frederick Kin Hing Phoa(B) Institute of Statistical Science, Academia Sinica, Taipei City 11529, Taiwan {hhjung,fredphoa}@stat.sinica.edu.tw
Abstract. The power-law distribution has been widely used to describe the degree distribution of a network, especially when the range of degree is large. However, the deviation from such behavior appears when the range of degrees is small. Even worse, the conventional employment of the continuous power-law distribution usually causes an inaccurate inference as the degree should be discrete-valued. To remedy these obstacles, we propose a finite mixture model of truncated zeta distributions for a broad range of degrees that disobeys a power-law nature in a small degree range while maintaining the scale-free nature of a network. The maximum likelihood algorithm alongside the model selection method is presented to estimate model parameters and the number of mixture components. We apply our method on scientific collaboration networks with remarkable interpretations. Keywords: Truncated zeta distribution · Power-law · Maximum likelihood · Collaboration network · Degree distribution
1
Introduction
Network science has emerged to address various properties of a network, which can be found from neuroscience to statistical physics. A network consists of nodes and links, and the topological structure, which explores how nodes are connected in the system, has been investigated with great interest [4,16,20]. The degree of a node is the number of links connected to the node. The degree distribution P (k) of a network is defined as the fraction of nodes in the network with degree k, and it is considered as an important characteristic that describes the network structure. Many types of research have been devoted to the study of the degree distribution using Poisson, exponential, and power-law distributions. Among them, the most frequently observed form is the power-law, and networks that have the power-law degree distribution are often referred to as scale-free networks. We can observe power-law degree distributions in collaboration, World Wide Web, protein-protein interaction, and semantic networks [2,9,23,24]. The presence of highly connected nodes (hubs) is a key feature of the power-law c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 944, pp. 497–507, 2021. https://doi.org/10.1007/978-3-030-65351-4_40
498
H. Jung and F. K. H. Phoa
degree distribution. A rich-get-richer mechanism, also called a popularity effect [17], has been known to produce hubs [4,6]. Many dynamic network models have been developed to explain the powerlaw degree distribution. The pioneered work is the generative model proposed by Barabasi and Albert (1999) [4], called the BA model. They employ the preferential attachment mechanism on the network growth. This model reflects the phenomenon that nodes with more neighbors tend to receive more links from other nodes. The BA model yields to a power-law degree distribution with exponent 3, i.e., P (k) ∝ k −3 . The exact form, given by Bollobas et al. in [7], is: P (k|m) =
2m(m + 1) , k(k + 1)(k + 2)
k = m, m + 1, · · · ,
(1)
where m is the number of connections at each time. Many variants have been developed to cover a broad range of power-law exponents [10,11,18]. Another model for generating a power-law degree distribution is the copy model presented in Kumar et al. (2000) [19]. In this model, newly entering nodes randomly select some existing nodes and copy some of the links. There are many empirical distributions where the power-law nature is not observed in the range of small degrees. Many variants of the power-law distribution are developed to address this issue, such as generalized power-law distributions [3,25], composite distributions with threshold [13,21], and power-law distributions with an exponential cutoff [12,22]. However, these methods do not consider the essential foundation of the power-law. According to the BA model and its variants, the power-law nature is an inherent property exhibited by the preferential attachment rule. The model presented in this paper retains the power-law nature in the sense that we do not modify the power-law distribution function, given by P (k) ∝ k −α . Note that the BA model has a parameter m. Jordan (2006) [15] relieved the constant m condition that the number M of connections at each time can change over time according to the distribution of M . The degree distribution turned out to have P (k) = 2E[M (M + 1)1(M ≤ k)]/k(k + 1)(k + 2), k = 1, 2, · · · . Suppose that M has a finite support, M = 1, · · · , Mmax . Then we can easily show that the degree distribution can be represented by P (k) =
M max
wm P (k|m),
k = 1, 2, · · · ,
(2)
m=1
where wm = P (M = m) is a mixture weight of a mixture component P (k|m) in Eq. (1). This suggests that the degree distribution might be expressed as a mixture distribution. Inspired by this property, we consider a mixture model to explain the deviation from the power-law in the range of small degrees. Moreover, many studies have analyzed the degree distribution as if the degree is a continuous-valued variable using continuity assumptions [1,5]. This may yield an inaccurate result on the statistical inference. Therefore, we employ a truncated zeta distribution, which is a discrete version of the power-law distribution.
Mixture of Discrete Power-Laws
499
In this study, we present a mixture model of truncated zeta distributions for the analysis of degree distributions. This method covers the whole range of the degree distribution through a mixture of truncated zeta distributions while maintaining the scale-free nature of a network. Also, the characteristics of the degree distribution can be analyzed more accurately via the discrete form of the power-law distribution. We present the maximum likelihood estimation algorithm along with a model selection method. In addition, we apply the model to the collaboration network for analyzing the properties of the degree distribution.
2
The Power-Law Distribution
The probability density function (pdf) of the continuous power-law distribution parametrized by the power-law exponent α > 1 and the minimum value l > 0, denoted by P L(α, l), is given by f (x|α, l) = (α − 1)lα−1 x−α ,
x ≥ l.
(3)
The continuous power-law distribution has been widely used in the analysis of the degree distribution. However, the degree is discrete-valued, and in this case, an approximation method is required. Setting a constant c, 0 ≤ c ≤ 1, for the correction of the continuity, we can approximate the degree distribution k+1−c by f (k|α, l) ≈ k−c f (x|α, l)dx, for the integer value k. However, it is not easy to choose the constant c since there is no such constant to satisfy the exact equality of left-hand and right-hand sides for all integer values k = 1, 2, · · · . The case of constant c = 0.5, i.e., rounding to the nearest integer, is a common approach to approximate the discrete power-law behavior [9,14]. It is quite reasonable when considering the tail part of the power-law distribution. How k+1−c ever, if k is not large, the constant c that satisfies f (k|α, l) = k−c f (x|α, l)dx is considerably less than 0.5. Thus, the rounding approach should be avoided. Clauset et al. (2009) in [9] also mention that the rounding method is only valid for k > 6. The truncated zeta distribution, denoted by T Z(α, l), is a discrete form of the power-law distribution. The parameters are the same as the continuous powerlaw distribution in which α > 1 is the power-law exponent and l > 0 is the minimum value. The probability mass function (pmf) of T Z(α, l) is given by g(k|α, l) =
1 k −α , ζ(α, l)
k = l, l + 1, · · · ,
(4)
∞ where ζ(·, ·) is the Hurwitz zeta function ζ(α, l) = k=l k −α , which can be regarded as the normalizing constant of the distribution. The Hurwitz zeta function in Eq. (4) is closely related to the continuous counterpart 1/(α − 1)lα−1 in Riemann sum approximation, expressed by ζ(α, l) ∞Eq.−α(3) via the upper α−1 x dx = 1/(α − 1)l . Note that the right-hand side is not the exact value l of ζ(α, l).
500
3
H. Jung and F. K. H. Phoa
Truncated Zeta Mixture Model
We consider the finite mixture model of truncated zeta distributions by fixing the power-law exponent α for mixture components, while varying minimum values to produce a mixture of truncated zeta distributions. The probability mass function is represented by p(k|α, L, w) =
L
wl g(k|α, l),
k = 1, 2, · · · ,
(5)
l=1
where g(k|α, l) is the pmf of T Z(α, l), and L is the number of mixture components. Mixture weights w = (w1 , w2 , · · · , wL ) satisfy wl ≥ 0, l = 1, 2, · · · , L L − 1, wL > 0, and l=1 wl = 1. In this paper, we assume that the minimum value l is equal to 1, but it can be modified according to the data. The tail of most real networks follows the power-law distribution, and Eq. (5) has the exact power-law behavior for sufficiently large degrees. Theorem 1 For k larger than or equal to L, the truncated zeta mixture distribution in Eq. (5) has the exact power-law relationship, given by p(k|α, L, w) ∝ k −α . Proof. the pmf of T Z(α, l), we can write p(k|α, L, w) = By using L wl −α k . Since the term inside the bracket is independent with k, l=1 ζ(α,l)
the pmf of the mixture is proportional to k −α .
The non-identifiability problem is frequently observed in mixture distributions, even in the case of finite mixtures. The following theorem shows the identifiability of the truncated zeta mixture model. Theorem 2 The mixture distribution in Eq. (5) is identifiable with respect to α, L, and w. Proof. Let p(k|α1 , L1 , w1 ) and p(k|α2 , L2 , w2 ) be mixture distributions with α1 , L1 , w1 = (w11 , w12 , · · · , w1L ) and α2 , L2 , w2 = (w21 , w22 , · · · , w2L ), respectively. Suppose that p(k|α1 , L1 , w1 ) and p(k|α2 , L2 , w2 ) are identical, i.e., p(k|α1 , L1 , w1 ) = p(k|α2 , L2 , w2 ) for all k = 1, 2, · · · . Further, we define the slope function s(k|α, L, w) of the log-log degree distribution, given by s(k|α, L, w) = ln p(k+1|α,L,w)−ln p(k|α,L,w) , k = 1, 2, · · · . Since the two mixture distribuln(k+1)−ln(k) tions are identical, their slope functions are identical, i.e., s(k|α1 , L1 , w1 ) = s(k|α2 , L2 , w2 ) for all k = 1, 2, · · · . For sufficiently large k (≥max{L1 , L2 }), we have α = s(k|α, L, w). Therefore, α1 = α2 . The number of mixture components L is the largest integer k such that s(k − 1|α, L, w) = α, and L1 = L2 . Let L (= L1 = L2 ) and α (= α1 = α2 ) be the common number of mixture components and the power-law exponent. Using p(k|α1 , L1 , w1 ) = p(k|α2 , L2 , w2 ) for k = 1, 2, · · · , L, we have the following L equations: w11 g(1|α, 1) = w21 g(1|α, 1), w11 g(2|α, 1) + w12 g(2|α, 2) =
Mixture of Discrete Power-Laws
501
Fig. 1. The pmfs of the mixture of truncated zeta distributions with L = 4 on a loglog scale. Mixture weights are w = (0.4, 0.3, 0.2, 0.1) (left) and w = (0.1, 0.4, 0.4, 0.1) (right). Power-law exponents are α = 2.0 (blue), α = 2.5 (black), and α = 3.0 (red).
w21 g(2|α, 1) + w22 g(2|α, 2), · · · , w11 g(L|α, 1) + · · · + w1L g(L|α, L) = w21 g(L|α, 1)+· · ·+w2L g(L|α, L). By solving these equations, we have w1l = w2l for l = 1, 2, · · · , L. Hence, the mixture of truncated zeta distributions is identifiable. This model can handle frequently observed degree distributions that do not follow the power-law distribution at small degrees. Figure 1 shows some log-log plots of mixture distributions, and we can see that the degree distribution in Fig.2 is explained in our model framework.
4
Estimation
The Expectation-Maximization (EM) algorithm can be employed to estimate the exponent parameter α and mixture weights w for a given number of mixture components L. Let k = (k1 , k2 , · · · , kN ) be the observed data, and zn be the membership of kn , which is defined by zn = l if kn is from the l-th mixture component T Z(α, l). We consider the membership variable z = (z1 , z2 , · · · , zN ) as missings. Let θ = (α, w) = (α, w1 , · · · , wL ) be the parameters of the mixture model. N The complete-data likelihood function is given by L(θ|k, z) = n=1 p(zn | N 1 −α w)g(kn |α, zn ) = n=1 wzn ζ(α,zn ) kn , kn = zn , zn + 1, · · · . Then the completeN data log-likelihood function can be written as ln L(θ|k, z) = n=1 (ln wzn − ln ζ(α, zn ) − α ln kn ). We define Q(θ|θ∗ ) as the expected value of the log-likelihood given the observed data k and the current parameter estimate θ∗ = (α∗ , w∗ ), which can be expressed N L by Q(θ|θ∗ ) = E [ln L(θ|k, Z)|k, θ∗ ] = n=1 l=1 γ(l, n, θ∗ )(ln wl − ln ζ(α, l) − α ln kn ). Here, γ(l, n, θ∗ ) is the membership responsibility of the n-th observation kn corresponding to the l-th mixture component T Z(α, l), which is the posterior probabilities of mixture component memberships for each observation, defined by
502
H. Jung and F. K. H. Phoa
γ(l, n, θ∗ ) = P (zn = l|kn , θ∗ ) w∗ g(kn |α∗ , l) , = L l ∗ ∗ l =1 wl g(kn |α , l )
l = 1, 2, · · · , L, n = 1, 2, · · · , N.
(6)
The computation of Eq. (6) is the E-step of the EM algorithm. The M-step is to find θˇ = argmaxθ Q(θ|θ∗ ) using membership responsibilˇ we need to solve the following ities computed in the E-step.To find Lw, L optiN ∗ γ(l, n, θ ) ln w , subject to mization problem: maximize l n=1 l=1 l=1 wl = 1, and wl ≥ 0, l = 1, 2, · · · , L. Using the Lagrange multiplier method, we have w ˇl =
N 1 γ(l, n, θ∗ ), N n=1
l = 1, 2, · · · , L.
(7)
Next, α ˇ can be found by the partialderivative of Q with respect to α. Given N L ∂Q(θ|θ ∗ ) ζα (α,l) ∗ = − n=1 l=1 γ(l, n, θ ) ζ(α,l) + ln kn , ζα (α, l) is the partial ∂α derivative of the Hurwitz zeta function with respect to α, which can be expressed ∞ = − k=l (ln k)k −α . We need to find α ˇ that satisfies the by ζα (α, l) = ∂ζ(α,l) ∂α equation ∂Q(θ|θ∗ ) = 0. ∂α
(8)
Unfortunately, α ˇ is not shown in a closed-form solution. Thus, we need a numerical method to solve Eq. (8) with respect to α. In this paper, we employ Brent’s method [8]. We repeat the E- and M-step until convergence and then obtain the final ˆ parameter estimate θ. We select the number of mixture components L by applying the model selection function. In this paper, we employ the Bayesian information criterion (BIC) considering the trade-off between the goodness of fit and the complexity of the model, given by BIC = L ln N − 2
N
ln p(kn |α, L, w),
(9)
n=1
where α and w are estimated parameters obtained given the number of mixture components L. We select L that minimizes BIC in Eq. (9).
5 5.1
Application: Degree Distribution of Collaboration Network The Data
We study the scientific collaboration network obtained from the Web of Science, where a large-scale database collects the information of all published scientific
Mixture of Discrete Power-Laws
503
Table 1. The estimation results of L and α. The first three w estimates of the Computer Science field are also presented. Year Comp. Sci. ˆ α L ˆ w ˆ1
w ˆ2
w ˆ3
Biotech. Mat. Sci. Env. Sci. Phy. Chem. ˆ α ˆ α ˆ α ˆ α L ˆ L ˆ L ˆ L ˆ
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
0.30 0.31 0.30 0.30 0.29 0.30 0.30 0.30 0.30 0.30 0.29 0.29 0.30 0.29 0.29 0.29
0.18 0.18 0.18 0.18 0.18 0.19 0.20 0.20 0.21 0.21 0.21 0.21 0.22 0.22 0.23 0.23
21 21 25 36 36 38 38 38 38 36 36 36 36 36 36 36
16 16 16 16 21 21 21 21 21 21 21 23 17 19 19 19
3.36 3.27 3.23 3.16 3.24 3.23 3.27 3.22 3.21 3.21 3.20 3.23 2.97 2.97 2.87 2.85
0.30 0.29 0.28 0.28 0.27 0.26 0.25 0.25 0.24 0.23 0.22 0.21 0.21 0.20 0.20 0.19
2.96 2.93 2.93 3.05 3.00 2.98 3.02 2.99 3.01 2.95 2.93 2.92 2.93 2.89 2.86 2.84
36 36 36 37 37 37 36 36 38 38 41 41 43 43 53 53
3.04 2.98 2.92 2.88 2.82 2.79 2.82 2.85 2.90 2.92 2.95 2.95 2.94 2.91 2.94 2.89
27 27 31 30 31 30 38 36 38 40 38 40 38 39 43 37
3.57 3.53 3.59 3.49 3.41 3.38 3.59 3.57 3.62 3.55 3.42 3.41 3.22 3.18 3.10 2.94
35 35 35 38 38 39 38 36 35 38 39 40 42 44 42 44
3.29 3.21 3.14 3.13 3.06 3.04 3.05 3.05 3.06 3.11 3.13 3.12 3.12 3.08 2.99 2.95
articles in the world. Among all 275 disciplines, we randomly choose five, which are Biotechnology, Computer Science, Environmental Science, Materials Science, and Physical Chemistry, for demonstration. We reorganize the Web of Science database into a network data structure such that the nodes of networks are authors, and two authors are connected by an undirected link if there is at least one paper co-authored by them. We investigate the degree distribution of the data from 2001 to 2016. 5.2
Application of the Truncated Zeta Mixture Model
We apply the presented estimation method to degree distributions from 2001 to 2016. Table 1 shows the estimated results of L and α. Figure 2 shows degree distributions and their estimated mixture of truncated zeta distributions of the years 2001 and 2016. Plots indicate that the truncated zeta mixture fits well in the degree distribution. The beginning points (the verˆ This tical line in Fig. 2) of power-law behaviors are reasonably estimated via L. result suggests that the proposed model is useful for the degree distribution that usually deviates from the power-law distribution on a small degree part. ˆ and α Figure 3 shows how the estimated values L ˆ change over time. The estimated number L of mixture components tends to increase over time. On the other hand, α ˆ tends to decrease, even though it looks oscillating in the
504
H. Jung and F. K. H. Phoa
Fig. 2. The degree distributions of collaboration network data of the Computer Science field of the years 2001 (left) and 2016 (right). The estimated mixture of truncated zeta distributions is presented.
Fig. 3. The change of the estimated number of mixture components L (top) and powerlaw exponents α (bottom).
discipline of Material Science. These results indicate that the degree distribution has not yet been stabilized, and it constantly moves towards the direction of large L and small α regions for all fields. Many existing dynamic network models [4,11,18] have pursued to derive the stationary or converged degree distribution. ˆ suggests that, however, non-stationary network The increasing tendency of L models are of great importance, such as acceleration models [10], on the degree distribution of scientific collaboration networks. Also, a large number of works in the real data analysis have focused on the power-law exponent of a snapshot
Mixture of Discrete Power-Laws
505
of a network. However, the temporal variation of α ˆ points out the importance of the temporal model that can account for the temporal change of the power-law exponent. ˆ seems to approach some different values across fields. According Note that L to Eq. (2) and the relevant interpretation in Sect. 1, the number of mixture components L and mixture weights w are closely related to the distribution of ˆ could be a measure for the the number of links in the system. Therefore, L distribution of the number of links. On the other hand, α ˆ tends to converge to a value near 2.80 for all fields, suggesting that α could be a network-specific quantity instead of a field-specific quantity.
6
Concluding Remark
In this paper, we propose a mixture model to cover the whole range of degrees while maintaining the power-law or scale-free property of a network. We see that the adequacy of the mixture model can be justified by assuming m of the BA model as a random variable. In addition, the truncated zeta distribution is employed to analyze discrete distributions for a more accurate analysis. We present the maximum likelihood estimation along with BIC for determining the number of mixture components. The proposed model is applied to five collaboration networks. We find that the number of mixture components tends to increase, whereas the power-law exponent decreases over time, and mixture weights change over time accordingly (see Table 1). These tendencies for all fields indicate that collaboration networks are still in the evolving state, and the convergence of the degree distribution has not yet been achieved. The result highlights the practical importance of non-stationary evolving network models. Power-law distributions are not limited to the degree distribution of networks, they are ubiquitous in empirical data. For example, we can find the power-law distribution in the frequencies of words, the frequencies of family names, and the population of cities. The proposed mixture model of truncated zeta distributions could be useful in the following situations: the degree distribution disobeys the power-law in a small range, the manual modifications of the distribution function P (k) ∝ k −α are not supported by the background of the data, or the mixture distribution is proved to be reasonable to describe the data. Acknowledgement. The authors thank Clarivate Analytics to provide the access to the raw data of the Web of Science database for research investigations via the international collaboration between the Institute of Statistical Mathematics (ISM) of Japan and the Institute of Statistical Science, Academia Sinica (ISSAS) of Taiwan. The authors also thank Ms. Ula Tzu-Ning Kung for her service on English editing to improve the quality of this paper. This work was supported partially by the thematic project (ASCEND) of Academia Sinica (Taiwan) grant number AS-109-TP-M07 and the Ministry of Science and Technology (Taiwan) grant numbers 107-2118-M-001-011MY3 and 109-2321-B-001-013.
506
H. Jung and F. K. H. Phoa
References 1. Albert, R., Barab´ asi, A.L.: Statistical mechanics of complex networks. Rev. Mod. Phys. 74(1), 47 (2002) 2. Albert, R., Jeong, H., Barab´ asi, A.L.: Diameter of the world-wide web. Nature 401(6749), 130–131 (1999) 3. Arnold, B.C.: Pareto Distributions. Chapman and Hall/CRC, Boca Raton (2015) 4. Barab´ asi, A.L., Albert, R.: Emergence of scaling in random networks. Science 286(5439), 509–512 (1999) 5. Barab´ asi, A.L., Albert, R., Jeong, H.: Mean-field theory for scale-free random networks. Physica A Stat. Mech. Appl. 272(1–2), 173–187 (1999) 6. Bianconi, G., Barab´ asi, A.L.: Competition and multiscaling in evolving networks. Europhys. Lett. 54(4), 436 (2001) 7. Bollob´ as, B.E., Riordan, O., Spencer, J., Tusn´ ady, G.: The degree sequence of a scale-free random graph process. Random Struct. Algorithms 18(3), 279–290 (2001) 8. Brent, R.P.: Algorithms for Minimization Without Derivatives. Courier Corporation (2013) 9. Clauset, A., Shalizi, C.R., Newman, M.E.: Power-law distributions in empirical data. SIAM Rev. 51(4), 661–703 (2009) 10. Dorogovtsev, S.N., Mendes, J.F.F.: Effect of the accelerating growth of communications networks on their structure. Phys. Rev. E 63(2), 025101 (2001) 11. Dorogovtsev, S.N., Mendes, J.F.F., Samukhin, A.N.: Structure of growing networks with preferential linking. Phys. Rev. Lett. 85(21), 4633 (2000) 12. Fenner, T., Levene, M., Loizou, G.: A model for collaboration networks giving rise to a power-law distribution with an exponential cutoff. Soc. Networks 29(1), 70–80 (2007) 13. Garcıa, F., Garcıa, R., Padrino, J., Mata, C., Trallero, J., Joseph, D.: Power law and composite power law friction factor correlations for laminar and turbulent gas-liquid flow in horizontal pipelines. Int. J. Multiph. Flow 29(10), 1605–1624 (2003) 14. Gillespie, C.: Fitting heavy tailed distributions: the powerlaw package. J. Stat. Softw. 64(2), 1–16 (2015) 15. Jordan, J.: The degree sequences and spectra of scale-free random graphs. Random Struct. Algorithms 29(2), 226–242 (2006) 16. Jung, H., Lee, J.G., Lee, N., Kim, S.H.: Comparison of fitness and popularity: fitness-popularity dynamic network model. J. Stat. Mech. 2018(12), 123403 (2018) 17. Jung, H., Lee, J.G., Lee, N., Kim, S.H.: Ptem: a popularity-based topical expertise model for community question answering. Ann. Appl. Stat. 14(3), 1304–1325 (2020) 18. Krapivsky, P.L., Redner, S.: Organization of growing random networks. Phys. Rev. E 63(6), 066123 (2001) 19. Kumar, R., Raghavan, P., Rajagopalan, S., Sivakumar, D., Tomkins, A., Upfal, E.: Stochastic models for the web graph. In: Proceedings of 41st Annual Symposium on Foundation of Computer Science, pp. 57–65. IEEE (2000) 20. Mazzarisi, P., Barucca, P., Lillo, F., Tantari, D.: A dynamic network model with persistent links and node-specific latent variables, with an application to the interbank market. Eur. J. Oper. Res. 281(1), 50–65 (2020) 21. Meibom, A., Balslev, I.: Composite power laws in shock fragmentation. Phys. Rev. Lett. 76(14), 2492 (1996)
Mixture of Discrete Power-Laws
507
22. Mossa, S., Barthelemy, M., Stanley, H.E., Amaral, L.A.N.: Truncation of power law behavior in “scale-free” network models due to information filtering. Phys. Rev. Lett. 88(13), 138701 (2002) 23. Newman, M.E.: The structure and function of complex networks. SIAM Rev. 45(2), 167–256 (2003) 24. Newman, M.E.: Power laws, pareto distributions and zipf’s law. Contemp. Phys. 46(5), 323–351 (2005) 25. Prieto, F., Sarabia, J.M.: A generalization of the power law distribution with nonlinear exponent. Commun. Nonlinear Sci. Numer. Simul. 42, 215–228 (2017)
De-evolution of Preferential Attachment Trees Chen Avin(B) and Yuri Lotker School of Electrical and Computer Engineering, Ben Gurion University of the Negev, Beer-Sheva, Israel [email protected], [email protected]
Abstract. Given a graph Gt which is a result of a t time, evolutionary process, the goal of graph de-evolution of Gt is to infer what was the structure of the graph Gt for t < t. This general inference problem is very important for understanding the mechanisms behind complex systems like social networks and their asymptotic behavior. In this work we take a step in this direction and consider undirected, unlabeled trees that are the result of the well known random preferential attachment process. We compute the most likely root set (possible isomorphic patient zero candidates) of the tree, as well as the most likely previous graph Gt−1 structure. While the one step forward reasoning in preferential attachment is very simple, the backward (past) reasoning is more complex and includes the use of the automorphism and isomorphism of graphs, which we elucidate here.
Keywords: Social networks Evolution · De-evolution
1
· Preferential attachment · Trees · Time ·
Introduction
Complex systems are usually dynamic and change over time in an evolutionary type process. Understanding this process is at the core of making good predictions about the future behavior of the system. But how is it possible to understand and learn the evolution process? A basic approach could be to study the network history, i.e., the state of the system in previous times, and then, to create a model for the network (system) dynamics. If we have temporal information, e.g., on node and edge arrival times, it is easy to restore previous states of the network. Unfortunately, in many cases, such information does not exist, or is only partially available. Motivated by this approach, we investigate what we call the network deevolution process, a process in which, for a given current state of the network, one tries to infer the future of the network by de-evolution - going backwards in time from the current state to the network’s past. We believe that introducing a general theory and framework for network de-evolution will contribute to research on network science and complex systems. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 944, pp. 508–519, 2021. https://doi.org/10.1007/978-3-030-65351-4_41
De-evolution of Preferential Attachment Trees
509
This paper presents a step in this direction and studies the recovery of the network history based on its current state with no temporal information (network “de-evolution”). As a network model, we use an arbitrary undirected tree topology and assume a preferential attachment (PA) model [1,7,8]. For a tree Gt at time t, we propose a novel approach for computing the most likely root set (possible isomorphic patient zero candidates), as well as the most likely previous graph, Gt−1 structure. At the heart of our approach lie the automorphism and isomorphism properties of the tree. We initially use the fact that the probability of any tree network formed by a PA process depends only on the nodes’ degrees and does not depend on the chronology of its formation process (see Theorem 2). In turn, we show that the probability of a node being the root, and the previous network topology at time t − 1, only depends on the sizes of the isomorphism and automorphism classes of the investigated network (see Corollaries 2 and 3). Also, we present algorithms that run in O(n2 ) time, which allows us to find the most probable root and the tree topology at time t − 1, given the network considered at time t. The presented algorithm combines our version of a well-known algorithm that computes the size of the automorphism class of an arbitrary tree [4] and, as proposed here, an algorithm that computes the size of the isomorphism class of an arbitrary tree.
2
Model and Problem Statement
We consider a random, undirected, rooted tree that is formed according to the basic PA model, which we explain next. Let Gt = (Vt , Et ) denote a random preferential attachment tree of size t, where Vt = {v1 , v2 , ..., vt } is the set of nodes and Et = {e1 , e2 , ..., et } is the set of edges. The process starts with an initial graph G2 that consists of nodes v1 , v2 and the edge (v1 , v2 ), where v1 is the root of the tree and the parent of v2 . At each time step t > 2 a new node vt arrives and selects a single neighbor vi ∈ Vt−1 = {v1 , v2 , ..., vt−1 } from nodes that arrived before time t. The selected neighbor, vi , is called the parent of node vt . The parent of vt is selected randomly according to what is known as preferential attachment (PA). Let Pt be a random variable that denotes the index of the parent of vt ; then, the probability that vi is selected as the parent in the PA model for t > 2 is: di (t − 1) di (t − 1) , Pr(Pt = i) = t−1 = 2(t − 2) j=0 dj (t − 1)
(1)
where d (t − 1) is the degree of node i in the graph at time t − 1 and t−1 i j=0 dj (t − 1) is the sum of degrees of all nodes at time t − 1. Note that the numbers of nodes and edges in Gt are t and t t − 1, respectively. Therefore we know that the sum of degrees at time t is j=1 dj (t) = 2(t − 1). Informally, the goal of this work is as follows. Given an unlabeled, unrooted tree Gt that was generated by the PA process Gt , we study the de-evolution of
510
C. Avin and Y. Lotker
Gt , namely, we want to infer how the graph looked at earlier times. As a first step, in this paper we study two questions: i) what node is the most likely root of the tree?, and ii) what leaf is the most likely to have been the last to join the tree? Interestingly, as is evident from Eq. (1), it is simple to determine for each time step who are the most probable parents for the new node that will join the graph at the next time step. These are just the nodes with the highest degree in Gt . However, to infer backwards how the graph looked, even for one step, is not quite as simple. Formally, assume that an unlabeled (unrooted) tree Gt was generated by Gt . Our first task is to identify the most likely root of Gt , v ∗ ∈ V (G): v ∗ = arg max Pr(v is the root of Gt |Gt ).
(2)
v∈V (Gt )
Our second question concerns the last node, ∗ , that joined Gt . Let L(Gt ) be the set of leaves in Gt and for a leaf node ∈ L(Gt ) denote by Gt the graph Gt without , Then, ∗ = arg max Pr(Gt |Gt ).
(3)
∈L(Gt )
At a first glance, these problems might seem simple, but it transpires that they have not been solved before (to the best of our knowledge) and the solutions requires a complex combination of several intermediate results. For example, as we will see later, automorphism numbers of graphs and the orbits of nodes will play a significant role in the solutions we obtain.
3
Related Work
We now shortly overview studies that, to some degree, cover threads which are common to our current investigation regarding graph evolution history. Primarily, we should mention a paper by Shah and Zaman that shows a way to disclose the source of rumors which spread in a network [10] with the susceptible-infected (SI) model. They were motivated by applications such as detection of sources of viruses in computer networks and determination of causes of cascading failures in large systems such as financial markets. The study introduces the rumor centrality notion, and claims that a node with maximal rumor centrality is the most probable rumor source node. The authors started their investigation from regular tree network typologies (trees where every node has the same degree), and then extended it to arbitrary graphs formed according to the small-world model and scale-free models. Their formula for the number of permitted permutations of nodes that results in a rooted tree is similar to the formula deduced by us for the size of the isomorphism functions class of a rooted tree. They use this intermediate conclusion to find a rumor source node, while we exploit this formula as a part of our algorithm enabling us to reach a decision regarding the previous tree topology.
De-evolution of Preferential Attachment Trees
511
A second related study is by Rozenshtein et al. [9] that developed an algorithm to reconstruct an epidemic over time (in other words, propagation of activity in a network). However, their approach is based on the Steiner tree problem, which is associated with the all-pairs shortest path and minimum spanning tree problems, and thus is actually quite different from ours. Finally, “Finding Adam” [3] - proposes a way to identify a set of nodes of a tree, out of which one node might, with high probability, be the root of this tree. The trees they deal with are generated by either the uniform attachment or the PA model. This study, like our study, takes into account the isomorphism class of a labeled tree, but, in contrast to our study, “Finding Adam” adopts an approximation approach and does not try to find the exact probabilities. In particular, in all of their algorithms all leaves receive the same value, meaning they could have been the last to join the tree with the same probability. We show that this is not true.
4
Preliminaries
A finite undirected graph G is a pair (V, E), where V is a set of vertices and E ⊆ V × V is the set of edges. For a graph G let V (G) and E(G) denote the sets of vertices and edges of G, respectively. The size of the graph is denoted as |G| = |V (G)| = n. The degree of a vertex u in graph G is the number of vertices adjacent to it du (G) = |{u|(u, v) ∈ E(G)}|. A graph G is called labeled when its vertices are distinguished from one another by names such as v1 , v2 , ..., vn . The degree vector of a labeled graph G of size t is d(G) = {d1 (G), d1 (G), ..., di (G), ..., dt (G)},
(4)
where di (G) is the degree of node vi ∈ G. A tree is a connected acyclic graph. A rooted tree is a tree with exactly one distinguished vertex, that is called the root. Let T = (V, E) be a tree rooted at w. If u = v is on the (unique) path from w to v, then u is an ancestor of v and v is a descendant of u. If (u, v) ∈ E, then u is the father of v and v is a child of u. A vertex with no descendants is called a leaf. A vertex u and all its descendants are called a rooted subtree of T , denoted Tu . An isomorphism of graphs G and G is a bijection ψ between the vertex sets of G and G : ψ : V (G) → V (G ) such that any two vertices are adjacent in G (i.e., (u, v) ∈ E(G)) if and only if their images are adjacent in G (i.e., (ψ(u), ψ(v)) ∈ E(G )). We say that G and G are isomorphic if there exists an isomorphism between these graphs, and denote it as G ∼ =ψ G = G . We write G ∼ to denote that ψ is an isomorphism between G and G . An automorphism of a graph G is an isomorphism of G with itself. Two vertices u and v are similar or automorphically equivalent in G whenever there exists an automorphism φ of G in which φ(u) = v. Similarity [5,11] is an equivalence relation on V , whose equivalence classes (also known as orbits) form the automorphism partition of G. Namely, two vertices in the same orbit are automorphically equivalent. For a graph G and a node u ∈ G, let Orbit(G, u) be the equivalence class of u in the automorphism partition of G [2,6].
512
C. Avin and Y. Lotker
A root-preserving isomorphism (automorphism) in an isomorphism (automorphism) where the image of the root is also the root. Two rooted trees G and G are root isomorphic if there exists a root-preserving isomorphism between r them, and we denote it as G G . The number of root-preserving automorphisms for a rooted tree T , denoted Auto(T ), can be calculated using a tree coding algorithm proposed by Colbourn and Boot [4]. They proposed the following recursive equation which can be applied to an arbitrary tree rooted at x0 where x0 has p children x1 , x2 , ..., xp that are partitioned into q equivalence classes (orbits): p q ks ct ! , (5) Auto(T ) = k0 = s=1
t=1
where ks is the automorphism number of the subtree rooted at xs for 1 ≤ s ≤ p and ct is the size of the t equivalence class for 1 ≤ t ≤ q. Equation (5) and Auto(T ) and can be calculated in linear time by going over the tree three times: leaves-to-root, root-to-leaves and again leaves-to-root, where in each such passage a different linear time coding V → N procedure is executed. The above algorithm [4] also finds the automorphism partition for rooted trees (i.e., the orbits). For unrooted trees, the algorithm can also work by simply making the tree rooted in a unique fashion by finding the center or bicenter of the tree. This can also easily be accomplished in linear time.
5
Tree Formation Probability
We start by defining some properties of trees that are the result of a random PA process, Gt . Definition 1 (Valid Labeled Tree). We say that a tree Gt of size t is a valid labeled tree if and only if its nodes are labeled as V (G) = {v1 , v2 , . . . vt }, and G could have resulted from a random PA process Gt . Valid labeled trees have a unique labelling property: Claim 1 (Chronological Property of Valid Labeled Trees). A tree is a valid labeled tree if and only if it is a rooted tree (with v1 as the root) and is chronologically labeled: its node labels are {v1 , v2 , . . . vt } and, for each node vi , if it has a parent vj then j < i. For a valid labeled tree Gt of size t, and 2 ≤ i ≤ t let Gt [i] denote the induced subgraph of Gt with vertex set {v1 , . . . vi }, i.e., the evolution of Gt up to time i. Next we consider unlabeled trees, which are the input to our problems. An important component in our calculations will be the isomorphism class of rooted trees. Formally,
De-evolution of Preferential Attachment Trees
513
Definition 2 (Valid Isomorphism Class). The valid isomorphism class of a rooted tree Gt , denoted Δ(Gt ), is the set of all valid labeled trees that are root isomorphic to Gt :
r
Δ(Gt ) = {Gt | Gt is valid labeled tree and Gt Gt }.
(6)
The size of this isomorphism class is denoted K(Gt ) = |Δ(Gt )|. Note that Δ(Gt ) = ∅ (e.g., by DFS), and that every graph in Δ(Gt ) is a valid labeled tree. We can now start to tackle our first question and find out what is the probability that Gt will result in a specific valid tree. We first show (what is probably known as folklore) that this probability is only a function of the tree’s degree vector. Theorem 2. Let Gt = (Vt , Et ) be a valid labeled tree with |Gt | = t nodes. Then, the probability of Gt is only a function of the degree vector d(Gt ); let Γ (Gt ) = (t − 1)
t−1 i=1
(di (Gt ) − 1)! , i
(7)
where 0! = 1. Then, Pr(Gt = Gt ) = Γ (Gt ).
(8)
Further, expression (8) will be called the labeled tree probability. Proof. Let pi denote the parent of node vi in Gt . The probability that node vi , d (Gt [i−1]) . i > 2 selects its parent vpi (at time i) is: Pr(Pi = pi |Gt [i − 1]) = pi2(i−2) Therefore, the probability that Gt was formed is equal to: Pr(Gt = Gt ) = t t t dpi (Gt [i−1]) d (G [i−1]) t pi t = i=3 . Now consider i=3 Pr(Pi = pi |Gt [i − 1]) = i=3 2(i−2) i=3 2(i−2) the expression: t
dpi (Gt [i − 1]).
(9)
i=3
For each node vj consider its degree dj (Gt ) in the final graph Gt . How many times does node vj take part in Eq. (9) as a parent of some node vi (so pi = j)? If vj is a leaf it is not a parent of any node, so it does not appear in Eq. (9). If vj is not a leaf, then since the degree of vj is dj (Gt ) > 1 it must appear dj (Gt ) − 1 times as a parent of some nodes. In particular, it appears once for each degree between one and dj (Gt )−1 (once for each new leaf that selected vj as its parent). Using 0! = 1 we can rewrite tall nodes instead of a function t (9) as a function of of only parent nodes: i=3 dpi (Gt [i − 1]) = i=1 (di (Gt ) − 1)!. note that vt t t−1 must be a leaf so i=1 (di (Gt ) − 1)! = i=1 (di (Gt ) − 1)!. To conclude, notice t−1 t 1
that i=3 2(i − 2) = 2(t−1) i=1 2i, so the theorem follows.
514
C. Avin and Y. Lotker
A consequence of the last theorem is that any two trees with the same degree sequence will be generated with the same probability. In particular, this is the case when two trees are isomorphic. Formally we can claim: Corollary 1. If two valid labeled trees are isomorphic, Gt ∼ = Gt , then the probability that each of them is generated by Gt is equal.
The last corollary allows us to take another step towards our goal, remove the labeling, and compute the probability of an unlabeled topology, which we call the rooted tree probability. To do this, consider a rooted tree Gt (with arbitrary, or no labelling) where |Gt | = t, and define the rooted tree probability of Gt as a probability that Gt will be root isomorphic to Gt . Following Corollary 1 we obtain: Theorem 3. Let Gt be an unlabeled rooted tree where |Gt | = t. The probability that Gt will be root isomorphic to Gt is: r
Pr(Gt Gt ) = K(Gt )Γ (Gt ),
(10)
where K(Gt ) is the size of the isomorphism class Δ(Gt ), i.e., K(Gt ) = |Δ(Gt )|. Proof. Recall that any rooted tree is root isomorphic to at least one valid labeled r tree, and let Gt be a valid labeled tree s.t. Gt Gt . Then
r
Pr(Gt Gt ) = Pr(Gt ∈ Δ(Gt )) = K(Gt ) Pr(Gt = Gt ) = K(Gt )Γ (Gt ). The equalities hold from Corollary 1 and since all the graphs in Δ(Gt ) are isomorphic. Clearly, the degree sequences of Gt and Gt are also the same, so
Γ (Gt ) = Γ (Gt ), and the result follows. 5.1
Computing, K(G), the Size of the Valid Isomorphism Class
In this section we show how to compute for a rooted tree Gt the size of the valid isomorphism class Δ(Gt ), i.e., K(Gt ), in linear time. Let Φ(Gt ) = {φ1 , φ2 , ..., φk } denote the set of all root-preserving automorphism functions of Gt . Take note that Φ(Gt ) is never empty since it always contains the identity function. Recall that the size of this set, Auto(Gt ) = |Φ(Gt )|, is called the automorphism number and can be calculated in linear time [4]. Next we define Ψ (Gt ) = {ψ1 , ψ2 , ..., ψm }, as the valid isomorphism functions set for the graph Gt . This is the set of all root-preserving isomorphism functions that maps Gt to an root isomorphic valid labeled tree. Formally, r
Ψ (Gt ) = {ψ | Gt ψ Gt , Gt is a valid labeled tree}.
(11)
Figures 1 and 2 provide an example for these sets. In Fig. 1(a) we show an example of a rooted tree Gt . In Fig. 1 (b) we present the automorphism functions set, Φ(Gt ), where |Φ(Gt )| = 2, Fig. 1(c) shows the valid isomorphism function sets Ψ (Gt ) where |Ψ (Gt )| = 6. The class Δ(Gt ), which consists of three graphs, so K(Gt ) = 3, is shown in Fig. 2. Next we state the relation between K(Gt ), |Ψ (Gt )| and |Φ(Gt )|.
De-evolution of Preferential Attachment Trees
vr va vb vc vd
vr va vb vc vd ∼ = ψ1 v0 v1 v2 v3 v4 G1 ψ2 v0 v2 v1 v4 v3 G1
φ1 vr va vb vc vd φ2 vr vb va vd vc
ψ3 v0 v1 v2 v4 v3 G2 ψ4 v0 v2 v1 v3 v4 G2
vr va
vb
vc
vd (a)
515
ψ5 v0 v1 v3 v2 v4 G3 ψ6 v0 v3 v1 v4 v2 G3 (c)
(b)
Fig. 1. (a) A rooted tree Gt (directed for clarity) with automorphism number |Φ(Gt )| = 2. (b) Two possible automorphism functions: φ1 is the identity function, and φ2 . (c) The valid isomorphism functions set, Ψ (Gt ), where |Ψ (Gt )| = 6. The last column indicates the image of the isomorphism function, which is a graph from Δ(Gt ) shown in Fig. 2. v1 t 12345 p0 0 1 1 2 3 p1 0 1 1 3 2 p2 0 1 2 1 4
(a)
v1
v1
v2
v3
v2
v3
v2
v4
v4
v5
v5
v4
v3
v5
(b)
(c)
(d)
Fig. 2. The class Δ(Gt ) of the graph from Fig. 1. (a) Class representation using three possible parent vectors p0 , p1 , p2 . (b) Valid labeled tree G1 . (c) Valid labeled tree G2 . (d) Valid labeled tree G3 .
Theorem 4. The number of valid labeled trees that are isomorphic to Gt , K(Gt ) is equal to the ratio between the size of the valid isomorphism functions set, |Ψ (Gt )|, and the automorphism number of the graph, |Φ(Gt )|: K(Gt ) =
|Ψ (Gt )| . |Φ(Gt )|
(12)
Proof. We will prove the claim by showing that for each Gt ∈ Δ(Gt ) there are |Φ(Gt )| different valid isomorphism functions. Consider a valid root-preserving isomorphism function ψ(Gt ) s.t. ψ(Gt ) = Gt and a root-preserving automorphism function φ(Gt ). We define the following function composition ψ ◦ φ−1 (Gt ) where φ−1 is the inverse function of φ.:
ψ ◦ φ−1 : V (Gt ) → V (Gt ), where ∀v ∈ V (Gt ), v → ψ(φ−1 (v)). To continue, we need the next Claim.
(13)
516
C. Avin and Y. Lotker
Claim 5. Let ψ be a valid isomorphism function. Then, for two different valid automorphism functions φ and τ , ψ ◦ φ−1 and ψ ◦ τ −1 are two different isomorphism functions where ψ(Gt ) = ψ ◦ φ−1 (Gt ) = ψ ◦ τ −1 (Gt ). Proof. Clearly if φ = τ then from Eq. (13) ψ ◦ φ−1 = ψ ◦ τ −1 . Next we show that for every automorphism φ, ψ(Gt ) = ψ ◦ φ−1 (Gt ), i.e., (i, j) ∈ ψ(Gt ) if and only if (i, j) ∈ ψ ◦ φ−1 (Gt ). (i, j) ∈ ψ(Gt ) ⇐⇒ (ψ −1 (i), ψ −1 (j)) ∈ Gt ⇐⇒ (φ(ψ −1 (i)), φ(ψ −1 (j))) ∈ Gt ⇐⇒ (ψ(φ−1 (φ(ψ −1 (i)))), ψ(φ−1 (φ(ψ −1 (j))))) ∈ ψ ◦ φ−1 (Gt ) ⇐⇒ (i, j) ∈ ψ ◦ φ−1 (Gt ).
Claim 5 implies that for each Gt ∈ Δ(Gt ) there are |Φ(Gt )| different valid isomorphism functions. Clearly, for each ψ(Gt ) = ψ (Gt ) we have ψ = ψ so
overall |Ψ (Gt )| = K(Gt )|Φ(Gt )| and the theorem follows. Theorem 4 allows us to compute K(Gt ): we already know how to compute |Φ(Gt )| in linear time [4]; the next theorem states the size of |Ψ (Gt )|, and, in turn, allows us to also compute it in linear time. Theorem 6. For a given rooted tree Gt of size t, the size of the valid isomorphism functions set, |Ψ (Gt )|, is: |Ψ (Gt )| =
t! u∈Gt |Tu |
,
(14)
where Tu is the sub-tree of Gt rooted at u and |Tu | is its size. We prove Theorem 6 by providing an algorithm (Algorithm 1) that samples a valid root-preserving isomorphism function from Ψ (Gt ) with uniform probability. The inverse of this probability gives |Ψ (Gt )|. Claim 7. Algorithm 1 generates a valid root-preserving isomorphism function, ψ, of Gt uniformly at random. Proof. Clearly, Algorithm 1 generates a valid isomorphism function since ψ(Gt ) is chronologically labeled. Now consider any valid isomorphism function ψ ∈ Ψ (Gt ). ψ(Gt ) is a valid labeled tree, chronologically labeled, and therefore can be generated by Algorithm 1. What is the probability that Algorithm 1 will generate ψ? Let ψ −1 [i] = {ψ −1 (v1 ), ψ −1 (v2 ), . . . ψ −1 (vi )} and note that when ψ is generatedSi = ψ −1 [i] = {ψ −1 (v1 ), ψ −1 (v2 ), . . . ψ −1 (vi )} and |Si | = i. Therefore, we have w∈Ni |Tw | = |V \ Si | = t − i, and the probability to generate ψ is Pr(S1 = ψ −1 [1])
t
Pr(Si = ψ −1 [i] | Si−1 = ψ −1 [i − 1]) =
i=2
However, since ψ is bijection we can rewrite Eq. (15) as: holds for any ψ ∈ Ψ (Gt ) the claim holds.
t |Tψ−1 (vi ) |
i=1
t+1−i
u∈Gt |Tu |
t!
. (15)
. Since this
We now have all the necessary tools to answer our de-evolution questions.
De-evolution of Preferential Attachment Trees
517
Algorithm 1. Random valid root-preserving isomorphism function Require: A rooted tree Gt of size t Ensure: A valid root-preserving isomorphism function ψ chosen uniformly at random from Ψ (Gt ) 1: Set ψ(root(Gt )) → v1 2: S1 = {root(Gt )} 3: for i = 2 to t do / Si−1 } 4: Ni = {u | (u, w) ∈ E(Gt ), w ∈ Si−1 , u ∈ |Tu | 5: Select a random u ∈ Ni with probability w∈Ni |Tw | 6: Set ψ(u) → vi 7: Si+1 = Si ∪ {u} 8: end for
6
De-evolution: Peeling the Tree
We are now ready to answer our first question: Given a tree Gt assumed to be generated by the PA process Gt , what node in the tree is most likely to be the root? Definition 3 (Tree Root Likelihood) v ∗ = arg max Pr(v is the root of Gt |Gt ).
(16)
v∈V (Gt )
From Theorem 3, we can now restate the above question: r
∗
v = arg max Pr(Gt v∈V (Gt )
r
Gvt |Gt
Pr(Gt Gvt ) ∼ = Gt ) = arg max ∼ v∈V (Gt ) Pr(Gt = Gt )
= arg max K(Gvt )Γ (Gt ) = arg max K(Gvt ), v∈V (Gt )
(17)
v∈V (Gt )
and by Theorem 4 we have: Corollary 2 (Tree Root Likelihood) v ∗ = arg max Pr(v is the root of Gt |Gt ) = arg max v∈V (Gt )
v∈V (Gt )
|Ψ (Gvt )| . |Φ(Gvt )|
(18)
Since we know how to compute K(G) in linear time, we can claim the following theorem: Theorem 8. Given an unlabeled tree Gt of size t that is a result of the PA process, the most likely root of the tree can be found in O(t2 ). We can now turn to our second question: Which leaf is most likely to have been the last to join the graph? More formally, given that at time t, Gt ∼ = Gt , we are interested in finding what the graph was at time t − 1. Clearly, the last
518
C. Avin and Y. Lotker
node to join Gt was a leaf, so at time t − 1 the graph was identical to Gt but without one of its leaves. Recall that L(Gt ) is the set of leaves of a tree Gt and consider ∈ L(Gt ) to be a leaf node. Since is a leaf it has a single neighbor (parent) in Gt ; let this node be u and so (, u) ∈ E(Gt ). Let Gt denote the resulting graph when node and the edge (, u) are removed from Gt . Note that Gt is the vertex induced subgraph Gt [V \{}]. We are interested in finding which leaf was most likely to have been the last to join Gt , or equivalently, what is the graph Gt , which Gt−1 is most likely to be isomorphic to. Formally, we would like to solve the following: Definition 4. (Previous Graph Likelihood) ∼ ∼ ∗ = arg max Pr(Gt |Gt ) = arg max Pr(Gt−1 = Gt | Gt = Gt ). ∈L(Gt )
∈L(Gt )
(19)
We will solve the above expression by computing analytically the probability Pr(Gt |Gt ) for a given leaf and a rooted tree Gt , and then select the most likely case among all possible roots and leafs. Assume Gt is a rooted tree and let Δ(Gt , ) ⊆ Δ(Gt ) denote the subclass of Δ(Gt ) where was the last node that joined Gt . Formally, Δ(Gt , ) is the set of valid labeled trees that are root isomorphic to Gt and in which the image of r is vt . Namely, for each rooted tree G ∈ Δ(Gt , ), we have Gt G (using ψ) and ψ() = vt . Then, the definition of this subset is: r
Δ(Gt , ) = {G | G is valid labeled tree ∧ Gt ψ G ∧ ψ() = vt }.
(20)
Let the size of this subclass be denoted K(Gt , ) = |Δ(Gt , )|. We can now express the probability that Gt will generate a graph from Δ(Gt , ). Recall that all graphs in Δ(Gt , ) ⊆ Δ(Gt ) are isomorphic and therefore have the same labeled tree probability. Let Gt be an arbitrary graph in Δ(Gt , ), then Pr(Gt ∈ Δ(Gt , )) = K(Gt , ) Pr(Gt = Gt ) = K(Gt , )Γ (Gt ) = K(Gt , )Γ (Gt ). (21) From Eq. (21) and Eq. (10) we can now state and prove: Theorem 9. The probability that a leaf joined last to a rooted Gt , namely that Gt was the previous graph, is equal to the ratio between the sizes of Δ(Gt , ) and the whole size of Δ(Gt ): r
r
Pr(Gt | Gt ) = Pr(Gt−1 Gt | Gt Gt ) =
K(Gt , ) . K(Gt )
(22)
Recall that Theorem 4 allowed us to compute K(Gt ), but not directly to compute K(Gt , ). The next theorem expresses K(Gt , ) in terms of Theorem 4 and will enable us to use Theorem 9 to find the most likely previous graph.
De-evolution of Preferential Attachment Trees
519
Theorem 10. Consider a rooted tree Gt . Let ∈ L(Gt ) be a leaf and let a be the parent node of . Let K(Gt , ) be the size of subclass Δ(Gt , ) and K(Gt ) be . Let |Orbit(G , a )| be the orbit size the size of the isomorphism class of the G t t . Then, of node a in the graph Gt K(Gt , ) = K(Gt )|Orbit(Gt , a )|.
(23)
Proof. Recall that the size of Gt is t − 1 and consider any valid labeled tree - r ) of size t − 1 and a ∈ GGt−1 ∈ Δ(Gt t . Recall that Gt Gt−1 and let vj ∈ Gt−1 be a (possible) image of a . Clearly, |Orbit(Gt , a )| = |Orbit(Gt−1 , vj )|. Now for each vi ∈ Orbit(Gt−1 , vj ) we can create a different valid labeled tree Gt ∈ Δ(Gt , ), of size t, by adding the node vt to Gt and setting its parent to
vi . Since K(Gt ) = |Δ(Gt )| the claim follows. Finally, we can conclude our journey. Using Theorems 10 and 4 we reach our last expression that will be used to compute the most likely previous graph. Corollary 3. (Previous Graph Probability) Pr(Gt |Gt ) =
K(G|Ψ (Gt ) t )||Φ(Gt )| |Orbit(G|Orbit(G, a )| = t t , a )|. (24) K(Gt ) |Ψ (Gt )||Φ(Gt )|
Since there could be at most a linear number of leaves we can also claim: Theorem 11. Given an unlabeled rooted tree Gt of size t that is a result of the 2 PA process, the most likely graph Gt for a leaf ∈ L(Gt ) can be found in O(t ).
References 1. Barab´ asi, A.L., Albert, R.: Emergence of scaling in random networks. Science 286(5439), 509–512 (1999) 2. Borgatti, S.P., Everett, M.G.: Notions of position in social network analysis. Sociol. Methodol. 22, 1–35 (1992) 3. Bubeck, S., Devroye, L., Lugosi, G.: Finding Adam in random growing trees. arXiv preprint arXiv:1411.3317 (2014) 4. Colbourn, C.J., Booth, K.S.: Linear time automorphism algorithms for trees, interval graphs, and planar graphs. SIAM J. Comput. 10(1), 203–225 (1981) 5. Everett, M.G.: Role similarity and complexity in social networks. Soc. Netw. 7(4), 353–359 (1985) 6. Everett, M.G., Borgatti, S.: Calculating role similarities: an algorithm that helps determine the orbits of a graph. Soc. Netw. 10(1), 77–91 (1988) 7. Newman, M.: Networks: An Introduction. Oxford University Press, Oxford (2010) 8. Price, D.D.S.: A general theory of bibliometric and other cumulative advantage processes. J. Am. Soc. Inf. Sci. 27(5), 292–306 (1976) 9. Rozenshtein, P., Gionis, A., Prakash, B.A., Vreeken, J.: Reconstructing an epidemic over time. In: KDD 2016, August 13–17, San Francisco, CA, USA (2016) 10. Shah, D., Zaman, T.: Rumors in a network: who’s the culprit? IEEE Trans. Inf. Theory 57(8), 5163–5181 (2011) 11. Winship, C.: Thoughts about roles and relations: an old document revisited. Soc. Netw. 10(3), 209–231 (1988)
An Algorithmic Information Distortion in Multidimensional Networks Felipe S. Abrah˜ ao1,3,4(B) , Klaus Wehmuth1,3,4 , Hector Zenil2,3,4 , and Artur Ziviani1,3,4 1
4
National Laboratory for Scientific Computing (LNCC), Petropolis, RJ 25651-075, Brazil {fsa,klaus,ziviani}@lncc.br 2 Oxford Immune Algorithmics, Reading RG1 3EU, UK 3 Algorithmic Dynamics Lab, Unit of Computational Medicine, Department of Medicine Solna, Center for Molecular Medicine, Karolinska Institute, 171 77 Stockholm, Sweden [email protected] Algorithmic Nature Group, Laboratoire de Recherche Scientifique (LABORES) for the Natural and Digital Sciences, 75005 Paris, France
Abstract. Network complexity, network information content analysis, and lossless compressibility of graph representations have been played an important role in network analysis and network modeling. As multidimensional networks, such as time-varying, multilayer, or dynamic multilayer networks, gain more relevancy in network science, it becomes crucial to investigate in which situations universal algorithmic methods based on algorithmic information theory applied to graphs cannot be straightforwardly imported into the multidimensional case. In this direction, as a worst-case scenario of lossless compressibility distortion that increases linearly with the number of distinct dimensions, this article presents a counter-intuitive phenomenon that occurs when dealing with networks within non-uniform and sufficiently large multidimensional spaces. In particular, we demonstrate that the algorithmic information necessary to encode multidimensional networks that are isomorphic to logarithmically compressible monoplex networks may display exponentially larger distortions in the general case. Keywords: Multidimensional networks · Lossless compression Network complexity · Information distortion
1
·
Introduction
Algorithmic information theory (AIT) gives a set of formal universal tools for studying network complexity in the form of data compression, irreducible information content, or randomness of individual networks [18–20,23], especially in the case these networks were not generated, constructed, or defined by stochastic processes. In addition, such an algorithmic approach to the study of complex c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 944, pp. 520–531, 2021. https://doi.org/10.1007/978-3-030-65351-4_42
Algorithmic Information Distortion
521
networks (and not only graphs or networks, but tensors in general) has presented important refinements of more traditional statistical approaches, for example in the context of: automorphism group size [27]; graph summarization [24,25]; typicality and null models by replacing the principle of maximum entropy with the principle of maximum algorithmic randomness [26]; and the reducibility problem of multiplex networks into aggregate monoplex networks [20]. Since proper representations of multidimensional networks into new extensions of graph-theoretical abstractions have been one of the central topics of investigation in network science [3,10,15,16], the situations in which previous methods based on AIT cannot be straightforwardly imported into the multidimensional case also become an important question. In this sense, we present in this article a theoretical analysis of worst-case distortions with respect to the algorithmic complexity of node-aligned multidimensional networks, in particular those represented by multiaspect graphs [21,22] with a large number of non-uniform aspects. We show that in the general case of a multidimensional network there are algorithmic information distortions that grow linearly with the number of aspects and exponentially with respect to the algorithmic information of a monoplex network, whereas both the multidimensional network and this monoplex network are isomorphic structures. The results in this article hold independently of the choice of the encoding method or the universal programming language. This is because, given any two distinct encoding methods or any two distinct universal programming languages, the algorithmic complexity of an object represented in one way or the other can only differ by a constant whose value only depends on the choice of encoding methods or universal programming languages, but not on the choice of the object [8,12,17]. That is, algorithmic complexity is pairwise invariant for any two arbitrarily chosen encodings. Although only dealing with pairs of isomorphic objects in addition to this encoding invariance, we will see later on in Corollary 2 that algorithmic information distortions can in fact result from changing the multidimensional spaces into which isomorphic copies of the objects are embedded. Thus, contributing to multidimensional network complexity analysis, our results establish a worst-case error margin for topological information content evaluation and lossless compressibility. In addition, it shows the importance of multidimensional network encodings into which the multidimensional space itself is also encoded. The article is organized as follows. In Sect. 2, we recover necessary concepts, definitions, and results from the literature. In Sect. 3, we study basic properties of encoded multiaspect graphs. In Sect. 4, we demonstrate the main results in Theorem 1 and Corollaries 1 and 2. Section 5 concludes the paper.
2 2.1
Background Multiaspect Graphs
We directly base our notation regarding classical graphs on [4,5,11] and regarding multiaspect graphs (MAGs) on [21,22]. In order to avoid ambiguities, minor
522
F. S. Abrah˜ ao et al.
differences in the notation from [21,22] will be introduced here. In particular, the notation of MAG H = (A, E) is replaced with G = (A , E ), where the list A of aspects is replaced with A and the composite edge set E is replaced with E . This way, note that A = (A (G )[1], . . . , A (G )[i], . . . , A (G )[p]) is a list of sets, where each set in this list is an aspect (or node dimension [1]) denoted by A (G )[i]. The companion tuple of a MAG G becomes then denoted by τ (G ), where τ (G ) = (|A (G )[1]|, . . . , |A (G )[p]|) and p is called the order of the MAG. As established in [22], it is important to note that the companion tuple completely determines the size of the node-aligned p set V(G ) = i=1 A (G )[i] of all composite vertices v = (a1 , . . . , ap ) of G , and as a direct consequence also determines the size of the set E(G ) = V(G ) V(G ) of all possible composite edges e = ((a1 , . . . , ap ), (b1 , . . . , bp )) of G . This way, for every MAG G , one has E (G ) ⊆ E(G ). In this article, we employ hereafter the term multidimensional networks to refer to node-aligned multidimensional networks that can be mathematically represented by MAGs. In addition, we denote an undirected MAG without self-loops by Gc = (A , E ), so that the set Ec of all possible undirected and non-self-loop composite edges is defined by
×
×
Ec (Gc ) := {{u, v} | u, v ∈ V(Gc )} and E (Gc ) ⊆ Ec (Gc ) always holds. In a direct analogy to simple graphs, we refer to these MAGs Gc as simple MAGs. Regarding graphs, we follow the common notation and nomenclature [4,11, 13]: we denote a general (directed or undirected) graph by G = (V, E), where V is the finite set of vertices and E ⊆ V × V ; if a graph only contains undirected edges and does not contain self-loops, then it is called a simple graph. A graph G is (vertex-)labeled when the members of V are distinguished from one another by labels such as v1 , v2 , . . . , v|V | . If a simple graph is labeled by natural numbers, i.e., V = {1, . . . , n} with n ∈ N, then it is called a classical graph. For the present purposes of this article, all graphs G will be classical graphs and all MAGs will be simple MAGs. One may adopt the convention of calling the elements of the first aspect of a MAG as vertices, i.e., A (G )[1] = V (G ). Thus, a classical graph G is a labeled first-order (i.e., p = 1) simple MAG Gc with V (G) = V(Gc ) = {1, . . . , |V(Gc )|}. Note that the term ‘vertex’ should not be confused with term ‘composite vertex’, since they refer to same entity only in the case of first-order MAGs. As established in [21], one can define a MAG-graph isomorphism analogously to the classical notion of graph isomorphism: a MAG G is isomorphic to a graph G iff there is a bijective function f : V(G ) → V (G) such that e ∈ E (G ) ⇐⇒ (f (πo (e)), f (πd (e))) ∈ E(G), where πo is a function that returns the origin composite vertex of a composite edge and πd is a function that returns the destination composite vertex of a
Algorithmic Information Distortion
523
composite edge. In order to avoid ambiguities with the classical isomorphism in graphs, which is usually a vertex label transformation, we call: such an isomorphism between a MAG and graph from [21] a MAG-graph isomorphism; the usual isomorphism between graphs [4,11] as graph isomorphism; and the isomorphism between two MAGs G and G (i.e., (u, v) ∈ E (G ) iff (f (u), f (v)) ∈ E (G )) as MAG isomorphism. It is shown in [21] that a MAG is isomorphically equivalent to a graph: Theorem 1. For every MAG G of order p > 0, where all aspects are non-empty sets, there is a unique (up to a graph isomorphism) graph GG = (V, E) that is MAG-graph-isomorphic to G , where |V (GG )| =
p
|A (G )[n]| = |V(G )| .
n=1
However, we shall show in this article that, although both a MAG and its isomorphic graph can be encoded and both represent the same abstract relational structure, they may diverge in terms of compressibility or algorithmic information content in the general case. 2.2
Algorithmic Information Theory (AIT)
In this section, we recover some basic notations and definitions from the literature regarding algorithmic information theory (aka Kolmogorov complexity theory or Solomonoff-Kolmogorov-Chaitin complexity theory). For an introduction to these concepts and notation, see [8,9,12,17]. First, regarding some basic notation, let l(x) denote the length of a string x ∈ {0, 1}∗ . Let (x)2 denote the binary representation of the number x ∈ N. Let x n denote the ordered sequence of the first n bits of the fractional part in the binary representation of x ∈ R. That is, x n = x1 x2 . . . xn , where (x)2 = y.x1 x2 . . . xn xn+1 . . . with y ∈ {0, 1}∗ and x1 , x2 , . . . , xn ∈ {0, 1}. We denote the result of the computation of an arbitrary Turing machine M with input x ∈ L by the partial computable function M : L → L. Let LU denote a binary prefix-free (or self-delimiting) universal programming language for a prefix universal Turing machine U. As usual, let · , · denote an arbitrary computable bijective pairing function [12,17], which can be recursively extended in order to encode any finite ordered n-tuple in the form · , . . . , ·. Let w∗ denote the lexicographically first p ∈ LU such that l(p) is minimum and U(p) = w. The algorithmic information content of an object w is given by the (unconditional) prefix algorithmic complexity (also known as K-complexity, prefix Kolmogorov complexity, self-delimited program-size complexity, or Solomonoff-KolmogorovChaitin complexity for prefix universal Turing machines), denoted by K(w), which is the length of the shortest program w∗ ∈ LU such that U(w∗ ) = w. The conditional prefix algorithmic complexity of a binary string y given a binary string x, denoted by K(y |x), is the length of the shortest program w ∈ LU such that U(x, w) = y.
524
F. S. Abrah˜ ao et al.
With respect to weak asymptotic dominance of function f by a function g, we employ the usual f (x) = O(g(x)) for the big O notation when f is asymptotically upper bounded by g; and with respect to strong asymptotic dominance by a function g, we employ the usual f (x) = o(g(x)) when g dominates f .
3
Basic Properties of Encoded Multiaspect Graphs
In a general sense, a MAG Gc is said to be encodable (i.e., recursively labeled, or with a univocal computably ordered data representation) given τ (Gc ) iff there is an algorithm that, given the companion tuple τ (Gc ) as input, computes a bijective ordering of composite edges e ∈ Ec (Gc ) from composite vertices v ∈ V(Gc ). That is, if the companion tuple τ (Gc ) of the MAG Gc is known, then one can computably retrieve the position of any composite edge e = {u, v} in the chosen data representation of Gc from both composites vertices u and v, and vice-versa.1 This way, following the usual definition of encodings, a MAG is encodable given τ (Gc ) iff there is a algorithm that, given τ (Gc ) as input, can univocally encode any possible E (Gc ) that shares the same companion tuple. As expected, MAGs that have every element of its aspects labeled as a natural number can always be encoded. The proof of Lemma 1 follows directly from the definition of MAG and the recursive bijective nature of the pairing function2 . In other words, a MAG can always be encoded if the information necessary to determine the companion tuple τ (Gc ) is previously given. Lemma 1. Any arbitrary simple MAG Gc with A (Gc )[i] = {1, . . . , |A (Gc )[i]|} ⊂ N, where |A (Gc )[i]| ∈ N and 1 ≤ i ≤ p = |A (Gc )| ∈ N, is encodable given τ (Gc ). Note that there is then an algorithm that, given a bit string x ∈ {0, 1}∗ of length |Ec (Gc )| as input, computes a composite edge set E (Gc ) and there is another algorithm that, given the encoded composite edge set E (Gc ) as input, returns a string x. Such strings univocally represent (up to a MAG isomorphism or up to a reordering of composite edges) the characteristic function (or indicator function) of pertinence in the set E (Gc ), and thus we call them as characteristic strings of the MAG: Definition 1. Let e1 , . . . , e|Ec (G c )| be any arbitrary ordering of all possible composite edges of a simple MAG Gc . We say that a string x ∈ {0, 1}∗ with l(x) = |Ec (Gc )| is a characteristic string of a simple MAG Gc iff, for every ej ∈ Ec (Gc ), ej ∈ E (Gc ) ⇐⇒ the j-th digit in x is 1, where 1 ≤ j ≤ l(x). In order to ensure uniqueness of representations (now only up to a MAG automorphism) from which the algorithmic complexity are calculated, one may also choose to encode a MAG into a string-based representation using the pairing function ·, · and a fixed ordering/indexing of the composite edges: 1 2
An explicit formal definition of encodability (i.e., recursive labeling) given the companion tuple τ (Gc ) can be found for example in [2]. The reader can found a proof of Lemma 1 in [2].
Algorithmic Information Distortion
525
Definition 2. Let e1 , . . . , e|Ec (G c )| be any arbitrary ordering of all possible composite edges of a simple MAG Gc . Then, E (Gc ) denotes the composite edge set string e1 , z1 , . . . , en , zn such that zi = 1 ⇐⇒ ei ∈ E (Gc ), where zi ∈ {0, 1} with 1 ≤ i ≤ n = |Ec (Gc )|. In the case of graphs (or monoplex networks), we remember that there is always a unified and decidable way to encode a sequence of all possible undirected edges given any unordered pair {x, y} of natural numbers x, y ∈ N, for example by encoding characteristic strings or adjacency matrices of arbitrary finite size. Thus, encoding classical graphs with characteristic strings or with composite edge set strings is Turing equivalent and, therefore, it is also equivalent in terms of algorithmic information. This is indeed an underlying basic property previously explored, e.g., in [6,23,24]. Additionally, in the case of infinite graphs, it was shown in [14] that encoding with infinite characteristic strings may generate other counter-intuitive phenomena with respect to algorithmic randomness. The present article only deals with finite MAGs and graphs and with infinite families of finite MAGs and graphs. Unlike classical graphs, we shall see later on in Corollary 1 that the relationship between characteristic strings and composite edge set strings in the case of simple MAGs does not behave so well. Nevertheless, if the ordering assumed in Definition 1 matches the same ordering in Definition 2, we have in Lemma 2 below that both the MAG and its respective characteristic string are indeed “equivalent” in terms of algorithmic information, but except for the minimum information necessary to encode the multidimensional space (e.g., the algorithmic information of the encoded companion tuple in the form τ (Gc ) = |A (G )[1]|, . . . , |A (G )[p]|). As expected, the proof follows directly from the fact that an ordering of composite edges is always embedded into the notion of encodability by composite edge set strings (a complete proof can be found in [2]). Lemma 2. Let x ∈ {0, 1}∗ . Let Gc be an encodable MAG given τ (Gc ) such that x is the respective characteristic string. Then, K(E (Gc ) | x) ≤ K(τ (Gc )) + O(1) K(x | E (Gc )) ≤ K(τ (Gc )) + O(1) K(x) = K(E (Gc )) ± O K(τ (Gc )) .
(1) (2) (3)
Note that, since a graph is a MAG of order 1, and in this case characteristic strings and composite edge set strings become Turing equivalent, then Lemma 2 can be improved in the case of graphs so that one K(τ (Gc )) can eliminate in Eqs. (1) and (2). In addition, one can replace O K(τ (Gc )) in Eq. (3) with O (1).
526
4
F. S. Abrah˜ ao et al.
A Worst-Case Algorithmic Information Distortion from Increasing the Number of Aspects
Basically, Lemma 2 assures that the information contained in a simple MAG Gc and in the characteristic string are the same, except for the algorithmic information necessary to computably determine the companion tuple. Unfortunately, one can show in Theorem 2 below that this information deficiency between the data representation of a MAG (in the form e.g. E (Gc )) and its characteristic string cannot be much more improved in general. In other words, as we show3 in Theorem 2, there are worst-case scenarios of multidimensional spaces in which the algorithmic information necessary for retrieving the encoded form of the MAG from its characteristic string is close (except for a logarithmic term) to the upper bound given by Eq. 1 in Lemma 2. This shows a fundamental difference between encoding MAGs with characteristic strings (or, equivalently, adjacency matrices [24]) and encoding MAGs with composite edge set strings. Theorem 2. There are arbitrarily large encodable simple MAGs Gc given τ (Gc ) such that K(τ (Gc )) + O(1) ≥ K(E (Gc ) | x) ≥ K(τ (Gc )) − O log2 K(τ (Gc )) with K(E (Gc )) ≥ p − O(1) and K(x) = O (log2 (p)), where x is the respective characteristic string and p is the order of the MAG Gc . Proof. The main idea of the proof is to define an arbitrary companion tuple such that the algorithmic complexity of the characteristic string is sufficiently small compared to the algorithmic complexity of the companion tuple, while we can prove that there is a computable procedure that always recovers the companion tuple from E (Gc ). First, let Gc be any simple MAG with τ (Gc ) = (|A (Gc )[1]|, . . . , |A (Gc )[p]|) such that A (Gc )[i] = {1, 2} ⇐⇒ the i-th digit of w is 1 A (Gc )[i] = {1} ⇐⇒ the i-th digit of w is 0 where p ∈ N and w ∈ {0, 1}∗ are arbitrary. Since w is arbitrary, let w be a long enough finite initial segment of a 1-random real number y. Remember that, if y is a 1-random real number (i.e., an algorithmically random infinite sequence [8, 12]), then K(y n ) ≥ n − O(1) , where n ∈ N is arbitrary. From Lemma 1, we have that Gc is encodable given τ (Gc ). Therefore, there is a program q that represents an algorithm running on a prefix universal Turing machine U that proceeds as follows: (i) receive E (Gc )∗ as input; (ii) calculate the value of U (E (Gc )∗ ) and build a sequence e1 , . . . , en of the composite edges ei ∈ Ec (Gc ) in the exact same order that they appear in E (Gc ) = U (E (Gc )∗ ); 3
A preliminary version of Theorem 2 can be also found in [2].
Algorithmic Information Distortion
527
(iii) build a finite ordered set V := {v|e ∈ e1 , . . . , en , where (e = v, u ∨ e = u, v)} ; (iv) build a finite list [A1 , . . . , Ap ] of finite ordered sets Ai := {ai |ai is the i-th element of v = (a1 , . . . , ap ) ∈ V } , where p is finite and is smaller than or equal to the length of the longest v ∈ V ; (v) for every i with 1 ≤ i ≤ p, make zi := |Ai | ; (vi) return the binary sequence s = x1 x2 · · · xp from zj ≥ 2 ⇐⇒ xj = 1 zj = 1 ⇐⇒ xj = 0. Therefore, from our construction of Gc , we will have that K(y p ) ≤ l (E (Gc )∗ , q) ≤ K(E (Gc )) + O(1)
(4)
holds by the minimality of K(·) and by our construction of q. Moreover, one can trivially construct an algorithm that returns y p from the chosen companion tuple τ (Gc ) and another algorithm that performs the inverse computation. This way, we will have that K(τ (Gc )) ≤ K(y p ) + O(1) ≤ p + O (log2 (p))
(5)
and, since y is 1-random, p − O(1) ≤ K(y p ) ≤ K(τ (Gc )) + O(1).
(6)
Additionally, since E (Gc ) and p were arbitrary, we can choose any characteristic string x such that (7) K(x) = O (log2 (p)) holds. For example,4 one can take a trivial x as a binary sequence starting with 1 and repeating 0’s until the length matches the total number of all possible composite edges p ±o(p) 2 p ±o(p) 22 − 22 , (8) |Ec (Gc )| = 2 which we know it is possible because of the Borel normality of y [7, 8]. Note that, in this case, our construction of the simple MAGs Gc ensures that the number 4
This is only an example. In fact, one can choose any characteristic string x in which K(x) ≤ O (log2 (p)) always holds.
528
F. S. Abrah˜ ao et al.
of possible composite vertices only varies in accordance with the number of 1’s in y. Therefore, together with basic inequalities in AIT, we have that K(τ (Gc )) ≤ K(E (Gc )) + O(1) ≤ K(x) + K(E (Gc ) | x) + O(1) ≤ O log2 K(τ (Gc )) + K(E (Gc ) | x). Finally, the proof of K(τ (Gc )) + O(1) ≥ K(E (Gc ) | x) follows directly from Lemma 2.
The reader is then invited to note that the proof of Theorem 2 also works for many other forms of companion tuples τ (Gc ), as long as Eqs. (4), (5), (6), and (7) hold. For example, keep w being a long enough finite initial segment of a 1-random real number y and then define τ (Gc ) = (|A (Gc )[1]|, . . . , |A (Gc )[p]|) such that A (Gc )[i] = {1, . . . , (f1 (p) + f2 (p))} ⇐⇒ the i-th digit of w is 1 A (Gc )[i] = {1, . . . , f1 (p)} ⇐⇒ the i-th digit of w is 0, where f1 : N → N \ {0} and f2 : N → Z \ {0} are arbitrary total computable functions. Moreover, as a consequence of Theorem 2, we show in Corollary 1 below a phenomenon that can only occur for families of objects embedded into arbitrarily large and non-uniform multidimensional spaces. Note that the companion tuple completely determines the discrete multidimensional space of the MAGs in which A (G )[i] = {1, . . . , |A (G )[i]|} ⊂ N holds for every i ≤ p. In the particular case A (G )[i] = A (G )[j] holds for every i, j ≤ p, we say the multidimensional space of the MAG is uniform. Also note that the number of dimensions of a node-aligned multidimensional network that is mathematically represented by a MAG is given by the value p, i.e., the order of the MAG. Thus, arbitrarily large multidimensional spaces formally refers to arbitrarily large values of p. Specifically, Corollary 1 shows that there are two infinite sets of objects (in particular, one of data representations of multiaspect graphs and the other of strings) whose every member of one set is an encoding of a member of the other, but these members of the two sets are not always equivalent in terms of algorithmic information, which is a phenomenon that some may deem to be counter-intuitive at first glance: Corollary 1. There is an infinite family F of simple MAGs and an infinite set X of the correspondent characteristic strings such that, for every constant c ∈ N, there are Gc ∈ F and x ∈ X, where x is the characteristic string of Gc and (9) O log2 K(E (Gc )) > c + K(x). Proof. Let c ∈ N be arbitrary. Then, in order to construct the family F , it suffices to select an infinite number of finite initial segments of a 1-random infinite binary sequence y such that, for each selected y n , we choose another k > n with
Algorithmic Information Distortion
K(y k ) ≥ c + K(y n ) + O log2 K(y k ) .
529
(10)
This procedure can be applied infinitely many times because y is 1-random. Now, from the proof of Theorem 2, construct an infinite set of companion tuples based on these initial segments of y. From each of these companion tuples, construct the characteristic strings in the same way as in the proof of Theorem 2. Finally, the desired inequality in Eq. (9) then follows from Theorem 2 and Eq. (10).
We can now combine Corollary 1 with Theorems 1 and 2 in order to show that, although for every MAG there is a graph that is isomorphic to this MAG, they are not always equivalent in terms of algorithmic information, where in fact the distortion may be exponential with respect to the algorithmic information of the graph: Corollary 2. There are an infinite family F1 of simple MAGs and an infinite family F2 of classical graphs, where every classical graph in F2 is MAG-graphisomorphic to at least one MAG in F1 , such that, for every constant c ∈ N, there are Gc ∈ F1 and a GG c ∈ F2 that is MAG-graph-isomorphic to Gc , where O log2 K(E (Gc )) > c + K(E (GG c )). Proof. Let F1 be an infinite family of simple MAGs that satisfies Corollary 1. Let F2 be a family composed of the classical graphs that are MAG-graph-isomorphic to the MAGs in F1 . Then, for every Gc ∈ F1 and GG c ∈ F2 that is MAG-graphisomorphic to Gc , both have the same characteristic string. Remember that, for classical graphs, Eq. (3) holds in the form K(x) = K (E (GG c ))±O (1). Finally, the proof then follows from Corollary 1.
5
Conclusion
This article presented mathematical results on the limitations for algorithmic information theory (AIT) applied to the study of multidimensional networks with a large number of non-uniform node dimensions (i.e., aspects). In the case of importing previous approaches for graphs or monoplex networks to nodealigned multidimensional networks, we demonstrated in Theorem 2, Corollary 1, and Corollary 2 the existence of worst-case distortions for network complexity analysis based on network information content or lossless compressibility. When comparing a logarithmically compressible network topology embedded into a high-algorithmic-complexity multidimensional space with this low-algorithmiccomplexity network topology embedded into a unidimensional space, we showed that, in the general case, there are algorithmic complexity distortions that grow linearly with the number of aspects and exponentially with respect to the algorithmic complexity of the monoplex network. These distortions occur even though both the multidimensional network and the monoplex network are isomorphic structures.
530
F. S. Abrah˜ ao et al.
These results show that a more careful analysis should be taken with purpose of evaluating how the number of distinct aspects, the respective sizes of each aspect, and the ordering that these might be encoded affect the algorithmic information of the whole network. This way, the present article highlights the importance of: (i) taking into account the algorithmic complexity of the data structure itself; and (ii) going beyond the algorithmic complexity of encoding multidimensional networks with characteristic strings or adjacency matrices, for instance. Unlike graphs (or monoplex networks), the irreducible information content of a multidimensional network may be highly dependent on the choice of the encoded isomorphic copy. As we have only dealt with node-aligned multidimensional networks in the form of MAGs, future research is needed for establishing worst-case scenarios when the multidimensional network is not node aligned. Acknowledgments. Authors acknowledge the partial support from CNPq: F. S. Abrah˜ ao (301.322/2020-1), K. Wehmuth (303.193/2020-4), and A. Ziviani (310.201/ 2019-5). Authors acknowledge the INCT in Data Science – INCT-CiD (CNPq 465.560/2014-8) and FAPERJ (E-26/203.046/2017). We also thank Cristian Calude, Mikhail Prokopenko, and Gregory Chaitin for suggestions and directions on related topics investigated in this article.
References 1. Abrah˜ ao, F.S., Wehmuth, K., Zenil, H., Ziviani, A.: On incompressible multidimensional networks. arXiv Preprints (2018). http://arxiv.org/abs/1812.01170 2. Abrah˜ ao, F.S., Wehmuth, K., Zenil, H., Ziviani, A.: Algorithmic information and incompressibility of families of multidimensional networks. Research report no. 8/2018, National Laboratory for Scientific Computing (LNCC), Petr´ opolis, Brazil (2020). https://arxiv.org/abs/1810.11719v9 3. Boccaletti, S., Bianconi, G., Criado, R., del Genio, C., G´ omez-Garde˜ nes, J., Romance, M., Sendi˜ na-Nadal, I., Wang, Z., Zanin, M.: The structure and dynamics of multilayer networks. Phys. Rep. 544(1), 1–122 (2014) 4. Bollob´ as, B.: Modern Graph Theory. Graduate Texts in Mathematics. Springer, New York (1998) 5. Brandes, U., Erlebach, T.: Fundamentals. In: Brandes, U., Erlebach, T. (eds.) Network Analysis. Lecture Notes in Computer Science, vol. 3418, pp. 7–15. Springer, Heidelberg (2005) 6. Buhrman, H., Li, M., Tromp, J., Vit´ anyi, P.: Kolmogorov random graphs and the incompressibility method. SIAM J. Comput. 29(2), 590–599 (1999) 7. Calude, C.S.: Borel normality and algorithmic randomness. In: Developments in Language Theory, pp. 113–129. World Scientific Publishing (1994) 8. Calude, C.S.: Information and Randomness: An Algorithmic Perspective, 2nd edn. Springer, Berlin (2002) 9. Chaitin, G.: Algorithmic Information Theory, 3rd edn. Cambridge University Press, Cambridge (2004) 10. De Domenico, M., Sol´e-Ribalta, A., Cozzo, E., Kivel¨ a, M., Moreno, Y., Porter, M.A., G´ omez, S., Arenas, A.: Mathematical formulation of multilayer networks. Phys. Rev. X 3(4), 041022 (2013)
Algorithmic Information Distortion
531
11. Diestel, R.: Graph Theory, Graduate Texts in Mathematics, vol. 173, 5th edn. Springer, Heidelberg (2017) 12. Downey, R.G., Hirschfeldt, D.R.: Algorithmic Randomness and Complexity. Theory and Applications of Computability. Springer, New York (2010) 13. Harary, F.: Graph Theory. Addison Wesley Series in Mathematics. CRC Press, Boca Raton (2018) 14. Khoussainov, B.: A quest for algorithmically random infinite structures. In: Proceedings of the Joint Meeting of the Twenty-Third EACSL Annual Conference on Computer Science Logic (CSL) and the Twenty-Ninth Annual ACM/IEEE Symposium on Logic in Computer Science (LICS) - CSL-LICS, pp. 1–9. ACM Press, New York (2014) 15. Kivela, M., Arenas, A., Barthelemy, M., Gleeson, J.P., Moreno, Y., Porter, M.A.: Multilayer networks. J. Complex Netw. 2(3), 203–271 (2014) 16. Lambiotte, R., Rosvall, M., Scholtes, I.: From networks to optimal higher-order models of complex systems. Nat. Phys. 15(4), 313–320 (2019) 17. Li, M., Vit´ anyi, P.: An Introduction to Kolmogorov Complexity and Its Applications. Texts in Computer Science, 4th edn. Springer, Cham (2019) 18. Morzy, M., Kajdanowicz, T., Kazienko, P.: On measuring the complexity of networks: Kolmogorov complexity versus entropy. Complexity 2017, 1–12 (2017) 19. Mowshowitz, A., Dehmer, M.: Entropy and the complexity of graphs revisited. Entropy 14(3), 559–570 (2012) 20. Santoro, A., Nicosia, V.: Algorithmic complexity of multiplex networks. Phys. Rev. X 10(2), 021069 (2020) ´ Ziviani, A.: On MultiAspect graphs. Theor. Comput. 21. Wehmuth, K., Fleury, E., Sci. 651, 50–61 (2016) ´ Ziviani, A.: MultiAspect graphs: algebraic representation 22. Wehmuth, K., Fleury, E., and algorithms. Algorithms 10(1), 1–36 (2017) 23. Zenil, H., Kiani, N., Tegn´er, J.: A review of graph and network complexity from an algorithmic information perspective. Entropy 20(8), 551 (2018) 24. Zenil, H., Kiani, N.A., Abrah˜ ao, F.S., Rueda-Toicen, A., Zea, A.A., Tegn´er, J.: Minimal Algorithmic Information Loss Methods for Dimension Reduction, Feature Selection and Network Sparsification. arXiv Preprints (2019). https://arxiv.org/ abs/1802.05843 25. Zenil, H., Kiani, N.A., Tegn´er, J.: Quantifying loss of information in network-based dimensionality reduction techniques. J. Complex Netw. 4(3), 342–362 (2016) 26. Zenil, H., Kiani, N.A., Tegn´er, J.: The thermodynamics of network coding, and an algorithmic refinement of the principle of maximum entropy. Entropy 21(6), 560 (2019) 27. Zenil, H., Soler-Toscano, F., Dingle, K., Louis, A.A.: Correlation of automorphism group size and topological properties with program-size complexity evaluations of graphs and complex networks. Phys. A: Stat. Mech. Appl. 404, 341–358 (2014)
Hot-Get-Richer Network Growth Model Faisal Nsour1(B) and Hiroki Sayama1,2 1
2
Department of Systems Science and Industrial Engineering, Binghamton University, Binghamton, NY, USA {fnsour1,sayama}@binghamton.edu Waseda Innovation Lab, Waseda University, Tokyo, Japan
Abstract. Under preferential attachment (PA) network growth models late arrivals are at a disadvantage with regard to their final degrees. Previous extensions of PA have addressed this deficiency by either adding the notion of node fitness to PA, usually drawn from some fitness score distributions, or by using fitness alone to control attachment. Here we introduce a new dynamical approach to address late arrivals by adding a recent-degree-change bias to PA so that nodes with higher relative degree change in temporal proximity to an arriving node get an attachment probability boost. In other words, if PA describes a rich-get-richer mechanism, and fitness-based approaches describe good-get-richer mechanisms, then our model can be characterized as a hot-get-richer mechanism, where hotness is determined by the rate of degree change over some recent past. The proposed model produces much later high-ranking nodes than the PA model and, under certain parameters, produces networks with structure similar to PA networks. Keywords: Preferential attachment · Network growth · First-mover advantage · Degree dynamics · Winner-take-all · Hot-get-richer
1
Introduction
Under the preferential attachment (PA) growth model, late-arriving nodes are at a disadvantage with regard to their final degree. To account for high-degree later arrivals that are often observed in real-world empirical networks, several methods of extending or replacing the PA growth model have been explored, often with some sort of node fitness either replacing or modifying PA [1]. Some approaches use node age as a key factor to determine node fitness [4,5,7]. Node extinction is an extreme form of fitness-based growth that accounts for nodes becoming ineligible for new attachment by aging out of the pool of potential attachments for arriving nodes [10]. Fitness-based growth has been found to produce socalled scale-free networks even in the absence of preferential attachment [2]. It is interesting to note that so long as ranking information is preserved, fitness values themselves need not be available to arriving nodes [3]. Comparatively fewer studies have focused on the late arrivals themselves [4,9]. As opposed to tying growth to fitness measures, the study by Mokryn c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 944, pp. 532–543, 2021. https://doi.org/10.1007/978-3-030-65351-4_43
Hot-Get-Richer Network Growth Model
533
et al. [9] proposes a new model called Trendy Preferential Attachment (TPA) where attachment probability is a function of decaying edge relevance, allowing high-degree nodes to wane in their ability to attract new edges thereby opening the way for late arrivals to grow in importance. In this paper we propose a new degree-recency-biased preferential attachment (DRPA) growth model. The DRPA model balances relative recent degree change against overall degree to determine the attachment probability. Furthermore, we explore not only the final network structure resulting from the growth model, but also the qualitative changes in rank of high-ranking late arrivals in particular as the network grows. DRPA models the creation of high-ranking nodes from late-arriving nodes by adding the concept of degree recency to classical PA. The intuition is simple: late-arriving nodes can become attractive because they have, in some recent past, been connected to at a higher rate relative to their degree. Recent degree change signals potential future growth which biases PA growth. To analyze results, and compare DRPA to PA in terms of late arrivals, we also introduce measures of arrival position and degree change trajectory in growing networks. The DRPA model is similar to the TPA model in that it accounts for shifting regimes of degree distribution (i.e., not only degree distribution but which nodes gain or lose rank) based on temporal dynamics. However, the two models approach trend determination in different ways. TPA determines the attachment probability of a new node to an existing node i as a time-weighted sum p(i) = f (1)ki (t − 1) + f (2)ki (t − 2) + ... + f (t − 1)ki (1), where f (t) is a monotonically decreasing function. The function f (t) in TPA can change between networks (and even be monotonically increasing to provide conservation of degree history), and TPA requires the node’s full history to be available at each attachment. In contrast, DRPA (see Eq. 1 below) relies only on the degree of node i at arrival time and the most degree change. These differences suggest that while similar in intent, the models will likely produce differing results, and in some cases (e.g., due to the history requirement of TPA) may not be interchangeable. We note the similarities and differences and recognize that a follow up in-depth comparison is needed to better understand effect of temporal dynamics on change in degree distribution. In real-world networks, fitness is indeed often intrinsic to each node. A new website, for example, may offer a unique value to users and thus gain inbound links at a much higher rate than its age-proportional rate would dictate. However, even with inherent fitness, recent changes in topology may exert an influence. Some product may be interesting to consumers simply because it has suddenly been purchased disproportionately to its previous sales than have competing products, even long established ones. In a word, it has become hot. If preferential attachment is a rich-get-richer mechanism, and fitness-based approaches are good-get-richer mechanisms, then we might say that DRPA (and related models) is a hot-get-richer mechanism wherein late-movers may have an advantage over first movers due to their increased likelihood of higher relative degree change.
534
2
F. Nsour and H. Sayama
Degree-Recency-Biased Preferential Attachment Model
The DRPA model can be thought of as being PA in which node degrees are weighted by their relative recent changes used as a multiplicative factor. The model is denoted as follows, with an arriving node having probability of attaching to existing node i : ki (t)Ri (t) pi (t) = t j kj (t)Rj (t)
(1)
Here ki (t) is the degree of node i, and Ri (t) is the recency attachment factor of node i, defined as Ri (t) =
ki (t) − ki (t − r) ki (t)
β ,
(2)
where r is the recency span parameter indicating how far back in the network’s growth cycle to determine a given node’s degree change. Parameter β is a tuning parameter, on the unit interval, that controls how much the model should prefer absolute degree ki (t) or recent change in degree (ki (t) − ki (t − r)). When β goes to 0, pi (t) becomes classical preferential attachment. As it goes to 1, pi (t) goes to an attachment based only on recent degree change, defined as: ki (t) − ki (t − r) pi (t) = t j kj (t) − kj (t − r)
(3)
A warm up period is allowed so that at any given moment Eq. 3 is valid. Negative values of β would reward the opposite of hotness, what we might call “conservativeness” (i.e., slow growth of high-degree nodes), but these ranges were not explored in the present study. We next attempt to obtain an analytical understanding of the DRPA model dynamics. For this purpose, we assume that the time is continuous (even though the numerical simulations described later were conducted with discrete time) and the recent degree change (ki (t) − ki (t − r)) can be approximated by a separate dynamical variable called recency, denoted as ri (t), that decays exponentially with time but grows by ki (t). The latter assumption allows us to avoid using time-delay differential equations and thus significantly simplifies the analytical work. With these continuous-time assumptions, we describe the dynamics of the model as ki (t)1−β ri (t)β , ki (t) = mpi (t) = m kj (t)1−β rj (t)β
(4)
j C IN , global industrial expansion is greater than industrial agglomeration, the CRD RD international division of labor is further developed, and the global economy is in the OUT < C IN , the situation is just the opposite: the global stage of globalization; when CRD RD economy will be facing the trend of de-globalization (Fig. 4).
Fig. 4. Centralization of GVC backbone
According to our hypothesis, before 2008, driven by the wave of globalization, GVC continuously evolved towards the direction of global economic integration. For instance, China’s accession to the WTO in 2000 exert positive influences on globalization; after 2008, the US subprime crisis has increased the resistance of many countries/regions in trade and industrial transfer and dwarfed the international division of labor, which has led to the orientation of industrial policy in various countries changed into the adjustment and optimization of domestic industrial structures, in order to respond to the risks which may occur during the process of globalization. 3.4 Global Efficiency Global Efficiency, or GE for short, quantitatively reflects the average efficiency of sending information between nodes in the network [28], which is introduced in this paper to
568
L. Xing and Y. Han
measure the overall capacity of the turnover value stream of industrial sectors that form the backbone of GVC. Its definition in a directed network is the Harmonic Mean of the distance between two nodes: GE =
1 1 i=j dij N (N − 1)
(8)
According to Eq. (8), when the distance between two nodes is infinite, the reciprocal will be zero, and the GE will always have a finite value, i.e., 0 ≤ GE ≤ 1; when GE = 0, there are only isolated nodes in the network, without any edge in between; when GE = 1, all node pairs are directly connected; the larger the GE, the better the connectivity between the nodes, the stronger the ability of the network to spread information.
Fig. 5. Global efficiency of GVC Backbone
From Fig. 5, the GE in the GVC backbone part has steadily increased before 2009, gone through a sharp decline between 2009 and 2010, and then entered a recovery phase. After five years of adjustments, GE in 2014 finally recovered and surpassed that in 2009. The decrease in ND as shown in Fig. 3 indicates that the number of edges in sequential networks is decreasing, and the disappearing edges may cause two situations. One is that these edges are redundant existing inside many communities, and the GE increases due to the downsizing scale of network, that is, inversely proportional to N (N − 1). On the other hand, the situation is diametrically opposite. The disappearing edges act as the bridge linking different communities in the original network, so the decrease in ND is also accompanied by a decrease in GE, because the shortest paths of the network are lengthened, making the part of i=j d1ij smaller. We therefore believe that 2009 is a watershed for the evolution of GVC. Prior to this, the adjustment of global industrial structure and the changes in international division of labor sufficed to be proactive and positive, and the overall efficiency of the global economy was improving; for some time after the US subprime crisis, the above-mentioned change became passive and negative, because the collapse of some industrial linkages playing the role of pivotability in the world impeded it. All in all, globalization is a double-edged sword, and one of its negative effects is that local turbulence may spread at a faster rate. At present, the COVID-19 pandemic has
Extracting the Backbone of Global Value Chain
569
caused a rapid global economic recession, in that the value flow on various segments of the GVC backbone part has been hindered or even blocked. Moreover, the impact caused by the Coronavirus Recession will be more serious and far-reaching, because it occurs simultaneously in many countries and regions around the world, being different from the subprime crisis just from the United States gradually affecting the world through the cascade effect.
4 Conclusions In this paper, we withdraw the backbone of GVC and carry out the network-based principal component analysis. Recent years have witnessed a number of studies using the highdimensional data to model this complicated world from different angles, for instance, the global economic system. Although a lot of algorithms and ideas were introduced to analyze this both directed and weighted dense network, filtering techniques aiming at the uncovering the most important inter-industry relations are still very necessary. We thus refer to the network pruning as the abstraction of a sub-network that contains far fewer edges and allows the discrimination and computational tractability of the relevant features of the original networks. By doing this, we get the backbone of GVC according to the trade-off between the conduction efficiency and velocity of intermediate goods. Then, many undiscovered trends in the former international trade network analyses show up, which provides a new perspective to understand the principal characteristics of globalization. In the future, the backbone of GVC may pave the way for the ICIO network’s community detection, link prediction, and spatial econometrics, etc. Acknowledgement. The author acknowledges support from National Natural Science Foundation of China (Grant No. 71971006), Natural Science Foundation of Beijing Municipality (Grant No. 9194024), Humanities and Social Science Foundation of Ministry of Education of the People’s Republic of China (Grant No. 19YJCGJW014), Ri-Xin Talents Project of Beijing University of Technology (Grant Recipient: Lizhi Xing), Technology Plan Key Program of Beijing Municipal Education Commission (Grant No. KZ20181005010).
References 1. Zhou, M., Wu, G., Xu, H.: Structure and formation of top networks in international trade, 2001–2010. Soc. Netw. 44(44), 9–21 (2016) 2. Ahmed, N.K., Neville, J., Kompella, R.: Network sampling: from static to streaming graphs. ACM Trans. Knowl. Discov. Data 8(2), 7 (2014) 3. Toivonen, H., Zhou, F., Hartikainen, A., Hinkka, A.: Network compression by node and edge mergers. In: Bisociative Knowledge Discovery, pp. 199–217. Springer (2012) 4. Blagus, N., Šubelj, L., Bajec, M.: Self-similar scaling of density in complex real-world networks. Phys. A Stat. Mech. Appl. 391(8), 2794–2802 (2012) 5. Carmi, S., Havlin, S., Kirkpatrick, S., Shavitt, Y., Shir, E.: A model of internet topology using k-shell decomposition. Proc. Natl. Acad. Sci. U.S.A. 104(27), 11150–11154 (2007) 6. Kitsak, M., Gallos, L.K., Havlin, S., Liljeros, F., Muchnik, L., Stanley, H.E., Makse, H.A.: Identification of influential spreaders in complex networks. Nat. Phys. 6(11), 888–893 (2010)
570
L. Xing and Y. Han
7. Siganos, G., Tauro, S.L., Faloutsos, M.: Jellyfish: a conceptual model for the AS internet topology. J. Commun. Netw. 8(3), 339–350 (2006) 8. Chen, D.B., Lü, L.Y., Shang, M.S., Zhang, Y.C.: Identifying influential nodes in complex networks. Phys. A Stat. Mech. Appl. 391(4), 1777–1787 (2012) 9. Lü, L.Y., Zhang, Y.C., Yeung, C.H., Zhou, T.: Leaders in social networks, the delicious case. PLoS ONE 6(6), e21202 (2011) 10. Li, Q., Zhou, T., Lü, L.Y., Chen, D.B.: Identifying influential spreaders by weighted LeaderRank. Phys. A Stat. Mech. Appl. 404, 47–55 (2014) 11. Malang, K., Wang, S.H., Phaphuangwittayakul, A., Lü, Y.Y., Yuan, H.N., Zhang, X.Z.: Identifying influential nodes of global terrorism network: a comparison for skeleton network extraction. Phys. A Stat. Mech. Appl. 545, 123769 (2020) 12. Zhang, X.H., Zhu, J.: Skeleton of weighted social network. Phys. A Stat. Mech. Appl. 392(6), 1547–1556 (2013) 13. Zhao, S.X., Zhang, P.L., Li, J., Tan, A.M., Ye, F.Y.: Abstracting the core subnet of weighted networks based on link strengths. J. Assoc. Inf. Sci. Technol. 65(5), 984–994 (2014) 14. Zhang, R.J., Stanley, H.E., Ye, F.Y.: Extracting h-backbone as a core structure in weighted networks. Sci. Rep. 8(1), 1–7 (2018) 15. Cao, J., Ding, C., Shi, B.: Motif-based functional backbone extraction of complex networks. Phys. A Stat. Mech. Appl. 526, 121123 (2019) 16. Kim, D.H., Noh, J.D., Jeong, H.: Scale-free trees: the skeletons of complex networks. Phys. Rev. E 70(4), 046126 (2004) 17. Grady, D., Thiemann, C., Brockmann, D.: Robust classification of salient links in complex networks. Nat. Commun. 3(1), 199–202 (2011) 18. Zhang, X.H., Zhang, Z.C., Zhang, H., Wang, Q., Zhu, J.: Extracting the globally and locally adaptive backbone of complex networks. PLoS ONE 9(6), e100428 (2011) 19. Serrano, M.A., Boguna, M., Vespignani, A.: Extracting the multiscale backbone of complex weighted networks. Proc. Natl. Acad. Sci. U.S.A. 106(16), 6483–6488 (2009) 20. Radicchi, F., Ramasco, J.J., Fortunato, S.: Information filtering in complex weighted networks. Phys. Rev. E 83(4), 046101 (2011) 21. Foti, N.J., Hughes, J.M., Rockmore, D.N.: Nonparametric sparsification of complex multiscale networks. PLoS ONE 6(2), e16431 (2011) 22. Bu, Z., Wu, Z.A., Qian, L.Q., Cao, J., Xu, G.D.: A backbone extraction method with local search for complex weighted networks. In: 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, pp. 85–88. IEEE (2014) 23. Coscia, M., Neffke, F.: Network backboning with noisy data. In: 2017 IEEE 33rd International Conference on Data Engineering, pp. 425–436. IEEE (2017) 24. Wang, Z., Wei, S., Yu, X., Zhu, K.: Characterizing global value chains: production length and upstreamness. National Bureau of Economic Research (2017) 25. Timmer, M.P., Los, B., Stehrer, R., Vries, G.J.D.: An anatomy of the global trade slowdown based on the WIOD 2016 release. GGDC Research Memorandum (2016) 26. Xing, L.Z., Dong, X.L., Guan, J., Qiao, X.Y.: Betweenness centrality for similarity-weight network and its application to measuring industrial sectors’ pivotability on the global value chain. Phys. A Stat. Mech. Appl. 516, 19–36 (2019) 27. Foster, J.G., Foster, D.V., Grassberger, P., Paczusk, M.: Edge direction and the structure of networks. Proc. Natl. Acad. Sci. U.S.A. 107(24), 10815–10820 (2010) 28. Latora, V., Marchiori, M.: Efficient behavior of small-world networks. Phys. Rev. Lett. 87(19), 198701 (2001)
Analysis of Tainted Transactions in the Bitcoin Blockchain Transaction Network ´ Mar´ıa Oskarsd´ ottir(B) , Jacky Mallett, Arn´ or Logi Arnarson, and Alexander Snær Stef´ ansson Reykjav´ık University, Reykjav´ık, Iceland {mariaoskars,jacky}@ru.is
Abstract. Blockchain technology, with its decentralised peer-to-peer network and cryptographic protocols, has led to a proliferation of cryptocurrencies, with Bitcoin at the forefront. The blockchain publicly records all Bitcoin transactions which can be used to build a dynamic and complex network to give a representation of the transactions in the underlying monetary system. Despite the cryptographic guarantees there exist inconsistencies and suspicious behavior in the chain. We reported on two such anomalies related to block mining in previous work. In this paper, we build a network using bitcoin transactions and apply techniques from network science to analyse its complex structure. We focus our analysis on sub-networks induced by the two sets of anomalies, and investigate how inequality in terms of wealth and anomaly fraction evolves from the blockchain’s origin. Thereby we present a novel way of using network science to detect and investigate cryptographic anomalies. Keywords: Bitcoin Blockchain
1
· Transaction network · Cryptography ·
Introduction
The blockchain is a publicly available ledger that stores all transactions made using bitcoin, the first cryptocurrency. The blockchain technology, proposed by Nakamoto in 2008, is based on an open peer-to-peer network to authenticate transactions using cryptographic technologies and implement a decentralized distributed digital ledger. Its introduction has led to a proliferation of cryptocurrencies in recent years [16]. The public bitcoin blockledger is now –12 years later– the most prominent and impactful version. To date, it records over half a billion bitcoin transactions which it stores in 620,000 blocks on the blockchain. In total, 18 million bitcoins are currently stored in over 46 million digital wallets, accompanied by details of the transactions they have been used in. The impact of this novel technology and the accompanying financial system is already considerable and it has attracted researchers from various disciplines, including cryptography, economics and network science. By construction, the bitcoin blockledger lends itself extremely well to network analysis since all transactions using the ledger are publicly recorded, with c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 944, pp. 571–581, 2021. https://doi.org/10.1007/978-3-030-65351-4_46
572
´ M. Oskarsd´ ottir et al.
information about both the originator and the recipient. The dynamic nature of blockchain, the vast amount of transactions, intricate patterns, richness of node and edge features, exogenous effects (such as of markets and the economy) all contribute to the complexity of the network and its analysis. The bitcoin transaction network has been studied before to some extent, including investigation of the acquisition and spending behaviour of bitcoin owners [19]. The network shows evidence of the Pareto principle during the first four years, in that linear or sub-linear preferential attachment drive the network’s growth and wealth distribution [9]. More recently, there has been a data driven analysis of price fluctuations, user behaviour, and wealth accumulation in the bitcoin transaction network, including an investigation of the richest wallets [17]. Finally, an analysis of the transaction network for the first nine years after its creation identified a causal relationships between the movements of bitcoin prices and changes of the transaction network topology [4]. As the bitcoin infrastructure has evolved, a number of measures have been introduced to address the inherent scaling limitations of a peer-to-peer network, a recent review of research on the bitcoin transaction network, identified three types of these networks, namely the Bitcoin Address Network, the Bitcoin User Network and the Bitcoin Lightning Network. In addition, the authors conclude that distribution of bitcoin is very uneven and the network is becoming increasingly more sparse [21]. Another stream of research is focused on anomalies and suspicious behaviour in the bitcoin blockledger using data science and machine learning. In an attempt to find anomalous transactions, [18] extracted features from the transaction network, from the origin until 2014, and applied k-means clustering to find outliers. Similar approaches have been proposed by other researchers [14,15]. Some studies investigate certain types of suspicious behaviours. Firstly, to identify ponzi schemes, transactions and wallets related to known schemes were extracted and compared to regular transactions and wallets in a supervised learning setting [3]. Secondly, researchers have looked into money laundering specifically, using network methods, in particular network representation learning and supervised machine learning models [8]. Recently, Elliptic1 introduced a public data set which contains several sub-networks for the blockchain transaction network, with rich node features and labels for licit and illicit transactions. Researchers have trained several supervised learning methods to detect illicit transactions and compared their performance [22]. Others have also worked with the Elliptic dataset [1,11,20], for example using active learning to address the high class imbalance in the data set [11]. In spite of the blockchain’s structural and operational properties that are designed to safeguard it, i.e. the decentralized peer-to-peer network, cryptographic protocols, validation of transactions, openness etc., inconsistencies and suspicious behaviour have been observed and reported. These have been connected with colluding miners [6], enhanced performance mining [5,7], 1
Elliptic is a cryptocurrency intelligence company focused on safeguarding cryptocurrency ecosystems from criminal activity.
Bitcoin Blockchain Transaction Network
573
the so-called Patoshi pattern which appears in the first 30,000 blocks [13] and selfish mining, where miners publish the blocks they mine selectively [10]. In this paper we use network science to analyse the complex network of bitcoin transactions with respect to two particular anomalies which we have identified in blocks mined in the early years of the blockchain [12]2 . Given the magnitude of these anomalies –the blocks in question represent well over 3 million bitcoin– we investigate whether they may have led to false conclusions about some aspects of bitcoin transactions. We construct sub-networks of transactions that originate with the anomalous, or tainted, blocks and compare the structural properties of the sub-networks with the full network as well as sub-networks that arise from non-tainted blocks. Furthermore, motivated by the analysis of wealth distribution presented by Kondor et al. (2014) [9] and irregularities observed there, we compare the evolution of Gini coefficients of node features in the various sub-networks. In the next section we discuss the two anomalies on which the analyses in this paper are based. Then we describe our methodology and present the results. The paper concludes with a summary of our findings and directions for future work.
2
Background
The origin story of bitcoin is that the technology originated with a posting by a Satoshi Nakamato to the cryptography mailing list in 2008, followed by a slow expansion in 2009-10 as early adopters installed mining software and began creating bitcoins. Although there has been some question as to whether a single individual could have developed and tested this system, simply due to the range of expertise required, this story has been broadly accepted by researchers. At the end of 2019 we performed a simple frequency analysis of the hexadecimal values (nibbles) by position, in the bitcoin blockchain [12]. This revealed two distinct anomalous patterns, both in the nonce which is a key part of the proof of work performed by all miners to obtain bitcoins. One anomaly occurs in the first hexadecimal position (nibble) of the block’s nonce field as shown in Fig. 1b where in a disproportionate number of blocks this has a value in the range 0–3, and the other is in the penultimate position of the nonce where an abnormal number of 0’s occur in the first 18 months of mining, Fig. 1a. We refer to these as the P anomaly and the the Z anomaly, respectively. Both patterns seem to be associated either with the originators of bitcoin or very early adopters. The Extended Patoshi anomaly in the first nibble of the nonce is a notable feature of the first months of mining, part of which has already been attributed by Sergio Lerner to mining by Nakamato. The second, “”penultimate zero”, pattern can also be seen almost from the start of mining, and is either part of Nakamoto’s mining, or that of a very early adopter. After accounting for the expected number of blocks that would contain these values, (6.25% in the penultimate zero case, 2
The paper is currently under review, but will be shared upon request.
574
´ M. Oskarsd´ ottir et al.
Fig. 1. Anomalous patterns discovered by frequency analysis of the hexadecimal values by position in the bitcoin blockchain.
and 25% in the Patoshi anomaly in the first nibble), we estimate that approximately one third of all coins mined at the first difficulty level are obtained from blocks mined with these features. Across the entire ten years of both patterns, well over 3 million bitcoins appear to have been obtained from blocks with these distinguishing features. The magnitude of these two patterns clearly warrants further investigation into any associated patterns in the transactions associated with the coins mined in these blocks. Previous research into early transactions in the bitcoin network has thrown up evidence of suspicious clusters, notably Shamir and Dorit’s work [19] which discovered a large number of coins being progressively consolidated into a small number of apparently connected wallets, however generally research in this area has not had a clear marker in the blocks themselves on which to attach suspicion.
3 3.1
Methodology Bitcoin Transaction Network
To carry out our analysis, we extract the entire bitcoin blockchain from origin to November 2019. Using these blocks, we create a database of transactions, with information about the from transaction and one or more to transactions which correspond to the movement of bitcoin between wallets. Wallets that received the miner’s reward coins (otherwise known as coinbase transactions) from blocks with the two patterns are marked as tainted, and as these coins are transferred to other wallets, the percentage taint for each pattern is calculated and updated for the receiving wallet. This allows us to accrue information on the from and to nodes (wallet addresses) of the transaction, as well as the amount that was transferred, the transactions’ tainted P ratio and tainted Z ratio and the timestamp of each transaction. In this way we obtain an edgelist of timestamped transactions from which we create a directed network.
Bitcoin Blockchain Transaction Network
3.2
575
Generation of Sub-networks
Having identified two types of anomalous transactions in the coinbase, namely the Z and the P anomaly, we continue to investigate their prominence in and effect on the bitcoin transaction network. To do this, starting from the full network, we extract sub-networks of transactions that have an origin with a specific set of coinbase transactions. We consider five sets of coinbase(cb) transactions as listed below. Z = {cb|The Z anomaly is in the nonce of the cb block} P = {cb|The P anomaly is in the nonce of the cb block} T Z ∩T P = {cb|The Z and the P anomalies are in the nonce of the cb block} ¬TZ = {cb|The Z anomaly is not in the nonce of the cb block} ¬TP = {cb|The P anomaly is not in the nonce of the cb block} T T
We create a sub-network for each set using snowball sampling. In the snowball sampling, we start off with a sub-network of source nodes that consists of the coinbase transactions in the respective set. Any transaction that is linked to one of these source nodes in the full network is added to the sub-network. Subsequently, any transaction in the full network that is linked to one of the most recently added transaction in the sub-network, is also added to the sub-network. This process is repeated until no more transactions can be added. Since the full network is timestamped and directional, the process will terminate. As a result, we obtain, in addition to the full network –which we refer to as All – five sets of sub-networks, each one originating with the sub-sets listed above. We refer to these as Tainted Z, Tainted P , Tainted P &Z, Not Z and Not P , respectively. These sub-networks and the full network are created for each month starting in January 2010 until May 2012. Due to the size of the entire dataset it is not feasible to build the sub-networks with the snowball sampling technique using all the nodes in each set. Therefore we randomly sample 1000 nodes from each set before doing the snowball sampling. This is repeated ten times for each sub-network in each month that we analyse. The values shown in the plots below are the mean value for each measure in these ten samples. 3.3
Network Measures
In order to compare the characteristics of the sub-networks to those of the full network, we consider several network measures. First we measure basic properties of the networks. The first three basic measures are the number of nodes, density and diameter [2]. Number of nodes is simply the total number of nodes in the respective sub-network. The second measure is the network’s density, or the number of edges divided by the maximum possible number of edges. It gives an intuition of how well connected the network is. Finally, diameter measures the length of the longest shortest path in the network. For any given pair of nodes, there is a path between them that is shorter than any other path between them. The diameter is the longest of such
576
´ M. Oskarsd´ ottir et al.
paths in the network and represents the size of the network. Since computing the shortest path between all pairs of nodes in a network can get quite time consuming as networks grow in size, we randomly sample 1000 pairs of nodes from each network and use those pairs to estimate the networks’ diameters. Based on Kondor et al. (2014), we focus on the Gini coefficient, clustering coefficient and the degree correlation to quantify the inequality in the network and sub-networks [9]. Firstly, we use the Gini coefficient to characterize the heterogeneity of the distribution of in-degree, out-degree, transaction amount, tainted Z ratio and tainted P ratio. Generally, the Gini coefficient is defined as n 2 i=1 ixi n+1 (1) − G = n n n i=1 xi where {xi } is a monotonically non-decreasing ordered sample of size n. Thus, G = 0 indicates perfect equality, or every node being equal in terms of the value being considered, whereas G = 1 indicates complete inequality. As in [9] we measure this for the distribution of in-degree, out-degree and transaction amount, but in addition we compute the distribution of tainted P and tainted Z ratios. Secondly, we look at the assortativity or degree correlation of the network [2]. We compute it using the Pearson correlation coefficient of the out- and in-degrees of connected node pairs out ¯ )(k in − k¯in ) (j − j out e (2) r= e e Lσout σin where for the edge e that links node vf rom to vto , jeout is the out-degree of node vf rom and kein is the in-degree of node vto , 2 k¯in = kein /L and σin = (kein − k¯in )2 /L. (3) e
e
¯ are computed in a similar way. Degree correlation measures the σout and k out nodes’ tendency to be linked to nodes with a similar degree. In an assortative network (where r > 0) high degree nodes are linked to other high degree nodes and low degree nodes are linked to other low degree nodes. In disassortative networks (r < 0), in contrast, high degree nodes have a tendency to connect to low degree nodes, creating a hub and spoke structure. Finally, we measure the networks’ clustering coefficient, that is, the density of triangles in the networks, given by C=
2Δv 1 N v dv (dv − 1)
(4)
where Δv is the number of triangles with node v and dv is the degree of node v. The sum runs over all nodes in the network [2]. To compute C we must ignore the directionality of the network. The clustering coefficient measures how connected then nodes are in their closest neighborhoods.
Bitcoin Blockchain Transaction Network
4
577
Results
Fig. 2. Evolution of the network’s characteristics.
Figure 2 shows the evolution of some of the network’s characteristics as presented by Kondor et al. (2014) [9], namely the Gini-coefficient of in-degree, out-degree and amount in Fig. 2a and the degree correlation and clustering coefficient in Fig. 2b. Since we are looking at transactions only, and not wallets, these graphs are slightly different from the ones presented in [9], although the trends are very similar, except for the clustering coefficient. However, given this close similarity, we continue to work with the network of transactions only. In addition, we have added the Gini coefficient for tainted P ratio and tainted Z ratio in the plot in Fig. 2a. We can see that both start off relatively low, but increase sharply in mid 2010, with the tainted Z inequality increasing much more than the tainted P inequality. Figure 3 shows the evolution of the networks’ diameter, number of nodes and density. Note the log scale on the y-axis. We can see that the sub-networks are both smaller and denser than the full transaction network, which is to be expected, since they are samples of the full network. The sub-networks are smaller because their origin can only be traced to particular subsets of coinbase transactions, and yet as time goes by they mix in with all the other transactions, and hence the measures presented in Fig. 3 converge. The diameter is more fuzzy in the beginning, but eventually, all networks show a similar tendency in this regard. Figure 4 shows the evolution of the Gini coefficient for in-degree, out-degree, transaction amount, tainted Z and tainted P , in addition to the degree correlation and clustering coefficient for each of the five sub-networks on a monthly basis. In each plot, the red line denotes the whole network, and we can see how the values for each sub-network all converge towards each other and are
578
´ M. Oskarsd´ ottir et al.
Fig. 3. Evolution of diameter, number of nodes and density in the network of all transactions and in the five sub-networks.
slowly nearing the red line. Moreover, we see that in the beginning, the indegree tends to be more equally distributed in the sub-networks than in the whole network, whereas for out-degree there is an opposite behavior, the distribution of out-degree is less equal in the sub-networks. We also see that in the tainted P and tainted P &Z networks, the inequality in the amount distribution increases in early 2010 and remains very high. In terms of the Gini coefficient for tainted Z ratio, the inequality in the tainted P is very high early on, and we see the opposite effect in terms of the Gini coefficient of tainted P , here the tainted Z sub-network scores very high, at least until November 2010. Both sub-networks of not tainted transactions have a high clustering coefficient in the beginning, whereas all converge to the same low value towards the end of the period. The Not P sub-network behaves differently from the other ones. In terms of out-degree, tainted Z and tainted P it dips in April 2010 and jumps at the same time in terms of in-degree and clustering coefficient. Its amount inequality remains high throughout the whole period. For degree correlation, all sub-networks show a similar trend, except for the tainted P &Z sub-network which takes a downwards turn in September 2010 and stays negative for a couple of months. This particular observation clearly demonstrates an irregularity that needs to be studied further. The evolution of the various Gini coefficients in the full network in comparison to the sub-networks can tell us a great deal about how the tainted coinbase transactions have blended in with the other transactions, thus hiding in plain sight. It also informs us of points in time where the transaction network ought to be investigated more in-depth. In terms of in-degree, the Gini coefficient is much lower in the sub-networks than in the full network, which indicates a more homogeneous in-degree distribution. The opposite holds for the out-degree, there is more inequality in the out-degree in the tainted networks. This could indicate that owners of tainted bitcoin were behaving differently when trading them, while mixing them with untainted coins. In terms of amount inequality, it is the highest in the tainted sub-networks. It is interesting to see such a high
Bitcoin Blockchain Transaction Network
579
Fig. 4. Evolution of Gini coefficients of in-degree, out-degree, transaction amount, tainted Z ratio and tainted P ratio, as well as degree correlation and clustering coefficient for the whole transaction network and five types of sub-networks.
tainted P inequality in the tainted Z network and a high tainted Z inequality in the tainted P network in the first year. Finally, the networks’ assortativity raises many questions, because of the varied patterns in the sub-networks. Furthermore, the fact that the Tainted P &Z network becomes disassortative for two months is highly irregular. All of these observations require further investigation, for example by looking at the degree distribution of the sub-networks, and a closer inspection of the structure of transactions at various moments.
5
Conclusion
In this paper we used network science to detect and investigate cryptographic anomalies. Based on two types of anomalies, we constructed sub-networks of
580
´ M. Oskarsd´ ottir et al.
bitcoin transactions and compared their structural properties. We saw that the distribution of several node properties, such as in-degree, transaction amount and tainted ratio is different in the sub-networks when compared to the full network. This is apparent in the networks until late 2010, when the properties start to converge to what is observed in the full network. In particular, degree correlation of the sub-network with both anomalies shows a great deviation from the rest at the same time as both these anomalies were prominent in block mining. This paper has an additional contribution. The size of the blockchain and its transactions places a prohibitively high computational complexity on analysing its network behaviour, the technique used here of sampling when creating the sub-networks has allowed us to adequately estimate the networks’ properties as Figs. 3 and 4 show. Using this as a basis for similar methods to compress computation time for block chain transaction analysis is worth exploring. Further work is needed to get a full grasp on what exactly is happening in the networks we examined. Our analysis is based on monthly updates of the network, whereas weekly or daily updates might give a better sense of when and how the anomalies are having an effect on transaction patterns. Moreover, we are looking at a network of transaction only, and not including the wallets. Having wallets as nodes would change the network structure and may well provide other insights. Finally, we have only analysed transactions until mid 2012. In our continued work, our plan is to consider the entire blockchain.
References 1. Alarab, I., Prakoonwit, S., Nacer, M.I.: Comparative analysis using supervised learning methods for anti-money laundering in bitcoin. In: Proceedings of the 2020 5th International Conference on Machine Learning Technologies, pp. 11–17 (2020) 2. Barab´ asi, A., et al.: Network Science. Cambridge University Press, Cambridge (2016) 3. Bartoletti, M., Pes, B., Serusi, S.: Data mining for detecting bitcoin ponzi schemes. In: 2018 Crypto Valley Conference on Blockchain Technology (CVCBT), pp. 75–84. IEEE (2018) 4. Bovet, A., Campajola, C., Mottes, F., Restocchi, V., Vallarano, N., Squartini, T., Tessone, C.J.: The evolving liaisons between the transaction networks of bitcoin and its price dynamics (2019). arXiv preprint: arXiv:1907.03577 5. Courtois, N.T., Grajek, M., Naik, R.: The unreasonable fundamental incertitudes behind bitcoin mining (2013). arXiv preprint: arXiv:1310.7935 6. Dev, J.A.: Bitcoin mining acceleration and performance quantification. In: 2014 IEEE 27th Canadian Conference on Electrical and Computer Engineering (CCECE), pp. 1–6. IEEE (2014) 7. Eyal, I., Sirer, E.G.: Majority is not enough: bitcoin mining is vulnerable. In: Christin, N., Safavi-Naini, R. (eds.) Financial Cryptography and Data Security. FC 2014. Lecture Notes in Computer Science, vol. 8437, pp. 436–454. Springer, Berlin, Heidelberg (2014). https://doi.org/10.1007/978-3-662-45472-5 28 8. Hu, Y., Seneviratne, S., Thilakarathna, K., Fukuda, K., Seneviratne, A.: Characterizing and detecting money laundering activities on the bitcoin network (2019). arXiv preprint: arXiv:1912.12060
Bitcoin Blockchain Transaction Network
581
9. Kondor, D., P´ osfai, M., Csabai, I., Vattay, G.: Do the rich get richer? an empirical analysis of the bitcoin transaction network. PloS one 9(2) (2014) 10. Li, S.-N., Yang, Z., Tessone, C.J.: Mining blocks in a row: a statistical study of fairness in bitcoin mining. In: 2020 IEEE International Conference on Blockchain and Cryptocurrency (ICBC), pp. 1–4. IEEE (2020) 11. Lorenz, J., Silva, M.I., Apar´ıcio, D., Ascens˜ ao, J.T., Bizarro, P.: Machine learning methods to detect money laundering in the bitcoin blockchain in the presence of label scarcity (2020). arXiv preprint: arXiv:2005.14635 12. Mallett, J.: A report on cryptographic anomalies in the bitcoin blockchain (2020) 13. McGinn, Dan, McIlwraith, Doug, Guo, Yike: Towards open data blockchain analytics: a bitcoin perspective. R. Soc. Open Sci. 5(8), 180298 (2018) 14. Monamo, P., Marivate, V., Twala, B.: Unsupervised learning for robust bitcoin fraud detection. In: 2016 Information Security for South Africa (ISSA), pp. 129– 134. IEEE (2016) 15. Monamo, P.M., Marivate, V., Twala, B.: A multifaceted approach to bitcoin fraud detection: global and local outliers. In: 2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 188–194. IEEE (2016) 16. Nakamoto, S., Bitcoin, A.: A peer-to-peer electronic cash system. Bitcoin (2008). https://bitcoin.org/bitcoin.pdf 17. Pavithran, D., Al-Karaki, J.N., Thomas, R., Shibu, C., Gawanmeh, A.: Data-driven analysis of price change, user behavior and wealth accumulation in bitcoin transactions. In: 2019 Advances in Science and Engineering Technology International Conferences (ASET), pp. 1–6. IEEE (2019) 18. Pham, T., Lee, S.: Anomaly detection in the bitcoin system-a network perspective (2016). arXiv preprint: arXiv:1611.03942 19. Ron, D., Shamir, A.: Quantitative analysis of the full bitcoin transaction graph. In: Sadeghi, A.R. (ed.) Financial Cryptography and Data Security FC 2013. Lecture Notes in Computer Science, vol. 7859. Springer, Berlin, Heidelberg (2013). https:// doi.org/10.1007/978-3-642-39884-1 2 20. Turner, A.B., McCombie, S., Uhlmann, A.J.: Discerning payment patterns in bitcoin from ransomware attacks. Journal of Money Laundering Control (2020) 21. Vallarano, N., Tessone, C., Squartini, T.: Bitcoin transaction networks: an overview of recent results (2020). arXiv preprint: arXiv:2005.00114 22. Weber, M., Domeniconi, G., Chen, J., Weidele, D.K.I., Bellei, C., Robinson, T., Leiserson, C.E.: Anti-money laundering in bitcoin: Experimenting with graph convolutional networks for financial forensics (2019). arXiv preprint: arXiv:1908.02591
Structural Network Measures
Top-k Connected Overlapping Densest Subgraphs in Dual Networks Riccardo Dondi1 , Pietro Hiram Guzzi2 , and Mohammad Mehdi Hosseinzadeh1(B) 1
Universit` a degli Studi di Bergamo, Bergamo, Italy {riccardo.dondi,m.hosseinzadeh}@unibg.it 2 Magna Graecia University, Catanzaro, Italy [email protected]
Abstract. Networks are largely used for modelling and analysing data and relations among them. Recently, it has been shown that the use of a single network may not be the optimal choice, since a single network may misses some aspects. Consequently, it has been proposed to use a pair of networks to better model all the aspects, and the main approach is referred to as dual networks (DNs). A DN consists of pair of related graphs (one weighted, the other unweighted) that share the same set of vertices and two different edge sets. It is often interesting to extract common subgraphs in the two networks that are dense in the conceptual network and connected in the physical one. The simplest instance of this problem is finding a common densest connected subgraph (DCS), while here we focus on the detection of the Top-k Densest Connected subgraphs, i.e. a set k subgraphs having the largest density in the conceptual network which are also connected in the physical network. We formalise the problem and then we propose a heuristic to find a solution, since the problem is computationally hard. A set of experiments on synthetic and real networks is also presented to support our approach.
Keywords: Dual networks Heuristics
1
· Dense subgraphs · Graph algorithms ·
Introduction
Networks are largely used to represent and analyse data and relations among them in many fields. For instance, in biology and medicine networks are used to model relationships among macromolecules in a cell (e.g. nucleic acids, proteins and genes). Similarly, in social network analysis, graphs are used to model associations among users and the analysis may reveal association patterns or communities of similar users [23]. Classically, a single network has been used to model data and to extract relevant knowledge by looking at topological parameters, i.e. community-related structures [3] such as groups of related genes or users [20]. More recently, it has been shown that the use of a single network may not be c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 944, pp. 585–596, 2021. https://doi.org/10.1007/978-3-030-65351-4_47
586
R. Dondi et al.
able to capture efficiently all the relationships among elements considered, therefore some complex models have been introduced such as heterogeneous networks [21] or dual networks [25]. A dual network is a pair of related graphs sharing the same vertex set, with two different edge sets. One network has unweighted edges, and it is called physical graph. The second one has weighted edges and it is called conceptual graph. Dual networks have been used in other works to model interactions among genetic variants [22].
Fig. 1. Workflow of the proposed approach. In the first step the input conceptual and physical networks are merged together using a network alignment approach; then Weighted-Top-k-Overlapping DCS is applied on the alignment graph. Each extracted subgraph induces a connected subgraph in the physical network and one of the top-k overlapping weighted densest subgraph in the conceptual one.
In dual networks, finding a Densest Connected Subgraph (DCS) may reveal hidden and relevant knowledge that cannot captured by using a single graph. Formally, given two input graphs Gc (V, Ec ) (undirected and edge-weighted), and Gp (V, Ep ) (undirected and unweighted), the problem consists in finding a subset of nodes Is that induces a densest community in Gc and a connected subgraph in Gp . As proved in [25], the DCS problem is NP-hard, therefore there is the need for novel heuristics and computational approaches to solve it. Recently, researchers focused on the identification of a set of densest subgraphs that may better capture the presence of communities in real networks [2,10]. Many classical problems look for a single dense subgraph, while in [8,10,16] the problem has been extended to the identification of a set of k, with k ≥ 1, overlapping densest subgraphs. Such subgraphs may represent relevant communities sharing some nodes, such as hubs. In this paper, we explore the problem of finding top-k weighted overlapping densest connected subgraphs in dual networks. While finding such subgraphs in a single network has been resolved by ad-hoc heuristics [19], the variant of the problem in dual networks is still a challenging problem. We model the problem as a variant of the local network alignment problem and we propose a novel
Top-k Overlapping Densest Connected Subgraphs in Dual Networks
587
algorithm to solve it. The approach is based on a two step strategy: first a single alignment graph is built from the dual networks [13,21], then we look for dense subgraphs in this network with an ad-hoc heuristic. Notice that these subgraphs correspond to dense subgraphs in the conceptual networks and connected subgraphs in the physical one, therefore they are solutions of the initial problem. Figure 1 depicts the workflow of our approach. Our approach is conceptually different from the existing approaches for the DCS problem and it enables more flexibility. For instance, Wu et al [25] do not consider overlapping subgraphs and their approach is limited to the exact correspondence of nodes between networks. On the other hand, with respect to other approaches for finding densest subgraphs in a network [2,8,10,12], we consider weighted networks, an extension that can be useful in many contexts, in particular for biological networks. We provide an implementation of our heuristic and we show the effectiveness of our approach on synthetic datasets and through three case studies on social networks data, on biological networks and on a co-authorship networks. The experimental results confirm the effectiveness of our approach. The paper is structured as follows: Sect. 2 discusses related works, Sect. 3 gives definitions and formally introduces the problem we are interested into. Section 4 presents our heuristic; Sect. 5 discusses the case studies; finally Sect. 6 concludes the paper.
2
Related Work
Many complex systems cannot be efficiently modelled using a single network without losses of information. Therefore the use of dual networks is growing [25]. The problem of finding communities in a network or a dual network is based on the specific model of dense or cohesive graph considered. Several models of cohesive subgraph have been considered in the literature and applied in different contexts. One of the first definition of a cohesive subgraph is a fully connected subgraph, i.e. a clique. However, the determination of a clique of the maximum size, also referred to as the Maximum Clique Problem, is NP-hard [15], and it is difficult to approximate [27]. Moreover, real networks sometimes miss some edges; therefore, the clique model is often too strict and may fail to find some important information. Consequently, many alternative definitions of cohesive subgraphs that are not fully interconnected have been introduced, including sclub, s-plex and densest subgraph [9,18]. A densest subgraph is a subgraph with maximum density (ratio between the number of edges and number of nodes of the subgraph) and the DensestSubgraph problem asks for a subgraph of maximum density in a given graph. The problem can be solved in polynomial time [11,17] and approximated within factor 12 [1,5]. Notice that the Densest-Subgraph problem can be extended also to edge-weighted networks. Recently, Wu et al. [25], proposed an algorithm for finding a densest connected subgraph in a dual network. The approach is based on a two-step strategy. In the first step, the algorithm prunes the dual network without eliminating
588
R. Dondi et al.
the optimal solution. In the second step, two greedy approaches are developed to build a search strategy for finding a densest connected subgraph. Briefly, the first step finds the densest subgraph in the conceptual network. The second step refines this subgraph to guarantee that it is connected in the physical network. While the literature of network mining has mainly focused on the problem of finding a single subgraph, recently the interest in finding more than a subgraph has emerged [2,8,10,16]. The proposed approaches usually allows overlapping between the computed dense subgraphs. Indeed, there can be nodes that are shared between interesting dense subgraphs, for example hubs. The proposed approaches differ in the way they deal with overlapping. The problem defined in [2] controls the overlap by limiting the Jaccard coefficient between each pair of subgraphs of the solution. The Top-k-Overlapping problem, introduced in [10], includes a distance function in the the objective function. In this paper, we follow this last approach and we extend it to weighted networks.
3
Definitions
In this section we define the main concepts related to our problem starting with the definition of dual network. Definition 1. Dual Network A dual network DN G(V, Ec , Ep ) is a pair of networks: a conceptual weighted network Gc (V, Ec ) and a physical unweighted one Gp (V, Ep ). Now, we introduce the definition of density. Definition 2. Density Given a weighted graph G(V, E, weight), let v ∈ V be a node of G, and let vol(v) = weight(v, w) w:(v,w)∈E
be the sum of the weights of the edges incident in v. The density of weighted vol(v) graph is defined as ρ(G) = v∈V|V | . Given a graph (weighted or unweighted) G with a set V of nodes and a subset Z ⊆ V , we denote by G[Z] the subgraph of G induced by Z. Given E ⊆ E, we denote by weight(E ) the sum of weights of edges in E . Given a dual network we may consider the subgraphs Gp [I] and Gc [I], induced in the two networks by the node set I ⊆ V . A densest common subgraph DCS is a subset of nodes I such that the density of the induced conceptual network is maximised and the induced physical network is connected, formally defined in the following. Definition 3. Densest Common Subgraph Given a dual network G(V, Ec , Ep ), a densest common subgraph in G(V, Ec , Ep ) is a subset of nodes I ⊆ V such that Gp [I] is connected and the density of Gc [I] is maximum.
Top-k Overlapping Densest Connected Subgraphs in Dual Networks
589
In this paper, we are interested in finding k ≥ 1 densest connected subgraphs. However, to avoid taking the same copy of a subgraph or subgraphs that are very similar, we consider the following distance functions introduced in [10]. Definition 4. Let G(V, Ec , Ep ) be a dual network and let G[A], G[B], with A, B ⊆ V , be two induced subgraphs of G. The distance between G[A] and G[B], 2 denoted by d : 2V × 2V → R+ has value equal 2 − |A∩B| |A||B| if A = B, else is equal to 0. 2
Notice that 2 − |A∩B| |A||B| decreases as the overlapping between A and B increases. Now, we are able to introduce the problem we are interested into. Problem 1. Weighted-Top-k-Overlapping DCS Input: A dual network G(V, Ec , Ep ), a parameter λ > 0. Output: a set X = {G[X1 ], . . . , G[Xk ]} of k connected subgraphs of G, with k ≥ 1, such that the following objective function is maximised: k i=1
ρ(Gc [Xi ]) + λ
k−1
k
d(G[Xi ], G[Xj ])
i=1 j=i+1
Weighted-Top-k-Overlapping DCS, for k ≥ 3, is NP-hard, as it is NP-hard already on an unweighted graphs [8]. Notice that for k = 1, then Weighted-Topk-Overlapping DCS is exactly the problem of finding a single weighted densest connected subgraph, hence it can be solved in polynomial time. 3.1
Greedy Algorithms for DCS
One of the ingredient of our method is a variant of a greedy algorithm for DCS. This latter algorithm, denoted by Greedy, is an approximation algorithm for computing a connected densest subgraph of a given graph. Given a weighted graph G, Greedy [1,5] iteratively removes from G a vertex v having lowest vol(v) and stops when all the vertices of the graph have been removed. At each iteration i, with 1 ≤ i ≤ |V |, Greedy computes a subgraph Gi and returns a densest of subgraphs G1 , . . . , G|V | . The algorithm has a time complexity O(|E|+|V | log |V |) on weighted graphs and achieves an approximation factor of 21 [1,5]. We introduce here a variant of the Greedy algorithm, called V-Greedy. Given an input weighted graph G, V-Greedy, similar to Greedy, at each iteration i, with 1 ≤ i ≤ |V |, removes a vertex v having lowest vol(v) and computes a subgraph Gi , with 1 ≤ i ≤ |V |. Then, among subgraphs G1 , . . . , G|V | , V-Greedy returns a i) subgraph Gi that maximises the value ρ(Gi ) + 2( ρ(G |Vi | ). i) Essentially, we add to the density the correction factor 2( ρ(G |Vi | ) to avoid that the subgraph returned by V-Greedy contains almost disjoint dense subgraphs. For example, consider a graph with two equal size cliques K1 and K2 having the same (large) weighted density and a single edge of large weight connecting
590
R. Dondi et al.
them. Then the union of K1 and K2 is denser than both K1 and K2 , hence Greedy returns the union of K1 and K2 . This may prevent us to find K1 , K2 as a solution of Weighted-Top-k-Overlapping DCS. In this example, when the density of K1 and K2 is close enough to the density of their union, V-Greedy will return instead one of K1 , K2 .
4
The Proposed Algorithm
In this section we present our heuristic for Weighted-Top-k-Overlapping DCS in dual networks. The approach is based on two main steps: (i) First, the input networks are integrated into a single weighted alignment graph preserving the connectivity properties of the physical networks; (ii) Second, the obtained alignment graph is mined by using an ad-hoc heuristic for Weighted-Top-k-Overlapping DCS. 4.1
Building of the Alignment Graph
In the first step the algorithm receives in input a weighted graph Gc (V, Ec ), and an unweighted graph Gp (V, Ep ), an initial set of node pairs between input networks (seed nodes) representing corresponding nodes, a distance threshold δ that represents the maximum threshold of distance that two nodes may have in the physical network. The algorithm receives as input these data; then it starts by building the nodes of the alignment graph. Each node of the input graph represents a pair of corresponding nodes of the input following the seed nodes. Then it adds the edges among nodes. The algorithm adds an edge connecting two nodes whenever the corresponding nodes are connected in both input networks. If the nodes are adjacent in both networks, then the edge connecting them will have the same weight of the corresponding nodes in the conceptual network. Conversely, if the nodes are adjacent in the physical network and they have a distance lower than δ (a user-defined threshold of length), then the weight of the edge will be the average of the weights of the path linking the considered nodes. 4.2
A Heuristic for Weighted-Top-k-Overlapping DCS
In this section, we present our heuristic for Weighted-Top-k-Overlapping DCS, called Iterative Weighted Dense Subgraphs (IWDS). The heuristic starts with a set X = ∅ and consists of k iterations. At each iteration i, 1 ≤ i ≤ k, given a set X = {G[X1 ], ..., G[Xi−1 ]} of subgraphs, IWDS computes a subgraph G[Xi ] and adds it to X . The first iteration of IWDS applies the V-Greedy algorithm (see Sect. 3.1) on the dual network G and computes G[X1 ]. In iteration i, with 2 ≤ i ≤ k, IWDS applies one of the two following cases, depending on a parameter f , 0 < f ≤ 1, i−1 and on the size of the set Ci−1 = j=1 Xj (the set of nodes covered by the subgraphs in X ).
Top-k Overlapping Densest Connected Subgraphs in Dual Networks
591
Case 1. If |Ci−1 | ≤ f |V | (that is at most f |V | nodes of G are covered by the subgraphs in X ), IWDS applies the V-Greedy algorithm on a graph obtained by retaining α nodes (α is a parameter) of Ci−1 having highest weighted degree in G and removing the other nodes of Ci−1 . G[Xi ] is a weighted connected dense subgraph, distinct from those in X , in the resulting graph. Case 2. If |Ci−1 | > f |V | (more than f |V | nodes of G are covered by the subgraphs in X ), IWDS applies the V-Greedy algorithm on a graph obtained by removing (1 − α) nodes (recall that α is a parameter of IWDS) of Ci−1 having lowest weighted degree in G. IWDS computes G[Xi ] as a weighted connected dense subgraph, distinct from those in X , in the resulting graph. Complexity Evaluation. We denote by n (by m, respectively) the number of nodes (of edges, respectively) of the dual network. The first step requires the analysis of both graphs and the construction of the novel alignment graph and it requires O(n2 )(calculation − edge − weights) time. The calculation of edge weights requires the calculation of the shortest paths among all the node pairs in the physical graph using the Chan implementation [4], therefore it requires O(nmp ) time (mp is the number of edges of the physical graph). As for Step 2, IWDS makes k iterations. Each iteration applies V-Greedy on G and requires O(mn log n) time [6]. Iteration i, with 2 ≤ i ≤ k, first computes the set of covered nodes to retain or remove. For this purpose, we sort the nodes in Cj−1 based on their weighted degree in O(n log n) time. Thus the overall time complexity of IWDS is O(kmn log n).
5
Experiments
In this section, we provide an experimental evaluation of IWDS on synthetic and real networks1 . The design of a strong evaluation scheme for our algorithm is not simple, since we have to face two main issues: (1) Existing methods for computing the top-k overlapping subgraphs are defined for unweighted graphs and cannot be used on dual networks; (2) Existing network alignment algorithms do not aim to extract densest subgraphs. Consequently, we cannot easily compare our approach with the existing state of the art methods, and we design an ad-hoc procedure for the evaluation based on the following steps. First, we demonstrate on synthetic networks that our approach can find densest known subgraphs. In this way, we show that IWDS can correctly recover top-k weighted densest subgraphs. Then we applied our method to some real-world dual networks. The alignment algorithm is implemented in Python 3.7 using the NetworkX package for managing networks [14]. We implemented IWDS in MATLAB R2020a and we perform the experiments on MacBook-Pro (OS version 10.15.3) with processor 2.9 GHzIntel Core i5 and 8GB 2133 MHz LPDDR3 of RAM, Intel Iris Graphics 550 1536 MB. 1
The source code and data used in our experiments are available at https://github. com/mehdihosseinzadeh/-k-overlapping-densest-connected-subgraphs.
592
5.1
R. Dondi et al.
Synthetic Networks
In the first part of our experimental evaluation, we analyse the performance of IWDS to find planted ground-truth subgraphs on synthetic datasets. Datasets. We generate a synthetic dataset consisting of k = 5 planted dense subgraphs (cliques). Each planted dense subgraph contains 30 nodes and has edge weights randomly generated in the interval [0.8, 1]. These cliques are then connected to a background subgraph of 100 nodes. We consider three different ways to generate the background subgraph: Erd¨ os-Renyi with parameter p = 0.1, Erd¨ os-Renyi with parameter p = 0.2 and Barabasi-Albert with parameter equal to 10. Weights of the background graphs are randomly generated in interval [0, 0.5]. Then 50 edges are randomly added (with weights randomly generated in interval [0, 0.5]). Based on this approach, we generate two different sets of synthetic networks, called Synthetic1 and Synthetic2. Synthetic1 is generated as described above. Synthetic2 is obtained by applying noise to the synthetic networks of Synthetic1. In this latter case, noise is added by varying 5% or 10% of node relations of each network. Pairs of nodes are chosen randomly: if they belong to the same clique, the weight of the edge connecting the two nodes is changed to a random value in the interval [0, 0.5]; else an edge connecting the two nodes is (possibly) added (if not already in the network) and its weight is randomly assigned a value in the interval [0.8, 1]. Outcome. Table 1 reports F1-score, based on precision and recall, to measure the accuracy of our method to detect the ground-truth subgraphs. The F1-score takes value between 0 and 1, where the value 1 is obtained when the detected subgraphs exactly correspond to the ground-truth subgraphs. Following [26], we consider the number of shared nodes between each groundtruth subgraph and each detected subgraph, so that we are able to define the best-matching of ground-truth subgraphs and detected subgraphs. Then, we compute the F 1[t/d] measure as the average F1-score of the best-matching ground-truth subgraph to each detected subgraph (truth to detected ) and F 1[d/t] measure as the average F1-score of the best-matching detected subgraph to each ground-truth subgraph (detected to truth). Table 1 shows that for the noiseless Synthetic1 IWDS is able to find the ground-truth subgraphs for all three different values of α, averaged over 300 examples. Table 1 shows performance of IWDS on the Synthetic2 datasets using noise equal to 0.05 and 0.10, averaged over 90 examples. For noise 0.05, IWDS outputs solutions that are very close to the ground-truth subgraphs for all three different values of α. For noise equal to 0.10, the algorithm performance is slightly worse, but still remains close to ground-truth subgraphs. 5.2
Dual Networks
We evaluate IWDS on three real-world dual network datasets:
Top-k Overlapping Densest Connected Subgraphs in Dual Networks
593
Table 1. Performance of IWDS: (on the left) performance on synthetic1 for k = 5, varying α from 0.05 to 0.25, averaged over 300 examples; (on the right) performance of IWDS on synthetic2 for k = 5, varying α from 0.05 to 0.25, averaged over 90 examples. Noise α = 0.05α = 0.1α = 0.25 F 1[t/d]1.00 F 1[d/t]1.00
1.00 1.00
1.00 1.00
α = 0.05α = 0.1α = 0.25
0.05 F 1[t/d]0.99 F 1[d/t]0.99
1.00 0.99
1.00 0.99
0.10 F 1[t/d]0.98 F 1[d/t]0.96
0.97 0.96
1.00 0.97
Datasets. G-graphA. The G-graphA dataset is derived from the GoWalla social network where users share their locations (expressed as GPS coordinates) by checking-in into the web site [7]. Each node represents a user and each edge links two friends in the network. We obtained the physical network by considering friendship relation on the social network. We calculated the conceptual network by considering the distance among users. Then we run the first step of our algorithm and we obtained the alignment graph G-graphA, containing 2241339 interactions and 9878 nodes (we set δ=4). In this case a DCS represents set of friends that share check-ins in near locations. DBLP-graphA. The DBLP-graphA dataset is a computer science bibliography that represents interactions between authors. Nodes represent authors and edges represent connections between two authors if they have published at least one paper together. Each edge in the physical network connects two authors that coauthored at least one paper. Edges in the conceptual network represent the similarity of research interests of the authors calculated on the basis of all their publications. After running the first step of the algorithm (using δ=4), we obtained an alignment graph DBLP-graphA dataset containing 553699 interactions and 18954 nodes. In this case a DCS represents a set of co-authors that share some strong common research interests and the use of DNs is mandatory, since physical network shows only co-authors that may not have many common interests and the conceptual network represents authors with common interest that may not be co-authors. HS-graphA. HS-graphA is a biological dataset and is taken from the STRING database [24]. Each node represents a protein, and each edge takes into account the reliability of the interactions. We use two networks for modelling the database: a conceptual network represents such reliability value; and a physical network stores the binary interactions. The HS-graphA dataset contains 5879727 interactions and 19354 nodes (we set δ=4) (Table 2). Outcome. For these large size datasets, we set the value of k to 20, following the approach in [10]. Table 3 reports the running time of IWDS, and the density and distance of the solutions returned by IWDS. We consider three different values of α. As shown in Table 3, by increasing the value of α from 0.05 to 0.25, IWDS
594
R. Dondi et al. Table 2. Properties of the Alignment Graphs obtained for each dataset. Graph
Representation
DBLP-graphA Co-authorship
Nodes Edges 18954
Density
553699 0.0026
G-graphA
Social`ı
9878 2241339 0.0448
HS-graphA
Protein Interactions 19354 5879727 0.0313
(except of one case, HS-graphA with α = 0.1) returns solutions that are denser, but with lower distance. Table 3 shows also how the running time of IWDS is influenced by the size of the network and by the value of α. The running time is affected not only by the number of nodes of the network, but also by their density. DBLP-graphA and HSgraph-A have almost the same number of nodes, but HS-graph-A is much more denser than DBLP-graphA. IWDS for the former network is considerably slower than for DBLP-graphA (1.986 slower for α = 0.05, 6.218 slower for α = 0.25). The running time of IWDS increases as α increases. Indeed by increasing the value of α, less nodes are removed by Case 1 and Case 2 of IWDS, hence in iterations of IWDS V-Greedy is applied to larger subgraphs. Table 3. Performance of IWDS on real-world network for k = 20, varying α from 0.05 to 0.25. For each network, we report the size of the network (number of vertices |V | and edges |E|), the running time in minutes, the density and the distance. set G-graphA
|V |
DBLP-graphA 18954
HS-graphA
6
|E|
α = 0.05 α = 0.1 α = 0.25
9878 2241339 Time 89.84 Density 2863.99 Distance 275.82 553699 Time Density Distance
105.69 39.61 307.72
19354 5879727 Time 209.88 Density 1326.07 Distance 226.40
98.72 184.87 4000.73 6345.67 257.84 220.16 125.71 52.39 231.25
165.25 74.12 213.04
749.06 1027.58 1153.68 1799.22 212.34 205.55
Conclusion
DNs are used to model two kinds of relationships among elements in the same scenario. A DN is a pair of networks that have the same set of nodes. One network has unweighted edges (physical network), while the second one has weighted edges (conceptual network). In this contribution, we introduced an approach that first integrates a physical and a conceptual network into an alignment graph.
Top-k Overlapping Densest Connected Subgraphs in Dual Networks
595
Then, we applied the Weighted-Top-k-Overlapping DCS problem to the alignment graph to find k dense connected subgraphs. These subgraphs represent some subsets of nodes that are strongly related in the conceptual network and that are connected in the physical one. We presented a heuristic, called IWDS, for Weighted-Top-k-Overlapping DCS and an experimental evaluation of IWDS. We first presented as a proof-of-concept the ability of our algorithm to retrieve known densest subgraphs in synthetic networks. Then we tested the approach on some real networks to demonstrate the effectiveness of our approach. Future work will consider a possible high performance implementation of our approach and the application of the IWDS algorithm to other scenarios (e.g. financial or marketing datasets).
References 1. Asahiro, Y., Iwama, K., Tamaki, H., Tokuyama, T.: Greedily finding a dense subgraph. J. Algorithms 34(2), 203–221 (2000) 2. Balalau, O.D., Bonchi, F., Chan, T.H., Gullo, F., Sozio, M.: Finding subgraphs with maximum total density and limited overlap. In: Cheng, X., Li, H., Gabrilovich, E., Tang, J. (eds.) Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, WSDM 2015, Shanghai, China, February 2-6, 2015, pp. 379–388. ACM (2015) 3. Cannataro, M., Guzzi, P.H., Veltri, P.: Impreco: distributed prediction of protein complexes. Fut. Gener. Comput. Syst. 26(3), 434–440 (2010) 4. Chan, T.M.: All-pairs shortest paths for unweighted undirected graphs in o(mn) time. ACM Trans. Algorithms 8(4) (2012) 5. Charikar, M.: Greedy approximation algorithms for finding dense components in a graph. In: Jansen, K., Khuller, S. (eds.) Approximation Algorithms for Combinatorial Optimization, Third International Workshop, APPROX 2000, Proceedings. Lecture Notes in Computer Science, vol. 1913, pp. 84–95. Springer, Heidelberg (2000) 6. Charikar, M.: Greedy Approximation Algorithms for Finding Dense Components in a Graph. In: Jansen, K., Khuller, S. (eds.) Approximation Algorithms for Combinatorial Optimization. APPROX 2000. Lecture Notes in Computer Science, vol. 1913, pp. 84–95. Springer, Berlin, Heidelberg (2000). https://doi.org/10.1007/3540-44436-X 10 7. Cho, E., Myers, S.A., Leskovec, J.: Friendship and mobility: user movement in location-based social networks. In: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1082–1090. ACM (2011) 8. Dondi, R., Hosseinzadeh, M.M., Mauri, G., Zoppis, I.: Top-k overlapping densest subgraphs: Approximation and complexity. In: Proceedings of the 20th Italian Conference on Theoretical Computer Science, ICTCS 2019, Como, Italy, September 9-11, 2019, pp. 110–121 (2019) 9. Dondi, R., Mauri, G., Sikora, F., Zoppis, I.: Covering a graph with clubs. J. Graph Algorithms Appl. 23(2), 271–292 (2019) 10. Galbrun, E., Gionis, A., Tatti, N.: Top-k overlapping densest subgraphs. Data Min. Knowl. Discov. 30(5), 1134–1165 (2016) 11. Goldberg, A.: Finding a maximum density subgraph. Technical report. Uni. California, Berkeley (1984)
596
R. Dondi et al.
12. Guzzi, P.H., Cannataro, M.: μ-cs: An extension of the TM4 platform to manage Affymetrix binary data. BMC Bioinform. 11(1), 315 (2010) 13. Guzzi, P.H., Milenkovi´c, T.: Survey of local and global biological network alignment: the need to reconcile the two sides of the same coin. Briefings in Bioinformatics p. bbw132 (2017) 14. Hagberg, A., Swart, P., S Chult, D.: Exploring network structure, dynamics, and function using networkx. Technical report, Los Alamos National Lab.(LANL), Los Alamos, NM (United States) (2008) 15. Hastad, J.: Clique is hard to approximate within n/sup 1-/spl epsiv. In: Proceedings of 37th Conference on Foundations of Computer Science, pp. 627–636. IEEE (1996) 16. Hosseinzadeh, M.M.: Dense Subgraphs in Biological Networks. In: Chatzigeorgiou, A., et al. (eds.) SOFSEM 2020: Theory and Practice of Computer Science. SOFSEM 2020. Lecture Notes in Computer Science, vol. 12011, pp. 711–719. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-38919-2 60 17. Kawase, Y., Miyauchi, A.: The densest subgraph problem with a convex/concave size function. Algorithmica 80(12), 3461–3480 (2018) 18. Komusiewicz, C.: Multivariate Algorithmics for Finding Cohesive Subnetworks. Algorithms 9(1), 21 (2016) 19. Lee, V.E., Ruan, N., Jin, R., Aggarwal, C.: A survey of algorithms for dense subgraph discovery. In: Aggarwal, C., Wang, H. (eds.) Managing and Mining Graph Data. Advances in Database Systems, vol. 40, pp. 303–336. Springer, Boston, MA (2010). https://doi.org/10.1007/978-1-4419-6045-0 10 20. Liu, X., Shen, C., Guan, X., Zhou, Y.: Digger: detect similar groups in heterogeneous social networks. ACM Trans. Knowl. Disc. Data (TKDD) 13(1), 2 (2018) 21. Milano, M., Milenkovi´c, T., Cannataro, M., Guzzi, P.H.: L-HetNetAligner: a novel algorithm for local alignment of heterogeneous biological networks. Sci. Reports 10(1), 3901 (2020) 22. Phillips, P.C.: Epistasis — the essential role of gene interactions in the structure and evolution of genetic systems. Nat. Rev. Genet. 9(11), 855–867 (2008) 23. Sapountzi, A., Psannis, K.E.: Social networking data analysis tools & challenges. Fut. Gener. Comput. Syst. 86, 893–913 (2018) 24. Szklarczyk, D., Morris, J.H., Cook, H., Kuhn, M., Wyder, S., Simonovic, M., Santos, A., Doncheva, N.T., Roth, A., Bork, P., et al.: The string database in 2017: quality-controlled protein–protein association networks, made broadly accessible. Nucleic acids research p. gkw937 (2016) 25. Wu, Y., Zhu, X., Li, L., Fan, W., Jin, R., Zhang, X.: Mining Dual Networks Models, Algorithms, and Applications. TKDD (2016) 26. Yang, J., Leskovec, J.: Community-affiliation graph model for overlapping network community detection. In: 2012 IEEE 12th International Conference on Data Mining, pp. 1170–1175. IEEE (2012) 27. Zuckerman, D.: Linear degree extractors and the in-approximability of max clique and chromatic number. In: Kleinberg, J.M. (ed.) Proceedings of the 38th Annual ACM Symposium on Theory of Computing, Seattle, WA, USA, May 21-23, 2006, pp. 681–690. ACM (2006)
Interest Clustering Coefficient: A New Metric for Directed Networks Like Twitter Thibaud Trolliet1(B) , Nathann Cohen2 , Frédéric Giroire2 , Luc Hogie1 , and Stéphane Pérennes2 1
2
INRIA Sophia-Antipolis, Sophia-Antipolis, France [email protected] Université Côte d’Azur/CNRS, Sophia-Antipolis, France
Abstract. We study here the clustering of directed social graphs. The clustering coefficient has been introduced to capture the social phenomena that a friend of a friend tends to be my friend. This metric has been widely studied and has shown to be of great interest to describe the characteristics of a social graph. In fact, the clustering coefficient is adapted for a graph in which the links are undirected, such as friendship links (Facebook) or professional links (LinkedIn). For a graph in which links are directed from a source of information to a consumer of information, it is no longer adequate. We show that former studies have missed much of the information contained in the directed part of such graphs. We thus introduce a new metric to measure the clustering of a directed social graph with interest links, namely the interest clustering coefficient. We compute it (exactly and using sampling methods) on a very large social graph, a Twitter snapshot with 505 million users and 23 billion links. We additionally provide the values of the formerly introduced directed and undirected metrics, a first on such a large snapshot. We exhibit that the interest clustering coefficient is larger than classic directed clustering coefficients introduced in the literature. This shows the relevancy of the metric to capture the informational aspects of directed graphs. Keywords: Complex networks · Clustering coefficient networks · Social networks · Twitter.
1
· Directed
Introduction
Networks appear in a large number of complex systems, whether they are social, biological, economical or technological. Examples include neuronal networks, the Internet, financial transactions, online social networks, ... Most “real-world” This work has been supported by the French government through the UCA JEDI (ANR-15-IDEX-01), EUR DS4H (ANR-17-EURE-004) Investments in the Future projects, and ANR DIGRAPHS, by the SNIF project, and by Inria associated team EfDyNet. The authors are grateful to the OPAL infrastructure from Université Côte d’Azur for providing resources and support. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 944, pp. 597–609, 2021. https://doi.org/10.1007/978-3-030-65351-4_48
598
T. Trolliet et al.
networks exhibit some properties that are not due to chance and that are really different from random networks or regular lattices. In this paper, we focus on the study of the clustering coefficient of social networks. Nodes in a network tend to form highly connected neighborhoods. This tendency can be measured by the clustering coefficient. It is classically defined for undirected networks as three times the number of triangles divided by the number of open triangles (formed by two incident edges). This clustering coefficient had been computed in many social networks and had been observed as much higher than what randomness would give. Triangles thus are of crucial interest to understand “real-world” networks. However, a large quantity of those networks are in fact directed (e.g. the web, online social networks like Instagram, financial transactions). It is for instance the case of Twitter, one of the largest and most influential social networks with 126 million daily active users [14]. In Twitter, a person can follow someone she is interested in; the resulting graph, where there is a link u → v if the account associated to the node u followed the account associated to the node v, is thus directed. In this study, we used as main dataset the snapshot of Twitter (TS in short) extracted by Gabielkov et al. as explained in [6] and made available by the authors. The TS has around 505 million nodes and 23 billion arcs, making it one of the biggest snapshots of a social network available today. The classic definition of the clustering coefficient cannot be directly applied on directed graphs. This is why most of the studies computed it on the so-called mutual graph, as defined by Myers & al. in [11], i.e., on the subgraph built with only the bidirectional links. We call mutual clustering coefficient (mcc for short) the clustering coefficient associated with this graph. We computed this coefficient in the TS, using both exact and approximated methods. We find a value for the mcc of 10,7%. This is a high value, of the same order as those found in other web social networks. However, this classical way to operate leaves out 2/3 of the graph! Indeed, we computed that the bidirectional edges only represents 35% of the edges of the TS. A way to avoid it is to consider all links as undirected and to compute the clustering coefficient of the obtained undirected graph. We refer to the corresponding computed clustering coefficient as the undirected clustering coefficient (ucc for short). Such a computation in the TS gives a value of ucc of only 0.11%. This is way lower than what was found in most undirected social networks. It is thus a necessity to introduce specific clustering coefficients for the directed graphs. More generally, when analyzing any directed datasets, it is of crucial importance to take into account the information contained in its directed part in the most adequate way. A first way to do that is to look at the different ways to form triangles with directed edges. Fagiolo computed the expected values of clustering coefficients considering directed triangles for random graphs in [5] and illustrated his method on empirical data on world-trade flows. There are two possible orientations of triangles: transitive and cyclic triangles, see Figs. 1b and 1c. Each type of triangles corresponds to a directed clustering coefficient:
Interest Clustering Coefficient: A New Metric for Directed Networks
599
– the transitive clustering coefficient (tcc in short), defined as: tcc =
# transitive triangles , # open transitive triangles
– the cyclic clustering coefficient (ccc in short), defined as: ccc =
3 · # cyclic triangles . # open transitive triangles
We computed both coefficients for the snapshot, obtaining tcc = 1.9% and ccc = 1.7%. However, note that a large part of the transitive and cyclic triangles comes from bidirectional triangles. When removing them, we arrive to values of tcc = 0.51% and ccc = 0.24%. We believe those metrics miss an essential aspect of the Twitter graph: while the clustering coefficient was defined to represent the social cliques between people, it is not adequate to capture the information aspect of Twitter, known to be both a social and information media [8,11]. In this work, we go one step further in the way directed relationships are modeled. We argue that in directed networks, the best way to define a relation or similarity between two individuals (Bob and Alice) is not always by a direct link, but by a common interest, that is, two links towards the same node (e.g., Bob → Carol and Alice → Carol). Indeed, when discussing interests, consider two nodes having similar interests. Apart from being friends, these two nodes do not have any reason to be directly connected. However, they would tend to be connected to the same out-neighbors. We exploit this to study a new notion of connections in directed networks and the new naturally associated clustering coefficient, which we name interest clustering coefficient, or icc in short, and define as follows: icc =
4 · # K22s , # open K22s
where a K22 is defined as a set of four nodes in which two of them follow the two others, and an open K22 is a K22 with a missing link, see Fig. 1d. We computed the icc on the Twitter snapshot, obtaining icc = 3.6% (3.1% when removing the bidirectional structures). This value, an order of magnitude higher than the previous clustering coefficients computed on the non bidirectional directed graph, confirms the interest of this metric. If the clustering coefficient of triangles are good metrics to capture the social aspect of a graph, the interest clustering coefficient is a good metric to capture the informational aspect. In summary, our contributions are the following: – We define a new clustering coefficient for graphs with interest links. – We succeeded in computing it, both exactly and using sampling methods, for a snapshot of Twitter with 505 million nodes and 23 billion edges. – We additionally provide the values of the directed and undirected clustering coefficients previously defined in the literature. We believe this is the first time that such coefficients are computed exactly for a large directed online social network.
600
T. Trolliet et al.
The paper is organized as follows. We first discuss related work in Sect. 2. In Sect. 3, we present the algorithms we used to compute the values of the interest clustering coefficient, both exactly and by sampling. We discuss the results on the clustering coefficients of Twitter in Sect. 4.
Fig. 1. Closed (left) and open (right) undirected and directed triangles and K22s.
2
Related Work
Clustering coefficient. The clustering coefficient shows that, when two people know each other, there is a high probability that those people have common friends. The clustering coefficient has numerous important applications, such as spam detection [4], link recommendation [15], information spread [7], etc. There are different definitions of the clustering coefficient. The global clustering coefficient, sometimes also called transitivity, was first introduced by Barrat and Weigt in [2]. It is defined as 3 times the number of triangles in the graph, divided by the number of connected triplets of vertices in the graph. Another definition was given by Watts and Strogatz [19] and is called the local clustering coefficient. It is defined as the mean over all nodes of the graph of the local clustering of each node, that is the probability that two random neighbors of the node are also connected together. We use the global clustering coefficient in this paper. The clustering coefficient has also been defined for weighted graphs [13]. Computations for social graphs. The undirected clustering coefficient of some social networks has been provided in the literature. It has been computed on very large snapshots for Facebook [17], Flickr, and YouTube [10]. The local clustering coefficient has also been studied in the undirected mutual graph of Twitter [11]. The undirected clustering coefficient is usually much higher in social networks than in random models. Directed graphs. All these studies only consider the undirected clustering coefficient, even for directed graphs like Twitter. Fagiolo introduced definitions of directed clustering coefficients, that we named tcc and ccc [5], but those definitions had never been computed and discussed on large datasets to our
Interest Clustering Coefficient: A New Metric for Directed Networks
601
knowledge, as we do in this paper. Moreover, we believe that these metrics are not the most relevant ones for directed graphs with interest links. Computing substructures. Researchers studied methods to efficiently compute the number of triangles in a graph, as naive methods are computationally very expensive on large graphs. Two families of methods have been proposed: triangle exact counting or enumeration and estimations. In the first family, the fastest algorithm is due to Alon, Yuster, and Zwick [1] and runs in O(m1.41 ), with m the number of edges. However, methods using matrix multiplication cannot be used for large graphs because of their memory requirements. In practice, enumeration methods are often used, see e.g., [9]. Methods to count rectangles and butterfly structures in undirected bipartite networks were also proposed in [18] and in [12]. In this paper, we propose an efficient enumeration algorithm to count the number of K22s and open K22s in a very large graph. We focused on the case in which only one adjacency can be stored, as this was our case for the TS. To the best of our knowledge, we are the first to consider this setting.
3
Computing Clustering Coefficients in Twitter
To compute the interest clustering coefficient and the triangle clustering coefficients of very large networks, we used two different methods presented here: an exact count and an estimation using sampling techniques, either with a Monte Carlo algorithm or with a sampling of the graph. As a typical example of a massive directed social network with interest links, we carried out the computations for a Twitter snapshot (TS in short) with 505 million nodes and 23 billion links, described in Report [16]. 3.1
Exact Count
We computed the exact numbers of K22s and open K22s in the Twitter Snapshot. Recall that we are discussing a dataset with hundreds of million nodes and billions of arcs. Results are reported in Table 1 and discussed in Sect. 4. We first present in this section the algorithms we use, and discuss their complexity. In the rest of this paper, we call top vertices (resp. bottom vertices) of a K22 the vertices which are destinations (resp. sources) of the K22 edges. We call a fork a set of two edges of a K22 connected to the same vertex. We say that a fork has top (or bottom) vertex x if both edges are connected to x and x is a top (resp. bottom) vertex of the K22. The same terminology applies to open K22s. Trivial algorithm. The trivial algorithm would consider all quadruplets of vertices with 2 upper vertices. Then, for each quadruplet, it would check the existence of a K22 and of open K22s. There are 42 n4 such quadruplets. It thus gives a complexity of O(n4 ). This method can thus not be considered for the TS as it would perform 6.4 × 1033 iterations. Improved algorithm. The practical complexity can be greatly improved by only considering connected quadruplets, and by mutualizing the computations of
602
T. Trolliet et al.
the common neighbors of the in-neighbors of a vertex, as explained below. The pseudo-code is given in Algorithm 1. The algorithm’s main loop iterates on the vertices of the graph. For each vertex x, we consider its in-neighborhood N − (x). We then compute how many times a vertex w (with w < x to avoid counting a K22 twice) appears in the out-neighborhoods of the vertices of N − (x). We denote it #occ(w). We use a table to store the value of #occ(w) in order to be able to do a single pass on each out-neighbor. For a vertex w, any pair of its #occ(w) in-neighbors common with x forms K22s with x and a K22 with x and w as top vertices. There are hence #occ(w) 2 w as top vertices. The number of K22s with x as a top vertex is then #occ(w) #K22(x) = . 2 w|#occ(w)≥2
The number of open K22s with x as the top vertex is computed by noticing that, for any pair of vertices u and v of N − (x), we have d+ (u) − 1 + d+ (v) − 1 − 1v∈N + (u) − 1u∈N + (v) open K22s containing this fork (ux, vx). We can count the number of open K22s with x as a top vertex, u as the bottom vertex of out-degree at least 2 (and thus another vertex v as the bottom vertex of out-degree at least 1). A vertex u ∈ N − (x) is thus in (d+ (u) − 1 v∈N − (x)\{u} 1v∈N + (u) )(d− (x) − 1) such open K22s. The only subtlety is that we count the number of arcs, which are between two vertices of N − (x), during the loop on the out-neighborhoods of the vertices of N − (x). We note this number #internalArcs. We then have: ⎛ ⎞ #openK22(x) = ⎝ (d+ (u) − 1)(d− (x) − 1)⎠ − #internalArcs. u∈N − (x)
Lastly, the global number of K22s (resp. open K22s) in the digraph is just the sum of the number of K22s (resp. open K22s) with a vertex x as a top vertex, as, since we only consider K22s formed with a vertex w such that x < w, we only count each K22 once. Complexity of the used algorithm. The complexity thus is m+ u d+ (u)(d+ (u)− 1), with m the number of edges. Indeed, each edge is only considered once as an in-arc and d+ − 1 times as an out-arc. Note that, in the Twitter Snapshot, the sum of the squares of the degrees is equal to 8 · 1013 . The order of the number of iterations needed to compute the number of K22s was thus massively decreased from the 6.4 × 1033 iterations of the trivial algorithm. For graphs following a power-law degree distribution with exponent between 2 and 3, we show in [16] that this gives a complexity between O(m+n) and O(n2 ), to be compared to the one of the naive method O(n4 ). Note that the number of undirected and directed triangles can be easily computed while counting the K22s, see [16].
Interest Clustering Coefficient: A New Metric for Directed Networks
603
Algorithm 1. Enumeration of K22s and open K22s 1: Input: Digraph(V, A) 2: #occ=0 table 3: for x ∈ V do 4: #internalArcs ← 0 We count the number of arcs internal to N − (x) as these arcs do not form open K22s 5: for v ∈ N − (x) do 6: #openK22s + = (d+ (v) − 1)(d− (x) − 1) 7: for w ∈ N + (v) \ {x} do 8: #occ[w]+ = 1 We use a second table to test that. 9: if w ∈ N − (x) then 10: #internalArcs+ = 1 11: for w with #occ[w] ≥ 2 do 12: #k22+ = #occ[w] 2 13: #openK22s − = #internalArcs 14: #occ ← 0 4#K22 15: icc ← #openK22
3.2
Done with a double loop
Approximate Counts
As discussed later in Sect. 4, the exact count of the number of K22s and open K22s in Twitter implies massive computations. This number can be estimated using Monte Carlo Method and/or computations on a sample of the graph. We discuss both methods below. One of our goals was to see how good computations made in the literature using smaller Twitter snapshots were. Exact icc on Twitter Samples. We built samples of the TS to estimate the interest clustering coefficient. Several choices can be made to build the samples. To avoid missing nodes of high degrees (which would lead to a high variance), we sampled the arcs (and not the nodes). Given a sampling probability p, we keep an arc in the sample with probability p. We generated samples of different sizes corresponding to sampling probabilities from p = 1/100 to p = 1/16000. Estimator of the number of K22 and open K22s. We use the classic estimator X = A∈A XA , where XA is the random variable which is equal to 1 if all the arcs of pattern A are selected in the sample and 0 otherwise. Theoretical bounds for expectation and variance are given in [16]. The difficulty for the variance comes from the fact that two K22s can share a common link. Results. We present in Fig. 2a the results of the algorithm for different sample sizes, corresponding to sampling probabilities from p = 1/100 to p=1/16, 000. For each sample size, we generated 30 samples. The distribution over the samples of the interest clustering coefficient is provided by a boxplot for each value of p. Note that a K22 of the TS appears in a sample with a probability of only p4 , and of p3 for an open K22. The clustering coefficient of a sample is thus an estimate of p · icc.
604
T. Trolliet et al.
Fig. 2. Estimation of the interest clustering coefficient with Approximate Counts.
We observe that, the clustering coefficient is well estimated using any sample for a sampling probability of 1/1000 or larger. Indeed, for this range of probabilities, the distribution over all samples is very concentrated and around the exact value of the icc. Note that, for p = 1/1000, a K22 is present in the sample with a probability of only 10−12 . The expectation of the number of nodes with an edge is only 23 million nodes (over 500 million) and the number of edges is also around 23 million. Thus, in the TS, a small sample (5% of the nodes and 0.1% of edges) allows to do an efficient estimation of the icc. For smaller values of p, the variance increases. The median estimates well the icc for a range of p between 1/8000 and 1/1000, but samples of these sizes may have error of 100% of the value. Lastly, for p = 1/16000, most of the time there is no surviving K22 in the sample, leading to a value of zero for the icc. The icc thus cannot be estimated correctly. In conclusion, a sample with sampling probability 1/1000 is enough to efficiently estimate the interest clustering coefficient, with a computation time of around 1 min (instead of days for the whole TS) on a machine of the cluster. Monte Carlo Method. The difficulty to estimate the clustering coefficients using Monte Carlo Method is that the probability to observe a (closed or open) K22 or a triangle is very small. In the case of triangles, this difficulty can be easily circumvented by knowing the node degrees. This allows to select an open triangle uniformly at random. In the case of K22s, this information is not sufficient to select an open K22 uniformly at random. In fact, achieving this goal is very costly. We thus present a new method in which, by picking only forks (as we do for triangles), we can compute the interest clustering coefficient. The idea is to select a vertex v as a root according to the square of its in-degree (as in the case of triangles), but without knowing its number of open K22s (first step). We then select two arcs u1 v and u2 v uniformly at random (second step). We then
Interest Clustering Coefficient: A New Metric for Directed Networks
605
Table 1. Clustering coefficients in the TS. The first (resp. second) column represents the number of closed (resp. open) K22s or triangles in the Twitter Snapshot. Each line corresponds to a clustering coefficient defined in Sect. 1.
icc tcc ccc ucc mcc
#closed
#open
cc
2.6 · 1016 2.5 · 1012 7.2 · 1011 6.2 · 1011 3.2 · 1011
3.1 · 1018 1.3 · 1014 1.3 · 1014 1.6 · 1015 8.9 · 1012
3.3% 1.9% 1.7% 0.11% 10.7%
compute the number of K22s and open K22s with the selected fork (u1 v, u2 v) (third step). The formal justification of this new method is developed in [16]. We present here the results of the experiments. Experiments. We carried out two runs with 10 million iterations. It took about 2min30 for one run (60.000 iterations per second). The value of the estimator of the icc for the two runs is plotted as a function of the number of iterations in Fig. 2b. We see that the estimator converges as expected to the value of the icc of TS represented by a straight horizontal line (and which was computed exactly in the previous section). We also plotted the estimated standard deviation as a function of the number of iterations. To obtain it, we did one billion iterations. We then estimated the standard deviation σ, and plotted √σn . We see that large jumps or discontinuity happen, but only at the beginning. They correspond to the draw of a fork with a lot of K22s and open K22s corresponding to a user who does not have the same icc as the global network. Then, the convergence is quick. After 300 iterations, the standard deviation is below 10% and after 1000 iterations, we do not experience a value of the runs less precise than 10%.
4
Results: Clustering Coefficients in Real Datasets
Twitter. To compute the number of K22s, open K22s, directed triangles, and undirected triangles in the Twitter Snapshot, we used a cluster with a rack of 16 Dell C6420 dual-Xeon 2.20GHz (20 cores), with 192 GB RAM, all sharing an NFS Linux partition over Infiniband. It took 51 hours to compute the exact numbers of K22s and open K22s, corresponding 265 h of cumulative computation times on the cluster. We reported the results in Table 1. Number of K22s and triangles. We see that the numbers of K22s and open K22s are huge, 2.6 × 1016 and 3.1 × 1018 , respectively. It has to be compared with the number of triangles which are several orders of magnitude smaller: e.g., 2.5 × 1012 and 1.3 × 1014 for transitive triangles. Clustering coefficient in the mutual graph. The mutual graph captures the friendship relationships in the social network. The mutual clustering coefficient thus is high (mcc = 10.7%), as cliques of friends are frequent in Twitter.
606
T. Trolliet et al. Table 2. Clustering coefficients without the mutual structures. icc
tcc
ccc
ucc
T witter 3.1% 0.51% 0.24% 0.057%
Clustering coefficients in the whole graph. We observe that icc = 3.3% > tcc = 1.9% > ccc = 1.7% > ucc = 0.11%. Directed metrics better capture the interest relationships in the TS as ucc is very low. The highest parameter is the icc. It confirms the hypothesis of this paper that common interests between two users are better captured by the notion of K22 than by a direct link between these users. As expected, the second parameter is the one using transitive triangles. Indeed, they capture a natural way for a user of finding a new interesting user, that is, considering the followings of a following, especially after having seen retweets. A bit surprisingly, the ccc is not very low. In fact, a large fraction of the cyclic triangles are explained by corresponding triangles in the mutual graph (triangles of bi-directional links). We believe bidirectional links contain a part of the social aspect of Twitter. Indeed, two friends will tend to follow each other, while a celebrity have little chance to follow back a person she does not know. A way to artificially take off the social influence in order to focus exclusively on the directed interest part of the graph is to remove the (open and closed) triangles and K22s contained in the mutual graph from the total count. Indeed, each undirected triangle of the mutual graph induces two cyclic triangles and six transitive triangles, and each undirected open triangle induces two open triangles. In the same way, each undirected K22 induces two K22s and each undirected open K22 induces two open K22s. The obtained results are shown in Table 2. If we take off those mutual triangles, both the tcc and the ccc values drop to 0.51% and 0.24%, respectively, while the icc stays about the same at 3.1%. This tends to confirm the hypothesis that the directed triangle clusterings somehow measure the friendship part of the TS more than the interest part. We also looked at the distributions of K22, open K22 and icc for each node, using definitions adapted from the triangle ones [19]. We obtain a value of 7.7% for the local icc. More details can be found in [16].
Other networks. We computed the different metrics on four other directed networks: two social networks, a web network and a citation network. The main dataset characteristics (more details in [16]) as well as their clustering coefficients are reported in Table 3. The main takeaways are the following: – A high value of icc indicates the presence of clusters of interests such as research communities or interest fields. – A high value of tcc is the sign of an important local phenomena of neighbors’ recommendations and/or of a high hierarchical structure in the dataset. – The ccc has no real social meaning. If its value can be high in a directed graph, this is only due to the presence of bidirectional arcs and triangles.
Interest Clustering Coefficient: A New Metric for Directed Networks
607
Table 3. Information and clustering coefficients of the directed datasets. SN means m the Social Network. N is the number of nodes, |E| the number of edges, and |E| |E| fraction of edges implied in a bidirectional link. Is a SN N Instagram Flickr Web (.edu) Citations
Yes Yes No No
4.5 × 104 2.6 × 106 6.9 × 105 3.8 × 106
|E|
|E|m |E|
icc
tcc
ccc
mcc
ucc
6.7 × 105 3.3 × 107 7.6 × 106 1.7 × 107
11% 62% 25% 0%
12.0% 12.4% 46.3% 22.3%
15.4% 12.2% 59.6% 9.1%
3.7% 9.3% 18.8% 0%
22.6% 13.9% 78.5% (none)
4.1% 10.8% 0.69% 6.7%
– Directed networks have a high mcc. Indeed, their bidirectional parts (mutual graph) have strong social communities, leading to a high clustering coefficient. – The ucc is usually significantly lower, showing that the directed part of the network is better understood using directed clustering coefficients. – Directed social networks have similar mixes of values of their undirected and directed clustering coefficients, however, with some notable differences, due to their diverse usages and information. Additional work. As a complement of this study, we also proposed two applications using the K22s, which can be found in [16]: - Recommendations. We propose to use the K22s to carry out link recommendation, as we advocate that the interest clustering coefficient is a good measure of common user interests. The principle is to recommend links closing a large number of K22s (instead, classically, of triangles). We discuss the strengths/weaknesses of this method for a set of Twitter users. - Models with addition of K22s. We propose a new directed random growing model, based on the one introduced by Bollobas et al. [3], with addition of K22s. This way, the model is able to build random networks with a high icc, in order to represent the high values of icc found in real-world networks as presented in this paper. We prove that the in- and out- degree distributions of this model follow power-laws, as most of real-world networks, and show empirically the high value of the icc.
5
Conclusion, Discussion, and Future Work
In this paper, we introduce a new metric, the interest clustering coefficient, to capture the interest phenomena in a directed graph. Indeed, the classical undirected clustering coefficient apprehends the social phenomena that my friends tend to be connected. However, it is not adequate to take into account directed interest links. The interest clustering coefficient is based on the idea that, if two people are following a common neighbor, they have a higher chance to have other common neighbors, since they have at least one interest in common. We computed this new metric on a network known to be at the same time a social
608
T. Trolliet et al.
and information media, a snapshot of Twitter from 2012 with 505 million users and 23 billion links. The computation was made on the total graph, giving the exact value of the interest clustering coefficient, and using sampling methods. The value of the interest clustering coefficient of Twitter is around 3.3%, higher than (undirected and directed) clustering coefficients introduced in the literature and based on triangles, which we also computed on the snapshot. Since both icc and classical cc represent a probability to close a structure, comparing their values with each other makes sense. However, an additional comparison with random models would be good to quantify what a “high value” means. In a network built with the G(n, p) model, both directed cc and icc are equal to p. For the Twitter Snapshot, p ∼ 10−7 , way smaller than the found values. An interesting study would be to compute those metrics on preferential attachment models; we keep this for a future work. The icc is defined here for unweighted networks: it would be interesting to generalize it to weighted ones as a future work. We also would like to further investigate link recommendation based on the K22 structure defined for the interest clustering coefficient. Indeed, as there are several order more K22s than triangles in social networks, it could lead to a more diverse recommendation system, than the ones based on triangles. It would be interesting to carry out a real-world user case study to investigate if users are more satisfied by such recommendations.
References 1. Alon, N., Yuster, R., Zwick, U.: Finding and counting given length cycles. Algorithmica 17(3), 209–223 (1997) 2. Barrat, A., Weigt, M.: On the properties of small-world network models. Eur. Phys. J. B-Condens. Matter Complex Syst. 13(3) (2000) 3. Bollobás, B., Borgs, C., Chayes, J., Riordan, O.: Directed scale-free graphs. In: ACM-SIAM Symposium on Discrete Algorithms (2003) 4. Bykin, P.O., Roychowdhury, V.P.: Leveraging social networks to fight spam. Computer 38(4), 61–68 (2005) 5. Fagiolo, G.: Clustering in complex directed networks. Phys. Rev. E 76 (2007) 6. Gabielkov, M., Rao, A., Legout, A.: Studying social networks at scale: macroscopic anatomy of the twitter social graph. In: ACM SIGMETRICS Performance Evaluation Review, Vol. 42, pp. 277–288. ACM (2014) 7. Granovetter, M.S.: The strength of weak ties. In: Social networks. Elsevier (1977) 8. Kwak, H., Lee, C., Park, H., Moon, S.: What is twitter, a social network or a news media? In: Proceedings of the 19th International Conference on World Wide Web, ACM (2010) 9. Latapy, M.: Main-memory triangle computations for very large (sparse (powerlaw)) graphs. Theoret. Comput. Sci. 407(1–3), 458–473 (2008) 10. Mislove, A., Marcon, M., Gummadi, K.P., Druschel, P., Bhattacharjee, B.: Measurement and analysis of online social networks. In: ACM IMC, pp. 29–42 (2007) 11. Myers, S.A., Sharma, A., Gupta, P., Lin, J.: Information network or social network?: the structure of the twitter follow graph. In: Proceedings of the 23rd International Conference on World Wide Web, pp. 493–498. ACM (2014)
Interest Clustering Coefficient: A New Metric for Directed Networks
609
12. Sanei-Mehri, S.-V., Sariyuce, A.E., Tirthapura, S.: Butterfly counting in bipartite networks. In: ACM SIGKDD, pp. 2150–2159 (2018) 13. Sarämaki, J., Kiveläa, M., Onnela, J.-P., Kaski, K., Kertesz, J.: Generalizations of the clustering coefficient to weighted complex networks. Phys. Rev. E 75(2) (2007) 14. Shaban, H.: https://www.washingtonpost.com/technology/2019/02/07/twitterreveals-its-daily-active-user-numbers-first-time/ 15. Silva, N.B., Tsang, R., Cavalcanti, G.D.C., Tsang, J.: A graph-based friend recommendation system using genetic algorithm. In: 2010 IEEE Congress on Evolutionary Computation (CEC), pp. 1–7. IEEE (2010) 16. Trolliet, T., Cohen, N., Giroire, F., Hogie, L., Pérennes, S.: Interest clustering coefficient: a new metric for directed networks like twitter (2020). arXiv preprint: arXiv:2008.00517 17. Ugander, J., Karrer, B., Backstrom, L., Marlow, C.: The anatomy of the facebook social graph (2011). arXiv preprint: arXiv:1111.4503 18. Wang, J., Fu, A.W.-C., Cheng, J.: Rectangle counting in large bipartite graphs. In: 2014 IEEE International Congress on Big Data. IEEE (2014) 19. Watts, D.J., Strogatz, S.H.: Collective dynamics of small-world networks. Nature 393(6684), 440 (1998)
Applying Fairness Constraints on Graph Node Ranks Under Personalization Bias Emmanouil Krasanakis(B) , Symeon Papadopoulos, and Ioannis Kompatsiaris CERTH-ITI, 57001 Thessaloniki, Greece {maniospas,papadop,ikom}@iti.gr
Abstract. In this work we address algorithmic fairness concerns that arise when graph nodes are ranked based on their structural relatedness to a personalized set of query nodes. In particular, we aim to mitigate disparate impact, i.e. the difference in average rank between nodes of a sensitive attribute compared to the rest, while also preserving node rank quality. To do this, we introduce a personalization editing mechanism that helps ranking algorithms achieve different trade-offs between fairness constraints and rank changes. In experiments across four real-world social graphs and two base ranking algorithms, our approach outperforms baseline and existing methods in uniformly mitigating disparate impact, even when personalization suffers from extreme bias. In particular, it achieves better trade-offs between fairness and node rank quality under disparate impact constraints. Keywords: Node ranking · Personalized ranking fairness · Disparate impact mitigation
1
· Algorithmic
Introduction
Machine learning has been widely adopted in systems that affect important aspects of people’s lives, from recommending social media friends to assisting jurisdictional or employment decisions. Since these systems often learn to replicate human-generated and systemic biases, fairness concerns arise when automated decisions end up correlated to sensitive attributes, such as gender or ethnicity [1,2]. Fairness is commonly defined as similar assessment between sensitive and non-sensitive groups of data samples under a statistical measure [1,3–5]. In this work, we focus on disparate impact elimination [6–9], which requires (approximate) statistical parity between sensitive and non-sensitive positive predictions. Node ranking refers to a class of methods that organize relational data into graphs and score the structural relatedness of their nodes to a set of query ones. This process can be personalized, in the sense that query nodes share an attribute, such as their political views, in which case scores (ranks) can be used as estimators for that attribute [10–12]. If no personalization takes place and all nodes become queries, ranks reflect their structural importance [13,14]. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 944, pp. 610–622, 2021. https://doi.org/10.1007/978-3-030-65351-4_49
Fairness Constraints on Graph Node Ranks under Personalization Bias
611
Although graph node ranking is an important machine learning discipline, remarkably little work has been done to make it fair. In fact, the first -to our knowledge- principled understanding of node rank fairness was only recently proposed by Tsioutsiouliklis et al. [15], who explore disparate impact mitigation for the node ranks of Google’s PageRank algorithm [16]. We now initiate a discussion on the fairness of personalized node ranking algorithms. Contrary to the non-personalized case, where node rank quality is tied to ad hoc definitions of structural importance, there exist objective notions of personalized node rank quality that fairness-aware approaches should ideally respect. For example, ranking social graph nodes to recommend friends should aim for relevance yet its outcome should not be influenced by sensitive attributes, such as race. In this work we leverage a model that weights training samples to make classifiers fair [3] and adapt it to estimate an unbiased personalization that yields fairer node ranks. Our new adaptation can be trained towards a variety of fairness objectives, such as fully or partially eliminating disparate impact while minimizing rank edits. We demonstrate its effectiveness by comparing it to baseline and existing practices across two ranking algorithms and four realworld graphs with both unbiased and extremely biased personalization. Our contribution lies in initiating a discussion on fairness-aware personalized ranking algorithms, where we address the trade-off between biased personalization and the preservation of prediction-related node rank quality. Furthermore, we investigate whether approaches uniformly introduce fairness in the sense that they do so for both the whole graph and an evaluation subset of nodes.
2 2.1
Background Personalized Node Ranking Algorithms
Personalized node ranking starts from a set of query nodes sharing an attribute of interest and scores nodes v per some notion of structural proximity to query ones. We organize node scores, which are called ranks and not to be confused with ordinalities, into vectors r with elements r[v] ≥ 0. We similarly consider a user-provided personalization vector p with elements p[v] ∈ [0, 1] reflecting the importance of nodes v being used as queries (0 corresponds to non-query nodes). Ranking algorithms are often expressed as graph filters [17,18]. These use a normalization W of the graph’s adjacency matrix, whose elements W [u, v] define transitions from nodes u to v. Then, given that propagating the personalization n hops away can be written as W n p, they weight different propagation distances: r = H(W )p
H(W ) =
∞
hn W n
(1)
n=0
where H(W ) is called a graph filter. Different filters can be obtained for different weights hn and methods of calculating W . For example, the graph’s adja−1 cency matrix M can be normalized column-wise W = M D or symmetrically − 12 − 12 D M D , where D = diag u M [v, u] v is the diagonal table of node
612
E. Krasanakis et al.
degrees. Two well-known graph filters are Personalized PageRank [19,20] and Heat Kernels [21], which respectively arise from hop weights hn = (1 − a)an and hn = e−t tn /n! for parameters a ∈ [0, 1] and t ∈ {1, 2, 3, . . . }. 2.2
Sweeping Node Ranks
The sweep procedure [22,23] utilizes node ranking algorithms to identify congregations of nodes that are tightly-knit together and well-separated from the rest of the graph, a concept known as low subgraph conductance [24]. This procedure assumes that a base ranking algorithm R with strong locality [25], such as personalized PageRank and Heat Kernels, yields ranks R(p) for a personalization p that comprises structurally close query nodes. It then compares ranks with their non-personalized counterparts R(1), where 1 is a vector of ones: rsweep =
R(p)[v] R(1)[v]
(2)
From now on, we will refer to this post-processing as the sweep ratio. The sweep procedure orders all nodes based on their sweep ratio and cuts the graph into two partitions so that conductance is minimized. This practice statistically yields well-separated partitions for a variety of node ranking algorithms [22–24]. From a high-level perspective, this indicates that the sweep ratio tends to improve node rank quality. 2.3
Algorithmic Fairness and Graph Mining
Algorithmic fairness is broadly understood as parity between sensitive and nonsensitive group samples over a chosen statistical property. Three popular fairnessaware objectives [1,3–5] are disparate treatment elimination, disparate impact elimination and disparate mistreatment elimination. These correspond to not using the sensitive attribute in predictions, preserving statistical parity between the fraction of sensitive and non-sensitive positive labels and achieving identical predictive performance on the two groups under a measure of choice. In this work, we focus on mitigating disparate impact unfairness [1,6–9]. An established measure that quantifies this fairness objective is the pRule [6]; denoting as R[v] the binary outputs of a system R for samples v, S the sensitive group, S the non-sensitive group (which is the complement of S) and P (a|b) the probability of a conditioned on b, this is defined as: pRule =
min(pS , pS ) ∈ [0, 1] max(pS , pS )
pS = P (R[v] = 1|p ∈ S) pS = P (R[v] = 1|p ∈ S)
(3)
The higher the pRule, the fairer a system is. There is precedence [6] for considering 80% pRule or higher as fair. Calders-Verwer disparity |pS − pS | [7] is a correlated measure optimized at the same point, but is less descriptive in that it biases fairness assessment against high fractions of positive predictions.
Fairness Constraints on Graph Node Ranks under Personalization Bias
613
In domains related to ranking, fairness has been defined for the order of recommended items [26–29] as equity in the ranking positions between sensitive and non-sensitive items. However, these notions of fairness are not applicable to the more granular understanding provided by node ranks. In graphs, the notion of achieving fair node embeddings has been proposed [30,31]. These are the first approaches that introduce fair random walks, a stochastic process modeled by personalized PageRank. However, the fairness of these walks is only implicitly asserted through embedding fairness. A more advanced understanding has been achieved recently in the more general domain of graph neural networks [32], which can be trained to produce fair recommendations, even under partial knowledge of the sensitive attribute. Last, a recent work by Tsioutsiouliklis et al. [15] has initiated a discourse on node rank fairness. Although focused on non-personalized ranking, it first recognizes the need of optimizing a trade-off between fairness and preserving rank quality. Furthermore, it provides a first definition of node rank fairness, called φ-fairness. Under a stochastic interpretation of node ranks, where they are proportional to the probability of nodes assuming positive labels, φ-fairness |S| becomes equivalent to disparate impact elimination when φ = |S|+|S | . In this work we consider the similar objectives of a) trading-off deviation from the original ranks and high pRule and b) preserving rank quality under fairness constraints. The pRule is calculated according to the above-mentioned stochastic interpretation of ranks through: 1 L∞ (r)[v] pS = P (R[v] = 1|p ∈ S) = |S| v∈S
pS = P (R[v] = 1|p ∈ S) =
1 |S |
(4) L∞ (r)[v]
v∈S
where L∞ (r) is a normalization that divides ranks with their maximum value and = L∞ (r)[v]. R is a stochastic process with probability P (R[v] = 1) = maxr[v] u r[u] 2.4
The CULEP Model
In previous work [3], we tackled the problem of making black box calibrated binary classifiers fair by pre-processing training data. To this end, we proposed a Convex Underlying Error Permutation (CULEP) model that weighs the importance of training samples to treat unfairness similarly to how an ideal but unobserved distribution of fair training labels would. To do this, we theorized that unfairness correlates to misclassification error (i.e. the difference between binary classification labels and calibration probabilities) and whether samples are sensitive. Furthermore, we recognized that strongly misclassified samples could exhibit different degrees of bias from correctly classified ones. Under these considerations the CULEP model tries to promote fairness by introducing a type of parameterized balancing between these sources of unfairness. In particular, after a stochastic analysis similar to the one we will later
614
E. Krasanakis et al.
conduct in this work, training samples i with misclassification error erri are assigned weights proportional to αi Eβi (erri ) + (1 − αi )Eβi (−erri ) where the values of αi ∈ [0, 1], βi ≥ 0 depend only on whether samples are sensitive or not and Eβ (·) is a function asymmetric around 0, such as the exponential Eβ (x) = eβx . These parameters can be tuned to satisfy various fairness objectives, including trade-offs between preserving accuracy (when βi = 0) and improving the pRule (when βi are large enough and αi balance towards mitigating positive label disparity).
3
Our Approach: Fair Personalizer
We theorize that there exist two types of potential node rank bias: stationary and rank-related. The first arises when ranks end up multiplied with a fixed bias-related quantity for each node. Whereas the second depends on the personalization, which transfers either its own or graph edge bias to the ranks. Of the two, stationary bias is easier to treat, as it does not depend on the personalization and only attacks the ranking algorithm’s outcome. In fact, the sweep ratio eliminates it, as it ends up dividing node ranks with their bias term. On the other hand, rank-related bias is harder to tackle. To see why, let us consider an invertible graph filter, such as the closed form of personalized PageRank H(W ) = (1 − a)(I − aW )−1 , and a personalization vector p that yields ranks r = H(W )p. We assume that there exist ideal ranks rf air satisfying a fairness-aware objective, such as minimizing the following trade-off between preserving ranks and improving the pRule with weight wpRule up to suppRule : minimize
1 |V | L∞ (rf air )
− L∞ (r)1 − wpRule · min{pRule(rf air ), suppRule } (5)
where pRule(rf air ) calculates the pRule of those ranks across all graph nodes V and · 1 is the L1 norm that sums the absolute values of vector elements. Then, the graph’s structure (e.g. edge sparseness) may cause H(W ) to be near-singular and hence propagate back small fair rank changes as large differences between the original personalization and its fair counterpart pf air = H −1 (W )rf air . Setting aside the potential intractability of optimally editing the personalization, we argue that this practice should be preferred to postprocessing ranks, as it respects underlying structural characteristics exposed when the graph filter diffuses the personalization through edges. To keep this upside, we propose that searching for fairness-inducing personalization edits can be made easier if these are expressed through parametric models of only few parameters to learn. The CULEP model could in theory fit this role, since it depends on only four parameters (αi and βi can each assume only two values, depending on whether i are sensitive). However, it can not be ported to graph ranking as-is, since weighting zero elements of the personalization vector through multiplication does not affect ranks at all and there is no rank validation set on which to tune its parameters. To address these issues, we adapt this model to perform non-linear edits on the penalization vector and use the original personalization as a rough one-class validation set of known positive examples.
Fairness Constraints on Graph Node Ranks under Personalization Bias
615
We start from a stochastic interpretation of ranks that snaps them to 1 with probability P (·) proportional to their value and to 0 otherwise. We also consider an edited vector personalization pest that estimates ranks rest = H(W )pest of similar fairness to some unobserved ideal ones rf air . For ease of notation, in the rest of this section we consider all vector operations (including multiplication) to be applied element-wise. We first analyse whether estimated ranks match the ideal fair ones: P (rf air = rest ) = P (rf air = rest |p = rest )P (p = rest ) + P (rf air = rest |p = rest )P (p = rest ) Borrowing CULEP’s theorization, the probabilities of estimated node ranks being fair given that they approximate well the original personalization are correlated with the probability of personalization being fair P (pf air = pest ) and the same holds true given that estimated node ranks do not approximate well the original personalization. Furthermore, if one of these two types of probabilities becomes larger for a node the other should become smaller and conversely. Finally, we consider an exponential-based approximation (whose ability to achieve fairness has been experimentally demonstrated [3]) of how these types of probabilities differ from their correlated fair personalization. This approximation depends on rank and personalization differences and whether nodes are sensitive: P (rf air = rest |p = rest ) ≈ K P (pf air = pest ) e−b(L∞ (r)−p) P (rf air = rest |p = rest ) ≈ K P (pf air = pest ) eb(L∞ (r)−p) where b is a vector of real values such that b[v] = {bS if v ∈ S, bS otherwise} and K > 0 is a normalization constant that makes probabilities sum to 1. We further assume that selecting sensitive and non-sensitive nodes as part of the personalization is done with fixed probabilities aS and aS pertaining to the personalization bias and organize those into a vector a = P (p = rest ) with elements a[v] = {aS if v ∈ S, aS otherwise} ∈ [0, 1]. Given the above analysis, we select a fair personalization estimation pf air based on the self-consistency criterion that, when it approaches fairness-inducing personalization, estimated fair ranks should also approach the ideal fair ones: pest = P (rf air = rest |pf air = pest ) ≈ −b(L∞ (r)−p)
∝ ae
4 4.1
P (rf air = rest ) P (pf air = pest )
(6)
b(L∞ (r)−p)
+ (1 − a)e
Experiment Setup Graphs
To assess whether our approach can achieve fairness while preserving node rank quality, we experiment on four graphs: two Facebook friendship graphs [33],
616
E. Krasanakis et al.
one Twitter graph of political retweets [34] and one Amazon graph of frequent product co-purchases [35]. The first three comprise real-world fairness sensitive attributes, but not adequately many nodes and edges to calculate ranks of high quality. On the other hand, the Amazon graph is not annotated with fairness-related attributes but is large enough for ranking algorithms to boast high quality, which our approach aims to maintain. The Facebook graphs each start from a given user and record social relations between them and their friends, including relations between friends. Ten such graphs are available in the source material, out of which we randomly select two to experiment on. These are denoted as FacebookX, where X is their starting user. We select the anonymized binary ‘gender’ attribute as sensitive and the first anonymized binary ‘education’ attribute as the prediction label. The Twitter graph comprises only one anonymized sensitive attribute of binary political opinions (left or right). The Amazon graph does not contain sensitive information and we consider the product category ‘Book’ to be sensitive. Due to lack of predictive attributes for the Twitter and Amazon graphs, we define predictions for the sensitive attribute’s binary complement, which makes those graphs exhibit what we later dub as extreme unfairness. These graphs are overviewed in Table 1. Columns correspond to graph names, number of nodes, number of edges, fraction of nodes with positive labels, number of nodes designated as sensitive and pRule value of their positive labels. Table 1. Experiment graph characteristics
4.2
Graph
Nodes
Edges
Positive% Sensitive% pRule
Facebook0 Facebook686 Twitter Amazon
347 170 18,470 334,863
5,038 68% 3,312 55% 48,365 61% 925,872 >99%
36% 46% 39% 0. ˜ Similarly, βp = βp − δp0 , where δjk := 1 iff j = k and δjk := 0 otherwise.
642
2.1
S. Chowdhury et al.
Path Homologies of the Mutual Dyad Motifs
As a further example of practical relevance, we characterize the path homologies of a family of network motifs that we call the n-uplinked mutual dyads—or dually, the n-downlinked mutual dyads—in reference to the original terminology going back to [23]. Given an integer n ≥ 1, an n-uplinked mutual dyad is a digraph Wn with vertex set {a, b, 1, 2, . . . , n} and edge set {(a, b), (b, a)} ∪ {(a, i) : 1 ≤ i ≤ n} ∪ {(b, i) : 1 ≤ i ≤ n}. This is illustrated in Fig. 8. The n-downlinked mutual dyad is defined by reversing all the arrows (cf. Fig. 7). We have: Proposition 1. Let n ∈ Z>0 and let Wn denote the n-uplinked or downlinked mutual dyad. Then β˜2 (Wn ) = n − 1, and β˜p (Wn ) = 0 for all p ≥ 0, p = 2. Proof. Suppose first that Wn is the uplinked mutual dyad motif. From [9] we know that β0 counts the number of connected components of the underlying undirected graph, so β0 = 1 and β˜0 = 0. Next consider the case p = 1. We have ∂[1] (e(a,b) + e(b,a) ) = 0, but we also have ∂[2] (e(a,b,1) + e(b,a,1) ) = e(a,b) + e(b,a) , and so e(a,b) + e(b,a) cannot contribute to β˜1 . Similarly terms of the form e(a,b) + e(b,i) − e(a,i) ∈ Z1 , 1 ≤ i ≤ n, cannot contribute to β˜1 as they belong to B1 , being the images of e(a,b,i) for 1 ≤ i ≤ n. Next consider p ≥ 3. In these cases we find Ωp = {0}, as all the 3-paths have boundaries with non-allowed paths, and taking linear combinations does not cancel out these non-allowed paths. Finally we deal with the case p = 2. First let 1 ≤ i = j ≤ n. Then ∂[2] (e(a,b,i) + e(b,a,i) − e(a,b,j) − e(b,a,j) ) = e(a,b) + e(b,a) − e(a,b) − e(b,a) = 0. Thus all 2-paths of the form e(a,b,i) + e(b,a,i) − e(a,b,j) − e(b,a,j) belong to Z2 , but not to B2 as Ω3 is trivial. Some linear algebra shows that a basis for Z2 is given by the collection {e(a,b,1) + e(b,a,1) − e(a,b,j) − e(b,a,j) : 2 ≤ j ≤ n}. It follows that β˜2 = n − 1. To conclude the proof, observe that all of these arguments hold for the downlinked mutual dyad by replacing terms of the form e(a,b,i) with e(i,a,b) .
3
Algorithm
Although an algorithm to compute non-regular path homology is fairly obvious, our implementation [31] is apparently among the first for dimension > 1. Though [26,27] are considerably older, we were unaware of these for some time and could not find any published work drawing on them, so we outline our approach here. First, to reduce computation, we remove all nonbranching limbs (i.e. chains of vertices of total degree 2 terminating in leaves of degree 1), since these do not affect homology (Theorem 5.1 of [9]). Similarly (see Proposition 3.25 of [9]), we break the graph into weakly connected components and compute homology componentwise. For each component D, we extend an order on vertices V (D) ≡ [n] to the lexicographical ordering on paths. We inductively construct Ap (D) for 0 ≤ p ≤ pmax as follows: A0 (D) = V (D), and Ap (D) is constructed by appending to every path in Ap−1 (D) every vertex that has an arc from the path’s terminal vertex. The paths are constructed in lexicographical order for each p.
Path Homology and Temporal Networks
643
From here, we compute the indices (using a radix-n expansion) that specify the inclusion Ap → V p+1 under lexicographical ordering. We then construct the matrix representation ∂[p,A] of the restriction of ∂[p] to FAp using the stanp dard basis. Let ∇[p,A] be the projection of ∂[p,A] onto FV \Ap−1 (i.e., the matrix obtained by removing rows of ∂[p,A] that correspond to elements of Ap−1 ), and Δ[p,A] be the projection onto FAp−1 . The kernel of ∇[p,A] is Ωp . In practice, we remove rows of ∇[p,A] that are identically zero before computing this kernel, which yields Ωp much more efficiently. With a matrix representation Ω[p,A] for the kernel above in hand, we compute −1 Δ[p,A] Ω[p,A] (i.e., the projection of the chain boundary operator ∂p = Ω[p−1,A] Ap−1 ∂[p,A] onto F , projected onto and restricted to the invariant space). We then compute the homology of this chain complex, e.g. we use the rank-nullity theorem and compute the matrix ranks to obtain Betti numbers, or we compute representatives for the homology groups (albeit somewhat lacking in geometric meaning) as the cokernels of [ker ∂p ]T ∂p+1 , or take the Smith normal form of the boundary matrices to find any torsion over Z (cf. Fig. 5).3
4
Phenomenology of Path Homology for Small Digraphs
Small Digraphs. In Fig. 3, we illustrate digraphs on 4 vertices with nontrivial homology in dimensions greater than 1. Surprisingly, nontrivial homology arises even in dimension 3 with just 4 vertices. Additionally, the left panel of Fig. 4 shows directed acyclic graphs (DAGs) on 6 vertices with nontrivial homology in dimension 2. Observations from these DAGs led us to formulate and prove a conjecture about the path homology of deep feedforward neural networks [4]. Finally, the right panel of Fig. 4 shows undirected graphs (considered as digraphs) on 6 vertices with nontrivial homology in dimension 2, highlighting that path homology is relevant to the undirected case as well. Torsion. Though heretofore defined over fields, path homology still makes sense over rings, e.g. Z. By sampling Erd˝ os-R´enyi digraphs, M. Yutin [30] was able to find a family of digraphs with torsion. In Fig. 5, we show the smallest two digraphs in this family; larger digraphs are formed by adding more vertices to the central cycle and linking to one of the two vertices external to the cycle 3
Further optimizations are imaginable. We can pre-process the graph more, in accord with Theorem 5.7 of [9], and deal with the fallout in low dimensions. Instead of performing a singular value decomposition on rather large boundary matrices (en route to computing the rank), we can recursively build up the invariant spaces from lower dimensions—each sub-path of an invariant path is itself an invariant path (since we assume no loops). A simple algorithm for this latter approach might be to check every pair of paths in dimension p against every vertex to see where we can append ‘triangles’ and ‘squares’ (terminology borrowed from [9]); while promising, this approach generates too many paths, and reducing it to a basis is computationally nontrivial. For low dimensions, we could also directly compute Betti numbers from the digraph itself (e.g. Proposition 3.24 of [9]).
644
S. Chowdhury et al.
Fig. 3. (L) β˜2 > 0 for these 6 (of 218 total) digraphs on 4 vertices. In each case β˜p = δp,2 . (R) β˜3 > 0 for these 5 digraphs on 4 vertices. In each case β˜p = δp,3 .
Fig. 4. (L) β˜2 > 0 for these 17 (of 5984 total) DAGs on 6 vertices. (R) β˜2 > 0 for these 17 (of 156 total) undirected graphs on 6 vertices.
Fig. 5. These two digraphs exhibit torsion. Larger digraphs with torsion can be formed by adding more vertices to the central cycles of these and linking them each to one of the two vertices external to the cycle (in an alternating fashion). We conjecture that the digraph in this family with a central cycle of length 2n has a torsion subgroup of ˜ 1 . This conjecture has been computationally verified up to n = 8. These Z/nZ in H digraphs may be analogues of so-called lens spaces that can be formed by gluing two tori together with a twist, and that themselves exhibit similar torsion.
(in an alternating fashion). We conjecture that the digraph in this family with a central cycle of length 2n has a torsion subgroup of Z/nZ in one-dimensional path homology. This conjecture has been computationally verified up to n = 8.
Path Homology and Temporal Networks
645
Erd˝ os-R´enyi Random Graphs. In Fig. 6 we plot empirical distributions for the first few Betti numbers of Erd˝ os-R´enyi random graphs [7] on 4 nodes. A standard application of these distributions could be to test if a stochastic digraph generating process could be modeled via a random graph generating model.
1
0.5
0 1
0.5
0 1
0.5
0 1
0.5
0 0
0.2
0.4
0.6
0.8
1
Fig. 6. Empirical distributions of β˜p (D4,q ).
5
Examples of Applications to Temporal Networks
We analyze three temporal networks [17,22], more specifically directed contact networks as described in [5]: MathOverflow, an email network, and activity on a Facebook group: these illustrate how path homology can find high-order interactions that are respectively indicators of dilution, recurring motifs, and concentration within network behavior. In our analyses, we take a standard approach of taking sliding windows over time, aggregating each window into a digraph (see, e.g. [1,2] for such “dynamic”
646
S. Chowdhury et al.
approaches to network measures such as centrality and motif structure) for which we then compute the path homology. An interesting alternative approach would be to compute persistent path homology [3], which would compute path homology signatures over the entire time axis instead of over sliding windows. However, there are two obstructions to this approach. One is that currently there are efficient algorithms for only (up to) one-dimensional persistent path homology [6]. The other is that this method would actually call for zigzag persistent path homology to deal with vanishing edges, and this technique has not yet been developed.4 5.1
MathOverflow
To illustrate the ability of path homology to identify structurally relevant motifs, we analyzed the answer-to-question portion of the sx-mathoverflow temporal network available at [20]. This network has 21688 vertices and 107581 directed temporal contacts, spanning 2350 days. It has previously been analyzed in [24]; for a discussion of question/answer phenomenology on MathOverflow, see [28]. We filtered the contacts through a sliding time window of 24 h, moving every eight hours, and aggregated each window into a static digraph. We then computed the first three Betti numbers.5 Only two windows, immediately adjacent and overlapping, had β2 > 0, corresponding to a period over 13–14 Oct 2009. Subsequent inspection of homology representatives revealed that this phenomenon originated in the presence of the 2-downlinked mutual dyad (cf. Subsect. 2.1) motif presented in Fig. 7, and additional inspection of MathOverflow itself revealed the particular questions and answers involved. The rarity of 2-homology is related to the fact that it happened very early in the history of MathOverflow–in fact, just two weeks after its beginning. As MathOverflow changed over time, opportunities for such tightly coupled patterns of questions and answers diminished. For example, most of the first 200 users asked and answered many fewer questions over time, while the overall size of and activity on MathOverflow grew much larger. 5.2
An Email Network
The phenomenon of the previous example is actually ubiquitous and generalized in email networks, for reasons attributable to well-known behaviors unique to the medium. We analyzed the email-Eu-core-temporal network available at [20]. This network has 986 vertices and 332334 directed temporal contacts, spanning 804 days of activity. We filtered the contacts through a sliding window of the 4
5
During review, a referee suggested another alternative approach, viz. the representation of [25], which would dovetail with the results of [4]. We believe this approach is very promising and plan to address it in future work. Note that unlike the temporal digraph representation of [5], such an approach always yields a DAG, albeit at the cost of enforcing a time discretization. NB. Our path homology code removes any loops from digraphs.
Path Homology and Temporal Networks
647
Fig. 7. Digraph of activity on MathOverflow over a 24-hour period during 13–14 Oct 2009. Vertices are labeled by user ID; arcs are directed from answerer to questioner (any parallel arcs are merged). Arcs participating in a 2-homology representative are highlighted, with pairs of arcs and associated questions ((25, 65), 437), ((25, 83), 451), ((65, 83), 446), ((83, 65), 437), ((121, 65), 433), and ((121, 83), 446) as well as ((121, 83), 451). Three of the four users {25, 65, 83, 121} share the same first subject tag at time of writing; all four share the same second subject tag.
most recent 100 contacts, moving every 50 contacts, and again aggregated each window into a static digraph. We then computed the first three Betti numbers. Many windows exhibited very high values of β2 due to instances of the n-uplinked mutual dyad (cf. Subsect. 2.1) motif shown in Fig. 8. 5.3
A Facebook Group
As a final example, we consider the first 1000 days of activity on a Facebook group [19,29]. Because the associated temporal network (13295 vertices; 187750 contacts) has a daily lull with virtually no activity, we aggregated the temporal network into daily digraphs. Figure 9 shows the number of posts per day and the first three Betti numbers. Besides the obvious correlation between activity and β˜0 , the appearance of progressively more- and higher-dimensional homology classes over time is also evident, indicating the emergence of higher-order network structure. Figure 10 shows the first daily digraph with β˜2 > 0.
6
Remarks
Although the computational requirements for path homology scale exponentially with dimension p, even the case p = 2 can highlight salient network structure and behavior. By decomposing temporal networks into time windows, path homology can be successfully brought to bear in this regard, illuminating both motifs with nontrivial path homology as well as the temporal networks themselves.
648
S. Chowdhury et al.
Fig. 8. (L) The distribution of β˜2 for windowed digraphs obtained from the email network. (R) The digraph Wn depicted here has β˜2 (Wn ) = n − 1, and it is the cause of high values of β˜2 in windowed digraphs obtained from the email network. The underlying dynamics is common in large organizations: two people (“Alice” and “Bob”) both send email to the same wide distribution and to each other.
Fig. 9. (L) Daily Facebook group posts. (R) Betti numbers of daily digraphs. As activity increases, so do topological features in dimensions 0 through 2.
Fig. 10. (L) First daily digraph with β˜2 > 0: day 756. The only weak component with β˜2 > 0 is indicated with a box. (R) Detail with (different graph layout and) arcs ˜ 2 highlighted. This homology representative is highly symmetrical. representing H
Path Homology and Temporal Networks
649
Acknowledgements. The authors thank Michael Robinson for many helpful discussions. This material is based upon work partially supported by the Defense Advanced Research Projects Agency (DARPA) and the Air Force Research Laboratory (AFRL). Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of DARPA or AFRL.
References 1. Braha, D., Bar-Yam, Y.: From centrality to temporary fame: dynamic centrality in complex networks. Complexity 12, 59 (2006) 2. Braha, D., Bar-Yam, Y.: Time-dependent complex networks: dynamic centrality, dynamic motifs, and cycles of social interactions. In: Gross, T., Sayama, H. (eds.) Adaptive Networks: Theory, Models and Applications. Springer (2009) 3. Chowdhury, S., M´emoli, F.: Persistent path homology of directed networks. In: Symposium on Discrete Algorithms (2018) 4. Chowdhury, S., et al.: Path homologies of deep feedforward networks. In: IEEE International Conference on Machine Learning and Applications (2019) 5. Cybenko, G., Huntsman, S.: Analytics for directed contact networks. Appl. Net. Sci. 4, 106 (2019) 6. Dey, T. K., Tianqi, L., Wang, Y.: An efficient algorithm for 1-dimensional (persistent) path homology. In: Symposium on Computational Geometry (2020) 7. Frieze, A., Karo´ nski, M.: Introduction to Random Graphs. Cambridge (2016) 8. Ghrist, R.: Elementary Applied Topology. Createspace, Scotts Valley (2014) 9. Grigor’yan, A., et al.: Homologies of path complexes and digraphs. arXiv:1207.2834 (2012) 10. Grigor’yan, A., Muranov, Yu., Yau, S.-T.: Graphs associated with simplicial complexes. Homology Homotopy Appl. 16, 295 (2014) 11. Grigor’yan, A., et al.: Homotopy theory for digraphs. Pure Appl. Math. Quart. 10, 619 (2014) 12. Grigor’yan, A., et al.: Cohomology of digraphs and (undirected) graphs. Asian J. Math. 19, 887 (2015) 13. Grigor’yan, A., Muranov, Yu., Yau, S.-T.: Homologies of graphs and K¨ unneth formulas. Comm. Anal. Geom. 25, 969 (2017) 14. Grigor’yan, A., et al.: On the path homology theory of digraphs and EilenbergSteenrod axioms. Homology Homotopy Appl. 20, 179 (2018) 15. Grigor’yan, A., et al.: Path homology theory of multigraphs and quivers. Forum Math. 30, 1319 (2018) 16. Hatcher, A.: Algebraic Topology. Cambridge (2002) 17. Holme, P.: Modern temporal network theory: a colloquium. Eur. Phys. J. B 88, 234 (2015) 18. Huntsman, S.: Generalizing cyclomatic complexity via path homology. arXiv:2003.00944 (2020) 19. Kunegis, J.: The Koblenz network collection. In: Proceedings of the WOW (2013) 20. Lescovec, J., Krevl, A.: SNAP datasets: Stanford large network dataset collection (2014). http://snap.stanford.edu/data 21. Lin, Y., et al.: Weighted path homology of weighted digraphs and persistence. arXiv:1910.09891 (2019) 22. Masuda, N., Lambiotte, R.: A Guide to Temporal Networks. World Scientific (2016) 23. Milo, R.: Network motifs: simple building blocks of complex networks. Science 298(5594), 824–827 (2002)
650
S. Chowdhury et al.
24. Montoya, L.V., Ma, A., Mondrag´ on, R.J.: Social achievement and centrality in MathOverflow. In: Complex Networks (2013) 25. P´ osfai, M., H¨ ovel, P.: Structural controllability of temporal networks. New J. Phys. 16, 123055 (2014) 26. Shajii, A.R. (2013). https://github.com/arshajii/digraph-homology 27. Slawinski, M. (2013). https://github.com/gtownrocks/digraph homology 28. Tausczik, Y.R., Kittur, A., Kraut, R.E.: Collaborative problem solving: a study of MathOverflow. In: CSCW (2014) 29. Viswanath, B., et al.: On the evolution of user interaction in Facebook. In: WOSN (2009) 30. Yutin, M.: Personal communication (2019) 31. Yutin, M. (2020). https://github.com/SteveHuntsmanBAESystems/ PerformantPathHomology
Computing Temporal Twins in Time Logarithmic in History Length Binh-Minh Bui-Xuan1 , Hugo Hourcade1,2(B) , and C´edric Miachon2 1
LIP6 (CNRS – Sorbonne Universit´e), Paris, France 2 Courtanet, Paris, France {buixuan,hugo.hourcade}@lip6.fr, {hugo.hourcade,cedric.miachon}@lesfurets.com
Abstract. A temporal graph G is a sequence of static graphs indexed by a set of integers T representing time instants. For Δ an integer, a pair of Δ-twins is a pair of vertices u = v which, starting at some time instant, have exactly the same neighbourhood outside {u, v} for Δ consecutive instants. We address the enumeration problem of all pairs of Δ-twins in G, such that the overall runtime depends the least on the history length, namely max{t : Gt ∈ G not empty } − min{t : Gt ∈ G not empty }. We give logarithmic solutions, using red-black tree data structure. Numerical analysis of our implementation on graphs collected from real world data scales up to 108 history length (Source code at https://github.com/ DaemonFire/deltatwinsMEI.).
Keywords: Graph theory
1
· Historical data · Modular decomposition
Introduction
Graph data from historical databases are well captured in the formalism of a link stream [13], a time varying [3], temporal [7] or evolving graph [2]. These notions occur in as various use cases as transportation timetables [5,8,11], navigation programs [4,14], email exchanges [12], proximity interactions [16], and many other types of dataset [17]. Therein, duplicated data rather correspond to twin vertices, in the sense of [15]: a pair of true twins are vertices u = v sharing the same neighbourhood; else, they are still false twins if their neighbourhoods outside {u, v} are the same. In a static graph, removing redundant twin vertices helps in reducing both space and time complexity of graph problems via modular decomposition, see e.g. [6,9,15] for a broad survey. The notion of twin vertices in a temporal graph is ambivalent on the time dimension: shall twins be eternally or just temporary so, then for how long? We formally define two versions of such a notion in the subsequent section. Informally, in the EternalTwins version, we address the problem of enumerating all Supported by Courtanet – Sorbonne Universit´e convention C19.0665 and ANRT grant 2019.0485. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 944, pp. 651–663, 2021. https://doi.org/10.1007/978-3-030-65351-4_52
652
B.-M. Bui-Xuan et al.
pairs of vertices which are twins ubiquitously in a historical dataset of graphs; we refer to Δ-Twins, with Δ an integer, as the problem of enumerating all pairs of vertices which are twins in at least Δ consecutive time instants, at some moment in the dataset. Data duplicates are well modeled by EternalTwins. Two vertices in this case would have exactly the same behaviour at any time in the link stream. Δ-Twins on the other hand could represent a temporary likeliness between two agents who are not duplicates per se. Indeed, outside the corresponding Δ time window, they are prone to have different behaviours. Δ-Twins could therefore help in detecting behavioural patterns on a specific time period which would characterize populations of a same group. Previous Works. In a static graph, standard approaches for finding twin vertices include modular decomposition [15], partition refinement [10], and the following bucket sort for finding true twins. The idea is to stream the neighbourhood N(v) of every vertex v into the corresponding entry T[N(v)] of an array T in linear time in the vertex number n using the following function: v → T[N(v)]==null? T[N(v)]=v: print(v,T[N(v)]), where N(v) is encoded as an n-bit array of 1 and 0 representing the line of the adjacency matrix corresponding to vertex v. This does not enumerate (print) all possible pairs of true twins as (a, b) and (a, c) can thus be found but then not (b, c). Besides, buckets come with a costly space complexity as the machine representation of N(v) can be a very big integer, forcing table T to have a lot of entries. However, this latter inconvenience can be circumvented by replacing T[N(v)] by T[hash(N(v))], accepting some very low probability of false positives. Consider now a temporal graph as a sequence G = (Gt )t∈T of static graphs. Likewise, the true twins case of EternalTwins can be solved by replacing T[hash(N(v))] with T[hash(N1 (v)×N2 (v)× . . . )], where Nt (v) is the neighbourhood of vertex v in graph Gt . The main crux here is that the pair of vertices must be twins at any time instant. Hence, we “only” need to check the property everywhere, e.g. in the Cartesian product of all neighbourhoods. If we consider that applying the hash function to an array already present in RAM memory (reading Nt (v)’s from adjacency matrices) is in constant time, then the previous bucket streaming is very fast. The problem is much less simple with Δ-Twins, where we do not know when the Δ time window starts. Note that problems with a sliding Δ time window have recently been widely studied, e.g. with vertex cover [1] and cliques [18], and the extensive references therein. However, up to our knowledge, twin vertices have not been considered before. Basically, for the specific case of true twins of EternalTwins, the above described bucket sort could be considered a solution linear in O(n + τ ), where n is the number of vertices and τ is the history length τ = max{t : Gt ∈ G not empty} − min{t : Gt ∈ G not empty}. More generally, EternalTwins can be solved by calling modular decomposition algorithms [9,15] on every graph Gt for t ∈ T , for a global time complexity in O((n + m) × τ ), where m = m1 +
Computing Temporal Twins
653
m2 + . . . is the total number of all recorded edges in (Gt )t∈T . As for Δ-Twins, partition refinement techniques such as in [10] can be deployed consecutively Δ times at every instant graph Gt for a global polynomial time complexity with a factor in τ × Δ. In settings where the history length τ is important, a new algorithm with time logarithmic in τ is desirable. Note that we can have τ >> m if there are many instants t where Gt is an empty graph. If we only consider T as the set of time instants with at least one recorded edge, then computing Δ-twins which are defined on Δ consecutive instants, would require additional arithmetic operations to take into account those deleted instants. Our paper addresses the following question: Would there be instantaneous response to Δ-Twins on historical graph data collected from human activities? Can it be numerically confirmed by implementation of those algorithms? In particular, the foci in this paper are: long history (big τ ), few vertices (small n) and good number of recorded edges (medium m). We revisit red-black tree data structure and devise a computation with runtime logarithmic in the history length of the input, and confront its implementation to one generated dataset and three datasets collected from real world data. Theoretical Contribution. We revisit matrix-based implementation of partition refinement and use red-black tree data structure in order to devise two variants of an algorithm for Δ-Twins. The two variants differ in the use of a large matrix in memory, the matrix-less version being the key to avoid out of RAM problems. Furthermore, the use of red-black trees allows our algorithm to compute even in the case where input graphs Gt for t ∈ T are given unordered, mixing parts of one graph to another. This feature is fault tolerant for batched data which come asynchronously. All in all, the computation time is O(m × n log τ + N ) with a O(n2 × τ ) size adjacency matrix in memory and O(m2 × n log τ + N ) without it, where N represents the size of the output. Numerical Experiments. We confront our implementations to generated data in order to confirm that the implementation is sound and its runtime is logarithmic in history length τ . We then confront it to real world datasets, with two collections from previous experiments [12,16] and a new one called LesFurets. The runtime of our algorithm averages at 12 s for all but one dataset, where it averages at 70 s. The formal framework of Δ-Twins is defined in next Sect. 2. In Sect. 3, we use red-black tree data structure in order to achieve logarithmic runtime in the history length of the input data. All numerical analysis of our implementation are presented in Sect. 4, before we close the paper with concluding remarks.
2
Twin Vertices in a Historical Recording of Graphs
Graphs in this paper are simple, undirected, and unweighted. A temporal graph is a sequence of graphs indexed by integers representing time instants. For practical use, it can also be formalized as a link stream L = (T, V, E) such that T ⊆ N
654
B.-M. Bui-Xuan et al.
Fig. 1. In this link stream, vertices 1 and 2 (represented by rows with the corresponding identifiers) have exactly the same links to other vertices at every instants from 0 to 5 = τ − 1. They form a pair of eternal-twins.
Fig. 2. In this link stream, vertices 1 and 2 (represented by rows with the corresponding identifiers) have exactly the same links to other vertices for instants 1, 2 and 3. They form a pair of Δ-twins for any Δ ≤ 3.
is an interval, V is a finite set, and E ⊆ T × V2 . The elements of V are called vertices and the elements of E are called (recorded) edges. For t ∈ T , the subgraph Gt of L induced by t is a graph over the same vertex set V , with edge set Et = {{u, v} : (t, {u, v}) ∈ E}. In this paper, we indifferently refer to temporal graphs as link streams. The adjacency matrix sequence (Mt )t∈T is the sequence of adjacency matrices of graphs (Gt )t∈T (Figs. 1 and 2). Temporal twin vertices have two variants. A pair of eternal twins {u, v} ∈ V2 is a pair of vertices for which the neighbourhoods Nt (u) and Nt (v) are strictly equal in V \ {u, v} for every instant t ∈ T . For an integer Δ, a pair of Δ-twins {u, v} ∈ V2 is a pair of vertices for which the neighbourhoods Nt (u) and Nt (v) are strictly equal in V \ {u, v} for Δ consecutive instants t0 ≤ t < t0 + Δ with t0 , t0 + Δ⊆ T . Our paper addresses the following problems, which both have polynomial time solutions. EternalTwins Input : A link stream L. Output : A list of all pairs of eternal twins in L. Δ-Twins Input : A link stream L and an integer Δ. Output : A list of all pairs of Δ-twins in L. The matrix based technique for partition refinement is defined as follows. Assume initially that all pair of vertices are eternal twins: Tw(u, v) = true for all u = v. For every pair of vertices u = v, time instant t ∈ T , and vertex w ∈ V \ {u, v}, if Mt (w, u) = Mt (w, v), then Tw(u, v) = false. Such a vertex w is called a splitter of {u, v}. At the end of the process, output every entry u = v of table Tw where Tw(u, v) = true. This results in a naive O(n3 × τ ) solution for EternalTwins. A similar process repeated Δ times using Δ different tables Tw allows to solve Δ-Twins in time O(n3 × τ × Δ).
Computing Temporal Twins
3
655
Temporal Twins in Time Logarithmic in History Length
In many cases, an input link stream L = (T, V, E) is not given by its adjacency matrix sequence (Mt )t∈T , but rather by a list of its recorded edges E. Then, computing EternalTwins in time independent from history length τ can be done by using the triangular structure of splitters, as in Algorithm 1. Data: Linkstream L : (T, V, E) Result: List of all eternal-twins of L for vertex u do for vertex v = u do Initialize Tw(u, v) = true; end end for recorded edge (t, {u, v}) ∈ E do for vertex w ∈ V \ {u, v} do if (t, {u, w}) ∈ / E then Tw(v, w) = false; // u is a splitter of {v, w} end end end return every entry u = v of table Tw where Tw(u, v) = true.
Algorithm 1: Edge Iteration algorithm for eternal-twins listing
The overall complexity is O(m×n+n2 ) if (Mt )t∈T is also given as input, for a constant time (t, {u, w}) ∈ / E testing. This is the matrix version or Matrix Edge Iteration algorithm (MEI). When (Mt )t∈T is part of the input, O(n2 × τ ) space complexity at runtime is required, which usually causes out of RAM problems / E could for big τ . It is in O(m2 × n + n2 ) otherwise, since scanning (t, {u, w}) ∈ take O(m) especially if E is given unordered by t ∈ T . This is the matrix-less version or Matrix-less Edge Iteration algorithm (MLEI). We note that in practice the latter O(m) factor is small, especially when E is chronologically ordered, by dichotomy search. We have proved the following property. Property 1. EternalTwins can be solved in time independent from history length. In order to address Δ-Twins, we will use a tree-based data structure inspired from red-black trees. Each node represents a time range P ⊆ T and contains a time range D ⊆ P of consecutive instants having been removed from T . The 2 sons of this node represents time ranges Q ⊆ P and R ⊆ P with Q the time range preceding D and R the time range following D. Removing an instant t from T will result in trying to remove it from the node at the root of this tree which represents T . If this node contains no time range of time deleted, the range of time deleted becomes [t, t] and we create both sons of this node, which
656
B.-M. Bui-Xuan et al. 0 0 0
3
6 3
9 510 12
16
3
0 13
16
Fig. 3. Representation of [0, 16] where 3, 6, 7, 8, 9, 12, 14 are removed. The two leaves respectively represent time ranges preceding and following [6, 9]. All remaining intervals can be enumerated from the intervals in blue contained in the leaves.
3 24 12
46
16 13 16
911 14
16
Fig. 4. This tree represents the same time partitioning as the one in Fig. 3 but is one level deeper and therefore requires more operations and a greater computation time to delete a new instant.
represent {a ∈ T, a < t} and {a ∈ T, a > t}. If the node contains a time range of deleted instants D = {a ∈ T, t0 ≤ a ≤ t1 }, if t = t0 − 1 or t = t1 + 1 we add t to D. If not, we try to remove t in the adequate son of this node. After an update of a node, we check if time ranges erased contained in nodes are colliding. If D(lef t son) = [t0 , t1 ] and D(f ather) = [t1 + 1, t2 ], we fuse the son in the father, D(f ather) ← [t0 , t2 ] and lef t son(f ather) ← lef t son(lef t son(f ather)) This data structure allows us to store a discontinued time range. We can then extract from it all time spans of at least Δ consecutive instants by exploring the tree and looking up time ranges represented by leaves of this tree (Fig. 4). Complexity of the deletion of an instant and computation of the time ranges of Δ consecutive instants in such a structure is at worst case in O(d) with d the depth of the tree, this depth being lesser or equal to the number of noncolliding time range deleted. Some optimizations can be accomplished using this data structure. For instance, using it to solve Δ-twins listing problem, when an instant t is removed, if either tf − t < Δ with tf the last instant in the node in which t is being removed or t−ti < Δ with ti the first instant in the node, we can consider that the interval [t, tf ] (respectively [ti , t]) is removed, thus diminishing computation time by avoiding useless computations. This type of trees can be balanced in order to minimize computation time of deletion. But this balancing comes at a price. It requires computation of the depth of each sub-tree and to recursively balance each node. If we proceed to this balancing operation at each deletion of an instant, we only need to balance the node on which we deleted the instant and all the parent nodes, root included. We therefore proceed to O(d) operations, where d is the depth of the tree. The balancing operation in itself is in constant time, providing the depth of a sub-tree is saved in its root node and updated accordingly. Balancing operation of a node: if depth(lef t son) − depth(right son) > 2, then we can rotate left:
Computing Temporal Twins
657
tf (f ather) ← ti (right son) − 1; ti (right son) ← ti (f ather); (right son(f ather)) ← lef t son(right son); lef t son(right son) ← f ather; f ather ← right son; A rotation of a node loses no information, which means that sub-trees rooted on each of the sons of one node will have a depth at most one level distant. We can therefore ensure a depth in log(p) where p is the number of non-colliding time ranges deleted from the tree. That means that the depth of those trees would be inferior at all time to log(τ ) where τ is the history length of the input link stream. Therefore, deletion and balancing operations’ complexity would be at worst case in O(log(τ )). Δ-Twins Listing Algorithm Based on Edges Iteration (MEI and MLEI). We now adapt the algorithm solving the eternal-twins problem to our Δ-twins listing problem, as in Algorithm 2. Data: A linkstream L : (T, V, E) and an integer δ Result: A list of all Δ-twins of L We initialize for each pair of vertices Tw(u, v) = T ree(T ); Initialize a list R of all entries in Tw; for recorded edge (t, {u, v}) of E do for vertex w do if (t, {u, w}) ∈ / E then Remove instant t in Tw(v, w); if a removal exhausts the time instants in Tw(v, w) then remove entry (v, w) from R; end end end end for (u, v) left in R do scan all time ranges of at least Δ consecutive instants in Tw(u, v) and add all ranges to output end return output
Algorithm 2: Edge Iteration algorithm for Δ-twins listing
The overall complexity of this Matrix Edge Iteration algorithm (MEI) is O(m × n log τ + N ) with n the number of vertices, m the number of recorded edges, τ the history length and N the number of pairs of Δ-twins if (Mt )t∈T was given. The space complexity is O(n2 × τ ), due to the use of (Mt )t∈T . If / E adds to (Mt )t∈T is not given, scanning E to ascertain that (t, {u, w}) ∈ the complexity. If E is chronologically ordered, the scanning can be fast, by dichotomy search. However, for a worst case complexity this step requires O(m)
658
B.-M. Bui-Xuan et al.
Fig. 5. Computation time of MLEI algorithms solving the Δ-Twins listing problem in function of history length on the Timeprogression datasets. We do not have consistent results for MEI due to out of RAM.
time. Complexity for this version of the algorithm is therefore O(m2 × n log τ + N ). This is the Matrix-less Edge Iteration algorithm (MLEI). The overall space complexity of algorithm MLEI is O(n2 log τ ), due to the use of Tw. This space complexity remains reasonable as our focus is for graphs with small n. Due to space restriction, we omit the proof of the following theorem. Theorem 1. On input a link stream with n vertices, m recorded edges, τ history length, and N pairs of Δ-twins, Δ-twins can be solved in time O(m×n log τ +N ) with O(n2 × τ ) space complexity, or in time O(m2 × n log τ + N ) with O(n2 log τ ) space complexity.
4
Numerical Analysis
All the algorithms listed in the previous sections have been implemented in Java1 and run on a standard laptop clocking at 2.7 Ghz. Since the use of EternalTwins is practically somewhat limited, plus the fact the algorithms is independent from history length, we only present numerical results for Δ-Twins. Basically, for the correctness of the various implementations, our methodology is unit-testing. The control groups are obtained from running an implementation of the naive computations in O(n3 × τ × Δ) described in Sect. 2. Due to the high time complexity of the naive computations, we do unit-testing uniquely for instances where the naive computations do not exceed 45 min. This covers ≈33% instances of all our experiments in both below Subsect. 4.1 and Subsect. 4.2. Results are positive on all these samples. In what follow we will totally skip the discussion about correctness, and only focus on computation time. We first stress-test the implementation on big values of history length with a generated dataset, in Subsect. 4.1. Then, we confront our implementation to three different datasets collected from real world data, in Subsect. 4.2. Our overall experiments run for more than 3000 (three thousand) hours CPU time. 1
Source code at https://github.com/DaemonFire/deltatwinsMEI.
Computing Temporal Twins
659
Fig. 6. Overview of computation time of all experiments.
4.1
Logarithmic Dependency in History Length on Runtime
Theoretically, our algorithms compute twin vertices in time logarithmic in the history length of the input temporal graph. We would like to confirm this on runtime of their implementation. Hypothesis 1. The runtime computation is logarithmic in the history length of the input temporal graph. Dataset and Experiment Result. Our methodology is to generate an artificial dataset called Timeprogression in order to monitor history length’s influence on computation time while maintaining constant numbers of vertices and edges. Number of vertices was set to 50 and number of edges to 105 , ascertaining that both dimensions are small enough so that history length’s influence on algorithm’s computation time would not be prone to be negligible. There are 199 instances, with history length varying from 5000 to 106 time instants. This dataset is not ordered by time instants. Results are presented in Fig. 5. Discussion: Confirmation of Hypothesis 1. Progression of computing time is logarithmic, with few jumps (2 cases) probably due to some noisy use of the PC during computation. 4.2
Runtime on Real World Datasets
We confront our implementations on real world datasets, and would like to experiment both hypothesis below. Hypothesis 2. Δ-twins can be enumerated in reasonable time. Hypothesis 3. Algorithm MLEI is able to compute link streams that cause exhaustion of memory for the MEI version.
660
B.-M. Bui-Xuan et al.
Fig. 7. Computation time of MLEI and MEI algorithms solving the Δ-twins listing problem in function of the number of edges in the link stream on the Rollernet datasets.
Fig. 8. Computation time of MLEI algorithms solving the Δ-twins listing problem in function of the number of edges in the link stream on the Enron datasets.
Datasets and Experiment Result. Our methodology is to confront the implementations on three different datasets collected from real world data. We focus on Δ-Twins with Δ = 102. In the sequel, we describe our sampling method over the three datasets. Then, a global view of all computation time is captured in Fig. 6. We develop with detailed views on each of the three datasets, in Figs. 7, 8 and 9, respectively. We leave all discussions for the corresponding paragraph at the end of this section. Rollernet dataset [16] has been collected from rollerbladers touring Paris. Links will be recorded at instant t whenever two rollerbladers are close enough during a given period. This is a dense linkstream. This dataset is more likely to present a relatively greater number of Δ-twins than the two other datasets. We run our experiments on the following seven batches of extracts, each batches containing 100 extracts. Two first batches contain link streams induced by n1 = 40, resp. n2 = 50, vertices of the raw Rollernet dataset. Three batches contain link streams induced by m1 = 105 , m2 = 2 · 105 , and m3 = 3 · 105 , recorded edges of the raw dataset. Two last batches contain link streams induced by τ1 = 5000, resp. τ2 = 8000, successive time instants of the raw dataset. Enron dataset [12] is parsed from the log of e-mail exchanges between employees of a same company over a period of 3 years. Δ-twins would emerge from this link stream as people from the same service are sent the same e-mails for a certain period of time. This link stream is very sparse. We do the following seven
Computing Temporal Twins
661
Fig. 9. Computation time of MLEI algorithms solving the Δ-twins listing problem in function of the number of edges in the link stream on the Lesfurets datasets.
batches of 100 extracts each, similarly as for Rollernet, with n1 = 50, n2 = 100, m1 = 5000, m2 = 10000, m3 = 20000, τ1 = 107 , and τ2 = 5 · 107 . LesFurets dataset is parsed from the log of user behaviour on the lesfurets’s funnel, some vertices representing the various users and the others representing events on the funnel. This link stream is therefore a bipartite link stream. This dataset is not ordered by time instants. The latter feature also provides a way to test the robustness of our approach, in the sense of fault tolerance. We do the following seven batches of 100 extracts each, similarly as the other datasets, with n1 = 300, n2 = 600, m1 = 3000, m2 = 6000, m3 = 8000, τ1 = 10000, and τ2 = 13000. Discussion: Slight Confirmation of Hypothesis 2; Confirmation of Hypothesis 3. Our experiments on the Rollernet datasets are where naive algorithm computes in reasonable computation time, allowing us to ascertain that MEI and MLEI algorithms compute Δ-twins correctly, cf. Fig. 7. It is also the only dataset we treated on which MEI doesn’t encounter out of RAM issues, where we observe that MEI tends to be a bit more quicker to compute than MLEI. According to the heat-map, |T | seems to have a minor impact on computation time as many datasets with large |T | compute faster than datasets with small |T |. The overall tendency towards greater computation time being due to the increase of the number of vertices and of edges more than history length. As soon as we reach Enron and LesFurets datasets, |V | gets too big and naive algorithm computation time grows unreasonably. |T | also grows to a large number and MEI, as expected, starts to face memory issues. Hypothesis 2 seems to be strained on Enron dataset. But we can still use those datasets to experiment on MLEI algorithm. For Enron, the left hand side of Fig. 8 allows us to confirm our theoretical complexity regarding |E| as our experimental curve of computation time in function of number of edges seems to describe a second degree polynomial function. The heat-map once again allows us to picture that history length’s influence on complexity seems not to be so clear, indicating that number of edges has a greater impact on computation time than history length of the link stream. On the other hand, the right hand side of Fig. 8, where the heat-map correspond to number of vertices shows us what can be expected of a heat-map about an important complexity factor as darker points correspond
662
B.-M. Bui-Xuan et al.
to greater computation time than lighter ones. We proceed similarly for the results on LesFurets datasets, cf. Fig. 9. They confirm the same tendencies as with Enron. We conclude from our experiments that Hypothesis 2 is strained on Enron dataset, whereas Hypothesis 3 is confirmed. All in all, twin vertices can be computed within some minutes, even on Enron dataset.
5
Conclusion and Perspectives
We introduced two variants of twin vertices in a historical collection of graphs. The corresponding algorithmic problems of enumerating all such twin vertices are polynomial. We address the problem of solving them in time depending the least in the history length. Revisiting partition refinement techniques along with red-black tree data structures, we devise a logarithmic solution. Our solution is subject to two sub-variants: with or without the use of adjacency matrices in runtime memory. Confronting to datasets collected from real world data, our solutions scales up to 108 history length. An interesting development of this work could be the replacement of all matrix data by hash-maps or sorted arrays. Then, extensive numerical analysis should be made in order to compare these three approaches (matrix, hash-maps and sorted arrays). Acknowledgements. We are grateful to Emmanuel Chailloux for helpful discussion and pointers. We are grateful to the anonymous reviewers for their helpful comments which greatly improved the paper.
References 1. Akrida, E., Mertzios, G., Spirakis, P., Zamaraev, V.: Temporal vertex cover with a sliding time window. J. Comput. Syst. Sci. 107, 108–123 (2020) 2. Bui-Xuan, B.M., Ferreira, A., Jarry, A.: Computing shortest, fastest, and foremost journeys in dynamic networks. Int. J. Found. Comput. Sci. 14(2), 267–285 (2003) 3. Casteigts, A., Flocchini, P., Godard, E., Santoro, N., Yamashita, M.: Expressivity of time-varying graphs. In: Proceedings of the 19th International Symposium on Fundamentals of Computation Theory, pp. 95–106 (2013) 4. Dean, B.: Continuous-time dynamic shortest path algorithms. Ph.D. thesis, Massachusetts Institute of Technology (1999) 5. Dibbelt, J., Pajor, T., Strasser, B., Wagner, D.: Connection scan algorithm. ACM J. Exp. Algorithmics 23, 1–56 (2018) 6. Ehrenfeucht, A., Harju, T., Rozenberg, G.: The Theory of 2-Structures: A Framework for Decomposition and Transformation of Graphs. World Scientific, River Edge (1999) 7. Erlebach, T., Hoffmann, M., Kammer, F.: On temporal graph exploration. In: Proceedings of the 42nd International Colloquium on Automata, Languages, and Programming. LNCS, vol. 9134, pp. 444–455 (2015) 8. Foschini, L., Hershberger, J., Suri, S.: On the complexity of time-dependent shortest paths. Algorithmica 68(4), 1075–1097 (2014)
Computing Temporal Twins
663
9. Habib, M., Paul, C.: A survey of the algorithmic aspects of modular decomposition. Comput. Sci. Rev. 4(1), 41–59 (2010) 10. Habib, M., Paul, C., Viennot, L.: Partition refinement techniques: an interesting algorithmic tool kit. Int. J. Found. Comput. Sci. 10(2), 147–170 (1999) 11. Kempe, D., Kleinberg, J., Kumar, A.: Connectivity and inference problems for temporal networks. J. Comput. Syst. Sci. 64(4), 820–842 (2002) 12. Klimt, B., Yang, Y.: Introducing the enron corpus. In: CEAS (2004) 13. Latapy, M., Viard, T., Magnien, C.: Stream graphs and link streams for themodeling of interactions over time. Soc. Netw. Anal. Min. 8(61), 611–6129 (2018) 14. Ros, F., Ruiz, P., Stojmenovic, I.: Acknowledgment-based broadcast protocol for reliable and efficient data dissemination in vehicular ad-hoc networks. IEEE Trans. Mob. Comput. 11(1), 33–46 (2012) 15. Spinrad, J.: Efficient Graph Representations. Field Institute Monographs, vol. 19. American Mathematical Society (2003) 16. Tournoux, P.U., Leguay, J., Benbadis, F., Conan, V., De Amorim, M.D., Whitbeck, J.: The Accordion phenomenon: analysis, characterization, and impact on DTN routing. In: Proceedings of the 28th IEEE Conference on Computer Communications (2009) 17. Tsalouchidou, I., Baeza-Yates, R., Bonchi, F., Liao, K., Sellis, T.: Temporal betweenness centrality in dynamic graphs. Int. J. Data Sci. Analytics 9, 1–16 (2019) 18. Viard, T., Magnien, C., Latapy, M.: Enumerating maximal cliques in link streams with durations. Inf. Proc. Lett. 133, 44–48 (2018)
A Dynamic Algorithm for Linear Algebraically Computing Nonbacktracking Walk Centrality Eisha Nathan(B) Lawrence Livermore National Lab, Livermore, CA 94551, USA [email protected] Abstract. Dynamic graph data is used to represent the changing relationships in society, biology, web traffic, and more. When computing analytics on such evolving data, it is important to have algorithms that can update analytics quickly as data changes, without needing to recompute the analytics from scratch. A common analytic performed on graph data is that of centrality: identifying the most important (highly ranked) vertices in the graph. In this work we examine centrality scores based on nonbacktracking walks in graphs and propose methods to update such scores in dynamic graphs. We propose two dynamic methods and demonstrate that both are faster than statically recomputing the scores at each graph change. We additionally show that one of these methods is far superior than the other with regard to the quality of the scores obtained, and is able to produce good quality approximations with respect to a static recomputation. Our methods use properties of iterative methods to update local portions of the centrality vector as the graph changes (in this paper, we focus exclusively on edge additions). Experiments are performed on real-world networks with millions of vertices and edges. Keywords: Dynamic graphs
1
· Nonbacktracking walks · Centrality
Introduction
Several application domains contain data modeling relationships between entities that can be represented as graphs, such as financial transactions, computer networks, societal relationships, web traffic, or road networks. Many of these domains also contain data that is constantly changing over time, necessitating a model that can capture these evolving relationships. Dynamic graph data represents the changing relationships in such networks. Consider a network modeling Facebook friendships. As new connections between people in the online world are made and/or erased, these changes are reflected in the underlying graph structure modeling these relationships. One key data mining question in network analysis is that of identifying the most important vertices (in the Facebook example, people) [1]. “Importance” of This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 with release number LLNL-CONF-814052. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 944, pp. 664–674, 2021. https://doi.org/10.1007/978-3-030-65351-4_53
Dynamic NBT Centrality
665
a vertex can be defined in many different ways, but a very common approach is to identify vertices that are ‘most central’ to the graph in some manner. This approach gives rise to a host of centrality metrics which quantify traversals, or walks, around a network to obtain scores indicating relative importance of vertices. A walk in a graph is a sequence of vertices connected by edges that allows for both vertices and edges to repeat; however recent work has suggested that walks that do not backtrack on themselves ought to be given more importance in networks (where a backtracking walk is one that visits a particular vertex i, visits its neighbor j, then immediately backtracks to vertex i) [2]. Again using our Facebook example, a walk that backtracks upon itself provides little to no useful information about the rest of the friendships in the graph. Additionally, backtracking walks in information diffusion networks or networks modeling disease spread do not allow the user to glean any new information [3,4]. In this paper, we study nonbacktracking walks and their associated centrality scores in the context of dynamic graphs. Specifically, we present an algorithm to efficiently compute the scores as the underlying network changes. Dynamic algorithms are inherently local, relying on the change in the graph and focusing efforts on updating the area affected by that change, while static recomputation uses the whole graph. Therefore dynamic methods avoid this unnecessary computation time, giving enourmous computational savings. Our presented algorithms are faster than recomputing centrality scores from scratch each time the graph is changed and the results obtained from our second method demonstrate that it returns high-quality results similar to a simple static recomputation. To our knowledge, this is the first work on calculating scores for nonbacktracking walk centrality in dynamic networks using linear algebraic techniques.
2 2.1
Background Definitions
Let A be the adjacency matrix associated with the graph G = (V, E) with n = |V | vertices and m = |E| edges. For all edges (i, j) ∈ E, Aij = 1 and is 0 otherwise. A dynamic graph can be modeled as a sequence of static graphs, or by taking snapshots of the graph G as it evolves over time. Let Gt and At be the graph and corresponding adjacency matrix at time t, and let ΔA represent the change between two timesteps, or ΔA = At+1 − At . This work presents algorithms for updating nonbacktracking walk centrality (NBT) scores efficiently. In a static graph, NBT scores, denoted x∗ , are calculated by solving the linear system in Eq. 1 [2]. (I − αA + α2 (D − I))x∗ = (1 − α2 )1
(1)
Here, I is the n × n identity matrix and D is the diagonal matrix whose values on the diagonal are the row sums of the corresponding rows of A. The scores represent the weighted sum of NBT walks starting at each vertex, where walks are
666
E. Nathan
weighted by successive powers of some weighting parameter α. For the dynamic case, we seek (at any time t), the solution xt to the system given in Eq. 2. (I − αAt + α2 (Dt − I))xt = (1 − α2 )1 2.2
(2)
Iterative Methods
Directly solving for the exact NBT scores is an extremely expensive computation on the order of O(n3 ), which quickly becomes impractical as n grows large. In practice we use iterative methods to obtain an approximation to the exact solution, which typically costs O(m) assuming the number of iterations is not very large [5]. These methods are very fast and practical since many real-world graphs are sparse and m e2 .time; – Parallel flow F| (e1 , e2 ): a flow with e1 .time = e2 .time; With F→ (n) we denote the set of all forward flows F→ (e1 , e2 ) with e1 .dest = e2 .source = n, i.e. passing from node n. Similar definitions hold for the sets F← (n) and F| (n) of backward and parallel flows passing from n, respectively. Starting from F→ (n), F← (n) and F| (n), we can build the so-called temporal signature of n. Definition 4 (Temporal signature). The temporal signature (or fingerprint) of a node n, S(n), is a triplet of integer numbers (f, b, p) where f = |F→ (n)|, b = |F← (n)| and p = |F| (n)|. 1
This is equivalent to the notion of non-induced subgraph. For induced subgraphs, we have ES = (VS × VS ) ∩ E.
TemporalRI
679
By comparing temporal signatures, we can derive the following partial order binary relation. Definition 5 (Temporal inclusion, ). Given two nodes m and n, let S(m) = (f , b , p ) and S(n) = (f, b, p) their temporal signatures. m is temporally included in n (m n) iff f ≤ f ∧ b ≤ b ∧ p ≤ p. As described in Sect. 4, temporal inclusion is used in the computation of compatibility domains in order to prune the search space as soon as possible before the matching process starts. Finally, we define the Temporal Subgraph Isomorphism (TSI) problem. Definition 6 (Temporal subgraph isomorphism). Let Q = (VQ , EQ ) and T = (VT , ET ) two temporal graphs, named query and target, respectively. The Temporal Subgraph Isomorphism (TSI) problem aims to find an injective function f : VQ → VT , called mapping, which maps each node in Q to a node in T , such that the following conditions hold: 1. ∀ eQ = (s, d, tQ ) ∈ EQ : eT = (f (s), f (d), tT ) ∈ ET ; 2. ∀ eQ = (a, b, tQ ), eQ = (c, d, tQ ) ∈ EQ s.t. tQ ≤ tQ : ∃ eT = (f (a), f (b), tT ), eT = (f (c), f (d), tT ) ∈ ET s.t. tT ≤ tT ; Condition 1 implies that each query event eQ must have a match with a target event eT , but not necessarily conversely. Condition 2 means that the chronological order derived by the timestamps of query events must be respected in the target too. So, the timestamps of query events are more like indexes that indicate in which order target events should happen. The TSI problem can have more than one solution, i.e. there may exist one or more mappings. Given a mapping f , a match of Q in T is the set of pairs of query and target matched nodes M = {(q1 , f (q1 )), (q2 , f (q2 )), ..., (qk , f (qk )}, where k = |VQ |. An occurrence of Q in T is the temporal subgraph O of T formed by nodes f (q1 ), f (q2 ), ..., f (qk ) and all edges eT = (f (qi ), f (qj ), tT ) such that eQ = (qi , qj , tQ ) ∈ EQ for all 1 ≤ i < j ≤ k. In Fig. 1 there is one match of query Q in target T , M = {(1, 4), (2, 1), (3, 2), (4, 5), (5, 6)} and nodes and edges of the corresponding occurrence are drawn in red.
3
RI Algorithm
TemporalRI is inspired by RI algorithm for subgraph isomorphism in static graphs [2,3]. In this section we summarize the main steps of RI. In particular, we refer to RI-DS [3], which includes a preprocessing step based on the computation of compatibility domains to filter candidate pairs of query and target nodes for the matching.
680
G. Locicero et al.
The three main steps of RI are: i) computation of compatibility domains, ii) computation of the ordering of query nodes for the matching, iii) matching process. In the following subsections each of these steps will be detailed. For the description of RI we refer to a query graph Q = (VQ , EQ ) with k nodes and a target graph T = (VT , ET ). 3.1
Computation of Compatibility Domains
The first step of RI computes for each query node q the compatibility domain Dom(q) which is the set of nodes in the target graph that could match q based on node in- and out-degrees. This step speeds up the matching process, because only target graph nodes in Dom(q) are considered as possible candidates for a match to q during the search. Formally, a node t ∈ VT is compatible to a node q ∈ VQ iff: i) the in-degree of q is less than or equal to the in-degree of t, ii) the out-degree of q is less than or equal to the out-degree of t. 3.2
Ordering of Query Nodes
In the next step RI computes the order in which query nodes have to be processed for the search during the matching. The processing order is calculated without considering the target graph. The key idea is that query nodes which both have high degree and are highly connected to nodes already present in the partial ordering come earlier in the final ordering. Formally, let μi−1 = (q1 , q2 , ..., qi−1 ) be the partial ordering up to the (i − 1)th node, with i < k. Let U i−1 the set of unordered query nodes, i.e. nodes that are not in the partial ordering μi−1 . To choose the next node of the ordering, we define for each candidate query node q three sets: 1. Vq,vis : the set of nodes in μi−1 and neighbors of q; 2. Vq,neig : the set of nodes in μi−1 that are neighbors of at least one node which is in U i−1 and neighbor of q; 3. Vq,unv : the set of nodes which are neighbors of q, are not in μi−1 and are not even neighbors of nodes in μi−1 . The next node in the ordering is the one with: (i) the highest value of |Vq,vis |, (ii) in the case of a tie in (i) the highest value of |Vq,neig |, (iii) in the case of a tie in (ii) the highest value of |Vq,unv |. In case of a tie according to all criteria, the next node is arbitrarily chosen. 3.3
Matching Process
Following the previously defined ordering of query nodes, RI performs matching to find occurrences of the query within the target. The matching process starts with the first node of the ordering and an initially empty match M. If a new match between a query node q and a target
TemporalRI
681
node t is found, the pair (q, t) is added to M and RI continues the search with the next node in the ordering. If all query nodes have been matched, M constitutes a new match of Q in T , so it can be added to the list of matches found. Whenever all query nodes have been matched or no match has been found for a query node, the algorithm performs backtracking and continues the search from the last matched node. For the first node of the ordering the set of candidate target nodes for matching is its compatibility domain, while for any other query node q candidates are both: i) neighbors of the target node t that has been already matched to the previous query node in the ordering, ii) nodes that are compatible to q. The choice of whether or not to add a candidate pair (q, t) to M is made based on the following feasibility rules: i) t has not been already matched, ii) for every already mapped node q neighbor of q, there must be an edge between t and f (q ) in T . The latter rule ensures the consistency of the partial mapping M in case the new pair (q, t) is added to M.
4
TemporalRI Algorithm
Now we describe TemporalRI algorithm, focusing on the main differences with RI. The general framework of TemporalRI and the computation of ordering of query nodes for the matching are the same as RI. Compatibility Domains. To further reduce the size of the domains and prune the set of candidates during the search, we can exploit the concepts of temporal signature and temporal inclusion defined in Sect. 2. Indeed, provided that inDeg(q) ≤ inDeg(t) and outDeg(q) ≤ outDeg(t), if a query node q is not temporally included in a target node t, then at least one flow Fq (of a specific type) is missing in the target for node t. In fact, suppose Fq is a forward flow. It follows that there must exist at least one flow Ft for t of a different type (backward or parallel). Therefore, if we matched q to t the chronological order for at least the two events of the flow would not be respected. In TemporalRI a node t ∈ VT is compatible to a node q ∈ VQ iff: i) the in-degree of q is less than or equal to the in-degree of t, ii) the out-degree of q is less than or equal to the out-degree of t, iii) q t. Matching Process. Whenever a new pair (q, t) is added to the partial match M, we need to ensure that the chronological order of edges of the partial occurrence O corresponding to M be satisfied. To reach this aim we extend the set of feasibility rules. More specifically, let f the partial mapping, EQ,map the set of query edges between already mapped query nodes (including q) and EO the set of edges of O. We consider, for each query edge (q, q ) linking q to an already mapped query node q , its rank in EQ,map (i.e. the position of q in the set EQ,map sorted by timestamps) and compare it with the rank of the matched target edge (t, f (q )) in EO . If the two ranks are different then q cannot be matched to t. This condition, however, is applied locally and does not guarantee that the final occurrence O will respect the chronological order imposed by query edges.
682
G. Locicero et al.
For this reason, once all query nodes have been mapped, we also need to compare the rank of each edge (u, v) in Q with the rank of the corresponding edge (f (u), f (v)) in O. If all ranks are equal the match is added to the list of found matches.
5
Experimental Results
In this section we evaluate the performance of TemporalRI, considering a dataset of 8 real networks of small and medium size taken from Network Repository [26]. Table 1 lists the networks used and their main features. For each network we report the number of nodes |V |, the number of edges |E|, the maximum degree dmax and the average degree davg . Table 1. Dataset for the experiments. Network
|V |
|E|
dmax
davg
SFHH-conf-sensor email-dnc edit-enwikibooks ia-contacts-dublin ca-cit-HepPh fb-wosn-friends ia-enron-email soc-epinions-trust-dir
403 1.9K 8.4K 11K 28.1K 63.7K 87K 131.6K
70.3K 37.4K 162K 415.9K 4.6M 1.3M 1M 840.8K
2.4K 5.5K 11.3K 616 11.1K 1.8K 39K 3.6K
348 40 38 75 327 39 26 12
For the comparison with other tools, we only focused on exact motifs counting or subgraph isomorphism algorithms having the same or very similar definition of temporal queries [17,20,28]. Among these, Mackey’s algorithm [17] was the only one we managed to compare with TemporalRI. For all other tools, it was impossible to make any comparison. Sun et al. [28] did not provide any implementation of their algorithm. SNAP temporal algorithm [20] works for specific 2-node and 3-node queries only. The general algorithm described in their paper is not actually temporal, because it performs subgraph matching in a static graph and then, in a post-processing step, it removes the occurrences that do not match the temporal constraints. TemporalRI has been implemented in Java and is available at https://github.com/josura/TemporalRI. All experiments have been performed on an ACER Nitro 5 AN515-52, with an Intel Core 8300H and 16 GB of RAM. We randomly extracted from each network queries with k = 4, 8, 16, 32, 48 nodes. We considered 50 queries for each value of k. Figure 2 shows boxplots of running times measured for both TemporalRI and Mackey’s algorithm for the 4 smallest networks of the dataset. Figures 2b and 2c show that for email-dnc and edit-enwikibooks , which are small and sparse networks, Mackey’s algorithm
TemporalRI
683
is generally faster, even though the difference between the running times of the two algorithms is few seconds. However when we consider the other two small but denser networks, SFHH-conf-sensor and ia-contacts-dublin (Figs. 2a and 2d), performance of Mackey’s algorithm begins to degrade when queries become large. This could also be related to memory leaks issues characterizing the algorithm, which appear to be more evident when we increase the size of targets. Indeed, for the 4 biggest networks of the dataset, Mackeys’s algorithm always went out of memory. For this reason, concerning these networks, we report the running times of TemporalRI only (Fig. 3). Results on the biggest networks show that the running time of TemporalRI increases exponentially with the query size. However, even for large queries our algorithm ends up in few minutes, and therefore results quite efficient.
Fig. 2. Running times of TemporalRI and Mackeys’ algorithm for the 4 smallest real networks: a) SFHH-conf-sensor, b) email-dnc, c) edit-enwiki-books, d) ia-contactsdublin.
684
G. Locicero et al.
Fig. 3. Running times of TemporalRI algorithm for the 4 biggest real networks, where Mackeys’ algorithm went out of memory: a) ca-cit-HepPh, b) fb-wosn-friends, c) iaenron-email, d) soc-epinions.
6
Conclusions
Temporal graphs describe dynamic systems, where nodes are entities and edges represent interactions between entities which take place at a specific time. In this paper we focused on Temporal Subgraph Isomorphism (TSI) problem, which consists in finding all the occurrences of a small temporal graph (the query) in a larger temporal graph (the target), such that timestamps of target edges follows the same temporal order of corresponding matched query edges. TSI can be considered a baseline framework to study related problems, such as motif search, anomaly detection and evaluation of node centralities. We introduced a novel algorithm, TemporalRI, to solve the TSI problem. We showed that TemporalRI is general and therefore suitable to be run on queries of any size and topology. In addition, it is several order faster and more memory-efficient than the state-of-the-art when executed on large queries and large networks. In an extended version of the paper we plan to conduct a comprehensive experimental analysis on both real and synthetic networks. We can notice that the concept of temporal flow and the techniques introduced to prune the search space both in the preprocessing and in the matching and speedup the process are general and therefore can be, in principle, plugged into any subgraph matching algorithm to handle the TSI problem.
TemporalRI
685
For the future, we aim to further extend our framework, introducing multiple events between nodes and giving the user the possibility to define additional constraints on the time difference between adjacent events. We also plan to optimize TemporalRI and implement it on top of a SPARK framework to deal with very large networks.
References 1. Bi, F., Chang, L., Lin, X., Qin, L., Zhang, W.: Efficient subgraph matching by postponing cartesian products. In: SIGMOD 2016, pp. 1199–1214 (2016). https:// doi.org/10.1145/2882903.2915236 2. Bonnici, V., Giugno, R., Pulvirenti, A., Shasha, D., Ferro, A.: A subgraph isomorphism algorithm and its application to biochemical data. BMC Bioinform. 14(Suppl 7), 1–13 (2013). https://doi.org/10.1186/1471-2105-14-S7-S13 3. Bonnici, V., Giugno, R.: On the variable ordering in subgraph isomorphism algorithms. IEEE/ACM Trans. Comput. Biol. Bioinform. 14(1), 193–203 (2017). https://doi.org/10.1109/TCBB.2016.2515595 4. Cordella, L.P., Foggia, P., Sansone, C., Vento, M.: A (sub)graph isomorphism algorithm for matching large graphs. IEEE Trans. Pattern Anal. Mach. Intell. 26(10), 1367–1372 (2004). https://doi.org/10.1109/TPAMI.2004.75 5. Crawford, J., Milenkovic, T.: ClueNet: Clustering a temporal network based on topological similarity rather than denseness. PLoS ONE 13(5), e0195993 (2018). https://doi.org/10.1371/journal.pone.0195993 6. Divakaran, A., Mohan, A.: Temporal link prediction: a survey. New Gener. Comput. 38, 213–258 (2020). https://doi.org/10.1007/s00354-019-00065-z 7. Han, W., Lee, J., Lee, J.H.: Turboiso: towards ultrafast and robust subgraph isomorphism search in large graph databases. In: SIGMOD 2013, pp. 337–348 (2013). https://doi.org/10.1145/2463676.2465300 8. Han, M., Kim, H., Gu, G., Park, K., Han, W.: Efficient subgraph matching: harmonizing dynamic programming, adaptive matching order, and failing set together. In: SIGMOD 2019, pp. 1429–1446 (2019). https://doi.org/10.1145/3299869.3319880 9. Hiraoka, T., Masuda, N., Li, A., Jo, H.: Modeling temporal networks with bursty activity patterns of nodes and links. Phys. Rev. Res. 2(2), 023073 (2020). https:// doi.org/10.1103/PhysRevResearch.2.023073 10. Holme, P., Saramaki, J.: Temporal networks. Phys. Rep. 519, 97–125 (2012). https://doi.org/10.1016/j.physrep.2012.03.001 11. Holme, P., Saramaki, J.: Temporal Network Theory. Springer, Heidelberg (2019). https://doi.org/10.1007/978-3-030-23495-9 12. Hulovatyy, Y., et al.: Exploring the structure and function of temporal networks with dynamic graphlets. Bioinformatics 31(12), i171–i180 (2015). https://doi.org/ 10.1093/bioinformatics/btv227 13. Kim, K., Seo, I., Han, W.S., Lee, J.H., Hong, S., Chafi, H., Shin, H., Jeong, G.: TurboFlux: a fast continuous subgraph matching system for streaming graph data. In: SIGMOD 2018, pp. 411–426 (2018). https://doi.org/10.1145/3183713.3196917 14. Lv, L., et al.: PageRank centrality for temporal networks. Phys. Lett. A 383(12), 1215–1222 (2019). https://doi.org/10.1016/j.physleta.2019.01.041 15. Kovanen, L., Karsai, M., Kaski, K., Kert´esz, J., Saramaki, J.: Temporal motifs in time-dependent networks. J. Stat. Mech. Theor. Exp. 2011(11), P11005 (2011). https://doi.org/10.1088/1742-5468/2011/11/p11005
686
G. Locicero et al.
16. Liu, P., Benson, A.R., Charikar, M.: Sampling methods for counting temporal motifs. In: Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pp. 294–302 (2019). https://doi.org/10.1145/3289600. 3290988 17. Mackey, P., Porterfield, K., Fitzhenry, E., Choudhury, S., Chin, G.: A chronological edge-driven approach to temporal subgraph isomorphism. In: 2018 IEEE International Conference on Big Data (Big Data), pp. 3972–3979 (2018). https://doi.org/ 10.1109/BigData.2018.8622100 18. Masuda, N., Lambiotte, R.: A Guide to Temporal Networks. World Scientific, London (2016). https://doi.org/10.1142/q0268 19. Masuda, N., Holme, P.: Small inter-event times govern epidemic spreading on networks. Phys. Rev. Res. 2(2), 023163 (2020). https://doi.org/10.1103/ PhysRevResearch.2.023163 20. Paranjape, A., Benson, A.R., Leskovec, J.: Motifs in temporal networks. In: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pp. 601–610 (2017).https://doi.org/10.1145/3018661.3018731 21. Petit, J., Gueuning, M., Carletti, T., Lauwens, B., Lambiotte, R.: Random walk on temporal networks with lasting edges. Phys. Rev. E 98(5), 052307 (2018). https:// doi.org/10.1103/PhysRevE.98.052307 22. Redmond, U., Cunningham, P.: Temporal subgraph isomorphism. In: Proceedings of 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, pp. 1451–1452 (2013). https://doi.org/10.1145/2492517. 2492586 23. Redmond, U., Cunningham, P.: Subgraph isomorphism in temporal networks. arXiv preprint arXiv:1605.02174 (2016) 24. Rocha, L.E., Masuda, N., Holme, P.: Sampling of temporal networks: methods and biases. Phys. Rev. E 96(5), 052302 (2017). https://doi.org/10.1103/PhysRevE.96. 052302 25. Rossetti, G., Cazabet, R.: Community discovery in dynamic networks: a survey. ACM Comput. Surv. 51(2) (2018). https://doi.org/10.1145/3172867 26. Rossi, R.A., Ahmed, N.K.: The network data repository with interactive graph analytics and visualization. In: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, pp. 4292–4293 (2015). https://doi.org/10.5555/2888116. 2888372 27. Singh, E.A., Cherifi, H.: Centrality-based opinion modeling on temporal networks. IEEE Access 8, 1945–1961 (2020). https://doi.org/10.1109/ACCESS.2019.2961936 28. Sun, X., Tan, Y., Wu, Q., Wang, J., Shen, C.: New algorithms for counting temporal graph pattern. Symmetry 11(10), 1188 (2019). https://doi.org/10.3390/ sym11101188 29. Sun, X., Tan, Y., Wu, Q., Chen, B., Shen, C.: TM-Miner: TFS-based algorithm for mining temporal motifs in large temporal network. IEEE Access 7, 49778–49789 (2019). https://doi.org/10.1109/ACCESS.2019.2911181 30. Sun, S., Luo, Q.: Subgraph matching with effective matching order and indexing. Trans. Knowl. Data Eng. (2020). https://doi.org/10.1109/TKDE.2020.2980257 31. Tizzani, M., Lenti, S., Ubaldi, E., Vezzani, A., Castellano, C., Burioni, R.: Epidemic spreading and aging in temporal networks with memory. Phys. Rev. E 98(6), 062315 (2018). https://doi.org/10.1103/PhysRevE.98.062315 32. Torricelli, M., Karsai, M., Gauvin, L.: weg2vec: Event embedding for temporal networks. Sci. Rep. 10, 7164 (2020). https://doi.org/10.1038/s41598-020-63221-2
TemporalRI
687
33. Tsalouchidou, I., et al.: Temporal betweenness centrality in dynamic graphs. Int. J. Data Sci. Anal. 9, 257–272 (2020). https://doi.org/10.1007/s41060-019-00189-x 34. Williams, O.E., Lillo, F., Latora, V.: Effects of memory on spreading processes in non-Markovian temporal networks. New J. Phys. 21(4), 043028 (2019). https:// doi.org/10.1088/1367-2630/ab13fb
StreamFaSE: An Online Algorithm for Subgraph Counting in Dynamic Networks Henrique Branquinho1 , Luciano Gr´ acio1,2 , and Pedro Ribeiro1,2(B) 1
DCC-FCUP, Universidade do Porto, Porto, Portugal {hbranquinho,lgracio}@fc.up.pt, [email protected] 2 CRACS & INESC-TEC, Porto, Portugal Abstract. Counting subgraph occurrences in complex networks is an important analytical task with applicability in a multitude of domains such as sociology, biology and medicine. This task is a fundamental primitive for concepts such as motifs and graphlet degree distributions. However, there is a lack of online algorithms for computing and updating subgraph counts in dynamic networks. Some of these networks exist as a streaming of edge additions and deletions that are registered as they occur in the real world. In this paper we introduce StreamFaSE, an efficient online algorithm for keeping track of exact subgraph counts in dynamic networks, and we explain in detail our approach, showcasing its general applicability in different network scenarios. We tested our method on a set of diverse real directed and undirected network streams, showing that we are always faster than the current existing methods for this task, achieving several orders of magnitude speedup when compared to a state-of-art baseline. Keywords: Subgraph census · Dynamic networks · Temporal networks · Streaming · Online algorithm · Motifs · Graphlets
1
Introduction
Complex networks are the mathematical tools that model interaction-based realworld systems, allowing the development of abstract tasks to extract useful information from them. One of these tasks is called Subgraph Census and consists in counting how many induced subgraphs of each isomorphic class exist in the network. However this task is known to be hard, demanding the development of efficient algorithms. Classical methods for computing Subgraph Census on a network require the whole network to be known beforehand. However, this often not the case, as many real-world systems are constantly undergoing change (e.g. friendships in social networks, financial transactions networks and routes in packet switching networks). Not only that, but analysing dynamic networks has a number of additional challenges when compared to static ones. For example, the number of edges may build up over time to unmanageable dimensions [9]. Here, we present StreamFaSE, an online algorithm that keeps track of subgraph counts in dynamic networks. The core of the our method lies in identifying c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 944, pp. 688–699, 2021. https://doi.org/10.1007/978-3-030-65351-4_55
StreamFaSE
689
the region of the network that is affected by each update and restricting the count of subgraphs to that same region.
2
Preliminaries
A graph G = (V, E) is a tuple of vertices V = {v1 , v2 , ...vn } and edges E represented as a set of pairs of vertices. In a directed graph, an edge (v1 , v2 ) is considered to have its origin in v1 . The size of the graph is determined by the number of vertices, denoted as |V (G)|. A k-graph is a graph of size k. A subgraph Gk of a graph G is a k-graph in which V (Gk ) ⊆ V (G) and E(Gk ) ⊆ E(G). A subgraph is said to be induced if ∀v, w ∈ V (Gk ) : (v, w) ∈ E(G) ⇒ (v, w) ∈ E(Gk ). Two graphs G and H are isomorphic, denoted as G ∼ H if there is a bijection between V (G) and V (H) such that two vertices in G are adjacent if and only if their corresponding vertices are connected in H. The neighbourhood of a vertex v ∈ V (G) is the set of all vertices that share an edge with v and is defined as N (v) ≡ {w : (v, w) ∈ E}. We say these vertices are adjacent to v. The neighbourhood of a set of vertices is the union of the neighbourhood of each of them and is denoted as N ({v1 , v2 , ..., vn }) ≡ {N (v1 ) ∪ N (v2 ) ∪ ... ∪ N (vn )}. The k-neighborhood of a vertex v is the subgraph induced by all vertices whose distance to v is less or equal than k. The k-neighbourhood of a set of vertices is the union of the k-neighbourhoods of each vertex in the set. The exclusive neighbourhood of a vertex v with respect to a subgraph S is defined as Nexc (v, S) = {u : u ∈ N (v) ∩ u ∈ N (S) ∩ u ∈ S}. In simpler terms, it consists of the neighbours of v that are not neither in S nor are adjacent to any vertex in S. A graph stream consists of an initial graph G0 and a list of updates. These updates can be of four types: vertex addition, vertex removal, edge addition and edge removal. We focus on edge operations, and denote edge additions and removals as +(v1 , v2 ) and −(v1 , v2 ) respectively. 2.1
Problem Definition
Problem 1 (Subgraph Census). Given a network N and an integer k, determine the frequencies of all its connected induced k-subgraphs. Two occurrences are considered different if they do not share at least one node. Problem 2 (Streaming Subgraph Census). Given a network N , an integer k, a solution to Subgraph Census(N, k) and a stream of edge updates S = (e1 , e2 , ..., en ), compute the Subgraph Census of all networks SN = (e1 (N ), e2 (e1 (N )), ...) resulting from successively performing each operation in S to N . In this paper we propose a solution to Problem 2.
690
2.2
H. Branquinho et al.
Related Work
2.2.1 Taxonomy The concept of subgraphs in static networks can be mapped, at least in two different meaningful ways, to dynamic networks: (1) Dynamic subgraphs consist in incorporating the dynamic property of the network in the subgraphs. Although the addition of temporal information can bring additional challenges, like unmanageable numbers of temporal edges and the need to account for patterns that occur at different time scales, the study of dynamic subgraphs can bring further insights into temporal phenomena [9]. (2) The second mapping path consists in unfolding the dynamic network into a series of static snapshots. In this model, static subgraphs keep their traditional meaning. In this paper, we address the problem of counting static subgraphs in dynamic networks. Extensive work has been done to solve the static subgraph census problem. We refer to a survey by Ribeiro et al. [11] for a better insight. The authors propose a taxonomy that accounts for the existing variations of the problem: Regarding the cardinality of subgraphs counted, algorithms exist to solve three variations of the subgraph census. (1) Subgraph centric algorithms count the number of occurrences of a single subgraph. (2) Set centric approaches count the occurrences of a given set of subgraphs, and (3) Network centric algorithms count the occurrences of all existing subgraphs. Note that any subgraph centric approach can be computed for every subgraph, resulting in a network centric algorithm. Similarly, set centric approaches are trivially reduced to network centric ones by considering the set of all possible subgraphs. Precision wise, algorithms follow one of two approaches. (1) Exact counting algorithms compute the exact frequencies of the subgraphs, while (2) approximation based algorithms use sampling and analytical estimators to compute approximated solutions. Note that approximation based solutions deal with a simpler version of the subgraph census problem since they scale differently with the size of the network. We propose that the same taxonomy be used when considering the streaming subgraph census problem. Taking this into account, in this paper we present StreamFaSE, a network centric, exact counting algorithm for the streaming subgraph census problem. 2.2.2 Relevant Static Algorithms Wernicke developed a network-centric, exact-counting and enumeration-based algorithm, called ESU [14], that enumerates subgraphs in a network through a backtracking approach, by performing a depth-first search on each network node. ESU uses two vertex sets: Vsubgraph or Vsub , where each vertex is added up to a size of k, upon which the set contains k vertices that form a connected induced subgraph; Vextension or Vext which contains vertices that can be added to Vsub . Whenever ESU finishes enumerating a subgraph, a tool called nauty [7] is used to compute the isomorphic class of the enumerated subgraph. When a vertex is added to Vsub , all of its neighbours are added to Vext , guaranteed they
StreamFaSE
691
are not already in Vext . Vertices are labeled with integers, so that no vertex can be added to Vsub if its label is less than Vsub [0]. These two conditions avoid duplicates and repeated computation. ESU implicitly creates a recursion tree. FaSE [10], developed by our group, is another network-centric, exactcounting and enumeration-based algorithm that takes advantage of the recursion tree implicitly created by ESU. FaSE combines this enumeration with a data structure called G-Trie [12], a structure that can store and compress multiple graphs in a prefix tree manner, by taking advantage of common substructures. Each level i of the tree represents i-graphs, and descendants of a g-trie node all share a common substructure. Throughout the enumeration process of ESU, FaSE creates paths in a g-trie, up to a depth k. A leaf node in a g-trie represents a k-subgraph. At the end of the enumeration process, isomorphism tests are performed on the leaf nodes of the g-trie using nauty [7]. By using a g-trie to encapsulate isomorphism information, isomorphism tests only need to take place at the end of the enumeration, which results in large speedups. However, this approach allows for redundant information to be created in the g-trie (many leaf nodes represent the same isomorphic class), and due to the combinatorial explosion of subgraph types as subgraph sizes grow, FaSE is currently limited to subgraphs with size up to 19. For a better understanding of g-tries, we refer to [12]. FaSE is extendible to many kinds of graphs, like directed and colored, and also provides a sampling option. Our approach is an adaptation of FaSE to streaming networks that accounts for undirected and directed graphs. 2.2.3 Relevant Dynamic Algorithms Regarding the streaming subgraph census, Xiaowei Chen and John C. S. Lui propose an approximation based algorithm which consists in generating samples by leveraging consecutive steps of random walks and the observation of the neighbours of visited nodes [2]. This algorithm falls out of the scope of this document, since it is an approximate solution. Another approximate solution has been proposed by Al-Thaedan and Carvalho [1], which consists in infering subgraph frequencies through exact frequencies of smaller subgraphs by using the Pascal triangle. An exact subgraph-centric approach has been developed by Mukherjee et al. [8], based on keeping in memory an association between edges and every occurrence of the target subgraph, trading memory for efficiency. To the best of our knowledge, only one network-centric, exact-counting solution exists. Schiller [13] presented StreaM-k, an algorithm that updates subgraph frequencies for every update by retrieving the k−2-neighbourhood of an updated edge and using a small adjacency matrix-like representation of each affected subgraph in the k − 2-neighbourhood to identify the isomorphic classes of the subgraph before and after the update. This correspondence between adjacency codes and isomorphic classes is previously computed and stored. The implementation made available only accounts for subgraphs up to size 7. Moreover, the implementation does not account for directed graphs, unlike our approach.
692
3
H. Branquinho et al.
Method
The core of our approach consists in only exploring the area of the network affected by an update. Recomputing subgraph frequencies for the whole network, for every update, is computationally expensive and unnecessary. Therefore, we devised a depth-first search algorithm that only reaches the (k − 2)neighbourhood of the vertices of the updated edge. We can guarantee that no additional vertices are needed, since we are counting k-subgraphs, and thus any k-subgraph that includes an edge (v1 , v2 ) must be composed of vertices that are at a maximum distance of k − 2 from either v1 or v2 . Figure 1 illustrates this idea.
Fig. 1. An example of the core idea of StreamFaSE for k = 3. When adding/deleting the green edge (1, 2), the subgraph induced on the blue vertices is sufficient to compute the topology changes. This corresponds to the 1-neighbourhood of {1, 2}.
Furthermore, even though we are restricting the search space to the affected area, we still need a way to efficiently figure out which isomorphic classes correspond to the subgraphs that existed before and after the update, to correctly update its frequencies. FaSE does this by maintaining a path in the g-trie until a leaf node is reached. However, in a streaming environment, we need an efficient way to determine the isomorphic classes of the affected subgraphs before and after the update. StreamFaSE does both these tasks during the same subnetwork traversal. This translates into verifying if each k-subgraph is connected with and without the (added/deleted) edge. If a subgraph is connected in both its updated and non-updated versions, then we known that the update (either an edge addition or deletion) changed the topology of the subgraph. As a consequence, we decrease the count of the old subgraph and increase the count of the new one. If the edge addition (deletion) connects (disconnects) a previously disconnected (connected) subgraph, then we only increment (decrement) the count of the new (old) subgraph. Our approach lies in verifying if the subgraph is connected with and without the updated edge while enumerating vertices in the (k − 2)-neighbourhood of the updated edge. This is done through a process we called origin propagation.
StreamFaSE
693
Considering the update of an edge (v1 , v2 ), we define the origin of a vertex u in the depth-first search as: O(u) = N (u) ∩ {v1 , v2 } ∪ O(parent(u)). In simpler terms, the origin of a vertex u is the subset of {v1 , v2 } that is reachable from u within the subgraph induced on the current enumeration, taking in account that the updated edge may not be used. This is depicted in Fig. 2. Initially, we define the origin of a vertex v as the set of initial vertices (v1 or v2 ) from which it was obtained during the search-tree growth. When other vertices are expanded from v, its origin is propagated (i.e. inherited), so these newly expanded vertices will (at least) also have v’ origin. In summary, we enumerate all k-subgraphs that belong to the (k − 2)neighbourhood of the vertices in the updated edge and, by propagating origins, we verify if each subgraph is connected without the updated edge in order to update subgraph frequencies. We keep two paths in a g-trie to deal with the identification of isomorphic classes: one considering the updated edge, and another one not considering it. This is a modification of the original g-trie structure, as disconnected subgraphs will be present in the tree, and more nodes will be created. However, leaf nodes still only represent connected k-subgraphs. Because the g-trie is kept and updated at all times, all possible paths are built early in the computation. Since isomorphism testing is only done once for each leaf node of the tree, as soon as the tree is fully built, no more isomorphic tests need to take place. Empirically, we verified that the total time spent in isomorphism testing is negligible when compared to the total run time.
Fig. 2. An example of how origin propagation is used to determine the connectedness of subgraphs with k = 3. S1, S2 and S3 are obtained during the search. Disregarding the updated edge (1, 2), only S2 is connected. We conclude that (1) if the update is an addition S1 and S3 are new occurrences. (2) If the update is a deletion S1 and S3 are no longer connected. (3) Subgraph S2 was connected before and after the update.
694
3.1
H. Branquinho et al.
Detailed Description
Algorithm. StreamFaSE
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15:
Input A network N , an integer k, a stream of updates S = and a g-trie Output k-subgraphs’ counts after each update procedure StreamFaSE(N , k, S) for all (op, v1 , v2 ) ∈ S do if op is addition then N ← N + (v1 , v2 ) VSub ← {v1 , v2 } VExt ← (neighbors(v1 ) ∪ neighbors(v2 )) \{v1 , v2 } is connected ← f alse if N is directed and (v2 , v1 ) ∈ E(N ) then is connceted ← true Initialize g-trie paths for all u ∈ VExt do Origin(u) ← {v1 , v2 } ∩ neighbors(u) DFS UPDATE(VSub ,VExt ,op,k,is connected,O,g-trie paths) if op is deletion then N ← N − (v1 , v2 )
16: procedure DFS UPDATE(VSub ,VExt ,op,k,is connected,O,g-trie paths) 17: if |VSub | = k then 18: Retrieve isomorphic class S1 of VSub from the g-trie path 19: if is connected then 20: Retrieve isomorphic class S2 of VSub − (v1 , v2 ) from the g-trie path 21: if op is addition then 22: Increment frequency of S1 23: Decrement frequency of S2 24: else 25: Decrement frequency of S1 26: Increment frequency of S2 27: else 28: while VExt = ∅ do 29: Remove an arbitrarily chosen vertex w from VExt 30: VSub ← VSub ∪ w 31: if O(w) = {v1 , v2 } then 32: is connected ← true 33: for all u ∈ exclusive neighbors(w, VSub ) do 34: VExt ← VExt ∪ u 35: O(u) ← O(u)∪ O(w) 36: Update g-trie paths ,VExt ,op,k + 1,is connected,O, g-trie paths) 37: DFS UPDATE(VSub 38: Reset changes made to VExt and O
StreamFaSE
695
For every update to an edge (v1 , v2 ), the first thing we do is include v1 and v2 in Vsub (the vertex set of the current subgraph being enumerated). v1 and v2 ’s neighbours are added to Vext (a set of vertices that can still be added to the subgraph). While we are adding these neighbours to Vext , we also define each vertex’s origin as v1 ,v2 or v1 + v2 , according to which endpoints they are connected. Furthermore, we initialize two paths in the g-trie like we explained above. A boolean flag is connected is used to declare if the current subgraph being enumerated is connected without the updated edge. Initially, this flag is set to false. However, a special verification is needed for directed graphs: if an edge (v2 , v1 ) exists, we can be sure that every subgraph enumerated is also a connected subgraph without the edge (v1 , v2 ). StreamFaSE enumerates subgraphs on the target graph considering the existence of the updated edge at all times. This means that if the update is an edge addition, we first add the edge and then we run StreamFaSE to enumerate subgraphs. Subgraphs that exist with the edge (and are detected by the normal FaSE procedure) have their frequency incremented; if the subgraphs are connected without the edge, their frequency is decremented (they are detected through origin propagation). If the update is an edge deletion, we first run StreamFaSE and we delete the edge afterwards. The frequencies are updated in an inverse manner regarding edge additions. After this initialization procedure, a recursive function (DF S U P DAT E) that enumerates subgraphs in a depth-first manner is used to explore the k − 2-neighbourhood of the updated edge. The recursive function’s base case is when Vsub has a size of k, meaning we finished enumerating a k-subgraph. In every recursion of DF S U P DAT E, we iteratively remove every vertex v3 from Vext and add it to the subgraph (to Vsub ). When adding the vertex to the subgraph, we verify if it can be reached by both v1 and v2 , by checking if its origin is v1 + v2 . In that case, the flag that indicates the subgraph is connected without the updated edge is set to true. Then, we iterate over v3 ’s neighbours, in order to propagate origins and verify if they can be added to Vext (if they are already in Vext they are not added again). For every neighbour v4 of v3 , we set their origin as origin(v4 ) ← origin(v4 ) ∪ origin(v3 ), as we know that v4 can be reached by v3 and therefore by v3 ’s origin vertex. v4 is added to Vext if it does not belong to the extension set. Finally, the g-trie path’s are updated, and a new recursive call is made. Changes made to the origins set in each recursion must be undone upon the recursion terminating, as v4 only shares v3 ’s origin if they are both in the same subgraph. Therefore, when v3 is removed from the subgraph (upon the recursion terminating), v4 ’s origin must be reset. When the recursive function reaches its base case (Vsub has size k), we only need to retrieve the corresponding isomorphic subgraph class from the g-trie path and update its frequency accordingly, depending on the type of edge update being made. Furthermore, if is connected is set to true, we know that the subgraph is connected without considering the updated edge, and we also retrieve the isomorphic class of the subgraph not considering (v1 , v2 ) and update its frequency.
696
4
H. Branquinho et al.
Experimental Results
All tests were executed on machine using a 16-core AMD Opteron processor with a 2.3 GHz base clock speed, and a total of 252 GB of memory installed. We used six real-world networks to test the performance of the proposed algorithm, StreamFaSE. Due to the novelty of our work, no data sets were available for testing the performance of StreamFaSE. With that in mind, we adapted networks with timestamped edges by transforming them into a chronological stream of edge additions and deletions. We opted to use a sliding window model, in which an edge is deleted after some predefined amount of time passes since its addition. Table 1 contains a detailed description of the used networks as well as how they were adapted to a stream-friendly format. Table 1. Real-world networks used in our experiments. Name
Digraph #V
Updates Avg.#E
email
No
142
6.023
259
12-hour sliding window of email exchanges between members of a research institution over a period of 803 days. Adapted from [9]
mooc
No
4.008
39.298
19.231
Active student’s interactions with course activities on a popular MOOC platform. Adapted from [6]
9/11
No
13.314
414.872
6.273
Co-appearance of words in stories released by the news agency Reuter over a period of 66 days after the 9/11 atack on the US. Edges were updated daily. Adapted from [3]
violence
Yes
29
359
82
12-month sliding window of violent activities between political actors in Italy between the years of 1919 and 1922. Adapted from [5]
mathoverflow Yes
21.594
88.711
44.356
2350 day-long network of questions and answers between users of MathOverflow [9]
retweets
321.307 443.548
Yes
Description
221.775 Retweet network during the first observation of gravitational waves, in 2016. The data was gathered over 6 days. Adapted from [4]
We tested our algorithm by running a thorough experiment in which we iteratively increased the order, k, of the counted subgraphs. Each experiment was ran twice to assure consistency. The following statistics were gathered: – Types: number of occurring non-isomorphic subgraphs with order k found in the network throughout all updates; – Time: execution time, measured as elapsed time between launching the program up to computing the frequencies of all possible subgraphs after each update;
StreamFaSE
697
– Speedup: Time of other algorithms divided by the time of our own. We compared the performance of StreamFaSE versus FaSE and StreaMk. FaSE was chosen, not only because it is the basis for our algorithm, but also because it is a state-of-the-art algorithm for computing the static Subgraph Census; showcasing the importance of taking different approaches to solve the dynamic version of the problem. Table 2 contains the results. Table 2. Experimental results. Network
Size Time (s) StreamFaSE FaSE
Speedup vs. StreaM-k FaSE StreaM-k
email
3 4 5 6 7 8 9
0.03 0.14 1.19 9.99 64.30 325.43 1720.09
1.41 8.48 52.95 337.45 1747.70 >4h >4h
0.71 1.23 6.59 43.24 275.99 * *
46.1x 62.6x 44.5x 33.8x 27.2x >44.3x >8.4x
23.4x 9.1x 5.5x 4.3x 4.3x * *
mooc
3 4
0.70 173.98
1976.95 >4h
2.00 2714.29
2843.1x >82.8x
2.9x 15.6x
9/11
3 4
5.36 196.08
11338.62 19.66 >4h 726.85
2115.4x >73.5x
3.7x 3.7x
violence
3 4 5 6 7 8
0.00 0.03 0.51 4.57 25.13 101.45
0.07 0.79 5.87 30.07 112.98 341.79
**
23.9x 23.1x 11.6x 6.6x 4.5x 3.4x
**
mathoverflow
3 4
5.57 1043.37
>4h >4h
**
>2585.3x ** >13.8x
retweets 3 125.88 >4h ** * Algorithm does not compute for k > 7 ** Algorithm does not compute on directed networks.
>114.4x
**
Results are very promising, showing that StreamFaSE outperforms the other algorithms. It is worth noting that FaSE, being the non-native approach to the Streaming Subgraph Census Problem, has a major drop in performance when handling large networks (i.e. mooc, 9/11, mathoverflow, retweets). This is because, unlike StreamFaSE and StreaM-k, FaSE recomputes all subgraph counts after each update.
698
5
H. Branquinho et al.
Conclusions and Future Work
In this paper we presented an efficient algorithm for the Streaming Subgraph Census Problem. Our method consists of StreamFaSE, a network-centric approach to perform the exact enumeration of all static subgraphs occurring in a dynamic network. The main novelty of the presented algorithm is that it explores only the regions of the network that have undergone a topology change after each stream update. Also, the main algorithm is the same for edge additions and deletions, varying only in the timing that the update is actually committed to the network. In the future, we plan to boost the performance of our method when handling real-world stream networks, by developing an online algorithm that can process multiple edge additions and deletions at the same time. This has applications in high-throughput network monitoring systems that demand fast response times, such as detecting anomalies in financial transactions prior to their approval. Acknowledgements. This work is financed by National Funds through the Portuguese funding agency, FCT - Funda¸ca ˜o para a Ciˆencia e a Tecnologia, within project UIDB/50014/2020
References 1. Al-Thaedan, A., Carvalho, M.: Online estimation of motif distribution in dynamic networks. In: 2019 IEEE 9th Annual Computing and Communication Workshop and Conference, CCWC 2019, pp. 758–764. Institute of Electrical and Electronics Engineers Inc., March 2019 2. Chen, X., Lui, J.C.S.: Mining graphlet counts in online social networks. ACM Trans. Knowl. Discov. Data 12(4), 1–38 (2018) 3. Corman, S.R., Kuhn, T., Mcphee, R.D., Dooley, K.J.: Studying complex discursive systems. Hum. Commun. Res. 28(2), 157–206 (2002) 4. De Domenico, M., Altmann, E.G.: Unraveling the origin of social bursts incollective attention. Sci. Rep. 10(1), 4629 (2020) 5. Franzosi, R.: Narrative as data: linguistic and statistical tools for the quantitative study of historical events. Int. Rev. Soc. Hist. 43, 81–104 (1998) 6. Kumar, S., Zhang, X., Leskovec, J.: Predicting dynamic embedding trajectory in temporal interaction networks. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1269–1278. ACM (2019) 7. McKay, B.D., Piperno, A.: Practical graph isomorphism, II. J. Symb. Comput. 60, 94–112 (2014) 8. Mukherjee, K., Hasan, M.M., Boucher, C., Kahveci, T.: Counting motifs in dynamic networks. BMC Syst. Biol. 12(S1), 6 (2018) 9. Paranjape, A., Benson, A.R., Leskovec, J.: Motifs in temporal networks. In: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, WSDM 2017, pp. 601–610. Association for Computing Machinery, New York (2017) 10. Paredes, P., Ribeiro, P.: Rand-FaSE: fast approximate subgraph census. Soc. Netw. Anal. Min. 5, 17 (2013)
StreamFaSE
699
11. Ribeiro, P., Paredes, P., Silva, M.E., Aparicio, D., Silva, F.: A survey on subgraph counting: concepts, algorithms and applications to network motifs and graphlets. arXiv preprint arXiv:1910.13011 (2019) 12. Ribeiro, P., Silva, F.: G-Tries: a data structure for storing and finding subgraphs. Data Min. Knowl. Discov. 28(2), 337–377 (2014) 13. Schiller, B.: Graph-based Analysis of Dynamic Systems. Ph.D. Thesis, Faculty of Computer Science, Technische Universit¨ at Dresden (2016) 14. Wernicke, S.: A faster algorithm for detecting network motifs. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), LNBI, vol. 3692, pp. 165–177. Springer, Heidelberg (2005)
Temporal Bibliometry Networks of SARS, MERS and COVID19 Reveal Dynamics of the Pandemic Ramya Gupta1 , Abhishek Prasad1 , Suresh Babu2 , and Gitanjali Yadav1,3(B) 1 National Institute of Plant Genome Research, New Delhi 110067, India 2 Ambedkar University of Delhi, Delhi 110007, India 3 Department of Plant Science, University of Cambridge, Cambridge CB23EA, UK
[email protected]
Abstract. A world crisis brings forth new, often unexpected responses that are fascinating to investigate from both scientific and social standpoints. A comprehensive bibliometric investigation of such an event can offer insights into politics of the pandemic, not just providing incentives for improving scientific quality and productivity, but also dissecting the role of global competition and marginalization in terms of funding and peerage. The sheer numbers of publications witnessed in less than 10 months of the novel coronavirus outbreak, indicates how scientists from all walks of life, irrespective of their respective fields of interests, shifted to COVID19 research, leading to discoveries and new directions of research for many. However, this shift has also resulted in shocking factoids based on incomplete interpretations of scientific data, which have continued to be foisted on the public at an alarming rate during the past nine months of COVID, the most colossal of these being the Lancet HCQ story. In this work, we use the 2020 COVID-19 publications to identify bibliometric communities that we compare temporally across two major epidemics of SARS and MERS. Keywords: Bibliometric networks · Coronavirus · Data mining · SARS · MERS
1 Introduction 1.1 Bibliometry and the Coronavirus The capacity of any state to cope effectively with a pandemic like COVID-19 depends upon inherent capacity, economic equality and the ability to make informed policy decisions based on scientific evidence (Guillen 2020). The same holds true for organisations, be it the public or social domain, and the world has witnessed an unprecedented flurry in scientific and social media publications during the past nine months of the ongoing coronavirus pandemic that has infected more than 35 million people on the planet, killing over 1 million as of early October 2020. Here, we report a detailed investigation of >24000 published documents on viral epidemics, authored by over 50,000 individuals from 140 nations, in order to understand epidemic dynamics, and how this has impacted © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 944, pp. 700–711, 2021. https://doi.org/10.1007/978-3-030-65351-4_56
Temporal Bibliometry Networks of SARS, MERS and COVID19
701
international relations, science education and research systems, in order to advance our understanding of the manner in which the pandemic has moved through society, and how the world has responded to it. 1.2 Related Work Extraction and analysis of knowledge from the scholarly corpus can add valuable insights and enable synthesis of existing research findings while delineating new directions for future research (Skippari et al. 2005). Rigorous bibliometric methodologies can identify coherent clusters in existing research that can serve as reference points and identify knowledge gaps that remain to be addressed (Jimenez and Bjorvatn, 2018). In this regard, visualization and conceptualization of a complex co-citation corpus as networks enables derivation of biologically significant inferences from systematic analysis of detailed conceptual relationships (Yadav and Babu 2012). Very recently, we have developed a new decision support system based on recursive partitioning of bibliometric evidence, to simplify exploratory literature review, enabling rational design of research objectives for scholars, as well as development of comprehensive grant proposals that address gaps in research (Mishra et al. 2020). As a specific case study, we examined the current global pandemic This work represents a synthesis of the key pandemic-related literature over a duration of twenty years, covering two major epidemics of SARS and MERS, apart from COVID19, and it revealed three broad analytical foci: the sources of risk; the scope of epidemic research over the past two decades; and how actors in science and education (countries, authors and publishers) have responded to the crisis. In summary, this is the first comprehensive meta-analysis on the expanding global scientific literature before and after the coronavirus outbreak, and we find very interesting patterns that reveal insights into the politics of the pandemic.
2 Methods 2.1 Data Collection Four search terms were used to query the ‘Web of Science’ database for entries from the past twenty years; namely ‘coronavirus’, ‘COVID-19’, ‘SARS-CoV-2’, and ‘SARS’. Collected metadata that was used for comparative analysis included for each document, its Type (Journals, Books, etc.), Keywords (ID), Author’s Keywords (DE), period, average citations per documents, Authors (appearances, single-authored documents, multi-authored documents), documents per author, authors per document, co-authors per document, and collaboration Index. 2.2 Bibliometric Analysis All bibliometry elated parameters were invetigated in R using Bibliometrix package (Aria and Curcillio 2017; Aria et al. 2020). All word map networks were constructed in Cytoscape (Shannon et al. 2003). The parameters used for network analyses included ‘Collaboration index’ (number of authors of multi-authored articles divided by the number of multi-authored articles), Annual Scientific Production (ASP), and Citations per
702
R. Gupta et al.
year. For analysis of sources, we interpreted citations, impact, source and clustering through Bradford’s law (Sources that have the greatest number of articles are classified as ‘core sources’. This is followed by ‘zone 1’ sources, ‘zone 2’ sources, and so on). For analysis of authors, we used most production over time, productivity (as per Lotka’s Law), author impact and affiliations. For global mapping of bibliometry, we made use of Corresponding authors’ country, No. of documents per country, and Total Citations per country and country collaboration maps. For keyword analysis, we performed word dynamics and conceptual structure mapping and thematic mapping, by construction of co-word maps, co-occurrence networks and Historiographs.
3 Results 3.1 Data Collection Of the four search terms described in Methods, ‘coronavirus’ returned the most number of results (12,015 documents), followed by ‘SARS’, ‘COVID-19’ and then ‘SARSCoV-2’. The documents returned for ‘SARS’ and ‘Coronavirus’ are mostly classified as ‘article’, while most of the documents returned for ‘COVID-19’ are classified as ‘editorial material’. Table 1 provides a summary of the data collated. Table 1. Bibliometric Data Collected for this study. Description
SARS
SARS-CoV-2
CORONAVIRUS
COVID-19
Documents
11611
239
12015
573
Sources (Journals, Books, etc.)
2948
122
1826
174
Keywords Plus (ID)
16303
264
12134
265
Author’s Keywords (DE)
16707
361
11479
511
Period
2001–2020
2020
2001–2020
2020
Average citations per documents
23.13
1.293
25.08
0.7086
Authors
30803
1313
32438
1904
Author Appearances
64811
1601
79463
2428
Authors of single-authored documents
1099
31
658
122
Authors of multi-authored documents
29704
1282
31780
1782 235
Single-authored documents
1554
32
887
Documents per Author
0.377
0.182
0.37
0.301
Authors per Document
2.65
5.49
2.7
3.32
Co-Authors per Documents
5.58
6.7
6.61
4.24
Collaboration Index
2.95
6.19
2.86
5.27
As can be seen in Table 1, all entries for ‘COVID-19’ and ‘SARS-CoV-2’ were from 2020 while ‘Coronavirus’ and ‘SARS’ returned results for the entire twenty-year duration. We clubbed these two sets of terms, to construct two distinct temporal networks; one spanning past twenty years, and the other spanning the single (current) pandemic year of 2020. In all subsequent sections, we call these Group A and B, respectively. At
Temporal Bibliometry Networks of SARS, MERS and COVID19
703
the end of this paper we also bring in a third set (Group C), representing the first year of the SARS epidemic (2002–2003) in order to compare immediate response curves. 3.2 Annual Scientific Production (ASP) Reflecting Past Case Surges Annual Scientific Production (ASP) curves in the two groups are shown in Fig. 1. Group A showed a peak in 2002 to 2004, followed by a steady decline till 2007 and a second peak from 2012 to 2016, with relative stability since. This trend makes sense given that SARS first appeared in China in 2002, spreading worldwide within a few months, lasting till 2004, while no known cases have occurred since 2004. The second peak can be explained by the first occurrence of MERS in Saudi Arabia in 2012, and cases of MERS are still occurring in some parts of the world such as the Middle East.
Fig. 1. Annual Scientific Production (ASP) curves in the two groups; 20 years data on coronavirus (left) and 2020 data on COVID19 (right panel).
The average number of article citations also reflect similar trends, peaking in 2003 and 2013, with ‘SARS-CoV-2’ having highest collaboration index, followed by ‘COVID-19’ suggesting new relationships that did not exist before. 3.3 Analysis of Sources The top twenty journals based on number of articles published from both collections revealed major international journals like Lancet, The Journal of Virology in Group A, while Group 2 showed Cureus and EuroSurveillance among the top rankers. Curiously, these two journals did not appear at all among the long term collection. Bradford’s law was used to group sources of the two collections into zones, as shown in Fig. 2. Sources that have the greatest number of articles are classified as ‘core sources’ and it is clear from the Figure that Group A collection has several core sources, while the Group B collections have very few core sources. This trend is not surprising in a Pandemic situation when global calamity strikes, but we focused on the non-core parts. An sssessment of the full data (not just top rankers) revealed presence of >65% new sources/journals that had not been publishing related work in the past two decades.
704
R. Gupta et al.
Fig. 2. Bradford’s curves for Group A and Group B reveal distinct differences in relative size of core sources (Dark Grey Shaded regions).
3.4 Sources Impact; An Analysis of Bibliometric Indices Source impact can be measured in terms of h-index, m-index, or g-index of the source and we focused on the first of these. Figure 3 shows the source impact for the top five journals in both collections, and ‘Journal of Virology’ was a top ranker for ‘SARS’ and ‘coronavirus’ collections (yellow line), closely followed by Plos One (red curve). In Group A, the curve saw its highest peak during 2004–2006 (ending of the SARS epidemic) and a smaller peak during 2012–2013 (first MERS epidemic case). In Group B, Eurosurveillance is the steepest, followed closely by Cureus. Surprisingly, Group B (with 300 journals) is roughly 1/10th of Group A (roughly 300) despite 20x time having passed, but it lists papers from 60 journals that never occurred in the 20 years data. 3.5 Analysis of Authors This analysis conventionally identifies the most relevant authors based on the most number of documents. But we conducted the analysis from the viewpoint of understanding whether and to what extent the pandemic witnessed the rise of authors who had never previously published, or at least not in the area of virology (following up on the observation in the previous section about 60 new sources in less than ten months). One way to address this is to assess authors’ production over time, as shown in Fig. 4. From the Group A collections in Fig. 4, it could be seen that most authors had been publishing since 2003. In sharp contrast, 1/3rd Group B authors have an h-index of Zero, and majority of these appear to have been born overnight during 2020, as the names do not appear in past twenty years data. It is possible that these ‘new authors’ were prolific
Temporal Bibliometry Networks of SARS, MERS and COVID19
705
Fig. 3. Source Growth curves for Group A and Group B reveal distinct peaks in Group A for the past epidemics of SARS and MERS.
Fig. 4. Author Production Timelines. Each red line represents an author’s timeline, the size of each bubble is proportional to the number of documents they authored, and colour intensity of the bubble is proportional to the total citations per year for the documents authored that year.
publishers before the pandemic as well, although publishing in other aspects of research previously, which was not captured in our data due to keyword based bibliometry. This, however, is unlikely as we found the >550 authors in this category to have h-indices either Zero or 1. To dissect this feature further, we assessed author productivity using Lotka’s Law, as shown in Fig. 5. This graph identifies ‘core authors’ who have contributed to multiple documents in the collection and ‘occasional authors’ who have contributed very few documents to the collection. The dotted line in the graph identifies the theoretical distribution of authors. As can be seen in Fig. 5, Group A collections coincide with the theoretical distribution (dotted line) and have many core authors, with about 60% occasional authors who have contributed only one document to the collections. Group B collections, in contrast, have far fewer core authors, with >75% authors contributing only one publication to the collection, followed by 10% authors contributing only two publications to the collection. Interestingly, authors in these lists were quite different from those in the list of top authors based on number of documents and citations. Since this graph was based on the actual impact of their work, it may be a more meaningful way of picking top authors in
706
R. Gupta et al.
Fig. 5. Frequency Distribution of Documents as per Lotka’s Law
a field. For instance, Author XG, top ranked in both collections (when ranked according to number of publications) is missing from both collections now. 3.6 Affiliations and Global Impact Analysis of most relevant affiliations by number of documents revealed the University of Hong Kong at the top for Group A, followed by Chinese University of Hong Kong, University of North Carolina, University of Toronto, University of Iowa, and the Center for Disease Control and Prevention. For Group B, the top ranked affiliations were Fudan University, University of Macau, and Wuhan University. In order to understand the extent of international collaborations, we used two key terms; SCP (single country publication) and MCP (multiple country publication). MCP informs the number of documents in which at least one co-author is from a different country. China and USA were the top two countries for each collection, followed by Germany, Canada, United Kingdom. Curiously, Korea, Turkey and Italy also appeared in these lists but with few or no MCPs. Figure 6 depicts the No. of documents per country in each collection and it is interesting to note that both collections, despite huge differences in absolute numbers, show extensive contribution from every continent, dominated by North America and Europe, as expected.
Fig. 6. No. of documents per country in Group A (left) and Group B (right) collections. The intensity of color is proportional to the number of documents produced by the country.
An analysis of actual cited documents in the two collections, (altogether over 3,00,000 citations) showed several overlapping documents, including peaks from the 2003 SARS epidemic, and a smaller peak from the first MERS outbreak in 2012. The
Temporal Bibliometry Networks of SARS, MERS and COVID19
707
same peaks were seen for Group 2 collections as well, suggesting that authors cited relevant papers in both cases. 3.7 Knowledge Extraction and Word Maps We constructed a keyword-network to reveal the most frequently occurring author keywords as well as keywords plus lists for each collection, as shown in Fig. 7. Only five of these were common to the two lists. For Group A collections, the top author keywords included the search terms themselves (‘sars’ and ‘coronavirus’), ‘severe acute respiratory syndrome’, ‘sars cov’, and ‘mers cov’. Top author keywords and top words from keywords plus lists of Group B collections included ‘coronavirus’, ‘2019 ncov’, ‘pneumonia’, and ‘pandemic’. Top common words from the keywords plus lists included ‘pneumonia’, ‘sars’, ‘acute respiratory syndrome’, ‘outbreak’, ‘wuhan’, and ‘protein’. Interestingly, the term ‘spike protein’ was a top author keyword and a top entry in the keywords plus list for the Group A collection, but not for the Group B collection.
Fig. 7. Network of work mappings in both collections (20 Years – left) and 2020 (right).
3.8 Conceptual Structure Mapping In bibliometry, this is done to understand the main themes and trends being discussed in the field being investigated. Co-word networks are created to understand the areas of research or study in that field. Factorial analysis is used to identify sub-fields. Thematic networks are used to further probe the areas of research or study and map the evolution of their relevance over time. Co-word mapping for Group A showed two clusters as illustrated in Fig. 8. The larger red cluster appeared to be composed of keywords related to the disease and the resulting pandemic. The smaller blue cluster seemed to focus on
708
R. Gupta et al.
the virus itself, with keywords such as ‘recombination’, ‘evolution’, ‘coronavirus spike protein’ that focus on understanding how the virus survives and infects hosts. Co-word mapping for Group B also showed two clusters as illustrated in Fig. 9. The larger red cluster contains keywords related to the disease and the virus causing it. The smaller blue cluster with keywords such as ‘mental health’, ‘anxiety, and ‘depression’ showed that the impact of COVID-19 on mental health is an emerging area of study.
Fig. 8. Conceptual Structure Map for Group A
Word co-occurrence maps and Author-Co-Citation maps were constructed to understand which keywords or authors typically occur together and what sub-fields exist as a result of their clustering together. The strength of their co-occurrence and the importance of each cluster can be understood from these maps. For Group A, we found strong pairing of ‘respiratory syndrome coronavirus’ and ‘virus’ in a cluster related to the biology of the virus itself such as ‘recombination’, ‘replication’, ‘inhibition’, and ‘antibodies’. The more peripheral (and therefore less relevant) clusters showed important keywords like ‘spike protein’, ‘angiotensin converting enzyme’, ‘receptor’, ‘crystal structure’, ‘glycoprotein’ and more. This suggests that most publications focus on discussing the pathology of the disease and vital functions of the virus. While specific biological features of the virus have been identified and discussed, it has not yet happened at a large scale and is not the focus of current research. For Group B, co-occurrence mappings revealed three disconnected clusters with limited individual importance (data not shown). Patterns in co-citation networks of authors in the two collections, followed by Historiographic mapping revealed about 20 co-authorship clusters in Group A (twenty years data, 50 K authors), whereas the Group B dataset, with just under 3000 authors revealed over 300 clusters over ten months of published literature. These numbers differ greatly when compared, but the contrast is emphasized even more when we considered data from
Temporal Bibliometry Networks of SARS, MERS and COVID19
709
Fig. 9. Conceptual Structure Map for Group B
Set C (only 2003 SARS related publications), which showed 71 clusters over a duration of 12 months. The pattern suggests a much more emphatic response to COVID19 than has ever been recorded earlier for any epidemic. In general, collaboration network maps for Group 1 collections revealed several strong collaborations, while the Group 2 collections revealed that there was very little collaboration in this group, possibly an artifact of all publications being extremely recent.
4 Conclusions This work began as an exploration of the dynamics of the ongoing global coronavirus pandemic using time resolved bibliometric networks as a refreshing change of viewpoint, but the large volume of data and patterns soon became a rich resource for extrapolating the impact and responses of the scientific world to local and global crises, such as the SARS, MERS and COVID19. We were able to compare and contrast publications arising in 2020 alone, with the publications that were seen over the past twenty years. In all, we analyzed about 24000 publications, 50K authors and over 300K citations in terms of individuals, affiliations, timelines, citations, indices of impact, keywords, subject categories and much more. Overall, we found emergent patterns that reflect how the world has responded to a global pandemic. For instance, we observe a sharp decline in the gap between specialist authors and Generalist authors in the new collections,
710
R. Gupta et al.
that contrast strongly with a bias towards the power law in the past twenty years. Most strikingly, we observed >75 new author communities in the 2020 network, comprising of individuals and groups with no prior history or record of expertise in the field. An increasing number of authors are producing an increasing number of publications, and these are not the same groups who were ranking highest in the history of virology or epidemics. Surprisingly, analysis of the 2002-03 data (as a separate Set C) revealed that the observed pattern is not a reflection of single year data; publications during the first year of SARS outbreak showed very different curves as compared to 2020 COVID19 (data not shown due to lack of space). On one hand, this could be a very positive change, suggesting global collaborations and people coming together to tackle the pandemic. On the other hand, this might well be a reflection of the extent to which opportunists have tried to make the most of the calamity in digging gold, across boundaries of nations, funding agencies and large publishers. Throughout history, pandemics have been fertile ground for propaganda during which diverse agencies like countries, companies, political leaders and parties, religious institutions and similar agencies have engaged in intense efforts to spread goodwill, commercial benefits, political mileage and/or social capital. Our study provides interesting insights into the politics of the pandemic, and paves the way for more detailed investigations into bibliometry of the pandemic. Acknowledgements. RG and AP acknowledge CSIR Junior Research Fellowship. GY acknowledges CSIR Grant ID 38(1461)/18/EMR-II and RCUK-BBSRC Grant ID BBSRC BB/P027970/1TIGR2ESS, as well as NIPGR support. SB acknowledges support from School of Human Ecology, AUD.
References Aria, M., Cuccurullo, C.: bibliometrix: an R-tool for comprehensive science mapping analysis. J. Informetr. 11(4), 959–975 (2017). https://doi.org/10.1016/j.joi.2017.08.007 Aria, M., Misuraca, M., Spano, M.: Mapping the evolution of social research and data science on 30 years of social indicators research. Soc. Indicator. Res. 1–29 (2020). https://doi.org/10. 1007/s11205-020-02281-3 Babu, S., Yadav, G.: NEXCADE: a method for perturbation of complex Networks. PloS ONE (2012) Cuccurullo, C., Aria, M., Sarto, F.: Twenty years of research on performance management in business and public administration domains. In: Academy of Management Proceedings, vol. 2013, no. 1, p. 14270. Academy of Management (2013) Jiménez, A., Bjorvatn, T.: The building blocks of political risk research: a bibliometric co-citation analysis. Int. J. Emerg. Mark. (2018) Mishra, P., Prasad, A., Babu, S., Yadav, G.: Bibliometric Networks for Research: Decision Support Systems based on Scientific Evidence (2020) Mali, F., Kronegger, L., Doreian, P., Ferligoj, A.: Dynamic scientific co-authorship networks. In: Models of Science Dynamics, 2012, pp. 195–232. https://doi.org/10.1007/978-3-642-230 68-4_6 Newman, M.E.: Coauthorship networks and patterns of scientific collaboration. Proc. Natl. Acad. Sci. 101(Suppl 1), 5200–5205 (2004). https://doi.org/10.1073/pnas.0307545100
Temporal Bibliometry Networks of SARS, MERS and COVID19
711
Shannon, P., Markiel, A., Ozier, O., Baliga, N.S., Wang, J.T., Ramage, D., Amin, N., Schwikowski, B., Ideker, T.: Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13 2498–2504 (2003) Skippari, M., Eloranta, J., Lamberg, J., Parviainen, P.: Conceptual and theoretical underpinnings in the research of corporate political activity: a bibliometric analysis. Liiketaloudellinen aikakauskirja 2, 185 (2005) Ziman, J.: Are debatable scientific questions debatable? Soc. Epistemol. 14(2–3), 187–199 (2000). https://doi.org/10.1080/02691720050199225
Author Index
A Abrahão, Felipe S., 520 Alaimo, Salvatore, 386 Altuncu, M. Tarik, 154 Araki, Shuto, 193 Arnarson, Arnþór Logi, 571 Avin, Chen, 508 B Babu, Suresh, 700 Bagdasar, Ovidiu, 286 Barahona, Mauricio, 154 Bhalwankar, Rajesh, 245 Borzì, Stefano, 386 Bramson, Aaron, 193 Branquinho, Henrique, 688 Bui-Xuan, Binh-Minh, 651 C Cavallaro, Lucia, 286 Cazabet, Rémy, 462 Chawla, Nitesh V., 485 Cherifi, Hocine, 66 Chortaras, Alexandros, 92 Chowdhury, Samir, 639 Cohen, Nathann, 597 Contractor, Noshir, 322, 346 Curreri, Francesco, 286 D da Fonseca Vieira, Vinícius, 104 da Silva Félix, Lucas Gabriel, 104 Das, Archan, 346 de Bruin, Gerrit Jan, 79 De Meo, Pasquale, 286
DeChurch, Leslie, 322 Di Maria, Antonio, 386 Diesner, Jana, 623 Dinh, Ly, 623 Dondi, Riccardo, 585 Dong, Xiaowen, 181 Drif, Ahlem, 66 Duricic, Tomislav, 15 Duvivier, Louis, 462 F Faouzi, Nour-Eddin El, 218 Ferragina, Paolo, 386 Ferrara, Emilio, 3 Ferro, Alfredo, 386, 675 Ficara, Annamaria, 286 Fiumara, Giacomo, 286 Frisk, Emil, 275 Fügenschuh, Marzena, 206 Fujii, Teru, 27 Furno, Angelo, 218 G Gama, João, 27 Gardiner, Oliver, 181 Gera, Ralucca, 206 Geraldo Barbosa, Carlos Magno, 104 Giroire, Frédéric, 437, 597 Glade, Nicolas, 372 Gómez-Zará, Diego, 322, 346 Grácio, Luciano, 688 Guan, Jun, 547 Guembour, Sami, 66 Gupta, Ramya, 700 Guzzi, Pietro Hiram, 585
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 R. M. Benito et al. (Eds.): COMPLEX NETWORKS 2020, SCI 944, pp. 713–715, 2021. https://doi.org/10.1007/978-3-030-65351-4
714 H Han, Qiwei, 130 Han, Yu, 559 Heimann, Mark, 3 Helic, Denis, 15 Henry, Elise, 218 Hogie, Luc, 597 Hosein, Patrick, 298 Hosseinzadeh, Mohammad Mehdi, 585 Hourcade, Hugo, 651 Huntsman, Steve, 425, 639 Hussain, Hussain, 15 I Ivanov, Sergiu, 372 J Jiang, Lan, 623 Jung, Hohyun, 497 K Kadomtsev, Dmitry, 450 Kaven, Emily, 322 Kaven, Ilana, 322 Kermani, Mehrdad Agha Mohammad Ali, 335 Kern, Roman, 15 Kimura, Masahiro, 27 Kogge, Peter M., 485 Kompatsiaris, Ioannis, 610 Korniss, Gyorgy, 167 Koyutürk, Mehmet, 39 Krasanakis, Emmanouil, 610 Krieg, Steven J., 485 Kumano, Masahito, 27 L Lex, Elisabeth, 15 Li, Mengzhen, 39 Liotta, Antonio, 286 Liu, Jueyi, 474 Locicero, Giorgio, 675 Lotker, Yuri, 508 M Ma, Manqing, 167 Mäkinen, Ilkka H., 275 Malinskii, Igor, 450 Mallett, Jacky, 571 Mandalios, Alexios, 92 Miachon, Cédric, 651
Author Index Micale, Giovanni, 675 Michaud, Jérôme, 275 Mignot, Sylvain, 310 Mironov, Sergei, 450 Miyagi, Shigeyuki, 231, 398 Miyazawa, Hajime, 53 Mizui, Yasutaka, 398 Moschoyiannis, Sotiris, 361 Murata, Tsuyoshi, 53 Murić, Goran, 3 Muscolino, Alessandro, 386 N Nathan, Eisha, 664 Nikolentzos, Giannis, 117, 142 Nsour, Faisal, 532 O Óskarsdóttir, María, 571 P Papadopoulos, Symeon, 610 Papagiannis, Georgios, 361 Peito, Joel, 130 Pérennes, Stéphane, 437, 597 Petit, Mathieu, 218 Phoa, Frederick Kin Hing, 497 Prasad, Abhishek, 700 Pulvirenti, Alfredo, 386, 675 R Rahaman, Inzamam, 298 Ren, Jiaqi, 547 Rennard, Virgile, 117 Rezapour, Rezvaneh, 623 Ribeiro Xavier, Carolina, 104 Ribeiro, Pedro, 688 Rivera, Adín Ramírez, 142 Robardet, Céline, 462 S Sakai, Osamu, 231, 398 Sani, Samane Abbasi, 335 Santos Alves, Antônio Pedro, 104 Sayama, Hiroki, 532 Sefer, Emre, 410 Segretain, Rémi, 372 Sidorov, Sergei, 450 Stamou, Giorgos, 92 Stefánsson, Alexander Snær, 571
Author Index Sui, Yuze, 474 Szilva, Attila, 275 Szymanski, Boleslaw K., 167 T Tagarelli, Andrea, 206 Takes, Frank W., 79 Thomas, Michalis, 142 Treur, Jan, 245, 260 Trilling, Laurent, 372 Trolliet, Thibaud, 437, 597 V van den Herik, H. Jaap, 79 Vazirgiannis, Michalis, 92, 117, 142 Veenman, Cor J., 79 Vignes, Annick, 310
715 W Wehmuth, Klaus, 520 X Xing, Lizhi, 547, 559 Y Yadav, Gitanjali, 700 Yaliraki, Sophia N., 154 Yamamoto, Keigo, 231 Yutin, Matvey, 639 Z Zand, Hanie, 335 Zenil, Hector, 520 Zhou, Xueguang, 474 Zhu, Ling, 474 Ziviani, Artur, 520