216 42 1MB
English Pages 113 Year 2018
Springer Series in the Data Sciences
Lijun Chang · Lu Qin
Cohesive Subgraph Computation over Large Sparse Graphs Algorithms, Data Structures, and Programming Techniques
Springer Series in the Data Sciences Series Editors: David Banks, Duke University, Durham Jianqing Fan, Princeton University, Princeton Michael Jordan, University of California, Berkeley Ravi Kannan, Microsoft Research Labs, Bangalore Yurii Nesterov, Universite Catholique de Louvain, Louvain-la-Neuve Christopher R´e, Stanford University, Stanford Ryan Tibshirani, Carnegie Melon University, Pittsburgh Larry Wasserman, Carnegie Mellon University, Pittsburgh
Springer Series in the Data Sciences focuses primarily on monographs and graduate level textbooks. The target audience includes students and researchers working in and across the fields of mathematics, theoretical computer science, and statistics. Data Analysis and Interpretation is a broad field encompassing some of the fastest-growing subjects in interdisciplinary statistics, mathematics and computer science. It encompasses a process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, suggesting conclusions, and supporting decision making. Data analysis has multiple facets and approaches, including diverse techniques under a variety of names, in different business, science, and social science domains. Springer Series in the Data Sciences addresses the needs of a broad spectrum of scientists and students who are utilizing quantitative methods in their daily research. The series is broad but structured, including topics within all core areas of the data sciences. The breadth of the series reflects the variation of scholarly projects currently underway in the field of machine learning.
More information about this series at http://www.springer.com/series/13852
Lijun Chang • Lu Qin
Cohesive Subgraph Computation over Large Sparse Graphs Algorithms, Data Structures, and Programming Techniques
123
Lijun Chang School of Computer Science The University of Sydney Sydney, NSW, Australia
Lu Qin Centre for Artificial Intelligence University of Technology Sydney Sydney, NSW, Australia
ISSN 2365-5674 ISSN 2365-5682 (electronic) Springer Series in the Data Sciences ISBN 978-3-030-03598-3 ISBN 978-3-030-03599-0 (eBook) https://doi.org/10.1007/978-3-030-03599-0 Library of Congress Control Number: 2018962869 Mathematics Subject Classification: 05C85, 05C82, 91D30 © Springer Nature Switzerland AG 2018 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
To my wife, Xi my parents, Qiyuan and Yumei Lijun Chang To my wife, Michelle my parents, Hanmin and Yaping Lu Qin
Preface
Graph model has been widely used to represent the relationships among entities in a wide spectrum of applications such as social networks, communication networks, collaboration networks, information networks, and biological networks. As a result, we are nowadays facing a tremendous amount of large real-world graphs. For example, SNAP [49] and Network Repository [71] are two representative graph repositories hosting thousands of real graphs. An availability of rich graph data not only brings great opportunities for realizing big values of data to serve key applications but also brings great challenges in computation. The main purpose of this book is to survey the recent technical developments on efficiently processing large sparse graphs, in view of the fact that real graphs are usually sparse graphs. Algorithms designed for large sparse graphs should be analyzed with respect to the number of edges in a graph, and ideally should run in linear or near-linear time to the number of edges. In this book, we illustrate the general techniques and principles, toward efficiently processing large sparse graphs with millions of vertices and billions of edges, through the problems of cohesive subgraph computation. Although real graphs are sparsely connected from a global point of view, they usually contain subgraphs that are locally densely connected [11]. Computing cohesive/dense subgraphs can either be the main goal of a graph analysis task or act as a preprocessing step aiming to reduce/trim the graph by removing sparse/unimportant parts such that more complex and time-consuming analysis can be conducted. In the literature, the cohesiveness of a subgraph is usually measured by the minimum degree, the average degree, or their higher-order variants, or edge connectivity. Cohesive subgraph computation based on different cohesiveness measures extracts cohesive subgraphs with different properties and also requires different levels of computational efforts. The book can be used either as an extended survey for people who are interested in cohesive subgraph computation or as a reference book for a postgraduate course on the related topics, or as a guideline book for writing effective C/C++ programs to efficiently process real graphs with billions of edges. In this book, we
vii
viii
Preface
will introduce algorithms, in the form of pseudocode, analyze their time and space complexities, and also discuss their implementations. C/C++ codes for all the data structures and some of the presented algorithms are available at the author’s GitHub website.1 Organization. The book is organized as follows. In Chapter 1, we present the preliminaries of large sparse graph processing, including characteristics of real-world graphs and the representation of large sparse graphs in main memory. In this chapter, we also briefly introduce the problems of cohesive subgraph computation over large sparse graphs and their applications. In Chapter 2, we illustrate three data structures (specifically, linked list-based linear heap, array-based linear heap, and lazy-update linear heap) that are useful for algorithm design in the remaining chapters of the book. In Chapter 3, we focus on minimum degree-based graph decomposition (aka core decomposition); that is, compute the maximal subgraphs with minimum degree at least k (called k-core), for all different k values. We present an algorithm to conduct core decomposition in O(m) time, where m is the number of edges in a graph, and also discuss h-index-based local algorithms that have higher time complexities but can be naturally parallelized. In Chapter 4, we study the problem of computing the subgraph with the maximum average degree (aka, densest subgraph). We present a 2-approximation algorithm that has O(m) time complexity, a 2(1 + ε )-approximation streaming algorithm, and also an exact algorithm based on minimum cut. In Chapter 5, we investigate higher-order variants of the problems studied in Chapters 3 and 4. As the building blocks of higher-order analysis of graphs are kcliques, we first present triangle enumeration algorithms that run in O(α (G) × m) k−2 × m) time, time and k-clique enumeration algorithms that run in O(k × (α (G)) √ where α (G) is the arboricity of a graph G and satisfies α (G) ≤ m [20]. Then, we discuss how to extend the algorithms presented in Chapters 3 and 4 for higher-order core decomposition (specifically, truss decomposition and nucleus decomposition) and higher-order densest subgraph computation (specifically, k-clique densest subgraph computation), respectively. In Chapter 6, we discuss edge connectivity-based graph decomposition. Firstly, given an integer k, we study the problem of computing all maximal k-edge connected subgraphs in a given input graph. We present a graph partition-based approach to conduct this in O(h × l × m) time, where h and l are usually bounded by small constants for real-world graphs. Then, we present a divide-and-conquer approach, which invokes the graph partition-based approach as a building block, for computing the maximal k-edge connected subgraphs for all different k values in O((log α (G)) × h × l × m) time.
1
https://github.com/LijunChang/Cohesive subgraph book.
Preface
ix
Acknowledgments. This book is partially supported by Australian Research Council Discovery Early Career Researcher Award (DE150100563). Sydney, NSW, Australia Sydney, NSW, Australia September 2018
Lijun Chang Lu Qin
Contents
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Graph Terminologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.2 Real Graph Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.3 Representation of Large Sparse Graphs . . . . . . . . . . . . . . . . . . 1.1.4 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Cohesive Subgraphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Cohesive Subgraph Computation . . . . . . . . . . . . . . . . . . . . . . . 1.2.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 1 1 3 4 6 6 7 8
2
Linear Heap Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Linked List-Based Linear Heap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Interface of a Linked List-Based Linear Heap . . . . . . . . . . . . . 2.1.2 Time Complexity of ListLinearHeap . . . . . . . . . . . . . . . . . 2.2 Array-Based Linear Heap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Interface of an Array-Based Linear Heap . . . . . . . . . . . . . . . . 2.2.2 Time Complexity of ArrayLinearHeap . . . . . . . . . . . . . . . . 2.3 Lazy-Update Linear Heap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9 9 10 13 14 15 18 18
3
Minimum Degree-Based Core Decomposition . . . . . . . . . . . . . . . . . . . . . 3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Degeneracy and Arboricity of a Graph . . . . . . . . . . . . . . . . . . 3.2 Linear-Time Core Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 The Peeling Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Compute k-Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 Construct Core Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Core Decomposition in Other Environments . . . . . . . . . . . . . . . . . . . . 3.3.1 h-index-Based Core Decomposition . . . . . . . . . . . . . . . . . . . . 3.3.2 Parallel/Distributed Core Decomposition . . . . . . . . . . . . . . . . 3.3.3 I/O-Efficient Core Decomposition . . . . . . . . . . . . . . . . . . . . . . 3.4 Further Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21 21 22 23 23 26 27 32 32 36 37 39 xi
xii
Contents
4
Average Degree-Based Densest Subgraph Computation . . . . . . . . . . . . . 4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Properties of Densest Subgraph . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Approximation Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 A 2-Approximation Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 A Streaming 2(1 + ε )-Approximation Algorithm . . . . . . . . . . 4.3 An Exact Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Density Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 The Densest-Exact Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 Pruning for Densest Subgraph Computation . . . . . . . . . . . . . . 4.4 Further Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41 41 42 43 43 45 47 47 50 52 52
5
Higher-Order Structure-Based Graph Decomposition . . . . . . . . . . . . . . 5.1 k-Clique Enumeration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Triangle Enumeration Algorithms . . . . . . . . . . . . . . . . . . . . . . 5.1.2 k-Clique Enumeration Algorithms . . . . . . . . . . . . . . . . . . . . . . 5.2 Higher-Order Core Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Truss Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Nucleus Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Higher-Order Densest Subgraph Computation . . . . . . . . . . . . . . . . . . . 5.3.1 Approximation Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Exact Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Further Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55 55 55 62 64 64 68 71 71 73 75
6
Edge Connectivity-Based Graph Decomposition . . . . . . . . . . . . . . . . . . . 6.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Deterministic k-Edge Connected Components Computation . . . . . . . 6.2.1 A Graph Partition-Based Framework . . . . . . . . . . . . . . . . . . . . 6.2.2 Connectivity-Aware Two-Way Partition . . . . . . . . . . . . . . . . . 6.2.3 Connectivity-Aware Multiway Partition . . . . . . . . . . . . . . . . . 6.2.4 The KECC Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Randomized k-Edge Connected Components Computation . . . . . . . . 6.4 Edge Connectivity-Based Decomposition . . . . . . . . . . . . . . . . . . . . . . 6.4.1 A Bottom-Up Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.2 A Top-Down Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.3 A Divide-and-Conquer Approach . . . . . . . . . . . . . . . . . . . . . . . 6.5 Further Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77 77 79 79 80 85 88 90 92 93 95 95 98
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
Chapter 1
Introduction
With the rapid development of information technology such as social media, online communities, and mobile communications, huge volumes of digital data are accumulated with data entities involving complex relationships. These data are usually modelled as graphs in view of the simple yet strong expressive power of graph model; that is, entities are represented by vertices and relationships are represented by edges. Managing and extracting knowledge and insights from large graphs are highly demanded by many key applications [93], including public health, science, engineering, business, environment, and more. An availability of rich graph data not only brings great opportunities for realizing big values of data to serve key applications but also brings great challenges in computation. This book surveys recent technical developments on efficiently processing large sparse graphs, where real graphs are usually sparse graphs. In this chapter, we firstly present in Section 1.1 the background information including graph terminologies, some example real graphs that serve the purpose of illustrating properties of real graphs as well as the purpose of empirically evaluating algorithms, and space-effective representation of large sparse graphs in main memory. Then, in Section 1.2 we briefly introduce the problem of cohesive subgraph computation and also discuss its applications.
1.1 Background 1.1.1 Graph Terminologies In this book, we focus on unweighted and undirected graphs and consider only the interconnection structure (i.e., edges) among vertices of a graph, while ignoring possible attributes of vertices and edges. That is, we consider the simplest form of a graph that consists of a set of vertices and a set of edges.
© Springer Nature Switzerland AG 2018 L. Chang, L. Qin, Cohesive Subgraph Computation over Large Sparse Graphs, Springer Series in the Data Sciences, https://doi.org/10.1007/978-3-030-03599-0 1
1
2
1 Introduction
We denote a graph by g or G. For a graph g, we let V (g) and E(g) denote the set of vertices and the set of edges of g, respectively, and we also represent g by (V (g), E(g)). We denote the edge between u and v by (u, v), the set of neighbors of a vertex u in g by: Ng (u) = {v ∈ V (g) | (u, v) ∈ E(g)}, and the degree of u in g by: dg (u) = |Ng (u)|. We denote the minimum vertex degree, the average vertex degree, and the maximum vertex degree of g by dmin (g), davg (g), and dmax (g), respectively. Given a subset Vs of vertices of g (i.e., Vs ⊆ V (g)), we use g[Vs ] to denote the subgraph of g induced by Vs ; that is: g[Vs ] = (Vs , {(u, v) ∈ E(g) | u ∈ Vs , v ∈ Vs }). Given a subset of edges of g, Es ⊆ E(g), we use g[Es ] to denote the subgraph of g induced by Es ; that is: {u, v}, Es ). g[Es ] = ( (u,v)∈Es
g[Vs ] is referred to as a vertex-induced subgraph of g, while g[Es ] is referred to as an edge-induced subgraph of g. Across the book, we use the notation G either in definitions or to specifically denote the input graph that we are going to process, while using g to denote a general (sub)graph. For the input graph G, we abbreviate V (G) and E(G) as V and E, respectively; that is, G = (V, E). We also omit the subscript G in other notations, e.g., d(u) and N(u). We denote the number of vertices and the number of undirected edges in G by n and m, respectively, which will be used for analyzing the time and space complexity of algorithms when taking G as the input graph. Without loss of generality, we assume that G is connected; that is, there is a path between every pair of vertices. We also assume that m ≥ n for presentation simplicity; note that, for a connected graph G, it satisfies that m ≥ n − 1.
v6 v7
v5
v10
v1
v2 v 8
v3
v4
v9
v11
Fig. 1.1: An example unweighted undirected graph Example 1.1. Figure 1.1 shows an example graph G consisting of 11 vertices and 13 undirected edges; that is, n = 11 and m = 13. The set of neighbors of v1 is N(v1 ) = {v2 , v3 , v4 , v5 }, and the degree of v1 is d(v1 ) = |N(v1 )| = 4. The vertexinduced subgraph G[{v1 , v2 , v3 , v4 }] is a clique consisting of 4 vertices and 6 undirected edges.
1.1 Background
3
1.1.2 Real Graph Datasets In this book, we focus on techniques for efficiently processing real graphs that are obtained from real-life applications. In the following, we introduce several real graph data repositories as well as present some example real graphs that serve the purpose of illustrating properties of real graphs and the purpose of empirically evaluating algorithms in the remainder of the book. Real Graph Data Repositories. Several real graph data repositories have been actively maintained by different research groups, which in total cover thousands of real graphs. A few example repositories are as follows: • Stanford Network Analysis Project (SNAP) [49] maintains a collection of more than 50 large network datasets from tens of thousands of vertices and edges to tens of millions of vertices and billions of edges. It includes social networks, web graphs, road networks, internet networks, citation networks, collaboration networks, and communication networks. • Laboratory for Web Algorithmics (LAW) [12] hosts a set of large networks with size up-to 1 billion vertices and tens of billions of edges. The networks of LAW are mainly web graphs and social networks. • Network Repository [71] is a large network repository archiving thousands of graphs with up-to billions of vertices and tens of billions of edges. • The Koblenz Network Collection (KONECT)1 contains several hundred network datasets with up-to tens of millions of vertices and billions of edges. The networks of KONECT cover many diverse areas such as social networks, hyperlink networks, authorship networks, physical networks, interaction networks, and communication networks. Five Real-World Graphs. We choose five real-world graphs from different domains to show the characteristic of real-world graphs; these graphs will also be used to demonstrate the performance of algorithms and data structures in the remainder of the book. The graphs are as-Skitter, soc-LiveJournal1, twitter-2010, uk-2005, and it-2004; the first two are downloaded from SNAP, while the remaining three are downloaded from LAW. as-Skitter is an internet topology graph. soc-LiveJournal1 is an online community, where members maintain journals and make friends. twitter-2010 is a social network. uk-2005 and it-2004 are two web graphs crawled within the .uk and .it domains, respectively. For each graph, we make its edges undirected, remove duplicate edges, and then choose the largest connected component (i.e., giant component) as the corresponding graph. Statistics of the five graphs are given in Table 1.1, where the last column shows the degeneracy (see Chapter 3) of G. We can see that davg (G) n holds for all these graphs; that is, real-world graphs are usually sparse graphs.
1
http://konect.uni-koblenz.de/.
4
1 Introduction Graphs n m davg (G) dmax (G) δ (G) as-Skitter 1,694,616 11,094,209 13.09 35,455 111 soc-LiveJournal1 4,843,953 42,845,684 17.69 20,333 372 uk-2005 39,252,879 781,439,892 39.82 1,776,858 588 it-2004 41,290,577 1,027,474,895 49.77 1,326,744 3,224 twitter-2010 41,652,230 1,202,513,046 57.74 2,997,487 2,488
Table 1.1: Statistics of five real graphs (δ (G) is the degeneracy of G) 106
107 6
5
10
4
10
#Vertices
#Vertices
10 10
3
10
102 1
103 2
10
1
10
10
0
10
5
104
0
0
10
1
10
10
2
3
4
10
10
10
5
10
0
10
1
10
Degree
(a) as-Skitter
3
4
10
10
5
(b) soc-LiveJournal1 10
10
6
10
105
105
6
#Vertices
#Vertices
10
7
10
4
10
3
10
2
10
101
4
10
3
10
2
10
101
0
100
2
Degree
7
10
10
0
101
102
103
104
105
106
107
10
100
101
102
Degree
(c) it-2004
103
104
105
106
107
Degree
(d) twitter-2010
Fig. 1.2: Degree distributions Figure 1.2 shows the degree distribution for four of the graphs. Note that both x-axis and y-axis are in log scale. Thus, the degree distribution follows a powerlaw distribution; this demonstrates that real-world graphs are usually power-law graphs.
1.1.3 Representation of Large Sparse Graphs There are two standard ways to represent a graph in main memory [24], adjacency matrix and adjacency list. For a graph with n vertices and m edges, the adjacency matrix representation consumes θ (n2 ) space, while the adjacency list representation consumes θ (n + m) space. As we are dealing with large sparse graphs containing tens of millions (or even hundreds of millions) of vertices in this book (see Table 1.1), the adjacency matrix representation is not feasible for such large number of vertices. On the other hand, an adjacency list consumes more space than an
1.1 Background
5
array due to explicitly storing pointers in the adjacency list, and moreover, accessing linked lists has the pointer-chasing issue that usually results in random memory access. Thus, we use a variant of the adjacency list representation, called adjacency array representation, which is also known as the Compressed Sparse Row (CSR) representation in the literature [73]. Note that, as we focus on static graphs in this book, the adjacency array representation is sufficient; however, if the input graph dynamically grows (i.e., new edges are continuously added), then the adjacency list representation or other representations may be required. In the adjacency array representation, an unweighted graph is represented by two arrays, denoted pstart and edges. It assumes that each of the n vertices of G takes a distinct id from {0, . . . , n − 1}; note that if this assumption does not hold, then a mapping from V to {0, . . . , n − 1} can be explicitly constructed. Thus, in the remainder of the book, we also occasionally use i (0 ≤ i ≤ n − 1) to denote a vertex. The adjacency array representation is to store the set of neighbors of each vertex consecutively by an array (rather than a linked list as done in the adjacency list representation) and then concatenate all such arrays into a single large array edges, by putting the neighbors of vertex 0 first, followed by the neighbors of vertex 1, and so on so forth. The start position (i.e., index) of the set of neighbors of vertex i in the array edges is stored in pstart[i], while pstart[n] stores the length of the array edges. In this way, the degree of vertex i can be obtained in constant time as pstart[i + 1] − pstart[i], and the set of neighbors of vertex i is stored consecutively in the subarray edges[pstart[i], . . . , pstart[i + 1] − 1]. As a result, the neighbors of each vertex occupy consecutive space in the main memory, which can improve the cache hit-rate. Note that this representation also supports the removal of edges from the graph. That is, we move all the remaining neighbors of vertex i to be consecutive in edges starting at position pstart[i], and we use another array pend to explicitly store in pend[i] the last position of the neighbors of i. v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11
pstart 0
4
8 12 16 19 20 21 23 24 25 26
edges
v2 v3 v4 v5 v1 v3 v4 v8 v1 v2 v4 v10 v1 v2 v3 v11 v1 v6 v7 v5 v5 v2 v9 v8 v3 v4
Fig. 1.3: Adjacency array graph representation Figure 1.3 demonstrates the adjacency array representation for the graph in Figure 1.1. It is easy to see that, by the adjacency array representation, an unweighted undirected graph with n vertices and m undirected edges can be stored in main memory by n + 2m + 1 machine words; note that here each undirected edge is stored twice in the graph representation, once for each direction. An example C++ code for allocating memory to store the graph G is shown in Listing 1.1. Here, pstart and edges are two arrays of the data type unsigned int. For presentation simplicity, we define unsigned int as uint in Listing 1.1, which will also be used in the data structures presented in Chapter 2. Note that, as the range of unsigned int in a typical machine nowadays is from 0 to 4, 294, 967, 295, the
6
1 Introduction
Listing 1.1: Graph memory allocation typedef unsigned int uint; uint *pstart = new uint[n+1]; uint *edges = new uint[2*m];
example C++ code in Listing 1.1 can be used to store a graph containing up-to 2 billion undirected edges; for storing larger graphs, the data type of pstart, or even edges, needs to be changed to long.
1.1.4 Complexity Analysis In this paper, we will provide time complexity analysis for all the presented algorithms by using the big-O notation O(·). Specifically, for two given functions f (n) and f (n), f (n) ∈ O( f (n)) if there exist positive constants c and n0 such that f (n) ≤ c × f (n) for all n ≥ n0 ; note that O(·) denotes a set of functions. Occasionally, we will also use the θ -notation. Specifically, for two given functions f (n) and f (n), f (n) ∈ θ ( f (n)) if there exist positive constants c1 , c2 , and n0 , such that c1 × f (n) ≤ f (n) ≤ c2 × f (n) for all n ≥ n0 . As we aim to process large graphs with billions of edges in main memory, it is also important to keep the memory consumption of an algorithm small such that larger graphs can be processed with the available main memory. Thus, we also analyze the space complexities of algorithms in this book. As the number m of edges usually is much larger than the number n of vertices for large real-world graphs (see Table 1.1), we analyze the space complexity in the form of c × m + O(n) by explicitly specifying the constant c, since c × m usually is the dominating factor. Recall that the adjacency array representation of a graph in Section 1.1.3 consumes 2m + O(n) memory space. Thus, if an algorithm takes only O(n) extra memory space besides the graph representation, then a graph with 1 billion undirected edges may be able to be processed in a machine with 16GB main memory which is common for nowadays’ commodity machines. Note that a graph with 1 billion undirected edges takes slightly more than 8GB main memory to store by the adjacency array representation.
1.2 Cohesive Subgraphs In this book, we illustrate the general techniques and principles towards efficiently processing large real-world graphs with millions of vertices and billions of edges, through the problems of cohesive subgraph computation.
1.2 Cohesive Subgraphs
7
1.2.1 Cohesive Subgraph Computation A common phenomenon of real-world graphs is that they are usually globally sparse but locally dense [11]. That is, the entire graph is sparse in terms of having a small average degree (e.g., in the order of tens), but it contains subgraphs that are cohesive/dense (e.g., contains a large clique of up-to thousands of vertices). Thus, it is of great importance to extract cohesive subgraphs from large sparse graphs. Given an input graph G, cohesive subgraph computation is either to find all maximal subgraphs of G whose cohesiveness values are at least k for all possible k values, or to find the subgraph of G with the largest cohesiveness value. Here, the cohesiveness value of a subgraph g is solely determined by the structure of g while being independent to other parts of G that are not in g, and sometimes is also referred to as the density of the subgraph; thus, cohesive subgraph sometimes is also called dense subgraph. In this book, we focus on cohesive subgraph computation based on the following commonly used cohesiveness measures: 1. Minimum degree (aka, k-core, see Chapter 3); that is, the maximal subgraph whose minimum degree is at least k, which is called k-core. The problem is either to compute the k-core for a user-given k or to compute k-cores for all possible k values [59, 76, 81]. 2. Average degree (aka, dense subgraph, see Chapter 4); that is a subgraph with average degree at least k. The problem studied usually is to compute the subgraph with the largest average degree (i.e., densest subgraph) [18, 35]. 3. Higher-order Variants of k-core and Densest Subgraph (see Chapter 5); for example, the maximal subgraph in which each edge participates in at least k triangles within the subgraph (i.e., k-truss) [22, 92], the subgraph where the average number of triangles each vertex participates is the largest (i.e., triangle-dense subgraph) [89]. 4. Edge connectivity (aka, k-edge connected components, see Chapter 6); that is, the maximal subgraphs each of which is k-edge connected. The problem studied is either to compute the k-edge connected components for a user-given k [3, 17, 102] or to compute k-edge connected components for all possible k values [16, 99]. Besides the above commonly used ones, other cohesiveness measures have also been defined in the literature [48]. For example, a graph g is a clique if each vertex is connected to all other vertices [80] (i.e., |E(g)| = |V (g)|(|V2 (g)|−1) ); a graph g is a γ -quasi clique if at least γ portion of its vertex pairs are connected by edges (i.e., |E(g)| ≥ γ × |V (g)|(|V2 (g)|−1) ) [1]; a graph g is a k-plex if every vertex of g is connected to all but no more than (k − 1) other vertices (i.e., dg (u) ≥ |g| − k, for each u ∈ V (g)) [82]. Nevertheless, cohesive subgraph computation based on these definitions usually leads to NP-Hard problems, and thus are generally computationally too expensive to be applied to large graphs [23]. Consequently, we do not consider these alternative cohesiveness measures in this book.
8
1 Introduction
1.2.2 Applications Cohesive subgraph computation can either be the main goal of a graph analysis task or act as a preprocessing step aiming to reduce/trim the graph by removing sparse/unimport parts such that more complex and time-consuming analysis can be conducted. For example, some of the applications of cohesive subgraph computation are illustrated as follows. Community Search. Cohesive subgraphs can be naturally regarded as communities [42]. In the literature of community search, which computes the communities for a given set of query users, there are many models based on cohesive subgraphs, e.g., k-core-based community search [10, 84], and k-truss-based community search [41]. Recently, a tutorial is given in ICDE 2017 [42] regarding cohesive subgraph-based community search. Locating Influential Nodes. Cohesive subgraph detection has been used to identify the entities in a network that act as influential spreaders for propagating information to a large portion of the network [46, 55]. Analysis on real networks shows that vertices belonging to the maximal k-truss subgraph for the largest k show good spreading behavior [55], leading to fast and wide epidemic spreading. Moreover, vertices belonging to such dense subgraphs dominate the small set of vertices that achieve the optimal spreading in the network [55]. Keyword Extraction from Text. Recently, the graph of words representation is shown very promising for text mining tasks [91]. That is, each distinct word is represented by a vertex, and there is an edge between two words if they co-occur within a sliding window. It has been shown in [72, 88] that keywords in the main core (i.e., k-core with the largest k) or main truss (i.e., k-truss with the largest k) are more likely to form higher-order n-grams. A tutorial is given in EMNLP 2017 [57] regarding cohesive subgraph-based keyword extraction from text. Link Spam Detection. Dense subgraph detection is a useful primitive for spam detection [33]. A study in [33] shows that many of the dense bipartite subgraphs in a web graph are link spam, i.e., websites that attempt to manipulate search engine rankings through aggressive interlinking to simulate popular content. Real-Time Story Identification. A graph can be used to represent entities and their relationships that are mentioned in the texts of an online social media such as Twitter, where edge weights correspond to the pairwise association strengths of entities. It is shown in [4] that given such a graph, a cohesive group of strongly associated entity pairs usually indicates an important story.
Chapter 2
Linear Heap Data Structures
In this chapter, we present linear heap data structures that will be useful in the remainder of the book for designing algorithms to efficiently process large sparse graphs. In essence, the linear heap data structures can be used to replace Fibonacci Heap and Binary Heap [24] to achieve better time complexity as well as practical performance, when some assumptions are satisfied. In general, the linear heap data structures store (element, key) pairs, and have two assumptions regarding element and key, respectively. Firstly, the total number of distinct elements is denoted by n, and the elements are 0, 1, . . . , n − 1; note that, in this book, each element usually corresponds to a vertex of G. Secondly, let key cap be the upper bound of the values of key, then the possible values of key are integers in the range [0, key cap]; note that, in this book, key cap is usually bounded by n. By utilizing the above two assumptions, the linear heap data structures can support updating the key of an element in constant time and also support retrieving/removing the element with the minimum (or maximum) key in amortized constant time. Recall that the famous Fibonacci Heap is constant-time updatable but has a logarithmic popping cost [24].
2.1 Linked List-Based Linear Heap The linked list-based linear heap organizes all elements with the same key value by a doubly linked list [17]; that is, there is one doubly linked list for each distinct key value. Moreover, the heads (i.e., the first elements) of all such doubly linked lists are stored in an array heads, such that the doubly linked list for a specific key value key can be retrieved in constant time from heads[key]. For memory efficiency, the information of doubly linked lists are stored in two arrays, pres and nexts, such that the elements that precede and follow element i in the doubly linked list of i are pres[i] and nexts[i], respectively. In addition, there is an array keys for storing the key values of elements; that is, the key value of element i is keys[i]. © Springer Nature Switzerland AG 2018 L. Chang, L. Qin, Cohesive Subgraph Computation over Large Sparse Graphs, Springer Series in the Data Sciences, https://doi.org/10.1007/978-3-030-03599-0 2
9
10
2 Linear Heap Data Structures
10 ...
4 3 2 1 0
v4 v5 v8 v11
v3
v2
0 1 2 3 4 5 6 7 8 9 10 heads - v11 v8 v5 v4 - - - - - -
v1
nexts - v1 v2 v3 - - v6 - v7 v9 v10 v10 v9
v7
v6
pres v2 v3 v4 - - v7 v9 - v10v11 keys 4 4 4 4 3 1 1 2 1 1 1 v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11
heads (a) Conceptual view
(b) Actual storage
Fig. 2.1: An example of linked list-based linear heap For example, Figure 2.1 illustrates the linked list-based linear heap constructed for vertices in Figure 1.1 where the key value of a vertex is its degree. Figure 2.1 shows the conceptual view in the form of adjacency lists, while Figure 2.1 shows the actual data stored for the data structure; here, n = 11 and key cap = 10. Note that dashed arrows in Figure 2.1 are just for illustration purpose and are not actually stored. Each index in the array heads corresponds to a key value, while each index in the arrays pres, nexts, and keys corresponds to an element; note that the same index in the arrays pres, nexts, and keys corresponds to the same element.
2.1.1 Interface of a Linked List-Based Linear Heap The interface of the linked list-based linear heap, denoted ListLinearHeap, is given in Listing 2.1, while the full C++ code is available online as a header file.1 ListLinearHeap has member variable n storing the total number of possible distinct elements, key cap storing the maximum allowed key value, and four arrays keys, pres, nexts, and heads for representing the doubly linked lists. In addition, ListLinearHeap also stores max key — an upper bound of the current maximum key value — and min key — a lower bound of the current minimum key value — that will be used to retrieve the element with the maximum key value and the element with the minimum key value. In ListLinearHeap, the elements range over {0, . . . , n − 1}, while the key values range over {0, . . . , key cap}. The arrays keys, pres, and nexts are of size n, and the array heads is of size key cap + 1. Thus, the space complexity of ListLinearHeap is 3n + key cap + O(1). Note that, to use ListLinearHeap after constructing it, the member function init needs to be invoked first to properly initialize the member variables before invoking any other member functions. 1
https://github.com/LijunChang/Cohesive subgraph book/blob/master/data structures/ ListLinearHeap.h.
2.1 Linked List-Based Linear Heap
11
Listing 2.1: Interface of a linked list-based linear heap class ListLinearHeap { private: uint n; // total number of possible distinct elements uint key_cap; // maximum allowed key value uint max_key; // upper bound of the current maximum key value uint min_key; // lower bound of the current minimum key value uint *keys; // key values of elements uint *heads; // the first element in a doubly linked list uint *pres; // previous element in a doubly linked list uint *nexts; // next element in a doubly linked list public: ListLinearHeap(uint _n, uint _key_cap) ; ˜ListLinearHeap() ; void init(uint _n, uint _key_cap, uint *_elems, uint *_keys) ; void insert(uint element, uint key) ; uint remove(uint element) ; uint get_n() { return n; } uint get_key_cap() { return key_cap; } uint get_key(uint element) { return keys[element]; } uint increment(uint element, uint inc) ; uint decrement(uint element, uint dec) ; bool get_max(uint &element, uint &key) ; bool pop_max(uint &element, uint &key) ; bool get_min(uint &element, uint &key) ; bool pop_min(uint &element, uint &key) ; }
Algorithm 1: init( n, key cap, elems, keys) /* Initialize max key, min key and heads max key ← 0; min key ← key cap; for key ← 0 to key cap do heads[key] ← null;
*/
1 2 3
/* Insert (element, key) pairs into the data structure for i ← 0 to n − 1 do insert( elems[i], keys[i]);
*/
4 5
Initialize a Linked List-Based Linear Heap (init). The main task of init is to: (1) allocate proper memory space for the data structure, (2) assign proper initial values for max key, min key, and heads, and (3) insert the (element,key) pairs, supplied in the input parameter to init, into the data structure. The pseudocode of init is shown in Algorithm 1, where memory allocation is omitted. Specifically, max key is initialized as 0, min key is initialized as key cap, and heads[key] is initialized
12
2 Linear Heap Data Structures
as non-exist (denoted by null) for each distinct key value. Each (element,key) pair is inserted into the data structure by invoking the member function insert.
Algorithm 2: insert(element, key) /* Update doubly linked list keys[element] ← key; pres[element] ← null; nexts[element] ← heads[key]; if heads[key] = null then pres[heads[key]] ← element; heads[key] ← element; /* Update min key and max key 4 if key < min key then min key ← key; 5 if key > max key then max key ← key;
*/
1 2 3
*/
Insert/Remove an Element into/from a Linked List-Based Linear Heap. The pseudocode of inserting an (element, key) pair into the data structure is shown in Algorithm 2, which puts element at the beginning of the doubly linked list pointed by heads[key]. Note that, after inserting (element, key) into the data structure, the values of max key and min key may also be updated.
Algorithm 3: remove(element) if pres[element] = null then /* element is at the beginning of a doubly linked list 2 heads[keys[element]] ← nexts[element]; 3 if nexts[element]! = null then pres[nexts[element]] ← null; 1
4 5 6
else
7
return keys[element];
*/
nexts[pres[element]] ← nexts[element]; if nexts[elements]! = null then pres[nexts[elements]] ← pres[element];
To remove an element from the data structure, the doubly linked list containing element is updated by adding a direct link between the immediate preceding element pres[element] and the immediate succeeding element nexts[element] of element. The pseudocode is given in Algorithm 3, which returns the key value of the removed element. Note that if element is at the beginning of a doubly linked list, then heads[keys[element]] also needs to be updated.
Algorithm 4: decrement(element, dec) 1 2 3 4
key ← remove(element); key ← key − dec; insert(element, key); return key;
2.1 Linked List-Based Linear Heap
13
Update the Key Value of an Element. To update the key value of an element, the element is firstly removed from the doubly linked list corresponding to the key value keys[element] and is then inserted into the doubly linked list corresponding to the updated key value. As a result, the key value of element in the data structure is updated, and moreover min key and max key may also be updated by the new key value of element. The pseudocode of decrement is shown in Algorithm 4, which returns the updated key value of element. The pseudocode of increment is similar and is omitted. Algorithm 5: pop min(element, key) 1 2
while min key ≤ max key and heads[min key] = null do min key ← min key + 1;
3 4
if min key > max key then return false;
5 6 7 8
else element ← heads[min key]; key ← min key; remove(element); return true;
Pop/Get Min/Max from a Linked List-Based Linear Heap. To pop the element with the minimum key value from the data structure, the value of min key is firstly updated to be the smallest value such that heads[min key] = null. If such min key exists, then all elements in the doubly linked list pointed by heads[min key] have the same minimum key value, and the first element is removed from the data structure and is returned. Otherwise, min key is updated to be larger than max key, which means that the data structure currently contains no element. The pseudocode is shown in Algorithm 5. The pseudocodes of get min, pop max, and get max are similar and are omitted. Note that, during the execution of the data structure, the value of min key is guaranteed to be a lower bound of the key values of all elements in the data structure. Thus, Algorithm 5 correctly obtains the element with the minimum key value.
2.1.2 Time Complexity of ListLinearHeap The time complexities of the member functions of ListLinearHeap are as follows. Firstly, the initialization in Algorithm 1 takes O(n + key cap) time if memory allocation is invoked and takes O( n + key cap) time otherwise. Secondly, each of the remaining member functions of ListLinearHeap, other than get min, pop min, get max, and pop max, runs in constant time. The difficult part is the four member functions for popping/getting the element with the minimum/maximum key value. Let key cap, which is given as an input to init, be the maximum key value that is
14
2 Linear Heap Data Structures
allowed during the execution of the member functions of ListLinearHeap. In the worst case, an invocation of one of these four member functions takes O( key cap) time. Nevertheless, a better time complexity is possible, as proved in the theorem below. Theorem 2.1. After the initialization by init, a sequence of x decrement(id, 1), increment(id, 1), get min, pop min, get max, pop max, and remove operations takes O( key cap + x) time. Note that: (1) decrement and increment are only allowed to change the key value of an element by 1, and (2) insert is not allowed. Proof. Firstly, as discussed in above, each of the member functions decrement, increment, and remove takes constant time. Secondly, pop min and pop max are simply get min and get max, respectively, followed by remove. We prove in the following that a sequence of x decrement(id, 1), increment, get min, and remove operations takes O( key cap + x) time. It is worth mentioning that here the restriction of incrementing only by 1 for increment is removed. The most time-consuming part of get min (similar to Algorithm 5) is updating min key, while other parts can be conducted in constant time. Let t − denote the number of times that min key is decreased; note that min key is only decreased in decrement, and by 1 each time. Thus, t − equals the number of invocations of decrement(id, 1) and t − ≤ x. Similarly, let t + denote the number of times that min key is increased; note that min key is increased only in get min, but not in decrement, increment, or remove. Then, the time complexity of a sequence of x decrement(id, 1), increment, get min, and remove operations is O(x+t − +t + ). It can be verified that t + ≤ t − + key cap. Therefore, the above time complexity is O( key cap + x). The general statement in this theorem can be proved similarly. 2 Following Theorem 2.1 and assuming key cap = O( n), the amortized cost of one invocation of init followed by a sequence of (≥ n) decrement(id, 1), increment(id, 1), get min, pop min, get max, pop max, and remove operations is constant per operation. Thus, a set of n elements that are given as input to init can be sorted in O( key cap + n) time, as shown in the following lemma, which actually is similar to the idea of counting sort [24]. Lemma 2.1. A set of n elements that are given as input to init can be sorted in non-decreasing key value order (or non-increasing key value order) in O( key cap + n) time. Proof. To sort in non-decreasing key value order, we invoke pop min n times after initializing by init. The time complexity follows from Theorem 2.1. 2
2.2 Array-Based Linear Heap In ListLinearHeap, the doubly linked lists are represented by explicitly storing the immediate preceding and the immediate succeeding elements of an element i
2.2 Array-Based Linear Heap
15
in pres[i] and nexts[i], explicitly. Alternatively, all elements in the same doubly linked list can be stored consecutively in an array, in the same way as the adjacency array graph representation discussed in Section 1.1.3. This strategy is used in the array-based linear heap in [7]. Specifically, all elements with the same key value are stored consecutively in an array ids, such that elements with key value 0 are put at the beginning of ids and are followed by elements with key value 1 and so forth. The start position of the elements for each distinct key value is stored in an array heads. Thus, heads and ids resemble pstart and edges of the adjacency array graph representation, respectively. In addition, the key values of elements are stored in an array keys in the same way as the linked list-based linear heap, and the positions of elements in ids are stored in an array rids (i.e., rids[ids[i]] = i).
0 1 2 3 4 5 6 7 8 9 10 heads 0 0 5 6 7 11 11 11 11 11 11 ids v6 v7 v9 v10v11v8 v5 v1 v2 v3 v4 rids
7 8 9 10 6 0 1 5 2 3 4
keys 4 4 4 4 3 1 1 2 1 1 1 v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 Fig. 2.2: An example of array-based linear heap For example, Figure 2.2 demonstrates the array-based linear heap constructed for vertices in Figure 1.1 where the key value of a vertex is its degree. That is, Figure 2.2 shows the actual data of the adjacency array-based representation for the doubly linked lists in Figure 2.1. Note that dashed arrows in Figure 2.2 are for illustration purpose and are not actually stored. Each index in the array heads corresponds to a key value, and each index in the arrays rids and keys corresponds to an element i, while the indexes of the array ids correspond to neither.
2.2.1 Interface of an Array-Based Linear Heap The interface of the array-based linear heap, denoted ArrayLinearHeap, is given in Listing 2.2, while the full C++ code is available online as a header file.2 Similar to ListLinearHeap, ArrayLinearHeap has member variables, n, key cap, max key, min key, heads, keys, ids, and rids. ArrayLinearHeap has similar but more restricted member functions than ListLinearHeap; for example, insert 2
https://github.com/LijunChang/Cohesive subgraph book/blob/master/data structures/ ArrayLinearHeap.h.
16
2 Linear Heap Data Structures
Listing 2.2: Interface of an array-based linear heap class ArrayLinearHeap { private: uint n; // total number of possible distinct elements uint key_cap; // maximum allowed key value uint max_key; // upper bound of the current maximum key value uint min_key; // lower bound of the current minimum key value uint *keys; // key values of elements uint *heads; // start position of elements with a specific key uint *ids; // element ids uint *rids; // reverse of ids, i.e., rids[ids[i]] = i public: ArrayLinearHeap(uint _n, uint _key_cap) ; ˜ArrayLinearHeap() ; void init(uint _n, uint _key_cap, uint *_ids, uint *_keys) ; uint get_n() { return n; } uint get_key_cap() { return key_cap; } uint get_key(uint element) { return key_s[element]; } void increment(uint element) ; void decrement(uint element) ; bool get_max(uint &element, uint &key) ; bool pop_max(uint &element, uint &key) ; bool get_min(uint &element, uint &key) ; bool pop_min(uint &element, uint &key) ; }
and remove are disabled, and increment and decrement are only allowed to update the key value of an element by 1. The space complexity of ArrayLinearHeap is 3n + key cap + O(1), the same as that of ListLinearHeap. Also, to use ArrayLinearHeap after constructing it, the member function init needs to be invoked first to properly initialize the member variables before invoking any other member functions. Initialize an Array-Based Linear Heap (init). The pseudocode of init is shown in Algorithm 6, where memory allocation is omitted. It needs to sort all elements in ids in non-decreasing key value order, which is conducted by the counting sort [24]. After initializing by init, the set of elements with key value key is stored consecutively in the subarray ids[heads[key], . . . , heads[key + 1] − 1]. Update the Key Value of an Element. In ArrayLinearHeap, the key value of an element is only allowed to be updated by 1. The general idea of decrement is as follows. Let key be the key value of element before updating. Then, the goal is to move element from the subarray ids[heads[key], . . . , heads[key + 1] − 1] to the subarray ids[heads[key − 1], . . . , heads[key] − 1]. To do so, element is firstly moved to be at (by swapping with) position heads[key], where the original position of element in ids is located by rids[element]. Then, the start position of elements in ids with
2.2 Array-Based Linear Heap
17
Algorithm 6: init( n, key cap, ids, keys) 1 2 3 4 5
6 7 8 9 10 11 12 13 14 15 16 17
/* Initialize max key, min key, and keys max key ← 0; min key ← key cap; for i ← 0 to n − 1 do keys[ ids[i]] ← keys[i]; if keys[i] > max key then max key ← keys[i]; if keys[i] < min key then min key ← keys[i];
*/
/* Initialize ids, rids Create an array cnt of size max key + 1, with all entries 0; for i ← 0 to n − 1 do cnt[ keys[i]] ← cnt[ keys[i]] + 1; for i ← 1 to max key do cnt[i] ← cnt[i] + cnt[i − 1]; for i ← 0 to n − 1 do cnt[keys[i]] ← cnt[keys[i]] − 1; rids[ ids[i]] ← cnt[keys[i]];
*/
for i ← 0 to n − 1 do ids[rids[ ids[i]]] ← ids[i]; /* Initialize heads heads[min key] ← 0; for key ← min key + 1 to max key + 1 do heads[key] ← heads[key − 1]; while heads[key] < n and keys[ids[heads[key]]] < key do heads[key] ← heads[key] + 1;
*/
Algorithm 7: decrement(element) 1 2 3 4 5
key ← keys[element]; if heads[key] = rids[element] then Swap the content of ids for positions heads[key] and rids[element]; rids[ids[rids[element]]] ← rids[element]; rids[ids[heads[key]]] ← heads[key];
6 7 8
if min key = key then min key ← min key − 1; heads[min key] ← heads[min key + 1];
9 10
heads[key] ← heads[key] + 1; keys[element] ← keys[element] − 1; return keys[element];
key value key (i.e., heads[key]) is increased by one; consequently, element is now at the end of the subarray ids[heads[key − 1], . . . , heads[key] − 1]. Note that min key may also be updated in decrement. The pseudocode of decrement is shown in Algorithm 7, which returns the updated key value of element. The pseudocode of increment is similar and is omitted. For example, to decrement the key value of v4 by 1 for the data structure in Figure 2.2, firstly, v4 is swapped with v1 in the array ids, and then heads[4] is increased by 1 to become 8; note that rids and keys are updated accordingly. Pop/Get Min/Max from an Array-Based Linear Heap. This is similar to that of ListLinearHeap, and we omit the details.
18
2 Linear Heap Data Structures
2.2.2 Time Complexity of ArrayLinearHeap The time complexities of the member functions of ArrayLinearHeap are the same as that of ListLinearHeap. Note that the counting sort in init runs in O( n + key cap) time. Moreover, similar to the proof of Theorem 2.1, the following theorem can be proved for ArrayLinearHeap. Theorem 2.2. After the initialization by init, a sequence of x decrement, increment, get min, pop min, get max, and pop max operations takes O( key cap + x) time. Note that ArrayLinearHeap is more restricted than ListLinearHeap, and has a similar performance to ListLinearHeap (see Chapter 3).
2.3 Lazy-Update Linear Heap The practical performance of ListLinearHeap can be improved by using the lazy updating strategy, if only get max, pop max, and decrement are allowed (or similarly, only get min, pop min, and increment are allowed) [15]. For presentation simplicity, we present the data structure for the former case in this subsection. The interface of lazy-update linear heap, denoted LazyLinearHeap, is given in Listing 2.3, while the full C++ code is available online as a header file.3 Most parts of LazyLinearHeap are similar to that of ListLinearHeap. There are two major differences [15]. Firstly, LazyLinearHeap is stored by three rather than four arrays. That is, each adjacency list in LazyLinearHeap can be represented by a singly linked list rather than a doubly linked list, since no arbitrary deletion of an element from LazyLinearHeap is allowed. Thus, pres is not needed in LazyLinearHeap, and the space complexity of LazyLinearHeap becomes 2n + key cap + O(1). Secondly, LazyLinearHeap does not greedily maintain an element into the proper adjacency list as done in ListLinearHeap, when its key value is decremented. Specifically, decrement merely does the job of updating keys[element] (see Listing 2.3). Thus, for elements in the singly linked list pointed by heads[key], their key values can be exactly key and also can be smaller than key. To obtain the element with the maximum key value from LazyLinearHeap, each element is checked and maintained when it is going to be chosen as the element with the maximum key value. The pseudocode of pop max is shown in Algorithm 8. It iteratively retrieves and removes the first element in the singly linked list corresponding to the maximum key value max key. If the key value of element equals max key, then it indeed has the maximum key value and is returned. Otherwise, the key value of element must be smaller than max key, and thus element is inserted into the proper singly linked list. 3
https://github.com/LijunChang/Cohesive subgraph book/blob/master/data structures/ LazyLinearHeap.h.
2.3 Lazy-Update Linear Heap
19
Listing 2.3: Interface of a lazy-update linear heap class LazyLinearHeap { private: uint n; // total number of possible distinct elements uint key_cap; // maximum allowed key value uint max_key; // upper bound of the current maximum key value uint *keys; // key values of elements uint *heads; // the first element in a singly linked list uint *nexts; // next element in a singly linked list public: LazyLinearHeap(uint _n, uint _key_cap) ; ˜LazyLinearHeap() ; void init(uint _n, uint _key_cap, uint *_elems, uint *_keys) ; uint get_n() { return n; } uint get_key_cap() { return key_cap; } uint get_key(uint element) { return keys[element]; } uint decrement(uint element, uint dec) { return keys[element] -= dec; } bool get_max(uint &element, uint &key) ; bool pop_max(uint &element, uint &key) ; }
Algorithm 8: pop max(element, key) 1 2 3
while true do while max key > 0 and heads[max key] = null do max key ← max key − 1;
4 5
if heads[max key] = null then return false;
6 7 8 9 10
element ← heads[max key]; heads[max key] ← nexts[element]; /* Remove element if keys[element] = max key then /* element has the maximum key value key ← max key; return true;
11
else
12 13
*/;
/* Insert element into the proper singly linked list nexts[element] ← heads[keys[element]]; heads[keys[element]] ← element;
*/
*/
Analysis. The efficiency of LazyLinearHeap compared with ListLinearHeap is proved in the following lemma.
20
2 Linear Heap Data Structures
Lemma 2.2. For an arbitrary sequence of decrement and pop max operations, the running time of LazyLinearHeap is no larger than that of ListLinearHeap. Proof. It is easy to verify that each element that was moved from one singly linked list to another singly linked list in pop max (Algorithm 5) in LazyLinearHeap corresponds to a set of decrement(element, dec) operations in ListLinearHeap. Thus, the lemma holds. 2 Note that, as shown in the proof of Lemma 2.2, LazyLinearHeap in practice can be much faster than ListLinearHeap.
Chapter 3
Minimum Degree-Based Core Decomposition
In this chapter, we discuss efficient techniques for computing the minimum degreebased graph decomposition (aka, core decomposition). Preliminaries are given in Section 3.1. A linear-time algorithm is presented in Section 3.2, while h-indexbased local algorithms that can be naturally made parallel are presented in Section 3.3.
3.1 Preliminaries Definition 3.1 ([81]). Given a graph G and an integer k, the k-core of G is the maximal subgraph g of G such that the minimum degree of g is at least k; that is, every vertex in g is connected to at least k other vertices in g.
v1 g3
v3
v2
v5
v4 g2 g 1
v6 v7 v8
Fig. 3.1: An example graph and its k-cores For example, for the graph in Figure 3.1, g1 , g2 , and g3 are its 1-core, 2-core, and 3-core, respectively. The k-core of G is unique and is a vertex-induced subgraph of G; thus, we may also refer to the k-core by its set of vertices in the following. Note that, here, we do not consider the connectedness of k-core, which will be discussed in Section 3.2.3. Core Decomposition. The problem of core decomposition is to compute the k-cores of an input graph G, for all possible values of k. © Springer Nature Switzerland AG 2018 L. Chang, L. Qin, Cohesive Subgraph Computation over Large Sparse Graphs, Springer Series in the Data Sciences, https://doi.org/10.1007/978-3-030-03599-0 3
21
22
3 Minimum Degree-Based Core Decomposition
Definition 3.2. Given a graph G, the core number of a vertex u in G, denoted core(u), is the largest k such that the k-core of G contains u. Note that the total size of k-cores for all possible k values could be much larger than the size of the input graph. As a result, rather than reporting k-cores for all possible k values, we focus on computing the core numbers of vertices (i.e., core decomposition) and building a hierarchical structure for the k-cores of all possible k values.
3.1.1 Degeneracy and Arboricity of a Graph The maximum value among the core numbers of all vertices in a graph is highly related to the notions of degeneracy and arboricity that measure the sparsity of a graph. Definition 3.3 ([53]). A graph G is k-degenerate if every subgraph g of G has a vertex with degree at most k in g. The degeneracy of G, denoted δ (G), is the smallest value of k for which G is k-degenerate. That is, the degeneracy δ (G) of G equals the maximum value among the minimum vertex degrees of all subgraphs of G. Definition 3.4 ([20]). The arboricity of a graph G, denoted α (G), is the minimum number of edge-disjoint spanning forests into which G can be decomposed, or equivalently, the minimum number of forests needed to cover all edges of G. Relationships Between Core Number, Degeneracy, and Arboricity. The relationships between core number, degeneracy, and arboricity are demonstrated by the following two lemmas. Lemma 3.1. The degeneracy of G equals the maximum value among core numbers of vertices in G (i.e., δ (G) = maxu∈V core(u)). Proof. Firstly, it is easy to see that G contains no (δ (G) + 1)-core, since the minimum vertex degree of a (δ (G)+1)-core is δ (G)+1; thus, maxu∈V core(u) ≤ δ (G). Secondly, G must contain a δ (G)-core, since it has at least one subgraph whose minimum vertex degree is δ (G); thus, maxu∈V core(u) ≥ δ (G). The lemma follows. 2 Lemma 3.2. α (G) ≤ δ (G) < 2α (G). Proof. Firstly, as every subgraph of G has a vertex with degree at most δ (G), G can be decomposed into δ (G) edge-disjoint forests as follows. We initially create δ (G) empty forests and then add edges into the forests by iteratively removing the minimum degree vertex from the graph. When removing a vertex u from the graph, its number of adjacent edges in the current graph must be no larger than δ (G),
3.2 Linear-Time Core Decomposition
23
according to the definition of degeneracy; we direct each of the adjacent edges of u to point from u to the other end-point and add each of the adjacent edges of u into a different forest. As each vertex is removed only once, it has at most one out-going edge in each of the δ (G) forests. Thus, there is no cycle in the forests, and we have α (G) ≤ δ (G). Secondly, let’s consider the δ (G)-core g of G; it must exist, and the minimum vertex degree of g is δ (G). Then, g can be decomposed into α (G) edge-disjoint forests, since it is a subgraph of G. Thus, the number of edges of g is at most α (G) × (|V (g)| − 1). As the minimum vertex degree of a graph is no larger than its average (g)|−1) vertex degree, we have δ (G) ≤ 2×α (G)×(|V < 2α (G). 2 |V (g)| Upper Bounds of α (G) and δ (G). For an arbitrary graph G, it is easy to verify that α (G) ≤ dmax (G) and δ (G) ≤ dmax (G). Moreover, they can also be bounded by n and m as follows. √ Lemma 3.3 ([20]). α (G) ≤ 2m+n . 2 Lemma 3.4. δ (G) ≤
√ 2m + n .
Proof. This lemma directly follows from Lemmas 3.2 and 3.3. That is: √ √2m + n 2m + n + 1 √ ≤ 2× = δ (G) < 2α (G) ≤ 2 × 2m + n + 1. 2 2 √ 2m + n since δ (G) is an integer. 2 Thus, we have δ (G) ≤
3.2 Linear-Time Core Decomposition In this section, firstly, a linear-time algorithm for computing the core numbers of all vertices (i.e., core decomposition) is presented in Section 3.2.1. Then, it is adapted to compute the k-core of a graph for a specific k value in Section 3.2.2. Finally, algorithms for constructing core hierarchies are presented in Section 3.2.3.
3.2.1 The Peeling Algorithm A linear-time algorithm, denoted Peel, is developed in [7, 59] for computing the core numbers of all vertices. The algorithm is usually referred to as the peeling algorithm in the literature, since it iteratively peels/removes the minimum degree vertex from the remaining graph. The pseudocode of Peel is illustrated in Algorithm 1, which is self-explanatory. One thing to notice is that rather than actually removing a vertex from the representation of the graph G, it is only marked as deleted; that is, in
24
3 Minimum Degree-Based Core Decomposition
Algorithm 1, a vertex is marked as deleted if and only if it is in the array seq. As a result, the representation of the input graph G remains unchanged during the execution of Algorithm 1.
Algorithm 1: Peel: compute core numbers of all vertices [7, 59] Input: A graph G = (V, E) Output: core(v) for each vertex v ∈ V 1 2
for each v ∈ V do Let d(v) be the degree of v in G;
3 4 5
max core ← 0; seq ← 0; / for i ← 1 to n do u ← arg minv∈V \seq d(v); Add u to the tail of seq; if d(u) > max core then max core ← d(u); core(u) ← max core; for each neighbor v of u that is not in seq do d(v) ← d(v) − 1;
6 7 8 9 10
In Algorithm 1, besides the core numbers core(·) of vertices, it also computes a total ordering of vertices which is stored in an array seq. This ordering is referred to as a smallest-first ordering or degeneracy ordering, defined as follows. Definition 3.5 ([59]). Given a graph G, a permutation (v1 , v2 , . . . , vn ) of all vertices of G is a degeneracy ordering of G if every vertex vi has the minimum degree in the subgraph of G induced by {vi , . . . , vn }. Obviously, seq computed by Algorithm 1 is a degeneracy ordering. The degeneracy ordering has two important properties, as proved in the following two lemmas. Lemma 3.5. If the input graph G is oriented according to the degeneracy ordering (i.e., an undirected edge (u, v) is directed from u to v if u appears before v in seq), then the maximum out-degree in the resulting graph is bounded by δ (G). Proof. This lemma directly follows from the definition of degeneracy and the definition of degeneracy ordering. 2 Lemma 3.6. For any given k such that 1 ≤ k ≤ δ (G), the k-core of G is the subgraph of G induced by vertices in seq[pk , . . . , n − 1], where pk is the position/index of the first vertex in the array seq whose core number is at least k. Proof. Let g be the subgraph of G induced by vertices in seq[pk , . . . , n − 1]. Firstly, the k-core of G contains none of the vertices in seq[0, pk − 1], since their core numbers are all smaller than k. Secondly, the minimum degree of g is at least k, according to the definition of core number and the definition of degeneracy ordering. Thus, g is the k-core of G, and the lemma holds. 2
3.2 Linear-Time Core Decomposition
25
Example 3.1. Consider the graph in Figure 3.1, the degeneracy ordering of vertices is seq = (v6 , v7 , v8 , v5 , v1 , v2 , v3 , v4 ), whose core numbers are 1, 1, 1, 2, 3, 3, 3, 3, respectively. Thus, the 3-core is the subgraph induced by vertices {v1 , v2 , v3 , v4 }, since v1 is the first vertex in seq whose core number is at least 3. Complexity Analysis and Implementation Details. One thing to notice is that the degrees d(·) of vertices keep changing during the execution of Algorithm 1, specifically at Line 9. As a result, a naive implementation of Line 4 of Algorithm 1, by iterating through vertices in V \seq to find the vertex with the minimum degree, would result in a time complexity of O(n2 ). The good news is that with the help of the linear heap data structures presented in Section 2.1, Algorithm 1 can be implemented to run in O(m) time; ideas similar to ListLinearHeap and ArrayLinearHeap are used in [59] and [7], respectively, to achieve the linear time complexity. Specifically, the data structure is initialized by all vertices of V where the key value of a vertex is its degree d(·), Line 5 is conducted by invoking the member function pop min, and Line 10 is achieved by invoking the member function decrement. The time complexity follows from Theorem 2.1 (or Theorem 2.2) and the facts that: (1) the maximum possible key value in the data structure is n − 1, and (2) there are totally n invocations of pop min and m invocations of decrement. The space complexity of Algorithm 1 is 2m + O(n), where the graph representation takes space 2m + O(n) and the linear heap data structure takes space O(n). Graph G Peel+ListLinearHeap Peel+ArrayLinearHeap as-Skitter 0.522 0.688 soc-LiveJournal1 5.107 5.615 uk-2005 14.281 27.229 it-2004 16.507 31.840 twitter-2010 180 175
Table 3.1: Processing time, in seconds, of core decomposition by Peel coupled with ListLinearHeap or ArrayLinearHeap Empirical Evaluation. The processing time of core decomposition by the Peel algorithm coupled with ListLinearHeap or ArrayLinearHeap over the five real graphs described in Section 1.1.2 is shown in Table 3.1. This is to compare the efficiency of ListLinearHeap with the efficiency of ArrayLinearHeap. We can see that occasionally, one may outperform the other, but in general they have similar performances. Thus, in practice it is a good choice to adopt ListLinearHeap for the linear heap data structure, since ListLinearHeap is more flexible than ArrayLinearHeap. Figure 3.2 demonstrates the size distribution of the k-cores for all possible k values over four of the real graphs. Here, besides the number of vertices and edges in the entire k-core which may be disconnected, we also show the number of vertices
26
3 Minimum Degree-Based Core Decomposition 108
108
107
107
6
6
10
10
105
105
104 3
10
102 0 2
104
|V| in k-core |E| in k-core |V| in GC of k-core |E| in GC of k-core 2
1
2
2
2
3
2
10 4
2
5
2
6
7
2
|V| in k-core |E| in k-core |V| in GC of k-core |E| in GC of k-core
3
102 0 2
2
1
2
2
2
3
Core
4
2
5
6
2
7
2
8
2
2
9
Core
(a) as-Skitter
(b) soc-LiveJournal1
10
10
10
9
10
108
108
107
107
10
10 9
6
6
10
10 |V| in k-core |E| in k-core |V| in GC of k-core |E| in GC of k-core
5
10
104 103
2
20
22
24
26
Core
(c) it-2004
|V| in k-core |E| in k-core |V| in GC of k-core |E| in GC of k-core
5
10
104 28
210
212
103
20
22
24
26
28
210
212
Core
(d) twitter-2010
Fig. 3.2: Number of vertices and edges in (Giant Component of) k-core (varying k) and edges in the giant component (i.e., the largest connected component) of the kcore. In general, the size of k-core decreases rapidly when k increases. For the two medium-sized graphs as-Skitter and soc-LiveJournal1, the δ (G)-core (i.e., the non-empty k-core for the largest k) contains less than a thousand vertices, while for the two large graphs, the δ (G)-core contains less than ten thousand vertices which is much smaller than the number of vertices in the entire graph. Across the four graphs, the k-core usually contains only one connected component (i.e., the number of vertices in the k-core is the same as the number of vertices in the giant component of the k-core); but, the k-core could also be disconnected (e.g., for large k on graphs soc-LiveJournal1 and it-2004).
3.2.2 Compute k-Core The Peel algorithm (i.e., Algorithm 1) could be directly used for computing the kcore of a graph for a user-given k; that is, stop the algorithm once max core becomes no smaller than k, and then the remaining subgraph of G is the k-core of G. However, maintaining the linear heap data structure, despite taking linear time, incurs a large overhead. As a result, it is desirable to design an algorithm for computing the k-core of a graph for a specific k, without maintaining the linear heap data structure. The pseudocode of k-core computation is given in Algorithm 2. It maintains the set of vertices whose degrees are smaller than k in a queue Q and iteratively removes
3.2 Linear-Time Core Decomposition
27
Algorithm 2: k-core: compute k-core Input: A graph G = (V, E) and an integer k Output: k-core of G 1 2 3 4
Initialize an empty queue Q; for each v ∈ V do Let d(v) be the degree of v in G; if d(v) < k then Push v to Q;
while Q = 0/ do u ← pop a vertex from Q; /* Remove u from the graph 7 for each neighbor v of u do 8 d(v) ← d(v) − 1; 9 if d(v) = k − 1 then Push v to Q; 5 6
10
*/
return the subgraph of G induced by vertices with d(·) ≥ k;
vertices of Q from the input graph G. A vertex is initially pushed into Q if its degree in G is smaller than k, and then during the execution, a vertex is pushed into Q if its degree in the current resulting graph becomes exactly k − 1 due to the removal of its neighbors. Note that, in this way, each vertex not in the k-core of G will be pushed into Q exactly once, while vertices in the k-core of G will never be pushed into Q. The time complexity of Algorithm 2 is O(m) since each vertex of V will be popped out from Q at most once, and the space complexity of Algorithm 2 is 2m + O(n).
3.2.3 Construct Core Hierarchy In the previous discussions, the connectedness of k-core is ignored. That is, the computed k-core may be disconnected. In this subsection, we focus on computing connected k-cores for all different k values. Definition 3.6. Given a graph G, a connected k-core of G is a connected component of the k-core of G; that is, a maximal connected subgraph g of G such that the minimum degree of g is at least k. For a given k, a graph can have multiple connected k-cores. As discussed in Section 3.1, the total size of k-cores for all possible k values could be much larger than the size of the input graph; this also holds for the connected k-cores. Thus, rather than reporting each individual connected k-core, the goal is to construct a hierarchy for all the connected k-cores of all possible k values such that any connected k-core can be retrieved in linear time to its contained number of vertices; note that here we only focus on the set of vertices of a connected k-core, since each connected k-core is a vertex-induced subgraph. There are two ways to construct the hierarchy, one is a hierarchy tree [76] and the other is a spanning tree [16]. Before presenting these
28
3 Minimum Degree-Based Core Decomposition
techniques, we first briefly review the disjoint-set data structure, which will be used for constructing the hierarchies.
3.2.3.1 Disjoint-Set Data Structure A disjoint-set data structure [24] maintains a collection S = {S1 , S2 , . . .} of disjoint dynamic sets. Each set is identified by a representative, which is one of its members and remains unchanged if the set is unchanged. The most popular and efficient way to represent a set in the data structure is by a rooted tree such that each member points to its parent and the root of the tree is the representative [24].
Listing 3.1: Interface of a disjoint-set data structure class UnionFind { private: uint n; // total number of elements uint *parent; // parents of elements uint *rank; // ranks of elements public: UnionFind(uint _n) ; ˜UnionFind() ; void init(uint _n) ; // allocate memory, and initialize the arrays: parent and rank ui UF_find(ui i) ; // return the representative of the set containing i void UF_union(ui i, ui j) ; // union the two sets that contain i and j, respectively }
The interface of the disjoint-set data structure, denoted UnionFind, is illustrated in Listing 3.1, while the full C++ code is available online as a header file.1 UnionFind has member variables n storing the total number of elements, and arrays parent and rank storing the parent and rank value of each element, respectively; note that rank is used by the union by rank optimization. In the data structure, we assume that elements are 0, 1, . . . , n− 1. UnionFind has the following three member functions: • init( n) initializes the data structure such that each element i forms a singleton set; specifically, parent[i] = i and rank[i] = 0. • UF find(i) returns the representative of the unique set containing i. 1
https://github.com/LijunChang/Cohesive subgraph book/blob/master/data structures/ UnionFind.h.
3.2 Linear-Time Core Decomposition
29
• UF union(i, j) unites the two sets that contain i and j, respectively, into a single set that is the union of these two sets. Specifically, the parent of the representative of one set is set as the representative of the other set. By using the union by rank and path compression optimizations, both UF find and UF union can be conducted in constant amortized time in all practical situations [24]. Specifically, the time complexities of UF find and UF union are demonstrated by the following theorem. Theorem 3.1 ([24]). By using both union by rank and path compression, the worstcase running time of a sequence of x UF find and UF union operations is O(x × a(x)), where a(x) is the inverse of the Ackermann function and is at most 4 for all practical values of x. In the remainder of the book, we assume that a(x) is constant and ignore it in the time complexity analysis. The space complexity of the disjoint-set data structure is 2n + O(1) since each of the arrays parent and rank has size n.
3.2.3.2 Core Hierarchy Tree A core hierarchy tree is a rooted and unordered tree, denoted CoreHT, where vertices in it are called nodes to be distinguished from vertices of the graph G. The core hierarchy tree represents the connected k-cores of all possible k values in a compact tree form based on the fact that each connected k-core is entirely contained in a connected (k − 1)-core. Specifically, each node of the core hierarchy tree of G corresponds to a distinct connected k-core of G, has a weight k, and contains the subset of vertices of the connected k-core that is not contained in any (k + 1)-core of G (i.e., the subset of vertices of the connected k-core whose core numbers are exactly k). As a result, the sets of vertices contained in the nodes of the CoreHT form a partition of V . Formally, the core hierarchy tree is defined recursively as follows. For a given connected graph G, let k be its minimum vertex degree; that is, G itself is a k-core. Let C1 ,C2 , . . . ,Cx be the connected (k + 1)-cores of G, and CoreHT1 , CoreHT2 , . . ., CoreHTx be their corresponding core hierarchy trees. Then, the root node of the core hierarchy tree of G has a weight k and contains vertices of G that are not in the connected (k + 1)-cores of G (i.e., not in C1 ,C2 , . . . ,Cx ), and its set of children is the set of root nodes of CoreHT1 , CoreHT2 , . . . , CoreHTx . For example, the core hierarchy tree of the graph in Figure 3.1 is illustrated in Figure 3.3, where weights are shown beside the nodes. It can be verified that, for a node r in the core hierarchy tree, CoreHT, of G whose weight is k, the set of vertices of G contained in the nodes of the subtree of CoreHT rooted at r is a connected k-core of G. The reverse is also true; that is, for each connected k-core of G, there exists such a node in CoreHT. For example, for the core hierarchy tree in Figure 3.3, the set of vertices {v1 , v2 , v3 , v4 , v5 } contained in the two non-root nodes is a connected 2-core of the graph in Figure 3.1. Consequently, the connected k-cores of a graph for a given k can be obtained in linear time to
30
3 Minimum Degree-Based Core Decomposition
v6 , v7 , v8 v5
1
2
v1 , v2 , v3 , v4
3
Fig. 3.3: The core hierarchy tree for the graph in Figure 3.1 the number of vertices contained in the connected k-cores, by traversing the core hierarchy tree in a bottom-up fashion. Similar hierarchy structure has been used in [50] for searching influential communities, i.e., a vertex-weighted version of kcore. Algorithm 3: CoreHierarchy: compute the core hierarchy tree of a graph Input: A graph G = (V, E), a degeneracy ordering seq of vertices, and the core numbers core(·) of vertices Output: A core hierarchy tree CoreHT of G 1 2 3 4
Initialize an empty CoreHT, and a disjoint-set data structure F for V ; for each vertex u ∈ V do Add a node ru , with weight core(u) and containing vertex u, to CoreHT; Point u to ru ;
for each vertex u in seq in reverse order do for each neighbor v of u in G that appear later than u in seq do Let rv and ru be the nodes of CoreHT pointed by the representatives of the sets containing v and u in F , respectively; 8 if rv = ru then /* Update the CoreHT */ 9 if the weight of rv equals the weight of ru then 10 Move the content (i.e., vertices and children) of rv to ru ; 5 6 7
11 12
13
else Assign ru as the parent of rv in the CoreHT; /* Update the disjoint-set data structure F */ Union u and v in F , and point the updated representative of the set containing u to ru ; return CoreHT;
For the sake of time efficiency, the core hierarchy tree of a graph G is constructed in a bottom-up fashion, rather than a top-down fashion as in the definition above. The pseudocode is shown in Algorithm 3, denoted CoreHierarchy, which takes as input a degeneracy ordering seq of vertices as well as the core numbers core(·) of vertices. In addition to the core hierarchy tree CoreHT, it also maintains a disjointset data structure F for the vertices V of G. There is a one-to-one correspondence between each set of F and each tree of CoreHT, during the execution of the algorithm; moreover, the representative of a set in F also has a pointer pointing to the
3.2 Linear-Time Core Decomposition
31
root node of the corresponding tree in CoreHT. Initially, each vertex u ∈ V forms a node in CoreHT and forms a set in F (Lines 1–4). Then, vertices are processed in the reverse order according to the degeneracy ordering seq (Line 5). When processing a vertex u, only its neighbors that appear after u in seq (i.e., have been processed) are considered (Line 6). For each such neighbor vertex v of u, the trees containing u and v in CoreHT are merged (Lines 9–11), and correspondingly the sets containing u and v in F are also united (Line 12); note that the purpose of the latter is to efficiently implement the former (specifically, finding the root of a tree in CoreHT in amortized constant time). Two trees in CoreHT are merged in the following ways: (1) if their roots have the same weight, then the two roots are merged to be a single root (Lines 9–10); (2) otherwise, the root with a larger weight is added as a child of the other root (Line 11). Based on the disjoint-set data structure, the time complexity of CoreHierarchy is O(m), and the space complexity of CoreHierarchy is 2m + O(n) where the size of CoreHT is O(n).
3.2.3.3 Core Spanning Tree Alternatively, the connected k-cores of G for all different k values can also be obtained from a weighted spanning tree of G, with a similar idea to that in [16] defined for k-edge connected components. The general idea is to first construct an edgeweighted graph Gw of G based on the core decomposition of G. That is, the weight w(u, v) of edge (u, v) is the largest k for which there is a k-core of G containing the edge; it can be verified that w(u, v) equals min{core(u), core(v)}. For example, the edge-weighted graph of G in Figure 3.1 is shown in Figure 3.4. Then, the connected k-cores of G can be obtained from Gw by the following lemma. Lemma 3.7. The k-core of G is the subgraph of Gw induced by the edges with weight at least k. Proof. The k-core of G is the subgraph of G induced by those vertices whose core numbers are at least k. It is also exactly the subgraph of Gw induced by those edges whose weights are at least k, according to the definition of edge weights. 2
v1 3 3
v3
3 3
v2
1 3 3 1 2 v5 1
v4
2
v6 v7 v8
(a) An edge-weighted graph
v1
3
v2 3
v3
3
v4
1 1 2 v5 1
v6 v7 v8
(b) Maximum spanning tree
Fig. 3.4: A core spanning tree for the graph in Figure 3.1
32
3 Minimum Degree-Based Core Decomposition
To preserve only the vertex information of each connected k-core, actually the maximum spanning tree, denoted CoreSPT and shown in Figure 3.4, of Gw is enough. It can be verified that for a given k, there is a one-to-one correspondence between the sets of vertices contained in the connected components of Gw and the sets of vertices in the connected components of CoreSPT, by considering only edges with weight at least k, which also correspond to the sets of vertices in connected k-cores of G. Algorithm 4: CoreSpanning: compute a core spanning tree of a graph Input: A graph G = (V, E), a degeneracy ordering seq of vertices, and core numbers core(·) of vertices Output: A core spanning tree CoreSPT of G
2 3 4 5 6 7
CoreSPT ← (V, 0); / /* Initialize CoreSPT to be a graph consisting of vertices V and no edges */; Initialize a disjoint-set data structure F for V ; for each vertex u in seq in reverse order do for each neighbor v of u in G that appear after u in seq do if v and u are not in the same set in F then Union u and v in F ; Add edge (u, v) with weight core(u) into CoreSPT;
8
return CoreSPT;
1
Given a degeneracy ordering seq and the core numbers core(·) of vertices, the pseudocode of the algorithm for computing the core spanning tree CoreSPT of G is shown in Algorithm 4, denoted CoreSpanning. The algorithm follows the Kruskal’s algorithm for computing the maximum spanning tree of an edgeweighted graph [24], and has a time complexity of O(m). The space complexity of CoreSpanning is 2m + O(n) since CoreSPT is a tree which takes only O(n) space.
3.3 Core Decomposition in Other Environments In the section, h-index-based local algorithms for core decomposition are presented in Section 3.3.1, which are then made parallel in Section 3.3.2 and made I/O-efficient in Section 3.3.3.
3.3.1 h-index-Based Core Decomposition The Peel algorithm is an inherently sequential algorithm that requires the global graph information and thus is hard to be parallelized. Local algorithms based on the idea of Hirsch index (aka, h-index [38]) have also been designed for computing the core numbers core(·) of vertices in a distributed environment in [54, 62, 69].
3.3 Core Decomposition in Other Environments
33
3.3.1.1 An h-index-Based Local Algorithm The h-index is a metric to measure the impact and productivity of researchers by the citation counts [38] and is formally defined as follows. Definition 3.7. Given a multi-set S of positive numbers, the h-index of S, denoted h-index(S), is the largest integer k such that there are at least k numbers in S that are no smaller than k; that is, the largest integer k such that |{s ∈ S | s ≥ k}| ≥ k. The h-index is a monotone function; that is, if we decrease a number in S or remove a number from S, then the h-index of S becomes no larger. The core numbers of vertices can be computed via h-index, by the following two lemmas, where core(u) denotes an upper bound of the core number of u (i.e., core(u) ≥ core(u)). Lemma 3.8 ([54]). Given a graph G, let C(u) be the multi-set of core numbers of u’s neighbors (i.e., C(u) = {core(v) | v ∈ N(u)}). Then, the h-index of C(u) equals the core number of u in G, that is: core(u) = h-index(C(u)). Proof. Firstly, we prove that core(u) ≥ h-index(C(u)). Let k be the value of h-index(C(u)). We consider the k-core g of G. According to the definition of h-index, at least k neighbors of u are in g. Thus, u is also in g, and core(u) ≥ k = h-index(C(u)). Secondly, we prove that core(u) ≤ h-index(C(u)). Let k be the value of core(u). Consider the k-core g of G. Then, at least k neighbors of u are in g, each of which has a core number at least k. Thus, h-index(C(u)) ≥ k = core(u). Thus, the lemma holds. 2 Lemma 3.9 ([54]). Given a graph G, let C(u) be the multi-set of upper bounds of core numbers of u’s neighbors (i.e., C(u) = {core(v) | v ∈ N(u)}). Then, the h-index of C(u) is an upper bound of u’s core number, that is: core(u) ≤ h-index(C(u)). Proof. Following the monotonicity property of h-index, we have h-index(C(u)) ≥ h-index(C(u)). From Lemma 3.8, we have h-index(C(u)) = core(u). Thus, the lemma holds. 2 Following the above two lemmas, the pseudocode of an h-index-based local algorithm for core decomposition is shown in Algorithm 5. The algorithm initializes the upper bounds of core numbers of vertices by their degrees (Line 1). Then, it iteratively recomputes the h-index(C(u)) and assigns the computed value as an updated upper bound of the core number of u (Line 7). The h-index of C(u) is computed by the procedure HIndex. Note that, due to the monotonicity of h-index, the upper bounds can only become smaller after updating. When there is no updating of the upper bounds, the upper bounds are actually the correct core numbers, as proved by the lemma below.
34
3 Minimum Degree-Based Core Decomposition
Algorithm 5: CoreD-Local: compute core numbers of vertices [54] Input: A graph G = (V, E) Output: core(v) of each vertex v ∈ V 1 2 3 4 5 6 7 8
for each vertex u ∈ V do core(u) ← the degree of u in G; update ← true; while update do update ← false; for each vertex u ∈ V do old ← core(u); core(u) ← HIndex(N(u), core); if core(u) = old then update ← true;
9
return core(v) ← core(v) for each vertex v ∈ V ;
10 11 12 13
Procedure HIndex(S, core) Initialize an array cnt of size |S| + 1, consisting of all zeros; for each vertex u ∈ S do if core(u) > |S| then cnt[|S|] ← cnt[|S|] + 1; else cnt[core(u)] ← cnt[core(u)] + 1;
14 15 16
for i ← |S| down to 1 do if cnt[i] ≥ i then return i; cnt[i − 1] ← cnt[i − 1] + cnt[i];
Lemma 3.10. Given the upper bounds core of the core numbers of vertices in G, if h-index(C(v)) equals core(v) for every vertex v in G, then the upper bounds core are the core numbers. Proof. For a vertex u, let k be the value of core(u) and consider the subgraph g of G induced by the set of vertices whose upper bounds of core numbers are at least k. As h-index(C(v)) = core(v) holds for every vertex v in G, each vertex in g has at least k neighbors in g. Thus, the core number of u is at least k. Moreover, k is an upper bound of the core number of u due to Lemma 3.9. Consequently, 2 core(u) = k = core(u) and the lemma holds. Due to the monotonicity of h-index and the fact that the upper bounds core(·) can only become smaller in the algorithm, Algorithm 5 terminates after a finite number of iterations. Note that Algorithm 5 works for any processing order of vertices at Line 5. Actually, if at Line 5 the vertices are processed according to a degeneracy ordering of V , then the algorithm converges after one iteration; nevertheless, computing a degeneracy ordering of V is no easier than computing the core numbers of vertices.
3.3.1.2 An Optimization Algorithm CoreD-Local (i.e., Algorithm 5) blindly recomputes the h-index for every vertex in each iteration. However, most of these values keep unchanged in most of the
3.3 Core Decomposition in Other Environments
35
iterations; thus, the computation is wasted. In view of this, an optimization technique is proposed in [95]. The general idea is that it additionally maintains another auxiliary value cnt(v) for each vertex v ∈ V , which will be defined shortly, such that h-index(C(u)) = core(u) if and only if core(u) > cnt(u). Thus, in this way the procedure HIndex only needs to be invoked for those vertices satisfying core(u) > cnt(u), which truly updates the value of core(u). To achieve the above goal, cnt(u) records the number of u’s neighbors whose upper bounds of core numbers are no smaller than that of u; that is, cnt(u) = |{v ∈ N(u) | core(v) ≥ core(u)}|. Then, it is easy to verify the following lemma. Lemma 3.11. core(u) = h-index(C(u)) if and only if core(u) ≤ cnt(u).
Algorithm 6: CoreD-Local-opt: compute core numbers of vertices Input: A graph G = (V, E) Output: core(v) for each vertex v ∈ V 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Q ← 0; / for each vertex u ∈ V do core(u) ← the degree of u in G; cnt(u) ← the number of neighbors of u whose degrees are no smaller than that of u; if cnt(u) < core(u) then Push u into Q; while Q = 0/ do Q ← 0; / while Q = 0/ do Pop a vertex u from Q; old ← core(u); core(u) ← HIndex(N(u), core); cnt(u) ← 0; for each neighbor v of u do if core(v) ≥ core(u) then cnt(u) ← cnt(u) + 1; if v ∈ / Q and core(u) < core(v) ≤ old then if cnt(v) = core(v) then Push v into Q ; cnt(v) ← cnt(v) − 1; Q ← Q ; return core(v) ← core(v) for each vertex v ∈ V ;
Based on the above technique, the optimization algorithm of CoreD-Local is shown in Algorithm 6, where cnt(·) for vertices are maintained at Line 4 and Lines 12–17. A queue Q is used to store the set of vertices violating core(u) ≤ cnt(u) and is initialized at Line 5. Then, the procedure HIndex is invoked only for vertices in Q. Theorem 3.2. The time complexity of Algorithm 6 is O(m × h-index(G)). Here, h-index(G) is the largest value computed by HIndex(N(u), d) among all vertices in G, where d(·) denotes the degrees of vertices.
36
3 Minimum Degree-Based Core Decomposition
Proof. According to Lemma 3.11, for each vertex u popped from Q, computing h-index(C(u)) truly changes the value of core(u). Thus, a vertex u is popped out from Q for at most HIndex(N(u), d) + 1 times. This is because: (1) after popping out a vertex u from Q for the first time, the value of core(u) becomes at most HIndex(N(u), d), and (2) the value of core(u) can only decrease after updating. Moreover, each run of Lines 10–17 for a vertex u takes O(d(u)) time. Thus, the total time complexity is O(m × h-index(G)). 2 Empirical Evaluation. The processing time of core decomposition by the Peel algorithm and the CoreD-Local-opt algorithm over five real graphs is shown in Table 3.2. Here, Peel uses the ArrayLinearHeap data structure, but it directly implements the logic of the data structure inside the algorithm rather than constructing an ArrayLinearHeap class and calling its member functions, to make a more precise comparison with CoreD-Local-opt. As a result, Peel runs slightly faster than Peel+ArrayLinearHeap. Graph G as-Skitter soc-LiveJournal1 uk-2005 it-2004 twitter-2010
Peel CoreD-Local-opt 0.550 0.645 4.232 7.765 26.338 17.535 28.647 24.810 134 369
Table 3.2: Processing time, in seconds, of core decomposition by Peel and CoreD-Local-opt From Table 3.2, we can see that CoreD-Local-opt also performs well for the problem of core decomposition despite having a higher time complexity, and sometimes even runs slightly faster than Peel. This is because one invocation of the procedure HIndex for a vertex u can reduce the value of core(u) by more than one, which stops at core(u). Thus, core(u) converges to core(u) rapidly for many vertices, and the number of vertices whose core(·) values change in an iteration drops exponentially along with the iterations. For example, Figure 3.5 illustrates the number of updates (i.e., the size of |Q|) in each iteration. As a result, the running time of CoreD-Local-opt in practice is much smaller than O(m × h-index(G)). It will be an interesting research task to give a more tight time complexity analysis for the algorithm or design a new local algorithm with time complexity O(m) but based on h-index.
3.3.2 Parallel/Distributed Core Decomposition The h-index-based local algorithm CoreD-Local (i.e., Algorithm 5) is embarrassingly parallel. That is, the processing of Lines 6–8 for each vertex at Line 5 is regarded as a task, and these n tasks can be parallelized. The parallelization can be
37
106
107
105
10
4
10
10
6
#Updates
#Updates
3.3 Core Decomposition in Other Environments
103 102 10
1
5
104 103 102 1
10
100 2
0
2
1
2
2
3
2
4
2
5
2
100
6
2
0
2
1
2
2
2
Iteration
(a) as-Skitter
4
2
5
2
2
6
2
7
6
2
7
(b) soc-LiveJournal1
8
8
10 7 10 106 5 10 4 10 103 2 10 1 10 100
#Updates
#Updates
3
2
Iteration
2
0
2
2
2
4
6
2
8
2
10
2
2
12
10 7 10 106 5 10 4 10 103 2 10 1 10 100
0
2
Iteration
(c) it-2004
1
2
2
2
3
2
4
2
5
2
2
Iteration
(d) twitter-2010
Fig. 3.5: Number of updates in different iterations implemented in two different ways, either synchronously or asynchronously. Firstly, if two copies of core(·) are maintained for each vertex, one representing the upper bound in the previous iteration and the other representing the upper bound in the current iteration, and the procedure HIndex only accesses the copy of the upper bound in the previous iteration, then this is a synchronous algorithm [28] and the tasks for different vertices are disjoint. Secondly, if only one copy of core(·) is maintained, then the parallel algorithm is still correct and is an asynchronous algorithm [28]. The behavior of the synchronous algorithm is deterministic and is the same for different runs. However, the behavior of the asynchronous algorithm is randomized, and depends on the processing sequence of the vertices by the different processors. Nevertheless, these two algorithms output the same result when converge. In practice, the asynchronous algorithm usually runs faster than the synchronous algorithm [28], due to the immediate visibility of the updated core(·) values in the asynchronous algorithm.
3.3.3 I/O-Efficient Core Decomposition Computing core numbers core(·) of vertices in an I/O-efficient manner is studied in [19, 44, 95]. Here, the graph is assumed to be stored on disk in the adjacency array representation (see Section 1.1.3) by two binary files, pstart and edges; pstart consists of n + 1 integers and is assumed to be loaded and stored in main memory.
38
3 Minimum Degree-Based Core Decomposition
The sets of neighbors of vertices can be iterated by conducting a sequential I/O of the entire file edges.
Algorithm 7: CoreD-IO: I/O efficiently computed core numbers of vertices Input: A graph G = (V, E) Output: core(v) for each vertex v ∈ V 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
vmin ← vn ; vmax ← v1 ; for vertex u ← v1 to vn do core(u) ← the degree of u in G; Load N(u) from disk; cnt(u) ← the number of vertices in N(u) whose degrees are no smaller than that of u; if cnt(u) < core(u) then if u < vmin then vmin ← u; vmax ← u; while vmin ≤ vmax do vmin ← vn ; vmax ← v1 ; for vertex u ← vmin to vmax s.t. core(u) > cnt(u) do Load N(u) from disk; old ← core(u); core(u) ← HIndex(N(u), core); cnt(u) ← 0; for each vertex v ∈ N(u) do if core(v) ≥ core(u) then cnt(u) ← cnt(u) + 1; if core(u) < core(v) ≤ old then if cnt(v) = core(v) and not u < v ≤ vmax then if v < vmin then vmin ← v; if v > vmax then vmax ← v; cnt(v) ← cnt(v) − 1; vmin ← vmin ; vmax ← vmax ; return core(v) ← core(v) for each vertex v ∈ V ;
The pseudocode of the I/O-efficient core decomposition algorithm in [95], denoted CoreD-IO, is shown in Algorithm 7. It is a variant of CoreD-Local-opt, and assumes that the main memory is of size O(n) and can hold two arrays, core(·) and cnt(·), but not the entire graph. The edges of a vertex is loaded from disk to main memory when needed, at Lines 4 and 12. Rather than maintaining in a queue Q all vertices whose upper bounds of core numbers (i.e., core(·)) will change, it records in vmin and vmax the vertices with the smallest id and the largest id in Q, respectively. Note that here the vertices are assumed to be comparable with respect to their ids; for example, the vertices are 0, 1, . . . , n − 1. Thus, the algorithm iterates through every vertex in the range between vmin and vmax to update their core(·) values, by loading their neighbors from disk to main memory. Note that, in this process, a vertex u, even if is in the range between vmin and vmax , is skipped if core(u) ≤ cnt(u). The experimental results in [95] show that CoreD-IO can process large graphs with up-to
3.4 Further Readings
39
one billion vertices and 42 billion edges in several hundred minutes by consuming less than 5GB main memory.
3.4 Further Readings Variants of core decomposition for other data types and settings have also been studied in the literature. For example, core decomposition for weighted graph is studied in [30], for directed graph is studied in [31], and for uncertain graph is studied in [13]. On the other hand, to handle graph updates, streaming algorithms to maintain the core numbers core(·) of vertices for edge insertions and deletions are studied in [51, 74, 75, 101]. A tutorial on the concepts, algorithms, and applications of core decomposition of networks is recently given in EDBT 2016 [56].
Chapter 4
Average Degree-Based Densest Subgraph Computation
In this section, we study average degree-based densest subgraph computation, where average degree is usually referred to as the edge density in the literature. In Section 4.1, we give preliminaries of densest subgraphs. Approximation algorithms and exact algorithms for computing the densest subgraph of a large input graph will be discussed in Section 4.2 and in Section 4.3, respectively.
4.1 Preliminaries Definition 4.1. The edge density of a graph g, denoted ρ (g), is
ρ (g) =
|E(g)| . |V (g)|
Lemma 4.1. The edge density is half of the average degree, i.e., ρ (g) = Proof. The average degree of g is davg (g) =
∑v∈V (g) dg (v) |V (g)|
v1
v2
v3
v4
=
2×|E(g)| |V (g)|
davg (g) 2 .
= 2 × ρ (g).
2
Fig. 4.1: An example graph For example, for the graph G in Figure 4.1, its edge density is ρ (G) = 44 = 1, and its average degree is davg (G) = 2. In the following, we simply refer to edge density as density.
© Springer Nature Switzerland AG 2018 L. Chang, L. Qin, Cohesive Subgraph Computation over Large Sparse Graphs, Springer Series in the Data Sciences, https://doi.org/10.1007/978-3-030-03599-0 4
41
42
4 Average Degree-Based Densest Subgraph Computation
Definition 4.2. Given a graph G, a subgraph g of G is a densest subgraph of G if ρ (g) ≥ ρ (g ) for all subgraphs g of G. Note that the densest subgraph of a graph may be not unique. For example, for the graph in Figure 4.1, both the entire graph and the subgraph induced by vertices {v1 , v2 , v3 } are densest subgraphs with density 1. Nevertheless, it is easy to see that any densest subgraph of G is a vertex-induced subgraph of G. Thus, in the remainder of this chapter, we simply use a vertex subset S to denote the subgraph G[S] of G induced by S and use E(S) to denote the set of edges of the subgraph G[S]. Densest Subgraph Computation. Given an input graph G, the problem of densest subgraph computation is to compute a densest subgraph S∗ of G. In the remainder of this chapter, we use S∗ to denote a densest subgraph of G and use ρ ∗ to denote the density of the densest subgraph of G (i.e., ρ ∗ = ρ (S∗ )).
4.1.1 Properties of Densest Subgraph Densest subgraph has several properties as proved by the following lemmas. Lemma 4.2. The minimum degree of S∗ is no smaller than ρ ∗ , that is: dmin (S∗ ) = min∗ dS∗ (u) ≥ ρ ∗ = ρ (S∗ ) u∈S
where dS∗ (u) is the degree of u in the subgraph G[S∗ ]. Proof. This lemma can be proved by contradiction. Suppose that there is a vertex ∗ )| ∗ u ∈ S∗ with degree dS∗ (u) < ρ ∗ = |E(S |S∗ | . Let S be S \{u}. Then, the density of S is ∗
|E(S )| ∗ |E(S∗ )| |E(S)| |E(S∗ )| − dS∗ (u) |E(S )| − |S∗ | = > = = ρ (S∗ ), ρ (S) = ∗ ∗ |S| |S | − 1 |S | − 1 |S∗ |
which contradicts that S∗ is the densest subgraph of G. Thus, the lemma holds.
2
∗ Lemma 4.3. δ (G) 2 ≤ ρ ≤ δ (G) holds for every graph G; recall that δ (G) is the degeneracy of G.
Proof. Firstly, recall that any graph G has a non-empty δ (G)-core, in which each vertex has at least δ (G) neighbors within the subgraph. Then, the average degree of the δ (G)-core is at least δ (G), and thus the density of the δ (G)-core is at least δ (G) 2 . δ (G) δ (G) ∗ Consequently, the density of the densest subgraph is at least 2 ; that is, ρ ≥ 2 . Secondly, following Lemma 4.2, ρ ∗ must be no larger than dmin (S∗ ), which in turn is no larger than δ (G) by following the definition of degeneracy. Consequently, ρ ∗ ≤ δ (G). 2
4.2 Approximation Algorithms
43
4.2 Approximation Algorithms In this section, firstly a 2-approximation algorithm that runs in linear time for computing the densest subgraph is presented in Section 4.2.1. Then, a multi-pass streaming 2(1 + ε )-approximation algorithm that conducts multiple passes of sequentialonly access of a graph is introduced in Section 4.2.2.
4.2.1 A 2-Approximation Algorithm Before presenting the approximation algorithm, we first define the notion of θ approximation algorithm. Definition 4.3. For the problem of densest subgraph computation, an algorithm is a θ -approximation algorithm if for every graph G, it outputs a subgraph S such that
ρ (S) ≥
ρ (S∗ ) , θ
where S∗ is the densest subgraph of G. It is interesting to observe that an algorithm that simply returns the δ (G)-core of the input graph G is a 2-approximation algorithm, as proved by the following theorem. Recall that, δ (G)-core is non-empty and can be computed in O(m) time (see Chapter 3). Theorem 4.1. The δ (G)-core of a graph G is a 2-approximation of the densest subgraph of G. Proof. This directly follows from the proof of Lemma 4.3.
2
A Greedy 2-Approximation Algorithm. Rather than directly returning the δ (G)core of G as an approximation of the densest subgraph of G, the greedy algorithm in [18] returns the subgraph that has the maximum density among n chosen subgraphs of G. The n subgraphs are iteratively obtained from G by removing the vertex with the minimum degree. Equivalently, this process computes a degeneracy ordering of vertices of G (see Definition 3.5), which is a permutation of all vertices of G. Then, each of the n suffixes of the degeneracy ordering represents a subgraph, and the one with the maximum density is returned. The pseudocode of the greedy algorithm, denoted by Densest-Greedy, is shown in Algorithm 1. It first invokes the peeling algorithm Peel, which is presented as Algorithm 1 in Chapter 3, to compute the degeneracy ordering of vertices of G (Line 1). Then, it computes the density of each suffix S of the degeneracy ordering (Line 5), by maintaining the number mS of edges in the subgraph of G induced by the suffix (Lines 9–10). The suffix that has the maximum density is recorded in S˜ (Lines 6–7), which is returned by the algorithm at Line 11.
44
4 Average Degree-Based Densest Subgraph Computation
Algorithm 1: Densest-Greedy: a 2-approximation algorithm for densest subgraph [18] Input: A graph G = (V, E) Output: A dense subgraph S˜ Compute the degeneracy ordering of G by invoking Peel (Algorithm 1 in Chapter 3); S˜ ← 0; / S ← V ; mS ← |E|; for each vertex u of G according to the degeneracy ordering do ρ (S) ← mS /|S|; ˜ then 6 if S˜ = 0/ or ρ (S) > ρ (S) S˜ ← S; 7 1 2 3 4 5
S ← S\{u}; for each neighbor v of u such that v ∈ S do mS ← mS − 1;
8 9 10 11
˜ return S;
Analysis of the Densest-Greedy Algorithm. It is easy to see that the time complexity of Algorithm 1 is O(m). Firstly, recall that the Peel algorithm runs in O(m) time. Secondly, Lines 9–10 run in O(d(u)) time for a vertex u. Thirdly, at Line 7, rather ˜ we only need to store the vertex u at S, ˜ since than copying the entire set S into S, each subgraph is a suffix of the degeneracy ordering. Regarding the approximation ratio, it can be directly concluded from Theorem 4.1 that Algorithm 1 is a 2-approximation algorithm, since the δ (G)-core of G is one of the suffixes of the degeneracy ordering. Nevertheless, it can also be proved by the following theorem, where the general idea of the proof is also useful for proving other theorems. Theorem 4.2 ([18]). Densest-Greedy (Algorithm 1) is a 2-approximation algorithm. Proof. Let v be the smallest vertex of S∗ according to the degeneracy ordering, and let Sv be the suffix of the degeneracy ordering starting at v. Then, it holds that S∗ ⊆ Sv and
davg (Sv ) ≥ dmin (Sv ) = dSv (v) ≥ dS∗ (v) ≥ ρ (S∗ )
where the equality follows from the definition of degeneracy ordering (i.e., dSv (v) = dmin (Sv )), the second inequality follows from the fact that S∗ ⊆ Sv , and the last inequality follows from Lemma 4.2. Following the nature of the algorithm (specifically Lines 6–7), it holds that ˜ ≥ ρ (Sv ) = ρ (S)
davg (Sv ) ρ (S∗ ) ≥ 2 2
4.2 Approximation Algorithms
45
where S˜ is the output of Algorithm 1. Thus, Algorithm 1 is a 2-approximation algorithm. 2
4.2.2 A Streaming 2(1 + ε )-Approximation Algorithm In many application scenarios, the number of iterations of an algorithm is also essential for the efficiency [5], e.g., in an I/O-efficient setting, or in a distributed setting. Typically, in an I/O-efficient setting, the input graph is stored on disk in the form of adjacency lists, and each iteration of an algorithm needs to sequentially access (possibly) the entire graph once. Thus, the fewer the number of iterations, the fewer the number of I/Os. However, the number of iterations of the 2-approximation algorithm (i.e., Algorithm 1) is n, which is large. In view of this, a streaming algorithm is proposed in [5], which reduces the number of iterations from n to log1+ε n for an input parameter ε > 0. Algorithm 2: Densest-Streaming: a streaming and approximation algorithm for densest subgraph [5] Input: A graph G = (V, E), a parameter ε > 0 Output: A dense subgraph S˜ S˜ ← 0; / S ← V ; mS ← |E|; for each vertex u ∈ V do dS (u) ← the degree of u in G; while S = 0/ do ρ (S) ← mS /|S|; ˜ then 6 if S˜ = 0/ or ρ (S) > ρ (S) S˜ ← S; 7 1 2 3 4 5
8 9 10 11 12 13 14
Δ (S) ← {v ∈ S | dS (v) ≤ 2 × (1 + ε ) × ρ (S)}; for each v ∈ Δ (S) do S ← S \ {v}; for each neighbor u of v in S do ds (u) ← ds (u) − 1; mS ← mS − 1; ˜ return S;
The pseudocode of the streaming algorithm is shown in Algorithm 2. The input graph G is assumed to be stored on disk, where all adjacent edges of a vertex are stored consecutively. The algorithm inputs a parameter ε > 0 (e.g., ε = 0.3) for controlling the approximation ratio of the algorithm, which makes a trade-off between the efficiency and the approximation ratio. Similar to Algorithm 1, S˜ is used to store the subgraph with the largest density among all checked subgraphs, and S is used to store the subgraph obtained by iteratively removing vertices from G. S is initialized to be V (Line 2). At Lines 4–13, batches of vertices are iteratively removed from
46
4 Average Degree-Based Densest Subgraph Computation
S until S becomes empty. In each iteration, S is firstly used to update S˜ (Lines 5– 7), in a similar way to that in Algorithm 1. But, different from Algorithm 1 which removes only one vertex in each iteration, a batch of vertices are removed from S in each iteration of Algorithm 2. The set of removed vertices, denoted as Δ (S), is identified as {v ∈ S | dS (v) ≤ 2 × (1 + ε ) × ρ (S)}. Lines 9–13 update dS (u) and also the total number of edges in E(S) after removing all vertices of Δ (S) from S; this operation can be conducted by sequentially scanning the adjacency lists of vertices on disk. When S becomes empty, the subgraph S˜ is returned (Line 14). Analysis of the Densest-Streaming Algorithm. The main cost of Densest-Streaming (i.e., Algorithm 2) is scanning the adjacency lists of vertices on disk. Note that in the worst case all edges of G need to be scanned in each iteration. Thus, the I/O complexity of Algorithm 2 is mainly determined by the number of iterations, which is bounded by the following lemma. Lemma 4.4 ([5]). Densest-Streaming (i.e., Algorithm 2) terminates in O(log1+ε n) iterations. Proof. In each iteration, a set Δ (S) = {v ∈ S | dS (v) ≤ 2× (1+ ε ) × ρ (S)} of vertices are removed from S. To bound the number of iterations, the main problem is to bound either |Δ (S)| or |S \ Δ (S)|. It is easy to see that Δ (S) is not empty, since there is at least one vertex in S whose degree is no greater than the average degree of all vertices in S. For each vertex v that is not removed (i.e., v ∈ S \ Δ (S)), it holds that dS (v) > 2 × (1 + ε ) × ρ (S). Thus: 2 × |E(S)| =
∑ dS (v)
v∈S
=
∑
dS (v) +
v∈Δ (S)
>
∑
dS (v)
v∈S\Δ (S)
∑
dS (v)
∑
2 × (1 + ε ) × ρ (S)
v∈S\Δ (S)
>
v∈S\Δ (S)
= |S \ Δ (S)| × 2 × (1 + ε ) ×
|E(S)| . |S|
This results in the following inequality: |S \ Δ (S)|
ρ ∗ . However, when x = ρ ∗ , there exist both a minimum s–t cut ({s} ∪ S, {t} ∪ T ) in / Gx such that S = 0/ and a minimum s–t cut ({s} ∪ S , {t} ∪ T ) in Gx such that S = 0. This is because the minimum s–t cut in a graph may be not unique, and different algorithms may report different minimum s–t cuts; nevertheless, all minimum s–t cuts have the same value.
50
4 Average Degree-Based Densest Subgraph Computation
Algorithm 3: DensityTest: test whether ρ ∗ ≥ x [35] Input: A graph G = (V, E), a real number x ≥ 0 Output: A vertex subset S, such that if S = 0/ then ρ ∗ ≥ ρ (S) ≥ x and otherwise ρ ∗ ≤ x 1 2 3 4 5 6
Gx ← G; Assign a weight 1 to every edge of Gx ; Add a source vertex s ∈ / V and a sink vertex t ∈ / V to Gx ; for each vertex u ∈ V do Add edge (s, u) with weight m to Gx ; Add edge (u,t) with weight m + 2x − d(u) to Gx ;
7 8
Compute a minimum s–t cut in Gx and denote it by (A, B); return A \ {s};
Following the above discussions, the pseudocode for testing whether the density ρ ∗ of a densest subgraph of a graph G is no smaller than x is shown in Algorithm 3, denoted DensityTest. It is self-explanatory. The correctness of DensityTest is proved by the following theorem, which directly follows from Theorem 4.5. Theorem 4.6. Given an input graph G and an input real number x, let S be the result of invoking DensityTest(G, x). If S = 0, / then ρ ∗ ≥ ρ (S) ≥ x; if S = 0, / then ρ ∗ ≤ x. Let TMC (m, n) be the time complexity of computing a minimum s–t cut in a graph with n vertices and m edges, for a designated source vertex s and a designated sink vertex t. Then, the time complexity of DensityTest is TMC (m, n). Currently, TMC (m, n) 2 can be achieved as O(n × m × log nm ) [36]. Example 4.1. Consider the graph G in Figure 4.2(a), which has 6 vertices and 8 edges. The augmented graph Gx of G for x = 1.25 is illustrated in Figure 4.2(b). A minimum s-t cut in Gx is computed as shown in Figure 4.2(c), which is / it can be con({s, v1 , v2 , v3 , v4 }, {v5 , v6 ,t}); that is, S = {v1 , v2 , v3 , v4 }. As S = 0, cluded that ρ ∗ ≥ 1.25. Moreover, it can be easily verified that ρ (S) ≥ 1.25, as shown in Figure 4.2(d).
4.3.2 The Densest-Exact Algorithm Based on Theorem 4.5, a binary search can be conducted to search for the value of ρ ∗ . However, as ρ ∗ is a real number, it is not easy to obtain the exact value of ρ ∗ due to the imprecise representation of real numbers in a computer. The good news is that the difference between the densities of any two subgraphs of G cannot be arbitrarily small, as proved by the following lemma. Lemma 4.5 ([35]). Given two subgraphs S1 and S2 , if ρ (S1 ) > ρ (S2 ), then it holds 1 . that ρ (S1 ) − ρ (S2 ) ≥ n(n−1)
4.3 An Exact Algorithm
51
Proof. The following equalities are immediate:
ρ (S1 ) − ρ (S2 ) =
|E(S1 )| |E(S2 )| |E(S1 )| × |S2 | − |E(S2 )| × |S1 | − = . |S1 | |S2 | |S1 | × |S2 |
Firstly, as |E(S1 )| × |S2 | − |E(S2 )| × |S1 | is an integer and ρ (S1 ) > ρ (S2 ), it must holds that |E(S1 )| × |S2 | − |E(S2 )| × |S1 | ≥ 1. Secondly, as S1 = S2 , it must holds that |S1 | × |S2 | ≤ n(n − 1). Thus, ρ (S1 ) − ρ (S2 ) ≥
1 n(n−1)
and the lemma follows.
2
Algorithm 4: Densest-Exact: an exact algorithm for densest subgraph [35] Input: A graph G = (V, E) Output: A densest subgraph of G 1 2 3 4 5 6 7 8 9 10 11
S˜ ← 0; / l ← 0; h ← m; 1 while h − l ≥ n(n−1) do l+h x← 2 ; S ← DensityTest(G, x); if S = 0/ then S˜ ← S; l ← x; else h ← x; ˜ return S;
Following Lemma 4.5 and based on the procedure DensityTest, the algorithm for computing a densest subgraph of a graph G is shown in Algorithm 4, denoted Densest-Exact. The search range of ρ ∗ is denoted by [l, h], and the algorithm stops 1 (Line 3). It is easy to see that 0 ≤ ρ ∗ ≤ m. Thus, l is initialized when h − l < n(n−1) as 0 and h is initialized as m (Line 2). The correctness of Densest-Exact is proved by the following theorem. Theorem 4.7. Densest-Exact(G) correctly computes a densest subgraph of G. ˜ ≥ l and the invariant of Proof. Firstly, it is easy to see that the invariant of ρ (S) l ≤ ρ ∗ ≤ h are maintained during the execution of Algorithm 4. Secondly, when the 1 . Thus: algorithm stops, it holds that h − l < n(n−1) ˜ ≥ l > h− ρ (S)
1 1 ≥ ρ∗ − . n(n − 1) n(n − 1)
52
4 Average Degree-Based Densest Subgraph Computation
Moreover, following Lemma 4.5, for every subgraph S of G such that ρ (S) = ρ ∗ , 1 ˜ which means that S˜ is a densest . Thus, ρ (S) < ρ (S), it must be ρ (S) ≤ ρ ∗ − n(n−1) subgraph of G. 2 As the number of iterations in Densest-Exact is O(log n) and the time complexity of DensityTest is TMC (m, n), the time complexity of Densest-Exact is 2 O (log n × TMC (m, n)), which can be achieved as O(log n × m × n × log nm ). Note that, by using the parametric maximum flow technique in [29], the time complexity 2 can be reduced to O(m × n × log nm ).
4.3.3 Pruning for Densest Subgraph Computation As the time complexity of Densest-Exact is at least quadratic, the Densest-Exact algorithm cannot be directly applied to compute the densest subgraph in a large graph. As a result, pruning techniques are required to compute densest subgraphs in large graphs. Actually, the greedy algorithm Densest-Greedy presented in Section 4.2.1 can be used to significantly reduce the graph instance that needs to be processed by Densest-Exact. The general idea is to first run the Densest-Greedy algorithm to obtain a lower bound of the density ρ ∗ of the densest subgraph in G. Let S be the subgraph of G returned by Densest-Greedy(G). Then, following Lemma 4.2, the minimum degree of S∗ must be at least ρ (S). Thus, any vertex of degree smaller than ρ (S) can be removed from G without affecting the densest subgraph of G; equivalently, Densest-Exact only needs to be invoked on the ρ (S)-core of G.
Graphs as-Skitter soc-LiveJournal1 uk-2005 it-2004 twitter-2010
Original size n m 1, 694, 616 11, 094, 209 4, 843, 953 42, 845, 684 39, 252, 879 781, 439, 892 41, 290, 577 1, 027, 474, 895 41, 652, 230 1, 202, 513, 046
Reduced size n m 915 73, 480 3, 639 661, 891 51, 784 15, 037, 470 4, 279 8, 593, 024 11, 619 17, 996, 107
Table 4.1: Reduced graph for densest subgraph computation Table 4.1 illustrates the sizes of the reduced graphs obtained by the above pruning technique, for densest subgraph computation on the five representative graph instances presented in Chapter 1. Here, n and m are the number of vertices and the number edges in the reduced graph, respectively. It is obvious that the reduced graph is significantly smaller than the original input graph.
4.4 Further Readings The densest subgraph problem can also be solved by algebraic techniques, e.g., linear programming [18]. Besides, variants of densest subgraph problem for other data
4.4 Further Readings
53
types and settings have also been studied in the literature. For example, densest subgraph computation for directed graphs is studied in [45] and for weighted graphs is studied in [35]. Dynamic densest subgraph discovery is studied in [9, 27, 63]. Locally, densest subgraph discovery is studied in [70]. Density-friendly graph decomposition is studied in [26, 87]. Computing top-k densest subgraphs with limited overlaps is studied in [6]. Densest subgraph computation in dual networks is studied in [97], and densest subgraph-based local community detection is studied in [96]. A tutorial on the concepts, algorithms, and applications of densest subgraph computation is recently given in KDD 2015 [34].
Chapter 5
Higher-Order Structure-Based Graph Decomposition
Higher-order structures, also known as motifs or graphlets, have been recently used to successfully locate dense regions that cannot be detected otherwise by edgecentric methods [8, 89, 90]. For example, a study in Science 2016 [8] shows that clustering based on higher-order connectivity patterns can reveal higher-order organization in a number of networks, including information propagation units in neuronal networks and hub structures in transportation networks. A typical structure of higher-order connectivity patterns is a k-clique for a specific k value, which is a complete graph with k vertices. In this chapter, we first present k-clique enumeration algorithms in Section 5.1. Then, in Section 5.2, we discuss higher-order core decomposition, specifically truss decomposition and nucleus decomposition. Finally, in Section 5.3, we introduce higher-order densest subgraph computation, i.e., k-clique densest subgraph computation.
5.1 k-Clique Enumeration k-cliques are building blocks of higher-order graph analysis. In this section, we introduce k-clique enumeration algorithms in Section 5.1.2. Before that, we first present triangle enumeration algorithms in Section 5.1.1, where triangle is a special case of k-clique for k = 3, and moreover triangle enumeration algorithms illustrate the central ideas of k-clique enumeration algorithms.
5.1.1 Triangle Enumeration Algorithms A triangle is 3-clique, which consists of three vertices such that there is an edge between every pair of the three vertices; let u,v,w denote the triangle consisting of vertices u, v, w. The problem of triangle enumeration is to enumerate all triangles © Springer Nature Switzerland AG 2018 L. Chang, L. Qin, Cohesive Subgraph Computation over Large Sparse Graphs, Springer Series in the Data Sciences, https://doi.org/10.1007/978-3-030-03599-0 5
55
56
5 Higher-Order Structure-Based Graph Decomposition
in a graph. This problem has been extensively studied, and many algorithms have been proposed. The first triangle enumeration algorithm with the time complexity O(m3/2 ) was introduced by Itai and Rodeh [43]. The time complexity of O(m3/2 ) is worst-case optimal and cannot be further improved with respect to m [47], since a complete graph with m edges has θ (m3/2 ) triangles. By using the concept of arboricity α (G), Chiba and Nishizeki [20] managed to enumerate all triangles in a sparse graph G in O(α (G) × m) ⊆ O(m3/2 ) time. Since then, many triangle enumeration algorithms with the time complexity of O(α (G) × m) have been proposed, e.g., see [47, 79, 64]. In the following, we first present the K3 algorithm proposed by Chiba and Nishizeki in [20], and then introduce a general framework for triangle enumeration based on the idea of oriented graph.
5.1.1.1 The K3 Algorithm The general idea of the K3 algorithm proposed in [20] is to process all vertices of a graph G in non-increasing degree order. When processing a vertex u, it enumerates every length-two path (u, v, w) (i.e., (u, v) ∈ E and (v, w) ∈ E) in G that starts from u and reports a triangle u,v,w if there is an edge in G between the first vertex (i.e., u) and the last vertex (i.e., w) of the length-two path; thus, all triangles containing u are enumerated. After processing u, the vertex u is removed from the graph to ensure that each triangle is enumerated only once.
Algorithm 1: K3: enumerate all triangles of a graph [20] Input: An undirected graph G = (V, E) Output: All triangles in G 1 2 3 4 5 6 7 8 9
Sort vertices of G such that d(v1 ) ≥ · · · ≥ d(vn ); for each vertex u ← v1 , . . . , vn do for each neighbor v ∈ N(u) do Mark v; for each neighbor v ∈ N(u) do for each neighbor w ∈ N(v) do if w is marked and v < w then Output triangle u,v,w ; for each neighbor v ∈ N(u) do Unmark v; Remove u from G;
The pseudocode of K3 is shown in Algorithm 1, where vertices are firstly sorted to be in non-increasing degree order (Line 1). In order to efficiently test whether there is an edge between u and w at Line 6, it marks all the neighbors of u at Line 3 such that each following testing can be conducted in constant time; that is, there is an edge between u and w in G if and only if w is marked. Note that, although hash table can also be used to achieve this goal, hash tables will incur not only space overhead
5.1 k-Clique Enumeration
57
but also time overhead in practice. At Line 6, before outputting the triangle u,v,w , it also checks whether the vertex ID of v is smaller than that of w (i.e., v < w); this is to avoid duplicate enumeration, since the triangle u,v,w may be enumerated from both the length-two path (u, v, w) and the length-two path (u, w, v). After outputting all triangles containing u, all neighbors of u are unmarked (Line 8), and the vertex u and all its associated edges are removed from the graph (Line 9). Analysis of K3. The correctness of K3 can be easily shown by induction. The time complexity of K3 is mainly related to the following lemma. Lemma 5.1 ([20]). Given a graph G = (V, E), it holds that
∑
min{d(u), d(v)} ≤ 2 × α (G) × m,
(u,v)∈E
where α (G) is the arboricity of G (see Chapter 3 for the definition of arboricity). Proof. According to the definition of arboricity α (G), the set of all edges of the graph G can be partitioned into α (G) edge-disjoint spanning forests; let them be Fi for 1 ≤ i ≤ α (G). Then:
∑
min{d(u), d(v)} =
∑
∑
min{d(u), d(v)}.
1≤i≤α (G) (u,v)∈Fi
(u,v)∈E
Now, let’s consider ∑(u,v)∈Fi min{d(u), d(v)} for a forest Fi . Actually, the edges of the forest Fi can be assigned to vertices of V such that each vertex has at most one edge, as follows. For each tree T in the forest Fi , choose an arbitrary vertex u as the root of T , direct each edge of T to be from the end-point that is closer to u to the other end-point, and associate each edge (u, v) of tree T with the head vertex of the directed edge. Thus:
∑
(u,v)∈Fi
min{d(u), d(v)} ≤
∑ d(v) = 2m.
v∈V
Therefore, ∑(u,v)∈E min{d(u), d(v)} ≤ 2 × α (G) × m, and the lemma holds. 2 Following from Lemma 5.1, the time complexity of K3 is O(α (G)×m); note that sorting the vertices of G in non-increasing degree order at Line 1 can be conducted in O(n) time by counting sort [24]. Moreover, as each edge (u, v) is involved in at most min{d(u), d(v)} triangles, the following corollary regarding the number of triangles in G also immediately follows from Lemma 5.1. Corollary 5.1. The number of triangles in G is at most 23 × α (G) × m. Regarding the space complexity of K3, the original implementation in [20] proposes to represent the input graph G with doubly linked adjacency lists and mutual references between the two stubs of an edge, to ensure constant time edge removal. This implementation not only is inefficient in practice due to random access of main memory caused by the adjacency lists but also has large space complexity. Actually, the CSR graph representation presented in Chapter 1 can be used to implement
58
5 Higher-Order Structure-Based Graph Decomposition
K3. Specifically, to remove a vertex from the graph, we can simply mark the vertex as deleted rather than physically remove it from the graph. Thus, the space complexity can be achieved as 2m + O(n). Note that, by this implementation, the time complexity of K3 still is O(α (G) × m).
5.1.1.2 A General Framework for Triangle Enumeration Ortmann and Brandes [64] recently present a unified framework for triangle enumeration based on the concept of oriented graph. Oriented Graph. The oriented graph of an undirected graph G is a directed graph, which is defined based on a total order ≺ of the vertices of G; different total orders will be discussed shortly. Given a graph G = (V, E) and a total order ≺ of V , the oriented version of G, denoted G+ = (V, E + ), has the same set of vertices as G. The edges of G+ are obtained from G by directing each undirected edge in G according to the total order ≺; that is, for an undirected edge (u, v) ∈ E, it is directed from u to v if u ≺ v and is directed from v to u otherwise. Thus, G+ has m directed edges, and can be obtained from G in O(m) time. For example, Figure 5.1(b) shows the oriented version of the undirected graph in Figure 5.1(a), where the total order is the degree decreasing order.
v1
v2
v3
v1
v2
v3
v5
v4
v6
v5
v4
v6
v8
v7
v11
v8
v7
v11
v9
v10
(a) Example Graph
v9
v10
(b) Oriented Graph
Fig. 5.1: An example graph and its oriented version In the oriented graph G+ which is a directed graph, the set of out-neighbors of a vertex u is N + (u) = {v ∈ V | (u, v) ∈ E + }, and the set of in-neighbors of u is N − (u) = {v ∈ V | (v, u) ∈ E + }. The out-degree and in-degree of u in G+ are denoted by d + (u) = |N + (u)| and d − (u) = |N − (u)|, respectively. The Framework. The general idea of triangle enumeration based on the oriented graph G+ is that every triangle u,v,w in G has a unique orientation based on the
5.1 k-Clique Enumeration
59
total order of V . For example, without loss of generality, assume u ≺ v ≺ w, then the oriented version of triangle u,v,w in G+ can be generated by computing the common out-neighbor w of u and v.
Algorithm 2: TriE: enumerate all triangles in a graph [64] Input: An undirected graph G = (V, E), and a total order ≺ of V Output: All triangles in G 1 2 3 4 5 6 7 8
Construct the oriented graph G+ = (V, E + ) of G with respect to ≺; for each vertex u ∈ V do for each out-neighbor v ∈ N + (u) do Mark v; for each out-neighbor v ∈ N + (u) do for each out-neighbor w ∈ N + (v) do if w is marked then Output triangle u,v,w ; for each out-neighbor v ∈ N + (u) do Unmark v;
Based on the above idea, the general framework for triangle enumeration is shown in Algorithm 2, denoted TriE. It is similar to the K3 algorithm (Algorithm 1). The only difference is that TriE first orients the graph G with respect to the total order ≺ to obtain a directed graph G+ (Line 1) and then only enumerates the outneighbors of vertices in G+ (Lines 3–5). Note that, in TriE, each triangle in G is enumerated exactly once; specifically, the triangle u,v,w is enumerated only when processing the highest-ranked vertex in the triangle (i.e., u, assuming u ≺ v and u ≺ w) at Line 2. For example, for the oriented graph in Figure 5.1(b), all the four triangles v7 ,v9 ,v8 , v7 ,v11 ,v10 , v7 ,v4 ,v5 , and v7 ,v4 ,v6 that contain v7 are enumerated only when processing v7 . Total Orders. Three popularly used total orders are as follows: 1. Degree decreasing ordering. For any two vertices u, v ∈ V , u ≺ v (i.e., u ranks higher than v) if: • d(u) > d(v), or • d(u) = d(v) and u has a larger vertex ID than v. 2. Degree increasing ordering. For any two vertices u, v ∈ V , u ≺ v (i.e., u ranks higher than v) if: • d(u) < d(v), or • d(u) = d(v) and u has a smaller vertex ID than v. 3. Smallest-first ordering (aka, degeneracy ordering). For any two vertices u, v ∈ V , u ≺ v (i.e., u ranks higher than v) if u is before v in the degeneracy ordering of V (see Chapter 3 for the definition of degeneracy ordering). In the above three total orders, vertex IDs are used to break ties; nevertheless, any tie broken rule can be applied. It is worth noting that the degree decreasing ordering
60
5 Higher-Order Structure-Based Graph Decomposition
mentioned above is exactly the reverse of the degree increasing ordering mentioned above. The above total orders have nice properties as proved by the following lemmas. Lemma 5.2. Assume that G+ is obtained from G based on the degree increasing ordering, then: 2 ∑ d + (u) ≤ 2 × α (G) × m. u∈V
Proof. Firstly:
∑
+ 2 d (u) =
u∈V
∑ d + (u) × d + (u) = ∑ ∑+
d + (u) =
u∈V v∈N (u)
u∈V
∑
(u,v)∈E +
d + (u).
Secondly, due to the orientation of G+ , for each edge (u, v) ∈ E + , it satisfies that d + (u) ≤ d(u) = min{d(u), d(v)}. Thus:
∑
d + (u)
2
=
u∈V
∑
d + (u) ≤
(u,v)∈E +
∑
(u,v)∈E +
min{d(u), d(v)} ≤ 2 × α (G) × m, 2
where the last inequality follows from Lemma 5.1.
Lemma 5.3. Assume that G+ is obtained from G based on the degree increasing ordering, then: ∑ d − (u) × d + (u) ≤ 2 × α (G) × m. u∈V
Proof. Note that:
∑ d − (u) × d + (u) = ∑ ∑+
d − (u) =
u∈V v∈N (u)
u∈V
≤
∑
∑
(u,v)∈E +
d − (u)
min{d(u), d(v)}
(u,v)∈E +
where the last inequality follows from the same argument as in Lemma 5.2. Thus, the lemma holds. 2 Lemma 5.4. Assume that G+ is obtained from G based on the smallest-first ordering, then: 2 ∑ d + (u) ≤ (2α (G) − 1) × m. u∈V
Proof. Firstly:
∑
d + (u)
2
=
u∈V
∑ d + (u) × d + (u) = ∑ ∑+
u∈V v∈N (u)
u∈V
=
∑
(u,v)∈E +
+
d (u) ≤
∑
(u,v)∈E +
δ (G),
d + (u)
5.1 k-Clique Enumeration
61
where the last inequality follows from Lemma 3.5 in Chapter 3. Secondly, Lemma 3.2 proves that δ (G) < 2α (G); thus, δ (G) ≤ 2α (G) − 1 since both sides of the inequality are integers. Therefore, the lemma holds. 2 Lemma 5.5. Assume that G+ is obtained from G based on the smallest-first ordering, then: ∑ d − (u) × d + (u) ≤ (2α (G) − 1) × m. u∈V
Proof. Note that:
∑ d − (u) × d + (u) = ∑ ∑−
u∈V
u∈V v∈N (u)
d + (u) =
∑
(v,u)∈E +
d + (u) ≤
∑
(v,u)∈E +
δ (G).
Thus, the lemma holds. 2 Note that, in both degree increasing ordering and smallest-first ordering, 2 ∑u∈V (d − (u)) cannot be bounded by O(α (G) × m). For example, for a star graph with one central vertex that is connected to all other n − 1 vertices by an edge, 2 α (G) = 1 and m = n − 1, while ∑u∈V (d − (u)) = θ (n2 ). Analysis of TriE. It is easy to see that the space complexity of TriE is 2m + O(n). Regarding the time complexity, assume that vertices have already been sorted with respect to the total order, then TriE runs in O(∑u∈V d − (u) × d + (u)) time, which is O(α (G) × m) by following Lemma 5.3 or Lemma 5.5. Note that the K3 algorithm (Algorithm 1) actually is similar to invoking TriE with the degree decreasing ordering of vertices. When also taking the time of sorting vertices into consideration, the total time complexity of TriE is still O(α (G) × m), since sorting vertices into either degree decreasing ordering or degree increasing ordering can be conducted in O(n) time and sorting vertices into the smallest-first ordering can be conducted in O(m) time (see Chapter 3). It is empirically shown in [64] that when excluding the time of sorting vertices with respect to the total order, invoking TriE with the smallest-first ordering runs the fastest in practice. However, the time of sorting vertices with respect to the smallestfirst ordering takes a significant portion of the total running time in practice, while on the other hand the time of sorting vertices with respect to degree decreasing or increasing ordering is negligible. As a result, the strategy of sorting vertices into degree decreasing ordering or degree increasing ordering is usually adopted in practice for triangle enumeration. Note that, in the above discussions, only Lemma 5.3 and Lemma 5.5 are used in the time complexity analysis, and there are also other variants of TriE based on Lemma 5.2 and Lemma 5.4 that run in O(α (G) × m) time; please see [64] for the other variants.
62
5 Higher-Order Structure-Based Graph Decomposition
5.1.2 k-Clique Enumeration Algorithms Both of the triangle enumeration algorithms presented in Section 5.1.1, K3 (Algorithm 1) and TriE (Algorithm 2), can be extended to enumerate all k-cliques in a graph for an arbitrary k ≥ 3, as illustrated in the following two subsubsections.
5.1.2.1 Extending K3 to k-Clique Enumeration Chiba and Nishizeki [20] proposed an extension of K3, denoted KClique-ChibaN, to enumerate all k-cliques in a graph. The pseudocode of KClique-ChibaN is shown in Algorithm 3. The general idea is that to enumerate all such k-cliques in a graph G that contain a specific vertex u, it is equivalent to enumerate all (k − 1)-cliques in the subgraph G[N(u)] of G induced by the set N(u) of neighbors of u in G (Line 10), which can be conducted recursively. The base case is that, when k = 2, all edges of the graph are reported (Lines 3–5). Like the K3 algorithm, KClique-ChibaN also: • processes vertices in degree decreasing order to achieve the desired time complexity (Lines 7–8), and • removes the vertex u from the graph after processing it to ensure that each kclique is enumerated exactly once (Line 11).
Algorithm 3: KClique-ChibaN: enumerate all k-cliques in a graph [20] Input: An undirected graph G = (V, E), and an integer k Output: All k-cliques in G 1 2
C ← 0; / KClique-Enum(G, k,C);
3 4 5
Procedure KClique-Enum(Gk , k,C) if k = 2 then for each edge (u, v) ∈ E(Gk ) do Output clique {u, v} ∪C;
6 7 8 9 10 11
else Sort vertices of Gk such that dGk (v1 ) ≥ · · · ≥ dGk (v|V (Gk )| ); for each vertex u ← v1 , . . . , v|V (Gk )| do Gk−1 ← the subgraph of Gk induced by NGk (u); KClique-Enum(Gk−1 , k − 1,C ∪ {u}); Remove u from Gk ;
KClique-ChibaN is a recursive algorithm which needs to access a different graph Gk for each invocation of the procedure KClique-Enum. The naive approach of explicitly storing Gk for each invocation of KClique-Enum would result in a space blowup. In view of this, Chiba and Nishizeki [20] propose to implicitly represent Gk within the graph representation of the original input graph G, by noting the fact
5.1 k-Clique Enumeration
63
that Gk is a vertex-induced subgraph of G. Specifically, G is represented by doubly linked adjacency lists with mutual references between the two stubs of an edge. It maintains the invariants that, when working with the subgraph Gk , the neighbors NGk (u) of each vertex u ∈ V (Gk ) appear consecutively at the beginning of the adjacency list of u. Moreover, all vertices of Gk are assigned the label “k” such that when getting NGk (u) from N(u), it can stop once the label of the next vertex in N(u) is not “k”. As a result, the subgraph Gk can be accessed on-the-fly from G in O(|V (Gk )| + |E(Gk )|) time. Details can be found in [20]. The time and space complexities of KClique-ChibaN are bounded in the following theorem, whose proof can be found in [20]. Theorem 5.1 ([20]). KClique-ChibaN enumerates all k-cliques in a graph G in O(k × (α (G))k−2 × m) time and O(n + m) space.
5.1.2.2 Extending TriE to k-Clique Enumeration Recently, an extension of TriE to enumerate all k-cliques in a graph is proposed in [25]; that is, enumerate k-cliques of G by working on the oriented version G+ of G. Denote the algorithm by KClique-Oriented. Its pseudocode is shown in Algorithm 4, where the degeneracy ordering is used to orient the input graph G. It is easy to see that KClique-Oriented integrates the ideas of TriE into KClique-ChibaN. Note that, by using the same graph representation techniques as presented for KClique-ChibaN, the subgraph G+ k−1 can be implicitly represented within the representation of the graph G+ such that G+ k−1 can be accessed on-the-fly in + + O(|V (Gk−1 )| + |E(Gk−1 )|) time. Algorithm 4: KClique-Oriented: enumerate all k-cliques in a graph [25] Input: An undirected graph G = (V, E), and an integer k Output: All k-cliques in G 1 2 3 4
Compute the degeneracy ordering of V ; Construct the oriented graph G+ = (V, E + ) of G with respect to the degeneracy ordering; C ← 0; / KClique-EnumO(G+ , k,C);
5 6 7
Procedure KClique-EnumO(G+ k , k,C) if k = 2 then for each edge (u, v) ∈ E(G+ k ) do Output clique {u, v} ∪C;
8 9 10 11
else
for each vertex u ∈ V (G+ k ) do + G+ ← the subgraph of G+ k−1 k induced by NG+ (u); KClique-EnumO(G+ k−1 , k − 1,C ∪ {u});
k
64
5 Higher-Order Structure-Based Graph Decomposition
The time and space complexities of KClique-Oriented are bounded in the following theorem, whose proof can be found in [25]. Theorem 5.2 ([25]). KClique-Oriented enumerates all k-cliques in a graph G in
δ (G) k−2 O k× 2 × m time and O(n + m) space. Note that it is proved in Lemma 3.2 that α (G) ≤ δ (G) < 2α (G). Thus, the time complexity of KClique-Oriented is similar to that of KClique-ChibaN. Nevertheless, it is shown in [25] that KClique-Oriented runs faster than KClique-ChibaN in practice.
5.2 Higher-Order Core Decomposition A k-core ensures that every vertex in it participates in at least k edges within the kcore, which considers the involvement between the structures of vertices and edges. This section investigates generalizations of k-core and respectively core decomposition, by using higher-order structures (specifically, k-cliques) to replace the structures of vertices and edges in the definition of k-core. In the following, Section 5.2.1 presents truss decomposition and Section 5.2.2 presents nucleus decomposition.
5.2.1 Truss Decomposition An immediate generalization of k-core is k-truss which considers the involvement between the structures of edges and triangles; that is, each edge in a k-truss participates in at least k triangles within the k-truss. k-truss is firstly introduced in [22], based on the observation of social cohesion where triangles play an important role. Definition 5.1. Given a graph g, the support of an edge (u, v) ∈ E(g), denoted suppg (u, v), is the number of triangles in g that contain the edge (u, v); that is: suppg (u, v) = |{w ∈ V (g) | u,v,w ∈ g}|, where u,v,w is a triangle consisting of vertices u, v, and w. Definition 5.2. Given a graph G and an integer k ≥ 2, the k-truss of G is the maximal subgraph g of G such that for any edge (u, v) ∈ E(g), the support of (u, v) in g is at least k (i.e., suppg (u, v) ≥ k). Note that, for presentation simplicity, Definition 5.2 requires the support of every edge to be at least k, while the original definition in [22] requires this to be k − 2. It is easy to see that a k-truss is not a vertex-induced subgraph but an edge-induced subgraph. For example, consider the graph G in Figure 5.2(a). The entire graph is a 1-truss since every edge in G is involved in at least one triangle. The 2-truss of G is shown in Figure 5.2(b), where three vertices and six edges are removed from the
5.2 Higher-Order Core Decomposition
65
1-truss. The 3-truss of G is shown in Figure 5.2(c), where two edges (v2 , v5 ), and (v2 , v6 ) are further removed from the 2-truss. Properties of k-Truss. k-truss has the following properties as proved in [22]: • Each k-truss of a graph G is a subgraph of the (k − 1)-truss of G; for example, in Figure 5.2, the 3-truss is a subgraph of the 2-truss which in turn is a subgraph of the 1-truss. v1
v1
v2 v5
v4
v3
v6
v2 v5
v4
v3
v9
v9 v10 v11
v6
v7
v10
v8
v7
v11
v12
v8
v12
(a) 1-truss (b) 2-truss
v1
v1
v2 v5
v4
v3
v6
v2
v11
v7 v12 (c) 3-truss
v6
v8
v10 v11
3-truss edge 2-truss edge
v9
v9 v10
v5
v4
v3
1-truss edge
v7
v8
v12 (d) Truss Decomposition
Fig. 5.2: Truss decomposition • Each k-truss of a graph G is a subgraph of the (k + 1)-core of G. • The edge connectivity of a k-truss is at least k + 1 (see Chapter 6 for the definition of edge connectivity). (g)|−2 • The diameter of a k-truss g is at most 2×|Vk+2 . Truss Decomposition. Following the properties of k-truss, it can be verified that the edges of k-trusses of a graph for all possible k values form a hierarchical structure, called truss hierarchy, similar to the core hierarchy of vertices presented in Section 3.2.3. One important step of constructing the truss hierarchy is computing the truss number for each edge, defined as follows. Definition 5.3. Given a graph G, the truss number of an edge (u, v) in G, denoted truss(u, v), is the largest k such that the k-truss of G contains (u, v), that is: truss(u, v) = max{k | the k-truss of G contains (u, v)}.
66
5 Higher-Order Structure-Based Graph Decomposition
The problem of truss decomposition is to compute the truss number for every edge in a graph. For example, Figure 5.2(d) shows the truss decomposition of the graph in Figure 5.2(a), where the truss numbers of edges are indicated by the different shapes of edges.
5.2.1.1 The Peeling Algorithm for Truss Decomposition The truss decomposition of a graph G can be computed by running the Peel algorithm (Algorithm 1 in Chapter 3) on the hyper-graph G that is constructed from G. Specifically, each edge of G is a vertex in the hyper-graph G , and each triangle in G is a hyper-edge in G . Then, it can be verified that the k-core of G corresponds to the k-truss of G, for all possible k values.
Algorithm 5: PeelTruss: compute truss numbers of all edges [92] Input: A graph G = (V, E) Output: truss(u, v) for each edge (u, v) ∈ E 1 2 3
for each edge (u, v) ∈ E do Compute the number t(u, v) of triangles in G containing (u, v); supp(u, v) ← t(u, v);
4 5 6
max truss ← 0; seq ← 0; / for i ← 1 to m do (u, v) ← arg min(u ,v )∈E\seq supp(u , v ); Add (u, v) to the tail of seq; if supp(u, v) > max truss then max truss ← supp(u, v); truss(u, v) ← max truss; for each triangle u,v,w that contains edge (u, v) do supp(u, v) ← supp(u, v) − 1; supp(u, w) ← supp(u, w) − 1;
7 8 9 10 11 12 13
Remove edge (u, v) from the graph;
Without explicitly constructing the hyper-graph, the Peel algorithm can also be extended to compute the truss decomposition of a graph, as shown in Algorithm 5, where the extended algorithm for truss decomposition is denoted by PeelTruss. Initially, the support of each edge is set as the number of triangles in the graph containing this edge (Lines 1–3). Then, the edge (u, v) with the smallest support is iteratively removed from the graph (Lines 6–13), as follows: 1. The edge (u, v) is added to the tail of a sequence seq of edges (Line 7), which later can be used to construct the truss hierarchy. 2. The truss number of (u, v) is set in the same fashion as setting core numbers in the Peel algorithm (Lines 8–9).
5.2 Higher-Order Core Decomposition
67
3. The edge (u, v) and its associated triangles are removed from the graph (Lines 10–13); specifically, the support of any other edge that forms a triangle with (u, v) is decreased by one. Note that, in PeelTruss, the support supp(u, v) of (u, v) is assumed to be equal to and linked with the support supp(v, u) of (v, u); that is, updating either one of supp(u, v) and supp(v, u) also automatically updates the other. Analysis and Implementation Details of PeelTruss. The correctness of PeelTruss directly follows from the fact that each k-truss is the subgraph of the (k − 1)-truss obtained by iteratively removing the edges whose supports are smaller than k. The time complexity of PeelTruss is mainly dominated by Line 2, Line 6, and Line 10, which can be bounded as follows: • Line 2 and Line 10 are related to enumerating triangles containing an edge (u, v), which can be implemented in O(min{d(u), d(v)}) time. Specifically, a hash table H is constructed to store all edges of G during the initialization of the algorithm. Without loss of generality, assume that d(u) ≤ d(v). Then for every neighbor w ∈ N(u), it looks up the hash table H to check whether there is an edge between v and w, and there is a triangle u,v,w in G if and only if (v, w) is in H. • Line 6 iteratively retrieves the edge with the smallest support, where the support of an edge may decrease after edge removal at Line 13. This can be achieved in amortized constant time by using a similar data structure to ListLinearHeap or ArrayLinearHeap presented in Chapter 2. That is, the data structure here stores m elements, each corresponding to an edge of G. As a result, the time complexity of PeelTruss is O(α (G) × m), by noting that ∑(u,v)∈E min{d(u), d(v)} ≤ 2 × α (G) × m) (see Lemma 5.1). Remarks. PeelTruss computes the truss number for all edges in a graph. It can be verified that the k-truss of a graph G is exactly the subgraph of G induced by the set of edges whose truss numbers are at least k. Thus, PeelTruss can also be used to compute the k-truss of a graph. Nevertheless, computing k-truss for a specific k can be simpler. Specifically, to compute the k-truss of a graph, any edge with support smaller than k, rather than the one with the minimum support, can be used at Line 6 of Algorithm 5; as a result, any queue data structure can be used to maintain the set of all edges whose supports are smaller than k. On the other hand, if the truss hierarchy of a graph is needed, then a postprocessing of the sequence seq of edges computed by PeelTruss can construct the truss hierarchy in O(m) time in a similar way to Algorithm 3 by considering edges rather than vertices; we omit the details. Note that the notion of triangle-connected k-truss is also used in the literature (e.g., see [41]), which essentially further decomposes each connected k-truss into triangle-connected components; specifically, two edges are triangle-connected in a k-truss, if there is a triangle in the k-truss that contains both edges.
68
5 Higher-Order Structure-Based Graph Decomposition
5.2.2 Nucleus Decomposition The notion of truss decomposition is further generalized to nucleus decomposition in [78], based on general k-cliques for k ≥ 4. Note that a 1-clique is a vertex, a 2-clique is an edge, and a 3-clique is a triangle. Thus: • in a k-core, each vertex (i.e., 1-clique) is contained in at least k edges (i.e., 2cliques), and • in a k-truss, each edge (i.e., 2-clique) is contained in at least k triangles (i.e., 3-cliques). Consequently, one generalization would be that each r-clique is contained in at least k s-cliques, for 1 ≤ r < s. Definition 5.4. Given a graph g, two positive integers 1 ≤ r < s, and an r-clique Cr in g, the s-clique support of Cr in g, denoted suppsg (Cr ), is the number of s-cliques in g containing Cr . Definition 5.5. Given a graph g, two positive integers 1 ≤ r < s, and two r-cliques C1r and C2r , C1r and C2r are s-clique-connected if there exists a sequence of r-cliques in g, C1r = C1 ,C2 , . . . ,Ck = C2r , such that for each 1 ≤ i < k, there exists an s-clique in g containing both Ci and Ci+1 . Definition 5.6. Given a graph G and three positive integers, k, r, and s with r < s, a k-(r, s)-nucleus of G is a maximal union g of r-cliques in G such that: • for any r-clique Cr in g, its s-clique support in g is at least k (i.e., suppsg (Cr ) ≥ k), and • every pair of two r-cliques in g are s-clique-connected in g. It is easy to see that a k-(1, 2)-nucleus is equivalent to a maximal connected kcore, and a k-(2, 3)-nucleus is equivalent to a maximal triangle-connected k-truss.
v1
v2
v3
v1
v2
v3
v4
v5
v6
v4
v5
v6
v7
v8
v9
v7
v8
v9
(a) 2-(2,3)-nucleus
(b) 2-(2,4)-nucleus
Fig. 5.3: Nucleus decomposition [78]
5.2 Higher-Order Core Decomposition
69
Example 5.1. The graph in Figure 5.3(a) is a 2-(2, 3)-nucleus, in which each edge (2clique) is contained in at least two triangles (3-cliques). The graph in Figure 5.3(b) is a 2-(2, 4)-nucleus, in which each edge (2-clique) is contained in at least two 4cliques. Nucleus Decomposition. It is easy to see that: • each k-(r, s)-nucleus is a subgraph of a (k − 1)-(r, s)-nucleus, and • the sets of r-cliques in two different k-(r, s)-nucleus are disjoint. As a result, given r and s, a hierarchical structure, called nucleus hierarchy, can be constructed for the set of k-(r, s)-cliques for all possible k values [76]. However, rather than constructing this hierarchy which involves a lot of complicated details [76], we focus on describing the algorithm for computing the nucleus number for all r-cliques in a graph. Definition 5.7. Given a graph G and two positive integers r and s with r < s, the nucleus number of an r-clique Cr in G, denoted nucleuss (Cr ), is the largest k such that a k-(r, s)-nucleus of G contains Cr .
5.2.2.1 The Peeling Algorithm for Nucleus Decomposition The nucleus numbers of all r-cliques in a graph can also be computed by running the Peel algorithm on a hyper-graph constructed from G, such that each hyper-vertex corresponds to an r-clique of G, and each hyper-edge corresponds to an s-clique of G and contains all the r-cliques in it. It is also useful to view the hyper-graph as a bipartite graph, such that each r-clique and each s-clique correspond to a vertex in the bipartite graph, and there is an edge between an r-clique and an s-clique if the r-clique is a subgraph of the s-clique; for example, Figure 5.4 shows such a bipartite graph. Then, the algorithm is to iteratively peel/remove the r-clique with the minimum number of edges; moreover, when a r-clique is removed, all the scliques that are adjacent to the r-clique are also removed.
r-clique 1
... ... ...
s-clique 1
... ... ... ...
r-clique x s-clique y Fig. 5.4: Bipartite graph for nucleus decomposition
70
5 Higher-Order Structure-Based Graph Decomposition
The pseudocode for computing nucleus number for all r-cliques in a graph is shown in Algorithm 6, denoted PeelNucleus. It first generates the set of all r-cliques and the set of all s-cliques in G (Lines 1–2), denoted by Cr and Cs , respectively; this essentially builds the bipartite graph in Figure 5.4, since the edges in the bipartite graph can be implicitly implied. For each r-clique Cr , its support supps (Cr ) is the number of s-cliques containing Cr (Lines 3–4), that is, the number of adjacent edges of Cr in the bipartite graph.
Algorithm 6: PeelNucleus: compute nucleus numbers of all r-cliques [78] Input: A graph G = (V, E), and integers r and s with r < s Output: nucleuss (Cr ) for each r-clique Cr in G 1 2 3 4 5 6 7 8 9 10 11 12 13
Cr ← the set of all r-cliques in G; Cs ← the set of all s-cliques in G; for each Cr ∈ Cr do supps (Cr ) ← |{Cs ∈ Cs | Cr ⊂ Cs }|; max nucleus ← 0; while Cr = 0/ do Cr ← the r-clique in Cr with the minimum support; if supps (Cr ) > max nucleus then max nucleus ← supps (Cr ); nucleuss (Cr ) ← max nucleus; Cr ← Cr \ {Cr }; for each s-clique Cs ∈ Cs containing Cr do for each r-clique C1r ∈ Cr contained in Cs do supps (C1r ) ← supps (C1r ) − 1;
Then, the peeling process iteratively removes from the graph the r-clique Cr with the minimum support (Lines 7–13). The nucleus number of Cr is set accordingly (Lines 8–9). Note that when Cr is removed from the graph, the set of all s-cliques that contain Cr are also destroyed (Lines 11–13). Time Complexity of PeelNucleus. Recall that the time complexity of generating all s-cliques in a graph is O(s × (α (G))s−2 × m) (see Section 5.1.2). Let nr and ns be the number of r-cliques and the number of s-cliques in a graph G, respectively. The time and space complexities of PeelNucleus are bounded in the following theorem, whose proof can be found in [78]. Theorem 5.3. The time complexity of PeelNucleus is O(s × (α (G))s−2 × m) and the space complexity of PeelNucleus is O(nr + ns ). Note that the space complexity of PeelNucleus is usually dominated by ns since generally ns nr for small r and s with r < s. The space complexity of PeelNucleus can be reduced by generating s-cliques on-the-fly (at Lines 4 and 11) rather than materializing the s-cliques [76]. In addition, the nucleus hierarchy can also be constructed by bookkeeping additional information during the peeling process [76].
5.3 Higher-Order Densest Subgraph Computation
71
5.3 Higher-Order Densest Subgraph Computation This section investigates the generalization of edge densest subgraph to k-clique densest subgraph. Definition 5.8 ([89]). The k-clique density of a graph g for k ≥ 2, denoted ρk (g), is
ρk (g) =
ck (g) , |V (g)|
where ck (g) denotes the number of k-cliques in g. Note that the edge density studied in Chapter 4 is simply ρ2 (g), since each edge is a 2-clique. Definition 5.9 ([89]). Given a graph G, a subgraph g of G is a k-clique densest subgraph of G for a given k if ρk (g) ≥ ρk (g ) for all subgraphs g of G. Similar to the edge densest subgraph, it is easy to see that any maximal k-clique densest subgraph of G is a vertex-induced subgraph of G. Thus, in the following, we simply use a vertex subset S to denote the subgraph G[S] of G induced by S and use ck (S) to denote the number of k-cliques in G[S]. k-Clique Densest Subgraph Computation. Given an input graph G, the problem of k-clique densest subgraph computation (CDS) is to compute a k-clique densest subgraph S∗ of G.
5.3.1 Approximation Algorithms Both of the approximation algorithms presented in Section 4.2 have been extended to approximately compute the k-clique densest subgraph with approximation guarantees in [89]. For presentation simplicity, we here only discuss the extension of Algorithm 1 to computing a k-approximate solution for CDS, while extending Algorithm 2 to obtain a k(1 + ε )-approximation and streaming algorithm follows similar ideas. The general idea is to first run PeelNucleus (Algorithm 6) by setting r = 1 and s = k, to compute a peeling sequence of V ; denote the peeling sequence as seq. Then, the one with the maximum k-clique density among all n suffix of seq is a k-approximation of the k-densest subgraph S∗ . The pseudocode is shown in Algorithm 7, where fS (u) denotes the number of k-cliques in S containing u and ck (S) is k times the number of k-cliques in S. Analysis of CDS-Greedy. To prove the approximation ratio of CDS-Greedy, firstly a lemma similar to Lemma 4.2 can be proved.
72
5 Higher-Order Structure-Based Graph Decomposition
Lemma 5.6. Let S∗ be the k-clique densest subgraph of G, then: min fS∗ (u) ≥ ρk (S∗ )
u∈S∗
Proof. As S∗ is the k-clique densest subgraph of G, for any u ∈ S∗ , it satisfies that Algorithm 7: CDS-Greedy: a k-approximation algorithm for k-clique densest subgraph [89] Input: A graph G = (V, E) Output: A k-clique dense subgraph S˜ 1 2 3 4 5
S˜ ← 0; / S ← V ; ck (S) ← 0; for each vertex u ∈ V do Compute the number fS (u) of k-cliques in G containing u; ck (S) ← ck (S) + fS (u);
while S = 0/ do ρk (S) ← ck (S)/|S|; ˜ then 8 if S˜ = 0/ or ρk (S) > ρk (S) S˜ ← S; 9 6 7
10 11 12 13 14 15 16
Obtain the vertex u in S that has the smallest fS (u) value; S ← S \ {u}; for each (k − 1)-clique C in S ∩ N(u) do for each vertex v ∈ C do fS (v) ← fS (v) − 1; ck (S) ← ck (S) − k; ˜ return S;
ck (S∗ ) ck (S∗ ) − fS∗ (u) ≥ |S∗ | |S∗ | − 1 ck (S∗) = ρ (S∗ ). ⇐⇒ fS∗ (u) ≥ |S∗ |
ρk (S∗ ) ≥ ρk (S∗ \ {v}) ⇐⇒
Thus, the lemma holds. 2 The approximation ratio of CDS-Greedy is proved by the following theorem. Theorem 5.4 ([89]). CDS-Greedy is a k-approximation algorithm. That is: ˜ ≥ ρk (S)
ρk (S∗ ) k
where S˜ is the subgraph returned by CDS-Greedy and S∗ is the k-clique densest subgraph.
5.3 Higher-Order Densest Subgraph Computation
73
Proof. Let v be the vertex of S∗ that is forefront in the peeling sequence seq, and Sv be the suffix of seq starting at v. Then, S∗ ⊆ Sv and fSv (v) ≥ fS∗ (v). As a result:
ρk (Sv )=
ck (Sv ) ∑u∈Sv fSv (u) ∑u∈Sv fSv (v) ∑u∈Sv fS∗ (v) ∑u∈Sv ρk (S∗ ) ρk (S∗ ) = ≥ ≥ ≥ = |Sv | k × |Sv | k × |Sv | k × |SV | k × |SV | k
where the first inequality follows from the fact that v is the first vertex removed ˜ ≥ ρk (Sv ), the from Sv , and the third inequality follows from Lemma 5.6. As ρk (S) theorem holds. 2
5.3.2 Exact Algorithms The k-clique densest subgraph problem can also be solved exactly by computing minimum cuts in a series of carefully constructed augmented graphs Gx , in a similar fashion to the Goldberg’s algorithm in [35]. But, here the construction of the augmented graph Gx is different. Two formulations of the augmented graph are proposed in [89] and [60], respectively, where the latter one is simpler. Thus, we only present the formulation of [60] here. Density Testing. Let ρk∗ be the k-clique density of the k-clique densest subgraph of G. To test whether ρk∗ is larger than a positive real number x, an augmented and weighted directed graph Gx = (V (Gx ), E(Gx ), w) is constructed from G = (V, E) as follows [60]: • V (Gx ) = {s,t} ∪ V ∪ Ck−1 (G) where Ck−1 (G) denotes the set of all (k − 1)cliques in G. • For each vertex u ∈ V , there is a directed edge of weight fG (u) from s to u, where fG (u) is the number of k-cliques in G containing u. • For each vertex u ∈ V , there is a directed edge of weight k × x from u to t. • For each vertex u ∈ V and each (k − 1)-clique C ∈ Ck−1 (G) such that u and C form a k-clique, there is a directed edge of weight 1 from u to C. • For each (k − 1)-clique C ∈ Ck−1 (G), there is a directed edge of weight ∞ from C to each vertex u ∈ C. Similar to Theorem 4.5, it is proved in [60] that for any positive x, ρk∗ > x if and only if the minimum s–t cut in Gx is of value smaller than k × ck (G). The value of a cut (A, B) in a directed graph is the total weight of all edges from vertices of A to vertices of B. Theorem 5.5 ([60]). For any positive x, ρk∗ > x if and only if the minimum s–t cut in Gx is of value smaller than k × ck (G). Proof. Consider any minimum s–t cut (A, B) in Gx . Let A1 = A ∩ V , A2 = A ∩ Ck−1 (G), B1 = B ∩ V , and B2 = B ∩ Ck−1 (G). Then, A1 , A2 , {s} form a disjoint
74
5 Higher-Order Structure-Based Graph Decomposition
partition of A, and B1 , B2 , {t} form a disjoint partition of B. Let ckj be the number of k-cliques in G that have exactly j vertices in A1 , for 1 ≤ j ≤ k. Then, the value of the cut (A, B) is
∑
u∈B1
k−1
fG (u) + |A1 | × k × x + ∑ j × ckj j=1
where the first part ∑u∈B1 fG (u) is the total weight of the edges from s to B1 , the second part |A1 | × k × x is the total weight of the edges from A1 to t, and the third j part ∑k−1 j=1 j × ck is the total weight of the edges from A1 to B2 . Note that there is no edge from A2 to B1 since the weight of any such edge would be ∞, while the value the minimum s–t cut must be at most ∑u∈V fG (u) = k × ck (G) (i.e., the trivial cut with {s} on one side). Moreover, as ckk = ck (A1 ) and k
∑ j × ckj = ∑
j=1
fG (u) = k × ck (G) −
u∈A1
∑
fG (u),
u∈B1
the value of the cut (A, B) can be rewritten as: k × ck (G) + |A1 | × k × x − k × ck (A1 ). Thus, the theorem holds.
2
Algorithm 8: CDS-Exact: an exact algorithm for k-clique densest subgraph [60] Input: A graph G = (V, E) Output: A k-clique densest subgraph S∗ of G 1 2 3 4 5 6 7 8 9 10 11 12 13
S˜ ← 0; / l ← 0; h ← nk ; Ck−1 (G) ← the set of (k − 1)-cliques in G; 1 while h − l ≥ n(n−1) do l+h x← 2 ; Gx ← AugmentedGraph(G, Ck−1 (G), x); (A, B) ← minimum s–t cut in Gx ; if the value of the cut (A, B) is smaller than k × ck (G) then S∗ ← A ∩V ; l ← x; else h ← x; return S∗ ;
The Algorithm. Based on [60], the algorithm for computing the k-clique densest subgraph works in a similar fashion to Algorithm 4. The pseudocode is shown in Algorithm 8, where Line 6 constructs the augmented graph Gx as discussed above.
5.4 Further Readings
75
It conducts a binary search of x in a range [l, h] and stops once h − l is smaller than 1 ∗ k n(n−1) . Note that ρk must be no smaller than 0 and no larger than n , and the k-clique 1 . Thus, densities of any two subgraphs of G must be either identical or at least n(n−1) CDS-Exact correctly computes the k-clique densest subgraph in a graph G.
5.4 Further Readings Triangle enumeration/listing in other environments is also studied; for example, in an I/O-efficient manner [21, 40, 65, 39], and in a distributed fashion [37, 67, 68, 83, 86, 103]. I/O-efficient truss decomposition is studied in [92]. Truss hierarchy construction for triangle-connected k-truss is studied in [2, 41]. Fast nucleus hierarchy construction is studied in [76]. Sampling techniques for k-clique densest subgraph computation are investigated in [60]. Biclique-based densest subgraph discovery in bipartite graphs is studied in [77].
Chapter 6
Edge Connectivity-Based Graph Decomposition
In this chapter, we study edge connectivity-based graph decomposition; that is, each subgraph satisfies a certain edge connectivity requirement. In Section 6.1, we give the preliminaries. In Section 6.2, we present deterministic approaches to k-edge connected components (i.e., maximal k-edge connected subgraphs) computation for a user-given k, while a randomized approach is presented in Section 6.3. We discuss how to compute k-edge connected components simultaneously for all different k values (i.e., edge connectivity-based graph decomposition) in Section 6.4.
6.1 Preliminaries k-Edge Connected Component. We first give the definitions that are necessary to define the problems studied in this chapter.
g2 g1 v2
v1
v6
v7 v4
v9 v10
v3
v5
v8
v11
v12
v13 g3
Fig. 6.1: A graph and its 3-edge connected components [17]
© Springer Nature Switzerland AG 2018 L. Chang, L. Qin, Cohesive Subgraph Computation over Large Sparse Graphs, Springer Series in the Data Sciences, https://doi.org/10.1007/978-3-030-03599-0 6
77
78
6 Edge Connectivity-Based Graph Decomposition
Definition 6.1 ([32]). A graph G is k-edge connected if the remaining graph is still connected after the removal of any (k − 1) edges from G. A trivial graph consisting of a single vertex is not considered as k-edge connected. The edge connectivity of a graph is the largest k for which the graph is k-edge connected. For example, the edge connectivity of the graph in Figure 6.1 is 2, while the three subgraphs g1 , g2 , and g3 are all 3-edge connected. Definition 6.2 ([17]). Given a graph G, a subgraph g of G is a k-edge connected component of G if: 1) g is k-edge connected, and 2) any super-graph in G of g is not k-edge connected. For example, for the graph in Figure 6.1, g1 is a 3-edge connected component but not a 2-edge connected component, while the entire graph is a 2-edge connected component. Properties of k-Edge Connected Components. The definition of k-edge connected component has the following properties [17]: 1. A k-edge connected component is a vertex-induced subgraph; 2. k-edge connected components of a graph are disjoint; 3. The set of k-edge connected components of a graph is unique. 4. Each k-edge connected component of a graph G is a subgraph of a (k − 1)-edge connected component of G. Graph Cut, Global Min-Cut. We also give the definitions of graph cut and global min-cut, which will be used to illustrate the techniques for computing k-edge connected components. Definition 6.3 ([32]). Given a graph G = (V, E), a cut C = (S, T ) is a partition of V into two non-empty, disjoint subsets, i.e., S ∪ T = V and S ∩ T = 0. / We also denote a cut by the set of edges whose end-points are on different sides of the cut, i.e., {(u, v) ∈ E | u ∈ S, v ∈ T }. The value of a cut is the number of edges in the cut, denoted ω (C) = ω (S, T ) = |{(u, v) ∈ E | u ∈ S, v ∈ T }|. Definition 6.4 ([32]). A cut C = (S, T ) is called an s–t cut if s and t are on different sides of the cut, and it is a minimum s–t cut if its value is no larger than the value of any other s–t cuts. The edge connectivity between vertices s and t in a graph G, denoted λG (s,t), is the value of a minimum s–t cut in G. Thus, two vertices s and t are k-edge connected in G if and only if λG (s,t) ≥ k. Note that two vertices are k-edge connected in G does not necessarily mean that there is a k-edge connected component of G containing these two vertices. Definition 6.5 ([32]). The global min-cut of a graph G is the one that has the minimum value among all cuts of G.
6.2 Deterministic k-Edge Connected Components Computation
79
Let λ (G) denote the value of the global min-cut of a graph G = (V, E); that is, λ (G) = mins,t∈V,s =t λG (s,t). Then, λ (G) equals the edge connectivity of G, and G is k-edge connected if and only if λ (G) ≥ k. For example, C1 = {(v4 , v7 ), (v5 , v7 ), (v5 , v12 )} and C2 = {(v5 , v12 ), (v9 , v11 )} are cuts of the graph in Figure 6.1. C1 is a minimum v1 –v8 cut and C2 is a global min-cut.
6.2 Deterministic k-Edge Connected Components Computation Given a graph G = (V, E) and an integer k, deterministic techniques are proposed in [17, 102] for efficiently computing the set of k-edge connected components of G.
6.2.1 A Graph Partition-Based Framework A graph partition-based framework, denoted KECC, is developed by [17] for efficiently computing all k-edge connected components of a graph, which captures the cut-based framework in [102] as a special case. KECC recursively partitions a non-k-edge connected subgraph g into several (possibly more than two) subgraphs, by invoking Partition that removes from g all edges of one or several cuts of value smaller than k. More specifically, initially, there is only one connected graph which is G, and whenever there are connected subgraphs that are not k-edge connected, each of them is partitioned into a set of smaller subgraphs. The algorithm terminates when all connected subgraphs are k-edge connected. The pseudocode of KECC is shown in Algorithm 1, where Partition is a connectivity-aware graph partition procedure that will be introduced in the following two subsections; here, the input graph G is assumed to be connected. Both Gk and the queue Q contain a set of graphs (specifically, vertex-induced subgraphs of G); that is, each element of Gk and Q is a graph.
Algorithm 1: KECC: compute k-edge connected components [17] Input: A graph G = (V, E) and an integer k Output: k-edge connected components of G 1 2 3 4 5 6 7 8
Initialize a queue Q consisting of a single graph G; while Q is not empty do Pop a graph g from Q; Gk ← Partition(g, k); if Gk consists of only one graph g then Output g as a k-edge connected component of G; else Push all graphs of Gk into Q;
80
6 Edge Connectivity-Based Graph Decomposition
The set of all graphs that were pushed into the queue Q can be organized into a tree structure as depicted in Figure 6.2, called partition tree, where each ellipse represents a connected graph. The root represents the input graph G, and the children of a graph g represent the set of connected subgraphs obtained by partitioning g (through Partition(g, k)). Therefore, all k-edge connected components of G are represented by leaf nodes of the tree. Algorithm 1 generates all k-edge connected components of G by conducting a Breadth-First Search traversal on the partition tree.
G
g1,1
...
g1
...
gi
g1,j
...
g1,y
...
h
gx gx,1
...
gx,z
Fig. 6.2: A graph partition tree It is easy to see that Algorithm 1 correctly computes the set of k-edge connected components of a graph as long as Partition satisfies the following two properties [17]: • Atomicity Property. Each of the connected subgraphs returned by Partition(g, k) contains either all vertices of a k-edge connected component of g or none of its vertices, or equivalently, Partition(g, k) only removes edges of cuts of value smaller than k from g. • Cutability Property. If a graph g is not k-edge connected, then Partition(g, k) partitions it into at least two subgraphs. A partition procedure satisfying the above two properties is called a connectivityaware partition. In the literature, connectivity-aware two-way partition and connectivity-aware multiway partition have been proposed.
6.2.2 Connectivity-Aware Two-Way Partition One natural way to Partition(g, k) is testing whether the edge connectivity of g is at least k. If it is, then g itself is a k-edge connected component. Otherwise, a cut C of value smaller than k (i.e., certifying that λ (g) < k) will be obtained, and removing all edges of C from g partitions g into two subgraphs. The pseudocode is shown in Algorithm 2.
6.2 Deterministic k-Edge Connected Components Computation
81
Algorithm 2: Partition-TwoWay(g, k) 1 2
if g is k-edge connected then return {g};
3 4 5 6
else Compute a cut C of value smaller than k; Let g1 , g2 be the two connected subgraphs obtained by removing C from g; return {g1 , g2 };
Example 6.1. To compute the 3-edge connected components of the graph G in Figure 6.1, Partition-TwoWay verifies that G is not 3-edge connected, and C = {(v5 , v12 ), (v9 , v11 )} is a cut of value 2. Thus, two subgraphs g1 ∪ g2 and g3 are obtained; g3 is a 3-edge connected component, while g1 ∪ g2 is not. In the second iteration, Partition-TwoWay continues to partition g1 ∪ g2 . C = {(v4 , v7 ), (v5 , v7 )} is a cut of value 2, and two subgraphs g1 and g2 are obtained, both of which are 3-edge connected components. Finally, the 3-edge connected components of G are computed as g1 , g2 , and g3 . There are two approaches to testing the edge connectivity of a graph (or computing the global min-cut of a graph): maximum adjacency search-based approach, or dominating set-based approach. Maximum Adjacency Search-Based Partition. An algorithm based on maximum adjacency search is proposed in [85] to compute the global min-cut of a graph. This algorithm can be used to serve the purpose of testing the edge connectivity of a graph. That is, a graph is k-edge connected if and only if the value of the global min-cut is at least k. MAS-Based Global Min-Cut Computation. The main ingredient of the algorithm proposed in [85] is the procedure MAS that computes the minimum s–t cut for a pair of vertices s and t in O(m + n log n) time. However, rather than computing the maximum flow between s and t, MAS computes the minimum s–t cut by Maximum Adjacency Search (also referred to as maximum cardinality search). Essentially, given a graph g, MAS computes two vertices s and t such that set of adjacent edges of t is a minimum s–t cut in g. It is worth noting that: (1) s and t are determined by MAS rather than being given in the input, and (2) as a result, MAS cannot be used to compute the minimum cut for an arbitrary pair of given vertices.
82
6 Edge Connectivity-Based Graph Decomposition
Algorithm 3: MAS: compute a minimum cut [85] Input: A graph g = (V (g), E(g)) Output: A minimum s–t cut (S, T ) L ← {an arbitrary vertex of V (g)}; while |L| = |V (g)| do Let u be the most tightly connected vertex to L, i.e., u = arg maxv∈V (g)\L ω (L, v); 4 Add u to the tail of L; 1 2 3
5 6
Let s and t be the two vertices most recently added to L, and s is added to L prior to t; return the cut (L\{t}, {t}) as a minimum s–t cut;
The pseudocode of MAS is shown in Algorithm 3. It computes an ordering of all vertices of V (g), denoted by a list L, as follows. Initially, L is a singleton list consisting of an arbitrary vertex of V (g) (Line 1). As long as there are vertices of V (g) that are not included in L, the vertex u ∈ V (g)\L that is most tightly connected to L is added to the tail of L (Lines 3–4); that is, u = arg maxv∈V (g)\L ω (L, v), where ω (L, v) is the number of edges between v and the vertices of L and is regarded as the weight of vertex v. After all vertices of V (g) being added to L, let t be the last vertex in L and s be the vertex immediately prior to t in L (Line 5). Then, the set of adjacent edges of t in g is a minimum s–t cut, as proven by the lemma below. Lemma 6.1 ([85]). In Algorithm 3, let s and t be the two vertices (in the order) most recently added to L, then (L\{t}, {t}) is a minimum s–t cut. To find the global min-cut of a graph, whenever a minimum s–t cut is found by MAS, s and t are contracted into a super-vertex, and the resulting graph is given as an input to MAS for another run. Let G be the input to the first run of MAS. After n − 1 runs, the resulting graph consists of a single super-vertex, and n − 1 cuts of G will be identified; note that each cut of the resulting graph obtained by contracting vertices of G is also a cut of G. Then, the cut with the minimum value among the n − 1 identified cuts is a global min-cut of G, based on the following rationality. Consider the minimum s1 –t1 cut (S1 , T1 ) returned by the first run of MAS that takes G as input, there are two cases: either s1 and t1 are on different sides of the global min-cut of G or they are on the same side. In the former case, (S1 , T1 ) must be a global min-cut of G, while in the latter case, s1 and t1 can be contracted into a super-vertex without affecting the global min-cut of G. The time complexity of MAS is O(m + n log n) if a Fibonacci heap is used for maintaining the weights ω (L, v) of vertices and for finding the vertex with the largest weight at Line 3 [85]. For unweighted graphs, the linear heap data structure presented in Chapter 2 can be used to replace the Fibonacci heap to reduce the time complexity of MAS to O(m) [17]. Thus, the time complexity of the MAS-based global min-cut computation algorithm is O(n × m).
6.2 Deterministic k-Edge Connected Components Computation
83
Algorithm 4: Partition-MAS(g, k) [102] 1 2 3 4 5
g ← g; while g contains at least two vertices do Let (S, T ) be the minimum s–t cut obtained by MAS(g ); if ω (S, T ) < k then return {g[S], g[T ]}; else Let g be the resulting graph of contracting s and t into a super-vertex;
6
return {g};
MAS-Based Two-Way Partition. A variant of the above MAS-based global min-cut computation algorithm is adopted in [98, 66, 102] to partition a graph if it is not k-edge connected. Let Partition-MAS denote the MAS-based two-way partition. Its pseudocode is shown in Algorithm 4. That is, once a cut of value smaller than k is found by one run of MAS, the procedure terminates by removing all edges of the cut from the graph which partitions the graph into two subgraphs (Line 4). Otherwise, the procedure does not terminate early, which concludes that the graph is k-edge connected (Line 6). The time complexity of Partition-MAS is O(n × m). To reduce the running time of Partition-MAS in practice, a set of pruning techniques is developed in [102]. Nevertheless, the time complexity is still bounded below by Ω (n × m). That is, given a graph G that is k-edge connected, Partition-MAS needs to run MAS (via Line 3 of Algorithm 4) n − 1 times to determine that G is k-edge connected, while each run of MAS takes Ω (m) time. As a result, Partition-MAS is inefficient for processing large graphs. Dominating Set-Based Partition. An algorithm based on dominating set is proposed in [58] to compute the global min-cut of a graph. The idea is totally different from that of [85] and is based on dominating set and Lemma 6.2 in below. Definition 6.6. Given a graph G = (V, E), a vertex subset D ⊆ V is a dominating set of G if for any vertex u ∈ V , either u is in D or u is adjacent to a vertex of D. Lemma 6.2 ([58]). Given a graph g, let (S, T ) be an arbitrary cut of g such that ω (S, T ) < δ (g), where δ (g) denotes the minimum degree of g. Then, any dominating set D of g contains vertices from both S and T (i.e., D ∩ S = 0/ and D ∩ T = 0). / Based on Lemma 6.2, the general idea in [58] for computing the global min-cut of a graph G = (V, E) is as follows. It grows a vertex subset D ⊆ V , which initially consists of an arbitrary vertex of G, until D is a dominating set of G. Whenever D is not a dominating set of G, it chooses a vertex u from V \D, computes the minimum cut between u and D (i.e., a cut (S, T ) such that D ⊆ S and u ∈ T ), and adds u to D. Then, the cut of the minimum value among all the identified cuts is a global mincut of G. This is because, considering a global min-cut (S∗ , T ∗ ) of G, by following Lemma 6.2 there must be a step, before D becoming a dominating set, that D is a subset of S∗ and the next vertex u to be added to D is from T ∗ ; moreover, the value of the cut computed at this step must be equal to the value of (S∗ , T ∗ ).
84
6 Edge Connectivity-Based Graph Decomposition
Algorithm 5: Partition-DS(g, k) [17] 1 2 3
if the minimum degree of g is smaller than k then Let u be a vertex of degree smaller than k; return {g\u, u};
4
D ← {arg maxv∈V (g) d(v)}; /* D contains the maximum-degree vertex */; F ← {u ∈ V (g)\D | ∃v ∈ D, s.t., (u, v) ∈ E(g)}; while D ∪ F = V (g) do Let u be the vertex with the maximum degree in g[V (g)\D]; Compute a minimum cut (S, T ) of g such that D ⊆ S and u ∈ T ; if the value of (S, T ) is smaller than k then return {g[S], g[T ]};
5 6 7 8 9 10 11 12 13
D ← D ∪ {u}; F ← {u ∈ V (g)\D | ∃v ∈ D, s.t., (u, v) ∈ E(g)}; return {g};
Given a graph that is not k-edge connected, one of the cuts identified during the above process must have value smaller than k; thus, its removal can partition the graph into two subgraphs. Denote the dominating set-based two-way partition by Partition-DS. Its pseudocode is shown in Algorithm 5. If the minimum degree of g is less than k, then g is not k-edge connected (Lines 1–3). Otherwise, the algorithm maintains two disjoint vertex subsets D, F of V (g), where F consists of all vertices adjacent to but not in D (Lines 5 and 12). D is initialized to contain the vertex of g with the maximum degree (Line 4). Whenever D ∪ F does not contain all vertices of g (i.e., D is not a dominating set of g), the algorithm picks the vertex u from V (g)\D that has the maximum degree in the subgraph of g induced by V (g)\D (Line 7) and computes a minimum cut (S, T ) between v and D (Line 8). If the value of (S, T ) is smaller than k, then g is not k-edge connected and the removal of all edges of the cut from g partitions g into two subgraphs (Lines 9–10), otherwise; u is added to D (Line 11). Finally, if the algorithm does not terminate earlier before D becoming a dominating set of g, then g is k-edge connected (Line 13). Note that Algorithm 5 is also correct if an arbitrary vertex is chosen from V (g) and from V (g)\D, respectively, at Lines 4 and 7. Selecting the maximum degree vertex at Lines 4 and 7 ensures that the time complexity of Algorithm 5 is O(min{n × m, k × n2 }); note that, for a graph G whose minimum degree is no smaller than k, it holds that k × n ≤ 2 × m. The general idea to prove the time complexity is as follows [58]. We define an event that a vertex is covered, which happens the first time when the vertex is added to D ∪ F; then, each vertex of V (g) is covered at most once. Now, let’s consider the average cost of covering a vertex, where the main cost is computing a minimum cut at Line 8; note that a single computation of a minimum cut at Line 8 may cover multiple vertices. Consider a vertex u obtained at Line 7, and let i be the number of neighbors of u in D ∪ F. Then, i edge-disjoint paths from u to D can be found in total O(n) time. To compute k edge-disjoint paths from u to D, we still need to compute k − i more paths which can be conducted in O((k − i) × m ) time where m ≤ m is the total number of edges in the induced sub-
6.2 Deterministic k-Edge Connected Components Computation
85
graph g[V (g)\D]. Let d (v) be the degree of v in g[V (g)\D]. Then, at least d (v) − i vertices will be covered at this step, and d (v) ≥ k and d (v) ≥ 2 × m /n since v has the maximum degree in the subgraph. Thus, the average cost of covering a vertex is O(min{m, k × n}). It is shown empirically in [17] that Partition-DS generally runs faster than Partition-MAS, but Partition-DS still is inefficient for processing large graphs.
6.2.3 Connectivity-Aware Multiway Partition To reduce the height of the partition tree (see Figure 6.2), a connectivity-aware multiway partition algorithm, denoted Partition-MultiWay, is proposed in [17]. It simultaneously finds multiple (i.e., possibly more than two) cuts of value smaller than k, whose removal may partition a non-k-edge connected graph into more than two subgraphs. Partition-MultiWay is also based on MAS (Algorithm 3). But unlike Partition-MAS (Algorithm 4) which terminates after finding one cut C of value smaller than k, Partition-MultiWay continues to recursively invoke MAS on the non-singleton connected subgraph of g obtained by removing C; note that the other obtained connected subgraph of g must consist of a single super-vertex t due to the nature of MAS. The main benefit of Partition-MultiWay over Partition-TwoWay is as follows. Let g1 , g2 be the two corresponding connected subgraphs obtained from g by removing edges of the cut obtained at Line 3 of Algorithm 4, where g2 consists of a single super-vertex t. Then, g1 and g2 correspond to g[S] and g[T ], respectively, but with some of the vertices being contracted into super-vertices. To further partition g[S], Partition-MultiWay continues to run on g1 , while Partition-TwoWay will later run on g[S]. Note that the number of vertices in g1 is usually much smaller than that in g[S]; thus, Partition-MultiWay(g1 , k) partitions g[S] much faster than Partition-TwoWay(g[S], k) does. Optimizations. To improve the efficiency of Partition-MultiWay, optimization techniques are developed in [17]. Before presenting the optimizations, we first introduce some more notations. Given a list L of vertices and a vertex u ∈ L, we use Lu to denote the sublist of L consisting of all vertices preceding u and use pu to denote the vertex immediately preceding u in L; note that u ∈ Lu . Optimization-I: List Sharing. The first optimization is sharing the list L across different runs of MAS. The main observation is that, given L as computed by MAS for a graph G and let t be the last vertex of L, if ω (Lt , {t}) < k, then one of the connected subgraphs of G obtained by removing edges of the cut (Lt , {t}) is exactly the subgraph G[Lt ] of G induced by vertices Lt ; thus, a minimum s–t cut of G[Lt ] can be directly obtained from the same list L. Moreover, this phenomenon applies recursively, as long as the value of the cut found based on L is smaller than k. This is formally proved by the lemma below.
86
6 Edge Connectivity-Based Graph Decomposition
Lemma 6.3 ([17]). Let L be the list obtained by MAS (Algorithm 3) on an input graph G. Then, for any vertex u ∈ V , (Lu , {u}) is a minimum pu –u cut in the subgraph of G induced by vertices Lu ∪ {u}. Proof. Let V be Lu ∪ {u}. Then, it is easy to verify that for any v ∈ V : arg max ω (Lv , v ) = v. v ∈V \Lv
Recall that Lv includes all vertices preceding v in L, but does not include v. Thus, Lu followed by u will be the list obtained by MAS when taking G[Lu ∪ {u}] as input and taking the first vertex of L as the initial vertex. As a result, (Lu , {u}) is a minimum 2 pu –u cut in G[Lu ∪ {u}] by following Lemma 6.1. Optimization-II: Early Contraction. The second optimization is early contraction; that is, contract multiple pairs of vertices in a single run of MAS. The intuition is that given a graph G, a pair of vertices v1 and v2 in G can be safely contracted into a super-vertex without violating the atomicity and cutability properties of a partition algorithm (see Section 6.2.1), as long as they are k-edge connected in G. Note that it is possible that v1 and v2 may actually belong to different k-edge connected components. Nevertheless, if this is the case, then there must be another cut (S, T ) of G of value smaller than k such that every vertex of S is not k-edge connect to any vertex of T when computed in G, and thus v1 and v2 are on the same side of the cut. As a result, this invocation of Partition-MultiWay will separate G by removing edges of the cut (S, T ), while v1 and v2 will be separated in a later invocation of Partition-MultiWay. The benefit of early contraction is that, a k-edge connected component of G can be contracted rapidly into a single super-vertex such that the number of calls to MAS is significantly reduced. Lemma 6.3 implies that pu and u are k-edge connected in G if ω (Lu , u) ≥ k. Thus, each such pair of vertices can be contracted into a supervertex based on L. Even better, the contraction operation can be applied on-the-fly based on the following lemma. Lemma 6.4 ([17]). During the execution of MAS, whenever encountering a vertex v with weight no smaller than k, v and the vertex u that is lastly added to the current list L can be contracted into a super-vertex, and the execution of MAS continues. Proof. Let L be the list obtained by MAS for a graph G. Firstly, following from Lemma 6.3, for any vertex v with weight no smaller than k (i.e., ω (Lv , v) ≥ k), v and pv are k-edge connected in G, and thus, v and pv can be safely contracted into a super-vertex. Secondly, during the execution of MAS, whenever encountering a vertex v with weight no smaller than k, and let u be the last vertex added to the current list Lu ∪ {u}, then the weights of all vertices in the list L between u and v (excluding u) are no smaller than k, due to the nature of MAS. Thus, all vertices between u and v in L can be safely contracted into a single super-vertex; that is, u and v can be safely contracted into a super-vertex. Thus, the lemma holds. 2 Following Lemma 6.4, consider the graph in Figure 6.1 with k = 3, when L is initialized as {v1 }, the list L obtained after the first run of the optimized MAS is
6.2 Deterministic k-Edge Connected Components Computation
87
(v1 , v2 , v3 , {v4 , v5 }, v7 , v6 , {v8 , v9 }, v11 , v12 , {v10 , v13 }), where a super-vertex is represented by the set of vertices, from which it is contracted and is enclosed by a pair of braces. Here, three pairs of vertices are contracted in a single run of MAS. Optimization-III: k-Core Reduction. The third optimization is k-core reduction. It is easy to see that each vertex in a k-edge connected component has at least k neighbors within the k-edge connected component, and thus belongs to a k-core of the input graph; please refer to Chapter 3 for the definition of k-core. This optimization strategy is to incrementally maintain the k-cores of the subgraphs obtained through Partition-MultiWay. Note that the maintenance of k-core during the entire execution of KECC can be conducted in O(m) total time. Pseudocode of Optimized Partition-MultiWay. The pseudocode of PartitionMultiWay by incorporating the above optimizations is shown in Algorithm 6. Firstly, the input graph g is replaced by its k-core (Line 1) and a copy of g is stored in g (Line 2). Then, an optimized MAS (i.e., MAS-Opt) iteratively runs on g until g contains only one super-vertex. In the procedure MAS-Opt, a queue Q is used to store all vertices with weight at least k, and thus, the early contraction optimization is applied by contracting all vertices of Q with u (Lines 11–16). In addition, the computed list L is shared when the value of the obtained minimum cut is smaller than k (Lines 17–19), and once a cut of value smaller than k is found, all its edges are removed from both g and g (Lines 18–19). Note that, in Algorithm 6, the contraction operation is only applied to g (Line 16), while the removal of edges of a cut is done for both g and g (Lines 18–19).
Algorithm 6: Partition-MultiWay(g, k) [17] g ← the k-core of g; g ← g; while g contains more than one vertices do MAS-Opt(g , k, g); /* g and g are modified in MAS-Opt 5 return the set of connected components of g; 1 2 3 4
6 7 8 9 10 11 12 13 14 15 16 17 18 19
*/;
Procedure MAS-Opt(g , k, g) L ← {an arbitrary vertex u of V (g )}; for each v ∈ V (g ) do ω (v) ← ω (L, v); while |L| = |V (g )| do u ← arg maxv∈V (g )\L ω (v); Add u to the tail of L, and initialize a queue Q with u; while Q = 0/ do v ← pop a vertex from Q; for each (v, w) ∈ E(g ) with w ∈ / L do ω (w) ← ω (w) + 1; if ω (w) = k then Push w into Q; if u = v then Contract vertices u and v in g ; while |L| > 1 and the value of the cut C implied by the last two vertices in L is less than k do Remove the last vertex of L from both g and L; Remove all edges of C from g;
88
6 Edge Connectivity-Based Graph Decomposition
#Run 1 2 3 4
List L v1 , v2 , v3 , {v4 , v5 }, v7 , v6 , {v8 , v9 }, v11 , v12 , {v10 , v13 } v1 , {v4 , v5 }, {v2 , v3 }, v7 , {v6 , v8 , v9 }, v11 , {v10 , v12 , v13 } v1 , {v2 , v3 , v4 , v5 }, {v6 , v7 , v8 , v9 }, {v10 , v11 , v12 , v13 } {v1 , v2 , v3 , v4 , v5 }
Fig. 6.3: Running example of Partition-MultiWay Example 6.2. Given the graph in Figure 6.1 and k = 3, the list L obtained by each run of MAS-Opt is shown in Figure 6.3, where super-vertices are denoted by their associated vertices enclosed by braces. Partition-MultiWay partitions the graph into three subgraphs after four runs of MAS-Opt. In the first run of MAS-Opt, v4 and v5 , v8 and v9 , and v10 and v13 are contracted into three super-vertices, respectively. In the second run of MAS-Opt, v2 and v3 , v6 and {v8 , v9 }, and v12 and {v10 , v13 } are contracted into three super-vertices, respectively. In the third run of MAS-Opt, {v2 , v3 } and {v4 , v5 }, v7 and {v6 , v8 , v9 }, and v11 and {v10 , v12 , v13 } are contracted into three super-vertices, respectively. In the fourth run of MAS-Opt, two cuts of value smaller than 3 are obtained, which partition the graph into three subgraphs. Time Complexity of Partition-MultiWay. Let l denote the number of invocations of MAS-Opt at Line 4 of Algorithm 6. The time complexity of Partition-MultiWay is O(l × |E(g)|). This is because each run of MAS-Opt takes O(|E(g)|) time, by using the linear heap data structures presented in Chapter 2 to maintain the weights of vertices (at Line 14) and to retrieve the vertex with the largest weight (at Line 9). To contract u and v at Line 16 of Algorithm 6, the adjacent edges of v are added to be adjacent edges of u where parallel edges are allowed. Then, the total time complexity of contraction in MAS-Opt is O(|E(g)|) by noting the fact that each edge is moved (because of contraction) at most once during one run of MAS-Opt.
6.2.4 The KECC Algorithm We use KECC to denote the version of Algorithm 1 that invokes Partition-MultiWay (i.e., Algorithm 6) to partition a graph.1 Let h denote the height of the partition tree as illustrated in Figure 6.2, and l denote the maximum number of invocations of MAS-Opt in an execution of Partition-MultiWay across all runs of Partition-MultiWay. Then, the time complexity of KECC for a given graph G is O(h × l × m). Let KECC-MAS denote the version of Algorithm 1 that invokes Partition-MAS (i.e., Algorithm 4) for partitioning a graph, and let KECC-DS denote the version of Algorithm 1 that invokes Partition-DS (i.e., Algorithm 5) for partitioning a graph. 1
C++ code of KECC can be found online at https://github.com/LijunChang/Cohesive subgraph book/tree/master/edge connectivity decomposition.
6.2 Deterministic k-Edge Connected Components Computation
89
Then, the time complexity of both KECC-MAS and KECC-DS is O(n2 × m). Empirical studies on real-world graphs as conducted in [17] show that KECC-DS generally runs faster than KECC-MAS, and both are significantly outperformed by KECC. Moreover, h and l as used in the time complexity of KECC are usually bounded by small constants for real-world graphs. Therefore, KECC is able to process large graphs efficiently. Bounding h. The height h of the partition tree (see Figure 6.2) reflects the “depth” of the k-edge connected components nested together to form the input graph G, and the “depth” of nestedness is reduced by one after each iteration of Partition-MultiWay. Intuitively, for a given non-k-edge connected graph G, the more the subgraphs returned by Partition-MultiWay, the lower the height h. The minimum number of subgraphs that will be returned by Partition-MultiWay is guaranteed by the number of sibling subgraphs, defined as follows. In a simple way, the sibling subgraphs have the property that each of them, except one, is connected to the remaining part of its parent graph by less than k edges. For example, consider the graph G in Figure 6.1 and k = 3, (g1 ∪ g2 ) and g3 form the two sibling subgraphs, each of them is connected to the remaining part of the parent graph by 2 edges. More rigorously, the sibling subgraphs have the property that each of them is connected to any vertex in the remaining part of the parent graph by less than k edge-disjoint paths. Usually, more subgraphs than as defined by the sibling subgraphs are returned by Partition-MultiWay. For example, consider the graph G in Figure 6.1, there are only two sibling subgraphs (g1 ∪ g2 ) and g3 as discussed above. However, in practice, if g3 is disconnected from G (i.e., a cut between g3 and g1 ∪ g2 is found) before contracting any pair of vertices where one is from g1 and the other is from g2 , then g1 ∪ g2 will be partitioned into g1 and g2 in the same run of Partition-MultiWay. Therefore, three subgraphs g1 , g2 , and g3 will be returned by Partition-MultiWay, and h = 1 in this case. This phenomenon is observed in the experimental studies of [17], in which the largest h is 5 across graphs whose vertex numbers vary from 4 thousand to 2 million. Bounding l. Although in the worst case, l can be as large as |V (g)|, l is usually small for real-world graphs [17] and is characterized in the following two cases depending on whether g is k-edge connected or not. First, let’s consider a non-k-edge connected graph g, and let {g1 = (V1 , E1 ), . . . , gx = (Vx , Ex )} be the set of k-edge connected components of g. Then, we have l ≤ maxxj=1 |V j |, based on the following reasons. Consider an arbitrary subgraph gi , let gyi denote the subgraph corresponding to gi after y runs of MAS-Opt. Then, the number of vertices in gy+1 is strictly smaller than that in gyi . This is because gyi is i k-edge connected, and thus the early contraction optimization will contract at least one pair of vertices. Thus, after each run of MAS-Opt, the number of vertices in gi is reduced by at least one. Moreover, after maxxj=1 |V j | runs of MAS-Opt, if two subgraphs gi and g j are not already contracted into a single super-vertex, then they will be or have already been partitioned into two subgraphs without any extra run of MAS-Opt.
90
6 Edge Connectivity-Based Graph Decomposition
Secondly, let’s consider a k-edge connected graph g. Then, each vertex of g has a degree at least k. In Partition-MultiWay, each run of MAS-Opt computes an ordering L of vertices in g and contracts each such vertex having ω (Lv , v) ≥ k with its predecessor vertex in L (see the early contraction optimization). Let k(L) denote the number of vertices having ω (Lv , v) ≥ k in L. Then, after each run of MAS-Opt which computes an ordering L, the number of vertices is reduced by k(L). Recall that ω (Lv , v) is the number of edges between v and vertices prior to v in L. Implicitly, for a pair of vertices u and v that have no less than k parallel edges between them, MAS-Opt ensures to merge them into a super-vertex. Consider a random ordering L and a vertex v with degree no less than 2k, the probability that v will be merged with other vertices is at least 12 . Let V2k denote the subset of vertices in g whose degrees are no less than 2k. The expected number of vertices that will be reduced after one run of MAS-Opt is at least |V22k | . After contracting vertices into super-vertices, degrees of super-vertices tend to become larger. Therefore, the number of vertices reduces more rapidly after each run of MAS-Opt. For a special graph that every vertex has degree no less than 2k, the number l of runs is expected to be bounded by log |V (g)|. The same phenomenon also applies to each k-edge connected component of g if g is not k-edge connected. The experimental studies in [17] show that, across all the tested graphs, the values of l are no more than 21 regardless of the sizes of graphs and the sizes and/or numbers of k-edge connected components of a graph.
6.3 Randomized k-Edge Connected Components Computation Given a graph G = (V, E) and an integer k, a randomized algorithm, denoted RandomContract, is proposed in [3] to find the set of k-edge connected components of G with high probability. The general framework of RandomContract is similar to that of KECC (Algorithm 1). That is, it iteratively partitions a non-k-edge connected subgraph into multiple subgraphs. But unlike KECC which uses a deterministic algorithm Partition-MultiWay to partition a graph, RandomContract uses a randomized algorithm Partition-Random. Given a non-k-edge connected subgraph g, Partition-MultiWay(g, k) guarantees to partition g into at least two subgraphs, while Partition-Random(g, k) only guarantees to partition g into at least two subgraphs with high probability. That is, Partition-MultiWay satisfies both the atomicity and cutability properties, while Partition-Random only satisfies the atomicity property. To accelerate the success probability, RandomContract invokes Partition-Random on a subgraph g multiple times if Partition-Random cannot partition the graph g. As a result, RandomContract is a Monte Carlo algorithm [61], which has a fixed time complexity but may not output the correct result. The RandomContract Algorithm. The pseudocode of RandomContract is shown in Algorithm 7. It performs the random contraction algorithm, Partition-Random, for τ iterations, where choosing an appropriate value for τ will be discussed shortly.
6.3 Randomized k-Edge Connected Components Computation
91
In each iteration, it takes the set Gi−1 of subgraphs returned from the previous iteration as input and tries to partition each subgraph g in Gi−1 into multiple subgraphs by invoking Partition-Random(g, k). The pseudocode of the procedure Partition-Random is also shown in Algorithm 7. Similar to Partition-MultiWay, it also contracts vertex pairs into supervertices (Lines 13–14). But unlike Partition-MultiWay which contracts a pair of vertices only if they are k-edge connected in the current graph, Partition-Random contracts a random pair of vertices. During the random contraction process, as long as there are vertices with weighted degree (i.e., the sum of the weights of the adjacent edges) smaller than k, whose adjacent edges correspond to cuts of value smaller than k, it removes the associated edges from the graph which partitions it into multiple subgraphs (Lines 10–11). Implementation Details and Time Complexity. To efficiently implement the contraction operation, the graph g (at Line 7) is represented as an edge-weighted graph in [3], where the weight of (u, v) represents the number of parallel edges between u and v. Moreover, the adjacent edges of each vertex u are stored in a hash structure hu built specifically for u. As a result, to contract u and v, it merges the two hash structures by moving all entries of the smaller hash structure between hu and hv to the other hash structure. Algorithm 7: RandomContract: compute k-edge connected components [3] Input: A graph G = (V, E) and an integer k Output: k-edge connected components of G 1 2 3 4 5
G0 ← {G}; for i ← 1 to τ do Gi ← 0; / for each graph g of Gi−1 do Gi ← Gi ∪ Partition-Random(g, k);
6
return Gt ;
Procedure Partition-Random(g, k) G ← 0; / g ← g; while g is not empty do if ∃u ∈ V (g ) such that dg (u) < k then G← G ∪ {the subgraph of g induced by vertices that are contracted to form u in g }; 11 Remove u from g ; 7 8 9 10
12 13 14 15
else
Choose an edge (v, w) in g at random; Contract v and w in g ;
return G ;
In this way, it is proved in [3] that the worst-case time complexity of Partition-Random is O(m log n), and in typical situations, its expected time com-
92
6 Edge Connectivity-Based Graph Decomposition
plexity is O(m). Thus, the worst-case time complexity of RandomContract is O(τ × m × log n) and the expected time complexity is O(τ × m). Forced Contraction. To accelerate both the processing time and the success probability, an optimization technique, called forced contraction, is developed in [3]. That is, once there is an edge (u, v) whose weight becomes no smaller than k, then it enforces to contract u and v; this is because the weighted degrees of u and v will be no smaller than k. It is interesting to observe that forced contraction is a special case of the early contraction optimization used in Partition-MultiWay. Analysis on the Number τ of Iterations. As RandomContract is a randomized algorithm, the more iterations it runs, the more accurate solution it returns. However, more iterations also means higher time complexity. It is proved in [3] that the necessary number τ of iterations for Partition-Random to, with high probability, identify a cut of value smaller than k in a graph that is not k-edge connected is O(log2 n) with forced contraction. Empirical studies in [3] show that 50 iterations are sufficient to obtain the correct set of k-edge connected components in most cases in practice.
6.4 Edge Connectivity-Based Decomposition Given a graph G, this section investigates the problem of edge connectivity-based decomposition, which computes the set of k-edge connected components of G for all different values of k. However, rather than directly storing the k-edge connected components for all different k values, whose total size can be much larger than the size of the input graph, we compute a steiner connectivity for each edge in G. Definition 6.7 ([16]). The steiner connectivity between two vertices u and v in a graph G, denoted sc(u, v), is the maximum value k such that there is a k-edge connected component of G that contains both u and v. Definition 6.8 ([16]). Given a graph G = (V, E), the connectivity graph of G is a weighted undirected graph Gw = (V, E, w) with the same set of vertices and edges as G. But, each edge (u, v) in Gw carries a weight w(u, v) that equals the steiner connectivity sc(u, v) between u and v.
v2 4
v3
4
v1
4 4 4 4
4
4 4
v4
3 3
4
v5
33
v7
3 3 3
v10
2
3
v12
3 3 3
v6 3
v8
3
v9 2
3
v11 3
v13
Fig. 6.4: A connectivity graph [16]
6.4 Edge Connectivity-Based Decomposition
93
For example, Figure 6.4 shows the connectivity graph of the underlying unweighted graphs, where the steiner connectivity between each pair of adjacent vertices is shown beside the corresponding edge. Obtain k-Edge Connected Components from the Connectivity Graph. Given the connectivity graph Gw of a graph of G and an integer k, all k-edge connected components of G can be retrieved from Gw in linear time, based on the lemma below. Lemma 6.5 ([16]). Each k-edge connected component of G corresponds to a con≥k nected component of G≥k w , where Gw denotes the subgraph of Gw induced by edges with weight at least k. Proof. According to the definition of steiner connectivity, for each pair of adjacent vertices u and v with sc(u, v) ≥ k, there is a k-edge connected component of G containing both u and v. In addition, we also know that k-edge connected components of a graph are disjoint. Thus, for each connected component of G≥k w , the subgraph of G induced by vertices of the connected component is a k-edge connected component of G, and the lemma holds. 2 For example, for the graph in Figure 6.4, the subgraph induced by vertices {v1 , . . . , v5 } is a 4-edge connected component, and the subgraph induced by {v1 , . . . , v5 , v6 , . . . , v9 } is a 3-edge connected component. Following Lemma 6.5, after constructing the connectivity graph, all k-edge connected components for different k values can be efficiently obtained. Thus, in the following, we focus on computing the steiner connectivities for all pairs of adjacent vertices. Note that, if needed, a compact hierarchy (of size O(n)) of the k-edge connected components for all different k values can also be constructed from the connectivity graph in O(m) time in a similar way as done in Section 3.2.3.
6.4.1 A Bottom-Up Approach It is easy to see that the connectivity graph of G can be constructed by computing all k-edge connected components of G for all different k values from small (i.e., 2) to large (i.e., kmax (G)). Here, kmax (G) denotes the largest value of k such that G contains a non-empty k-edge connected component. That is, sc(u, v) is initialized to be 1 for each edge (u, v) in G. Then, for each k varying from 2 to kmax (G), the k-edge connected components of G are computed by invoking KECC as presented in Section 6.2, and the sc(u, v) for each edge (u, v) in k-edge connected components of G is reassigned as k. When the algorithm terminates, the assigned sc(·, ·) values are correct steiner connectivities for all edges. This is referred to as a bottom-up approach [16]. Lemma 6.6 ([14]). kmax (G) ≤ δ (G) holds for every graph G, where δ (G) is the degeneracy of G (see Section 3.1.1).
94
6 Edge Connectivity-Based Graph Decomposition
Proof. We prove the lemma by contradiction. Assume kmax (G) > δ (G), and let g be a (kmax (G))-edge connected component of G. Then, the minimum vertex degree of g is at least kmax (G), which contradicts that the degeneracy of G is δ (G) (< kmax (G)). Thus, the lemma holds. 2 Following Lemma 6.6, the time complexity of the above approach is O(δ (G) × TKECC (G)), where TKECC (G) is the time complexity of a k-edge connected components computation algorithm. Computation Sharing-Based Algorithm. To improve the performance in practice, a computation sharing technique is developed in [16], It follows from the property that each k-edge connected component of G is a subgraph of a (k − 1)-edge connected component of G. Therefore, all edges removed in computing (k − 1)-edge connected component of G can also be safely removed when computing k-edge connected components of G. That is, when computing k-edge connected components of G, the set of (k − 1)-edge connected components of G, instead of G, can be taken as the input. Algorithm 8: ConnGraph-BU: construct the connectivity graph [16] Input: A graph G = (V, E) Output: sc(u, v) for each pair of adjacent vertices (u, v) in G 1 2 3 4 5 6
Gk ← {G}; k ← 1; Assign sc(u, v) to be 1 for all edges of G; while Gk = 0/ do k ← k + 1; Gk ← 0; / for each graph g of size at least 2 in Gk−1 do Gk ← Gk ∪ KECC(g, k);
7
Assign sc(u, v) to be k for all edges of Gk ;
The pseudocode of the bottom-up computation sharing algorithm is shown in Algorithm 8, denoted ConnGraph-BU. Let Gk denote the set of k-edge connected components of G. Initially, G1 consists of only G (Line 1), and sc(u, v) is initialized as 1 for all edges (Line 2). Then, it iteratively computes k-edge connected components of G by increasing k starting from 2 until there is no edge left (Lines 3–7). For a specific k, the set of subgraphs in Gk−1 rather than G is given as input to KECC (Line 6), and the sc(u, v) for each edge (u, v) in Gk is reassigned as k (Line 7). Note that, since the (kmax (G) + 1)-edge connected component of G contains no edge, Algorithm 8 terminates after at most kmax (G) iterations of the while loop. The time complexity of Algorithm 8 remains O(kmax (G) × TKECC (G)). Example 6.3. Consider the unweighted version of the graph in Figure 6.4. G2 = {G}; thus, no pair of adjacent vertices has steiner connectivity 1. G3 = {g1 ∪ g2 , g3 }; (v5 , v12 ) and (v9 , v11 ) are removed, thus sc(v5 , v12 ) = 2 and sc(v9 , v11 ) = 2. In computing G4 , all edges in G except those in g1 are removed; therefore, these newly removed edges (u, v) have sc(u, v) = 3. Finally, the edges (u , v ) in g1 are all removed
6.4 Edge Connectivity-Based Decomposition
95
and have sc(u , v ) = 4. The computed steiner connectivities of pairs of adjacent vertices are also shown in Figure 6.4.
6.4.2 A Top-Down Approach As an alternative to the bottom-up approach in Algorithm 8, it is also possible to compute the k-edge connected components of G by varying k from kmax (G) to 1. We refer to this approach as a top-down approach. Computation sharing technique can also be developed for the top-down approach, based on the lemma below. Lemma 6.7. For a graph G, if every (k + 1)-edge connected component of G is contracted into a super-vertex, then the set of edges in the k-edge connected components of the resulting graph is exactly the set of edges whose steiner connectivities are k. Proof. Let G denote the graph obtained from G by contracting each (k + 1)-edge connected component into a super-vertex. Firstly, it is easy to see that every cut of G is also a cut of G. Thus, each k-edge connected component of G is still k-edge connected after contracting each (k + 1)-edge connected component into a supervertex. Secondly, it is easy to see that every k-edge connected component of G is also k-edge connected in G. Thus, the lemma holds. 2
Algorithm 9: ConnGraph-TD: construct the connectivity graph Input: A graph G = (V, E) Output: sc(u, v) for each pair of adjacent vertices (u, v) in G 1 2 3 4 5 6 7
Compute the degeneracy δ (G) of G; k ← δ (G); while k > 0 do Gk ← KECC(G, k); Assign sc(u, v) to be k for all edges of Gk ; Contract each connected subgraph of Gk into a single super-vertex in G; k ← k − 1;
The pseudocode of the top-down approach with computation sharing is shown in Algorithm 9, denoted ConnGraph-TD. It is self-explanatory. It is easy to see that the time complexity of Algorithm 9 is O(δ (G) × TKECC (G)), by noting that the degeneracy δ (G) of G can be computed in O(m) time (see Chapter 3).
6.4.3 A Divide-and-Conquer Approach A divide-and-conquer approach is proposed in [14] to compute edge connectivitybased graph decomposition in O((log δ (G)) × TKECC (G)) time. The general idea
96
6 Edge Connectivity-Based Graph Decomposition
is that, instead of computing k-edge connected components for k varying sequentially from kmax (G) to 2 or the other way around, it divides the interval [2, kmax (G)] (G) of possible k values. Specifically, it first computes the 2+kmax -edge connected 2 components of the input graph G, based on which it obtains two graphs G1 and G2 . (G) Here, G1 is the computed set of 2+kmax -edge connected components, and G2 is 2 2+kmax (G) -edge connected component into obtained from G by contracting every 2 a super-vertex. Then, it recursively solves the problem for k values in the interval of (G) [ 2+kmax + 1, kmax (G)] on G1 , and also the problem for k values in the interval 2 2+kmax (G) − 1] on G2 . The pseudocode is shown in Algorithm 10, which is of [2, 2 self-explanatory. Algorithm 10: ConnGraph-DC: construct the connectivity graph [14] Input: A graph g, and an interval [L, H] of Steiner connectivity values with L ≤ H Output: sc(u, v) for each edge (u, v) in g M ← L+H 2 ; g1 ← KECC(g, M); /* Compute k-edge connected components of g for a given k = M */; 3 if M = H then 4 Assign sc(u, v) to be H for each edge (u, v) in g1 ; 1 2
5 6
else
7 8
if M = L then Assign sc(u, v) to be (L − 1) for each edge (u, v) removed during the computation at Line 2;
9 10 11
ConnGraph-DC(g1 , [M + 1, H]);
else g2 ← the graph obtained from g by contracting each subgraph that is a connected component of g1 into a super-vertex; ConnGraph-DC(g2 , [L, M − 1]);
The correctness and the time complexity of Algorithm 10 are proved by the following two theorems, respectively. Theorem 6.1 ([14]). Invoking Algorithm 10 with graph G and interval [2, δ (G)] correctly computes the Steiner connectivities of all edges of G. Proof. Firstly, we prove that for each invocation of Algorithm 10 with inputs g and [L, H], all connected components of g are (L-1)-edge connected and there is no (H+1)-edge connected components of g. Initially, this property trivially holds for the first invocation of Algorithm 10. Now, we show that if this property holds for the inputs g and [L, H], then it also holds for the subroutines invoked at Line 6 and Line 11. For Line 6, each connected component of g1 is an M-edge connected component of g and is thus M-edge connected, and moreover, g1 has no (H+1)edge connected components since it is a subgraph of g; thus, the property holds when going into the subroutine at Line 6. For Line 11, every connected component
6.4 Edge Connectivity-Based Decomposition
97
of g2 is (L-1)-edge connected since it is obtained from a connected component of g by contracting each M-edge connected component into a super-vertex where M > L. Also, there is no M-edge connected component in g2 since every M-edge connected component is contracted into a super-vertex. Hence, the property holds when going into the subroutine at Line 11. Now, we prove the theorem by inducting on the length of the interval [L, H]; that is, len = H − L + 1. For the base case that len = 1 (i.e., L = H), we have M = L = H. We assign sc(u, v) to be M for each edge (u, v) in the M-edge connected components of g and assign sc(u, v) to be M − 1 for each edge removed during computing the M-edge connected components of g. Hence, the algorithm is correct, since every connected component of g is (L-1)-edge connected and there are no (H+1)-edge connected components in g. Now, assume that the algorithm is correct for H − L + 1 ≤ r with r ≥ 1, we prove that the algorithm is also correct for H − L + 1 = r + 1. Let M = L+H 2 , we have L ≤ M < H. After computing the M-edge connected components of g, we partition the set of edges of g into two sets: the set of edges in the M-edge connected components of g and the set of edges not in the M-edge connected components of g, which correspond to the set of edges in g1 and the set of edges in g2 , respectively. It is easy to verify that for each edge (u, v) in g, (u, v) is in g1 and sc(u, v) equals the steiner connectivity of (u, v) in g1 if sc(u, v) ≥ M, and (u, v) is in g2 and sc(u, v) equals the steiner connectivity of (u, v) in g2 if sc(u, v) < M; these two cases are correctly computed at Line 6 and Line 11, respectively. Thus, the theorem holds. 2 Theorem 6.2 ([14]). The time complexity of Algorithm 10 with graph G and interval [2, δ (G)] is O (log δ (G)) × TKECC (G) . Proof. We prove the theorem by inducting on the length of the interval [L, H] in the input of Algorithm 10. Obviously, when L = H, the time complexity of Algorithm 10 is O(TKECC (g)) = O (log(H − L + 2)) × TKECC (g) ; thus, the theorem holds when L = H. Now, we assume the theorem holds for all intervals of length len (i.e., H − L + 1 = len), we prove that it also holds for intervals of length len + 1. Given an arbitrary interval [L, H] with H − L = len, Line 2 takes that the recursion of Lines 3– time O(TKECC (g)), and from the induction we have 6 takes time O (log(H − M + 1)) × TKECC (g2 ) and the recursion of Lines 7–11 takes time O (log(M − L + 1)) × TKECC (g1 ) . Without loss of generality, we assume of Algorithm 10 is that H − M = M − L. Then, the total time complexity −M +1))×(T (g )+T (g )) = O (log(H −M +1)+ O TKECC (g)+(log(H KECC 1 KECC 2 1) × TKECC (g) = O((log(H − L + 2)) × TKECC (g)), where the first equality holds / here, we assume that TKECC (g) since E(g1 ) ∪ E(g2 ) = E(g) and E(g1 ) ∩ E(g2 ) = 0; is linear or super-linear to E(g). Thus, the time complexity of Algorithm 10 with G 2 and interval [2, δ (G)] is O (log δ (G)) × TKECC (G) , and the theorem holds. Example 6.4. Consider the graph G in Figure 6.1 which is also shown in the top part of Figure 6.5. Here, δ (G) = 4. Then, we compute the Steiner connectivities for all edges of G by invoking Algorithm 10 with G and interval [L, H] = [2, 4]. That is,
98
6 Edge Connectivity-Based Graph Decomposition
we compute the 2+4 2 -edge connected components of G and obtain the subgraph induced by S1 = {v1 , v2 , . . . , v9 } and the subgraph induced by S2 = {v10 , . . . , v13 }. As L < k = 3 < H, we will continue the computation for the graph G1 and G2 with intervals [2, 2] and [4, 4], respectively. The graph G1 consists of the two 3-edge connected components of G, and is shown in the lower right part of Figure 6.5. We compute the 4-edge connected components of G1 and obtain the subgraph induced by vertices {v1 , v2 , . . . , v5 }. Thus, all edges among vertices {v1 , v2 , . . . , v5 } have Steiner connectivities 4, and other edges have Steiner connectivities 3. The graph G2 is obtained by contracting each of S1 and S2 into a super-vertex and is shown in the lower left part of Figure 6.5; in G2 , there are two parallel edges between s1 and s2 , corresponding to edges (v9 , v11 ) and (v5 , v12 ), respectively. As G2 is 2-edge connected, the steiner connectivities of (v9 , v11 ) and (v5 , v12 ) are 2. The result is the same as that computed by Algorithm 10.
v1
v2
v6
v7
v4
v8
v9 v3
v10
v5 G [L, H] = [2, 4]
v11
v12
v13
Compute 3-edge connected components of G
(v9 , v11 ) s1
s2
v2
v1
v4
v6
v8
v9
(v5 , v12 ) G2 [L, H] = [2, 2]
v7
v3
v5
G1 [L, H] = [4, 4]
v10 v12
v11 v13
Fig. 6.5: Running example of Algorithm 10 [14]
6.5 Further Readings Edge connectivity-based decomposition has also been studied in the I/O-efficient environment [100], where techniques for reducing the memory consumption are proposed. Recently, efficient techniques to compute all k-vertex connected components are proposed in [52, 94].
References
1. James Abello, Mauricio G. C. Resende, and Sandra Sudarsky. Massive quasi-clique detection. In Proc. of LATIN’02, pages 598–612, 2002. 2. Esra Akbas and Peixiang Zhao. Truss-based community search: a truss-equivalence based indexing approach. PVLDB, 10(11):1298–1309, 2017. 3. Takuya Akiba, Yoichi Iwata, and Yuichi Yoshida. Linear-time enumeration of maximal kedge-connected subgraphs in large networks by random contraction. In Proc. CIKM’13, pages 909–918, 2013. 4. Albert Angel, Nick Koudas, Nikos Sarkas, and Divesh Srivastava. Dense subgraph maintenance under streaming edge weight updates for real-time story identification. PVLDB, 5(6):574–585, 2012. 5. Bahman Bahmani, Ravi Kumar, and Sergei Vassilvitskii. Densest subgraph in streaming and mapreduce. PVLDB, 5(5):454–465, 2012. 6. Oana Denisa Balalau, Francesco Bonchi, T.-H. Hubert Chan, Francesco Gullo, and Mauro Sozio. Finding subgraphs with maximum total density and limited overlap. In Proc. of WSDM’15, pages 379–388, 2015. 7. Vladimir Batagelj and Matjaz Zaversnik. An o(m) algorithm for cores decomposition of networks. CoRR, cs.DS/0310049, 2003. 8. Austin R. Benson, David F. Gleich, and Jure Leskovec. Higher-order organization of complex networks. Science, 353(6295):163–166, 2016. 9. Sayan Bhattacharya, Monika Henzinger, Danupon Nanongkai, and Charalampos E. Tsourakakis. Space- and time-efficient algorithm for maintaining dense subgraphs on onepass dynamic streams. In Proc. of STOC’15, pages 173–182, 2015. 10. Fei Bi, Lijun Chang, Xuemin Lin, and Wenjie Zhang. An optimal and progressive approach to online search of top-k influential communities. PVLDB, 11(9):1056–1068, 2018. 11. S. Boccaletti, V. Latora, Y. Moreno, M. Chavez, and D.-U. Hwang. Complex networks: Structure and dynamics. Physics Reports, 424(4–5):175–308, 2006. 12. Paolo Boldi and Sebastiano Vigna. The WebGraph framework I: Compression techniques. In Proc. of WWW’04, pages 595–601, 2004. 13. Francesco Bonchi, Francesco Gullo, Andreas Kaltenbrunner, and Yana Volkovich. Core decomposition of uncertain graphs. In Proc. of KDD’14, pages 1316–1325, 2014. 14. Lijun Chang. A near-optimal algorithm for edge connectivity-based hierarchical graph decomposition. CoRR, abs/1711.09189, 2017. 15. Lijun Chang, Wei Li, Xuemin Lin, Lu Qin, and Wenjie Zhang. pSCAN: Fast and exact structural graph clustering. In Proc. of ICDE’16, pages 253–264, 2016. 16. Lijun Chang, Xuemin Lin, Lu Qin, Jeffrey Xu Yu, and Wenjie Zhang. Index-based optimal algorithms for computing Steiner components with maximum connectivity. In Proc. of SIGMOD’15, 2015. © Springer Nature Switzerland AG 2018 L. Chang, L. Qin, Cohesive Subgraph Computation over Large Sparse Graphs, Springer Series in the Data Sciences, https://doi.org/10.1007/978-3-030-03599-0
99
100
References
17. Lijun Chang, Jeffrey Xu Yu, Lu Qin, Xuemin Lin, Chengfei Liu, and Weifa Liang. Efficiently computing k-edge connected components via graph decomposition. In Proc. SIGMOD’13, pages 205–216, 2013. 18. Moses Charikar. Greedy approximation algorithms for finding dense components in a graph. In Proc. APPROX’00, pages 84–95, 2000. ¨ 19. James Cheng, Yiping Ke, Shumo Chu, and M. Tamer Ozsu. Efficient core decomposition in massive networks. In Proc. of ICDE’11, pages 51–62, 2011. 20. Norishige Chiba and Takao Nishizeki. Arboricity and subgraph listing algorithms. SIAM J. Comput., 14(1):210–223, 1985. 21. Shumo Chu and James Cheng. Triangle listing in massive networks and its applications. In Proc. of KDD’11, pages 672–680, 2011. 22. Jonathan Cohen. Trusses: Cohesive subgraphs for social network analysis, 2008. 23. Alessio Conte, Donatella Firmani, Caterina Mordente, Maurizio Patrignani, and Riccardo Torlone. Fast enumeration of large k-plexes. In Proc. of KDD’17, pages 115–124, 2017. 24. Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms (3. ed.). MIT Press, 2009. 25. Maximilien Danisch, Oana Balalau, and Mauro Sozio. Listing k-cliques in sparse real-world graphs. In Proc. of WWW’18, 2018. 26. Maximilien Danisch, T.-H. Hubert Chan, and Mauro Sozio. Large scale density-friendly graph decomposition via convex programming. In Proc. of WWW’17, pages 233–242, 2017. 27. Alessandro Epasto, Silvio Lattanzi, and Mauro Sozio. Efficient densest subgraph computation in evolving graphs. In Proc. of WWW’15, pages 300–310, 2015. 28. A. Erdem Sariyuce, C. Seshadhri, and A. Pinar. Parallel Local Algorithms for Core, Truss, and Nucleus Decompositions. ArXiv e-prints, 2017. 29. Giorgio Gallo, Michael D. Grigoriadis, and Robert Endre Tarjan. A fast parametric maximum flow algorithm and applications. SIAM J. Comput., 18(1):30–55, 1989. 30. Christos Giatsidis, Dimitrios M. Thilikos, and Michalis Vazirgiannis. Evaluating cooperation in communities with the k-core structure. In Proc. of ASONAM’11, pages 87–93, 2011. 31. Christos Giatsidis, Dimitrios M. Thilikos, and Michalis Vazirgiannis. D-cores: measuring collaboration of directed graphs based on degeneracy. Knowl. Inf. Syst., 35(2):311–343, 2013. 32. Alan Gibbons. Algorithmic Graph Theory. Cambridge University Press, 1985. 33. David Gibson, Ravi Kumar, and Andrew Tomkins. Discovering large dense subgraphs in massive graphs. In Proc. of VLDB’05, pages 721–732, 2005. 34. Aristides Gionis and Charalampos E. Tsourakakis. Dense subgraph discovery: KDD 2015 tutorial. In Proc. of KDD’15, pages 2313–2314, 2015. 35. A. V. Goldberg. Finding a maximum density subgraph. Technical report, Berkeley, CA, USA, 1984. 36. Andrew V. Goldberg and Robert Endre Tarjan. A new approach to the maximum-flow problem. J. ACM, 35(4):921–940, 1988. 37. Oded Green, Llu´ıs-Miquel Mungu´ıa, and David A. Bader. Load balanced clustering coefficients. In Proc. of PPAA’14, pages 3–10, 2014. 38. J. E. Hirsch. An index to quantify an individual’s scientific research output. Proceedings of the National Academy of Sciences of the United States of America, 102(46):16569–16572, 2005. 39. Xiaocheng Hu, Miao Qiao, and Yufei Tao. I/o-efficient join dependency testing, LoomisWhitney join, and triangle enumeration. J. Comput. Syst. Sci., 82(8):1300–1315, 2016. 40. Xiaocheng Hu, Yufei Tao, and Chin-Wan Chung. Massive graph triangulation. In Proc. of SIGMOD’13, pages 325–336, 2013. 41. Xin Huang, Hong Cheng, Lu Qin, Wentao Tian, and Jeffrey Xu Yu. Querying k-truss community in large and dynamic graphs. In Proc. of SIGMOD’14, pages 1311–1322, 2014. 42. Xin Huang, Laks V. S. Lakshmanan, and Jianliang Xu. Community search over big graphs: Models, algorithms, and opportunities. In Proc. of ICDE’17, pages 1451–1454, 2017.
References
101
43. Alon Itai and Michael Rodeh. Finding a minimum circuit in a graph. SIAM J. Comput., 7(4):413–423, 1978. 44. Wissam Khaouid, Marina Barsky, Venkatesh Srinivasan, and Alex Thomo. K-core decomposition of large networks on a single pc. PVLDB, 9(1):13–23, 2015. 45. Samir Khuller and Barna Saha. On finding dense subgraphs. In Proc. of ICALP’09, pages 597–608, 2009. 46. M. Kitsak, L. K. Gallos, S. Havlin, F. Liljeros, L. Muchnik, H. E. Stanley, and H. A. Makse. Identification of influential spreaders in complex networks. Nature Physics, 6:888–893, 2010. 47. Matthieu Latapy. Main-memory triangle computations for very large (sparse (power-law)) graphs. Theor. Comput. Sci., 407(1–3):458–473, 2008. 48. Victor E. Lee, Ning Ruan, Ruoming Jin, and Charu C. Aggarwal. A survey of algorithms for dense subgraph discovery. In Managing and Mining Graph Data, pages 303–336. 2010. 49. Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/data, June 2014. 50. Rong-Hua Li, Lu Qin, Jeffrey Xu Yu, and Rui Mao. Influential community search in large networks. PVLDB, 8(5):509–520, 2015. 51. Rong-Hua Li, Jeffrey Xu Yu, and Rui Mao. Efficient core maintenance in large dynamic graphs. IEEE Trans. Knowl. Data Eng., 26(10):2453–2465, 2014. 52. Yuan Li, Yuhai Zhao, Guoren Wang, Feida Zhu, Yubao Wu, and Shengle Shi. Effective k-vertex connected component detection in large-scale networks. In Proc. of DASFAA’17, pages 404–421, 2017. 53. Don R. Lick and Arthur T. White. k-degenerate graphs. Canadian Journal of Mathematics, 22:1082–1096, 1970. 54. Linyuan L¨u, Tao Zhou, Qian-Ming Zhang, and H. E. Stanley. The h-index of a network node and its relation to degree and coreness. Nature Communications, 7:10168, 2016. 55. F. D. Malliaros, M.-E. G. Rossi, and M. Vazirgiannis. Locating influential nodes in complex networks. Scientific Reports, 6, 2016. 56. Fragkiskos D. Malliaros, Apostolos N. Papadopoulos, and Michalis Vazirgiannis. Core decomposition in graphs: Concepts, algorithms and applications. In Proc. of EDBT’16, pages 720–721, 2016. 57. Fragkiskos D. Malliaros and Michalis Vazirgiannis. Graph-based text representations: Boosting text mining, nlp and information retrieval with graphs. In Proc. of EMNLP’17, 2017. 58. David W. Matula. Determining edge connectivity in 0(nm). In Proc. of FOCS’87, 1987. 59. David W. Matula and Leland L. Beck. Smallest-last ordering and clustering and graph coloring algorithms. J. ACM, 30(3):417–427, 1983. 60. Michael Mitzenmacher, Jakub Pachocki, Richard Peng, Charalampos E. Tsourakakis, and Shen Chen Xu. Scalable large near-clique detection in large-scale networks via sampling. In Proc. of KDD’15, pages 815–824, 2015. 61. Michael Mitzenmacher and Eli Upfal. Probability and computing - randomized algorithms and probabilistic analysis. Cambridge University Press, 2005. 62. Alberto Montresor, Francesco De Pellegrini, and Daniele Miorandi. Distributed k-core decomposition. In Proc. of PODC’11, pages 207–208, 2011. 63. Muhammad Anis Uddin Nasir, Aristides Gionis, Gianmarco De Francisci Morales, and Sarunas Girdzijauskas. Fully dynamic algorithm for top-k densest subgraphs. In Proc. of CIKM’17, pages 1817–1826, 2017. 64. Mark Ortmann and Ulrik Brandes. Triangle listing algorithms: Back from the diversion. In Proc. of ALENEX’14, pages 1–8, 2014. 65. Rasmus Pagh and Francesco Silvestri. The input/output complexity of triangle enumeration. In Proc. of PODS’14, pages 224–233, 2014. 66. Apostolos N. Papadopoulos, Apostolos Lyritsis, and Yannis Manolopoulos. Skygraph: an algorithm for important subgraph discovery in relational graphs. Data Min. Knowl. Discov., 17(1), August 2008.
102
References
67. Ha-Myung Park and Chin-Wan Chung. An efficient mapreduce algorithm for counting triangles in a very large graph. In Proc. of CIKM’13, pages 539–548, 2013. 68. Ha-Myung Park, Francesco Silvestri, U. Kang, and Rasmus Pagh. Mapreduce triangle enumeration with guarantees. In Proc. of CIKM’14, pages 1739–1748, 2014. 69. Francesco De Pellegrini, Alberto Montresor, and Daniele Miorandi. Distributed k-core decomposition. IEEE Transactions on Parallel & Distributed Systems, 24:288–300, 2013. 70. Lu Qin, Rong-Hua Li, Lijun Chang, and Chengqi Zhang. Locally densest subgraph discovery. In Proc. of KDD’15, pages 965–974, 2015. 71. Ryan A. Rossi and Nesreen K. Ahmed. The network data repository with interactive graph analytics and visualization. In Proc. of AAAI’15, 2015. 72. Franc¸ois Rousseau and Michalis Vazirgiannis. Main core retention on graph-of-words for single-document keyword extraction. In Proc. of ECIR’15, pages 382–393, 2015. 73. Yousef Saad. Iterative methods for sparse linear systems. SIAM, 2003. ¨ 74. Ahmet Erdem Sariy¨uce, Bugra Gedik, Gabriela Jacques-Silva, Kun-Lung Wu, and Umit V. C ¸ ataly¨urek. Streaming algorithms for k-core decomposition. PVLDB, 6(6):433–444, 2013. ¨ 75. Ahmet Erdem Sariy¨uce, Bugra Gedik, Gabriela Jacques-Silva, Kun-Lung Wu, and Umit V. C ¸ ataly¨urek. Incremental k-core decomposition: algorithms and evaluation. VLDB J., 25(3):425–447, 2016. 76. Ahmet Erdem Sariy¨uce and Ali Pinar. Fast hierarchy construction for dense subgraphs. PVLDB, 10(3):97–108, 2016. 77. Ahmet Erdem Sariy¨uce and Ali Pinar. Peeling bipartite networks for dense subgraph discovery. In Proc. of WSDM’18, pages 504–512, 2018. ¨ 78. Ahmet Erdem Sariy¨uce, C. Seshadhri, Ali Pinar, and Umit V. C ¸ ataly¨urek. Finding the hierarchy of dense subgraphs using nucleus decompositions. In Proc. of WWW’15, pages 927–937, 2015. 79. Thomas Schank and Dorothea Wagner. Finding, counting and listing all triangles in large graphs, an experimental study. In Proc. of WEA’05, pages 606–609, 2005. 80. Pablo San Segundo, Alvaro Lopez, and Panos M. Pardalos. A new exact maximum clique algorithm for large and massive sparse graphs. Computers & Operations Research, 66:81– 94, 2016. 81. Stephen B. Seidman. Network structure and minimum degree. Social Networks, 5(3):269– 287, 1983. 82. Stephen B. Seidman and Brian L. Foster. A graph-theoretic generalization of the clique concept. The Journal of Mathematical Sociology, 6(1):139–154, 1978. 83. Julian Shun and Kanat Tangwongsan. Multicore triangle computations without tuning. In Proc. of ICDE’15, pages 149–160, 2015. 84. Mauro Sozio and Aristides Gionis. The community-search problem and how to plan a successful cocktail party. In Proc. of KDD’10, pages 939–948, 2010. 85. Mechthild Stoer and Frank Wagner. A simple min-cut algorithm. J. ACM, 44(4), 1997. 86. Siddharth Suri and Sergei Vassilvitskii. Counting triangles and the curse of the last reducer. In Proc. of WWW’11, pages 607–614, 2011. 87. Nikolaj Tatti and Aristides Gionis. Density-friendly graph decomposition. In Proc. of WWW’15, pages 1089–1099, 2015. 88. Antoine J.-P. Tixier, Fragkiskos D. Malliaros, and Michalis Vazirgiannis. A graph degeneracy-based approach to keyword extraction. In Proc. of EMNLP’16, pages 1860– 1870, 2016. 89. Charalampos E. Tsourakakis. The k-clique densest subgraph problem. In Proc. of WWW’15, pages 1122–1132, 2015. 90. Charalampos E. Tsourakakis, Jakub Pachocki, and Michael Mitzenmacher. Scalable motifaware graph clustering. In Proc. of WWW’17, pages 1451–1460, 2017. 91. Michalis Vazirgiannis. Graph of words: Boosting text mining tasks with graphs. In Proc. of WWW’17, page 1181, 2017. 92. Jia Wang and James Cheng. Truss decomposition in massive networks. PVLDB, 5(9):812– 823, 2012.
References
103
93. Jim Webber. The top 5 use cases of graph databases (white paper), 2015. 94. Dong Wen, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang. Enumerating k-vertex connected components in large graphs. CoRR, abs/1703.08668, 2017. 95. Dong Wen, Lu Qin, Ying Zhang, Xuemin Lin, and Jeffrey Xu Yu. I/O efficient core graph decomposition at web scale. In Proc. of ICDE’16, 2016. 96. Yubao Wu, Ruoming Jin, Jing Li, and Xiang Zhang. Robust local community detection: On free rider effect and its elimination. PVLDB, 8(7):798–809, 2015. 97. Yubao Wu, Ruoming Jin, Xiaofeng Zhu, and Xiang Zhang. Finding dense and connected subgraphs in dual networks. In Proc. of ICDE’15, pages 915–926, 2015. 98. Xifeng Yan, X. Jasmine Zhou, and Jiawei Han. Mining closed relational graphs with connectivity constraints. In Proc. of KDD’05, 2005. 99. Long Yuan, Lu Qin, Xuemin Lin, Lijun Chang, and Wenjie Zhang. I/O efficient ECC graph decomposition via graph reduction. PVLDB, 9(7):516–527, 2016. 100. Long Yuan, Lu Qin, Xuemin Lin, Lijun Chang, and Wenjie Zhang. I/O efficient ECC graph decomposition via graph reduction. VLDB J., 26(2):275–300, 2017. 101. Yikai Zhang, Jeffrey Xu Yu, Ying Zhang, and Lu Qin. A fast order-based approach for core maintenance. In Proc. of ICDE’11, pages 337–348, 2017. 102. Rui Zhou, Chengfei Liu, Jeffrey Xu Yu, Weifa Liang, Baichen Chen, and Jianxin Li. Finding maximal k-edge-connected subgraphs from a large graph. In Proc. of EDBT’12, 2012. 103. Yuanyuan Zhu, Hao Zhang, Lu Qin, and Hong Cheng. Efficient mapreduce algorithms for triangle listing in billion-scale graphs. Distributed and Parallel Databases, 35(2):149–176, 2017.
Index
A Adjacency array representation, 5, 6, 15, 37 Adjacency matrix, 4–5 Arboricity, 22–23, 56, 57 Array-based linear heap heads and ids, 15 interface of, 15–17 time complexity, 18 Arraylinearheap, 15, 16, 18, 25, 36, 67 Asynchronous algorithm, 37 Average degree-based densest subgraph computation definition, 42 edge density, 41 exact algorithm Densest-Exact algorithm, 50–52 density testing, 47–50 pruning, 52 properties, 42 streaming 2(1+ε )-approximation algorithm, 45–47 2-approximation algorithm, 43–45 B Bottom-up approach, 93–95 Brandes, Ulrik, 58 Breadth-First Search, 80 C Chiba, Norishige, 56, 62 Cohesive subgraph computation, 7–8 Community search, 8 Complexity analysis, 6 Compressed sparse row (CSR), 5, 57 Connected k-core, 21, 26, 27, 29–32, 68
Connectivity-aware multiway partition optimization, 85–87 pseudocode, 87–88 time complexity, 88 Connectivity-aware two-way partition maximum adjacency search, 81–85 pseudocode, 80, 81 ConnGraph-BU, 94 Core decomposition, 21–39, 55, 64–70 CoreD-Local-opt algorithm, 34–36 Core hierarchy tree, 27–31 Core number, 22–25, 29–35, 37–39, 66 Core spanning tree, 31–32 D Degeneracy, 3, 4, 22–25, 30–32, 34, 42–44, 59, 63, 93–95 Degeneracy ordering, 24, 25, 30–32, 34, 43, 44, 59, 63 Degree decreasing ordering, 59, 61 Degree increasing ordering, 59, 61 Densest-Exact algorithm, 50–52 Densest-Greedy algorithm, 43–44, 52 Densest-Streaming algorithm, 45–47 Densest subgraph, 7, 41–53, 55, 71–75 Density testing, 47–50, 73 Disjoint-set data structure, 28–27 Divide-and-conquer approach, 95–98 Dominating set, 83, 84 E Edge connectivity-based graph decomposition bottom-up approach, 93–95 definitions, 77–78 divide-and-conquer approach, 95–98 graph cut and global min-cut, 78–79
© Springer Nature Switzerland AG 2018 L. Chang, L. Qin, Cohesive Subgraph Computation over Large Sparse Graphs, Springer Series in the Data Sciences, https://doi.org/10.1007/978-3-030-03599-0
105
106 Edge connectivity-based graph decomposition (cont.) k-edge connected components connectivity-aware multiway partition, 85–88 connectivity-aware two-way partition, 80–85 graph partition-based framework, 79–80 KECC algorithm, 88–90 randomized algorithm, 90–92 properties, 78 steiner connectivity, 92, 93 3-edge connected components, 77, 78 top-down approach, 95 Edge density, 41, 71 Exact algorithm Densest-Exact algorithm, 50–52 density testing, 47–50 k-clique densest subgraph computation, 73–75 pruning, 52 F Forced contraction, 92 G Global min-cut, 78, 79, 81–83 Goldberg, A.V., 73 Goldberg’s algorithm, 73 Graph cut, 78 Graph memory allocation, 5, 6 Graph terminologies, 1–2 GreedyDensest, 43, 44, 52 H Higher-order structures k-clique densest subgraph computation approximation algorithms, 71–73 exact algorithms, 73–75 k-clique enumeration K3 algorithm, 56–58 KClique-ChibaN, 62–63 oriented graph, 58–61 TriE, 59, 61, 63–64 nucleus decomposition, 68–70 truss decomposition, 64–66 peeling algorithm, 66–67 h-index-based core decomposition local algorithm, 33–34 optimization algorithm, 34–36 I Influential nodes, 8 I/O-efficient core decomposition, 32, 37–39 Itai, Alon, 56
Index K K3 algorithm, 56–58 KClique-ChibaN, 62–63 k-clique densest subgraph computation (CDS) approximation algorithms, 71–73 exact algorithms, 73–75 k-clique density, 71, 73 k-clique enumeration algorithms K3 algorithm, 56–58 KClique-ChibaN, 62–63 oriented graph, 58–61 TriE, 59, 61, 63–64 k-core, 7, 8, 21–27, 29–33, 64, 66, 68, 87 KECC algorithm, 79, 88–90 k-edge connected components connectivity-aware multiway partition, 85–88 connectivity-aware two-way partition, 80–85 graph partition-based framework, 79–80 KECC algorithm, 88–90 randomized algorithm, 90–92 Koblenz Network Collection (KONECT), 3 Kruskal’s algorithm, 32 k-truss, 7, 8, 64–68, 75 L Laboratory for Web Algorithmics (LAW), 3 Large sparse graphs, 4–6 LazyLinearHeap, 18–20 Lazy-update linear heap, 18–20 Linear heap data structures array-based linear heap heads and ids, 15 interface of, 15–17 time complexity, 18 lazy-update linear heap, 18–20 ListLinearHeap (see Linked list-based linear heap) Linear-time algorithm connected k-core, 27 core hierarchy tree, 29–31 core spanning tree, 31–32 disjoint-set data structure, 28–27 k-core computation, 26–27 peeling algorithm, 23–26 Linked list-based linear heap actual storage, 10 conceptual view, 10 interface, 10–11 init, 11–12 insert/remove, 12 key value of element 12, 13 Pop/Get Min/Max, 13
Index pres and nexts, 9 time complexity, 13–14 ListLinearHeap, 10, 11, 13–20, 25, 67 M Maximum adjacency search (MAS) connectivity-aware multiway partition, 86–88 global min-cut computation, 81–82 two-way partition, 83–85 Minimum degree-based core decomposition definition, 21–22 degeneracy and arboricity, 22–23 h-index-based core decomposition, 32–36 I/O-efficient core decomposition, 37–39 linear-time algorithm connected k-core, 27 core hierarchy tree, 29–31 core spanning tree, 31–32 disjoint-set data structure, 28–27 k-core computation, 26–27 peeling algorithm, 23–26 parallel/distributed core decomposition, 36–37 N Network Repository, 3 Nishizeki, Takao, 56, 62 Nucleus decomposition, 55, 64, 68–70 Nucleus number, 69, 70 O Oriented graph, 56, 58, 59, 63 Ortmann, Mark, 58 P Parallel/disjoint core decomposition, 21, 32, 36–37, 88, 90, 91, 98 Parallel/distributed core decomposition, 36–37 Partition-MultiWay, 85–88
107 Partition tree, 80 Path compression optimization, 28–29 Peeling algorithm linear-time algorithm, 23–26 nucleus decomposition, 69–70 truss decomposition, 66–67 PeelNucleus, 70 Power-law distribution, 4 Pruning for densest subgraph, 52 R RandomContract algorithm, 90–92 Real graph datasets, 3–4 Real-time story identification, 8 Real-world graphs, 3–4 Rodeh, Michael, 56 S Smallest-first ordering, 24, 59–61 Spam detection, 8 Stanford Network Analysis Project (SNAP), 3 Steiner connectivity, 92–94, 97 Synchronous algorithm, 37 T Time complexity, 13–14, 91 array-based linear heap, 18 linked list-based linear heap, 13–14 θ -approximation algorithm, 43 Top-down approach, 95 Triangle enumeration, 55–62, 75 TriE, 59, 61, 63–64 Truss decomposition, 64–66 peeling algorithm, 66–67 Truss number, 65–67 2-approximation algorithm, 43–45 U Union by rank optimization, 28–29 Unweighted undirected graph, 2