Plug-and-Play Visual Subgraph Query Interfaces 3031161610, 9783031161612

This book details recent developments in the emerging area of plug-and-play (PnP) visual subgraph query interfaces (VQI)

211 115 4MB

English Pages 181 [182] Year 2023

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Foreword by the Series Editor
Preface
Contents
About the Authors
1 The Future is Democratized Graphs
[DELETE]
1.1 Querying Graphs
1.2 Subgraph Query Formulation Process
1.3 Graph Query Languages
1.4 Toward Graph Databases for All!
1.5 Visual Subgraph Query Interfaces (VQIs)
1.6 Limitations of Existing VQI
1.7 Plug-and-Play (PnP) Interfaces—Democratizing Subraph Querying
1.8 Overview of This Book
1.9 Scope
2 Background
[DELETE]
2.1 Graph Terminology
2.1.1 Subgraph Isomorphism-Related Terminology
2.1.2 Maximum (Connected) Common Subgraph
2.1.3 k-Truss
2.1.4 Types of Graph Collection
2.2 Cognitive Load
2.3 Usability
2.4 Conclusions
3 The World of Visual Graph Query Interfaces—An Overview
[DELETE]
3.1 Visual Subgraph Query Formulation (VQF) Approaches
3.2 Visual Subgraph Query Interfaces (VQI)
3.2.1 First Generation VQI
3.2.2 Second Generation VQI
3.2.3 Third Generation VQI
3.3 Comparative Analysis
3.4 Conclusions
4 Plug-and-Play Visual Subgraph Query Interfaces
[DELETE]
4.1 Assumptions Made by Existing VQI
4.2 Limitations of Existing VQI
4.3 Design Principles of Plug-and-Play VQI
4.4 Plug-and-Play (PnP) Interface
4.4.1 PnP Template
4.4.2 Plug
4.4.3 PnP Engine
4.4.4 Play Mode
4.5 Benefits of PnP Interfaces
4.6 Conclusions
5 The Building Block of PnP Interfaces: Canned Patterns
[DELETE]
5.1 Characteristics of Canned Patterns
5.2 Quantifying Coverage
5.3 Quantifying Diversity
5.4 Quantifying Cognitive Load
5.5 Conclusions
6 Pattern Selection for Graph Databases
[DELETE]
6.1 Closure Graph
6.2 Canned Pattern Selection Problem
6.3 The CATAPULT Framework
6.4 Cluster Summary Graph (CSG) Generation
6.4.1 Small Graph Clustering
6.4.2 Generation of CSGs
6.4.3 Handling Larger Graph Databases
6.5 Selection of Canned Patterns
6.6 Selection of Basic Patterns
6.7 Performance Study
6.7.1 Experimental Setup
6.7.2 Experimental Results
6.8 AURORA—A PnP Interface for Graph Databases
6.8.1 VQI Structure
6.8.2 Pattern-at-a-time Query Formulation
6.8.3 User Experience and Feedback
6.9 Conclusions
7 Pattern Selection for Large Networks
[DELETE]
7.1 The CPS Problem
7.2 Categories of Canned Patterns
7.2.1 Topologies of Real-World Queries
7.2.2 Topologies of Canned Patterns
7.3 Candidate Pattern Generation
7.3.1 Truss-Based Graph Decomposition
7.3.2 Patterns from a TIR Graph
7.3.3 Patterns from a TOR Graph
7.4 Selection of Canned Patterns
7.4.1 Theoretical Analysis
7.4.2 Quantifying Coverage and Similarity
7.4.3 CPS-Randomized Greedy Algorithm
7.5 Performance Study
7.5.1 Experimental Setup
7.5.2 User Study
7.5.3 Automated Performance Study
7.6 PLAYPEN—A PnP Interface for Large Networks
7.6.1 Pattern-at-a-Time Query Formulation
7.6.2 User Experience and Feedback
7.7 Conclusions
8 Maintenance of Patterns
[DELETE]
8.1 The CPM Problem
8.1.1 Problem Definition
8.1.2 Design Challenges
8.1.3 Scaffolding Strategy
8.1.4 Selective Maintenance Strategy
8.2 The MIDAS Framework
8.3 Maintenance of Clusters and CSGs
8.3.1 Closure Property of FCT
8.3.2 Maintenance of FCT
8.3.3 Maintenance of Graph Clusters
8.3.4 Maintenance of CSG Set
8.4 Candidate Pattern Generation
8.4.1 FCT-Index and IFE-Index
8.4.2 Pruning-Based Candidate Generation
8.5 Canned Pattern Maintenance
8.5.1 Pattern Score
8.5.2 Swap-based Pattern Maintenance
8.6 Maintenance of Basic Patterns
8.7 Performance Study
8.7.1 Experimental Setup
8.7.2 User Study
8.7.3 Experimental Results
8.8 MIDAS in AURORA
8.9 Conclusions
9 The Road Ahead
[DELETE]
9.1 Summary
9.1.1 Plug-and-Play (PnP) Interfaces
9.1.2 Canned Patterns—The Building Block of PnP Interfaces
9.1.3 Pattern Selection for Graph Databases
9.1.4 Pattern Selection for Large Networks
9.1.5 Pattern Maintenance
9.1.6 Usability Results
9.2 Future Directions
References
Index
Recommend Papers

Plug-and-Play Visual Subgraph Query Interfaces
 3031161610, 9783031161612

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Synthesis Lectures on Data Management

Sourav S. Bhowmick · Byron Choi

Plug-and-Play Visual Subgraph Query Interfaces

Synthesis Lectures on Data Management Series Editor H. V. Jagadish, University of Michigan, Ann Arbor, MI, USA

This series publishes lectures on data management. Topics include query languages, database system architectures, transaction management, data warehousing, XML and databases, data stream systems, wide scale data distribution, multimedia data management, data mining, and related subjects.

Sourav S. Bhowmick Byron Choi •

Plug-and-Play Visual Subgraph Query Interfaces

123

Sourav S. Bhowmick School of Computer Science and Engineering Nanyang Technological University Singapore, Singapore

Byron Choi Hong Kong Baptist University Hong Kong S.A.R., China

ISSN 2153-5418 ISSN 2153-5426 (electronic) Synthesis Lectures on Data Management ISBN 978-3-031-16161-2 ISBN 978-3-031-16162-9 (eBook) https://doi.org/10.1007/978-3-031-16162-9 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Foreword by the Series Editor

Graph data has been gaining importance over the years. More and more situations have data best represented in graph form, ranging from knowledge bases to social networks. This has led to considerable interest in querying graph data. Not surprisingly, there is a great deal of work in this area and many clever ways suggested in the literature. Any query language must have an underlying logic and formalism. The most natural way to express a query is through a textual statement of this underlying logic. A prime example of this is SQL, whose core is a text representation of relational algebra. However, users often find it difficult to express their query needs in the text that correctly states their desire in terms of the logic of the stored data. Hence, there has been a search to define more usable query interfaces. Visual query interfaces have been particularly important in this regard. With structured data, visual query interfaces are designed with the structure internalized. For example, if we assume that a relational database comprises tables connected by means of primarykey/foreignkey joins, we can define a visual query language that relies on this structure: for instance, a user could specify values for some attributes in a relation and some attributes in a joined relation. However, graph data tends to be much richer. For example, it is meaningful for a user to query for a node with high degree—a query with no obvious parallel in the relational world. This makes it challenging to design a visual query interface for graph data. This monograph presents what the authors call a “Plug and Play” interface for graph data. The idea is to have a (large) library of templates that cover most common cases, and a simple specification mechanism that appropriately instantiates selected templates in a manner specific to a particular graph database. The end result is a customized, easy-to-use interface. This Synthesis Lectures series includes many works of importance for the field of databases, with their importance being derived from diverse dimensions. Some books are important because they present an excellent survey of the state of the art in a topic of great interest; others are important because they cover a particularly novel research direction that a research group is pursuing, whose holistic presentation in a book format permits the authors to argue for their research vision in a manner not possible in focused research

v

vi

Foreword by the Series Editor

papers with tight page limits. This particular book is a great example of the latter: it advances the frontiers of our field and promotes discussion of a new approach. Please enjoy reading it, then discuss, criticize, and praise, as you see fit. Ann Arbor, MI, USA

H. V. Jagadish

Preface

If the user can’t use it, it doesn’t work. Susan Dray, President, Dray & Associates, Inc.

Law is not primarily for lawyers or judges—it applies to everyone. However, most people are unable to comprehend legal language on their own. Though technically they can access the law, they are not in any position to vindicate their own rights or defend themselves against legal challenges. One of the reasons that prevent public access to justice is the challenges brought by the usage of legal terminology. For example, the archaic terms “in camera” and “subpoena” are not only difficult to understand by the public, but also may create misunderstandings. For instance, “in camera” may be interpreted by one as appearing in the courthouse through Zoom! Certainly, it is much more intuitive to replace these terms with “in private” and “order to attend court”, respectively. Such simplification can potentially make an ordinary person’s experience with law more palatable, thereby enabling greater access to justice. The law impacts everyone and hence should be comprehensible by everyone. Data management tools, like law, are no more primarily for database experts and administrators. It should be accessible to everyone in an increasingly data-democratized and data-driven world. However, query languages—the primary means to access data residing in databases—prevent diverse end users who are not proficient in these languages to take advantage of these tools for their tasks. That is, query languages are like legal terminology that can only be understood and written by database professionals and experts. Since data impacts almost all aspects of life nowadays, it should be easily accessed and searched by end users with diverse skills and backgrounds. Visual query interfaces are designed to alleviate the access challenge by enabling end users to access and search data through the interactive construction of queries without resorting to any query languages. Given the ubiquity of graphs to model data in a wide variety of domains (e.g., biology, chemistry, ecology, social science, and journalism), this book reports recent work in building visual query interfaces to democratize access to graph data. Subgraph search query, which is typically represented as a connected graph, is one of the most popular query paradigms for accessing graph data. Since graphs are intuitive to draw, increasingly graph data management tools from academia and industry are exposing visual subgraph query interfaces (VQIs) to enable an end user to draw a vii

viii

Preface

subgraph search query interactively instead of formulating it textually using a graph query language. However, these classical VQIs suffer from several limitations such as high creation and maintenance cost, lack of superior support for visual subgraph query formulation, and poor portability across application domains and data sources that hinder their democratization. This book presents the paradigm of plug-and-play VQI, as it stands today, that addresses these limitations. In particular, a broad goal of this book is to draw on well-founded principles of human-computer interaction (HCI) and cognitive psychology to enhance the usability and reach of subgraph query formulation frameworks. Note that it is reasonable to expect this picture to evolve with time. Our discussion is divided into four parts, moving from “softer” aspects of visual interfaces (e.g., usability, cognitive load) to “harder” aspects of realizing them (e.g., algorithms and data structures) in order to build plug-and-play visual subgraph query interfaces. First, we review, as accurately as possible, a spectrum of classical visual interfaces to enable subgraph query formulation. We discuss their advantages and limitations w.r.t. usability and their impact on the democratization of subgraph querying tools to wider communities. Second, we introduce the novel paradigm of plug-and-play visual subgraph query interface (i.e., PnP interface). In particular, we describe its architecture, how it can address the limitations of classical VQIs, and the challenges that need to be addressed in order to realize it in practice. Third, we review frameworks that construct and maintain PnP interfaces. Specifically, we introduce recent visual subgraph query formulation frameworks that depart from the traditional mantra of “manual” VQI construction by exploring a paradigm that automatically generates and maintains a VQI for a given graph data source in a data-driven manner without resorting to any coding. A user can simply plug a PnP interface on top of his or her graph data source and play by formulating subgraph queries visually without resorting to any graph query languages. In particular, a pervasive desire of this review is to emphasize the role of cognitive load-aware “representative objects” in a VQI that facilitates top-down and bottom-up query formulation effortlessly. The last topic consists of several open problems in this young field. The list presented should by no means be considered exhaustive and is centered around challenges and issues currently in vogue. Nevertheless, readers can benefit by exploring the research directions given in this part. The book is suitable for use in advanced undergraduate and graduate-level courses on graph data management. It has sufficient material that can be covered as part of a semesterlong course, thereby leaving plenty of room for an instructor to choose topics. An undergraduate course in algorithms, graph theory, database technology, and basic HCI should suffice as a prerequisite for most of the chapters. A good knowledge of C++/Java programming language is sufficient to code the algorithms described herein. We have also made the code base of some of the frameworks available through GitHub links. For completeness, we have provided background information on several topics in Chap. 2:

Preface

ix

fundamental graph and subgraph query terminology and concepts related to HCI and cognitive psychology. The knowledgeable reader may omit this chapter and perhaps refer back to it while reading later chapters of the book. We hope that this book will serve as a catalyst in helping this burgeoning area of plug-and-play query interfaces that lie at the intersection of data management, HCI, and cognitive psychology to grow and have a practical impact. Singapore, Singapore Hong Kong S.A.R., China July 2022

Sourav S. Bhowmick Byron Choi

Acknowledgments

It is a great pleasure for us to acknowledge the assistance and contributions of a large number of individuals to this effort. First, we would like to thank our publishers Morgan & Claypool and Springer Nature for their support. In particular, we would like to acknowledge the efforts, help, and patience of Diane Cerra and Christine Kiilerich, our primary contacts for this edition. The majority of the work reported in this book grew out of the DAta-driven Visual INterface Construction EngIne (DAVINCI) project at the Nanyang Technological University (NTU), Singapore. In this project, our broad goal is to explore the paradigm of data-driven visual query interface construction to enable effective top-down and bottom-up searches. Specifically, some of the chapters are published in ACM SIGMOD and VLDB, two premium data management venues. Details related to the DAVINCI project can be found at https://personal.ntu.edu.sg/assourav/research/hint/index.html. Dr. Huey-Eng Chua of NTU, who was a key collaborator for this project, deserves the first thank you. She continuously provided high-quality management of this project by working with all stakeholders. This project would not have been successful without her contributions. In addition, we would also like to express our gratitude to all the group members and collaborators, past and present. In particular, Kai Huang (NTU & Fudan University), Zifeng Yuan (NTU & Fudan University), Zekun Ye (NTU & Fudan University), Prof. Curtis Dyreson (Utah State University), Prof. Shuigeng Zhou (Fudan University), and Prof. Wook-Shin Han (POSTECH) made substantial contributions to the broader aspect of our research on PnP interfaces. Quite a few people have helped us with the initial vetting of the text for this book. It is our pleasure to acknowledge them all here. We would like to thank Springer Nature for carefully proofreading the complete book in a short span of time and suggesting the changes which have been incorporated. We would like to acknowledge our parents and family members who gave us incredible support throughout the years. They were the major force behind our continuous strive for breaking out from the comfort zone of computer science to explore problems that are at the intersection of two or more disparate areas and along the way appreciate the

xi

xii

Acknowledgments

importance of softer aspects of technology. It has been and continues to be a great learning experience for us. A special thanks go to Professor H. V. Jagadish (UMich, USA) for giving us the opportunity to author this book. Finally, we would like to thank the MOE Singapore AcFR Tier 1 and Tier 2, for the generous financial support provided for the DAVINCI project. We would also like to thank the School of Computer Science and Engineering at the Nanyang Technological University for allowing the use of their resources to help complete the book. The work done at the Department of Computer Science at HKBU is partially supported by HKRGC GRF 12201119 and 12201518. July 2022

Sourav S. Bhowmick Byron Choi

Contents

1 The Future is Democratized Graphs . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Querying Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Subgraph Query Formulation Process . . . . . . . . . . . . . . . . . . . . . . 1.3 Graph Query Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Toward Graph Databases for All! . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Visual Subgraph Query Interfaces (VQIs) . . . . . . . . . . . . . . . . . . . 1.6 Limitations of Existing VQI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7 Plug-and-Play (PnP) Interfaces—Democratizing Subraph Querying . 1.8 Overview of This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.9 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

1 1 2 4 5 7 8 10 11 13 13

2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Graph Terminology . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Subgraph Isomorphism-Related Terminology 2.1.2 Maximum (Connected) Common Subgraph . 2.1.3 k-Truss . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.4 Types of Graph Collection . . . . . . . . . . . . . 2.2 Cognitive Load . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Usability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

15 15 15 16 17 17 18 19 19 20

3 The World of Visual Graph Query Interfaces—An Overview 3.1 Visual Subgraph Query Formulation (VQF) Approaches . . 3.2 Visual Subgraph Query Interfaces (VQI) . . . . . . . . . . . . . . 3.2.1 First Generation VQI . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Second Generation VQI . . . . . . . . . . . . . . . . . . . . 3.2.3 Third Generation VQI . . . . . . . . . . . . . . . . . . . . . . 3.3 Comparative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

21 22 23 23 25 25 27

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

xiii

xiv

Contents

3.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28 28

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

29 29 30 31 32 33 34 35 36 36 37 38

5 The Building Block of PnP Interfaces: Canned Patterns 5.1 Characteristics of Canned Patterns . . . . . . . . . . . . . . 5.2 Quantifying Coverage . . . . . . . . . . . . . . . . . . . . . . . 5.3 Quantifying Diversity . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Quantifying Cognitive Load . . . . . . . . . . . . . . . . . . . 5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

39 39 42 43 43 47 47

6 Pattern Selection for Graph Databases . . . . . . . . . . . 6.1 Closure Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Canned Pattern Selection Problem . . . . . . . . . . . . 6.3 The CATAPULT Framework . . . . . . . . . . . . . . . . 6.4 Cluster Summary Graph (CSG) Generation . . . . . . 6.4.1 Small Graph Clustering . . . . . . . . . . . . . . . 6.4.2 Generation of CSGs . . . . . . . . . . . . . . . . . 6.4.3 Handling Larger Graph Databases . . . . . . . 6.5 Selection of Canned Patterns . . . . . . . . . . . . . . . . 6.6 Selection of Basic Patterns . . . . . . . . . . . . . . . . . . 6.7 Performance Study . . . . . . . . . . . . . . . . . . . . . . . . 6.7.1 Experimental Setup . . . . . . . . . . . . . . . . . . 6.7.2 Experimental Results . . . . . . . . . . . . . . . . . 6.8 AURORA—A PnP Interface for Graph Databases . 6.8.1 VQI Structure . . . . . . . . . . . . . . . . . . . . . . 6.8.2 Pattern-at-a-time Query Formulation . . . . . . 6.8.3 User Experience and Feedback . . . . . . . . . 6.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

49 51 52 53 55 55 60 60 62 67 68 68 69 78 78 79 79 80 81

4 Plug-and-Play Visual Subgraph Query Interfaces 4.1 Assumptions Made by Existing VQI . . . . . . . . 4.2 Limitations of Existing VQI . . . . . . . . . . . . . . 4.3 Design Principles of Plug-and-Play VQI . . . . . 4.4 Plug-and-Play (PnP) Interface . . . . . . . . . . . . . 4.4.1 PnP Template . . . . . . . . . . . . . . . . . . . 4.4.2 Plug . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.3 PnP Engine . . . . . . . . . . . . . . . . . . . . 4.4.4 Play Mode . . . . . . . . . . . . . . . . . . . . . 4.5 Benefits of PnP Interfaces . . . . . . . . . . . . . . . 4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

Contents

xv

7 Pattern Selection for Large Networks . . . . . . . . . . . 7.1 The CPS Problem . . . . . . . . . . . . . . . . . . . . . . . 7.2 Categories of Canned Patterns . . . . . . . . . . . . . . 7.2.1 Topologies of Real-World Queries . . . . . . 7.2.2 Topologies of Canned Patterns . . . . . . . . 7.3 Candidate Pattern Generation . . . . . . . . . . . . . . . 7.3.1 Truss-Based Graph Decomposition . . . . . 7.3.2 Patterns from a TIR Graph . . . . . . . . . . . 7.3.3 Patterns from a TOR Graph . . . . . . . . . . . 7.4 Selection of Canned Patterns . . . . . . . . . . . . . . . 7.4.1 Theoretical Analysis . . . . . . . . . . . . . . . . 7.4.2 Quantifying Coverage and Similarity . . . . 7.4.3 CPS-Randomized Greedy Algorithm . . . . 7.5 Performance Study . . . . . . . . . . . . . . . . . . . . . . . 7.5.1 Experimental Setup . . . . . . . . . . . . . . . . . 7.5.2 User Study . . . . . . . . . . . . . . . . . . . . . . . 7.5.3 Automated Performance Study . . . . . . . . . 7.6 PLAYPEN—A PnP Interface for Large Networks 7.6.1 Pattern-at-a-Time Query Formulation . . . . 7.6.2 User Experience and Feedback . . . . . . . . 7.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . .

83 85 86 86 88 90 91 93 98 100 101 103 104 107 107 108 112 117 117 118 119 120

8 Maintenance of Patterns . . . . . . . . . . . . . . . . . 8.1 The CPM Problem . . . . . . . . . . . . . . . . . . . 8.1.1 Problem Definition . . . . . . . . . . . . . 8.1.2 Design Challenges . . . . . . . . . . . . . 8.1.3 Scaffolding Strategy . . . . . . . . . . . . 8.1.4 Selective Maintenance Strategy . . . . 8.2 The MIDAS Framework . . . . . . . . . . . . . . 8.3 Maintenance of Clusters and CSGs . . . . . . . 8.3.1 Closure Property of FCT . . . . . . . . . 8.3.2 Maintenance of FCT . . . . . . . . . . . . 8.3.3 Maintenance of Graph Clusters . . . . 8.3.4 Maintenance of CSG Set . . . . . . . . . 8.4 Candidate Pattern Generation . . . . . . . . . . . 8.4.1 FCT-Index and IFE-Index . . . . . . . . 8.4.2 Pruning-Based Candidate Generation 8.5 Canned Pattern Maintenance . . . . . . . . . . . . 8.5.1 Pattern Score . . . . . . . . . . . . . . . . . 8.5.2 Swap-based Pattern Maintenance . . . 8.6 Maintenance of Basic Patterns . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

123 125 125 126 127 128 131 132 132 133 135 136 136 136 139 141 141 143 146

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

xvi

Contents

8.7 Performance Study . . . . . . . . 8.7.1 Experimental Setup . . 8.7.2 User Study . . . . . . . . 8.7.3 Experimental Results . 8.8 MIDAS in AURORA . . . . . . 8.9 Conclusions . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

147 147 148 151 156 157 158

9 The Road Ahead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1.1 Plug-and-Play (PnP) Interfaces . . . . . . . . . . . . . . . . . . . . 9.1.2 Canned Patterns—The Building Block of PnP Interfaces . 9.1.3 Pattern Selection for Graph Databases . . . . . . . . . . . . . . 9.1.4 Pattern Selection for Large Networks . . . . . . . . . . . . . . . 9.1.5 Pattern Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1.6 Usability Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

159 159 159 160 160 161 162 162 163 165

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

About the Authors

Sourav S. Bhowmick is an Associate Professor at the School of Computer Science and Engineering (SCSE), Nanyang Technological University, Singapore. His core research expertise is in data management, human–data interaction, and data analytics. His research has appeared in premium venues such as ACM SIGMOD, VLDB, ACM WWW, ACM MM, ACM SIGIR, VLDB Journal, Bioinformatics, and Biophysical Journal. He is a co-recipient of Best Paper Awards in ACM CIKM 2004, ACM BCB 2011, and VLDB 2021 for work on mining structural evolution of tree-structured data, generating functional summaries, and scalable attributed network embedding, respectively. He is a co-recipient of the 2021 ACM SIGMOD Research Highlights Award. Sourav is serving as a member of the SIGMOD Executive Committee, a regular member of the PVLDB Advisory Board, and a co-lead in the committee for Diversity and Inclusion in Database Conference Venues. He is a co-recipient of the VLDB Service Award in 2018 from the VLDB Endowment. He was inducted into Distinguished Members of the ACM in 2020. Byron Choi is the Associate Head and an Associate Professor at the Department of Computer Science, Hong Kong Baptist University (HKBU). His research interests include graph-structured databases, database usability, database security, and time series analysis. Byron’s publications have appeared in premium venues such as TKDE, VLDBJ, SIGMOD, PVLDB/VLDB, and ICDE. He has served as a program committee member or reviewer of premium conferences and journals, including PVLDB, VLDBJ, ICDE, TKDE, and TOIS. He was awarded a distinguished program committee (PC) member from ACM SIGMOD 2021 and the best reviewers award from ACM CIKM 2021. He received the distinguished reviewer award from PVLDB 2019. He has served as the director of a Croucher Foundation Advanced Study Institute (ASI), titled “Frontiers in Big Data Graph Research”, in 2015. He was a recipient of the HKBU President’s Award for Outstanding Young Researcher in 2016.

xvii

1

The Future is Democratized Graphs

Graphs (a.k.a networks) are ubiquitous nowadays in many application domains (e.g., retail and eCommerce, transportation and logistics, healthcare, pharmaceuticals, and life sciences) as they provide powerful abstractions to model complex structures and relationships. Consequently, graph data management tools are expected to play a pivotal role in diverse applications such as customer analytics, fraud detection and prevention, supply chain management, and scientific data analysis. Markets and Markets anticipates the global graph database market size is expected to grow from USD 1.9 billion in 2021 to USD 5.1 billion by 2026 (Markets and Markets 2022). Given such growth opportunities, it is paramount for graph data management tools to be user-friendly, efficient, and scalable to support their growing demands from diverse end users and applications.

1.1

Querying Graphs

Querying graphs is a key component in any graph data management tool. Although keywordbased search (Wang and Aggarwal 2010) is the simplest paradigm to query graphs, such queries have limited flexibility as they disallow the specification of structural constraints on graphs. Consequently, the most common and important query primitive for graphs is subgraph search (also referred to as subgraph or graph query), where we want to retrieve one or more subgraphs in a graph G that exactly or approximately match a user-specified query graph Q. Exact subgraph search strictly searches for isomorphic subgraphs in G that matches Q. These queries are typically referred to as subgraph matching (Sun and Luo 2020) or subgraph enumeration (Afrati et al. 2013) query based on whether Q is a labeled or unlabeled query graph, respectively. On the other hand, a similar or approximate search allows the topology of the query graph to be mismatched to a certain degree. These approaches utilize edit distance (Bunke and Kim 1998), common connected subgraphs (Shang et al. 2010), or graph homomorphism (Fan et al. 2010; Song et al. 2018) to retrieve similar query results. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 S. S. Bhowmick and B. Choi, Plug-and-Play Visual Subgraph Query Interfaces, Synthesis Lectures on Data Management, https://doi.org/10.1007/978-3-031-16162-9_1

1

2

1 The Future is Democratized Graphs

A common thread across all these different query types is the formulation of a graph topology which is matched against the underlying graph data based on the constraints specified in a query graph. How does a user create such a topology? Although considerable efforts have been invested toward efficient and scalable processing of graphs, surprisingly, to the best of our knowledge, there is no systematic study that answers this question. At first glance, it may seem that this is a straightforward problem. Indeed, most existing subgraph querying research assumes the existence of a query graph and invests its efforts on efficient query processing algorithms and data structures. However, as we shall see later, it is not necessarily easy for diverse end users to formulate query graphs that return meaningful results. Although the importance of efficient and scalable graph processing techniques is undeniable for practical usage, the issue of subgraph query formulation is of significance too since it precedes query processing. Understanding the challenges faced by end users during query formulation paves the way for designing platforms that facilitate the query construction process in practice. A palatable subgraph query formulation experience galvanizes diverse end users to exploit powerful graph query processing engines for their tasks, thereby catalyzing the growth and democratization of graph data management tools to wider communities. A graph query processor, otherwise, has no practical usage to them if they fail to formulate subgraph queries on it to express their search goals! If users cannot formulate queries, a powerful query engine is of no use to them.

1.2

Subgraph Query Formulation Process

Given the lack of study on the subgraph query formulation process, we resort to Pirolli and Card’s sensemaking model (Pirolli and Card 2005) that distinguishes between information processing tasks that are either top-down (from theory to data) or bottom-up (from data to theory) to intuitively describe it. A user may formulate a query directly based on a pattern “in their head” (i.e., top-down formulation) or based on the data presented to him/her by the system (i.e., bottom-up formulation). In the context of subgraph search, in top-down search, one is expected to possess a precise knowledge about the attributes in the underlying graph repository as well as topologies-of-interest to formulate meaningful queries. On the other hand, bottom-up formulation refers to search when a user does not have an upfront knowledge of the complete query structure. That is, she may have some concepts or keywords in the head but is unaware of how they form a connected query graph structure that may result in a meaningful query. Hence, she may get acquainted to the key substructures that exist in the dataset through representative objects (e.g., query results, visualization of data, and recommendations) to galvanize subgraph query formulation. This is consistent with

1.2

Subgraph Query Formulation Process

3

Fig. 1.1 Apoptotic pathway of a C. elegans and b human in BioGRID database; c a subgraph search query on (b); d a query result

Pirolli and Card’s sensemaking model where bottom-up processes are “data-driven” tasks triggered by “noticing something of interest in data” (Pirolli and Card 2005). Intuitively, the bottom-up search paradigm is crucial for the democratization of graph data management systems as many end users may not always have a pattern in their head to formulate queries. Note that this issue is not unique to graphs as a similar phenomenon of the importance of bottom-up search has been also observed recently for sequence data (e.g., time series) (Lee et al. 2020). The following example scenario motivates the need to support bottom-up search in the context of subgraph search. Example 1.1 A key task for exploring new drugs at preclinical and clinical trials during drug development is the identification of suitable animal models with disease-related biological processes that are highly similar to that in human (Perlman 2016). Bob, a biologist, is interested in finding out if C. elegans is a suitable animal model for studying apoptosis in human. Given the knowledge of protein-protein interaction (PPI) of genes related to apoptosis in C. elegans (Fig. 1.1a), he first needs to identify the corresponding homologs1 in its human counterpart (Fig. 1.1b). This can be done through a literature search or using databases such as OrthoList (Shaye and Greenwald 2011). Bob found four relevant genes (i.e., EGL-1, CED-3, CED-4, and CED-9) in C. elegans. These genes are shown in colorcoded vertices in Fig. 1.1a. Next, he finds corresponding homolog genes (BID, CASP 3, APAF1, and BCL2, respectively) in human. Using these homolog genes, Bob wishes to search the human PPI (Fig. 1.1b) to determine if C. elegans is an appropriate organism for studying the apoptosis process in human. Bob, however, is unsure about how these genes interact with each other in human. As a biologist, he is aware that there may not be a strict correspondence between the two models due to evolution. Consequently, there may not exist an exact match of the subgraph 1 Homolog genes which have conserved interactions across multiple organisms form the basis for translating knowledge of biological processes from one organism to another organism (Yu et al. 2004).

4

1 The Future is Democratized Graphs

structure involving EGL-1, CED-3, CED-4, and CED-9 in the PPI network of human. He is aware that the pairs of genes (e.g., (BCL2, CASP 3)) should not be far apart in the human PPI as the too large distance between them indicates that the interaction is unlikely to be conserved. However, he does not know the distance or connectivity details between these pairs in the human PPI to formulate the topology of his query graph. Clearly, any guidance to facilitate the formulation of the query will be highly beneficial to Bob. However, as we shall see later, subgraph query formulation frameworks that can facilitate such bottom-up search have received scant attention in both academia and industry for decades.

1.3

Graph Query Languages

Query languages unlock the power of databases by making it possible for a user to ask complex queries over the underlying data in a declarative manner. For instance, in the above example, a graph query language can be used by Bob to express his query on the human PPI network. In the world of relational databases, SQL has been the standard for years. However, there is no standard query language for graphs yet. A soup of languages has been proposed in the last decade for graphs with varying expressive power from academia and industry (e.g., Gremlin, SPARQL, Cypher, G-Core, PGQL, and GQL). Unfortunately, enforcing end users to access the content of graph databases using any one of these query languages raises serious usability challenges that may hinder the democratization of graph data management tools. This concern is articulated in a recent Markets and Markets report (Markets and Markets 2022) as follows: Developers have to write their queries using Java as there is no Standard Query Language (SQL) to retrieve data from graph databases, which means employing expensive programmers or developers have to use SPARQL or one of the other query languages that have been developed to support graph databases; however, it would mean learning a new skill. This results in the lack of standardization and programming ease for graph database systems.

This is also echoed by a recent survey (Sahu et al. 2017) which revealed that graph query languages and usability are considered as some of the top challenges for graph processing. Example 1.2 Reconsider the search problem faced by Bob in Example 1.1. After undertaking a literature review and exhaustive discussions with his peers including data scientists, Bob was able to construct the query graph on a piece of paper as shown in Fig. 1.1c. This query is an example of a type of graph homomorphism query (Song et al. 2018). Note that the label on an edge represents lower and upper bounds on the path length that a pair of data

1.4 Toward Graph Databases for All!

5

vertices that matches the corresponding pair of query nodes must satisfy. A matching result of this query on the human PPI is shown in Fig. 1.1d. Bob now faces the challenge of formulating it on the graph database, which demands the query to be expressed in a proprietary graph query language. Bob is a non-programmer and unfamiliar with query languages. He is also unaware of any member in his team who is familiar with query languages. How can Bob use the graph database for his task? A proponent of query languages may argue that Bob’s challenge can be largely mitigated through education. That is, if we educate end users on programming skills at a low cost (e.g., online learning), then they can themselves formulate queries using query languages without resorting to employing (expensive) programmers. Even if an end user is not taught a specific graph query language, conventional wisdom indicates that programming is highly transferable from one language (e.g., SQL, Python) to another (e.g., Cypher, SPARQL). Hence, sufficient background in general programming or even SQL may ease the formulation challenge posed by graph query languages. Unfortunately, this is not a viable option for two key reasons. First, recent research in software engineering reveals that it can be difficult for even an experienced programmer to learn another language because of misconceptions about the new language that occur due to incorrect mapping of ideas from languages he/she knows (Shrestha et al. 2022). In fact, Shrestha et al. (2022) reported that this can be explained using interference theory in psychology and neuroscience (Underwood 1957). Specifically, old knowledge can either facilitate learning new knowledge or interfere with it. Learning new languages can be difficult even for programming experts!

Second, not every end user needs to be or wants to be a programmer. This is more so in various application domains where writing a query is a secondary task that one has to perform in order to undertake a primary task (Examples 1.1 and 1.2). A user such as Bob may not be willing to invest time to learn a programming language to undertake his primary task of studying apoptosis in humans for drug development. If we force such end users to take a programming course that they dislike or are disinterested in, what goals are we trying to achieve?

1.4

Toward Graph Databases for All!

If we do not wish to enforce end users to learn graph query languages in order to access graph databases, what alternatives do we have for formulating subgraph search queries? We advocate that there are at least two alternative strategies as follows that more likely going to get us closer to the goal of “graph databases for all”. These alternatives have the potential

6

1 The Future is Democratized Graphs

to arouse more positive and palatable experiences with end users than the learning of graph query languages. Natural language-based graph querying: Natural language (NL) interfaces to relational databases have been studied for several decades (Affolter et al. 2019). Such interfaces enable users easy access to data, without the need to learn SQL. Given a logically complex English language sentence as query input, the goal of the majority of this work is to translate them to SQL. Despite significant progress made on this topic over the years, a recent study reported that a robust NL-based interface for relational databases that can handle queries of different complexities is still elusive Kim et al. (2020). Intuitively, a similar NL interface for graph-structured data can enable end users to pose subgraph search queries without using a graph query language. For example, reconsider Bob’s query in Example 1.2. If there is a natural language-based interface to interact with PPI networks, then Bob may simply ask the following query on the human PPI network: “Find subgraphs that involve BID, CASP 3, APAF1, and BCL2 genes”. Unfortunately, there is scant research on NL interfaces for querying graphs (Zheng et al. 2017) and their deployment in practice. Accurate translation of an NL-based graph query to a graph query language is a challenging task due to inherent ambiguity in the natural language expression. For instance, it may be unnatural for an end user to explicitly specify all the relationships between nodes of a query graph in a natural language. Consequently, translating such an ambiguous NL query to a graph query language that accurately captures a user’s search intention is challenging and an open problem. Visual graph querying: Visual interfaces have a great impact on the democratization of computing technologies. We are surrounded every day as consumers with examples of such interfaces. Smartphones, which despite being a complex technology, are very intuitive to use through their user-friendly visual interfaces. Almost all e-commerce companies provide form-based visual interfaces instead of programming language interfaces for consumers to browse and purchase products. Imagine the challenges consumers would have faced if they have to write SQL queries to purchase products! We regularly use the Windows Explorer instead of command-line instructions for managing files in our computers. Since graphs are more intuitive to draw than to compose them in textual format, a userfriendly visual subgraph query interface (VQI) can enable an end user to draw a subgraph query interactively instead of formulating it textually using a graph query language. Indeed, several industrial graph querying frameworks in a variety of domains provide such interfaces (detailed in Chap. 3). These VQIs typically utilize direct-manipulation interfaces that are appealing to “novices as they can learn basic functionality quickly, are easy to remember for intermittent users by retaining operational concepts, and can be rapid for frequent users” (Shneiderman and Plaisant 2010). A visually constructed subgraph search query may either be internally transformed to a graph query language which is then executed on the underlying graph repository (Pienta et al. 2016) or its visual formulation and execution are interleaved Bhowmick et al. (2018). Figure 1.2 depicts a screenshot of a direct-manipulation-

1.5 Visual Subgraph Query Interfaces (VQIs)

7

based VQI to formulate Bob’s query by drawing it on a query canvas (left) and retrieving result matches (right) by interleaving query formulation and query processing using the approach in Song et al. (2018). Given the increasing popularity of visual subgraph query interfaces in practice, in this book, we shall focus on VQI as the alternative vehicle to formulate subgraph search queries.

1.5

Visual Subgraph Query Interfaces (VQIs)

Visual subgraph query interfaces for graph-structured data (VQIs) have been used in academia and industry for more than a decade (detailed in Chap. 3). The key components in the majority of these VQIs are the Attribute Panel, Query Panel, Pattern Panel, and Results Panel. The Attribute and Pattern panels are optional components containing attributes of nodes or edges and small connected graphs (i.e., patterns), respectively. The Query Panel is used by a user to draw a query and the Results Panel is for visualizing the query results. The details of visual design and the contents of some of these components are typically created manually by “hard coding” them during the implementation of a VQI. For example, in Fig. 1.2, Panel 2 is the Query Panel, Panel 3 is the Attribute Panel, and Panels 4 & 5 constitute the Results Panel. Observe that in this VQI there is no Pattern Panel. In general, the displayed patterns in a Pattern Panel facilitate pattern-at-a-time query construction where a user may drag-and-drop a pattern (i.e., a small subgraph) in the Query Panel instead of constructing the edges in it iteratively using edge-at-a-time mode, thereby reducing the number of steps and time taken to visually formulate a query graph (Huang et al. 2019; Yuan et al. 2021). Example 1.3 Reconsider Bob’s query in Fig. 1.1c. He can visually construct the query using the VQI in Fig. 1.2 as follows (Song et al. 2018).

Fig. 1.2 Visual construction of Bob’s query on human PPI network derived from the BioGRID database

8

1 The Future is Democratized Graphs

• Load the human PPI dataset (20,955 vertices and 2,92,022 edges) by selecting from Panel 1. • Construct the query in Panel 2 using edge-at-a-time mode. Relevant labels (in Panel 3) are selected, then dragged and dropped on Panel 2. Edge connections are made using mouse clicks. For example, to construct (bid, bcl2), we first select bid from Panel 3 and drag it to Panel 2. Next, bcl2 is selected. Similarly, a drag-and-drop operation brings it to Panel 2. Then, these two vertices are connected using mouse clicks and the corresponding edge bound [1,2] is specified in a combo box. Subsequent edges in the query are constructed in the same way. • Click the Run icon in Panel 4 to execute the query. Observe that Bob can now construct his query with ease without learning a graph query language or the need to hire an expensive programmer to pose queries on his behalf.

1.6

Limitations of Existing VQI

VQIs provide a palatable alternative to construct a subgraph search query without learning the syntax and semantics of graph query languages. Despite this benefit, existing VQIs suffer from several notable limitations that hinder the democratization of graph data management tools. We briefly articulate these limitations here. The reader may refer to Chap. 4 for details. First, existing VQIs are primarily designed to support the top-down search process (recall from Sect. 1.2). That is, a user must have the query graph in his head in order to draw it using a VQI. This is primarily due to the lack of exposition of relevant functionalities on a VQI that can facilitate bottom-up search. The Attribute Panel typically lists down a (sub)set of labels in the underlying data without revealing the topological structures associated with these labels. On the other hand, some VQIs may expose a small set of patterns in the Pattern Panel which may seem to facilitate bottom-up search. In practice, however, these patterns comprise well-known and popular substructures (e.g., benzene ring, triangle, and rectangle). A domain-specific user is typically aware of these patterns even if they are not displayed on a VQI. Consequently, the patterns are primarily used to expedite visual query graph construction through pattern-at-a-time mode and are not geared toward facilitating bottom-up search. For example, suppose the VQI in Fig. 1.2 is augmented with a Pattern Panel containing a triangle structure. Observe that this structure can expedite the topology construction for Bob’s query. However, it does not provide any meaningful, data-specific information to an end user to trigger a bottom-up search since triangles are building blocks of any large networks (Milo et al. 2022). Certainly, the exposition of patterns on a VQI is not the only functionality to trigger the bottom-up search. Result subgraphs that match a partially constructed subgraph query may potentially facilitate bottom-up enquiry. Specifically, a user may explore the result matches and then further refine the partially formulated query based on information gleaned from

1.6

Limitations of Existing VQI

9

them. However, such query results-driven bottom-up search does not necessarily alleviate the subgraph query formulation challenge using existing VQIs. An end user still needs to formulate a meaningful partial substructure to begin with, which can itself be challenging as mentioned above. Furthermore, visualizing these result subgraphs to facilitate bottomup search is a non-trivial challenge as the regions of the underlying network containing these matches may look like a “hairball” on a visual interface. Consequently, it is cognitively challenging to browse them to figure out topological patterns that one can use to complete the query formulation task. Even for a database of small- or medium-sized data graphs (e.g., chemical compounds), it is tedious to manually browse many results to seek for topologies-of-interest. Observe that any strategy to summarize these regions to make them palatable for visualization also increases the loss of topological information that may hinder query formulation. Second, we can broadly classify the panels in a VQI into two categories. The data panel comprises of the Attribute and Pattern Panels as the contents of these panels are datadependent. The remaining panels in a VQI can be considered as user panel since the contents depend on a user’s action (e.g., user query). Unfortunately, data panels in the majority of existing VQIs are “static” in nature as they are manually hard coded by programmers during the implementation of the VQIs. That is, the contents of a data panel are not data-driven. They are neither automatically generated from the underlying graph repository nor maintained with the evolution of the repository. As we shall see later, this has an adverse impact on visual query formulation. Third, many existing VQIs lack portability. A VQI designed for a specific graph repository (e.g., chemical compounds) may not be seamlessly plugged into another graph repository in a different domain (e.g., social networks). For example, the VQI of PubChem2 cannot be used to query any other graph data sources. As the contents of the data panel are domain-dependent and often manually created, a VQI needs to be reconstructed when the domain changes in order to accommodate new domain-specific patterns and labels. This inevitably increases the barrier to exploit an existing VQI on a new data repository. An end user has to first build a VQI by programming various components on top of her data-of-interest and connect it to a graph query engine before she can utilize it for query formulation. Consequently, a user either has to be a skilled programmer or needs to hire one before VQIs can be exploited. However, this is the very scenario we are trying to avoid in order to democratize graph data management tools. Classical visual subgraph query interfaces are typically not data-driven, unportable, and poor facilitators of the bottom-up search process, thereby hindering the democratization of graph data management tools.

2 https://pubchem.ncbi.nlm.nih.gov/#draw=true.

10

1.7

1 The Future is Democratized Graphs

Plug-and-Play (PnP) Interfaces—Democratizing Subraph Querying

The preceding subsection reported that existing classical VQIs do not provide sufficient features to aid bottom-up search, are static in nature when the underlying graph repository evolves, and have limited portability across different domains and sources. As we shall see later in this book, these limitations adversely impact various usability criteria of a VQI such as flexibility, robustness, efficiency, and satisfaction. In this book, we review the paradigm of plug-and-play (PnP) visual subgraph query interface (i.e., PnP interface for brevity) which has been proposed in the literature (Bhowmick et al. 2020, 2016, 2022; Huang et al. 2019, 2021; Yuan et al. 2022, 2021)3 to alleviate these limitations, thereby paving the way for the democratization of graph data management tools. How will querying be experienced by diverse end users when access to graph data is democratized? Given a graph data-of-interest in a specific application domain, one should be able to create a VQI on top of it according to his/her requirements effortlessly without resorting to coding. Then, one should be able to undertake both top-down and bottom-up search effectively and efficiently using the VQI without the need to formulate queries using a graph query language. The VQI should be able to be plugged on top of any existing graph query engine with ease for efficient execution of the formulated query. Over time a user should be able to generate a new or modified VQI, if necessary, effortlessly as her requirements, data source, or domain change. A PnP interface is designed to provide such an experience. A PnP interface goes against the traditional mantra of VQI construction by taking a fundamentally different approach. It is like a plug-and-play device that can be plugged into any kind of socket (i.e., graph data) and used (Fig. 1.3). Intuitively, a PnP interface is created using a PnP template where the underlying graph repository acts as a socket and user-specified requirements represent a plug. At the core is the PnP engine which takes the socket and plug as input and dynamically populates the various data panels (e.g., node/edge attributes, patterns) of the template in a data-driven manner. Specifically, the generation of the contents of the data panels is grounded not only on the plug and data-specific characteristics but also on the principles of HCI and cognitive psychology (e.g., cognitive load Sweller (1988)) that impact visual query formulation, thereby positively influencing several usability criteria of VQIs. Once the data panels are populated, the PnP template is transformed into a PnP interface and enters the “play” mode. In this mode, a user can use the PnP interface for efficient and effective visual subgraph query formulation and can install it on top of any graph query engine for execution. The PnP engine also enables data-driven maintenance of the data panels as the underlying data repository evolves. An end user can change the socket (i.e., graph repository) or the plug (i.e., requirements) as necessary to automatically generate or update a PnP interface effortlessly, giving end users the freedom to easily and quickly construct and maintain a VQI for any data sources without resorting to coding. Observe that 3 In the literature, a plug-and-play VQI is also referred to as a data-driven VQI.

1.8

Overview of This Book

11

Fig. 1.3 Plug-and-play visual graph query interface (PnP interface)

the paradigm of the PnP interface is orthogonal to the way graph data is stored (i.e., locally or in the cloud). PnP interfaces pave the way for democratizing access to graphs by providing superior support for visual subgraph query formulation, effortless construction and maintenance of a VQI for any data source without resorting to coding, and portability of a VQI across diverse graph data sources and querying applications.

1.8

Overview of This Book

This book gives a comprehensive introduction to the topic of plug-and-play visual subgraph query interface. A hallmark of this book is to emphasize research efforts that aim to bridge three traditionally orthogonal topics or fields, namely subgraph querying, human-computer interaction (HCI), and cognitive psychology. Specifically, a key component we cover is the review of techniques and strategies that make the design, construction, and maintenance of PnP interfaces grounded on the principles of HCI and cognitive psychology to facilitate the democratization of subgraph querying frameworks. To this end, we organize the discussions in this book as follows. • In Chap. 2, we present the elements that serve as background for the remaining chapters of the book. In particular, we focus on concepts and terminologies of graphs that are necessary to understand the construction and maintenance of PnP interfaces. We also introduce softer aspects of technology such as usability and cognitive load that play pivotal roles in designing effective PnP interfaces.

12

1 The Future is Democratized Graphs

• In Chap. 3, we give an overview of traditional visual subgraph query interfaces from industry and academia that dominate the current landscape. Specifically, we categorize the VQIs into three generations based on the features they support to facilitate different query formulation approaches and discuss their advantages and limitations. We also undertake a qualitative comparative analysis of these VQIs w.r.t. their usability and cognitive load. • In Chap. 4, we formally introduce the notion of plug-and-play visual subgraph query interface (PnP interface). We first elaborate on the limitations of classical VQIs that motivate the need for PnP interfaces. We set out the key design principles for realizing a PnP interface to tackle these limitations, highlight the benefits, and present a generic and extensible architecture of PnP interfaces grounded on these principles. In the subsequent chapters, we elaborate on efficiently realizing various components of the architecture. • In Chap. 5, we introduce and describe the characteristics of canned patterns, which are the building blocks of a PnP interface. Informally, canned patterns are connected subgraphs of size larger than a minimum threshold that are exposed on the Pattern Panel of a PnP interface in a data-driven manner to facilitate efficient top-down and bottom-up search. Specifically, we describe how to quantify the characteristics of these patterns by considering not only their data-specific features but also the cognitive load they may impose on end users during visual query formulation. • In Chaps. 6 and 7, we describe end-to-end frameworks for data-driven selection of canned patterns in a PnP interface based on their characteristics from underlying graph repositories. Specifically, we review work that draws upon the literature in HCI and cognitive psychology to handle two categories of graphs: a large collection of small- or mediumsized data graphs and a large network. We describe how these platforms support two state-of-the-art PnP interfaces. • Real-world graphs are dynamic in nature. However, as remarked earlier, traditional VQIs do not automatically maintain various data panels of a VQI as the underlying data evolve. In Chap. 8, we review a seminal platform that effectively and efficiently maintains the components (e.g., patterns) of a PnP interface when the underlying collection of smallor medium-sized data graphs evolves. • Finally, in Chap. 9, we summarize the contributions of this book and list down interesting open research problems in this burgeoning topic of plug-and-play visual query interfaces. Note that this book serves as a prequel to our previous one on blending visual graph query formulation and processing (Bhowmick et al. 2018) where we assume a VQI (traditional or plug-and-play) is available for subgraph query formulation.

References

1.9

13

Scope

In summary, the scope of this book is as follows. • We focus on the data-driven construction and maintenance of VQIs that facilitate visual subgraph query formulation activities by diverse end users. Hence, panels for visualizing and exploring the results of a query graph (i.e., Results Panel) in a VQI are beyond the scope of this book. We do not, by avoiding them, mean to understate their importance in any way especially in the context of facilitating bottom-up search; but they are not the subject of this book as the focus here is on subgraph query formulation. • Since our focus is on the VQI, we do not discuss the subgraph query processing engine in this book as it is orthogonal to a VQI. In particular, a PnP interface can easily be plugged on top of any graph query processor for executing queries formulated through it. • In subsequent chapters, we assume that the data graphs are stored in a single machine. That is, we do not discuss the construction and maintenance of PnP interfaces in a distributed graph environment. Recall that the paradigm of plug-and-play VQI is orthogonal to the way data graphs are stored. Existing single machine-based techniques for the construction and maintenance of PnP interfaces that we review in this book can be augmented to a distributed environment, which is earmarked as future work.

References K. Affolter, K. Stockinger, A. Bernstein. A Comparative Survey of Recent Natural Language Interfaces for Databases. The VLDB Journal, 28(5): 793-819, 2019. F. N. Afrati, D. Fotakis, J. D. Ullman. Enumerating subgraph instances using map-reduce. In ICDE, 2013. S. S. Bhowmick, et al. AURORA: Data-driven Construction of Visual Graph Query Interfaces for Graph Databases. In SIGMOD, 2020. S. S. Bhowmick, B. Choi, C. E. Dyreson. Data-driven Visual Graph Query Interface Construction and Maintenance: Challenges and Opportunities. PVLDB 9(12), 2016. S. S. Bhowmick, B. Choi. Data-driven Visual Query Interfaces for Graphs: Past, Present, and (Near) Future. In SIGMOD, 2022. S. S. Bhowmick, B. Choi, C. Li. Human Interaction with Graphs: A Visual Querying Perspective. Synthesis Lectures on Data Management, Morgan & Claypool Publishers, 2018. W. Fan, J. Li, S. Ma, H. Wang, Y. Wu. Graph Homomorphism Revisited for Graph Matching. In PVLDB, 2010 Graph Database Market. MarketsandMarkets. https://www.marketsandmarkets.com/MarketReports/graph-database-market-126230231.html?gclid=Cj0KCQiAxc6PBhCEARIsAH8Hff1pUb 5PI2peZmHQa-AvoPd2MRWXyPwGfEKYFu6I86Z-SgGyQ2a8G88aAmgmEALw_wcB, Last accessed 31st March, 2022. Horst Bunke and Kim Shearer. A graph distance metric based on the maximal common subgraph. Pattern recognition letters, 19(3):255–259, 1998. K. Huang, et al. CATAPULT: data-driven selection of canned patterns for efficient visual graph query formulation. In SIGMOD, 2019.

14

1 The Future is Democratized Graphs

K. Huang, et al. MIDAS: Towards Efficient and Effective Maintenance of Canned Patterns in Visual Graph Query Interfaces. In SIGMOD, 2021. H. Kim et al. Natural Language to SQL: Where Are We Today? In PVLDB, 13(10), 2020. D.J.L. Lee, et al. You Can’t Always Sketch What you Want: Understanding Sensemaking in Visual Query Systems. IEEE Trans. Vis. Comput. Graph., 26(1): 1267-1277, 2020. R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, U. Alon. Network motifs: Simple building blocks of complex networks. Science, 298(5594): 824-827, 2002. R.L. Perlman. Mouse Models of Human Disease: An Evolutionary Perspective. Evol. Med. Public Health 2016(1):170-176, 2016. R. Pienta, A. Tamersoy, A. Endert, S. Navathe, H. Tong, D. H.Chau. VISAGE: Interactive Visual Graph Querying. In AVI, 2016. P. Pirolli and S. Card. The Sensemaking Process and Leverage Points for Analyst Technology as Identified through Cognitive Task Analysis. In Proc. of Int. Conf. on Intelligence Analysis, 2005. S. Sahu, et al. The Ubiquity of Large Graphs and Surprising Challenges of Graph Processing. PVLDB, 11(4), 2017. H. Shang, X. Lin, Y. Zhang, J. X. Yu, W. Wang. Connected Substructure Similarity Search. In SIGMOD, 2010. D.D. Shaye, I. Greenwald. OrthoList: a compendium of C. elegans genes with human orthologs. PloS one 6(5):e20085, 2011. Ben Shneiderman and Catherine Plaisant. 2010. Designing the user interface: Strategies for effective human-computer interaction (5th edition). Addison-Wesley, Boston, M.A. N. Shrestha, C. Botta, T. Barik, C. Parnin. Here We Go Again: Why Is It Difficult for Developers to Learn Another Programming Language? Communications of the ACM, 65(3), March 2022. Y. Song, H. E. Chua, S. S. Bhowmick, B. Choi, S. Zhou. BOOMER: Blending Visual Formulation and Processing of p-Homomorphic Queries on Large Networks. In Proceedings of 44th ACM SIGMOD International Conference on Management of Data (ACM SIGMOD 2018), ACM Press, Houston, USA, June 2018. S. Sun, Q. Luo. In-memory subgraph matching: An in-depth study. In SIGMOD, 2020. J. Sweller. Cognitive load during problem solving: Effects on learning. Cognitive science, 12(2):257285, 1988. B. J. Underwood. Interference and Forgetting. Psychol. Rev., 64, 1, 1957. H. Wang, C. C. Aggarwal. A Survey of Algorithms for Keyword Search on Graph Data. Managing and Mining Graph Data, pp:249-273, 2010. H. Yu, N.M. Luscombe, H.X. Lu, X. Zhu, Y. Xia, J.D.J. Han, N. Bertin, S. Chung, M. Vidal, M. Gerstein. Annotation transfer between genomes: protein-protein interologs and protein-DNA regulogs. Genome Res., 14(6):1107-1118, 2004. Z. Yuan, H.-E. Chua, Sourav S. Bhowmick, Z. Ye, B. Choi, W.-S. Han. PLAYPEN: Plug-and-Play Visual Graph Query Interfaces for Top-down and Bottom-Up Search on Large Networks. In SIGMOD, 2022. Z. Yuan, H.-E. Chua, Sourav S. Bhowmick, Z. Ye, W.-S. Han, B. Choi. Towards Plug-and-Play Visual Graph Query Interfaces: Data-driven Canned Pattern Selection for Large Networks. Proc. VLDB Endow., 14(11): 1979-1991, 2021. W. Zheng, H. Cheng, L. Zou, J. X. Yu, K. Zhao. Natural Language Question/Answering: Let Users Talk With The Knowledge Graph. In CIKM, 2017.

2

Background

This chapter provides an overview of key concepts that serve as background for the rest of the book. First, we discuss relevant terminology related to graphs and subgraph query. This is followed by a brief discussion on cognitive load and usability that play pivotal roles in the design of visual subgraph query interfaces. While these two concepts are studied exhaustively in cognitive psychology, information visualization, and HCI communities, they have not received adequate attention from the graph data management community. Table 2.1 shows the key symbols related to graphs that we shall be using throughout this book.

2.1

Graph Terminology

We denote a graph as G = (V , E), where V is a set of vertices and E ⊆ V × V is a set of (directed or undirected) edges. Vertices and edges can have labels as attributes. Let l be the mapping function of G for labels of vertices or edges. That is, l(v) and l(u, v) are the labels of a vertex v ∈ V and an edge (u, v) ∈ E, respectively. The degree of a vertex v ∈ V is denoted as deg(v). The size of G is defined as |G| = |E|. In this book, we refer to a small, connected graph as a pattern. For ease of presentation, we assume data graphs and visual subgraph queries (i.e., query graphs) as undirected simple graphs (i.e., with neither self-loops nor multiple edges) with labeled vertices. Also, a subgraph query (query graph) is a connected graph with at least one edge. Note that the techniques discussed in this book can be easily extended to directed graphs.

2.1.1

Subgraph Isomorphism-Related Terminology

Given two vertex-labeled graphs G 1 = (V1 , E 1 ) and G 2 = (V2 , E 2 ), the problem of subgraph isomorphism is to find a 1-1 mapping from V1 to V2 such that each vertex in V1 is © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 S. S. Bhowmick and B. Choi, Plug-and-Play Visual Subgraph Query Interfaces, Synthesis Lectures on Data Management, https://doi.org/10.1007/978-3-031-16162-9_2

15

16

2 Background

Table 2.1 Key symbols related to graphs used in this book Symbol D G = (V , E) q, Q G1 ⊆ G2 ω(G 1 , G 2 ) (·) deg(v) t(e) kmax

Definition A graph database A (sub)graph or pattern A query graph (fragment) or a subgraph query Subgraph isomorphism from G 1 to G 2 Maximum (connected) common subgraph similarity Label of a vertex or edge Node degree Edge trussness of e Maximum trussness

mapped to a distinct vertex in V2 with the same label, and each edge in E 1 is mapped to an edge in E 2 . Definition 2.1 (Subgraph Isomorphism) Given two graphs G 1 = (V1 , E 1 ) and G 2 = (V2 , E 2 ), there exists a subgraph isomorphism from G 1 to G 2 , denoted by G 1 ⊆ G 2 , if there exists an injective function f : V1 → V2 , such that (1) ∀u ∈ V1 , φ1 (u) = φ2 ( f (u)), and (2) ∀(u, v) ∈ E 1 , ( f (u), f (v)) ∈ E 2 and ψ1 (u, v) = ψ2 ( f (u), f (v)). A graph G 1 = (V1 , E 1 ) is a subgraph of another graph G 2 = (V2 , E 2 ) (or G 2 is a supergraph of G 1 ) if there exists a subgraph isomorphism from G 1 to G 2 , i.e., G 1 ⊆ G 2 (or G 2 ⊇ G 1 ). We may also simply say that G 2 contains G 1 . The graph G 1 is called a proper subgraph of G 2 , denoted as G 1 ⊂ G 2 , if G 1 ⊆ G 2 and G 1  G 2 . Lastly, let G 1 ⊆ G 2 and |G 1 | = |G 2 | − 1. Multiple subgraph isomorphic embeddings of G 1 in G 2 may exist. Therefore, subgraph isomorphism of G 1 and G 2 can also be viewed as a relation between G 1 and G 2 , where each record is an embedding of G 1 in G 2 . Furthermore, we may use graph isomorphism of G 1 and G 2 if G 1 ⊆ G 2 and |V1 | = |V2 |, and graph automorphism of G 1 , which is a graph isomorphism to itself.

2.1.2

Maximum (Connected) Common Subgraph

Graph similarity can be assessed using feature-based or structure-based measures. The former cannot capture the global structural information of graphs. Hence, structure-based measures such as those based on maximum common subgraph (MCS) (Bunke 1997) and maximum connected common subgraph (MCCS) (Shang et al. 2010) are considered superior

2.1 Graph Terminology

17

alternatives. Given two graphs G 1 and G 2 , G is a common subgraph of G 1 and G 2 if G ⊆ G 1 , G ⊆ G 2 . G is an MCS if there exists no other common subgraph of G 1 and G 2 larger than G. Since MCS does not require a common subgraph to be connected, it is possible to have poor similarity match where vertices in one graph are mapped to those in another that are positioned very distant from each other (Shang et al. 2010). The maximum connected common subgraph (MCCS) addresses this by imposing an additional constraint that an MCS must be connected. Given two graphs G 1 and G 2 , let G MCS (resp. G MCCS ) be the MCS (resp. MCCS) of G 1 and G 2 . The maximum common subgraph similarity (resp. maximum connected common |G MCS | subgraph similarity) between G 1 and G 2 is defined as ωMCS (G 1 , G 2 ) = Min(|G (resp. 1 |,|G 2 |) |G MCCS | ωMCCS (G 1 , G 2 ) = Min(|G ) where Min(.) is the minimum operator. It is known that 1 |,|G 2 |) MCS and MCCS computation are both NP-hard (Shang et al. 2010).

2.1.3

k-Truss

A triangle is a cycle of length 3 in G. The support of an edge e = (u, v) ∈ E (denoted by sup(e)) is the number of triangles in G containing u and v. Given G, the k-truss of G is the largest subgraph G  = (V  , E  ) of G in which every edge e ∈ E  is contained in at least k − 2 triangles within the subgraph (Wang and Cheng 2012). A 2-truss is simply G itself. We define the trussness of an edge e as t(e) = max{k|e ∈ E Tk } where Tk = (VTk , E Tk ) is the k-truss in G. Further, kmax denotes the maximum trussness.

2.1.4

Types of Graph Collection

There are mainly two different streams of research for subgraph querying in the literature that are based on the type of data graph collection. One stream handles a large number of small- or medium-sized data graphs such as chemical compounds. Several existing techniques belong to this stream (Shang et al. 2010; Yan et al. 2004, 2005). We refer to such a collection of data graphs as a graph database, denoted by D. We assign a unique index (i.e., id) to each data graph in D. A data graph G with index i is denoted as G i . We assume each data graph is a connected graph in D. The other stream handles a single large data graph (a.k.a network), denoted by G, such as citation network, social network, and protein-protein interaction network (Fan et al. 2010; Han et al. 2013). In this book, we discuss PnP interfaces that are devised for these two streams of data graphs.

18

2.2

2 Background

Cognitive Load

In cognitive psychology, cognitive load refers to the amount of information that working memory can hold at one time. According to cognitive load theory, humans process information using limited working memory (Sweller 1988). It was initially developed in the arena of education and instructional design. The fundamental tenet of this theory is that the quality of instructional design will be enhanced if the limitations of working memory are taken into account. There are three types of cognitive load, namely intrinsic, extraneous, and germane. Intrinsic load is associated with the inherent difficulty of the instruction or task (e.g., finding the shortest path in a large graph). Extraneous cognitive load refers to the way information or tasks are presented to a learner, while germane cognitive load refers to the work put into creating a permanent store of knowledge (a schema). In the literature, there are three types of measures to assess cognitive load: subjective feedback, performance-based (accuracy, response time), and physiological such as brain activity, pupil dilation, and changes to heart rate (Waard 1996). In this book, we consider the intrinsic cognitive load of interpreting a pattern during visual query formulation as an important factor for designing PnP interfaces. That is, the quality of visual subgraph query interfaces can be enhanced if we incorporate in their design the impact of the cognitive load of its constituents on the end users. To this end, we use subjective feedbacks from users and response time for the visual query formulation task as proxies to assess cognitive load. We draw inspirations from several research efforts since the turn of the century that attempt to understand cognitive load imposed by network visualization models for performing different tasks (Huang et al. 2009; Yoghourdjian et al. 2018, 2021). For example, Huang et al. (2009) explored cognitive load imposed by node-link diagrams by varying complexities of the visual representation of network data and performed tasks. Their study confirmed the existence of cognitive load on end users as the complexity of the node-link diagram grows. More recently, Yoghourdjian et al. (2021) investigated the perceptual limitations of node-link diagrams for performing a connectivity-based task (e.g., finding the shortest path between a pair of nodes in a node-link diagram). They reported that for small-world graphs with 50 (resp. 100) or more nodes with a density of 6 (resp. 2), participants were unable to correctly find the shortest path in more than half of the trials. In this study, density is measured as the ratio between the number of edges to the number of nodes in a graph. In particular, this work is the first that reported physiological measures of cognitive load and revealed that these measures initially increase with task hardness but then decrease, possibly because the participants give up performing the task. In the sequel, the data-driven construction of PnP interfaces shall exploit these recent results to create cognitive load-conscious VQIs.

2.4

2.3

Conclusions

19

Usability

Usability is a quality attribute that assesses how easy user interfaces are to use (Benton and Turner 2005). Specifically, it refers to the “quality of the interaction in terms of parameters such as time taken to perform tasks, number of errors made, and the time taken to become a competent user” (Benton and Turner 2005). The criteria for measuring usability are as follows (Dix et al. 1998): • Learnability: By which new users can interact effectively and achieve maximal performance; • Flexibility: Multiple ways the user and system exchange information; • Robustness: The level of support provided to the user in determining successful achievement and assessment of goals; • Efficiency: Once a user learns about the systems, the speed with which he/she can perform tasks; • Memorability: How easily the user will remember the systems functions, after not using it for a period; • Errors: It is about the number of errors made by users, their severity, and whether they can recover from them easily; • Satisfaction: How enjoyable and pleasant is it to work with the system? Observe that criteria such as efficiency, errors, and satisfaction are influenced by the cognitive load of a VQI on end users.

2.4

Conclusions

This chapter introduces the key concepts that play a pivotal role in the design of PnP interfaces. These concepts are from diverse domains—from cognitive psychology to data management. PnP interfaces aim to unify them to facilitate the data-driven construction of user-friendly, personalized VQIs effortlessly for end users in diverse domains where graphs play a central role in data representation. A key point of our discussion in this book is the emphasis on the fact that despite decades of research on usability and human factors, many of the existing VQIs for subgraph querying are oblivious to these results. In the next chapter, we shall elaborate on these limitations of existing VQIs, paving the way for the paradigm of the PnP interface.

20

2 Background

References D. Benyon, P. Turner. Designing Interactive Systems: A Comprehensive Guide to HCI and Interaction Design. 2nd edn. Pearson Education Ltd., Edinburgh, 2005. H. Bunke. On a relation between graph edit distance and maximum common subgraph. Pattern Recogn. Lett., 18(8):689-694, 1997. D. De Waard. The measurement of drivers’ mental workload. Groningen University, Traffic Research Center Netherlands, 1996. A. Dix, J. Finlay, G. Abowd, R. Beale. Human-computer Interaction, 2nd edn, Pearson Education Ltd, Harlow, 1998. W. Fan, J. Li, S. Ma, H. Wang, Y. Wu. Graph Homomorphism Revisited for Graph Matching. In PVLDB, 2010 W.-S. Han, J. Lee, J.-H. Lee. TurboISO: Towards Ultrafast and Robust Subgraph Isomorphism Search in Large Graph Databases. In SIGMOD, 2013. W. Huang, P. Eades, S.-H. Hong. Measuring Effectiveness of Graph Visualizations: A Cognitive Load Perspective. Information Visualization 8(3), 2009. H. Shang, X. Lin, Y. Zhang, J. X. Yu, W. Wang. Connected Substructure Similarity Search. In SIGMOD, 2010. J. Sweller. Cognitive load during problem solving: Effects on learning. Cognitive science, 12(2):257285, 1988. J. Wang, J. Cheng. Truss decomposition in massive networks. PVLDB, 5(9): 812-823, 2012. X. Yan, P. S. Yu, J. Han. Graph Indexing: A Frequent Structure-Based Approach. In SIGMOD, 2004. X. Yan, P. S. Yu, J. Han. Substructure similarity search in graph databases. In In ACM SIGMOD, 2005. V. Yoghourdjian, D. Archambault, S. Diehl, T. Dwyer, K. Klein, H. C. Purchase, and H.-Y Wu. Exploring the Limits of Complexity: A Survey of Empirical Studies on Graph Visualization. Visual Informatics 2(4), 2018. V. Yoghourdjian, Y. Yang, T. Dwyer, L. Lee, M. Wybrow, K. Marriott. Scalability of Network Visualisation from a Cognitive Load Perspective. IEEE Trans. Vis. Comput. Graph., 27(2): 1677-1687, 2021.

3

The World of Visual Graph Query Interfaces—An Overview

Several visual interfaces such as Refinery (Kairam et al. 2015) and Apollo (Chau et al. 2011) facilitate exploration of large networks using keyword-based search on node and edge attributes. These systems enable bottom-up search through query results where the output (results) serves in part as input (queries). They visualize the most relevant items returned by the system w.r.t. a query along with their connections using a graph view. An end user is expected to browse these items and associated attributes which may spur insights and further exploration. However, these systems are limited to keyword-based queries and do not support subgraph queries where a query topology needs to be formulated. This significantly limits the practical usage of these systems for graph databases. Consequently, bottom-up search is limited only to browsing intermediate result graphs as patterns are not relevant here. Recall from Sect. 1.6, such bottom-up search may have a considerable cognitive impact on end users in the context of subgraph querying. This chapter focuses on visual interfaces that are designed for subgraph querying. It gives an overview of existing visual subgraph query interfaces (VQIs), discussing the state of the art in the industry and in the academic world. We begin our discussion by classifying different approaches for visual subgraph query formulation that are prevalent today. Next, we present a set of representative VQIs that are proposed in the last two decades for supporting these approaches. The majority of these VQIs are designed for constructing connected labeled query graphs visually. Specifically, we categorize the VQIs into three generations based on the features they support to facilitate different subgraph query formulation approaches, give representative examples, and discuss their advantages and limitations. Finally, we conclude by undertaking a comparative analysis of the three generations of VQIs w.r.t. the usability criteria introduced in the preceding chapter, emphasizing on their limitations that are yet to be addressed. In the next chapter, we shall introduce the notion of plug-and-play (PnP) interface to address these limitations.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 S. S. Bhowmick and B. Choi, Plug-and-Play Visual Subgraph Query Interfaces, Synthesis Lectures on Data Management, https://doi.org/10.1007/978-3-031-16162-9_3

21

22

3.1

3 The World of Visual Graph Query Interfaces—An Overview

Visual Subgraph Query Formulation (VQF) Approaches

Visual query interfaces utilize the results of decades of research by the HCI community related to various theoretical models of visual tasks, menu design, and human factors. Recall from Chap. 1, the majority of existing VQIs typically utilize direct-manipulation interfaces (Shneiderman and Plaisant 2010). There are mainly two popular approaches for formulating visual subgraph queries using these VQIs based on the top-down and bottomup query formulation processes introduced in Sect. 1.2. In the edge-at-a-time approach, a query is incrementally constructed by adding one edge at a time. Note that it may be timeconsuming to formulate a query with a large number of edges using this approach. Such repetitive steps can be frustrating and prone to mistakes as an end user may forget to add one or more edges. Hence, in the pattern-at-a-time approach, one may compose a visual query by dragging and dropping patterns (e.g., benzene, chlorobenzene, triangle patterns) that are made available on the VQI in addition to the construction of edges. Observe that this approach is more efficient than the former as it typically takes lesser time to construct a query (Huang et al. 2019; Yuan et al. 2021). Such patterns, if selected judiciously, may also facilitate bottom-up search as an end user can browse them to trigger subgraph query formulation. Example 3.1 Consider the subgraph query in Fig. 3.1 from BSBM1 (Query Q 12 ). Suppose Wei, a non-programmer, wishes to formulate it using a VQI containing a set of patterns (a subset of them is shown). Specifically, he may drag-and-drop p2 and p3 on the Query Canvas, merge the yellow vertex of p3 with the center vertex of p2 , add a vertex, and connect it with the gray vertex of p2 . Finally, Wei can assign appropriate vertex labels. Observe that it requires five steps to construct the topology. On the other hand, if Wei takes an edge-at-atime approach, it would require 23 steps to construct it. Clearly, these patterns enable more efficient (i.e., fewer number of steps or lesser time) formulation of the query.

Fig. 3.1 Q 12 in BSBM and patterns 1 http://wifo5-03.informatik.uni-mannheim.de/bizer/berlinsparqlbenchmark/spec/20080912/index.html.

3.2 Visual Subgraph Query Interfaces (VQI)

23

It is worth noting that Wei may not necessarily have the complete query structure “in his head” during query formulation. He may find p3 interesting while browsing the pattern set, which may initiate his bottom-up search leading to the query. Clearly, without the existence of a pattern set, such a bottom-up search would be infeasible in practice. Wei would have to resort to the exploration of the result matches of his partial query fragment to complete formulation, which is cognitively challenging for large networks. The pattern-at-a-time approach can be either static or dynamic in nature. In the former case, the set of patterns is fixed and displayed in the VQI prior to any query formulation (e.g., Example 3.1). In the latter case, the patterns are dynamically suggested during visual query formulation by utilizing the knowledge of partially formulated query fragments. Observe that this dynamic approach may not facilitate bottom-up search effectively as an end user needs to first visually formulate a query substructure before any suggestions can be exposed on the VQI. Furthermore, generating suggestions on-the-fly can be expensive and time-consuming especially for large graph repository and as a result may adversely impact the time taken by an end user to finish the query formulation task. Note that an end user may formulate a subgraph query using examples, which is known as Query By Example (QBE), instead of formulating a query from scratch. In this approach, an user provides an example of what the query results should look like so that the underlying query engine can automatically create a subgraph query that finds other similar results in the graph repository. Nevertheless, the example can be visually formulated in edge-at-a-time or pattern-at-a-time (e.g., copy and paste a subgraph from a data graph) mode. In this book, we refer to the time taken by an end user to construct a query graph visually in a VQI as query formulation time (QFT). As mentioned earlier, typically, the QFT of a pattern-at-a-time approach is lesser than that of an edge-at-a-time approach as the latter takes more number of steps to construct a query graph using a direct-manipulation interface.

3.2

Visual Subgraph Query Interfaces (VQI)

In this section, we present a representative set of VQIs that realize the aforementioned VQF approaches, highlighting their advantages and limitations w.r.t. visual subgraph query construction. We organize our discussion by classifying these interfaces into three generations based on the type of VQF supported.

3.2.1

First Generation VQI

The most rudimentary form of VQI for visual subgraph query formulation contains a drawing panel where a user can iteratively create a node, label it manually, and connect a pair of them with a link. However, such visual formulation does not really facilitate a user to create

24

3 The World of Visual Graph Query Interfaces—An Overview

Fig. 3.2 The VQI of GBlender (Jin et al. 2010, 2012)

labeled query graphs. One is expected to possess a precise knowledge about the attributes in the underlying graph repository as well as topologies-of-interest to formulate meaningful queries. The first generation VQIs partially alleviate this problem by exposing a set of node/edge attributes in the underlying graph repository. These VQIs follow the edge-at-atime approach to construct a query graph. GRAPHITE (Chau et al. 2008) is a representative example of the first generation VQI. A user can assign an attribute (i.e., label) to a node by double-clicking on it and picking a value from a pop-up dialog. Note that in GRAPHITE the set of attributes associated with a node is available on-demand (i.e., when a user double-clicks on a node). An alternative strategy is to make the set of attributes in the underlying data repository available apriori in a separate panel (Attribute Panel). A user simply selects an attribute and drags it to the Query Panel to create a node with the corresponding label. GBlender (Jin et al. 2010, 2012) utilizes such a VQI for query formulation (Fig. 3.2). Specifically, a user can drag an attribute from Panel 2 and drop it in Panel 3 to visually construct a node. Left- and right-clicking on a pair of nodes create an edge between them. The advantage of this strategy is that an end user does not need to recall a node/edge label that occurs in the graph repository as she can simply scan or search the list to choose an appropriate one for her query. This is particularly beneficial when the keyword a user has in her head mismatches with the attributes in the graph repository. Furthermore, it gives an opportunity to implicitly construct a node whenever an attribute is dropped into a Query Panel. In GRAPHITE, the creation of a node and label specification are carried out in two distinct steps.

3.2 Visual Subgraph Query Interfaces (VQI)

3.2.2

25

Second Generation VQI

Although first generation VQIs facilitate the labeling of nodes or edges in a query graph, they demand end users to have a precise knowledge of the query topologies to accurately formulate meaningful queries. As stated in Chap. 1, this is often impractical to assume in practice especially in the context of bottom-up search. While it is possible for end users to be aware of popular graph structures such as triangle, rectangle, wedge, or k-cycles that occur in a wide variety of data graphs, it is extremely challenging for them to be knowledgeable of larger and complex substructures that occur in a specific graph data-ofinterest. Consequently, it is challenging for end users to formulate a variety of queries using first generation VQIs. The second generation VQIs aim to address this limitation by exposing a Pattern Panel in a VQI in addition to the Attribute Panel. Specifically, a Pattern Panel contains a small set of small-size patterns that a user can drag-and-drop on a Query Panel to formulate a query. Consequently, these VQIs support the pattern-at-a-time query formulation mode. These patterns are typically exposed in the form of thumbnails on a VQI so that they are legible in small multiples to support rapid browsing. DrugBank,2 PubChem,3 and eMolecules4 are examples of representative industrial-strength second generation VQIs. Intuitively, the patterns enable faster completion of a non-trivial query formulation task compared to first generation VQIs as they allow a user to construct a set of edges with a single drag-and-drop action. Observe that these patterns are selected manually and hard coded by the VQI developers (possibly in collaboration with domain experts) during the creation of a VQI. Hence, they remain static unless explicitly modified by the developers.

3.2.3

Third Generation VQI

Observe that in all the representative second generation VQIs mentioned above, the set of patterns only consists of less than 15 patterns and each one of them is small in size with no more than 8 edges. Hence, they provide limited choice to end users for formulating a variety of queries visually. Furthermore, these patterns are static and do not evolve with different queries formulated by a user. The third generation VQIs aim to mitigate this problem by suggesting query-specific patterns that are generated on-the-fly during query formulation. The AutoG (Yi et al. 2017) framework adopts this dynamic pattern suggestion strategy for graph databases (i.e., a large collection of small- or medium-sized data graphs). Specifically, it automatically generates a small list of subgraphs as suggestions by considering potential query result size as well as structural diversity. Figure 3.3 shows an example of subgraph suggestions during query formulation in AutoG. Observe that given the query fragment 2 https://go.drugbank.com/structures/search/small_molecule_drugs/structure. 3 https://pubchem.ncbi.nlm.nih.gov/#draw=true. 4 https://www.emolecules.com/.

26

3 The World of Visual Graph Query Interfaces—An Overview

Fig. 3.3 Subgraph suggestions during query formulation in AutoG (Yi et al. 2017)

C-C-C constructed by a user, it suggests a small set of subgraphs containing this fragment. The new graph fragments that can be added to the current query graph are shown in the bottom panel in blue. A user can simply click on any one of the suggestions to select the relevant subgraph and continue with the query formulation task. In contrast to the second generation VQIs, users of a third generation VQI are not limited by the small set of static patterns exposed on a VQI for query formulation. In theory, a third generation VQI may provide query-specific suggestions of diverse patterns that may potentially aid a user during query formulation. Consequently, it has the potential to facilitate bottom-up search if an end user has an initial query fragment in mind to kickstart query formulation. Note that it is extremely difficult to speculate apriori the subgraph structure that a user wishes to construct during query formulation. Hence, it is computationally challenging to generate accurate suggestions involving larger subgraphs efficiently (i.e., in less than a second) since the candidate space to explore for suggestion generation for a large network is huge. Indeed, to the best of our knowledge, the design of an effective third generation VQI for large networks is still an open problem. The latency associated with suggestion generation may eclipse the task efficiency gained due to the pattern-at-a-time query formulation mode.

3.3

Comparative Analysis

27

Table 3.1 Comparison between three generations of VQIs Criteria Learnability Flexibility Robustness Efficiency Memorability Errors Satisfaction Cognitive load

3.3

Gen 1 High Low Low Low High High Low High

Gen 2 High Medium Medium Medium High Medium Medium Medium

Gen 3 High Medium Medium Low-High High Medium Low-Medium Medium-High

Comparative Analysis

In this section, we compare the three generations of VQIs w.r.t. their usability. Specifically, we undertake a qualitative comparison based on the usability criteria and cognitive load introduced in the preceding chapter. Table 3.1 summarizes the results. All VQIs across different generations are easy to learn and remember compared to any graph query language, emphasizing the benefits of VQIs as articulated in Chap. 1. The first generation VQIs have low flexibility since one can only formulate queries using the edge-at-a-time mode. These VQIs are typically designed for top-down search and do not effectively facilitate bottom-up search. This adversely impacts robustness, efficiency, errors, and satisfaction dimensions as it may either be too difficult to formulate a query when a user does not have a clear pattern in head or it may consume longer time to finish a query formulation task (i.e., longer QFT). Consequently, they may impose a high intrinsic cognitive load on the end users. The second generation VQIs alleviate this problem partially. End users can formulate queries using the pattern-at-a-time mode in addition to the edge-at-a-time mode, thereby reducing the time to formulate a query graph. The patterns on a VQI may trigger the bottomup search. Hence, they have higher efficiency and robustness and impose a relatively lesser cognitive load for the query formulation task compared to first generation VQIs. Enabling pattern-at-a-time mode also reduces potential errors due to repeated construction of edges in a query graph. The third generation VQIs inherit the advantages of the second generation VQIs. However, due to the query-specific, dynamic nature of pattern suggestion, the QFT is highly influenced by the time taken to generate suggestions as well as the accuracy of these suggestions. Hence, this may adversely impact efficiency, satisfaction, and cognitive load associated with VQF. An end user may simply abandon a VQI if the suggestion generation is inefficient.

28

3 The World of Visual Graph Query Interfaces—An Overview

Visual subgraph query interfaces democratize access to graph data by enabling end users to query without learning a query language. Several domain-specific, industrialstrength interfaces disallow them to formulate queries using any graph query languages, highlighting the unpopularity of the latter in practice.

3.4

Conclusions

In this chapter, we present a brief overview of the current world of visual subgraph query interfaces (VQI). In particular, we focused on how the VQI design has evolved from edgeat-a-time to pattern-at-a-time query formulation approach and qualitatively compared the three generations of VQIs w.r.t. usability and cognitive load. We observe that several industrial VQIs (e.g., PubChem, Drugbank, eMolecules) do not expose alternative interfaces to formulate textual queries using a graph query language, highlighting the reluctance of end users to use such programming languages. Clearly, VQIs play a pivotal role in democratizing access to graph data sources.

References D.H. Chau, C. Faloutsos, H. Tong, et al. GRAPHITE: A Visual Query System for Large Graphs. In ICDM Workshop, 2008. D. H. Chau, A. Kittur, J. I. Hong, C. Faloutsos. Apolo: making sense of large network data by combining rich user interaction and machine learning. In CHI, 2011. K. Huang, et al. CATAPULT: data-driven selection of canned patterns for efficient visual graph query formulation. In SIGMOD, 2019. C. Jin, S. S. Bhowmick, et al. Gblender: Towards Blending Visual Query Formulation and Query Processing in Graph Databases. In SIGMOD, 2010. C. Jin, S. S. Bhowmick, B. Choi, S. Zhou. prague: A Practical Framework for Blending Visual Subgraph Query Formulation and Query Processing. In ICDE, 2012. S. Kairam, N. H. Riche, S. M. Drucker, R. Fernandez, J. Heer. Refinery: Visual Exploration of Large, Heterogeneous Networks through Associative Browsing. Comput. Graph. Forum, 34(3), 2015. Peipei Yi, Byron Choi, S Sourav Bhowmick, and Jianliang Xu. AutoG: A visual query autocompletion framework for graph databases. V LDB J., 26(3):347–372, 2017. Ben Shneiderman and Catherine Plaisant. 2010. Designing the user interface: Strategies for effective human-computer interaction (5th edition). Addison-Wesley, Boston, M.A. Z. Yuan, H.-E. Chua, Sourav S. Bhowmick, Z. Ye, W.-S. Han, B. Choi. Towards Plug-and-Play Visual Graph Query Interfaces: Data-driven Canned Pattern Selection for Large Networks. Proc. VLDB Endow., 14(11): 1979-1991, 2021.

4

Plug-and-Play Visual Subgraph Query Interfaces

The preceding chapter introduced a representative set of state-of-the-art visual subgraph query interfaces (VQIs) in the industrial and academic world. Although these VQIs enable subgraph query formulation without the need for end users to be familiar with the syntax and semantics of a graph query language, they suffer from several drawbacks that adversely impact not only their ease of deployment on increasingly diverse graph-structured data sources but also in promoting top-down and bottom-up search. In this chapter, we formally introduce the notion of plug-and-play (PnP) visual subgraph query interface (PnP interface for brevity) to mitigate these drawbacks. We begin by first summarizing the assumptions made by existing VQIs that lead to their limitations in Sects. 4.1 and 4.2. Next, in Sect. 4.3, we articulate the key design principles for realizing a PnP interface that can address these limitations. We formally introduce the architecture of a PnP interface in Sect. 4.4 based on these principles. We conclude the chapter by highlighting the benefits brought by a PnP interface.

4.1

Assumptions Made by Existing VQI

Our analysis of existing classical VQIs reveals the following assumptions that are implicitly made by their designers: • Support efficient top-down search. The VQIs are designed primarily to enable subgraph query formulation without learning a graph query language or a programming language. Specifically, all the three generations of VQIs are geared toward supporting top-down search where an end user has a clear pattern in head for query graph construction. Although second generation VQIs expose a small set of patterns, in practice, all these patterns are often well-known in a specific application domain (e.g., benzene

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 S. S. Bhowmick and B. Choi, Plug-and-Play Visual Subgraph Query Interfaces, Synthesis Lectures on Data Management, https://doi.org/10.1007/978-3-031-16162-9_4

29

30

4 Plug-and-Play Visual Subgraph Query Interfaces

ring, triangle). This is evident from the Pattern Panels of DrugBank,1 PubChem,2 and eMolecules.3 Similarly, only a small subset of attribute labels may be available on a VQI (e.g., eMolecules). One can expect a typical end user of these VQIs are often cognizant of these patterns even if they are not exhibited on a VQI. Consequently, these patterns only serve to expedite visual query construction in practice but not the promotion of bottom-up search. Similarly, although third generation VQIs do not have a fixed set of patterns, they expose a limited collection of them (e.g., an edge) dynamically due to the computational cost of suggestion generation during query formulation. • Manual creation and maintenance. The contents of various panels of a VQI have to be manually implemented and maintained by programmers. • Tight integration with underlying data source. The implementation of a VQI has to be tightly integrated with the underlying graph data source it is deployed on. For example, the VQIs of Drugbank and PubChem are both tightly integrated with the corresponding data sources. That is, each VQI has to be independently created and maintained by individual owners of the data source despite having a similar structure and domain.

Enabling a user to construct the topology of a query graph that returns meaningful results without demanding a comprehensive knowledge of the graph repository is key toward realizing effective bottom-up query formulation.

4.2

Limitations of Existing VQI

Despite the significant progress made toward constructing user-friendly VQIs, the aforementioned assumptions result in four key drawbacks that hinder progress. • Lack of diverse content. As stated above, the contents of Attribute and Pattern Panels in existing VQIs are created manually by “hard coding” them during their implementation. Consequently, the sets of attributes and patterns are limited to those selected in advance by domain experts. A user may find the pre-selected patterns useless in formulating some queries. A similar problem may also arise in the Attribute Panel where the labels of nodes are manually selected instead of automatically generated from the underlying data. • Lack of flexible support for bottom-up search. The lack of diverse content in a VQI hinders bottom-up search. In practice, since end users are often familiar with the small 1 https://go.drugbank.com/structures/search/small_molecule_drugs/structure. 2 https://pubchem.ncbi.nlm.nih.gov/#draw=true. 3 https://www.emolecules.com/.

4.3

Design Principles of Plug-and-Play VQI

31

set of displayed patterns on a VQI, it fails to facilitate query graph construction when they do not have a clear pattern in their head to query. In other words, the lack of diverse content curbs existing VQIs to support only top-down search or query result-driven bottom-up search, thereby limiting the flexibility for bottom-up search. • Static content. The first and second generation VQIs are static. That is, the contents remain static even when the underlying data evolves. As a result, some patterns (resp. labels) may become obsolete as (sub)graphs containing such patterns (resp. labels) may no longer exist in the underlying data. Similarly, some new patterns (resp. labels), which are not currently displayed in a VQI, may emerge due to updates to the graph repository. In practice, it is prohibitively expensive to manually maintain the VQIs. Consequently, the sets of attributes and patterns on a VQI may not necessarily reflect the current state of the graph repository to facilitate graph search. Although third generation VQIs may extract attributes and patterns on-demand based on the current state of the repository, as mentioned earlier, they still provided limited support for patterns and bottom-up search. • Lack of portability. Lastly, classical VQIs lack portability as a VQI cannot be seamlessly integrated with another graph repository in a different domain (e.g., from protein-protein interaction networks to social networks). As the contents of Attribute and Pattern Panels are domain-dependent and manually created, the VQI needs to be reconstructed when the domain changes in order to accommodate new domain-specific patterns and labels. This significantly increases the cost of democratizing VQIs across different domains and sources.

4.3

Design Principles of Plug-and-Play VQI

A plug-and-play (PnP) interface takes a fundamentally different approach to alleviate the aforementioned limitations of classical VQIs. It is designed to give end users the freedom to easily and quickly construct and maintain a VQI in a data-driven manner for any graph data source without resorting to coding by simply “plugging” it on the data. In this section, we introduce the design principles behind PnP interfaces. In the next section, we shall introduce it formally. Specifically, PnP interfaces jettison all the aforementioned assumptions of classical VQIs (Sect. 4.1). 1. Work with independent data sources. A PnP interface should be able to work with any data source or application domain involving graph-structured data. That is, it should not be tightly integrated into a specific source. This will offer sufficient benefits to developers and end users by making VQI generation effortless with the growing number of sources. 2. Facilitate efficient top-down and bottom-up search. Classical VQIs assume that end users have a clear idea of the topologies of their query graphs. Hence, as mentioned earlier, they are primarily designed to support top-down searches. A key goal of PnP interfaces is to relax this assumption by facilitating both top-down and bottom-up searches effectively.

32

4 Plug-and-Play Visual Subgraph Query Interfaces

3. Useful attributes and pattern selection. Recall that the small sets of attributes and patterns in existing VQIs provide limited opportunities for end users to undertake bottom-up search. Hence, end users should be given a more flexible and diverse set of these elements in a VQI to facilitate effective search. For instance, larger-sized patterns that are specific to a graph repository can be exhibited on a VQI. This, however, makes the manual selection of these components during the implementation of a VQI impractical. Hence, they should be automatically generated from the underlying graph repository based on end users’ requirements. In particular, there are numerous patterns with different topologies that can be selected. Many of them may not be useful to end users in supporting top-down and bottom-up searches effectively. Hence, it is paramount to select patterns that are potentially “useful” for subgraph query formulation. 4. Cognitive load-aware pattern selection. The aforementioned goal of exposing larger patterns to end users may lead to a higher cognitive load as they need to visually interpret a pattern (i.e., edge relationships) quickly to determine if it is useful for a query. Hence, any PnP interface needs to select subgraphs that are not only potentially useful but also impose a low cognitive load on users. 5. Independence from query logs. Although query logs can provide rich information on the topologies of past queries posed on a specific data source, in practice such information is often not publicly available (Yuan et al. 2021). Hence, a PnP interface should be able to select attributes and patterns from a specific source without demanding its query logs as input. In the case query logs are available, it should be easily extensible to incorporate them.

For more than three decades, the visual subgraph query interface construction process is traditionally manual in nature, hard coded by programmers. The paradigm of the PnP interface brings a shift in this traditional thinking by making the construction and maintenance data-driven.

4.4

Plug-and-Play (PnP) Interface

We now present an overview of the architecture of PnP interfaces that realizes the aforementioned design principles to address the limitations of existing classical VQIs. A PnP interface consists of three components, a PnP template, a plug, and a PnP engine (Fig. 4.1). A PnP template provides the skeleton structure of a VQI independent of any domain or source. A plug is specified by an end user to instantiate different panels of the PnP template for a given graph data source based on his/her requirements. Based on the plug specification, the PnP engine is responsible for instantiating the relevant panels of a PnP template with data (e.g., attributes, patterns). Once a PnP template is populated with relevant data, it is ready

4.4

Plug-and-Play (PnP) Interface

33

Fig. 4.1 Architecture of a PnP interface

for the “play” mode where an end user can use it like any other VQI to formulate subgraph queries. We now elaborate on these components.

4.4.1

PnP Template

A PnP template is a visual interface that provides the interface structure of a PnP interface. Recall from Chap. 1 a VQI for graphs typically contains two categories of panels, data panel and user panel. The contents of data panels are generated from the underlying graph repository either manually or automatically. On the other hand, the contents of user panels are provided by end users. Hence, a PnP template contains at least four panels, a configuration panel, an attribute panel, a pattern panel, and a query panel. Note that the configuration and query panels are user panels whereas attribute and pattern panels are data panels. All panels are not instantiated with any data in a PnP template. Hence, it can be used on top of any data source. Figure 4.2 depicts an example of a PnP template. Observe that all panels are empty. A PnP template provides the necessary scaffold to an end user for creating VQIs with a uniform structure for any number of graph data sources, thereby reducing the impact of learning and navigating a new VQI for each new data source even for the same application domain (e.g., PubChem and DrugBank). Observe that a PnP template does not contain a results panel. This is because as mentioned in Sect. 1.8, this book focuses on subgraph query formulation. It is straightforward to extend the template to incorporate a panel for visualizing query results.

34

4 Plug-and-Play Visual Subgraph Query Interfaces

Fig. 4.2 A PnP template

4.4.2

Plug

A plug is a high-level specification of the properties of the panel containing patterns in a VQI. Given the specification and the PnP template, the PnP engine (discussed below) is responsible for selecting the patterns satisfying it from the selected graph repository. Formally, it is defined as follows. Definition 4.1 (Plug) Given a graph repository R and a PnP template I, a plug is specified as b = (ηmin , ηmax , γ) where ηmin > 0 (resp. ηmax ) is the minimum (resp. maximum) size of a pattern and γ > 0 is the number of patterns to be displayed on I. Essentially a plug is a collection of attribute-value pairs that specify the high-level content of the pattern set in a VQI as specified by an end user. For example, Fig. 4.3 shows a screenshot of the specification of b = (3, 10, 15). Accordingly, the minimum and maximum sizes of patterns in I are 3 and 10, respectively, and the total number of patterns to be displayed in the VQI is 15. Observe that there can be multiple plugs for a graph repository. Similarly, the same plug can be used for different repositories. Hence, different plug-and-play interfaces can be constructed by using different plug specifications. A plug should possess the following properties. • Data independence—A plug should not depend upon specific properties of a network (i.e., socket). The specification of the plug enables this by not admitting any networkspecific information. Observe that this property is important for plug-and-play interfaces as a plug can be used on different network data across different application domains.

4.4

Plug-and-Play (PnP) Interface

35

Fig. 4.3 Plug specification

• Able to select patterns with the required specifications—The PnP engine (discussed below) of a PnP interface should select patterns exactly as specified by the plug. Observe that in contrast to classical VQIs where the content of data panels is typically static and identical for all end users, the plug in a PnP interface allows end users to personalize the content of the Pattern Panel dynamically according to their preferences. Furthermore, the plug specification is extensible. For instance, it can be extended to add constraints on pattern size distribution. By default, in this book, we follow a uniform distribution in generating patterns to ensure that the sizes of patterns are evenly distributed. However, it can be easily modified as follows to accommodate a different size distribution by allowing the patterns per size to vary in the range [1-k] where k < γ: b = (ηmin , ηmax , dist , γ) where dist is the desired pattern size distribution.

4.4.3

PnP Engine

The PnP engine is at the core of a PnP interface. Given the plug specification b over data repository R and a PnP template I, it is responsible for populating the attributes and patterns for the Attribute and Pattern Panels in I, respectively, that satisfy b. To this end, the formal procedure to realize the PnP engine is outlined in Algorithm 4.1. Line 2 extracts the labels associated with nodes and links in R by traversing it. Given b and R, the SelectPatterns procedure in Line 3 is responsible for selecting the patterns for the Pattern Panel. Line 5 realizes the maintenance of attributes and patterns in a VQI as R evolves. Lastly, the Display procedure is invoked to display the attributes and patterns in respective panels in I. Observe

36

4 Plug-and-Play Visual Subgraph Query Interfaces

Algorithm 4.1 The PnP engine.

Require: Graph repository R , plug b = (ηmin , ηmax , γ), PnP template I, maintainStatus ; Ensure: Populated PnP template I; 1: if maintainStatus is ’False’ then 2: A ← Select Attributes(R) 3: P ← Select Patter ns(R, b) 4: else 5: (A, P ) ← Maintain(I) 6: end if 7: I ← Display(I, P , A) 8: return I

that a PnP Engine does not enforce ingestion of query logs, thereby enabling the realization of a PnP interface in absence of query logs. Selecting and maintaining attributes from R are straightforward. Hence, the core steps in Algorithm 4.1 are the selection and maintenance of patterns (Lines 3 and 5). Essentially, it is the building block of PnP interfaces. In subsequent chapters, we shall discuss this step in detail.

4.4.4

Play Mode

Once the PnP template is populated by the PnP engine with attributes and patterns in respective panels for the selected data source R, it transforms to a PnP interface and moves to the “play” mode. In this mode, an end user is ready to use the VQI for formulating subgraph queries on R. A PnP interface can be installed on top of any graph query engine to this end. Since a PnP interface is independent of any query engine, a visually constructed subgraph query can easily be passed to the underlying graph query engine for its evaluation. In the sequel, we shall refer to a PnP template and PnP interface interchangeably when the context is clear.

4.5

Benefits of PnP Interfaces

In this section, we discuss the benefits of the proposed PnP interface. Reconsider the design principles in Sect. 4.3 that aim to address the drawbacks of classical VQIs. A PnP interface enables the creation and maintenance of data-driven VQIs that can work with independent data sources. Specifically, the design of the PnP template and the PnP engine is independent of any data source, paving the way for realizing a PnP interface on any graph data source. In existing classical VQIs, the contents of the data panel are manually hard coded. In con-

4.6

Conclusions

37

Table 4.1 Comparison between the classical VQI and PnP interface Criteria Learnability Flexibility Robustness Efficiency Memorability Errors Satisfaction Cognitive load

Gen 1 High Low Low Low High High Low High

Gen 2 High Medium Medium Medium High Medium Medium Medium

Gen 3 High Medium Medium Low-High High Medium Low-Medium Medium-High

PnP High High Medium High High Low High Low

trast, in a PnP interface, the PnP engine selects the content in a data-driven manner, thereby facilitating data-driven construction and maintenance of a VQI. Furthermore, the SelectPatterns procedure in Algorithm 4.1 can be designed to select useful and cognitive load-aware patterns. The plug enables us to personalize the selection of these patterns according to user-defined size and number. Consequently, a PnP interface paves the way for a judicious selection of patterns that may spur effective bottom-up search beyond what is possible using classical VQIs. To elaborate further, we compare the potential of PnP interfaces with classical VQIs. Specifically, we revisit the qualitative comparison based on the usability criteria and cognitive load in Sect. 3.3. Table 4.1 summarizes the comparison. Note that the ratings for PnP interfaces in the table are in comparison to classical VQIs and not absolute values. Similar to classical VQIs, a PnP interface is easy to learn especially because it provides a uniform structure, feel, and look independent of data sources. Once a user learns to use it for one dataset, she can then use it for all. Compared to classical VQIs, a PnP interface facilitates both top-down and bottom-up search primarily due to the exposition of a more diverse set of cognitive load-aware patterns. This positively impacts robustness, efficiency, errors, and satisfaction dimensions compared to classical VQIs. Consequently, a PnP interface imposes a lower intrinsic cognitive load on the end users. In the sequel, we shall report the benefits of PnP interfaces w.r.t. efficiency, satisfaction, and cognitive load using user studies.

4.6

Conclusions

In this chapter, we formally introduce the PnP interface, a novel paradigm of data-driven construction and maintenance of visual subgraph query interfaces. A PnP interface is like a plug-and-play device that can be plugged into any kind of socket (i.e., graph data) and used. It is dynamically built from a high-level specification of pattern properties known as

38

4 Plug-and-Play Visual Subgraph Query Interfaces

the plug. This makes it highly portable as a VQI can automatically be constructed for any application centered around graph data. It is worth noting that PnP interfaces go against the traditional mantra of VQI construction. As discussed in the preceding chapter, VQIs for graphs are traditionally manually constructed for each data source or domain. We argue that as more and more graph data sources become prevalent in a wide variety of domains, the plug-and-play approach minimizes the cost of development and maintenance of VQIs. The PnP interface paves the way for a single framework to automatically construct a subgraph query formulation interface for any domain or source involving graphs.

Reference Z. Yuan, H.-E. Chua, Sourav S. Bhowmick, Z. Ye, W.-S. Han, B. Choi. Towards Plug-and-Play Visual Graph Query Interfaces: Data-driven Canned Pattern Selection for Large Networks. Proc. VLDB Endow., 14(11): 1979-1991, 2021.

5

The Building Block of PnP Interfaces: Canned Patterns

It is clear from the preceding chapter that the PnP engine plays a pivotal role in the realization of a PnP interface to facilitate top-down and bottom-up subgraph query formulation. The computationally challenging components in a PnP engine are the selection and maintenance of patterns (Lines 3 and 5 in Algorithm 4.1). We focus on these steps in the sequel. Intuitively, we can classify the patterns into two types, basic and canned. A basic pattern is a small-size pattern with size at most z (typically, z ≤ 3) such as an edge, 2-edge, and triangle. End users are typically aware of these generic topologies as they are either building blocks of any graph-structured data or they are well-known in a specific domain. On the other hand, a canned pattern is a subgraph of size larger than z. These larger size patterns are highly desirable as they often reveal structures that are unique to the underlying graph data source, thereby furnishing representative objects to end users to trigger efficient visual subgraph query formulation even when they may not have a specific pattern in their head. In this chapter, we describe the desirable characteristics of these canned patterns. We begin by introducing them in Sect. 5.1. Next, we describe how to quantify them in Sects. 5.2–5.4. In the subsequent chapters, we shall describe the selection and maintenance of canned patterns in a PnP interface based on these characteristics. Table 5.1 describes notations related to canned patterns used in this book.

5.1

Characteristics of Canned Patterns

Given a graph repository and a visual subgraph query interface I, intuitively, the goal of canned patterns P in I is to aid an end user to visually formulate queries quickly even when she may not have a specific pattern in her head to trigger search. Hence, canned patterns not only expedite subgraph query construction due to pattern-at-a-time formulation mode but also facilitate bottom-up search without browsing results of partially constructed subgraph queries. An ideal set of canned patterns yields a subset of patterns P  ⊆ P that © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 S. S. Bhowmick and B. Choi, Plug-and-Play Visual Subgraph Query Interfaces, Synthesis Lectures on Data Management, https://doi.org/10.1007/978-3-031-16162-9_5

39

40

5 The Building Block of PnP Interfaces: Canned Patterns

Table 5.1 Key symbols related to canned patterns Symbol p P, P scov( p, D), scov(P, D) lcov(P, D) cov( p) div( p, P \ p) sim( pi , p j ) cog( p) f cov (P) f div (P) f cog (P)

Definition A canned pattern A set of canned patterns Subgraph coverage of p or P in database D Label coverage of a pattern set P in database D Coverage of a pattern p in a large network Diversity of a pattern p ∈ P Similarity of patterns pi and p j Cognitive load of a pattern p Coverage of a pattern set P Diversity of a pattern set P Cognitive load of a pattern set P

require minimal edits (e.g., node label modification, node/edge addition, and deletion) for the construction of a wide variety of query graphs. However, it is impractical to display a large number of canned patterns in a VQI as not only it will make the interface overly complex but also a user will need to browse a long list of these patterns in order to determine which ones are best suited for her query. Hence, the number of canned patterns should be small and should satisfy the following desirable characteristics (Huang et al. 2019; Yuan et al. 2021). • High coverage. Intuitively, a canned pattern set should cover a large number of data graphs or regions of a graph. This ensures that a large number of subgraph queries can be visually constructed on these graphs by utilizing the pattern set. Note that a pattern set with high coverage does not prevent a user from formulating infrequent queries (i.e., queries with very few matches in the underlying graph repository) visually. One may combine several patterns or edit one or more of them in the VQI to formulate a subgraph query that is infrequent or frequent in the underlying data source. Since high-coverage patterns occur in many data graphs or in many regions of a network, a variety of users with diverse interests in a variety of regions of data graphs can potentially use them to formulate diverse queries efficiently. • High diversity. Every pattern p should ideally be diverse from every other pattern in P . This will enable P to potentially serve a larger variety of queries. This is intuitive given the limited space in a VQI for displaying canned patterns in practice. Populating a PnP interface with highly similar patterns does not add much value in query formulation but consumes precious space. A user can easily select one pattern p from the set of all highly similar patterns and formulate any remaining similar patterns by modifying p using very

5.1

Characteristics of Canned Patterns

41

Fig. 5.1 Cognitive load and diversity of patterns

few steps. For example, consider the patterns in Fig. 5.1b and c. It is judicious to avoid selecting both simultaneously for display as they are structurally very similar. A user can easily make minor edits to one of the patterns to create the other. In other words, exposing highly similar canned patterns to end users in PnP interfaces does not bring substantial benefits w.r.t. visual subgraph query formulation. • Low cognitive load. Canned patterns are consumed by humans and not machines. Hence, it is paramount to consider the impact of diverse patterns on the human cognitive system during visual query formulation. Specifically, a user may inspect a canned pattern and decipher its edge relationships in order to determine if it is useful for her query. Research in cognitive psychology and network visualization domains have consistently emphasized on the adverse cognitive impact of graph size, density, and edge crossings on various graph-specific tasks (e.g., finding the shortest path in a node-link diagram) (Yoghourdjian et al. 2018; Yoghourdjian et al. 2021; Huang et al. 2009). Consequently, a relatively complex canned pattern with high density or multiple edge crossings may demand substantial cognitive effort from an end user to decide if it can aid in her query formulation. Hence, it is desirable for the patterns to impose a low cognitive load on an end user. For example, intuitively a user may take more time to visually inspect the pattern in Fig. 5.1a to determine if it is useful for formulating her query compared to the one in Fig. 5.1b or Fig. 5.1c. This is because the former pattern is denser with multiple edge crossings than the latter. Hence, patterns such as Fig. 5.1a should be avoided in a VQI. Indeed, canned patterns displayed in several industrial VQIs (e.g., PubChem, DrugBank) have a low cognitive load. Most data management tools and techniques are designed for “specialists” and are oblivious to the cognitive impact on end users. Hence, a proponent of such traditional tools and techniques may argue that the cognitive load may not be important for specialists or domain experts. We argue that in a data-democratized world, the design of any graph querying framework should not be only for “specialists”. It is paramount to be cognizant

42

5 The Building Block of PnP Interfaces: Canned Patterns

of the psychological impact of graph data management frameworks on end users. The chasm between the potentials of visual query interfaces and their deployment in practice will persist if we ignore factors related to cognitive psychology in our solution design. Note that advocating for patterns with a low cognitive load does not negate an end user to visually formulate query graphs with complex topologies. One can easily formulate large and complex queries by combining two or more patterns (with possible edits). In the next sections, we describe how to quantify these aforementioned characteristics in turn. We classify the quantification based on the two types of graph repository introduced in Chap. 2 (i.e., graph database containing a large collection of small- or medium-sized data graphs and large network). Cognitive load-aware canned patterns play a pivotal role in visual subgraph query interfaces by facilitating not only efficient query formulation but also bottom-up search.

5.2

Quantifying Coverage

Coverage of patterns in graph databases. We consider two types of coverage, namely subgraph coverage and label coverage for patterns and vertex/edge labels, respectively. The |G | subgraph coverage of a pattern p ∈ P is defined as scov( p, D) = |D|p where G p ⊆ D is a set of data graphs containing p. Consequently, the subgraph coverage of a set of canned  |

Gp|

p∈P patterns P is given as scov(P , D) = . On the other hand, labeled data graphs |D| may contain a variety of different vertex/edge labels.1 Let L(e, D) be a set of graphs in D  containing edges having same label as e and L(E P , D) = ei ∈E P L(ei , D). Then the label

coverage of P w.r.t. D is given as lcov(P , D) =

|L(E P ,D)| . |D|

Coverage of patterns in large networks. Let S( p) = {s1 , · · · , sn } be a bag of subgraphs in G isomorphic to p (i.e., embeddings of p) where vertex labels in G = (V , E) and p = (V p , E p ) are assumed to be the same and si = (Vi , E i ). We say an edge e ∈ E i is covered  by p. The coverage of p is given as cov( p) = | i∈|S( p)| E i |/|E|. Similarly, cov(P ) = |E † |/|E| (i.e., f cov (P )) where every e ∈ E † is covered by at least one p ∈ P . Since |E| is  constant for a given G, coverage can be rewritten as cov( p) = | i∈|S( p)| E i | and cov(P ) = |E † |. Note that for large networks we do not consider label coverage. This is because for such a repository we are interested in unlabeled canned patterns instead of labeled canned patterns for reasons detailed in Chap. 7. 1 In graphs where only vertices are labeled, an edge label can be considered as a concatenation of labels of the end vertices (Chap. 2).

5.4

5.3

Quantifying Cognitive Load

43

Quantifying Diversity

Diversity of patterns in graph databases. Given the patterns p, p1 , and p2 , we say p1 is more diverse from (resp. similar to) p compared to p2 if G E D( p1 , p) > G E D( p2 , p) (resp. G E D( p1 , p) < G E D( p2 , p)) where G E D(.) is the graph edit distance (Riesen et al. 2007). Consequently, diversity of p, denoted as div( p, P \ p) = min{G E D( p, pi )} where pi ∈ P \ p. Similarly, diversity of P is given as f div (P ) = min p∈P div( p, P \ p). Diversity of patterns in large networks. The diversity of p w.r.t. to P in a large network G is the inverse of similarity of p. In particular, the similarity of a set of canned patterns  P is denoted as f sim (P ) = ( pi , p j )∈P ×P sim( pi , p j ) where sim( pi , p j ) is the similarity between patterns pi and p j (detailed in Sect. 7.4). Note that any superior and efficient network similarity technique can be adapted to compute sim( pi , p j ).

5.4

Quantifying Cognitive Load

Quantifying coverage and diversity of patterns with crisp mathematical definitions is realizable as these concepts depend only on the content and structure of the underlying data graphs. However, recall from Chap. 2, the cognitive load of patterns is influenced by both the pattern-of-interest as well as an end user’s cognitive system. Consequently, it is significantly harder to quantify it. Fortunately, our goal is not to measure precisely the cognitive load of a pattern on an end user. Instead, in a PnP interface, we are interested in comparing a pair of candidate patterns based on their estimated cognitive load score and choosing the one which has a lower score as a displayed pattern on the VQI. Hence, any cognitive load formula should be efficient to compute so that a PnP interface does not invest considerable time to compute them for selecting canned patterns. Our cognitive load computation techniques are influenced by research on the impact of various connectivity-related tasks in node-link diagrams on cognitive load (Huang et al. 2009; Yoghourdjian et al. 2018, 2021). In particular, the recent study in Yoghourdjian et al. (2021) not only utilized network topology features but also physiological measures (i.e., brain electrical activity, heart rate, and pupil size) to understand the relationship between cognitive load and connectivity-based tasks (i.e., finding shortest paths). The study reported the following results. First, people have significant difficulty in finding the shortest paths in high-density node-link diagrams of scale-free networks with more than 50 nodes and even spare graphs with more than 100 nodes. That is, the usefulness of node-link diagrams to visualize graphs rapidly deteriorates as the number of nodes and edges increases. Second, global network features such as the number of crossings (i.e., the number of node-link crossings and the number of link-link crossings) have a greater impact on the cognitive load of individuals than features of the shortest path such as straightness or length. Third, physiological measures of load initially increase with task hardness but decrease after that possibly due to participants of the study giving up on the task. We exploit these results in the context of canned patterns.

44

5 The Building Block of PnP Interfaces: Canned Patterns

At first glance, it may seem that since a canned pattern typically has significantly lesser than 50 nodes, it should have a negligible cognitive impact on end users. However, observe that the aforementioned studies assume that a node-link diagram is displayed on a standard monitor. For instance, in the study by Yoghourdjian et al. (2021), the node-link diagrams are displayed in a 1920 × 1080 pixel area on a 22-inch HP monitor. In contrast, in a VQI, patterns are typically displayed as thumbnails in small multiples to support rapid browsing of a list of such patterns. Consequently, the pixel area for a pattern is significantly smaller. Packing even a dozen of nodes and edges in such a small area may impose a significant cognitive load on end users especially for rapid browsing during query formulation. Cognitive load of patterns in graph databases. We quantify the cognitive load of a pattern p |E | in a graph database as follows: cog( p) = |E p | × ρ p where ρ p = 2 |V p |(|Vpp |−1) is the density of p. Then the cognitive load of a pattern set P is given as f cog (P ) = max p∈P cog( p). This measure follows from the intuition that the cognitive load increases with a density as larger and denser graphs overload the human perception and cognitive systems, resulting in poor performance of relatively complex connectivity-based tasks such as identifying the relationship between different vertices (Huang et al. 2009; Yoghourdjian et al. 2018; Yoghourdjian et al. 2021). Observe that in this measure we ignore the crossing number. This is a design choice we make that is primarily driven by the nature of graph databases that are publicly accessible. The majority of these large databases (e.g., AIDS, PubChem, and eMolecule) contain information about chemical compounds or molecules, which typically do not result in many crossings due to the chemical properties of interacting atoms. In the case, a graph database has data graphs with crossings we can use the measure defined below in the context of large networks. We justify the choice of the above measure for estimating cognitive load with a user study. We evaluate three putative measures (F1 to F3) for determining the cognitive load.  |E | Given a pattern p = (V p , E p ), F1 = |E p | × 2 |V p |(|Vpp |−1) , F2 = v∈V p deg(v) = 2|E p | |E |

where deg(v) is the degree of vertex v, and F3 = 2 |V pp| . Observe that F2 is a degree-based measure. We recruited 15 unpaid volunteers where each participant is given a pattern p and a query Q pair one at a time on a VQI and asked to determine if p ⊆ Q (i.e., whether p is useful for formulating Q) by clicking a yes/no button. We use performance-based measure (i.e., response time) to assess cognitive load. Specifically, the time taken from viewing p to clicking the button is recorded. Two datasets are used for the study. For each dataset, 6 queries (size in the range [18–39]) and 6 patterns of different topologies and cognitive loads are given to each participant in random order. In particular, |V | and |E| of a pattern vary in the range [4–13] and [3-1-3], respectively. Note that not all p ⊆ Q and an incorrect decision on a ( p, Q) pair from a participant is ignored (97.2% decisions are correct). This is to minimize cases where a participant clicks the button without checking. For each dataset, the patterns are ranked in increasing response time for each participant, where the smallest rank indicates the shortest time taken and implies the lowest cognitive effort needed to perform the task. Then, for a pattern pi , an average rank is obtained by

Quantifying Cognitive Load Kendall tau

5.4

45

1 0.5 0

AIDS

Actual vs F1

PubChem

Actual vs F2

Actual vs F3

Fig. 5.2 Comparison of cog measures for graph databases

averaging the ranks assigned to pi . Finally, the overall rank (Actual rank) of the pattern set for a given dataset is obtained by ordering the patterns in increasing average rank. Note that we do not use the average time taken to perform the overall rank as rank reversal may occur due to outliers (e.g., extremely long time taken by a participant). Further, the patterns are given another set of ranking in increasing F1 (corr. for F2 and F3). Figure 5.2 plots the Kendall rank correlation coefficient of the actual ranks w.r.t. the ranks obtained using F1, F2, and F3. Observe that F1 (avg. 0.8) is a more effective measure compared to F2 (avg. 0.28) and F3 (avg. 0.78). Cognitive load of patterns in large networks. Edge crossings occur frequently in large networks and hence cannot be ignored in this context. In fact, Huang and colleagues examined the effect of edge crossings on the mental load of users and found that cognitive load displays a relationship with edge crossings that resembles the logistic curve (Huang and L where L is the curve’s maximum value, x0 is the x value Huang 2010) f (x) = 1+e−k(x−x 0) of sigmoid’s midpoint, and k is the logistic growth rate (Zeide 1993). Lemma 5.1 The crossing number (i.e., number of edge crossings) of any simple graph G = (V , E) with at least 3 vertices satisfies cr ≥ |E| − 3|V | + 6. Proof (Sketch) Consider a graph G = (V , E) with cr crossings. Since each crossing can be removed by removing an edge from G, a graph with |E| − cr edges and |V | vertices containing no crossings (i.e., planar graph). Since |E| ≤ 3|V | − 6 for planar graph (i.e., Euler’s formula), hence, |E| − cr ≤ 3|V | − 6 for |V | ≥ 3. Rewriting the inequality, we have cr ≥ |E| − 3|V | + 6. Hence, the cognitive load of a pattern p is computed based on the size (sz p = |E p |), |E | density (d p = 2 |V p |(|Vpp |−1) ), and edge crossing (cr p ). If p is planar, then cr p = 0. Otherwise, it is cr p = |E p | − 3|V p | + 6. We then model the normalized cognitive load function according to the logistic curve: cog( p) = 1/(1 + e−0.5×(sz p +d p +cr p −10) )

(5.1)

Parameters of cog( p) are set empirically to ensure uniform distribution within the range of  [0, 1]. Finally, the cognitive load of a pattern set P (i.e., f cog (P )) is given as p∈P cog( p).

46

5 The Building Block of PnP Interfaces: Canned Patterns

Fig. 5.3 Graphs used for assessing cognitive load

We now justify the choice of the above cognitive load measure with a user study. Specifically, we compare several ways of measuring the cognitive load of a pattern p in large networks, namely f cog1 = f cog2 =

1 3



(1 − e−x )

(5.2)

x∈{sz p ,d p ,cr p }

1 1 + e−0.5×(sz p +d p +cr p −10)

(5.3)

f cog3 = sz p + d p + cr p

(5.4)

f cog4 = sz p × d p

(5.5)

f cog5 = cr p

(5.6)

Observe that f cog4 is used for graph databases. 20 volunteers were asked to rank the visual representations of six graphs (Fig. 5.3) of different sizes and topologies, in terms of cognitive effort required to interpret the edge relationships in these graphs. A “ground truth” ranking for these graphs is obtained based on the average ranks assigned by the volunteers. Then, the graphs are ranked according to the five cognitive load measures and compared against n c −n d where n is the number of observations; n c the ground truth using Kendall’s τ = 0.5×n(n−1) and n d are the number of concordant and discordant pairs, respectively. The measures f cog2 and f cog3 achieve the highest τ = 1. We select f cog2 as the cognitive load measure since

References

47

it is in the range of [0, 1] and facilitates the easy formulation of a non-negative and nonmonotone submodular pattern score function that we exploit for efficient canned pattern selection (detailed in Chap. 7).

5.5

Conclusions

In this chapter, we describe the desirable characteristics of canned patterns (i.e., patterns with a size larger than 3), which are the building blocks of any PnP interface. These patterns pave the way for enabling end users to perform bottom-up subgraph query formulation in addition to the top-down formulation. Given that the canned patterns are consumed by humans during query formulation and not machines, we emphasize the importance of low cognitive load as one of the desirable features along with the traditional characteristics such as high coverage and diversity. We describe how to quantify these three characteristics for the two categories of graph-structured data that we focus on in this book. In the sequel, we shall describe how these measures are exploited to automatically select high-quality canned patterns for display on a PnP interface.

References K. Huang, et al. CATAPULT: data-driven selection of canned patterns for efficient visual graph query formulation. In SIGMOD, 2019. W. Huang, P. Eades, S.-H. Hong. Measuring Effectiveness of Graph Visualizations: A Cognitive Load Perspective. Information Visualization 8(3), 2009. W. Huang, M. Huang. Exploring the relative importance of crossing number and crossing angle. In VINCI, 2010. K. Riesen, M. Neuhaus, H. Bunke. Bipartite graph matching for computing the edit distance of graphs. In GbRPR, 2007. V. Yoghourdjian, D. Archambault, S. Diehl, T. Dwyer, K. Klein, H. C. Purchase, and H.-Y Wu. Exploring the Limits of Complexity: A Survey of Empirical Studies on Graph Visualization. Visual Informatics 2(4), 2018. V. Yoghourdjian, Y. Yang, T. Dwyer, L. Lee, M. Wybrow, K. Marriott. Scalability of Network Visualisation from a Cognitive Load Perspective. IEEE Trans. Vis. Comput. Graph., 27(2): 1677-1687, 2021. Z. Yuan, H.-E. Chua, Sourav S. Bhowmick, Z. Ye, W.-S. Han, B. Choi. Towards Plug-and-Play Visual Graph Query Interfaces: Data-driven Canned Pattern Selection for Large Networks. Proc. VLDB Endow., 14(11): 1979-1991, 2021. B. Zeide. Analysis of growth equations. For. Sci., 39(3):594-616, 1993.

6

Pattern Selection for Graph Databases

In the preceding chapters, we emphasize patterns (basic and canned) as the basic building blocks of PnP interfaces. In this chapter, we present a state-of-the-art technique for selecting them to aid the construction of PnP interfaces for a large collection of small- or mediumsized data graphs (i.e., graph database). Recall that such datasets are prevalent nowadays in a variety of domains such as cheminformatics, bioinformatics, drug discovery, and computer vision. To this end, we first present a framework called CATAPULT (Huang et al. 2019) (Canned pAttern selecTion for fAst graPh qUery formuLaTion) that is at the core to address this problem. Then, we convey how it is utilized as the building block for AURORA, a novel PnP interface for graph databases (Bhowmick et al. 2020). Briefly, CATAPULT comprises of three key components, namely small graph clustering, cluster summary graph (CSG) generator, and canned pattern selector (Fig. 6.1). Given a data graph collection, the small graph clustering module performs clustering of the underlying data graphs using features such as frequent subtrees and maximum (connected) common subgraphs. Then, each cluster is summarized as a cluster summary graph (CSG) by “integrating” all data graphs in that cluster. Finally, the canned pattern selector component greedily generates candidate patterns from the summarized CSGs in lieu of the underlying data graphs as the number of CSGs is significantly smaller than the number of data graphs. In particular, a pattern score that is sensitive to the coverage, diversity, and cognitive load of patterns is employed to select suitable canned patterns within the plug specification for display on the VQI. CATAPULT also has a sampler component (eager and lazy sampler) to tackle very large repositories. Specifically, it judiciously samples data graphs from which canned patterns are selected. Our experimental study with real-world visual subgraph query interfaces reveals that CATAPULT can reduce the number of steps taken to formulate a query by up to 85.7% and as a result make query formulation more efficient. In summary, this chapter makes the following contributions.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 S. S. Bhowmick and B. Choi, Plug-and-Play Visual Subgraph Query Interfaces, Synthesis Lectures on Data Management, https://doi.org/10.1007/978-3-031-16162-9_6

49

50

6 Pattern Selection for Graph Databases

Fig. 6.1 The CATAPULT framework Table 6.1 Key symbols for this chapter Symbol G c = (Vc , E c ) G E D(·) S∈S CW ELW Ci ∈ C N k L ccov(·) B we sp

Definition Closure graph Graph edit distance operator A cluster summary graph (CSG) in a set of CSGs S Cluster weight Edge label weight A graph cluster in a set of graph clusters Cluster size threshold Number of seeds for k-means clustering Candidate pattern library Cluster coverage of pattern set A set of basic patterns Weight of an edge during canned pattern selection Pattern score of a pattern p

1. We describe CATAPULT, an end-to-end canned pattern selection framework that can be utilized to populate a set of canned patterns in any visual subgraph query interface independent of domains and data sources. 2. We formally propose the canned pattern selection problem and present a technique to mine them from a graph database. 3. We describe a simple technique to select a set of basic patterns from the graph database. 4. Using real-world data graph repositories and visual subgraph query interfaces, we show the superiority and applicability of our CATAPULT framework compared to state-of-theart canned pattern selection techniques. 5. We present an end-to-end PnP interface framework called AURORA that exploits CATAPULT and selected basic patterns to realize the vision of PnP interfaces for graph databases.

6.1

Closure Graph

51

The rest of the chapter is organized as follows. We begin with background knowledge in Sect. 6.1 that is necessary to understand the CATAPULT framework. We formally define the canned pattern selection problem in Sect. 6.2. We present the details of CATAPULT framework in Sects. 6.4–6.5. We describe the procedure to select basic patterns in Sect. 6.6. Section 6.6 details the experimental results. We describe AURORA in Sect. 6.8. The last section concludes the chapter. The key notations specific to this chapter are given in Table 6.1.

6.1

Closure Graph

In this section, we introduce the notion of closure graph (He et al. 2006), which we shall utilize for canned pattern selection. Given a database of small- or medium-sized data graphs D, recall from Chap. 2, we assign a unique index (i.e., id) to each data graph. A data graph G with index i is denoted as G i . A closure graph is a generalized graph generated by performing a union on the structures of a set of graphs (He et al. 2006). We first review two related concepts, namely graph extension and graph mapping. Graph extension allows “integration” of graphs of varying sizes into a single graph referred to as extended graph (denoted by G ∗ = (V ∗ , E ∗ )) by inserting dummy vertices or edges with a special label ε such that every vertex and edge is represented in G ∗ . For instance, consider the set of graphs in Fig. 6.2a. G 1 is extended to G ∗1 in Fig. 6.2b by adding a dummy vertex ε and an edge that connects vertex C with it. Given two extended graphs G ∗1 and G ∗2 , a generalized graph (Fig. 6.2c) can be obtained by mapping their vertices and edges using the approach in He et al. (2006). Formally, given two extended graphs G ∗1 = (V1∗ , E 1∗ ) and G ∗2 = (V2∗ , E 2∗ ), the graph mapping between G ∗1 and G ∗2 is given as a bijection function φ : G ∗1 → G ∗2 where (i) ∀v ∈ V1∗ , φ(v) ∈ V2∗ and at least one of v and φ(v) is not dummy, (ii) l1 (v) = l2 (φ(v)) if both v and φ(v) are not dummy, and (iii) ∀e = (v1 , v2 ) ∈ E 1∗ , φ(e) = (φ(v1 ), φ(v2 )) ∈ E 2∗ and at least one of e or φ(e) is not dummy. Given two extended graphs G ∗1 and G ∗2 and a mapping φ between them, a vertex and an edge closure can be obtained by performing an element-wise union of the attribute values of each vertex and each edge in the two graphs, respectively. Then the closure graph of G ∗1 and G ∗2 is a labeled graph G c = (Vc , E c ) where Vc is the vertex closure of V1∗ and V2∗ and E c is the edge closure of E 1∗ and E 2∗ . The labels take the form of A{i 1 , · · · , i n } for vertices and { j1 , · · · , jm } for edges where A is a vertex name, {i 1 , · · · , i n } is a set of indices of graphs containing vertex A, and { j1 , · · · , jm } is a set of indices of graphs containing edges of connected vertex pairs. For example, in Fig. 6.2c, vertex C{1} in G ∗1 is mapped to vertex C{2} in G ∗2 . In the closure graph (Fig. 6.2d), these vertices are replaced by a vertex C{1,2}. Note that these closures may contain attribute values ε corresponding to a dummy vertex or edge. We remove dummy labels from closure graphs. For example, in Fig. 6.2c, vertex ε{ε} is mapped to vertex N{2}. In the closure graph (Fig. 6.2d), these vertices are replaced by a vertex N{2}.

52

6 Pattern Selection for Graph Databases

Fig. 6.2 An example of closure graph

6.2

Canned Pattern Selection Problem

Intuitively, given a graph database D, a visual subgraph query interface I (i.e., PnP template), and a plug b, the canned pattern selection problem aims to select a set of patterns P satisfying b from D to display on I by maximizing coverage and diversity and minimizing the cognitive load of P (recall from Chap. 5). Definition 6.1 Given a graph database D, a visual subgraph query interface I, and a plug b = (ηmin , ηmax , γ), the goal of canned pattern selection problem is to find a set of canned patterns P from D that satisfies the following: max scov(P , D)

max lcov(P , D)

max div( p, P \ p)

min cog( p)

s.t. p ∈ P , p ⊆ G, G ∈ D

6.3 The CATAPULT Framework

53

where |P | = γ and ηmax −ηγ min +1 is the maximum number of patterns for each k-sized pattern, k ∈ [ηmin , ηmax ], ηmin > 2. Remark The canned pattern selection problem is NP-hard (Huang et al. 2019). Observe that it aims to find patterns of size greater than 2. Patterns of smaller size (i.e., z = 2) are basic patterns (e.g., labeled edge,1 2-path) in our VQI and are computed after the generation of canned patterns. Specifically, in our VQI, we select top-m basic patterns based on their support which are detailed in Sect. 6.6.

6.3

The CATAPULT Framework

It is prohibitively expensive to iteratively evaluate subgraphs in each data graph G ∈ D w.r.t. coverage, diversity, and cognitive load in order to compute P . At first glance, it may seem that frequent subgraphs (Ramraj and Prabhakar 2015) can be utilized as canned patterns as these subgraphs may have high coverage. However, subgraph queries are not necessarily frequent in nature as users may frequently pose infrequent subgraph queries (Bhowmick et al. 2016). Algorithm 6.2, which is a manifestation of the framework in Fig. 6.1, outlines the CATAPULT approach to tackle this problem. It first clusters D based on their feature and topological similarities (Lines 1–2) and then constructs a cluster summary graph (CSG) S ∈ S for the data graphs in each cluster by utilizing the notion of closure graph (He et al. 2006) (Line 3). Next, it maintains two types of weights related to each cluster and labeled edges, namely cluster weight (CW) and edge label weight (ELW) (Lines 4–5). The former is a ratio of the number of graphs in a cluster to that in the database and is a measure of the importance of a cluster and its CSG. A pattern that is derived from a CSG with a large cluster weight is more likely to achieve a higher coverage compared to one derived from a CSG with a small cluster weight. The latter measures the global occurrence of a labeled edge in the database. Lastly, CATAPULT automatically selects canned patterns P (Line 6) from CSGs w.r.t. the plug b by first generating weighted CSGs using ELW and then selecting patterns from them by considering subgraph coverage (based on CW), label coverage, diversity, and cognitive load. Intuitively, these features are utilized as follows. Subgraph Coverage. Coverage of a canned pattern set P  increases if for every subsequent pattern pi added to P  , it is derived from graphs in D that are not yet covered by any existing pattern p j ∈ P  . For example, consider P  and a graph database D partitioned into mutually exclusive graph clusters C = {C1 , C2 , C3 , C4 , C5 , C6 } where ∀Ci ∈ C , |Ci | = 10 (i.e., every cluster contains 10 graphs), and D = C . Then the cluster weight (CW) is cwi = |Ci | 10  |D| = 60 (identical for all 6 clusters). Suppose P covers C 1 , C 2 , C 4 in D. Consider two   candidate patterns p1 and p2 covering C1 , C2 , C3 and C2 , C3 , C5 , respectively. CATAPULT 1 A labeled edge is an edge with labeled vertices.

54

6 Pattern Selection for Graph Databases

Algorithm 6.2 CATAPULT Require: Graph database D , plug b = (ηmin , ηmax , γ); Ensure: Canned pattern set P; 1: Ccoar se ← Coar seClustering(D) /* Algorithm 6.3*/ 2: C f ine ← FineClustering(Ccoar se ) /* Algorithm 6.4*/ 3: S ← Cluster Summar yGraph Set(C f ine ) 4: ELW ← Get EdgeLabelW eight(D) 5: CW ← Get GraphCluster W eights(C f ine ) 6: P ← FindCanned Patter nSet (ELW, CW,S,b) /* Algorithm 6.5*/

preferentially selects p2 to be added to P  since it would increase the subgraph coverage to 5 clusters2 as opposed to 4 if p1 is chosen. Label Coverage. Intuitively, we add patterns in P that result in a higher label coverage. For example, let the set of unique edge labels of P  and D be {(0, 1), (0, 2), (1, 3)} and {(0, 1), (0, 2), (0, 3), (0, 4), (1, 2), (1, 3), (2, 3), (3, 4)}, respectively, where (l(vi ), l(v j )) are labels of (vi , v j ). Consider two candidate patterns p1 and p2 having unique edge labels {(0, 1), (0, 3), (0, 4), (1, 3), (1, 4)} and {(0, 1), (0, 2), (0, 3), (1, 2), (1, 3)}, respectively. Then CATAPULT preferentially selects p1 as P  covers 6 unique labels as opposed to 5 if p2 is selected. Diversity. Given a pattern set P  containing three patterns p1 , p2 , p3 and two candidate patterns p1 and p2 , let div(P  , p1 ) = min{G E D( p1 , p1 ), G E D( p1 , p2 ), G E D( p1 , p3 )} = 5 and div(P  , p2 ) = min{G E D( p2 , p1 ), G E D( p2 , p2 ), G E D( p2 , p3 )} = 7. Then, p2 has a greater pattern set diversity compared to p1 and is preferentially selected. Note that we consider GED computation as a black box, any state-of-the-art algorithm can be used to speed up this part. Cognitive Load. Given two candidate patterns p1 and p2 , p1 is preferred over p2 if p1 has a lower cognitive load than p2 . Lastly, to tackle very large graph databases, we extend the framework by sampling data graphs judiciously and then generate canned patterns from it (Sect. 6.4.3). Remark Recall that patterns with high coverage do not necessarily support efficient visual subgraph query formulation. Hence, we reemphasize that the goal of CATAPULT is to automatically select canned patterns having low cognitive load and are sufficiently diverse enough to expedite the formulation of a variety of visual subgraph queries by a variety of users. Since users may formulate both frequent and infrequent queries in practice, CATAPULT does not make any restrictive assumption on the type of queries it supports or their results size. Hence, it selects patterns (frequent and infrequent) that can frequently assist in formulating queries.

2 Since all clusters contain the same number of graphs, we can count the number of clusters instead of the number of graphs.

6.4

Cluster Summary Graph (CSG) Generation

55

Furthermore, CATAPULT is query log-oblivious as such log data may be unavailable especially in “cold start” cases. For instance, query logs may be unavailable for some remote public data source (e.g., AIDS) and cannot be exploited when a user downloads it to formulate queries over it. Additionally, there has to be a sufficient volume of such log data to be effective in canned pattern selection. Nevertheless, our canned pattern selection step (Line 6) can be extended to incorporate the frequency of patterns in past subgraph queries. Data management and analytics techniques have traditionally shied away from considering the impact of their solutions on the cognitive psychology of their end users. CATAPULT is the first platform in the literature which aims to select a collection of patterns by considering what their cognitive impact may be on their end users.

6.4

Cluster Summary Graph (CSG) Generation

A large collection of small- or medium-sized data graphs is likely to contain groups of graphs having similar topology. These groups can be obtained via clustering and each group can be represented by a cluster summary graph (CSG). Subsequently, we aim to design techniques that enable us to select relevant canned patterns from these CSGs instead of directly computing them from D, which is computationally untenable. Here we describe the CSG generation process. In the next section, we shall elaborate on how canned patterns can be selected from these CSGs.

6.4.1

Small Graph Clustering

We aim to partition D into a set of graph clusters C = {C1 , C2 , . . . , Cd }, where Ci ⊆ D, Ci ∩ C j = ∅ ∀i = j, and it maximizes a clustering property objective function f : C → R, i.e., find any C in argmaxC f (C ) = {C |∀C  : f (C  ) ≤ f (C )}. Unfortunately, the majority of graph clustering approaches focus on identifying “related” vertices in a single large graph (Schaeffer 2007). There is scant research on clustering a set of small- or medium-sized graphs (i.e., small graph clustering) (Günter and Bunke 2002; Schäfer and Mutzel 2016) and they can be categorized into two classes: feature vector-based and graph structurebased. The former uses graph properties or subgraph occurrences as a feature vector in a standard clustering algorithm. In contrast, the latter uses graph structures such as MCS or MCCS directly resulting in clusters that are more intuitive and interpretable. However, these techniques are expensive. Hence, in CATAPULT, we explore a hybrid technique that integrates these two approaches to achieve high-quality clusters in a reasonable time.

56

6 Pattern Selection for Graph Databases

Algorithm 6.3 Coar seClustering Require: Graph database D ; Ensure: A set of graph clusters C; 1: Tall ← GenerateFr equent Subtr ees(D); 2: Tsel ← Select Fr equent Subtr ees(Tall ); 3: for G i ∈ D do 4: Initialize Ri as a |Tsel |-dimensional zero vector; 5: for Subtree T j ∈ Tsel do 6: if G i contains T j then 7: Update j th position of vector Ri to 1; 8: end if 9: end for 10: end for 11: C ← Clustering(R, D) /* R = {R1 , · · · , R|D| }*/;

Given D, CATAPULT first assigns each data graph G ∈ D to an appropriate graph cluster C based on the distance between the frequent subtree3 -based feature vector of G and that of the feature vector representative of C (i.e., clustering property) where |C| = k (coarse clustering, Line 1 of Algorithm 6.2). Then, if |C| is larger than a threshold N , C is further decomposed into smaller clusters where each sub-cluster has a size less than N and intra-cluster graphs have smaller topological distances measured using MCCS (i.e., clustering property) compared to inter-cluster data graphs (fine clustering, Line 2). Observe that coarse clustering is a feature vector-based approach whereas fine clustering is based on graph topology. Also note that the reason we further decompose “larger” clusters using fine clustering is that it reduces the size of CSGs and their generation cost in the subsequent step by operating on a smaller collection of similar data graphs. We now elaborate on the coarse and fine clustering. Coarse clustering. Coarse clustering (Algorithm 6.3) leverages frequent subtrees as feature vectors, which are connected acyclic subgraphs with support greater than or equal to a threshold value. In CATAPULT, frequent subtrees are generated (Line 1) using the approach in Chi et al. (2003) and are represented as canonical strings in two steps: (1) canonical tree generation via normalization and (2) conversion of the tree to the canonical string. Normalization of a labeled rooted tree is a bottom-up procedure based on the tree isomorphism algorithm in Aho et al. (1974). Given the original tree, it is performed level-by-level bottom-up, using orders among subtrees at each level until the canonical form is obtained. An example of normalization is given in Fig. 6.3. Note that subtrees that are “equal” (e.g., branch {B, D, E}) are combined in the intermediate steps. The canonical string is obtained by scanning the canonical tree top-down level-by-level in a breadth-first manner. Symbols $ and # are used to partition families of siblings and the end of the canonical string, respectively. Hence, the canonical string of the tree in Fig. 6.3 is A$1B1B1B$1C1D$1D$1F1G$1E$1E#, assuming that all edges have a label of 1. 3 Compared to frequent graphs, frequent subtrees describe the crucial topology of graphs but demand lower computational cost.

6.4

Cluster Summary Graph (CSG) Generation

57

Fig. 6.3 Canonical form

Observe that a set of frequent subtrees may contain subtrees that are highly similar to others. Hence, the selection of frequent subtree set (Line 2) can be further optimized using maximization of the uncapacitated facility location function4 (Jain and Vazirani 2001). Here, frequent subtrees are facilities and the facility cost is the similarity of a subtree to other subtrees in the set. This function is a monotone submodular function and can achieve a near-optimal solution by applying greedy search where the selected subtrees are able to achieve at least 1 − 1e ≈ 63% discriminative power for clustering (Guestrin et al. 2005). We model the problem of selecting a greedy set of frequent subtree as features using the minimization of the dissimilarity between subtrees. This problem can be recast as a maximization of similarity between subtrees as follows. Given two subtrees i and j represented as canonical strings, the subtree similarity of i and j is defined as |lcs(i, j)| σsubtr ee (i, j) = max(|i|,| j|) where |i| is the size of i, lcs(i, j) is the longest common subtree between i and j, and max(.) is the maximum operator. Formally, the submodular function  is defined as q(Tsel ) = i∈Tall max j∈Tsel (σsubtr ee (i, j)) where Tall is the set of all frequent subtrees and Tsel is the set of near-optimal frequent subtrees. Next, CATAPULT iterates through each graph G i ∈ D to determine its feature vector. For clarity, the feature vector is a |Tsel |-dimensional vector. The jth position of the vector is one if G i contains the subtree T j ∈ Tsel , and zero otherwise (Lines 3–10). Finally, clustering (Line 11) is performed using Tsel as feature vectors. In particular, CATAPULT uses the k-means clustering with k seeds selected using the k-means++ algorithm (Arthur and Vassilvitskii 2007). We set k as |D| N where N is the maximum cluster size.

4 Given the cost for opening facilities and the cost for connecting cities to facilities, the uncapacitated facility location problem seeks a solution that minimizes the cost of connecting each city to an open facility.

58

6 Pattern Selection for Graph Databases

Algorithm 6.4 FineClustering Require: A set of graph clusters C; Ensure: A set of graph clusters C with size≤ N ; 1: Initialize N with the default maximum cluster size; 2: Clarge ← Get LargeCluster s(C , N ); /*Contains Ci where |Ci | > N */ 3: C ← C \Clarge ; 4: Cnew ← φ; 5: while |Clarge | > 0 do 6: C f ir st ← RemoveFir stCluster (Clarge ); 7: Seed1 ← Select RandomGraph(C f ir st ); C  ← I nser t I ntoSet(C  , Seed1); 8: 9: for Graph G ∈ C f ir st \{Seed1} do 10: ωG ← Get Similarit y(G, Seed1); 11: end for 12: Seed2 ← Select Dissimilar Graph(C f ir st \{Seed1}, ω); C † ← I nser t I ntoSet(C † , Seed2); 13: 14: for G ∈ C f ir st \{Seed1, Seed2} do  ← Get Similarit y(G, Seed2); 15: ωG  then 16: if ωG > ωG  17: C ← I nser t I ntoSet(C  , G); 18: else 19: C † ← I nser t I ntoSet(C † , G); 20: end if 21: end for 22: Clarge , Cnew ← ClusterU pdate(Clarge , Cnew , C  , N ); 23: Clarge , Cnew ← ClusterU pdate(Clarge , Cnew , C † , N ); 24: end while  25: C ← C Cnew ;

Fine clustering. In fine clustering (Algorithm 6.4), CATAPULT further breaks down large clusters into smaller ones. It organizes a given set of graphs C f ir st into two new clusters (C  and C † ) according to the MCCS similarity of the graph G ∈ C f ir st and seed graphs Seed1 and Seed2 corresponding to C  and C † , respectively. The first seed graph Seed1 is selected randomly from Clarge (Line 7) whereas the second one Seed2 ∈ Clarge is selected such that it is most dissimilar to Seed1 (Lines 9–12). After the clustering process, CATAPULT performs an update of Clarge and Cnew by checking the size of the newly generated clusters. A new cluster will be inserted into Clarge if its size is larger than N . Otherwise, it will be inserted into Cnew (Lines 22–23). We adopt the McGregor algorithm (McGregor 1982) to compute MCCS. Example 6.2 Given a graph dataset D = {G 1 , · · · , G 17 }, let N = 6 and the number of k-means clusters be K = 3. Suppose the clusters are C1 = {G 1 , G 2 , G 3 , G 6 , G 9 , G 12 }, C2 = {G 4 , G 7 , G 8 , G 13 , G 14 , G 15 , G 17 }, and C3 = {G 5 , G 10 , G 11 , G 16 } after the coarse clustering phase. Then, the size of C2 exceeds N and is subjected to fine clustering. A seed (e.g., G 8 ) is randomly selected from C2 and is inserted into the first new cluster C  . Then, MCCS similarities are computed between G 8 and the remaining graphs in C2 . The graph having the greatest MCCS dissimilarity is selected as the second seed (e.g., G 15 ) and inserted

6.4

Cluster Summary Graph (CSG) Generation

59

into the second new cluster C † . For the remaining graphs (i.e., C2R = C2 \ {G 8 , G 15 }), MCCS similarities are computed between each graph G i ∈ C2R and G 8 (denoted as ωG i ),  ). Comparison is made between ω  and between G i and G 15 (denoted as ωG G i and ωG i . A i larger ωG i implies that G i is more similar to G 8 compared to G 15 , and G i is inserted into the first new cluster. Fine clustering splits C2 into two new clusters C2 = {G 7 , G 8 , G 13 } and C2† = {G 4 , G 14 , G 15 , G 17 }. Hence, at the end of the small graph clustering phase, four clusters, namely, C1 C2 , C2† , and C3 are generated. Remark The worst-case time complexity of small graph clustering is exponential in cost due to the k-means algorithm (Arthur and Vassilvitskii 2006). Note that the CATAPULT framework is orthogonal to the choice of a feature vector-based clustering approach as kmeans can be replaced with an alternative clustering algorithm. Furthermore, the small graph clustering step is a one-time cost and is only invoked when D is a new dataset. Hence, such a trade-off is appropriate. Lemma 6.3 Small graph clustering achieves 21 + (α − 21 )min_ f r -approximation of optimum clustering where min_ f r is the support of frequent subtree and α is the probability that correct clustering occurs given that the MCCS of a pair of graphs (in the same cluster under optimum clustering) contains frequent subtrees of D. Proof (Sketch) Given a dataset of graphs D = {G 1 , G 2 , · · · , G n }, we denote the optimum clustering of D into k disjoint, non-empty clusters as C O P T = {C1 , C2 , · · · , Ck }. We further denote the clustering obtained by small graph clustering as C = {C1 , C2 , · · · , Ck  }. 

|  The misclassification error distance of C with respect to C O P T is essentially |D |D| where D is the set of misclassified graphs based on C O P T (Meil˘a 2006). We assume fine clustering based on MCCS produces C O P T . In the worst case, small graph clustering performs coarse clustering only (frequent subtree-based feature vector clustering) when the sizes of all generated clusters are less than or equal to N . Let A denote the correct classification of G i and B denote that the frequent subtree set of D contains MCCS of two graphs G i and G x . Observe that the probability of B denoted as Pr (B) is equivalent to     min_ f r (Lemma 6.5). Then, Pr (A) = Pr (A (B B)) = Pr (A B) + Pr (A B) = Pr (A|B)Pr (B) + Pr (A|B)Pr (B) = Pr (A|B)min_ f r + 21 (1 − min_ f r ) where Pr (A|B) = 0.5 since in the worst case, there is a random chance of correct classification given B. Pr (A|B) is the probability of correct classification given B and this is likely to occur when Cm = {m| max j∈G C MCCS(G i , j), m ∈ G C } and Cm = {m| maxl∈G C subtree(G i , l), m ∈ G C } where subtr ee(G i , l) is the similarity of the frequent subtree vector of G i and l. Hence, the small graph clustering achieves 1 1  2 + (α − 2 )min_ f r -approximation of C O P T where Pr (A|B) = α.

60

6.4.2

6 Pattern Selection for Graph Databases

Generation of CSGs

Once the set of graph clusters C are generated, CATAPULT summarizes each cluster Ci ∈ C into a closure graph. We refer to it as cluster summary graph (CSG). In particular, it iterates through each cluster Ci ∈ C and performs graph closure (He et al. 2006) by considering a pair of data graphs at a time. Briefly, the graph extension of a pair of data graphs is mapped and the closure graph is found by performing edge closure on the extended graph.5 The CSG S for a cluster Ci is obtained when all data graphs in the cluster have been integrated into the closure graph. Lemma 6.4 The time and space complexities of the CSGs generation process are O(|D||Vmax |d 2 log(|Vmax |)) and O(|D|(|E max | + |Vmax |)), respectively, where d is the maximum degree of vertices and G max = (Vmax , E max ) is the largest graph in D. Proof (Sketch) The worst-case time complexity to form a closure graph is O(|Vmax |d 2 log (|Vmax |)) (He et al. 2006) where d is the maximum degree of vertices. In the worst case, there will be |D| − N − k − 1 clusters and the resulting time complexity is O((|D| − N − k − 1)|Vmax |d 2 log(|Vmax |)). For large dataset, |D|  N and |D|  k. Hence, the worst-case time complexity is reduced to O(|D||Vmax |d 2 log(|Vmax |)). The worst-case space complexity for storing the graph clusters C is O(|D|(|E max | + |Vmax |)) whereas that for storing the set of closure graphs is O(|C |(|E max | + |Vmax |)). In addition, the worst-case space complexity for generating a graph closure is O(|Vmax | + |E max |) (He et al. 2006). Taken together, the worse-case space complexity for generation of  CSG is O(|D|(|E max | + |Vmax |)) since |D|  |C |.

6.4.3

Handling Larger Graph Databases

Small graph clustering can be computationally expensive for large D. To alleviate this challenge, CATAPULT follows a two-level sampling approach (eager sampling and lazy sampling) as depicted in Fig. 6.1. Intuitively, the eager sampling is performed prior to the small graph clustering phase and the lazy sampling is performed after the coarse clustering phase. As we shall see in Sect. 6.6, this sampling approach achieves a good balance between the quality of canned patterns and the runtime performance of CATAPULT. Eager sampling. Eager sampling refers to random sampling from D. First, it generates a random sample of data graphs from D. Given an error bound and a maximum probability ρ for the error that exceeds , the size of the random sample (|Seager |) is determined by |Seager | ≥ 2 12 ln ρ2 (Toivonen 1996). 5 CATAPULT skips the vertex closure step since edge labels are needed subsequently.

6.4

Cluster Summary Graph (CSG) Generation

61

For example, given D and sampling parameters ρ = 0.01 and = 0.02, |Seager | = 1 2 ln 0.01 = 6623. Observe that |Seager | is independent of |D|. Then, the sample 2(0.02)2 Seager is used to find a frequent subtree set (recall from the coarse clustering phase). For a subtree t, the probability that the error e(t, Seager ) > is at most ρ. Note that e(t, Seager ) = | f r (t) − f r (t, Seager )| where f r (t) and f r (t, Seager ) are the frequencies of t in D and Seager , respectively. Hence, by setting low_ f r < min_ f r , the potential frequent subtrees found with lower support low_ f r for the sample are less likely to miss any frequent subtrees found with the support of min_ f r for the original dataset. CATAPULT performs counting on this potential frequent subtree set using the original support threshold (i.e., min_ f r ) to retrieve the final set of frequent subtrees. The frequent subtree set is then used as feature vectors in the coarse clustering phase. Lemma 6.5 (Toivonen 1996) Given a frequent subtree set X , a random sample Seager , and a probability parameter ϕ, the probability that x ∈ X is missed is at most ϕ when  1 low_ f r < min_ f r − 2|Seager | ln ϕ1 where low_ f r and min_ f r are the lower support threshold and the original support threshold, respectively. Note that min_ f r is a userspecified support value of the frequent subtree set and low_ f r is a lower support threshold for the frequent subtree set due to sampling. Lazy sampling. After coarse clustering, some clusters may still be too large for efficient processing. CATAPULT performs stratified random sampling of large clusters to further reduce their sizes (referred to as lazy sampling). For example, suppose after coarse clustering of a dataset of 50K data graphs, cluster C1 contains 1000 data graphs. Let the sampling parameters 2 ×0.52 /50000) × 1000 = 15.13 be p = 0.5, Z α2 = Z 0.95 , and e = 0.03. Then |Slazy | = ( 1.650.03 2 2 (Lemma 6.6) and C1 can be further reduced by taking a sample of 15 graphs. Note that fine clustering still needs to be performed if |Slazy(C) | > N . Lemma 6.6 Given a set of data graphs D containing |C | clusters, the size of a random sample set Slazy(C) required to estimate a cluster C ∈ C is defined as |Ssample | × |C| |Slazy(C) | =  Ci ∈C |Ci |

(6.1)

2

where Ssample is the sample for D. Here |Ssample | = Z e2pq where Z 2 is the abscissa of the normal curve that cuts off an area α at the tails (1-α is the desired confidence level, e.g., 95%), e is the desired level of precision, p is the estimated proportion of a graph being sampled in D, and q = 1 − p. Proof (Sketch) Each graph cluster C can be considered as a strata. Under proportional C| (McKay et al. 1979) stratified sampling, the sample size of strata is given as |SC | = |SG|N||N G| where |SC | and |SG | are the sample sizes for cluster C and the set of graphs G, respectively;

62

6 Pattern Selection for Graph Databases

Algorithm 6.5 FindCanned Patter nSet Require: Edge label weight ELW, cluster weight CW, a set of CSGs S, plug b = (ηmin , ηmax , γ); Ensure: A set of canned patterns P; 1: P ← φ; 2: S ← Get W eightedGraph(S, ELW); 3: while |P | < γ do 4: Pc ← φ ; 5:  ← Get Patter nSi ze Range(b, P ); 6: for S ∈ S do 7: for Pattern size η ∈  do 8: L ← φ; 9: for iteration i =0 to x /*x =max no. of random walks*/ do 10: PCP ←  Generate PC P(G, η); 11: L ← L {PCP}; 12: end for 13: FCP ← GenerateFC P(L);  14: Pc ← Pc {FCP}; 15: end for 16: end for 17: s ← Get Patter nScor e(Pc , P , CW); 18: pbest ← Get Best Patter n(s, Pc ) 19: P ← P { pbest }; 20: CW ← U pdateCluster W eight(CW, pbest , S); 21: ELW ← U pdateEdgeLabelW eight(ELW, pbest ); 22: end while

and |NC | and |N G | are the population size of C and G, respectively. Further, for a large 2 population, a representative sample size can be obtained as |Ssample | = Z e2pq where Z 2 is the abscissa of the normal curve that cuts off an area α at the tails (1-α is the desired confidence level, e.g., 95%), e is the desired level of precision, p is the estimated proportion of a graph being sampled in the dataset, and q = 1 − p (Cochran 1991). 

6.5

Selection of Canned Patterns

Given a set of CSGs S, CATAPULT follows a greedy iterative approach for selecting canned patterns for a VQI. In each iteration, candidate patterns are generated from each CSG S ∈ S and the “best” pattern for that iteration is added to the partial canned pattern set P  . Weights related to coverage are assigned to the CSGs to ensure that in each subsequent iteration, candidate patterns are derived from CSGs that are not yet covered by P  . CATAPULT performs random walks on these weighted CSGs and leverages on the statistics obtained from the walks to propose a candidate canned pattern (final candidate pattern) for each size in the range [ηmin − ηmax ] (i.e., plug b). A pattern score based on coverage, diversity, and cognitive load is computed for each candidate pattern and utilized to select the next best pattern to be added into P  . Weights of the CSGs are then updated based on the selected pattern. These steps

6.5

Selection of Canned Patterns

63

are repeated until either the required number of canned patterns are discovered or when no new pattern can be found. We now describe the algorithm (Algorithm 6.5) in detail. Weighted CSG construction (Line 2 ). The CSG S of a cluster C is a summarized representation of data graphs contained in C. Each edge e in S is assigned a weight we based on its label coverage in the dataset (i.e., global occurrence) and in the cluster (i.e., local )| occurrence) as follows: we = lcov(e, D) × lcov(e, C) where lcov(e, X ) = |L(e,X |X | . Weighted random walk for candidate pattern generation (Lines 6–16). We adopt a random walk-based approach for candidate generation as each random walk starts afresh in each iteration and has the potential to cover different regions of the CSGs, thus producing diverse candidate patterns. Given a weighted CSG S, CATAPULT performs a random walk to generate a variety of potential candidate patterns (PCP) from which a final candidate pattern (FCP) is derived. These PCPs collectively form a candidate library L. We elaborate on the generation of PCP and FCP. Each random walk to generate a PCP starts with a seed edge (i.e., edge with the largest weight). In every iteration, the PCP is “grown” by adding an adjacent edge until the required number of edges is achieved or when no more edges can be added. An adjacent edge is selected as follows (Geerts 2004): (a) Find all adjacent edges (referred to as candidate adjacent edges (CAE)) of the partial PCP. (b) Multiply all CAE with the least common multiplier (LCM) of denominators of weights of the CAE. Hence, these CAE now have integer weights. (c) Replace each CAE (u, v) with integer weight k by k CAE (u, v) of weight 1. (d) Randomly select a CAE. At the end of each random walk, a PCP is added to the candidate library L. Generation of the FCP starts with the first edge (i.e., most frequent edge in L). Similar to the PCP, the FCP is “grown” an edge at a time. In order to ensure that the FCP is a connected subgraph, the next added edge selected is the most frequent edge in L that is also connected to the previous added edge. Pattern score computation (Line 17). The pattern score, which selects the best candidate (i.e., the candidate with the highest score), is computed by combining scov(), lcov(·), div(·), and cog(·). Observe that the scov computation is extremely expensive when |D| is large. Hence, we estimate scov in terms of the cluster coverage ccov. That is, scov(P , D)  ccov(P , cw, C ) where C is a set of clusters of D and cw is the cluster weight vector such that  i| cwi = |C i∈C cwi × Ii where Ii = 1 if the CSG of Ci contains |D| and ccov(P , cw, C ) = a subgraph isomorphic to p ∈ P and Ii = 0 otherwise. Hence, given D with clusters C , a pattern p = (V p , E p ), and a canned pattern set P , the pattern score of p is s p = ccov( p, cw, C ) × lcov( p, D) ×

div( p, P \ p) cog( p)

(6.2)

Notice that as recommended by Tofallis (2014), we combine scov, lcov, div, and cog using multiplicative utility function as we do not have prior knowledge of the trade-off rate (i.e., x units of criterion A is equivalent to y units of criterion B) between these criteria.

64

6 Pattern Selection for Graph Databases

Also, s p increases when ccov or div increases or when cog decreases. Hence, given two candidate patterns p1 and p2 , p1 is considered superior to p2 if s p1 > s p2 . Recall that GED is used to compute the pattern set diversity, which is known to be computationally expensive (Riesen et al. 2007). Hence, CATAPULT uses a pruning step based on the lower bound of GED to reduce the number of exact GED computations. Definition 6.7 Given two graphs G A = (V A , E A ) and G B = (VB , E B ), the lower bound GED is given as G E Dl (G A , G B ) = |V | + |E| where L(V A ) is the set of labels of  vertices in V A , |V | = ||V A | − |VB || + Min(|V A |, |VB |) − |L(V A ) L(VB )|, and |E| = ||E A | − |E B ||. Observe that the lower bound computes the exact number of vertex modifications (|V | in Definition 6.7) and the minimum number of edge modifications (|E|) that are necessary. Lemma 6.8 Given two graphs G A = (V A , E A ) and G B = (VB , E B ), the worst-case time complexity of computing the lower bound of GED is O(|V A |log|V A |) where |V A | ≥ |VB |. Proof (Sketch) The worst-case time complexity of computing the lower bound GED is due to the identification of common vertex labels in L(V A ) and L(VB ). This can be done by sorting both label lists and then comparing them which yields a complexity of O(|V A |log|V A |).  The lower bound can be exploited to compute GEDs as follows: (a) Compute the lower bound of GED (GEDl ) of candidate patterns pc with each canned pattern p in P  . (b) Order canned patterns in P  in increasing GEDl and store the list as Y . (c) Iterate through Y . In each iteration, (1) compute GED( p, pc ) where p ∈ Y , (2) update GEDmin if GED( p, pc ) GEDmin . Updating weights (Lines 20–21). In this step, the cluster weight and edge label weight (recall from Sect. 6.3) are updated after each selection of a new canned pattern p by utilizing the multiplicative weights update method (Arora et al. 2012) as follows: (a) Cluster weight update: if the CSG of a cluster C contains subgraph isomorphic to p, then the new cluster weight of C is wC = (1 − n) × wC where wC is the original cluster weight. We set n = 0.5 according to Arora et al. (2012). (b) Edge label weight update: if an edge e has label corresponding to that of an edge in p, then the new edge label weight of e is we = (1 − n) × we where we is the original edge label weight. Remark Observe that the above algorithm results in a higher chance of covering frequently occurring edge labels since there are more edges containing these labels and a walk is likely to pass through one or more such edges. These edges are likely to occur in frequent queries. In contrast, edge labels with low frequency are more likely to occur in infrequent queries.

6.5

Selection of Canned Patterns

65

Hence, CATAPULT balances the number of frequent and infrequent patterns by using the aforementioned weighted random walk approach. Furthermore, CATAPULT follows a uniform distribution in generating P to ensure that the sizes of canned patterns are evenly distributed (Recall from Chap. 4). That is, the maximum number of canned patterns per size is ηmax −ηγ min +1 . However, the Get Patter nSi ze Range procedure (Line 5) can be modified to generate the range of required size based on a pattern distribution dist for each while-loop iteration. Example 6.9 Figure 6.4 illustrates the canned pattern selection process. Let γ = 9, ηmin = 3, and ηmax = 5. CATAPULT first generates the weight we for each CSG (Fig. 6.4a). For example, wC O = lcov(C O, D) × lcov(C O, SC1 ) = 0.99. Then, random walks are performed to derive a library of PCPs for each pattern size. Figure 6.4b illustrates an instance of random walk to generate a PCP of size 3 for weighted CSG SC1 , starting at seed edge (C, O) (largest weight). A set of CAEs is obtained by converting edge weights to integers. For instance, the CAE of (C, N) consists of 47 copies of (C, N). A CAE (e.g., (C, N)) is then randomly selected and added to the partial PCP. This process is repeated until the PCP is fully constructed. The constructed PCP (e.g., {(C, O), (C, N), (C, S)}) is then added to the library. Next, CATAPULT proceeds to identify the FCP from the PCP library based on the frequency

Fig. 6.4 Canned pattern selection consisting of generating a weighted CSG; b PCP; and c FCP

66

6 Pattern Selection for Graph Databases

of labeled edges occurring in the library. Figure 6.4c illustrates the steps of finding a FCP of size 3 from SC1 . Based on 100 random walks, the most frequent edge is identified as (C, O), which forms the first edge in the FCP. The second edge (i.e., (C, N)) in the FCP is the most frequent edge in the library that is connected to (C, O). CATAPULT continues to identify the next edge until the FCP is constructed. Note that in every iteration, each CSG “proposes” a FCP for each pattern size. The pattern score for each FCP is computed. The FCP (e.g., P1 = {(C, O), (C, N), (C, S)}) with the largest pattern score is then selected as the best candidate pattern and added to the set. The cluster weights are updated by first identifying all CSGs containing subgraphs that are isomorphic to P1 and multiplying their cluster weights by 0.5. The weights of (C, O), (C, N), and (C, S) in edge label occurrence are updated as well by multiplying their initial weights by 0.5. CATAPULT repeats these steps for selecting subsequent canned patterns. Theorem 6.10 The worst-case time and space complexities of canned pattern selec2 |S||E tion (Algorithm 6.5) are O(|VSmax |!|VSmax ||S| + |P |(|V P(max) |3 + xηmax Smax |)) and 2 O(|S|(|E Smax | + ηmax ) + |D||E max |), respectively, where Smax is the largest CSG in the set of CSGs S and x is the number of random walk iterations. Proof (Sketch) Finding weights of edges in the closure graphs require O(|S||E Smax |) time in the worst case where Smax ∈ S is the largest closure graph. Generating PCP requires 2 |S||P ||E O(xηmax Smax |) time where x is the number random walk iterations. CATAPULT utilizes edge occurrence from the random walk to identify the FCP. For every PCP library, computing edge occurrence requires O(xηmax ) time while FCP generation takes O(ηmax |E Smax |) time. Computing pattern score requires subgraph isomorphism test for each closure graph to find ccov (O(|VSmax |!|VSmax |) (Cordella et al. 2004) and |P  | times of graph edit distance computation (O(|V P(max) |3 ) (Riesen et al. 2007) where P(max) is the largest pattern in P to find div, yielding O(|VSmax |!|VSmax ||S| + |P ||V P(max) |3 ) worst-case time complexity for each FCP. Updating of cluster weights and edge label occurrence require O(|VSmax |!|VSmax ||S|) and O(ηmax ) time, respectively. Taken together, the pattern mining and selection phase have the worst-case time complexity of O(|VSmax |!|VSmax ||S| + 2 |S||E |P |(|V P(max) |3 + xηmax Smax |)). Space complexity: There are |P | canned patterns. Since we expect canned patterns to be subgraphs of D, their sizes should be less than O(|Vmax | + |E max |). Hence, storage space needed for candidate patterns is O(|P |(|Vmax | + |E max |)). In addition, CATAPULT allocates weights to each closure graph and this requires O(|S||E Smax |) space. In the worst case, maintaining the ELW requires O(|D||E max |) space assuming that every edge in each graph in D has a unique label. For each PCP library, O(xηmax ) space is needed where x is the number of random walk instances. During each iteration, there are ηmax − ηmin + 1 candidate canned patterns per closure graph. Hence, in the worst case, Algorithm 6.5 has space complexity 2 |S|) since xη O(|P |(|Vmax | + |E max |) + |S||E Smax | + |D||E max | + ηmax max  |D||E max | in a large graph repository. This can be further reduced to O(|D||E max | + |S|(|E Smax | +

6.6

Selection of Basic Patterns

67

2 )) since |D|  |P | and in the worst case, G ηmax max is a strongly connected graph where  |E max | > |Vmax |.

Remark Our data-driven approach for selecting canned patterns enables an end user to customize her interface by specifying a plug. Also, notice that the worst-case time complexity of canned pattern selection is mainly due to the subgraph isomorphism test necessary for checking cluster coverage. In this work, we use the VF2 algorithm (Cordella et al. 2004). Note that CATAPULT is orthogonal to this choice as VF2 can be substituted with any superior subgraph isomorphism algorithm.

6.6

Selection of Basic Patterns

Recall from Sect. 6.2 we consider an edge and a 2-edge as basic patterns in our framework. We generate the basic patterns after the selection of the canned patterns. Specifically, m (default is 5 but can be configured) basic patterns (denoted as B ) are selected to facilitate query formulation. The following steps are used to select the basic patterns for each pattern type. We use the edge pattern to explain these steps: 1. Rank the edges in decreasing level of support in the dataset. The ranked list is denoted as Er . 2. Compute the number of steps (stepe (e1 )) required to draw e1 using edge-at-a-time approach where e1 ∈ Er is the edge with the highest level of support. 3. Compute the minimum number of steps (minStep p (e1 )) required to draw e1 using pattern-at-a-time approach. 4. Select e1 as a basic pattern if stepe (e1 ) < minStep p (e1 ) and |Bedge | < α where Bedge is a set of basic edge pattern and α is the maximum number of such patterns allowed in a canned pattern panel of a VQI. Remove e1 from Er . 5. Repeat Steps 2 to 4 until |Bedge | = α. Example 6.11 Consider the edge (C, N), which has the highest level of support in the AIDS dataset. Observe that stepe (C, N ) = 3. Consider the construction of (C, N) using P2 in Fig. 6.17. In this case, step p (C, N ) = 9 since we have to drag-and-drop P2 onto Panel 4 and then remove 8 edges to obtain (C, N). The minimum number of steps using the patternat-a-time approach is simply the minimum values of all step p (C, N ) when the entire set of basic and canned patterns is considered. Suppose minStep p (C, N ) = 5, then we select (C, N) as part of Bedge for display on Panel 5 since stepe (e1 ) < minStep p (e1 ). Note that the type (e.g., edge, 2-edge) of basic patterns and the number of basic patterns for each type are configurable by a user. In addition, the list of 2-edge candidate basic patterns can be found using any state-of-the-art frequent pattern mining technique. Then, the 2-edge basic patterns can be selected by following the steps similar to that of edge patterns described above.

68

6.7

6 Pattern Selection for Graph Databases

Performance Study

CATAPULT is implemented in Java with JDK1.8 and its VQI is implemented in Javascript v1.6.6 Small graph clustering is realized in C++ using the Boost library. We now investigate the performance of CATAPULT and report the key results. All experiments are performed on a 64-bit Windows desktop with Intel Xeon CPU E5-1630 (3.70 GHz) and 32 GB of main memory.

6.7.1

Experimental Setup

Datasets. We use the following datasets. (a) The AIDS antiviral dataset7 has 40,000 (40K) data graphs. We use it and its subset containing 10K graphs, referred to as AIDS40k and AIDS10k, respectively. (b) The PubChem dataset8 consist of 23,238 (23 K); 250,000 (250 K); 500,000 (500K); and 1 million (1M) chemical compound graphs. Unless otherwise stated, PubChem refers to the 23 K dataset. (c) eMolecule dataset9 consist of 10 K chemical compounds (referred to as eMol). Competitors. We compare our data-driven approach with two commercial visual subgraph query interfaces (PubChem and eMol) where the canned patterns are manually selected. We also compare CATAPULT with a frequent subgraph-based canned pattern selection strategy as a baseline. Query set. We generate subgraph queries by randomly selecting connected subgraphs from the dataset. For each dataset, 1000 subgraph queries with sizes in the range of [4–40] are randomly generated. Parameter settings. Unless specified otherwise, we set ηmin = 3, ηmax = 12, N = 20, k = |D| N , and |P | = γ = 30. Performance measures. We use the following measures for performance: • Clustering time: Time taken to perform clustering in the small graph clustering phase. • Pattern generation time (PGT): Time taken to select canned pattern set P (Algorithm 6.5). • CSG compactness (denoted as ξt ): Given a graph cluster C and a threshold t, the CSG compactness of a CSG SC = (VSC , E SC ) of C is ξt = |E|ESt | | where every edge e ∈ E t ⊆ C E SC is contained in at least t × |C| graphs in C. Intuitively, it measures the compactness of a CSG of a cluster.

6 The code is available at https://github.com/MIDAS2020/CATAPULT. 7 https://wiki.nci.nih.gov/display/NCIDTPdata/AIDS+Antiviral+Screen+Data. 8 ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF/. 9 https://www.emolecules.com/info/plus/download-database.

6.7

Performance Study

69

• Missed percentage (MP): Percentage of query set containing no canned patterns. M P = |Q M | |Q| × 100% where Q is the query set and Q M ⊆ Q does not contain subgraphs that are isomorphic to any p ∈ P . That is, these queries cannot be formulated using P . • Reduction ratio (denoted as μ): Given a subgraph query Q, μ is the ratio of the number of steps reduced when P is used for constructing Q to the total number of steps needed total −stepP for constructing it using the edge-at-a-time mode (steptotal ). That is, μ = stepstep total where stepP is the minimum number of steps required to construct Q when P is used. Note that a step refers to the addition of a vertex/edge/pattern or relabeling a vertex label. We assume a canned pattern p ∈ P can be used in Q iff p ⊆ Q. Further, when multiple patterns are used to construct Q, for simplicity we assume that their corresponding isomorphic subgraphs in Q do not overlap. Then, given Q and P , the problem of finding a collection of canned patterns PQ ⊆ P that maximally covers Q can be modeled as a maximum weighted independent set problem (Sakai et al. 2003) where each pattern p ∈ PQ is contained in Q and the weight of p is the number of vertices in it. The maximum weighted independent set is PQ and each pattern p ∈ PQ is treated as a single step. Note that PQ is strictly a bag as it may contain multiple instances of p if there are multiple non-overlapping subgraphs in Q that are isomorphic to p. Hence, stepP = |PQ | + |VQ \ VPQ | + |E Q \ E PQ |.

6.7.2

Experimental Results

Exp 1: Small graph clustering. First, we evaluate the effect of our small graph clustering strategy in terms of clustering time and CSG compactness. Figure 6.5 reports the performance on AIDS10k and AIDS40k for the following scenarios: (a) coarse clustering only (CC), (b) MCCS-based fine clustering only (mccsFC), (c) MCS-based fine clustering only (mcsFC), and (d) coarse and fine clustering (i.e., hybrid) with MCCS (mccsH) and MCS (mcsH). As expected, CC is generally faster but produces CSGs with low compactness. Occasional clusters with a large number of data graphs can result in CSGs with poor compactness due to the large variability in the topology of the data graphs in a cluster. In our experiments, the largest cluster produced in AIDS40k contains 14000 graphs. In contrast, MCCS-based fine clustering (mccsFC) produces more compact CSGs but is much slower. Interestingly, our proposed hybrid approach mccsH produces CSGs that are most compact for both datasets at a reasonable clustering time. This justifies the need for a hybrid strategy for small graph clustering. In subsequent experiments, we shall use mccsH. Exp 2: Sampling versus No sampling. Next, we evaluate the effect of sampling using AIDS dataset in terms of PGT, MP, and μ. For eager sampling, we set ρ = 0.01 and = 0.02 whereas for lazy sampling, p = 0.5, Z α2 = Z 0.95 , and e = 0.03. From Fig. 6.6, we observe 2

that there are no significant differences for both AIDS10k and AIDS40k in terms of μ and MP. But the PGT differs by up to 2 orders of magnitude. We also examine the effect of sampling on the quality of graph clusters. Figure 6.7 depicts that CSG compactness did not

6 Pattern Selection for Graph Databases 0.2 AIDS10K ξ

100

0.1

ξ0.4 ξ0.5 ξ0.6

AIDS10K

10 0

1 CC

CC

mccsFC mcsFC mccsH mcsH

0.2

1000 AIDS40K 100 ξ

Clustering time (min) Clustering time (min)

70

0.1

mccsFC mcsFC mccsH

ξ0.4 ξ0.5 ξ0.6

mcsH

AIDS40K

10 0

1 CC

CC

mccsFC mcsFC mccsH mcsH

mccsFC mcsFC mccsH

mcsH

Fig. 6.5 Small graph clustering phase

0 PGT (min)

Max μ Ave μ

50 10kS

10knoS

40kS

40knoS

10 100

MP

μ%

100

10 1

5 0

10kS 10knoS 40kS 40knoS

10kS 10knoS 40kS 40knoS

0.2 100 ξ

Clustering time (min)

Fig. 6.6 Effect of sampling. 10k S (resp. 40k S ) and 10knoS (resp. 40knoS ) denote sampling and no sampling of AI DS10k (resp. AI DS40k), respectively

0.1

10 0 1

10KS 10KnoS 40KS 40KnoS

10KS 10KnoS 40KS 40KnoS

ξ0.4

ξ0.5

ξ0.6

Fig. 6.7 Effect of sampling on the clustering phase

change significantly whereas the clustering time increases by up to one fold. Hence, the sampling approaches in CATAPULT reduce running time significantly without affecting the quality of selected canned patterns significantly. In the subsequent experiments, we shall use these sampling parameters. Exp 3: Effect of sampling parameters. Several parameters are used that affect the sampling sizes. Figure 6.8 plots the effect of varying ρ and e and the reduction ratio μ S =

6.7

Performance Study

71

μS%

50 0 -50

ε0.01

ε0.02

ε0.04

e0.015

e0.03

e0.06

ε0.01

ε0.02

ε0.04

e0.015

e0.03

e0.06

e0.015

e0.03

e0.06

MP

10 5

PGT (min)

0

100 10 1

ε0.01

ε0.02

ε0.04

Fig. 6.8 Effect of vary sampling parameters on canned pattern generation (AIDS40k) stepP (noSamp) −stepP (sampx ) stepP (noSamp)

where stepP (noSamp) and stepP (sampx ) are the number of steps required to construct a subgraph query when no sampling and when sampling based on parameter x are used, respectively. Observe that changes in μ S and MP are insignificant when the parameters are varied by 0.5-fold and 2-fold, respectively. PGT, however, varies by almost an order of magnitude when changes from 0.04 to 0.01 due to the increase of |Seager | from 1656 to 26492. Hence, varying the sampling parameters does not significantly affect the quality of canned patterns. Exp 4: Size of |P |. Next, we examine the effect of varying |P |. We observe that varying |P | does not have significant effect on μ (Fig. 6.9). As expected, PGT increases as |P | increases, and this effect is most noticeable in the larger dataset (AIDS40k). In addition, improved coverage of the query set is observed as the number of canned patterns is increased. In particular, MP displays a downward trend where there is ∼50% reduction when |P | is increased from 10 to 40. The average cog of patterns in P for all datasets varies in the range [1.65–1.97], highlighting the low cognitive load of the patterns. Exp 5: Varying pattern size. In this set of experiments, we examine the effect of varying the pattern size. We first set ηmax = 12 and vary ηmin in the range [3 − 9]. From Fig. 6.10, we observe that the increase in ηmin results in increasing MP and correspondingly, decreasing average μ. This is due to the fact that the probability of a query graph Q containing a large canned pattern is comparatively lower than that of a small canned pattern. As expected, PGT decreases as ηmin increases since there are fewer PCPs generated in Algorithm 6.5. Next, we set ηmin = 3 and vary ηmax in the range [5−12]. We observe that varying ηmax has little impact on MP (Fig. 6.11) as compared to varying ηmin . In particular, when ηmax is varied, MPmax -MPmin varies in the range [3.5−4.3]. In contrast, when ηmin is varied, MPmax -MPmin varies in the range [84.2−89.9]. Due to the relatively small effect on MP, maximum and average μ remain relatively constant when ηmax is varied. PGT increases as ηmax increases due to the generation of larger number of PCPs.

72

6 Pattern Selection for Graph Databases

Max μ

1 0.5 0 20

10

5

30

40

30

40

30

40

30

40

|P| Ave μ

1 0.5 0 5

10

20

MP

|P| 20 15 10 5 0

PGT (min)

5

10

20

|P|

5 4 3 2 1 0 5

10

20

|P| AIDS10k

AIDS40k

PubChem

eMol

Fig. 6.9 Effect of varying |P| Max μ

1 0.5 0 3

ηmin

5

7

9

Ave μ

1 0.1 0.01 0.001 3

5

ηmin

7

9

MP

100 10 1

PGT (min)

3

5

5 4 3 2 1 0 3 AIDS10k

Fig. 6.10 Effect of varying ηmin

5 AIDS40k

ηmin

ηmin

7

9

7

PubChem

9 eMol

6.7

Performance Study

73

Max μ

1 0.5 0 5

7

ηmax

9

12

5

7

ηmax

9

12

9

12

9

12

Ave μ

1 0.5 0

MP

15 10 5 0

PGT (min)

5

7

5 4 3 2 1 0 5

7

AIDS10k

AIDS40k

ηmax

ηmax

PubChem

eMol

Fig. 6.11 Effect of varying ηmax

5

10 5 0

div

div

div

10

3

0 5

10 20 30 40 |P|

5 7 ηmin

9

10 5 0 5

7 9 ηmax

12

Fig. 6.12 Effects of varying plug on div

Finally, we examine the effect of varying ηmin , ηmax , and |P | on div and cog for AIDS10k. We observe that increasing |P | resulted in decreasing div (Fig. 6.12). This is expected as it is more likely to find a similar graph in a large pool of datasets than a small pool. When ηmin increases, we observe increasing div. Increasing ηmin increases div as there tends to be greater diversity between larger patterns. In contrast, cog remains relatively constant (cog ∈ [1.59−2.36]). Results are qualitatively similar in other datasets. Exp 6: Comparison with commercialVQI. We compare CATAPULT with PubChem and eMol. Canned patterns on the VQI that are of size 3 or larger are extracted for our study. Specifically, the PubChem VQI has 12 patterns with size in the range [3–8], of which 11 contain no vertex labels (referred to as unlabeled patterns). On the other hand, eMol VQI has 6 unlabeled patterns with size in the range [3–8]. Hence, we generate 12 and 6 patterns in the size range [3–8] in CATAPULT for comparison with PubChem and eMol, respectively.

6 Pattern Selection for Graph Databases Time (sec)

Time (sec)

74 200 100 0

300 200 100 0

Q1 Q 2 Q3 Q 4 Q5

100 50 0

Q1 Q2 Q3 Q4 Q5 eMolecule

Steps taken

Steps taken

Q1 Q2 Q3 Q4 Q5 80 40 0

Q1 Q2 Q3 Q4 Q5

CATAPULT

PubChem

CATAPULT

Fig. 6.13 User study 1 scov

lcov

1 0.975 0.95

0.95 0.9

5 PubChem(P)

10 |P|

20

30

PubChem(top-|P| edge)

5

10

20

30

|P| AIDS40k(P)

AIDS40k(top-|P| edge)

Fig. 6.14 Effect of varying |P| on coverage

Query sets for PubChem and eMol are generated according to Sect. 6.7.1. Furthermore, we stepP (VQI) −stepP (catapult) where stepP (X ) is the redefine the reduction ratio as follows: μG = stepP (VQI) number of steps required to construct a subgraph query when P obtained from X is used. Since the majority of canned patterns in PubChem and eMol are unlabeled graphs, in order to compute MP and μ, we perform a vertex relabeling step before computing these measures. Specifically, we map unlabeled canned patterns to labeled subgraph queries. The queries are first relabeled such that all vertices have the same label (e.g., C) and all vertices in unlabeled canned patterns are assigned this label as well. Note that the vertex relabeling step is favorable to performances of these two VQIs as it underestimates the number of steps for stepP (VQI) . In PubChem and eMol VQIs, when unlabeled patterns are used, a user undertakes any one of the following steps to label its vertices: (1) 2-step labeling: Select a vertex label (step 1), then click an unlabeled vertex to assign the label (step 2). (2) 1-step labeling: Click on an unlabeled vertex to assign the label. Note that 2-step labeling is used if currently no vertex label is selected or if the previously selected label does not match the label of the current vertex whereas 1-step labeling is used otherwise. For example, a user first specifies a vertex label (e.g., C) by choosing from the VQI and then click on a vertex v1 she wishes to assign the label. Suppose the remaining vertices also have the same label C. Then she simply clicks on the remaining vertices in turn to assign the label. Hence, a total of two additional steps are needed to label vertex v1 and one additional step to label the remaining ones. Consequently, stepP (VQI) = stepP (VQI) + |V Pl |

6.7

Performance Study

75

where |V Pl | is the total number of vertices in unlabeled canned patterns used to construct the subgraph query assuming the optimistic case of 1-step labeling. First, we observe that the average cog of canned patterns is lowest in CATAPULT (cog(eMol) = 2.05 versus cog(CATAPULT) = 1.83; cog(PubChem) = 2.53 versus and cog(CATAPULT) = 2.01). The average diversity (div) of CATAPULT-derived patterns are high with values 9 (eMol) and 7.4 (PubChem). Second, patterns in CATAPULT have superior μG compared to eMol VQI, having maximum and average μG of 0.86 and 0.18, respectively. There are also fewer subgraph queries in CATAPULT that cannot be formulated using canned patterns (MPCATAPULT = 14.4 versus mpeMolVQI = 29.4). Third, PubChem has extremely low MP (MPPubChemVQI = 0.2 versus MPCATAPULT = 18.6) due to the lack of vertex labels (which relax vertex mapping) and the topological variety of patterns. However, CATAPULTgenerated patterns still perform superior to PubChem having max. and avg. μG 0.79 and 0.03, respectively. Note that this is despite the fact that the vertex relabeling step is favorable to PubChem. In summary, CATAPULT has the best performance. Exp 7: Comparison with frequent subgraph-based technique. We use the AIDS10k for this experiment. We generate frequent subgraphs (denoted as F ) using GASTON (Nijssen and Kok 2005). In particular, we set |F | = 30 where every frequent subgraph has size in |F | . We vary the the range [3−12] and the maximum number of patterns per size is (12−3+1) support threshold in the range {4%, 8%, 12%}. Similar to Exp 6, we redefine the reduction F −stepP . Note that for this experiment we cannot simply choose ratio as follows: μF = stepstep F the randomly generated query set. This is because such query set may be unduly biased toward containing frequent subgraphs as they occur more often. Note that in real-world applications user queries can be frequent or infrequent subgraphs. Hence, we generate a query set Q x where x is the fraction of queries that are infrequent. We set |Q x | = 50. Figure 6.15 plots the results. Observe that when x = 0, the queries are all frequent. Naturally, CATAPULT performs worse as its canned patterns contain a mixture of frequent and infrequent patterns. However, when x > 0, CATAPULT’s performance improves and outperforms the frequent subgraph-based technique when x = 0.3. Specifically, MP for CATAPULT remains relatively constant whereas it increases linearly for F (4% and 12%). Furthermore, the div of CATAPULT is higher than F (7.4 vs. 1.74). In summary, patterns generated by CATAPULT are superior to frequent subgraph-based patterns. 25 μF%

MP

40 20 0 CATAPULT F(4%) Q0 Q0.1 Q0.2

F(8%) Q0.3

F(12%) Q0.4

Fig. 6.15 CATAPULT versus frequent subgraph patterns

0 -25 Q0 Q0.1 Q0.2 Q0.3 Q0.4

76

6 Pattern Selection for Graph Databases

Table 6.2 Queries used for user study. CID is the unique identifier of the PubChem repository Query Q1 Q2 Q3 Q4 Q5

PubChem CID (PubChem) 7809 (|E|=18) 769013 (|E|=29) 169132 (|E|=34) 22749902 (|E|=39) 63559561 (|E|=40)

PubChem CID (eMol) 57491213 (|E|=12) 98037 (|E|=17) 52426 (|E|=23) 17081 (|E|=33) 10097586 (|E|=35)

Exp 8: User study. We conducted a user study to investigate the impact of data-driven canned pattern selection on query formulation time (QFT). We compare CATAPULT with PubChem and eMolecule (referred to as P (VQI)) for a set of user-formulated queries. For each dataset, we select 5 queries (Table 6.2) of size in the range [12-40] from respective repositories. These queries span a variety of structures (cycles, carbon chains, etc.) and contain different vertex labels (i.e., L(VQ ) = {C, Cl, H, N, O, S}). 25 unpaid volunteers (ages from 20 to 30) took part in the study. 56 and 20% of the volunteers have taken undergraduate chemistry/chemical engineering and biology courses, respectively. They are trained to use the three VQIs. For every query, they were given some time to determine the steps needed to formulate it visually and are informed to use canned patterns as much as possible. Every query was formulated 5 times by 5 different participants. Recall that the vertices of P (VQI) are unlabeled. Hence, as in Exp 6, these vertices are assigned a common label that is not in L(VQ ), and participants have to relabel them to the correct vertex label during query formulation. The QFT and the number of steps taken are recorded. Figure 6.13 reports the avg. readings for each query. Note that QFTs include the search time for relevant patterns. Clearly, the canned patterns generated by CATAPULT facilitate more efficient (shorter QFT and lesser number of steps) query formulation than those obtained from the other VQIs. In particular, CATAPULT’s patterns achieve up to 78% (resp. 81%) and 74% (resp. 75%) reduction in terms of QFT and the number of steps, respectively, for PubChem (resp. eMol) queries. Exp 9: Coverage. We examine the coverage of P using scov and lcov and compare them with the coverage of top-|P | frequent edges. Figure 6.14 plots the results for AIDS40k and PubChem. Results are qualitatively similar for other datasets. We observe that scov increases as |P | increases. This highlights that additional canned patterns added to the set are topologically distinct from existing ones, resulting in an increase of coverage. Naturally, the top-|P | frequent edges have higher scov than CATAPULT’s canned patterns due to their small size and a greater chance to occur in a data graph. For smaller |P |, CATAPULT’s canned patterns tend to have slightly higher lcov compared to the top-|P | frequent edges since canned patterns are larger and there is a higher likelihood of having more unique edges compared to the frequent edges. However, as |P | increases, this effect is reversed. This is due to CATAPULT’s canned patterns having a relatively stable set of unique edges generated

77

10 1 0.1 23K 250K 500K 1M

Cluster time(hr)

Performance Study

PGT(min)

6.7

10 1 23K 250K 500K 1M

0 μDS%

MP

20

100

10 0

-0.1 -0.2 -0.3

23K 250K 500K

1M

* * μDS%=0% 23K 250K 500K 1M

Fig. 6.16 Scalability study (PubChem)

by our random walk-based algorithm which tends to favor paths with greater support. In contrast, distinct edge labels of top-|P | frequent edges grow as |P | increases. In particular, scov (resp. lcov) of top-|P | frequent edges and CATAPULT’s canned patterns vary in the range [0.98–1] (resp. [0.98–1]) and [0.91–0.98] (resp. [0.98–1]), respectively. Recall that the edge-at-a-time mode is inefficient as it may require more steps than the pattern-at-atime mode. Hence, CATAPULT’s patterns not only have good coverage (scov ∼ 94% on average for all datasets) but also support efficient query formulation. Exp 10: Scalability. We examine the scalability of CATAPULT using PubChem with dataset sizes in the range {23 K, 250 K, 500 K, 1 M}. Similar to Exp 6, 12 canned patterns of size in the range [3–8] are extracted using CATAPULT. Figure 6.16 reports the results. As expected, clustering time and PGT increase as |D| increases. In particular, the increase is about an order of magnitude when |D| increases from 23K to 1 million data graphs. The larger datasets also resulted in lower MP and negative average relative reduction ratio stepP (PubChem DS ) −stepP (PubChem 23K ) , DS ∈ {23 K, 50 K, 500 K, 1 M}, and μ DS where μ DS = stepP (PubChem DS ) stepP (PubChem x ) is the number of steps required to construct a subgraph query based on canned patterns generated by CATAPULT for PubChem dataset of x size. The negative μ DS implies that on average, stepP (PubChem DS ) < stepP (PubChem 23K ) when DS > 23 K. That is, the quality of canned patterns improves with the dataset size. Interestingly, improvements in terms of MP and μ DS show an anti-monotonic trend where the best results are obtained when |D| = 250 K. μ DS and MP improve by 21.2% and 43%, respectively, when compared to |D| = 23 K. Comparatively, PGT and clustering time are 8.43 and 2.93 times slower. This is likely due to two competing effects: (a) a larger dataset improves the quality of canned patterns but (b) sampling degrades the pattern quality. Specifically, when the dataset size increases from 23 to 250 K, the effect of the former is dominant but subsequently the latter has a greater impact. Hence, we can generate high-quality canned patterns without processing the entire dataset (e.g., 250 K instead of 1 M).

78

6.8

6 Pattern Selection for Graph Databases

AURORA—A PnP Interface for Graph Databases

CATAPULT and the basic pattern generation technique are key components of AURORA (dAtadriven qUery inteRface cOnstruction for gRaph dAtabases),10 a novel data-driven visual graph query interface construction system for graph databases (Bhowmick et al. 2020). In this section, we describe the VQI structure of AURORA and emphasize on the components created by CATAPULT. A short video of AURORA is available at https://www.youtube.com/ watch?v=tUz3PU4k-0o&t=6s.

6.8.1

VQI Structure

Figure 6.17 depicts a screenshot of the AURORA VQI, which is generated based on the PnP template in Fig. 4.2. It consists of the following panels: 1. Panel 1: Contains buttons that allow users to select a dataset, select canned patterns according to a plug, and load previously generated canned patterns. 2. Panel 2: Contains a list of distinct vertex labels in the selected dataset (Attribute panel). 3. Panel 3: This panel generates statistics related to query formulation. Specifically, it records the time and number of steps taken, and patterns used to formulate a query. 4. Panel 4: This panel is used for subgraph query formulation (Query Canvas). 5. Panel 5: Displays the basic patterns. 6. Panel 6: Displays the canned patterns.

Fig. 6.17 The PnP interface constructed by AURORA 10 Code of AURORA is available at https://github.com/MIDAS2020/AURORA.

6.8

AURORA—A PnP Interface for Graph Databases

79

Recall that when a user invokes the PnP template, she first chooses her dataset of interest and specifies the plug using Panel 1. Then AURORA automatically populates Panels 2, 5, and 6. In particular, the canned patterns in Panel 6 are generated by CATAPULT. They can be arranged according to a user’s preference. She may select one of the options available in a drop-down box on the top right corner of the panel. The current version of AURORA supports the following options: 1. Group-by-size: Canned patterns are grouped according to their size and the groupings are ordered in increasing pattern size. Users can view patterns of a particular size by selecting the corresponding group from a tabbed panel. The Panel 6 in Fig. 6.17 depicts an example of canned patterns of size 9. 2. Single page: All canned patterns are displayed on a single scrollable page. 3. x per page: Canned patterns are ordered in increasing pattern size and divided into groups of x-patterns for display. Panel 5 in AURORA provides five basic patterns based on the approach discussed in Sect. 6.6. In particular, these patterns consist of three labeled edges and two 2-edge patterns. For example, the selection of (C, N) is described in Example 6.11.

6.8.2

Pattern-at-a-time Query Formulation

The list of all distinct vertex labels in a dataset (Panel 2), the basic patterns (Panel 5), and the canned patterns (Panel 6) allow users to formulate all possible queries on a graph database. We use the query in Panel 4 to illustrate how a user can visually construct it in AURORA. 1. Step 1: Select the canned pattern P2 and drag-and-drop it in Panel 4. 2. Step 2: Select the basic pattern B1 and drag-and-drop it in Panel 4. 3. Step 3: Use the mouse to connect P2 and B1 by dragging the vertex C of B1 onto the vertex C of P2. 4. Step 4: Select the basic pattern B4 and drag-and-drop it in Panel 4. 5. Step 5: Use the mouse to connect B1 and B4 by dragging the vertex N of B1 onto the vertex N of B4.

6.8.3

User Experience and Feedback

We interviewed participants of the user study (Exp 8) on their experience in using the VQIs of AURORA, PubChem, and eMolecule. A summary of their feedback is listed in Table 6.3. Several users highlighted that they found it tedious to use the canned pattern

80

6 Pattern Selection for Graph Databases

Table 6.3 Examples of user feedback Index 1

2

3 4

5 6

Feedback The PubChem VQI is quite cluttered and makes it difficult to find a vertex label. Vertex labels in the AURORA VQI are arranged alphabetically in a list and easier to scroll through. I also like the vertex label search function I need to change the vertex labels all the time when I use the VQIs of eMolecule and PubChem. It is more convenient to use the AURORA VQI since I do not have to change labels as often The features to rotate the patterns (AURORA VQI) and to zoom in and out (AURORA and eMolecule VQIs) are useful when I construct a query graph Patterns in the eMolecule VQI are all cyclical shapes (like triangles, squares, and pentagons). The canned patterns in the AURORA VQI are more varied and help me more in drawing a query The small canvas size in eMolecule makes it difficult to draw larger queries I can learn the AURORA VQI easily once and then query several graph stores. On the other hand, I have to learn how to use the PubChem and eMolecule VQIs to query the respective data stores. Certainly, AURORA is more appealing to use!

sets of PubChem and eMolecule if there is a frequent need to change vertex labels during query formulation. Some participants observed that the canned pattern set of eMolecule contains only cycles. They felt that canned patterns of AURORA are more diverse and are more useful for formulating queries. Lastly, several users complimented on the ease of use of AURORA. They commented that they can use a single VQI of AURORA to query multiple data sources such as PubChem and eMolecule. However, if they use eMolecule and PubChem, they have to learn how to use each VQI. This highlights the advantage of portability brought by the paradigm of data-driven visual subgraph query interface construction. In summary, we believe that participants find our VQI more efficient and user-friendly than the VQIs of PubChem and eMolecule.

6.9

Conclusions

In this chapter, we take a concrete step toward realizing the vision of PnP interface construction for graph databases. We focus on the problem of automatic selection of patterns, which is a key step in a PnP engine and at the core of building PnP interfaces. In particular, we present a framework called CATAPULT to this end. We propose a small graph clustering strategy to summarize topologically similar data graphs into CSGs and a random walk-based strategy to select canned patterns with high coverage, high diversity, and low cognitive load from them. Our experimental study demonstrates the superiority of the CATAPULT framework to manually selected canned patterns in traditional visual subgraph query interfaces. CATAPULT serves as the core engine to power AURORA, a PnP interface for graph databases.

References

81

In Chap. 8, we shall describe the efficient maintenance of canned patterns as the graph database evolves.

References A.V. Aho, J.E. Hopcroft, J.E. Ullman. The design and analysis of computer algorithms. AddisonWesley, 1974. S. Arora, E. Hazan, S. Kale. The multiplicative weights update method: a meta-algorithm and applications. Theory Comput. 8(1), 2012. D. Arthur, S. Vassilvitskii. How slow is the k-means method?. In SCG, 2006. D. Arthur, S. Vassilvitskii. k-means++: The advantages of careful seeding. In SIAM, 2007. S. S. Bhowmick, et al. AURORA: Data-driven Construction of Visual Graph Query Interfaces for Graph Databases.In SIGMOD, 2020. S. S. Bhowmick, B. Choi, C. E. Dyreson. Data-driven Visual Graph Query Interface Construction and Maintenance: Challenges and Opportunities. PVLDB 9(12), 2016. Y. Chi, et al. Indexing and mining free trees. In ICDM, 2003. W.G. Cochran. Sampling techniques. Third edition. Wiley, New York, New York, USA, 1991. L.P. Cordella, P. Foggia, C. Sansone. A (sub)graph isomorphism algorithm for matching large graphs. IEEE Trans. Pattern Anal. Mach. Intell., 26(10):1367-1372, 2004. F. Geerts, et al. Relational link-based ranking. In VLDB, 2004. C. Guestrin, A. Krause, A.P. Singh. Near-optimal sensor placements in gaussian processes. In ICML, 2005. S. Günter, H. Bunke. Self-organizing map for clustering in the graph domain. Pattern Recogn. Lett., 23(4):405-417, 2002. H. He, A.K. Singh. Closure-tree: An index structure for graph queries. In ICDE, 2006. K. Huang, et al. CATAPULT: data-driven selection of canned patterns for efficient visual graph query formulation. In SIGMOD, 2019. K. Jain, V.V. Vazirani. Approximation algorithms for metric facility location and k-median problems using the primal-dual schema and Lagrangian relaxation. J. ACM, 48(2):274-296, 2001. J.J. McGregor: Backtrack search algorithms and themaximal common subgraph problem. Software Practice and Experience. 12(1), 23-34, 1982. M.D. McKay, R.J Beckman, W.J. Conover. Comparison of three methods for selecting values of input variables in the analysis of output from a computer code. Technometrics, 21(2):239-245, 1979. M. Meil˘a. The uniqueness of a good optimum for k-means. In ICML, 2006. S. Nijssen, J.N. Kok. The gaston tool for frequent subgraph mining. Electron. Notes Theor. Comput. Sci., 127(1): 77-87, 2005. T. Ramraj, R. Prabhakar (2015). Frequent subgraph mining algorithms - a survey. Procedia Computer Science, 47:197-204, 2015. K. Riesen, M. Neuhaus, H. Bunke. Bipartite graph matching for computing the edit distance of graphs. In GbRPR, 2007. S. Sakai, M. Togasaki, K. Yamazaki. A note on greedy algorithms for the maximum weighted independent set problem. Discrete Appl. Math. 126(2-3): 313-322, 2003. S.E. Schaeffer. Graph clustering. Comput. Sci. Rev., 1(1):27-64, 2007. T. Schäfer, P. Mutzel. StruClus: structural clustering of large-scale graph databases. CoRR abs/1609.09000, 2016. C. Tofallis. Add or multiply? A tutorial on ranking and choosing with multiple criteria. INFORMS Trans. on Education, 14(3): 109-119, 2014. H. Toivonen. Sampling large databases for association rules. In VLDB, 1996.

7

Pattern Selection for Large Networks

In the preceding chapter, we have discussed the selection of canned patterns (i.e., the CPS problem) from a large collection of small- or medium-sized data graphs (i.e., graph databases). In this chapter, we present a framework called TATTOO (daTa-driven cAnned paTtern selecTiOn from netwOrks) that takes a data-driven approach to the CPS problem for large networks. Given a network G, a user-specified plug b, TATTOO automatically selects canned patterns from G that satisfy b. At first glance, a reader may wonder why the CATAPULT framework for graph databases cannot be utilized to address the CPS problem for large networks. In TATTOO, we focus on selecting unlabeled canned patterns from large networks in contrast to labeled patterns for reasons discussed in Sect. 7.1. Note that there is an exponential number of them in a large network. Furthermore, CATAPULT first partitions a collection of data graphs into a set of clusters and summarizes each cluster to a CSG. Then, it selects the canned patterns from these CSGs using a weighted random walk-based approach. This clustering-based approach is prohibitively expensive for large networks as detailed later. Lastly, CATAPULT does not exploit characteristics of real-world subgraph queries for selecting canned patterns primarily due to the lack of such publicly available data for graph databases. In contrast, TATTOO utilizes topological characteristics of real-world queries to guide the solution design. Specifically, it exploits a recent analysis of real-world query logs (Bonifati et al. 2017) to classify topologies of canned patterns into categories that are consistent with the topologies of real-world queries. This enables us to reach a middle ground where TATTOO does not need to be restricted by the availability of query logs but yet exploit topological characteristics of real-world queries to guide the selection process. Figure 7.1 depicts an overview of the TATTOO framework. It realizes an efficient candidate canned pattern generation technique based on the classified topologies to identify potentially useful patterns. Canned patterns are selected from these candidates for display on the VQI based on a pattern set score that is sensitive to coverage, diversity, and cognitive load of patterns. Specifically, it leverages recent progress in the algorithm community to propose a © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 S. S. Bhowmick and B. Choi, Plug-and-Play Visual Subgraph Query Interfaces, Synthesis Lectures on Data Management, https://doi.org/10.1007/978-3-031-16162-9_7

83

84

7 Pattern Selection for Large Networks

Fig. 7.1 Overview of TATTOO

selection algorithm that guarantees 1e -approximation (Buchbinder et al. 2014). Experiments with several real-world large networks and users reveal that TATTOO can select canned patterns within a few minutes. Importantly, these patterns can reduce the number of steps taken to formulate a subgraph query and query formulation time by up to 9.7X and 18X, respectively, compared to several baseline strategies. Unlike the preceding chapter, note that in this chapter we do not automatically select basic patterns for large networks. In large networks, a basic pattern has a size z ≤ 2 with the exception of 3-cycle and 4-cycle. Such small-size patterns are basic building blocks of networks (Wang and Chen 2003; Milo et al. 2002) (e.g., edge, 2-path, triangle, 4-cycle). Hence, they appear in most large networks and are provided by default for all datasets. Observe that this is feasible for large networks as our goal here is to expose unlabeled patterns on a VQI instead of labeled patterns. In the sequel, we refer to basic patterns in a large network as default patterns in order to distinguish them from automatically selected basic patterns in Chap. 6. In summary, this chapter makes the following contributions: • We describe TATTOO, an end-to-end canned pattern selection framework for any plugand-play visual subgraph query interface for large networks independent of domains and data sources. • We formally introduce the CPS problem for large networks (Sect. 7.1) and present a categorization of potentially useful canned patterns in Sect. 7.2. • We present an efficient solution to select canned patterns for a VQI (Sects. 7.3–7.4). Specifically, we present a candidate pattern generation framework that is grounded on topologies of real-world subgraph queries. Furthermore, we utilize the recent technique in Buchbinder et al. (2014) from the algorithm community to select canned patterns with good theoretical quality guarantees. • Using real-world networks, we show the superiority of TATTOO to several baselines (Sect. 7.5). • In Sect. 7.6, we present an end-to-end PnP interface system called PLAYPEN that exploits TATTOO to realize the vision of PnP interfaces for large networks.

7.1 The CPS Problem

85

Table 7.1 Key symbols for this chapter Symbol Ck = (Vck , E ck ) CC Pi (k1 , k2 ) GT GO t(e) sup(e) f r eq(·) PC P N Bcc (k  , e) E Bcc (k  , e) Si G R = (V R , E R ) Sk k s(P)

Definition k-chord pattern (k-CP) Composite chord pattern (CCP) of k1 -CP and k2 -CP TIR graph TOR graph Edge trussness of an edge e Support of an edge e Frequency of a pattern A set of k-CP patterns k  -CCP node neighbourhood of an edge e Edge neighborhood of an edge e Skeleton structure of CC Pi Remainder graph k-star pattern k ∈ [2, kmax ] of k-truss Pattern score of pattern set P

The key notations specific to this chapter are given in Table 7.1.

7.1

The CPS Problem

Given a data graph or network G = (V , E), a visual subgraph query interface I (i.e., a PnP template), and a user-specified plug b, the goal of the canned pattern selection (CPS) problem is to select a set of unlabeled patterns P for display on I, which satisfies the specifications in b and optimizes coverage, diversity, and cognitive load of P . Observe that the CPS problem for large network differs from the one introduced in Chap. 6 in two key ways. First, we focus on a single large network instead of a large collection of small- or medium-sized data graphs. Second, we select unlabeled patterns instead of labeled ones. In large networks, a subgraph query may not always contain labels on its vertices or edges. Specifically, unlabeled query graphs are formulated in the subgraph enumeration problem (Afrati et al. 2013), whereas query graphs are labeled in the subgraph matching problem (Sun and Luo 2020). Hence, selecting unlabeled patterns TATTOO facilitates the visual formulation of both these categories of queries. Furthermore, there is a soup of data models for large networks such as simple graphs, data graphs, and different variants of property graphs. Unlabeled patterns enable us to work with all these different data models. In particular, one may simply drag-and-drop specific vertex/edge labels from the Attribute panel of a VQI to add labels to the vertices/edges of a pattern regardless of the type of the model.

86

7 Pattern Selection for Large Networks

We now formally define the CPS problem addressed in this paper. Recall from Chap. 5, the definitions of coverage, diversity, and cognitive load of patterns in the context of large networks. Definition 7.1 (CPS Problem) Given a network G, a VQI I, and a plug b = (ηmin , ηmax , γ), the goal of canned pattern selection (CPS) problem is to find a set of unlabeled canned patterns P from G that satisfies max f cov (P ), − f sim (P ), − f cog (P ) subject to |P | = γ, P ∈ U

(7.1)

where ηmin > 2, P is the solution; U is the feasible set of canned pattern sets in G; f cov (P ), f sim (P ), and f cog (P ) are the coverage, similarity, and cognitive load of P , respectively. Remark Observe that CPS is a multi-objective optimization problem as our goal is to maximize coverage and diversity (i.e., minimize similarity) of canned patterns while minimizing their cognitive load. Hence, we address it by converting CPS into a single-objective optimization problem using a pattern score. It is shown to be NP-hard in Huang et al. (2019) by reducing it from the classical maximum coverage problem.

By selecting unlabeled patterns instead of labeled ones for display on a PnP interface, TATTOO paves the way for a single effective solution for building such interfaces on a variety of graph data models that occur in practice.

7.2

Categories of Canned Patterns

In theory, numerous different patterns can be selected from a given network. Which of these are “useful” for subgraph query formulation in practice? In this section, we provide an answer to this question.

7.2.1

Topologies of Real-World Queries

Although basic building blocks of networks (Milo et al. 2002; Wang and Chen 2003) are presented as default patterns in our VQI, as remarked in Chap. 4, they are insufficient as they do not expose to a user more domain-specific and larger patterns in the underlying data. Such larger substructures not only facilitate more efficient construction of subgraph queries but also guide users for bottom-up search by exposing substructures that are network-specific.

7.2

Categories of Canned Patterns

87

Fig. 7.2 Examples of real-world query topologies

Note that patterns such as wedge, triangle, and rectangle do not effectively trigger bottom-up search as they appear in almost all real-world networks. However, which topologies of these substructures should be considered for canned patterns? Ideally, real-world subgraph query logs can provide guidance to resolve this challenge. However, as remarked in Chap. 6, such data may be unavailable. Fortunately, unlike graph databases, there is a recent study (Bonifati et al. 2017) that analyzed a large volume of real-world SPARQL query logs, which we can exploit to guide canned pattern selection. This study revealed that topologies of many real-world subgraph queries map to chains, trees, stars, cycles, petals, and flowers1 (Bonifati et al. 2017). Figure 7.2 depicts examples of these topologies in real-world subgraph queries extracted from BigRDFBench (Saleem et al. 2018), BSBM,2 Rapid,3 and DBPedia (Ell et al. 2011). Consequently, canned patterns in any VQI should facilitate the efficient construction of these topologies.

1 A petal is a graph consisting of a source node s , target node t , and a set of at least 2 node-disjoint paths from s to t . A flower is a graph consisting of a node x with three types of attachments: chains (stamens), trees that are not chains (the stems), and petals. A flower set is a graph in which every connected component is a flower.

2 http://wifo5-03.informatik.uni-mannheim.de/bizer/berlinsparqlbenchmark/spec/20080912/index.html. 3 https://research.csc.ncsu.edu/coul/RAPID/RAPIDAnalytics/.

88

7.2.2

7 Pattern Selection for Large Networks

Topologies of Canned Patterns

We consider the following types of topological structures of canned patterns in order to facilitate the construction of the above query substructures. Path and cycle patterns. A subgraph query may contain paths of different lengths (i.e., chain) and/or cycles. Figure 7.2 depicts some examples. Hence, our canned patterns should expose representative k-paths and k-cycles in the underlying data. Given a graph G = (V , E), a k-path, denoted as Pk = (Vk , E k ), is a walk of length k containing a sequence of vertices v1 , v2 , · · · , vk , vk+1 where E k ⊆ E, Vk ⊆ V such that all vertices in Vk are distinct. A k-cycle is simply a closed (k − 1)-path where k ≥ 3. Star and asterism patterns. Intuitively, a star is a connected subgraph containing a vertex r where the remaining vertices are connected only to r (i.e., neighbors of r ). A k-star is  a single-level, rooted tree Sk = (V , E) where V = {r } L, r is the root vertex and L is the set of leaves such that ∀e = {u, v} ∈ E, u = r , v ∈ L and |V | = k + 1. We refer to the root as the center vertex. Note that k ≥  where  is the minimum value of k for which the single-level rooted tree is considered a star. Real-world queries may contain multiple k-stars that are combined together. For instance, the query topology in Fig. 7.2e is a combination of 6-star and 7-star by merging on a pair of edges. Hence, our canned pattern topology also involves stars that form an asterism pattern by merging them on a pair of edges. Formally, given n stars S = {Sk1 , · · · , Skn } and n − 1 merged edges E m = {em 1 , · · · , em n−1 } where Ski = (Vi , E i ) and em i ∈ E i , let R = {r1 , · · · , rn } be the center vertices such that ri ∈ Vi . The asterism pattern of S is defined as A S = (V , E) where ei = (ri , vi ), ei+1 = (ri+1 , vi+1 ),      E = 1≤i 2, the k-chord pattern (k-CP) Ck = (Vck , E ck ) associated with every edge e = (u, v) ∈ E k where u, v ∈ Vk is defined      = {w : 0 ≤ i ≤ k − 2} and and E ck = {(u, v)} E ck where Vck as Vck = {u, v} Vck i  E ck = {(u, wi ), (wi , v) : 0 ≤ i ≤ k − 2}. k-CP can be considered as a building block of k-trusses since it is found with respect to each edge in a given k-truss. Examples of k-CPs (4-CP and 5-CP) are illustrated in Fig. 7.3. We refer to the edge in a k-chord pattern that is involved in (k-2) triangles as a truss edge and the remaining edges as non-truss edges. For example, in Fig. 7.3, edges (A1 , B1 ) and (A2 , B2 ) are truss edges, whereas (A1 , C1 ) and (B2 , D2 ) are non-truss edges. Correspondingly, vertices of a truss edge (e.g., A1 , B1 , A2 , B2 ) are referred to as truss vertices. Observe that we can formulate a simple petal query in two steps by selecting the 4-CP pattern and deleting the truss edge. To select larger canned patterns with greater structural diversity, we combine k-CPs to yield additional composite chord patterns (CCP) that occur in the underlying network. Observe that combining a set of k-CPs in different ways results in different patterns as demonstrated in Fig. 7.3. However, this is an overkill as they are not only expensive to compute but also may generate patterns with higher density (higher cognitive load) or are larger than ηmax . Hence, we focus on the CCP generated by merging a single edge of two k-CPs as it not only reduces the complexity of CCP generation but also produces CCPs with lower density.

Fig. 7.3 k-chord and composite chord patterns. Gray nodes are truss nodes and oval-shaped nodes are combined nodes

90

7 Pattern Selection for Large Networks

Table 7.2 No. of steps for constructing queries ID (a) (b)

Edge-at-a-time 5 11

(c)

17

(d)

15

(e)

25

(f)

27

(g)

18

(h) (i)

18 23

(j)

31

Default patterns 1 3 [2 2-path + 1 edge]

Canned patterns 1 [2-path] 1 [5-path] 2 [6-cycle - 1 edge] 6 [2 2-path + 1 square - 1 5 [4-path + 2 2-path + 2 merge] edge + 1 edge + 1 merge] 5 [4-star + 1 2-path + 1 node + 2 edge] 5 [3 2-path + 1 edge + 1 3 [3-star + 4-path + 1 merge] merge] 4 [2 4-path + 2 merge] 11 [5 2-path + 1 node + 2 1 [A6,7 ] edge + 3 merge] 3 [5-star + 6-star + 1 edge] 13 [6 2-path + 1 node + 1 5 [11-star + 2 edge + 2 node] edge + 5 merge] 5 [9-star + 2 2-path + 2 merge] 8 [4 2-path + 1 edge + 3 3 [5-cycle + 4-star + 1 merge] merge] 4 [6-path + 2-path + 1 edge + 1 merge] 6 [3 2-path + 3 edge] 4 [3-star + A3,3 + 1 edge + 1 merge] 10 [square + 3 2-path + 1 5 [4-CP + 6-star + 1 node + 1 edge + 1 merge] node + 2 edge + 3 merge] 5 [CCPno (4,4) - 2 edge + 5-star + 1 merge] 11 [square + 3 2-path + 1-path + 5 edge 6 [5-path + 6-cycle + 4-star + 1 edge + 2 merge] + 1 merge] 7 [4-CP - 1 edge + 7-path + 4-star + 1 edge + 2 merge]

Unique small graph patterns. Lastly, we find small connected subgraphs that do not fall under the above categories but occur multiple times in the underlying network. Table 7.2 reports the number of steps taken by various modes of query construction of query topologies in Fig. 7.2. Observe that query construction using canned patterns often takes fewer steps compared to construction using only default patterns, emphasizing the need for patterns beyond the default ones. One can also formulate a specific query following multiple alternatives, i.e., using multiple sets of patterns (canned and default). This gives users the flexibility to formulate a query using these patterns in many ways, all of which often take fewer steps compared to the edge-at-a-time or default pattern-based modes.

7.3

Candidate Pattern Generation

In the preceding section, we classified the topologies of canned patterns broadly into “ktruss-like” and “non-k-truss-like” structures. In this section, we describe how candidate canned patterns conforming to these topological categories are extracted from the underlying network G. We begin by providing an overview of the TATTOO algorithm. Algorithm 7.6 outlines the procedure. It first decomposes G into truss-infested and truss-oblivious regions (Line 1) and then generates “k-truss-like” and “non-k-truss-like” candidate patterns from these regions, respectively (Lines 2–8). Finally, it selects the canned pattern set from these candidate patterns based on the plug specification (Line 9). We discuss the decomposition of G and

7.3

Candidate Pattern Generation

91

Algorithm 7.6 The TATTOO algorithm.

Require: Data graph G , plug b = (ηmin , ηmax , γ); Ensure: Canned pattern set P; 1: G T , G O ← GraphDecomposition(G) 2: PC P , f r eq(PC P ) ← GenChordPatterns(G T , t(e)) /*Alg. 7.7*/ 3: PCC P , f r eq(PCC P ) ← GenCombChordPatterns(G T , t(e)) /*Alg. 7.8*/ 4: Ps , f r eq(Ps ) ← GenStarPatterns(G O ) /*Alg. 7.9*/ 5: G R ← RemoveStarPatternEdges(G O , Ps ) 6: Pr , f r eq(Pr ) ← GenSmallPatterns   (G R , b) /*Alg. 7.10*/ 7: Pcand ← PC P PCC P Ps Pr   8: f r eq(Pcand ) ← f r eq(PC P ) f r eq(PCC P ) f r eq(Ps ) f r eq(Pr ) 9: P ← SelectCannedPatterns(Pcand , f r eq(Pcand ), b)

candidate pattern generation in turn. In the next section, we shall elaborate on the selection of canned patterns from the candidate patterns.

7.3.1

Truss-Based Graph Decomposition

In order to extract “non-k-truss-like” and “k-truss-like” structures as candidate patterns, we first decompose a network G into sparse (containing non-trusses) and dense (containing trusses) regions. The latter region is referred to as truss-infested region (TIR graph) and the former truss-oblivious region (TOR graph), and are denoted by G T and G O , respectively. Table 7.3 reports the sizes of G T and G O in several real-world networks measured as the percentage of the total number of edges. We observe G T basically consists of relatively large connected subgraphs that comprise multiple k-trusses. On the other hand, G O mainly Table 7.3 TIR and TOR graphs in real networks

Data BK GO DB AM RP YT RT SK RC LJ

Name loc-Brightkite loc-Gowalla com-DBLP com-Amazon RoadNet-PA com-Youtube RoadNet-TX as-Skitter RoadNet-CA com-LiveJournal

|V | 58K 197K 317K 335K 1.09M 1.13M 1.38M 1.7M 1.97M 4M

|E| 214K 950K 1.05M 926K 1.54M 2.99M 1.92M 11M 2.77M 34.7M

% (G T ) 67.3 78.2 93 77.2 12.7 46.8 12.5 79.1 12.6 83.2

% (G O ) 32.7 21.8 7 22.8 87.3 53.2 87.5 20.9 87.4 16.8

92

7 Pattern Selection for Large Networks

Fig. 7.4 Visualization of portions of TOR graphs of HepPh, Amazon, and Skitter

consists of chains (i.e., paths), stars, cycles, and small connected components. Figure 7.4 depicts examples of some of these structures in three networks. Furthermore, although some networks have small G O (e.g., com-DBLP), there are networks where G O is large (e.g., RoadNet-CA), encompassing up to 87.5% of the total number of edges. Consequently, by decomposing a network into G T and G O , we can improve efficiency by limiting the search for k-truss-like patterns in G T instead of the entire network and extract non-truss-like patterns from G O . Additionally, generating candidate patterns of aforementioned topological categories from both TIR and TOR graphs enables us to select a holistic collection of patterns having higher coverage and diversity. The cognitive load of the pattern set is often reduced when patterns from both regions are considered due to the sparse structure of TOR. TATTOO utilizes the state-of-the-art truss decomposition approach in Wang and Cheng (2012) to decompose G into G T and G O . Briefly, this approach identifies k-trusses (k ∈ [2 − kmax ]) in G iteratively by removing edges with support less than k − 2 from G. Hence, our graph decomposition algorithm adapts it to assign 2-truss as G O and the remaining k-trusses as G T . Note that the choice of truss decomposition technique is orthogonal to our framework. Any superior technique can be used. We keep track of the edge trussness (denoted as t(e)) in G T . Since the goal is to select canned patterns with maximum size ηmax , the upper bound of edge trussness is set to this value. The algorithm first identifies the support of each edge. Then, regions of the data graph are iteratively extracted by removing edges with the lowest support, starting from the sparsest (i.e., sup(e) = 0) to the densest. In particular, TATTOO considers all edges with sup(e) = 0 as sparse regions and these edges form the TOR graph G O . The remaining edges form the TIR graph G T . In summary, the above approach makes the following two simple modifications to the truss decomposition technique in Wang and Cheng (2012): (1) instead of storing each k-truss as a separate graph, it stores 2-truss as G O and the remaining k-trusses are combined as a single graph G T ; (2) it assigns a trussness value t(e) to every edge in G T and G O . The worst-case time and space complexities of this algorithm are O(|E|1.5 ) and O(|V | + |E|), respectively (Wang and Cheng 2012).

7.3

Candidate Pattern Generation

7.3.2

93

Patterns from a TIR Graph

Next, we generate k-CPs and CCPs as candidate patterns from a TIR graph. For each pattern, we also compute its frequency as it will be used subsequently to measure its coverage. We discuss them in turn. Generation of k-chord patterns. Algorithm 7.7 describes generation of the k-CPs. In particular, we can find k-CPs with respect to each edge in a given k-truss. For instance, every edge in a 4-truss and a 5-truss is part of at least 2 and 3 triangles, respectively. Observe that the 2-chord pattern of an edge e is simply the edge itself. Hence, TATTOO generates k-CPs for k ≥ 3. The frequency of a k-CP is measured by the frequency of the pattern occurring in G T , which is essentially the number of edges having trussness greater than or equal to k (Lines 13 to 20). Formally, given a TIR graph G T = (VT , E T ) and a k-chord pattern Ck = (Vck , E ck ), the frequency of Ck is defined as f r eq(Ck ) = |{e ∈ E|t(e) ≥ k}|. Then, the set of k-CPs of a G T is simply the set of patterns Ck whose frequency is greater than 0. We first generate k-chord patterns in G T and then compute their frequencies using edge trussness. Lemma 7.2 The worst-case time and space complexities of k-CP generation are O(kmax |E T |1.5 ) and O(|VT | + |E T |), respectively. Proof (Sketch) In Algorithm 7.7, the worst-case time complexity is due to Lines 13 to 20 √ which computes the trussness of each edge e ∈ E T (O(| E T |) Wang and Cheng 2012), updates f r eq(Ck ), and stores Ck in the candidate pattern set. Hence, the worst-case time complexity is O(|E T |1.5 + |E T |kmax ) since upper bound of k is kmax . Algorithm 7.7 uses O(|E T | + |VT |) and O(|E T |) space to hold G T and t(e), respectively. Further, all possible k-chord patterns (3 ≥ k ≥ kmax ) and their frequencies have to be stored in the worst-case (O(kmax )). Hence, worst-case space complexity is O(|E T | + |VT |) since |E T | + |VT |

 kmax for a large graph in practice. Generation of composite chord patterns. Next, we generate the CCPs. Specifically, we generate the following categories of CCPs based on different ways of merging truss and non-truss edges. Definition 7.3 Let Ck1 = (Vck1 , E ck1 ) and Ck2 = (Vck2 , E ck2 ) be two k-chord patterns where s, t ∈ Vck1 and u, v ∈ Vck2 are truss vertices. Then, we can generate the following categories of composite chord patterns of Ck1 and Ck2 by merging Ck1 and Ck2 as follows: 1. CC Ptn (k1 , k2 ): merge the truss edge of Ck1 with a non-truss edge of Ck2 . 2. CC Pnt (k1 , k2 ): merge the truss edge of Ck2 with a non-truss edge of Ck1 .

94

7 Pattern Selection for Large Networks

Algorithm 7.7 GenChordPatterns.

Require: TIR graph G T = (VT , E T ), trussness of all edges T (e); Ensure: Set of k -chord patterns PC P = {Ck |3 ≤ k ≤ kmax } and frequency f r eq(PC P ); 1: for k = 3 to kmax / ∗ generate k -chord patterns ∗ / do 2: Ck = (Vck , E ck ) ← φ 3: Vck ← {u, v} 4: E ck ← {(u, v)} 5: i ←k 6: while i ≥ 3 do 7: Vck ← {wi−2 } 8: E ck ← {(u, wi−2 ), (wi−2 , v)} 9: i ←i −1 10: end while 11: f r eq(Ck ) ← 0 12: end for 13: for each e ∈ E T / ∗ compute frequencies using edge trussness ∗ / do 14: k ← t(e) 15: while k ≥ 3 do 16: cov(Ck ) ← f r eq(Ck ) + 1 17: PC P ← PC P Ck 18: k ←k−1 19: end while 20: end for

3. CC Pno (k1 , k2 ): merge a non-truss edge of Ck1 with a non-truss edge of Ck2 such that there is an overlapping truss vertex. 4. CC Pnn (k1 , k2 ): merge a non-truss edge of Ck1 with a non-truss edge of Ck2 such that there is no overlapping truss vertex. Figure 7.3 depicts examples of these four categories of CCPs. When the context is clear, we shall simply refer to a CCP as CC Pi . A keen reader may observe that it is possible to create another CCP by merging the truss edge of Ck1 with the truss edge of Ck2 . However, this CCP is in fact a k-CP where k = k1 + k2 − 2. For instance, when C4 and C5 in Fig. 7.3 are merged on their truss edges, the resultant pattern is a 7-CP. Also, combining two 3-CPs always yields a 4-CP (Lemma 7.4). Since k-CPs have already been handled earlier, these combinations are ignored. Lemma 7.4 Two 3-CPs always yield a CCP that is 4-CP. Proof (Sketch) The simple 3-truss pattern C3 = (Vc3 , E c3 ) is simply a triangle. Hence, ∀e = (u, v) ∈ E c3 , there is a vertex w that is adjacent to both u and v. That is, all edges in C3 have a similar structure. Hence, all different types of single-edge mergers between two C3 produce a pattern with a merged edge em = (x, y) and vertices x and y have two common adjacent vertices w1 and w2 which is essentially C4 where its truss edge corresponds to the  merged edge of the two C3 (Fig. 7.6).

7.3

Candidate Pattern Generation

95

Fig. 7.5 k-CCP node and edge neighborhoods

Fig. 7.6 Combination of two 3-chord patterns

We now elaborate on how the CCPs and their frequencies are computed in TATTOO efficiently. We shall introduce two terminologies related to node and edge neighborhoods of a CCP to facilitate exposition. Given an edge e = (u, v) in a k-truss, the k  -CCP node neighborhood (denoted as N Bcc (k  , e)) of e is a set of vertices W adjacent to u and v such that ∀w ∈ W , t((u, w)) ≥ k  and t((w, v)) ≥ k  where k  ≤ k. The k  -CCP edge neighborhood (denoted as E Bcc (k  , e)) of e is the set of edges S adjacent to e such that ∀(u, x1 ), (x2 , v) ∈ S, x1 , x2 ∈ N Bcc (k  , e) where k  ≤ k. Figure 7.5 illustrates examples of k  -CCP node and edge neighborhoods. For instance, N Bcc (4, e) consists of v3 since t(v1 , v3 ) ≥ 4 and t(v2 , v3 ) ≥ 4. Lemma 7.5 Given a truss edge e, there is at least a k-chord pattern Ck on e if |N Bcc (k, e)| ≥ (k − 2). Proof (Sketch) Observe that k-chord pattern on an edge e = (u, v) implies that k-2 triangles in the graph contains e. Since N B cc (k, e) is the set of nodes W adjacent to u and v such that ∀w ∈ W , t((u, w)) ≥ k and t((w, v)) ≥ k, |NBcc (k, e)| is equivalent to the number of triangles around e. Hence, when |NBcc (k, e)| ≥ (k − 2), a k-chord pattern must exist on e.  Fr equencies o f CC Ptn (k1 , k2 ) and CC Pnt (k1 , k2 ). Consider two different k-CPs. CC Ptn and CC Pnt involve merger of a truss edge belonging to one k-CP with a non-truss edge belonging to another k-CP. Given two k-CPs Ck1 and Ck2 , let edges e1 and e2 be the truss

96

7 Pattern Selection for Large Networks

edges of Ck1 and Ck2 , respectively. Intuitively, a pattern is a CC Ptn (k1 , k2 ) if it contains an embedding of Ck1 and of Ck2 whereby there is an edge em in the pattern that belongs to the two embeddings such that em is a truss edge of Ck1 ’s embedding and is a non-truss edge of Ck2 ’s embedding, respectively. In other words, Ck1 and Ck2 can form a CCP (CC Ptn (k1 , k2 )) by merging a truss edge e1 from Ck1 with a non-truss edge from Ck2 if the following conditions are satisfied: (a) Condition 1: There is a Ck1 pattern on e1 containing e2 . (b) Condition 2: There is a Ck2 pattern on e2 where e2 = e1 . Note that due to Lemma 7.5, Condition 1 holds if |N Bcc (k2 , e2 ) \ {u, v}| ≥ (k2 − 2)  where e1 = (u, v). Further, if |N Bcc (k1 , e1 ) N Bcc (k2 , e2 ) \ {u, v}| ≥ (k1 − 2) + (k2 − 2)), then the pattern CC Ptn (k1 , k2 ) must exist. Hence, TATTOO checks the conditions iteratively on decreasing k2 and skips checks for k2 < k2 if the conditions are satisfied for k2 . The frequency of CC Ptn (k1 , k2 ) is simply the number of such e1 edges. For CC Pnt (k1 , k2 ), the approach is the same by swapping Ck1 with Ck2 . Fr equencies o f CC Pnn (k1 , k2 ) and CC Pno (k1 , k2 ). Recall that (Definition 7.3) a singleedge merge can also involve the merger of two non-truss edges, each from a different k-CP. Each non-truss edge contains a truss vertex. There are two ways in which two non-truss edges can merge as shown in Fig. 7.7b and c. In the former (resp. latter), vertex pairs (w1 , w2 ) (resp. (w2 , u 1 )) and (u 1 , u 2 ) (resp. (w1 , u 2 )) are merged. Hence, a pattern is a CC Pnn if it contains at least one embedding of a structure shown in Fig. 7.7(b) which we refer to as the skeleton structure of CC Pnn (denoted as Snn ). Hence, we can search for the Snn of a CC Pnn in a TIR graph to compute its occurrence and frequency. Specifically, a CC Pnn can be obtained if the following are satisfied: (a) Condition 1: There is a Ck1 pattern on its truss edge e1 = (u 1 u 2 , v) which contains e2 = (u 1 u 2 , w1 w2 ). (b) Condition 2: There is a Ck2 pattern on its truss edge e3 = (w1 w2 , x) which contains e2 . Note that Condition 1 holds if |N Bcc (k1 , e1 ) \ {u 1 u 2 , w1 w2 }| ≥ (k1 − 3) (Lemma 7.5). Similarly, Condition 2 holds if |N Bcc (k2 , e3 ) \ {u 1 u 2 , w1 w2 }| ≥ (k2 − 3). Further, if  |N Bcc (k1 , e1 ) \ {u 1 u 2 , w1 w2 } N Bcc (k2 , e3 ) \ {u 1 u 2 , w1 w2 }| ≥ (k1 − 3) + (k2 − 3), then the pattern CC Pnn must exist. The frequency of a CC Pnn is simply the number of skeleton structures Snn in a TIR graph.

Fig. 7.7 a A G T ; b Skeleton structure of CC Pnn ; c Skeleton structure of CC Pno . e1 and e3 are truss edges

7.3

Candidate Pattern Generation

97

Algorithm 7.8 GenCombChordPatterns.

Require: TIR graph G T = (VT , E T ), trussness of all edges T (e) ;  Ensure: Composite chord patterns P = {CC Ptn CC Pnn CC Pno } and frequency f r eq(P) where CC Ptn = {CC Ptn (k1 , k2 )|3 < k1 ≤ kmax , 3 ≤ k2 ≤ kmax }, CC Pnn = {CC Pnn (k1 , k2 )|3 < k1 ≤ kmax , 3 ≤ k2 ≤ kmax } and CC Pno = {CC Pno (k1 , k2 )|3 < k1 ≤ kmax , 3 ≤ k2 ≤ kmax }; 1: CC Ptn ← φ, CC Pnn ← φ, CC Pno ← φ 2: for e1 ∈ E T do 3: k1 ← t(e1 ) 4: Compute N Bcc (k1 , e1 ) /* compute k -CCP node neighbourhood*/ 5: Compute E Bcc (k1 , e1 ) /* compute k -CCP edge neighbourhood*/ 6: while k1 ≥ 4/ ∗ find composite chord patterns ∗ / do 7: for e2 ∈ E Bcc (k1 , e1 ) do 8: k2 ← Min(t(e2 ), kmax − k1 ) 9: CC Ptn , f r eq(CC Ptn ) ← Get T N (G T , e1 , k1 , e2 , k2 ) 10: CC Pnn , f r eq(CC Pnn ) ← Get N N (G T , e1 , k1 , N Bcc (k1 , e1 ), E Bcc (k1 , e1 ), e2 , k2 , NN) 11: CC Pno , f r eq(CC Pno ) ← Get N N (G T , e1 , k1 , N Bcc (k1 , e1 ), E Bcc (k1 , e1 ), e2 , k2 , NO) 12: end for k1 ← k1 − 1 13: end while 14: end for

CC Pno is very similar to CC Pnn except that the truss vertices of the merged edges are not combined during the merger. Figure 7.7c illustrates the skeleton structure of a CC Pno (Sno ), which occurs in all CC Pno . The frequency of a CC Pno is the number of skeleton structures Sno . Observe that f r eq(CC Pnn (k1 , k2 )) = f r eq(CC Pno (k2 , k1 )) since k1 and k2 can be swapped. The same is true for CC Ptn and CC Pnt . Hence, when combining two k-CPs, we only consider the case when k1 ≥ k2 . Algorithm. Putting the above strategies together (outlined in Algorithm 7.8), the CCPs are computed as follows. For each edge in G T , compute the k1 -CCP node and edge neighborhoods (Lines 4–5). Next, it computes the four types of CCPs (Lines 6-13) based on the aforementioned strategies. Note that the smallest CCP generated is a CCP(3, 4) due to Lemma 7.4. Also, we only compute CC Ptn (k1 , k2 ) instead of both CC Ptn (k1 , k2 ) and CC Pnt (k1 , k2 ) as CC Pnt (k1 , k2 ) is covered when k2 and k1 are swapped. Theorem 7.6 The worst-case time and space complexities of the CCP generation technique 2 |E ||E B 2 are O(kmax T max | ) and O(kmax |E T | + |VT |), respectively. Proof (Sketch) In Algorithm 7.8, for each edge e ∈ E T , there are k1 × |EBcc (k1 , e1 )| iterations that compute the procedures Get T N (O(kmax )), Get N N (O(kmax |EBmax |)) where EBmax is the k-CCP edge neighborhood with the largest size. The worst-case time com2 |E | × |EB 2 plexity is O(kmax T max | ) since kmax is the upper bound of k1 . Algorithm 7.8

98

7 Pattern Selection for Large Networks

requires O(|VT | + |E T |) and O(kmax |E T |) to store G T and NB, respectively. In the worst case, all possible combinations of CC Ptn(k1 ,k2 ) , CC Pno(k1 ,k2 ) , and CC Pnn(k1 ,k2 ) , and their respective frequency are stored (O(kmax kmax2 −3 )). The worst-case space complexity is O(kmax |E T | + |VT |) since kmax |E T | + |VT | kmax kmax2 −3 for a large graph in practice. 

7.3.3

Patterns from a TOR Graph

Generation of candidates from a TOR graph consists of two phases: star pattern extraction and small pattern extraction. The former extracts star and asterism patterns. Subsequently, the edges involved in these patterns are removed from G O resulting in further decomposition of the TOR graph. The resultant graph is referred to as the remainder graph (G R ). Then, the second phase extracts paths, cycles, and small connected subgraphs from G R . Extraction of star and asterism patterns. The frequencies of these patterns can be derived directly from their definitions (Sect. 7.2.2). Specifically, f r eq(Sk )=|{v|v ∈ VO , deg(v) = k}| and f r eq(A S ) = f r eq({E m ={em 1 , . . . , em n−1 }) where em i = (ri , ri+1 ) ∈ E O , {k, ki } ≥ , deg(ri ) = ki , and deg(ri+1 ) = ki+1 . Algorithm 7.9 outlines the procedure. The star and asterism patterns are extracted in Lines 2 to 22 and Lines 6 to 20, respectively. Briefly, asterism patterns are found using breadth-first search (BFS). A vector of vertices is used to keep track of star centers in an asterism pattern. We “grow” the pattern by adding a neighboring vertex z of the current star center being considered only if deg(z) ≥  and when the size of the grown pattern is less than or equal to ηmax . Lemma 7.7 The worst-case time and space complexities of star and asterism pattern extraction are O(|VO |2 ) and O(|E O | + |VO |), respectively. Proof (Sketch) In the worst case, finding the stars and asterism patterns requires performing BFS for each vertex in VO . In the worst case, the graph is strongly connected and every other vertex in VO is visited during the BFS. Hence, the worst-case time complexity is O(|VO |2 ). Algorithm 7.9 requires O(|VO | + |E O |) space for storing G O . In the worst case, there are degmax −  + 1 and degmax2 −+1 (1 + (degmax −  + 1)) possible Sk and A S , respectively. Since degmax occurs when every node v ∈ VO is connected to every other node in VO , degmax has worst-case complexity O(|VO |). Hence, storage of Sk and A S requires O(|VO |) and O(|VO |2 ), respectively, and Algorithm 7.9 requires O(|VO | + |E O |) space in the worst case.  Extraction of small patterns. The remainder graph G R is primarily composed of small connected components such as paths, cycles, and subgraphs with unique topology. Algorithm 7.10 outlines the extraction of these small patterns, and we denote k-cycle as Yk and subgraphs with unique topology as U . We refer to small subgraph patterns as connected

7.3

Candidate Pattern Generation

99

Algorithm 7.9 GenStarPatterns.

Require: TOR graph G O = (VO , E O ) Ensure: Stars and asterisms Ps and frequency f r eq(Ps ); 1: Ps ← φ 2: for v ∈ VO do 3: if deg(v) ≥ then 4: Ps ← Ps Sdeg(v) 5: f r eq(Sdeg(v) ) ← f r eq(Sdeg(v) ) + 1 6: Q ← φ/ ∗ Q is a queue ∗ / 7: SC ← InsertLast(SC, v)/ ∗ v is appended to SC , a vector of nodes ∗ / 8: Q ← Enqueue(Q, SC) 9: while Q = φ do 10: SCcurr ← Dequeue(Q) 11: u ← GetLast(SCcurr )/ ∗ retrieve last element in SCcurr ∗ / 12: for z ∈ Neighbours(u) do 13: if z ∈ / SCcurr and deg(z) ≥  and Size(SCcurr ) + Size(Sdeg(z) ) − 1 ≤ ηmax then 14: SCcurr ←InsertLast(SCcurr , z) 15: Ps ← Ps A SCcurr 16: f r eq(A SCcurr ) ← f r eq(A SCcurr ) + 1 17: Q ← Enqueue(Q, SCcurr ) 18: end if 19: end for 20: end while 21: end if 22: end for

components in G R that are neither k-paths nor k-cycles. Recall that 1-path, 2-path, 3-cycle, and 4-cycle are basic building blocks of real-world networks (Milo et al. 2002). In TATTOO, we consider them as default patterns, and they are not part of the candidate canned pattern set. Hence, we extract all k-paths for k > 2 (Lines 19–21) and k-cycles for k > 4 (Lines 22– 24) and their frequencies. After that, small connected subgraphs and their corresponding frequencies are extracted. Lemma 7.8 Worst-case time and space complexities to find small patterns are O(ηmax |V R |ηmax !) and O(|E R | + |V R |), respectively. Proof (Sketch) In Algorithm 7.10, the worst-case time complexity is due to the graph isomorphism check (O(ηmax !ηmax ) Cordella et al. 2004) on Line 26 which is within a for-loop with maximum of |V R | iterations. Hence, the worst-case time complexity is O(ηmax |V R |ηmax !). Algorithm 7.10 requires O(|V R | + |E R |) space for storing G R . Since every k-path (Pk ), kcycle (Yk ), and subgraphs with unique topology (U ) consists of multiple nodes, the number of possible Pk , Yk , and U is less than |V R |, and the storage required will be O(|V R |). Hence,  the worst-case space complexity is O(|V R | + |E R |).

100

7 Pattern Selection for Large Networks

Algorithm 7.10 GenSmallPatterns.

Require: Remainder graph G R =(V R, E R ), plug b = (ηmin , ηmax , γ) Ensure: Small patterns Pr = {P Y U } and frequency f r eq(Pr ) where P = {Pk |k ≥ 3}, Y = {Yk |k ≥ 5}; 1: Pr ← φ, P ← φ, Y ← φ, U ← φ 2: U Cmax I D ← 0 / ∗ maximum ID of UC ∗ / 3: for v ∈ V R do 4: set v as unvisited 5: end for 6: for v ∈ V R do 7: if v is not visited then 8: find component C = (VC , E C ) containing v 9: n deg1 ← 0 / ∗ num of nodes with deg=1 ∗ / 10: n deg2 ← 0 / ∗ num of nodes with deg=2 ∗ / 11: for u ∈ VC do 12: if deg(u) = 1 then 13: n deg1 ← n deg1 + 1 14: else if deg(u) = 2 then 15: n deg2 ← n deg2 + 1 16: end if 17: set u as visited 18: end for 19: if n deg1 =  2 and n deg2 = |VC | − 2 and |VC | = 2 and |VC | = 3 then 20: P ← P P|VC |−1 21: f r eq(P|VC |−1 ) ← f r eq(P|VC |−1 ) + 1 22: else if n deg1  = 0 and n deg2 = |VC | and |VC | = 3 and |VC | = 4 then 23: Y ← Y Y|VC | 24: f r eq(Y|VC | ) ← f r eq(Y|VC | ) + 1 25: else if |E C | ≥ b.ηmin and |E C | ≤ b.ηmax then 26: if IsIsomorphic(C, U ) = tr ue then 27: Ucurr I D ← GetID(C, U ) 28: f r eq(Ucurr I D ) ← f r eq(Ucurr I D ) + 1 29: else  30: U ← U (C, Umax I D ) 31: f r eq(Umax I D ) ← 1 32: Umax I D ← Umax I D + 1 33: end if 34: end if 35: end if 36: end for

Remark Exponential time complexity of the small pattern extraction phase is due to the isomorphism check. The time cost is small in practice due to the small size of candidate patterns, and their number is typically small in G R .

7.4

Selection of Canned Patterns

In this section, we describe the algorithm to select canned pattern set P from the generated candidate patterns. We begin by presenting the theoretical underpinning that influences the design of our algorithm.

7.4

Selection of Canned Patterns

7.4.1

101

Theoretical Analysis

Due to the hardness of the CPS problem, we design an approximation algorithm to address it. We draw on insights from a related problem, team formation problem (TFP) (Bhowmik et al. 2014; Chen and Lin 2004), which aims to hire a team of individuals T from a group of experts S for a specific project where T ⊆ S. Bhowmik et al. (2014) proposed that several aspects should be considered in TFP, namely skill coverage (skill), social compatibility (social), teaming cost (team), and miscellaneous aspects such as redundant skills avoidance (red) and inclusion of selected experts (exp). The formulation of TFP is given as s(T  ) = αskill f skill (T  ) − αsocial f social (T  ) − αteam f team (T  ) − αr ed fr ed (T  ) + αex p f ex p (T  ) where αskill , αsocial , αteam , αr ed , and αex p are non-negative coefficients that represent the relative importance of each aspect of team formation (Bhowmik et al. 2014). The goal is to find a team T  ⊆ S where the non-negative and non-monotone function s(T  ) is maximized. According to Bhowmik et al. (2014), this formulation can be posed as an unconstrained submodular function maximization problem which is NP-hard for arbitrary submodular functions. Selecting a set of canned patterns in CPS is akin to hiring a team of individuals in TFP where f skill , fr ed , and f team correspond to f cov , f sim , and f cog , respectively. Hence, CPS can be formulated in the form s(P  ) = α fcov f cov (P  ) − α fsim f sim (P  ) − α fcog f cog (P  ) (Definition 7.9) where P  is the set of candidate patterns which yields an optimized s(P  ). Definition 7.9 (Pattern Set Score) Given a pattern set P  , the score of P  is s(P  ) = 1     3|P  | ( f cov (P ) − f sim (P ) − f cog (P ) + 2|P |) where f cov , f sim and f cog are the cover age, similarity, and cognitive load of P , respectively. Definition 7.10 (Good Candidate Pattern) Given a pattern set P  and two candidate patterns   p1 and p2 , p1 is considered a good candidate pattern if s(P  p1 ) > s(P  p2 ) and is added to P  instead of p2 . Note that Definition 7.10 can be utilized for determining the inclusion of a candidate pattern in P . Next, we analyze the properties of f cov , f sim , f cog , and the pattern score. Lemma 7.11 Coverage of a pattern set P , f cov (P ), is submodular. Proof Submodular functions satisfy the property of diminishing marginal returns. That is, given a set of n elements (N ), a function f (.) is submodular if for every A ⊆ B ⊆ N and   j∈ / B, f (A { j}) − f (A) ≥ f (B { j}) − f (B). Given a graph G and canned pattern sets P A and P B where P A ⊆ P B , let the coverage of P A and P B be f cov (P A ) and f cov (P B ), respectively. Observe that P B consists of P A and additional patterns (i.e., P  = P B \ P A ).

102

7 Pattern Selection for Large Networks

For each canned pattern p ∈ P  , we let s = min(| f cov ( p)|, | f cov (P A )|) and K denotes the  overlapping set f cov ( p) f cov (P A ). The coverage of p falls under one of four possible scenarios, namely (1) K = f cov ( p) if s = | f cov ( p)|, (2) K = f cov (P A ) if s = | f cov (P A )|,  (3) K is an empty set, and (4) otherwise (i.e., 0 < | f cov ( p) f cov (P A )| < s). In the case where coverage of every p falls under scenario 1, then f cov (P A ) = f cov (P B ). Should any p falls under scenario 2, 3, or 4, then f cov (P A ) ⊂ f cov (P B ). Hence, f cov (P A ) ⊆ / P B ; let t = min(| f cov ( p  )|, | f cov (P A )|). Supf cov (P B ). Consider a canned pattern p  ∈  pose f cov ( p  ) f cov (P A ) = f cov ( p  ) where | f cov ( p  )| < | f cov (P A )| (Scenario 1), then  f cov (P A { p  }) − f cov (P A ) is an empty set. Note that we use the minus and set minus  operators interchangeably in this proof. Since f cov (P A ) ⊆ f cov (P B ), f cov (P B { p  }) =     f cov (P B ). Hence, f cov (P A { p }) − f cov (P A ) = f cov (P B { p }) − f cov (P B ).  Now, consider f cov ( p  ) f cov (P A ) = f cov (P A ) where | f cov ( p  )| > | f cov (P A )| (Sce  nario 2). f cov (P A { p }) − f cov (P A ) = f cov ( p  ) − f cov (P A ) where f cov (P A ) ⊂ f cov ( p  ). Let L and M be f cov ( p  ) \ f cov (P A ) and f cov (P B ) \ f cov (P A ), respectively. Observe that, similar to the previous observation, it is possible for (1) L to be fully contained in M if  |L| < |M|, (2) M to be fully contained in L if |M| < |L|, (3) L M to be empty, or   (4) otherwise (i.e., 0 < |L M| < t where t = min(|L|, |M|)). Hence, |L M| ∈ [0, t].      When |L M| = 0, f cov (P A { p }) − f cov (P A ) = f cov (P B { p }) − f cov (P B ). Other wise, there are some common graphs covered by L and M, resulting in f cov (P B { p  }) −     f cov (P B ) = L \ (L M). Hence, | f cov (P A { p }) − f cov (P A )| > | f cov (P B { p  }) −   f cov (P B )|. Taken together, for scenario 2, | f cov (P A { p  }) − f cov (P A )| ≥ | f cov (P B { p  }) − f cov (P B )|. For scenario 3, it is similar to scenario 2 where L is f cov ( p  ) instead of f cov ( p  ) \   f cov (P A { p  }) − f cov (P A ) = L and f cov (P B { p  }) − f cov (P B ) = f cov (P A ).     L \ (L M). Since |L M| ∈ [0, t], | f cov (P A { p  }) − f cov (P A )| ≥ | f cov (P B { p  })− f cov (P B )|. For scenario 4, it is the same as scenario 3 except that L = f cov ( p  ) \    ( f cov (P A ) f cov ( p  )). Observe that | f cov (P A { p  }) − f cov (P A )| ≥ | f cov (P B { p  }) −  f cov (P B )| due to |L M| ∈ [0, t].   Hence, in all cases, | f cov (P A { p  }) − f cov (P A )| ≥ | f cov (P B { p  }) − f cov (P B )|  applies and f cov (.) is submodular. Lemma 7.12 The similarity (resp. cognitive load) of a pattern set P , f sim (P ) (resp. f cog (P )), is supermodular. Proof We begin by stating the first-order difference. Given a submodular function f (.), for every P A ⊆ P B ⊆ D and every p ⊂ D such that p ∈ / P A , P B , the first-order difference   states that f (P A { p}) − f (P A ) ≥ f (P B { p}) − f (P B ).

7.4

Selection of Canned Patterns

103

Given a graph G, a canned pattern p ∈ / P B and canned pattern sets P A and P B where P A ⊆ P B , let the similarity of P A and P B be f sim (P A ) and f sim (P B ), respectively.    f sim (P B { p}) − f sim (P B ) = pi ∈P B sim( p, pi ) and f sim (P A { p}) − f sim (P A ) =  pi ∈P A sim( p, pi ). Since sim( pi , p j ) ≥ 0 ∀ pi , p j ⊂ G, P A ⊆ P B and by definition of  the first order difference, f sim (.) is supermodular. The proof is similar for f cog (.). Theorem 7.13 The pattern set score s(P  ) in Definition 7.9 is a non-negative and nonmonotone submodular function. Proof (Sketch) Consider a partial pattern set P  and a candidate pattern p. Suppose p does not improve the set coverage of P  and adds a high cost in terms of cognitive load and diver sity. Then, s(P  ) > s(P  { p}). Hence, the score function s(.) is non-monotone. Since f cov (P  ), f sim (P  ), f cog (P  ) ∈ [0, |P  |], f cov (P  ) − f sim (P  ) − f cog (P  ) is in the range 1     [-2|P  |,|P  |]. Hence, 3|P  | ( f cov (P ) − f sim (P ) − f cog (P ) + 2|P |) (Definition 7.9) is in the range [0, 1] and is non-negative. Since supermodular functions are negations of submodular functions and that non-negative weighted sum of submodular functions preserve submodular property (Fujishige 2005), s(P  ) is submodular. Note that adding a constant (i.e., 23 ) does not change the submodular property (Bhowmik et al. 2014) and ensures that 1 s(P  ) is non-negative. The scaling factors of α fcov = α fsim = α fcog = 3|P  | further bound   s(P ) within the range [0, 1]. Similar to s(T  ) in TFP, s(P  ) in CPS is non-negative and non-monotone. However, unlike TFP, CPS imposes a cardinality constraint where |P | is at most γ. Thus, CPS can be posed instead as a maximization of submodular function problem subject to cardinality constraint (Buchbinder et al. 2014).

7.4.2

Quantifying Coverage and Similarity

Next, we quantify the coverage and similarity measures used in the pattern score s(P  ). Note that we have already quantified the cognitive load of a pattern in large networks in Chap. 5. Coverage. Recall from Chap. 5, we can compute the coverage of a pattern p as cov p =  | i∈|S( p)| E i |. Since the edge sets of G T = (VT , E T ) and G O = (VO , E O ) are mutually exclusive, we further modify cov p to include a weight factor to account for effects exerted  x| by the sizes of G T and G O . Specifically, cov p = | i∈|S( p)| E i | |G |E| where G x ∈ {G T , G O } for patterns obtained from G x . However, the exact computation of coverage for each candidate pattern is prohibitively expensive. Hence, we approximate cov p as follows: x| covub( p) = |E p | × f r eq( p) × |G |E| . Observe that covub( p) is in fact the upper bound of cov p when no isomorphic instances of p in G overlap. Any superior upper bound that can be computed efficiently can be incorporated. Unlike cov p , computation of covub( p) requires only f r eq( p), which is significantly more efficient.

104

7 Pattern Selection for Large Networks

The order of pattern extraction in G O (e.g., extracting stars and asterisms before small patterns) may affect the frequency of the extracted patterns. Hence, normalization of covub is performed for each class of patterns (k-CP, CCP, star, asterism, and small pattern) as follows: covub( p) =

  covub( p) − Min(covub (Pt )) + 1

 (P )) − Min(cov  (P )) + 1 Max(covub t ub t

(7.2)

where t ∈ {k − C P, CC P, star , asterism, small} represents a class of pattern. Specifically, we compute k-CPs and CCPs in G T . Stars, asterisms, and small patterns are computed in G O . The normalized covub is in [0–1]. Similarity. Given a partial pattern set P  and two candidate patterns p1 and p2 , TATTOO selects p1 preferentially to add to P  if max p∈P  sim( p1 , p) < max p∈P  sim( p2 , p). To this end, we utilize NetSimile, a size-independent graph similarity approach based on the distance between feature vectors (Berlingerio et al. 2013). It is scalable with runtime complexity linear to the number of edges. Briefly, for every node, NetSimile extracts seven features, namely degree, clustering coefficient, average degree of neighbors, average clustering coefficient of neighbors, number of edges in ego-network, number of outgoing edges of ego-network, and number of neighbors of ego-network. Aggregator functions (i.e., median, mean, standard deviation, skewness, and kurtosis) are then applied to each local feature to generate the “signature” vector for a graph. The similarity score of two given graphs is normalized Canberra distance (range in [0–1]) between their “signature” vectors. Nevertheless, any superior and efficient network similarity technique can be adopted.

7.4.3

CPS-Randomized Greedy Algorithm

The canned pattern selection algorithm is as follows. First, it retrieves the default pattern set (1-path, 2-path, 3-cycle, and 4-cycle). Next, it prunes candidate patterns whose sizes do not satisfy the plug specification or are “nearly unique” (i.e., f r eq( p) < δ where δ is a pre-defined threshold). Note that the latter patterns have very low occurrences in G and are unlikely to be as useful for query construction in their entirety.4 Then, it selects P from the remaining candidates. Recall from Sect. 7.4.1, the CPS problem can be cast as a maximization of a submodular function problem subject to a cardinality constraint. Recently, the algorithm community has proposed a technique with quality guarantee in Buchbinder et al. (2014) to address it. We exploit this approach, referred to as CPS-Randomized Greedy (CPS-R-Greedy, Algorithm 7.11), in our CPS problem.

4 In case a user is interested in patterns with low coverage, δ can be set to 0 along with the reduction in α   f cov (P ) in s(P ) (Definition 7.9).

7.4

Selection of Canned Patterns

105

In particular, CPS-R-Greedy extends the discrete greedy algorithm (Nemhauser et al. 1978) using a randomized approach. At every step, a random candidate pattern is chosen from a set of “reasonably good” candidates (Lines 19–22). Intuitively, these candidates should have very few edge crossings, good coverage, and are different from patterns already in P . These candidates are identified as follows. For every candidate pattern p, we compute the pattern set score (Definition 7.9) assuming p is added to the canned pattern set. A “good” candidate p improves on the score of the set when it is added (Definition 7.10). Note that covub , cog, and sim changes as P changes. Hence, we recompute them at every iteration. Then, we randomly select a “good” candidate and assign it to P . The algorithm terminates either when the set contains the desired number of patterns or when there exists no more good candidates. The following quality guarantee can be derived from Buchbinder et al. (2014). Theorem 7.14 CPS-R-Greedy achieves 1e -approximation of CPS. Proof (Buchbinder et al. 2014) Let Ai be an event fixing all the random decisions of Greedy for every iteration i and Ai be the set of all possible Ai events. We denote  s(Pi−1 { pi }) − s(Pi−1 ) as s pi (Pi−1 ). Further, let the desired size of P be γ, 1 ≤ i ≤ γ and Ai−1 ∈ Ai−1 . Unless otherwise stated, all the probabilities, expectations, and random quantities are implicitly conditioned on Ai−1 . Consider a set Mi containing the patterns of O P T \ Pi−1 plus enough dummy patterns to make the size of Mi exactly γ.   Note that E[s pi (Pi−1 )] = γ −1 · p∈Mi s p (Pi−1 ) ≥ γ −1 · p∈M  s p (Pi−1 ) = γ −1 · i   s(O P T Pi−1 )−s(Pi−1 ) s ( P ) ≥ (Buchbinder et al. 2014), where the first p i−1 p∈O P T \Pi−1 γ inequality follows from the definition of Mi (i.e., set of “good” candidate patterns) and the second from the submodularity of s(.). Unfixing the event Ai−1 and taking an E[s(O P T Pi−1 )]−E[s(Pi−1 )] ≥ expectation over all possible such events, E[s pi (Pi−1 )] ≥ γ (1− γ1 )i−1 ·s(O P T )−E[s(Pi−1 )] , γ

0 ≥ i ≥ γ, E[s(O P T



where the second inequality is due to observation that for every

Pi )] ≥ (1 − γ1 )i · s(O P T ) (Buchbinder et al. 2014).

We now prove by induction that E[s(Pi )] ≥

i γ

· (1 − γ1 )i−1 · s(O P T ). Note that this is

true for i = 0 since s(P0 ) ≥ 0 = γ0 · (1 − γ1 )−1 · s(O P T ). Further, we assume that the claim holds for every i  < i. Now, we prove it for i > 0. E[s(Pi )] = E[s(Pi−1 )] + E[s pi (Pi−1 )] ≥ (1− γ1 )i−1 ·s(O P T )−E[s(Pi−1 )] = (1 − γ1 ) · E[s(Pi−1 )] + γ −1 (1 − γ1 )i−1 · γ 1 i−1 1 i−2 · s(O P T )] + γ −1 (1 − γ1 )i−1 · s(O P T ) = [ γi ] · s(O P T ) ≥ (1 − γ ) · [ γ · (1 − γ ) (1 − γ1 )i−1 · s(O P T ). Hence, E[s(Pk )] ≥ γγ · (1 − γ1 )γ−1 · s(O P T ) ≥ e−1 · s(O P T ). That is, Algorithm 7.11 achieves 1e -approximation of CPS. 

E[s(Pi−1 )] +

Theorem 7.15 CPS-R-Greedy has worst-case time and space complexity of O(|Pcand |γ| Vmax ||Vmax |!) and O(|Pcand |(|Vmax | + |E max |)), respectively, where |Vmax | and |E max | are the number of vertices and edges in the largest candidate pattern.

106

7 Pattern Selection for Large Networks

Proof (Sketch) Let G max = (Vmax , E max ) be the largest candidate pattern in Pall . In the worst case, time complexity of Algorithm 7.11 is O(|Pall |γ|Vmax |!|Vmax |) since there are |Pall | candidate patterns, and the while-loop in Algorithm 7.11 iterates at most γ times. For each iteration, the score function requires the computation of coverage, cognitive load, and redundancy which requires O(|Vmax |!|Vmax |), O(|Vmax | + |E max |), and O(|Vmax | + |Vmax |log(|Vmax |) (Berlingerio et al. 2013), respectively. Note that |Vmax |log(|Vmax |) ≈ |E max | in real-world graphs (Berlingerio et al. 2013). The space complexity is due to the storage of all candidate patterns. Hence, Algorithm 7.11 has space complexity of  O(|Pall |(|Vmax | + |E max |)).

Example 7.16 Consider a VQI I and a plug b = (3, 11, 6). Suppose there are four default patterns and five candidate patterns (i.e., Pcand ) as depicted in Fig. 7.8. Let δ = 10. The algorithm first removes p4 since f r eq( p4 ) < δ. Then, for the remaining patterns in Pcand , each is considered in turn to be added to P by exploiting CPS-R-Greedy technique. It first considers adding p1 to P and computes the resulting coverage ( f covub(P  p1 ) ), cognitive  load ( f cog(P  p1 ) ), and similarity ( f sim(P  p1 ) ). The pattern set score of P p1 is then computed using Definition 7.9. The scores of the other candidate patterns are computed similarly. Suppose the scores are 0.72, 0.63, 0.54, and 0.68 for p1 , p2 , p3 , and p5 , respectively. Then, in the first iteration, p1 is selected (and removed from subsequent iterations), and the current best score sbest is updated to 0.72. In the next (i.e., final) iteration, the candidates are again considered in turn to be added to P and the corresponding pattern set scores are computed. However, unlike the first iteration, only those candidates whose scores are greater than sbest are considered. Let the scores of p2 , p3 , and p5 be 0.81, 0.7, and 0.77, respectively. Then, a candidate will be randomly selected from p2 or p5 . Suppose p2 is chosen, then the final pattern set is {d1 , d2 , d3 , d4 , p1 , p2 }.

Fig. 7.8 Default patterns and candidate patterns

7.5

Performance Study

107

Algorithm 7.11 CPS-R-Greedy.

Require: Candidate pattern set Pall and its frequency f r eq(Pall ), plug b = (ηmin , ηmax , γ); Ensure: Canned pattern set P; 1: sbest ← 0 2: while γ > 0 do 3: pbest ← φ 4: smap ← φ 5: C ← φ / ∗ list of good candidates ∗ / 6: for p ∈ Pall do  7: f covub( p  P) ← GetCoverage( p, P , f r eq(P { p}))  8: f cog( p P) ← GetCognitiveLoad( p, P ) 9: f sim( p  P) ← GetSimilarity( p, P ) 10: s ← 13 ( f covub( p  P) − f sim( p  P) − f cog( p  P) + 2) 11: if s > sbest and |P | = 0 then 12: sbest ← s 13: pbest ← p 14: else if s > sbest then 15: smap ←UpdateScore(smap , s, p) 16: C ← C { p} 17: end if 18: end for 19: if |C| > 0 then 20: pbest ← RandomChoose(C) 21: sbest ← GetScore(smap , pbest ) 22: end if 23: if pbest = φthen 24: P ← P { pbest } 25: Pall ← Pall \ { pbest } 26: γ ←γ−1 27: else 28: break 29: end if 30: end while

7.5

Performance Study

TATTOO is implemented in C++ with GCC 4.2.1 compiler. We now report the key performance results of TATTOO. All experiments are performed on a 64-bit Windows 10 desktop with Intel(R) Core(TM) i7-4770K CPU (3.50 GHz) and 16 GB RAM.

7.5.1

Experimental Setup

Datasets. We evaluate TATTOO’s performance using 10 large networks (Table 7.3) from SNAP5 containing up to 34.7 million edges. Algorithms. We compare TATTOO with the following baselines: (a) CATAPULT: We assign the same labels to all nodes of a network and partition it into a collection of small- or 5 http://snap.stanford.edu/data/index.html.

108

7 Pattern Selection for Large Networks

medium-sized data graphs using METIS (Karypis and Kumar 1997). Then the algorithm discussed in the preceding chapter is used to select canned patterns. (b) Use graphlets, frequent subgraphs, random patterns, default patterns, and edge-at-a-time (i.e., pattern oblivious): x-node graphlets where x ∈ [2 − 5] are generated using the approach in Chen et al. (2016). Random patterns are generated by randomly selecting subgraphs of specific sizes from a network. The number of candidates per size follows a distribution. Frequent subgraphs are generated using Peregrine (Jamshidi et al. 2020) (downloaded from https:// github.com/pdclab/peregrine). These subgraphs are considered as candidates from which the canned patterns are selected using our algorithm in Sect. 7.4.3. Query sets and VQI. We use different query sets for the user study and automated performance study. We shall elaborate on them in respective sections. The VQI used for user study is depicted in Fig. 7.23. Parameter settings. Unless specified otherwise, we set ηmin = 3, ηmax = 15, γ = 30, δ = 3, and  = 5. Performance measures. We measure the performance of TATTOO using the followings: • Runtime: Execution time of TATTOO. • Memory requirement (MR): Peak memory usage when executing TATTOO. total −step P where • Reduction ratio (denoted as μ): Given a subgraph query Q, μ = stepstep total step P is the minimum number of steps required to construct Q when P is used and steptotal is the total number of steps needed when edge-at-a-time approach is used. Note that the number of steps excludes vertex label assignments which is a constant for a given Q regardless of the approach. For simplicity in automated performance study, we follow the same assumptions described in Sect. 6.7.1: (1) a canned pattern p ∈ P can be used in Q iff p ⊆ Q; (2) when multiple patterns are used to construct Q, their corresponding isomorphic subgraphs in Q do not overlap. In the user study, we shall jettison these assumptions by allowing users to modify the canned patterns, and no restrictions are imposed (i.e., step P does not need to be minimum). Smaller values of div imply better pattern diversity. For ease of comparison, the diversity plots are based on the inverse of div.

7.5.2

User Study

We undertake a user study to demonstrate the benefits of using our framework from a user’s perspective. 27 unpaid volunteers (ages from 20 to 35), who were students of, or, researchers within different majors, took part in the user study. None of them has used our VQI prior to the study. First, we presented a 10-min scripted tutorial of our VQI describing how to visually formulate queries. Then, we allowed the subjects to play with the tool for 15 min.

7.5

Performance Study

109

For each dataset, 5 subgraph queries with sizes in the range [10–28] are selected. These queries mimic the topology of real-world queries containing various structures described in Sect. 7.2.2. To describe the queries to the participants, we provided printed visual subgraph queries. A subject then draws the given query using a mouse in our VQI. The users are asked to make maximum use of the patterns to this end. Each query was formulated 5 times by different participants. We ensure the same query set is constructed in a random order (the order of the query and the approach are randomized) to counterbalance learning effects. The canned patterns on the VQI are grouped by size and displayed using ForceAtlas2 layout (Jacomy et al. 2014) on different pages according to their sizes. This multi-pagebased organization yields faster average query formulation time and fewer steps compared to other alternatives (Yuan et al. 2021). Learning effect. Since the same query set is used repeatedly for each approach, there may be a learning effect where volunteers start to commit the query set to memory if the study is conducted in a fixed order. Particularly, approaches that are tested later in the study may gain an unfair advantage over earlier approaches. We investigate it further with an experiment. Ten participants (U f ) were asked to construct 5 queries on the Amazon and YouTube datasets in a fixed order, while another ten participants (Ur where U f ∩ Ur = ∅) were asked to construct the same query set in a random order (the order of the query and the approach are randomized) to minimize learning effects. Figure 7.9 reports the average time taken for query formulation. Interestingly, formulation time is generally faster using TATTOO’s canned patterns compared to graphlets regardless of the order of the query and approach. We further examined the difference in average formulation time (i.e., tGraphlet − tTATTOO ) across queries and datasets for these two orders. In particular, the average time difference for Amazon (resp. YouTube) is 5.5s (resp. 5.5s) and 5.9s (resp. 3.7s) for fixed order and random order, respectively. This is possibly due to the learning effect. Hence, in subsequent experiments we follow the randomized order to minimize the learning effect. Visual mapping time. In order to use canned patterns for query formulation, a user needs to browse the pattern set and visually map them to her query. We refer to this as visual mapping time (VMT). For each pattern used, we record the pattern mapping time (PMT) as the duration when the mouse cursor is in the Pattern Panel to the time a user selects and drags it to the Query Canvas. The VMT of a query is its average PMT. Intuitively, a longer VMT implies a greater cognitive load on a user. Figure 7.10 shows the VMT of TATTOO patterns, graphlets, frequent subgraphs, and random patterns on AM and YT datasets. On average, TATTOO patterns consume the least VMT. We investigate the effect of various VQI canned pattern layout options (i.e., single page (S P); group by size (G S); 4 per page (4P); 8 per page (8P); 16 per page (16P); sort by cognitive load (S L), diversity (S D) and sort by coverage (SC)) on VMT for AM and YT datasets. Note that S P arranges the patterns in the order that they are identified. The plug was set to b = (4, 15, 30,  30 12 ). Figure 7.11 shows that the average VMT for layout options with multiple pages tends to be shorter (up to 33.1%) than those in a single page (i.e., S P, S L, S D,

110

7 Pattern Selection for Large Networks

Fig. 7.9 Learning effect on query formulation time

Fig. 7.10 Visual mapping time of canned patterns

Fig. 7.11 Effect of canned pattern layout on VMT

and SC). When the patterns are organized in pages, an increased number of pages reduces the need for a user to scroll and browse the patterns on a particular page and increases the need to toggle between various pages to identify useful patterns. Compared to single page options, the multi-page options (G S, 4P, 8P, 16P) achieve superior performance primarily due to the former. Hence, the multi-page-based organization is used for our user study. Quer y f or mulation time (QFT) and number o f steps. Figures 7.12 and 7.13 plot the average QFT and the average number of steps taken, respectively, for AM and YT. Note that a QFT includes the VMT and the steps include the addition/deletion of nodes and edges and the merger of nodes. As expected, the edge-at-a-time approach took the most steps. Paired t-

QFT(s)

7.5

Performance Study AM

40 20 0

QFT(s)

111

Q1

Q2

Q1 Edge−at−a−time Default patterns Random patterns

Q2

Q3

Q4

Q5

Q4

Q5 Peregrine

Q3

Q4

Q5

Q3

Q4

YT

60 40 20 0

Q3 Graphlets TATTOO CATAPULT

Steps

Steps

Fig. 7.12 Query formulation time in user study 60 40 20 0

AM

60 40 20 0

Q1

Q2

Q1

Q2

YT

Edge−at−a−time Default patterns Random patterns

Graphlets TATTOO CATAPULT

Q5 Peregrine

p−value of time p−value of steps

Fig. 7.13 Query construction steps in user study 0

10

10−2 10−4

AM

YT

AM

YT

RC

−1

10

10−2 −3

10

RC

Edge−at−a−time vs TATTOO

Graphlets vs TATTOO

Default patterns vs TATTOO

CATAPULT vs TATTOO

Random patterns vs TATTOO

Peregrine vs TATTOO

Fig. 7.14 P-values of user study

7 Pattern Selection for Large Networks

30 0

30

AM (TATTOO) Steps

QFT(s)

60

Q1(11) Q3(19) Q5(26) |P|=5

YT (Graphlet)

15 0

Q1(10) Q2(15) Q4(21) |P|=15

80 QFT(s)

112

YT (Peregrine)

40 0

Q1(10) Q4(21) Q5(26) |P|=30

Fig. 7.15 Effect of varying |P| on QFT and steps. Query size is indicated in round brackets

test shows that the superior performance of TATTOO is statistically significant ( p < 0.05) for 79.4% of the comparisons (Fig. 7.14). In particular, it takes up to 18X, 9.3X, 6.7X, 8X, 9X, and 9X fewer steps compared to edge-at-a-time, default pattern, random patterns, graphlet, frequent patterns, and CATAPULT-generated patterns, respectively. For QFT, TATTOO is up to 9.7X, 8.6X, 9X, 6.6X, 7.1X, and 7.4X faster, respectively. The results are qualitatively similar in other datasets. Note that we can run CATAPULT only on AM for reasons discussed later. E f f ect o f |P |. The number of patterns on a VQI may also impact a user cognitively as larger |P | means a user needs to browse more patterns to select relevant ones. Hence, we investigate the effect of |P | on QFT and the number of steps (Fig. 7.15). Interestingly, QFT and steps are reduced by an average of 12% and 22% (maximum reduction of 77% and 80%), respectively, when |P | is increased from 5 to 30. Increase in |P | exposes more patterns that could be leveraged for query formulation, reducing query formulation steps. Further, it results in two opposing effects: (1) longer time needed to browse and select appropriate patterns (longer VMT) and (2) potentially more and larger patterns available for query construction resulting in fewer construction steps and shorter QFT. The latter effect dominates. Our experimental studies consistently demonstrate the positive impact on the visual subgraph query formulation process when canned patterns with a lower estimated cognitive load on end users are selected for a PnP interface.

7.5.3

Automated Performance Study

In this section, we evaluate TATTOO from the following perspectives. First, we compare the runtime and quality of patterns of TATTOO with the baseline approaches (Exp 1, 2). Second, we present results that support our design decisions (Exp 3, 4, 5). To this end, we generate 1000 queries (size [4–30]) for each dataset where 500 are randomly generated and the remaining (evenly distributed) are path-like, tree-like, star-like, cycle-like, and flower-like queries.

7.5

Performance Study

113

Exp 1: Runtime. First, we evaluate the generation time of different pattern types in canned pattern sets. Figure 7.16 (top) shows the results. In particular, the generation of chord-like patterns requires significantly more time (up to 146% more for LJ) than other pattern types. This is primarily due to checks for different types of edge mergers required for CCPs. Figure 7.16 (bottom) reports the time taken by various phases of TATTOO as well as the runtime of CATAPULT. TATTOO selects canned patterns efficiently within a few minutes. Observe that the time cost for the small pattern extraction phase is small in practice. In general, pattern selection is the most expensive phase and requires a couple of minutes or less. Results are qualitatively similar for other datasets. Figure 7.17 plots the memory requirement for TATTOO. It is largely dependent on the size of the dataset where the largest dataset L J has the greatest memory cost. Lastly, observe that TATTOO is 735X faster than CATAPULT, which is not designed for large networks. Except for AM, other datasets either cannot be processed by METIS or fail to generate patterns in a reasonable time (within 12 h) due to too many possible matches of unlabeled graphs that require expensive graph edit distance computation. In the sequel, we shall omit discussions on CATAPULT.

Generation time(s)

Exp 2: Comparison with graphlets, frequent subgraphs, and random patterns. Next, we compare TATTOO’s patterns with those of graphlets (30 patterns derived from graphlets). Figure 7.18 reports the results. Observe that TATTOO’s patterns are superior to graphlets in all aspects. The results are qualitatively similar to other datasets. Note that coverage is not examined since it is 100% in all cases as all queries can be constructed using a 2-node graphlet.

100 * indicates 2;  ηmax −ηγ min +1  is the maximum number of patterns for each k-sized pattern; k ∈ [ηmin , ηmax ]. Remark. The CPM problem is NP-hard (Huang et al. 2021). Observe that CPM is a multiobjective optimization problem where an infinite number of Pareto optimal solutions may exist, making it hard to decide on a single suitable solution (Marler and Arora 2010). Furthermore, it is rare to find a feasible solution that optimizes all objective functions simultaneously. We address this by converting CPM into a single-objective optimization problem using a multiplicative score function (Tofallis 2014).

8.1.2

Design Challenges

At first glance, it may seem that we can build MIDAS on top of CATAPULT without any modification to the latter. However, recall that CATAPULT utilizes frequent subtrees as feature vectors for coarse clustering in the small graph clustering phase. Observe that the evolution of D may impact the content of the graph clusters generated by this phase. Unfortunately, frequent subtrees make efficient maintenance of these clusters a challenging task due to the lack of closure property. If we utilize frequent subtrees, we need to mine them again from scratch on D ⊕ D, which is time-consuming. Note that the closure property of a data structure plays a pivotal role in designing efficient maintenance strategies (Bifet and Gavald 2011). Hence, we need a data structure with closure property for the CPM problem. Second, batch updates to D may result in different degrees of evolution of the graph clusters. Naturally, this may impact the structures of CSGs from which patterns are selected. However, as remarked earlier, not all modifications to D demand refreshing of existing canned pattern set P as the updated version should not sacrifice the characteristics of canned patterns w.r.t. coverage, diversity, and cognitive load. Hence, we need to maintain P opportunely.

8.1 The CPM Problem

8.1.3

127

Scaffolding Strategy

We tackle the first challenge using scaffolding. Intuitively, scaffolding adds or modifies elements in an existing framework to support its extension, much like the use of scaffolds in the building industry. In particular, we adapt the existing CATAPULT framework by replacing the frequent subtrees (FS) with frequent closed trees (Bifet and Gavald 2011) (FCT). Given D and a threshold supmin , let f be a subtree in D and sup( f ) be the support of f . The subtree f is a frequent closed tree (FCT) if sup( f ) ≥ supmin and there exists no f  ∈ D such that f  is a proper supertree of f and sup( f  ) = sup( f ). Example 8.2 Consider a graph database containing G 1 to G 9 in Fig. 8.1. Let supmin = 39 . The tree f 4 in Fig. 8.2b is a FCT since sup( f 4 ) = 49 , and none of its supertrees (e.g., G 6 , a supertree of f 4 has support of 29 ) has the same support as it. Similarly, the edge f 1 in Fig. 8.2b is also a FCT. Note that the set of FCTs forms the basis from which all FS can be generated (Bifet and Gavald 2011). Hence, it is closely related to FS. Furthermore, there are fewer closed trees than frequent ones in general (Bifet and Gavald 2011). Consequently, FCTs significantly reduce the number of frequent structures being considered. More importantly, the closure property of FCT facilitates efficient incremental maintenance as the underlying database evolves. Lemma 8.3 If a subtree f is closed in either D or D, it must be closed in D ⊕ D. Proof (Sketch) Let sup( f , X ) be the support of f in a database X and f  be a proper supertree of f . Suppose f is closed in D, we have sup( f , D) > sup( f  , D) since sup( f , D) ≥ sup( f  , D) and sup( f , D)  = sup( f  , D). In addition, sup( f , D) ≥

Fig. 8.1 A sample graph database

128

8 Maintenance of Patterns

sup( f  , D). Therefore, if graphs D are added to D, sup( f , D ⊕ D) =  sup( f ,D)×|D|+sup( f ,D)×|D| f  ,D)×|D| > sup( f ,D)×|D|+sup( = sup( f  , D ⊕ D), that |D|+|D| |D|+|D| is, f is closed in D ⊕ D. If graphs D are removed from D, since f is closed in D, we have sup( f , D) − sup( f  , D) > 0 ⇒sup( f , D) × |D| − sup( f  , D) × |D| > 0 ⇒sup( f , D)|D| + sup( f , D \ D)|D \ D| > sup( f  , D)|D| + sup( f  , D \ D)|D \ D| ⇒sup( f , D \ D)|D \ D| − sup( f  , D \ D)|D \ D| > sup( f  , D)|D| − sup( f , D)|D| Since sup( f  , D)|D| − sup( f , D)|D| ≤ 0, we can derive that sup( f , D \ D)|D \ D| − sup( f  , D \ D)|D \ D| > 0, that is, f is closed in D ⊕ D.  For example, consider the sample graph database in Fig. 8.1. Suppose D contains G 10 to G 12 and supmin = 3/9. Then f 10 (resp. f 7 ) in Fig. 8.2b is infrequent (resp. closed) in D containing G 1 to G 9 and in D ⊕ D, although it is frequent (resp. not closed, since f 7 ’s proper supertree f 2 has the same support as it) in D. Hence, without scanning D ⊕ D and testing subgraph isomorphism, we cannot determine whether the frequent subtrees generated from D or D are frequent in D ⊕ D. In contrast, we can conclude that the closed subtrees generated from D or D are closed in D ⊕ D (Lemma 8.3). This advantage is captured by closure property of FCT (detailed in Sect. 8.3.1), which greatly alleviates the computational demand of maintaining graph clusters. In addition, similar graphs have similar FCTs (Li et al. 2011). Finally, we add two indices, namely frequent closed tree index (FCT-Index) and infrequent edge index (IFE-Index) to facilitate pruning of unpromising candidate patterns and fast estimation of the pattern score. In the sequel, we shall refer to this extension of CATAPULT as CATAPULT++.

8.1.4

Selective Maintenance Strategy

To address the second challenge, MIDAS considers two types of modifications to D that are identified by exploiting changes to graphlet frequencies in D. Graphlets are small network patterns, and their frequencies have been found to characterize the topology of a network (Pržulj 2007). Intuitively, D can be logically viewed as a single network consisting of many disconnected subgraphs. Then, modifications to graphlet frequencies in D may provide an indication of the degree of topological changes in D as graphlets characterize network topology. Consequently, we focus on the degree of modifications to graphlet frequencies to determine the strategy for maintaining the canned patterns. Specifically, we identify the type of modification by comparing the Euclidean distance between the graphlet

Fig. 8.2 Frequent closed trees, frequent and infrequent edges, FCT-Index, and IFE-Index

8.1 The CPM Problem 129

130

8 Maintenance of Patterns

frequency distributions (denoted as ψ) of D and D ⊕ D, denoted as dist(ψ D , ψ D⊕D ). Note that the larger the distance, the more likely D has undergone significant changes. The rationale for using graphlet frequencies to determine the maintenance strategy is based on the observation that any canned pattern p ∈ P consists of one or more graphlets and edges (Lemma 8.4). Observe that size-3 patterns are essentially 3-node and 4-node graphlets and larger patterns are grown from them. Hence, changes to graphlet frequency distributions may impact the current set of canned patterns P . To elaborate further, let the graphlets in D be g1 , g2 , . . . , gk , and their frequencies be f 1 , f 2 , . . . , f k , where f 1 ≥ f 2 ≥ . . . ≥ f k . After database modification, let the frequencies in D ⊕ D be f 1 , f 2 , . . . , f k . It is indeed possible that for i < j, f i ≥ f j but f i < f j . Since canned patterns are generated using a random walk-based approach, the probability that a particular candidate pattern is selected as a canned pattern is highly dependent on the frequencies of its edges and graphlets. Hence, a canned pattern containing graphlets whose frequencies have drastically reduced after database modification may no longer be relevant for D ⊕ D, and needs to be updated. Lemma 8.4 Any canned pattern pi ∈ P contains one or more graphlets and edges. Proof (Sketch) Let pi denote a canned pattern with i edges. Observe that the edge is the basic building block of a pattern and pi can be constructed using i edges. Consider p3 , which has only 3 possible pattern topologies (i.e., triangle, 3-star, 3-path) where 3-star and 3-path are 4-node graphlets and triangle is a 3-node graphlet. When p4 , there are 5 possible topologies of which 2- are 4-node graphlets and the remaining are 5-node graphlets. Observe that these 5 topologies can be “grown” by adding an appropriate edge to triangle, 3-star, or 3-path. That is, p4 contains subgraphs that are isomorphic to graphlets in p3 . Similarly, larger pi can be constructed by “growing” graphlets with edges. Hence, each canned pattern p contains one or more graphlets and edges.  Based on the above discussion, we can classify the degree of modifications into the following two types. • Major modification (Type 1): This occurs when graphlet frequency distributions undergo significant changes. A modification is deemed major if dist(ψ D , ψ D⊕D ) ≥  where  is the evolution ratio threshold. • Minor modification (Type 2): In minor modification, changes to D do not impact the current set of canned patterns P . That is, none of the patterns in P needs to be replaced. A modification is considered minor if dist(ψ D , ψ D⊕D ) < .

8.2 The MIDAS Framework

131

Algorithm 8.12 The MIDAS Algorithm.

Require: D , D , b = (ηmin , ηmax , γ), initial canned pattern set P, existing clusters C, existing CSG set S, existing FCT set F , FCT support threshold supmin , evolution ratio threshold ; Ensure: Updated canned pattern set P  ; 1: (C + , C ) ← AssignToCluster(C , D) 2: (C − , C ) ← RemoveFromCluster(C , D) 3: ψ D ← GetGraphletDistribution(D) 4: ψ D⊕D ← GetGraphletDistribution(D ⊕ D) 5: F ← MaintainFCT(F , D, supmin ) 6: S ← MaintainClusterSet(C + , C , S) 7: S ← MaintainCSGSet(S, C , C + , C − ) 8: if Distance(ψ D , ψ D⊕D ) ≥  then 9: (I FC T , I I F E ) ← GetIndices(D, supmin ) 10: P  ← MajorModification(C + , C − , S, b, P , I FC T , I I F E ) 11: end if 12: (I FC T , I I F E ) ← MaintainIndices(D, D, P , P  , supmin , I FC T , I I F E )

Traditionally, patterns in a classical visual subgraph query interface are rarely maintained with the evolution of the underlying data. MIDAS is the first data-driven framework that automatically maintains these patterns and argues that the lack of maintenance adversely impacts the effectiveness of these interfaces to facilitate visual query formulation.

8.2

The MIDAS Framework

We now formalize the MIDAS framework. Algorithm 8.12 outlines the entire framework. First, it assigns all newly added graphs to existing clusters in D (Line 1) and removes all graphs marked for deletion (Line 2). The affected clusters are denoted as C + and C − , respectively. Note that for cluster assignment (Line 2), MIDAS first computes the Euclidean distance between the FCT feature vector of a newly added graph G and that of the centroid of every cluster, then assigns G to the cluster which results in the smallest distance. Then, it calculates graphlet frequency distributions for D and D ⊕ D (Lines 3 and 4). Next, it performs FCT maintenance (Line 5) (Sect. 8.3.2). The modified clusters and CSGs are maintained in Lines 6 (Sect. 8.3.3) and 7 (Sect. 8.3.4), respectively. In Line 8, MIDAS computes the Euclidean distance between the graphlet distributions of D and D to determine the type of modification and corresponding action. For major modification (Lines 9-12), MIDAS generates candidate patterns from CSGs of newly generated and modified clusters (Sect. 8.4). Finally, the existing canned patterns P are updated using a multi-scan swapping

132

8 Maintenance of Patterns

strategy (Sect. 8.5). In the case of minor modification (i.e., Type 2), no pattern maintenance is required. However, observe that we do maintain the underlying clusters and CSGs (Line 12) to ensure that they are consistent with D ⊕ D. In subsequent sections, we shall elaborate on these steps in detail. Observe that our framework is query log-oblivious based on the reasons discussed in Chap. 6. Nevertheless, MIDAS can be easily extended to accommodate query logs by considering the weight of a pattern based on its frequency in the log during multi-scan swapping.

8.3

Maintenance of Clusters and CSGs

In this section, we present how existing graph clusters and CSGs are maintained due to D. We begin by introducing the closure property of FCTs and how it is utilized to maintain FCTs in CATAPULT++.

8.3.1

Closure Property of FCT

According to Bifet and Gavald (2011), a subgraph is maximal in D if it is common, and it is not a subgraph of any other common subgraph of the graphs in D. The intersection of a   set of graphs D, denoted as G 1 · · · G n , is the set of all maximal subgraphs in D. The closure of a CT f for D is the intersection of all graphs in D containing f (denoted as  D ( f )). The following propositions and corollaries established in Bifet and Gavald (2011) related to closed trees (CT) are also applicable to FCT since the latter is essentially a subset of ct (Sect. 8.1.3). Proposition 8.5 Adding (resp. deleting) a graph G containing a CT f to (resp. from) a graph dataset D does not modify the number of CT for D. Proposition 8.6 Let D1 and D2 be two graph datasets. A tree f is closed for D1 and only if it is in the intersection of its closures  D1 ( f ) and  D2 ( f ).



D2 if

 Corollary 8.7 Let D1 and D2 be two graph datasets. A tree f is closed for D1 D2 if and only if (1) f is a CT for D1 , or (2) f is a CT for D2 , or (3) f is a subtree of a CT in D1 and a CT in D2 , and it is in  D1  D2 ({ f }). Proposition 8.8 A tree f is closed if f is in the intersection of all its closed supertrees. As we shall see later, Corollary 8.7 and Proposition 8.8 can be exploited as checking conditions for closure when graphs are added to D and removed from D, respectively.

8.3

Maintenance of Clusters and CSGs

8.3.2

133

Maintenance of FCT

In CATAPULT++, FCTs are represented using the canonical form of frequent trees in CATAPULT where canonical trees are first generated via normalization and then converted to canonical strings. We now describe the maintenance of FCTs. We begin by briefly describing how they are generated in CATAPULT++. We generate a set of closed trees (CT) by leveraging the TreeNat approach in Balcázar et al. (2010). Briefly, TreeNat uses a recursive framework to identify the set of CT (denoted as F ) in D. At each iteration, the support of all new subtrees F  , that are extensible from f in one step, is checked. Recursive calls to TreeNat are made for all subtrees f  ∈ F  where sup( f  ) ≥ supmin . Note that f is added to F only if there is no f  s.t sup( f ) = sup( f  ). In addition, checks are done on F to identify F  that are subtrees of f where sup( f ) = sup( f  ) and f  ∈ F  . Observe that existence of f  violates the definition of CT (Balcázar et al. 2010). Hence, they are removed from F . In addition, F has to be maintained as the dataset evolves. Algorithm 8.13 outlines the steps taken in MIDAS to maintain F . First, it relaxes the condition for FCT by using a lower minimum support threshold supmin /2 (Line 1). Note that this reduces the chance of missing out on closed trees that may become frequent after modification to D. In the case where existing data graphs are removed from D (i.e., − ), the relevant FCT (F− ) is found using the TreeNat approach (Balcázar et al. 2010) (Line 3). This is then integrated with F using the approach in Bifet and Gavald (2008) (referred to as CTMiningDelete procedure) in Line 4. Briefly, the CTMiningDelete procedure identifies the integrated set of CT by checking every CT common to F and F− in size-ascending order to determine whether its subtrees remain close after the deletion operation. Note that this can be achieved by leveraging Proposition 8.8. A similar process occurs when new graphs are added to D (i.e., + ). The relevant CT (F+ ) is integrated with F using the CTMiningAdd (Line 8) procedure in Bifet and Gavald (2008). Similar to CTMiningDelete, CTMiningAdd checks every CT common to F and F+ in size-ascending order to determine if it remains closed in F . For the CT that remains closed, its support and the support of all its subtrees are updated (Proposition 8.5). In addition, those subtrees that are closed in F+ but not in F are added to the set of CT in accordance with Corollary 8.7. Finally, the threshold supmin is restored to its original value (Line 10) and CT t in F with sup(t) < supmin are pruned (Line 11) to obtain the final set of FCT. Lemma 8.9 Reducing the min sup by half prevents missing out on frequent closed trees after modification to D. Proof (Sketch) Given supmin , if f is frequent in D ⊕ + where |D| = n 1 and |+ | = n 2 , then f r eq( f , D ⊕ + ) ≥ (n 1 + n 2 ) × supmin where f r eq( f , X ) is the frequency of f in X . If f is not frequent in D, then f r eq( f , D) < n 1 × sup2min . Since f r eq( f , + ) = f r eq( f , D ⊕ + )− f r eq( f , D) and f r eq( f , D ⊕ + ) − f r eq( f , D) ≥ f r eq( f , D ⊕ + ) − n 1 × sup2min , f r eq( f , + ) ≥ n 2 × supmin + n21 × supmin . This can be further writ-

134

8 Maintenance of Patterns

Algorithm 8.13 MaintainFCT Algorithm.

Require: D , existing FCT set F , supmin ; Ensure: Updated FCT set F ; supmin 1: supmin ← 2 2: if |− | = 0 then 3: F− ← TreeNat(− , supmin ) 4: F ← CTMiningDelete(F , F− , supmin ) 5: end if 6: if |+ | = 0 then 7: F+ ← TreeNat(+ , supmin ) 8: F ← CTMiningAdd(F , F+ , supmin ) 9: end if 10: supmin ← supmin × 2 11: F ← PruneFCT(F , supmin )

ten as f r eq( f , + ) > n 2 × sup2min . Note that the RHS implies f is frequent in + . Further, if f is not frequent in + , then similarly, f is frequent in D. Hence, f will not be missed. Alternatively, if f is frequent in D \ − , let |D \ − | = n 2 . Then, any closed tree f frequent in D ⊕ − satisfies the condition that f r eq( f , D ⊕ − ) ≥ n 2 × − −) f r eq( f ,− ) f r eq( f ,D) f ,D) supmin . Since f r eq( = f r eq( f ,D\ n)+ , ≥ f r eq( fn,D\ . Further, n1 n1 1 1 f r eq( f ,D\− ) n1

min ≥ n 2 ×sup . We can assume that the number of removed graphs is less than or n1 equal to half of |D| (i.e., n 1 − n 2 ≤ n 2 ) because if it is otherwise, MIDAS would perform FCT f ,D) f r eq( f ,D) min ≥ n 2 ×sup ≥ sup2min . mining of D \ − . Hence, f r eq( n1 2×n 2 . This in turn gives us n1 Hence, f will not be missed. 

Lemma 8.10 The worst-case time and space complexities of FCT maintenance are O(|D||E max |) and O(|D|), respectively, where G max = (Vmax , E max ) is the largest graph in D. Proof (Sketch) The time complexity of FCT maintenance is due to TreeNat which requires O(|D||E max |) time in the worst case when every edge in every graph of D is traversed, and these graphs have the same number of edges (i.e., |E max |). In FCT maintenance procedure, O(|D|) space is required for storing D. In practice, |F | is small. Coupled with the small size of individual FCT, the space requirement for storing F is expected to be much lesser compared to that of D. Hence, in the worst case, the space complexity of FCT maintenance is O(|D|).  Example 8.11 Consider the graph database D in Example 8.2. Figure 8.2b shows the FCTs ( f 1 to f 5 ). Suppose D involves addition of G 10 to G 12 (Fig. 8.1) to D. The FCTs are maintained as follows: (1) relax supmin to 0.17; (2) identify F+ which consists of f 1 , f 2 , G 10 , and five other CTs. The supports for f 1 , f 2 , and G 10 are 33 , 23 , and 13 , respectively. For

8.3

Maintenance of Clusters and CSGs

135

the remaining CTs, they are all 13 . Observe that only f 1 and f 2 are CTs common to F and F+ ; (3) compute the support of f 1 and f 2 and their subgraphs for D ⊕ D (i.e., updated 8 to 12 12 and 12 , respectively). The support of subgraphs of f 2 (i.e., edge (C, S)) are updated to 8 12 as well. The edge (C, S) is not considered a CT as f 2 , its supertree, has the same support. After the update, there is no change in the FCT set. However, (C, S) is now a frequent edge. Now consider a new batch update involving the deletion of G 4 and G 6 . F− is found to be f 2 , G 6 , and three other CTs. f 2 and G 6 have support of 22 and 21 , respectively, whereas that of the remaining CTs are all 21 . Only f 2 is a CT common to F and F− . Hence, its 6 , whereas those of its subgraphs (C, O) and (C, S) are updated to support is updated to 10 10 6 and , respectively. In particular, (C, O) which corresponds to f 1 continues to be a FCT 10 10 after the update.

8.3.3

Maintenance of Graph Clusters

The clusters are maintained as follows using Algorithm 8.12: (1) Assign each newly added graph to an appropriate cluster (Line 1). (2) Remove graphs marked for deletion from existing clusters (Line 2). (3) Perform fine clustering (recall from Chap. 6) on clusters that exceed the maximum cluster size (Line 6). Observe that fine clustering results in new clusters. In a major modification, numerous graph additions and removals on a given cluster C may yield a CSG that is distinct from a CSG derived from the original C. These CSGs in turn may yield new candidate patterns and should be considered during candidate pattern generation (Sect. 8.4). Lemma 8.12 Worst-case time and space complexities of maintaining clusters are |+ |−N (|Vmax |+1)! + − (|+ | − i) (|Vmax O( i=1 |−|Vi |+1)!) ) and O((|C | + |C |)(|Vmax | + |E max |), respectively, where G max = (Vmax , E max ) is the largest modified graph and N is the maximum cluster size. Proof (Sketch) The worst-case time complexity of cluster maintenance is due to the fine clustering step. In the worst case, all the clusters containing newly added graphs exceed the maximum cluster size and fine clustering is performed. The worst-case time | + |−N (|Vmax |+1)! (| + | − i) (|Vmax complexity is O( i=1 |−|Vi |+1)!) ) where G max = (Vmax , E max ) is the largest modified graph and N is the maximum cluster size. The proof follows that in Chap. 6. Since all modified graphs have to be stored, the worst-case space complexity  is O((|C + | + |C − |)(|Vmax | + |E max |).

136

8.3.4

8 Maintenance of Patterns

Maintenance of CSG Set

Given graph insertions and deletions (+ and − ), MIDAS takes the following steps to update the CSGs: 1. For every G + = (V + , E + ) in + , retrieve the CSG S = (VS , E S ) associated with the cluster that G + is assigned to and update S by adding the id of G + to the labels of all  edges e ∈ E + E S . Further, ∀e ∈ E + \ E S , the edge e together with its label l is added to E S where l(e) is the id of G + . 2. For G − = (V − , E − ), retrieve the CSG S = (VS , E S ) associated with the cluster that G − is removed from. If the frequency of edge e ∈ E − in the graph cluster associated with S is 1, update S by removing e. Otherwise, update l(e) by removing the id of G − . Lemma 8.13 The worst-case time and space complexities of maintaining CSGs are O(|E max | × (|+ | + |− |)) and O((|+ | + |− |)(|E max | + |Vmax |)), respectively. Proof (Sketch) In the worst case, all modified graphs are of the same maximum size. CSG maintenance requires the processing of each edge in the modified graph. Hence, worst-case time complexity is O(|E max | × (|+ | + |− |)) where G max = (Vmax , E max ) is the largest modified graph. In terms of worst-case space complexity, MIDAS has to store all modified graphs. Hence, worst-case space complexity is O((|+ | + |− |)(|E max | + |Vmax |)). 

8.4

Candidate Pattern Generation

For Type 1 modification (i.e., major), MIDAS proceeds to generate candidate canned patterns and then replaces existing “stale” patterns in P with these candidate patterns according to a swap-based strategy. These two steps are encompassed by the MajorModification procedure in Algorithm 8.12 (Line 10). In this section, we elaborate on the candidate pattern generation process. In Sect. 8.5, we shall elaborate on the swap-based strategy. We begin by introducing two indexes, frequent closed tree index (FCT-Index) and infrequent edge index (IFE-Index), to facilitate these steps.

8.4.1

FCT-Index and IFE-Index

Intuitively, the FCT-Index enables us to efficiently keep track of the existence of specific FCTs and frequent edges in data graphs and canned patterns, whereas the IFE-Index keeps track of infrequent edges. In particular, the FCT-Index is constructed from the canonical forms of FCTs and frequent edges. Figure 8.2c depicts the canonical forms of FCTs and frequent edges in Fig. 8.2b. The canonical string is obtained by performing a top-down

8.4

Candidate Pattern Generation

137

level-by-level breadth-first scan of the canonical tree as described in Chap. 6. Recall that the symbol $ is used to separate families of siblings (e.g., O and S in f 2 ). Definition 8.14 (FCT-Index) Given a set of FCTs F and a set of frequent edges E f r eq in  D, the FCT-Index I FC T constructed on F E f r eq consists of the following components: • A Trie T = (VT , E T ) where v ∈ VT corresponds to a token of the canonical string of the FCTs and frequent edges. An edge e = (u, v) ∈ E T exists if the corresponding tokens of u ∈ VT and v ∈ VT are adjacent in the canonical strings. • ∀v † ∈ VT where v † is the terminating token in a canonical string, there exists a graph pointer and a pattern pointer. The graph pointer (resp. pattern pointer) of v † points to an array containing the number of embeddings of FCTs and frequent edges in each data graph (resp. pattern) over D (resp. P ).  Observe that the two array structures can be represented by a |F E f r eq | × |D| matrix  and a |F E f r eq | × |P | matrix, respectively. We refer to the former as trie-graph matrix (TG-matrix) and the latter as trie-pattern matrix (TP-matrix). We illustrate the construction of the FCT-Index using Fig. 8.2. First, the FCT set F = { f 1 , f 2 , f 3 , f 4 , f 5 , f 6 } and frequent edges E f r eq = { f 7 , f 8 , f 9 } are selected from D in Fig. 8.1. Then, the canonical strings of every FCT and frequent edge are inserted into a trie as shown in Fig. 8.2d. Finally, for every node in the trie representing the terminating token, a graph pointer (resp. pattern pointer) pointing to the row in the TG-matrix (resp. TP-matrix) is created. For instance, in the TP-matrix, pattern P3 in Fig. 8.2a has two embeddings of f 1 , and one embedding each of f 3 , f 4 , and f 9 . Definition 8.15 (IFE-Index) Given D containing a set of infrequent edges E in f , IFE-Index I I F E constructed on E in f consists of |E in f | × |D| edge-graph matrix (EG-matrix) and |E in f | × |P | edge-pattern matrix (EP-matrix) that store the number of embeddings for all infrequent edges over D and over canned patterns P , respectively. An example of IFE-Index is given in Fig. 8.2e where the infrequent edge f 11 = (C, N ) is found in G 2 and G 5 . Observe that the aforementioned matrices are sparse. Hence, MIDAS stores only non-zero entries to reduce space usage. That is, given a sparse matrix, let x(i, j) be the value of the entry in the i th row and j th column. ∀x(i, j) > 0, MIDAS stores i, j and x(i, j) in vectors ar ow , acolumn and avalue , respectively. Note that insertion and deletion occur as a tuple (i, j, x(i, j) ). Lemma 8.16 The time and space complexities for index construction are O(|D| × |Vmax |!|Vmax |) and O(|D|(|F | + |E in f r eq | + |E f r eq |) + (n × m)), respectively, where G max = (Vmax , E max ) is the largest graph in D, m is the maximum depth of the trie, and n is the number of unique vertices in the trie.

138

8 Maintenance of Patterns

Proof (Sketch) The worst-case time complexity is due to the subgraph isomorphism checks of every FCT in D and P , and it is O(|D| × |F | × |Vmax |!|Vmax |) where G max = (Vmax , E max ) is the largest graph in D since P ⊂ D, |D|  |P | and FCT are expected to be much smaller in size than G max . In practice, there are in fact very few FCT compared to |D| (see Sect. 6.6). Hence, the worst-case time complexity can be represented as O(|D| × |Vmax |!|Vmax |). The matrices require (|D| + |P |) × (|F | + |E in f r eq | + |E f r eq |) space, whereas the pointers require |F | space. In the worst case, F and E f r eq are 2 nonintersecting sets, and every vertex label in them is unique, resulting in worst-case space complexity of 2 × |E f r eq | + |F |(|V fmax | + |E fmax |) where f max is the largest FCT in F . Hence, the worst-case space complexity for index construction is |D|(|F | + |E in f r eq | + |E f r eq |) since |D|  |P |, |D|  |V fmax |, and |D|  |E fmax |.  Remark. The exponential time complexity is due to the subgraph isomorphism checks for FCTs in D and P . We use the VF2 algorithm (Cordella et al. 2004) to this end. In practice, as we shall see in Sect. 6.6, the cost is low due to the small size of FCTs. This also applies to subsequent Lemmas 6.10 and 8.24. Index Maintenance. Given an updated set of FCT and frequent edges, the trie is updated by inserting new vertices and edges and removing deleted vertices and edges. For all new FCT and frequent edges, a corresponding graph and pattern pointers are added and set to null initially. The matrices in FCT and IFE indices are maintained as follows: 1. When new FCTs or frequent edges (resp. infrequent edges) are added, new rows are added to TG- and TP- (resp. EG- and EP-) matrices. 2. When existing FCTs or frequent edges (resp. infrequent edges) are removed, corresponding rows are removed from TG- and TP- (resp. EG- and EP-) matrices. 3. When new graphs (resp. patterns) are added, new columns are added to TG- (resp. TP-) and EG- (resp. EP-) matrices. 4. When existing graphs (resp. patterns) are removed, corresponding columns are removed from TG- (resp. TP-) and EG- (resp. EP-) matrices. Note that the indices are maintained after database modification as well as when the canned pattern set is updated. Lemma 8.17 The worst-case time and space complexities of maintaining the indices are O(|D ⊕ D||E max |) and O(|D ⊕ D| × (|FD⊕D | + |E f r eq |)), respectively, where G max = (Vmax , E max ) is the largest graph of D ⊕ D. Proof (Sketch) The worst-case time complexity occurs when graph modification happens. In this case, it takes O(|D ⊕ D||E max |) time to retrieve the set of FCTs (Bifet and Gavald 2008) where G max = (Vmax , E max ) is the largest graph of D ⊕ D. Updating of trie

8.4

Candidate Pattern Generation

139

requires O(L) where L is the length of the key. Since for a large dataset, |D ⊕ D|  L, worst-case time complexity is O(|D ⊕ D||E max |). The trie, the pointers, and the matrices have to be stored. In the worst case, the size of the trie is O(|FD⊕D | × f max + |E f r eq |) where the set of FCTs FD⊕D and frequent edges E f r eq ) are distinct and f max is the maximum size of a FCT. Correspondingly, the size of pointers is O(|FD⊕D | + |E f r eq |) and TG matrix requires the largest storage of |D ⊕ D| × (|FD⊕D | + |E f r eq |) assuming that each entry in the matrix is non-zero. Hence, worst-case space complexity is  O(|D ⊕ D| × (|FD⊕D | + |E f r eq |)).

8.4.2

Pruning-Based Candidate Generation

Recall from Chap. 6, the candidate generation step in CATAPULT does not exploit any pruning technique to filter unsuitable candidates early. Since in the CPM problem we can exploit the knowledge of existing canned pattern set P , can we eliminate “unpromising” candidates early? To this end, MIDAS exploits a novel coverage-based pruning strategy to guide the FCP generation process toward candidates that are deemed to have a greater potential of replacing some existing patterns in P (referred to as pattern swapping). Intuitively, a new pattern p  is a promising FCP if it covers a large number of data graphs that is not covered by P (i.e., high marginal subgraph coverage), since p  is likely to improve upon the pattern score. A swapping threshold (κ) sets the minimum marginal subgraph coverage that is desired. The value of κ is updated based on the swap-based strategy. We use coverage-based pruning as it is monotonic. That is, given patterns p and p  , if p contains p  , then scov( p  ) ≥ scov( p). Note during canned pattern maintenance (Sect. 8.5), the candidate patterns are further assessed w.r.t. pattern score that is derived from cognitive load and diversity. In particular, we deliberately refrain from integrating cognitive loadbased pruning here as it allows us the flexibility to incorporate any alternative cognitive load measure in the pattern maintenance phase. Note that such a measure may not be monotonic. Definition 8.18 (Promising FCP) Given D, P , and swapping threshold κ, pc is a promising   FCP if ∃ p ∈ P , |Gscov( pc ) \ p∈P Gscov( p) | ≥ (1 + κ)|Gscov( p) \ p ∈P , p = p Gscov( p ) | where κ ∈ [0, 1]; Gscov(x) ⊆ D is a set of graphs containing x. MIDAS seeks to generate promising FCP efficiently by terminating the generation process early if pc is unlikely to have high subgraph coverage. Since the FCP is constructed iteratively by adding the most frequently traversed edge that is connected to the partially constructed FCP (denoted as pc ), MIDAS can perform early termination by considering the marginal subgraph coverage of the next edge e that is to be added to pc . It terminates FCP generation if e satisfies the following criteria (i.e., low marginal subgraph coverage):

140

8 Maintenance of Patterns

Fig. 8.3 CCP generation

|Gscov(e) \

 p∈P

Gscov( p) | < (1 + κ) min (|Gscov( p) \ p∈P



Gscov( p ) |)

(8.2)

p  ∈P , p   = p

In particular, we utilize the FCT-Index and IFE-Index to compute Gscov(e) . If e is a frequent edge, Gscov(e) is computed using the TG-matrix of FCT-Index. Otherwise, it can be computed using the EG-matrix of IFE-Index. The subsequent generation of CCP and FCP is similar to the CATAPULT framework. Example 8.19 Reconsider Example 8.2. Suppose |D ⊕ D| = 1000, γ = 9, ηmin = 3, and ηmax = 5. Let C1 = {G 1 , G 2 , G 6 , G 8 , G 9 , G 12 } be a cluster. The weighted CSG of C1 (Fig. 8.3a) is generated by computing the weight we for each CSG. Then, MIDAS generates a library of PCPs for each pattern size by performing random walks on the weighted CSGs. Next, it identifies the FCPs from the PCP library. Figure 8.3b depicts generation of a size-4 FCP from SC1 . Construction of the FCP starts from (C, O), the most frequent edge (based on 100 random walks). At each step, MIDAS checks if the current FCP ought to be pruned by considering the condition imposed by Eq. 8.2. In Fig. 8.3b, early termination of FCP generation occurs after adding e2 since it satisfies Eq. 8.2. Lemma 8.20 The worst-case time and space complexities of finding CCPs and FCPs are 2 |S|k|E 2 O(|VSmax |!|VSmax ||S| + |P |(|V Pmax |3 + xηmax Smax |)) and O(|S|(|E Smax | + ηmax ) + |D||E max |), respectively, where Smax is the largest CSG in the set of CSGs S whose clusters have evolved, x is the number of random walk iterations, Pmax is the largest pattern, and G max = (Vmax , E max ) is the largest data graph. Proof (Sketch) In the worst case, no FCPs are pruned. Finding weights of edges in the closure graphs require O(|S||E Smax |) time in the worst case where Smax ∈ S is the largest 2 |S||P ||E closure graph. Generating PCP requires O(xηmax Smax |) time where x is the num-

8.5

Canned Pattern Maintenance

141

ber of random walk iterations. Similar to CATAPULT, MIDAS utilizes edge occurrence from the random walk to identify the FCP. For every PCP library, computing edge occurrence requires O(xηmax ) time, while FCP generation takes O(ηmax |E Smax |) time. Computing pattern score requires a subgraph isomorphism test for each closure graph to find β p (O(|VSmax |!|VSmax |) Cordella et al. (2004)) and |P  | times of graph edit distance computation (O(|V Pmax |3 ) (Riesen et al. 2007) where Pmax is the largest pattern in P to find λ p , yielding O(|VSmax |!|VSmax ||S| + |P ||V Pmax |3 ) worst-case time complexity for each FCP. Updating of cluster weights and edge label occurrence requires O(|VSmax |!|VSmax ||S|) and O(ηmax ) time, respectively. Taken together, the pattern mining and selection phase have 2 |S||E worst-case time complexity of O(|VSmax |!|VSmax ||S| + |P |(|V Pmax |3 + xηmax Smax |)). Space complexity: There are |P | canned patterns. Since we expect canned patterns to be subgraphs of D, their sizes should be less than O(|Vmax | + |E max |). Hence, storage space needed for the candidate patterns is O(|P |(|Vmax | + |E max |)). In addition, MIDAS allocates weights to each closure graph, and this requires O(|S||E Smax |) space. In the worst case, maintaining the edge label weight requires O(|D||E max |) space assuming that every edge in each graph in D has a unique label. For each PCP library, O(xηmax ) space is needed where x is the number of random walk instances. During each iteration, there are ηmax − ηmin + 1 candidate canned patterns per closure graph. Hence, in the worst case, generating CCPs and FCPs requires space complexity of O(|P |(|Vmax | + |E max |) + |S||E Smax | + |D||E max | + 2 |S|) since xη ηmax max  |D||E max | in a typical large graph dataset. This can be further 2 )) since |D|  |P | and in the worst case, reduced to O(|D||E max | + |S|(|E Smax | + ηmax  G max is a strongly connected graph where |E max | > |Vmax |.

8.5

Canned Pattern Maintenance

In this section, we present the algorithm for maintaining the canned pattern set. We begin by adapting the pattern score utilized in CATAPULT to suit the CPM problem.

8.5.1

Pattern Score

The pattern score s p (Eq. 6.2) of CATAPULT is modified by (1) replacing cluster coverage (ccov) with subgraph coverage (scov) and (2) using a tighter bound GED in computing  . diversity div( p, P \ p). We denote the modified score as sP Cluster Coverage versus Subgraph Coverage. In the CPM problem, cluster size may change due to D. Since ccov (Eq. 6.2) is sensitive to cluster weights, we replace f div (P )  = f ccov p with scov p = |G p |/|D|. That is, sP scov (P ) × f lcov (P ) × f cog (P ) . Similarly,

p,P \ p) s p = scov( p, D) × lcov( p, D) × div( where p ∈ P . Note that scov computation cog( p) can be prohibitively expensive for large D. We address it by generating a sampled database

142

8 Maintenance of Patterns

Ds ⊂ D using the lazy sampling technique (Chap. 6) and then computing scov over Ds . In addition, we leverage I FC T and I I F E for computing scov. Observe that if a pattern p is contained in a graph G, then the corresponding column entries for p in TP-matrix must be smaller than or equal to that of G in TG-matrix. Hence, the pairs ( p, G), where p may be contained in G, can be found by utilizing it. In Fig. 8.2d, p3 contains 2 f 1 , 1 f 2 , 1 f 3 , and 1 f 9 (TP-matrix). From TG-matrix, G 8 and G 9 have corresponding cell entries that are greater than or equal to that of p3 . Hence, only 2 (i.e., ( p3 , G 8 ), ( p3 , G 9 )) instead of 9 subgraph isomorphism checks are performed for p3 . Tighter Bound GED. Observe that the diversity of a pattern is computed using a lower bound of GED (denoted as GEDl ) to reduce the number of exact GED computation. In MIDAS, we leverage a pattern-feature matrix (PF-matrix) to further tighten GEDl . Given a FCP pc = (V , E), each row of the matrix represents an edge e ∈ E, whereas each column represents a subtree feature instance (i.e., FCT, frequent, and infrequent edge). Since a FCP may contain multiple embeddings of a subtree feature f , these embeddings are presented as multiple columns in the PF-matrix. This is in contrast to the EG-matrix and EP-matrix where every column corresponds to a graph or a pattern instead of their embeddings. Hence, an entry x(i, j) in PF-matrix is 1 if G contains the j th feature (denoted as f j = (V f , E f )) and  ei ∈ V V f . Otherwise, it is 0. The PF-matrix of canned pattern p3 in Fig. 8.2a is given in Fig. 8.4. p3 contains two embeddings of f 1 (i.e., f 1 (1) and f 1 (2) in the PF-matrix) and one embedding each of f 3 , f 4 , and f 9 . We denote the embedding set of p3 as B P3 . For example, x(2,5) and x(3,5) are 1 as p3 contains an embedding of f 4 and edges e2 and e3 are in f 4 . Observe that if a graph G 1 contains p3 , then B p3 ⊆ BG 1 . Consider the case of another graph G 2 with one embedding of f 1 and one embedding each of f 3 , f 4 , and f 9 . G 2 does not contain p3 since B p3  BG 2 . Suppose an edge e1 of p3 is “relaxed” (i.e., e1 is not taken into consideration when p3 is being matched to another graph), then the relaxed embedding set B p3 ⊆ BG 2 . That is, G 2 contains p3 when e1 is “relaxed”. In general, when matching two  given graphs G i = (Vi , E i ) and G j = (V j , E j ) where |E i E j | > 0 and |E j | > |E i |, G i can be matched to G j by progressively relaxing more and more edges. The upper bound for the number of matching edges is |E i | − n where n is the number of relaxed edges. Hence, GEDl can be tightened further as GEDl = GEDl + n. Lemma 8.21 (Tighter lower bound for GED) Given two graphs G A = (V A , E A ) and G B = (VB , E B ), the tighter lower bound GED is given as GEDl (G A , G B ) = |V | + |E|

Fig. 8.4 Pattern-feature matrix

8.5

Canned Pattern Maintenance

143

where L(V A ) is the set of labels of vertices in V A , |V | = ||V A | − |VB || + Min(|V A |, |VB |) −  |L(V A ) L(VB )|, |E| = ||E A | − |E B || + n and n is the number of relaxed edges. Proof (Sketch) Given two graphs G A = (V A , E A ) and G B = (VB , E B ), the lower bound GED of vertices consists of the minimum number of node modifications (MV ) and edge modifications (M E ) required to transform G A to G B . Observe that MV consists of (1) adding (resp. removing) ||V A | − |VB || nodes if G A contains less (resp. more) nodes than G B and (2) modifying the node labels in G A to make sure they are the same as those in G B . The  latter requires Min(|V A |, |VB |) − |L(V A ) L(VB )| steps. In lower bound GED (denoted as GEDl ) as defined in Chap. 6, M E consists of only adding (resp. removing) ||E A | − |E B || edges if G A contains less (resp. more) edges than G B . That is GEDl (G A , G B ) = ||V A | −  |VB || + Min(|V A |, |VB |) − |L(V A ) L(VB )| + ||E A | − |E B ||. Observe that this definition of lower bound GED does not consider the “rewiring” of edges where an existing edge (vs , vt ) is modified to become (vs , vt ). In the tighter lower GED bound (denoted as GEDl (G A , G B )), such edges are considered. That is, by iteratively removing edges (i.e., relaxed edges) which require “rewiring”, G A progressively achieves a match to G B . Let n be the number of relaxed edges. Then, GEDl (G A , G B ) = GEDl (G A , G B ) + n. Hence, GEDl (G A , G B ) is  tighter than GEDl (G A , G B ) when n > 0.

8.5.2

Swap-based Pattern Maintenance

Observe that the maximum coverage (MC) problem, which seeks to identify a collection of sets such that the maximum number of elements are covered, is a sub-problem of the CPM problem. However, greedy solutions typically find the maximum cover from scratch and hence cannot be effectively exploited in our problem setting. Recent works Saha and Getoor (2009), Yuan et al. (2015) that address the MC problem in the context of streaming scenario use swap-based updating techniques instead. Specifically, they ensure that with each swap, the new cover set can outperform the cover set prior to the swap. However, these techniques are oblivious to diversity and cognitive load. Hence, we cannot adopt them directly. MIDAS realizes a multi-scan swapping strategy (Algorithm 8.14) which allows a progressive gain of coverage without sacrificing diversity and cognitive load. We begin by introducing the loss and benefit scores to facilitate exposition. Definition 8.22 (Loss and Benefit Scores) Given P and D, the loss score of a pattern   p ∈ P is defined as SL ( p, P , D) = p∈P scov( p, D) − p ∈P \ p scov( p  , D). The ben / P is defined as S B ( pc , P , D) = p ∈P ∪ pc scov( p  , D) − efit score of a pattern pc ∈  p∈P scov( p, D).

144

8 Maintenance of Patterns

Algorithm 8.14 PatternMaintenance Algorithm.

Require: Existing canned patterns P, set of FCPs Pc , plug b, indices I FC T , I I F E ; Ensure: Updated set of canned patterns P  ; 1: SPc ← GetPatternScoreSet(Pc , I FC T , I I F E ) 2: P Q Pc ← RankPattern(Pc , SPc ) 3: SP ← GetPatternScoreSet(P , I FC T , I I F E ) 4: P Q P ← RankPattern(Pc , SP ) 5: Stop ← f alse 6: PopNext ← tr ue 7: while |P Q Pc | > 0 and Stop = f alse do 8: pc ← PopFromPriorityQueue(P Q Pc ) 9: if PopNext = tr ue then 10: p ← PopFromPriorityQueue(P Q P ) 11: end if 12: S B ← GetBenefitScore( pc , P , I FC T , I I F E ) 13: S L ← GetLossScore( pc , P , I FC T , I I F E ) 14: s pc ← GetPatternScore( pc , I FC T , I I F E ) 15: s p ← GetPatternScore( p, I FC T , I I F E ) 16: if MetSwappingCriteria(P , pc , p) = tr ue and IsSamePatternSizeDistribution(P , pc , p) = tr ue then 17: P ← Swap(P , pc , p) 18: PopNext ← tr ue 19: else 20: PopNext ← f alse 21: if s pc < (1 + λ)s p then 22: Stop ← tr ue 23: end if 24: end if 25: end while  26: P ← P

MIDAS swaps an existing p ∈ P with a proposed FCP pc if there is no significant change  for the pattern size distribution of P and P \ { p} { pc }, and the following swapping criteria are satisfied: • • • • •

SW1: S B ( pc , P , D) ≥ (1 + κ)SL ( p, P , D), SW2: s pc ≥ (1 + λ)s p ,  SW3: f div (P \ { p} { pc }) ≥ f div (P ),  SW4: f cog (P ) ≥ f cog (P \ { p} { pc }), and  SW5: flcov (P \ { p} { pc }) ≥ flcov (P )

where κ and λ are swapping thresholds. Note that κ here is the same as that in Eq. 2, and we use the Kolmogorov-Smirnov test to assess if pattern size distributions are similar. SW3–SW5 are to maintain the quality of the updated canned pattern set (i.e., ensure optimization of   ). Additional requirements by users such as f sP div (P \ { p} { pc }) ≥ (1 + α1 ) f div (P ),   f cog (P )(1 + α2 ) ≥ f cog (P \ { p} { pc }), and flcov (P \ { p} { pc }) ≥ (1 + α3 ) flcov (P ) where αi > 0 can be easily handled.

8.5

Canned Pattern Maintenance

145

MIDAS ranks all the FCPs in decreasing s p and stores them in a priority queue (i.e., P Q Pc ) (Lines 1 and 2). The existing canned patterns are ranked in increasing s p and stored in another priority queue (i.e., P Q P ) (Lines 3 and 4). Then, it pops the FCP with the highest s p and compares it with a pattern with lowest s p in P Q P (Lines 7–25). A swap occurs only if the swapping criteria are met and there is no significant change for the pattern size  distribution of P and P \ { p} { pc } (Lines 16 –19). Swapping is repeated until either the P Q Pc becomes empty or when the second swapping criterion (SW2) is not met (Lines 21– 23). Observe that comparison based on SW2 can be used to terminate the swapping process. Finally, the swapped patterns are displayed on the VQI in a single update. MIDAS leverages the SW A Pα approach in Yang et al. (2016) for setting κ. Although the following lemma in Yang et al. (2016) is proposed for a different problem (i.e., diversified top-k subgraph matching), we can exploit it as each canned pattern can be cast as an embedding of a query graph. Lastly, we set λ the same as κ for reasons discussed in Sect. 6.6. Note that MIDAS can be easily configured to allow user-specified swapping thresholds (κ and λ) by specifying them as inputs in the algorithms. Lemma 8.23 (Yang et al. (2016)) Given an initial result set P , let κt be the value of κ used for the t th scan of the multi-scan swap algorithm and σt be the lower bound for the approximation ratio of the result set in the t th scan. At the t th scanning of SW A Pα , if σt−1 < 0.5, then by setting κt = 1 − 2σt−1 , the approximation ratio of the result set after the scanning is lower-bounded by σt = 0.25( 1−σ1t−1 ). f scov (P ) 1 th Remark. Lemma 8.23 dictates fscov (P O P T ) = 0.25( 1−σt−1 ) at the t scan if σt−1 < 0.5 and κt = 1 − 2σt−1 . That is, the coverage of P is lower-bounded by 0.25 times the subgraph coverage of the optimal pattern set, and this coverage tends toward 0.5 f scov (P O P T ). The diversity, cognitive load, and label coverage of P are at least as good as the original pattern set due to SW1–SW5.

Lemma 8.24 The worst-case time and space complexities of Algorithm 8.14 are and O(γ(|V Pmax | + |E Pmax |) + |D ⊕ D| × O(γ|Ds ||Vmax |!|Vmax | + |P ||V Pmax |3 ) ηmax −ηmin +1  ), respectively, where G max = (|FD⊕D | + |E f r eq |) + (ηmin + ηmax )|Cψ | 2 (Vmax , E max ) is the largest data graph, Pmax = (V Pmax , E Pmax ) is the largest canned pattern, and Ds ⊆ D. Proof (Sketch) Computing pattern score requires γ|Ds | times of subgraph isomorphism tests to find scov (O(|Vmax |!|Vmax |) Cordella et al. (2004)) where G max = (Vmax , E max ) is the largest data graph. In addition, |P | times of graph edit distance computation (O(|V Pmax |3 ) Riesen et al. (2007)) where Pmax = (V Pmax , E Pmax ) is the largest canned pattern. Hence, worst-case time complexity is O(|Vmax |!|Vmax ||D S |γ + |P ||V Pmax |3 ). Swap-based pattern maintenance has to store the existing canned pattern set, the set of FCPs and indices I FC T , and I I F E . In the worst case, storing canned patterns require O(γ(|V Pmax | + |E Pmax |)) space.

146

8 Maintenance of Patterns

Fig. 8.5 Swap-based pattern maintenance

ηmax Storage of PCPs takes O(|Cψ | i=η ×i) space in the worst case when each cluster in Cψ min yields a FCP for the size range [ηmin − ηmax ]. The largest storage of the indices is due to TG matrix (|D ⊕ D| × (|FD⊕D | + |E f r eq |)) assuming that each entry in the matrix is non-zero. Hence, worst-case space complexity is O(γ(|V Pmax | + |E Pmax |) + |D ⊕ D| ×  (|FD⊕D | + |E f r eq |) + (ηmin + ηmax )|Cψ | ηmax −η2 min +1 ). Example 8.25 Let γ = 6, ηmin = 3, ηmax = 4, and κ = λ = 0.3. Suppose P has 6 patterns (Fig. 8.5) and 20 FCPs (i.e., |Pc | = 20) are generated. Note that the FCPs are stored in a priority queue P Q Pc = [ pc5 , pc11 , pc8 , pc17 , · · · ], while the canned patterns are stored in a priority queue P Q P = [ p6 , p5 , p2 , p1 , p3 , p4 ]. Suppose S B , SL , s p6 , and s pc5 are found to be 0.8, 0.7, 0.61, 0.85, respectively. Since S B < (1 + κ)SL and s pc5 > (1 + λ)s p6 , p6 is not swapped with pc5 . Next S B , SL , and s pc11 are found to be 0.8, 0.6, and 0.79, respectively. Hence, S B > (1 + κ)SL and s p6 > (1 + λ)s pc11 . MIDAS swaps p6 with pc11 . In the next iteration, S B , SL , s p5 , and s pc8 are found to be 0.7, 0.65, 0.63, and 0.73, respectively. Since S B < (1 + κ)SL and s p5 < (1 + λ)s pc8 , p5 is not swapped with pc8 . The scan is also terminated since s p5 < (1 + λ)s pc8 ( pc8 is similar to p4 ). Consequently, the set of canned patterns after maintenance is { p1 , p2 , p3 , p4 , p5 , pc11 }.

8.6

Maintenance of Basic Patterns

Recall from Chap. 6, basic patterns contain edges and 2-edges (denoted as B ). In this section, we briefly describe their maintenance. In particular, MIDAS stores the support level of each edge and 2-edge in the dataset. Note that the 2-edge (e1 (u, v), e2 (v, w)) support values can be represented as a matrix where the row and column represent e1 and e2 , respectively. Similar to I I F E and I FC T , MIDAS only stores the non-zero entries in the matrix to reduce space consumption.

8.7

Performance Study

147

The basic edge patterns can be easily updated by taking the following steps: 1. Update support level of each edge e following graph modification. That is, ∀G + = (V + , E + ) ∈ + (resp. G − = (V − , E − ) ∈ − ), increment (resp. decrement) support of e by 1 if e ∈ E + (resp. e ∈ E − ). 2. Execute Steps 1–5 in Sect. 6.6. The steps for updating the basic 2-edge patterns are the same.

8.7

Performance Study

MIDAS is implemented with Java (JDK1.8).4 In this section, we investigate the performance of MIDAS and report the key findings. All experiments are performed on a 64-bit Windows desktop with Intel(R) Core(TM) i7-4790K CPU (4 GHz) and 32 GB of the main memory.

8.7.1

Experimental Setup

Datasets. We use the following datasets: (a) The AIDS antiviral dataset with 40,000 (40K) data graphs. (b) The PubChem dataset containing chemical compound graphs. Unless otherwise stated, PubChem refers to the 23K dataset. Other variants used are 250K, 500K, and 1 million. (c) eMolecule dataset consisting of 10K chemical compounds (i.e., eMol). We use variants of various datasets, and they are denoted as < Y >< X > where Y and X refer to the dataset name and the number of graphs used, respectively (e.g., AIDS25K refers to AIDS dataset with 25K data graphs). Baselines. We compare MIDAS against (1) maintenance from scratch using CATAPULT (denoted as CATAPULT), (2) maintenance from scratch using CATAPULT++ (denoted as CATAPULT++), (3) random swapping instead of multi-scan swap (denoted as Random), and (4) canned pattern set from CATAPULT with no maintenance (denoted as NoMaintain). The canned pattern set derived by an approach X is denoted as P X . Query set. The query set is generated by randomly selecting connected subgraphs from the dataset. For each dataset, 1000 subgraph queries with sizes in the range of [4–40] are generated. We balance the query set such that queries from + are represented. In particular, when |+ | > 0, 500 queries are derived from + and the rest from D \ − . Otherwise, all queries are obtained from D ⊕ D. We denote a batch addition (resp. deletion) of graphs as +Y % (resp. −Y %) where Y = |M| |D| × 100% and M is the number of graphs randomly added (resp. removed).

4 The code is available at https://github.com/MIDAS2020/Midas.

148

8 Maintenance of Patterns

10 Parameter settings. Unless specified otherwise, we set τ = |D| , ηmin = 3, ηmax = 12, |P | = γ = 30, supmin = 0.5,  = 0.1, and κ = λ = 0.1. We use the default settings in Chap. 6 for CATAPULT.

Performance measures. We use the following measure to assess the performance of MIDAS: • Pattern maintenance time (PMT): Time taken to maintain canned pattern set P (Algorithm 8.12). • Missed percentage (MP): Percentage of query set containing no canned patterns. M P = |Q M | |Q| × 100% where Q is the query set, and Q M ⊆ Q does not contain subgraphs that are isomorphic to any p ∈ P . −stepMIDAS where • Reduction ratio (denoted as μ): Given a subgraph query Q, μ = step X step X step X and stepMIDAS are the minimum number of steps required to construct Q when P derived from approach X and MIDAS are used, respectively. Note that μ > 0 implies that P derived from X required more steps compared to MIDAS. For simplicity in automated performance study, we assume (1) a canned pattern p ∈ P can be used in Q iff p ⊆ Q; (2) when multiple patterns are used to construct Q, their corresponding isomorphic subgraphs in Q do not overlap. In the user study, we shall jettison these assumptions by allowing users to modify the canned patterns and no restrictions are imposed.

8.7.2

User Study

The most pertinent question related to MIDAS is whether canned pattern maintenance expedites visual query formulation. We perform a user study to address it. Note that we focus the study on this question. 25 unpaid volunteers (ages 20 to 39) took part. These volunteers are students or researchers within different majors. They displayed a range of familiarity and expertise with subgraph queries according to a pre-study survey. We use the VQI of AURORA (Chap. 6). We first presented a 10-min scripted tutorial of the VQI describing how to visually formulate queries. We then allowed the subjects to play with the VQI for 15 min. For PubChem23K, AIDS25K, and eMol5K, we added 6K, 10K, and 3K data graphs, respectively. Then, for each dataset, 3 sets of 5 subgraph queries with sizes in the range [19-45] are selected. Query set 1 (i.e., QS 1) consisted of 5 queries derived from D; Query set 2 (i.e., QS 2) consisted of 2 queries derived from D and 3 derived from + ; Query set 3 (i.e., QS 3) consisted of 5 queries derived from + . We measured the query formulation time (QFT), the number of steps required to formulate a query, and also the visual mapping time (VMT) which is the time required to browse and select a canned pattern for use. Unless specified otherwise, we set |P | = 30. To describe the queries to the users, we provided printed visual subgraph queries. A subject then draws the given query using a mouse in our VQI. The users are asked to make

8.7

Performance Study

149

40 Steps

QFT(s)

200

VMT(s/pattern)

maximum use of the patterns to this end. Each query was formulated 5 times by different participants. We ensure the same query set is constructed in a random order (the order of the query and the approach are randomized) to mitigate the effects of learning and fatigue. Figures 8.6, 8.7, and 8.8 report the results. MIDAS took the least QFT and steps on average for all datasets. For AIDS25K (resp. eMol5k), query formulation using MIDAS is up to 20% (resp. 26%) faster and required up to 18% (resp. 27%) fewer steps compared to NoMaintain. VMT of all approaches vary in the range [3.9–5.2] for AIDS25K, whereas VMT of MIDAS ([4.2–4.7]) is lower than other approaches ([4.9–5.4]) for eMol5k. For PubChem, query formulation using MIDAS is up to 29.5% faster and required up to 22.9% fewer steps compared to NoMaintain (Fig. 8.8). VMT of MIDAS is in the range [6.4–8.5] and is comparable to other approaches [6.6–9.4].

100 0

QS 1 Random NoMaintain

20 0

QS 3

QS 2

CATAPULT MIDAS

QS 1 Random NoMaintain

QS 2

QS 3

5 2.5 0

CATAPULT MIDAS

QS 1 Random NoMaintain

QS 2

QS 3

CATAPULT MIDAS

VMT(s/pattern)

40

200 Steps

QFT(s)

Fig. 8.6 User study on AIDS25K

100 0

QS 1 Random NoMaintain

20 0

QS 3

QS 2

CATAPULT MIDAS

QS 1 Random NoMaintain

QS 2

QS 3

5 2.5 0

CATAPULT MIDAS

QS 1 Random NoMaintain

QS 2

QS 3

CATAPULT MIDAS

75 50

100 0

QS 1 Random NoMaintain

QS 2

QS 3

CATAPULT MIDAS

Fig. 8.8 User study on PubChem

VMT(s/pattern)

300 200

Steps

QFT(s)

Fig. 8.7 User study on eMol5K

25 0

QS 1 Random NoMaintain

QS 2

QS 3

CATAPULT MIDAS

10 5 0

QS 1 Random NoMaintain

QS 2

QS 3

CATAPULT MIDAS

8 Maintenance of Patterns

|P|=10

10

Patterns used

VMT(s/pattern)

150

|P|=30

5 0

Random NoMaintainCATAPULT

300

|P|=10

MIDAS

|P|=30 Steps

QFT(s)

|P|=10

|P|=30

2 0

Random NoMaintain CATAPULT

75

200 100 0

4

Random NoMaintainCATAPULT MIDAS

|P|=10

MIDAS

|P|=30

50 25 0

Random NoMaintainCATAPULT

MIDAS

75

200

50

100 0

AIDS PubChem eMol

Random NoMaintain

CATAPULT MIDAS

VMT(s/pattern)

300 Steps

QFT(s)

Fig. 8.9 Effect of varying |P| on PubChem

25 0

AIDS PubChem eMol

Random NoMaintain

CATAPULT MIDAS

10 5 0

AIDS PubChem eMol

Random NoMaintain

CATAPULT MIDAS

Fig. 8.10 User study with user-specified queries

In addition, we investigated the effect of varying the number of canned patterns (|P | = 10 vs. |P | = 30) on VQI. Figure 8.9 shows a general trend of reduction in the QFT and the number of steps as |P | increases, which is intuitive. Further, Fig. 8.9 illustrates the effect of stale patterns (N oMaintain) versus using canned patterns maintained randomly, using MIDAS or using CATAPULT from scratch. The maintained pattern sets yielded more patterns that can be used to formulate the required queries. The increase in usage of canned patterns in turn resulted in fewer steps needed for query construction and shorter query formulation time. Lastly, we let users come up with their own queries. Specifically, they can formulate queries of any size and topology. On average, each user constructed 5 queries from each dataset with query size in the range of [18–42]. Figure 8.10 reports the results. As expected, MIDAS took the least QFT, steps, and VMT on average for all datasets. It is interesting to observe that MIDAS is superior to CATAPULT. In the latter, at each iteration, the “best” pattern is added greedily to the pattern set. The order in which patterns are added impacts the overall quality of the pattern set as CATAPULT does not guarantee that at each iteration, the best candidate is the optimal one. Unlike MIDAS, there is no requirement that a candidate is added only if the resultant pattern set has better quality than the old one.

Performance Study

151

300

QFT(s)

QFT(s)

8.7

150 0

Q1

Q2

PubChem

Q3

Q4

MIDAS

Steps

Steps PubChem

VMT(s/pattern)

Q1

Q2

eMol

Q3

Q4

Q5

Q3

Q4

Q5

Q3

Q4

Q5

MIDAS

Q3

Q4

MIDAS

30 0

Q5

VMT(s/pattern)

Q2

Q1

8

Q1

Q2

eMol

MIDAS

6

4

3

0

Q1

Q2

PubChem

Q3

Q4

MIDAS

Q5

0

10

−1

10

−2

10

−3

0

Q1

Q2

eMol

MIDAS

0

PubChem vs MIDAS

QFT

Steps

10 −1 eMol vs MIDAS 10 −2 10 −3 10 −4 10

p−value

p−value

0 60

40

10

150

Q5

80

0

300

VMT

QFT

Steps

VMT

Fig. 8.11 Comparison with commercial VQIs

Comparison with commercial VQIs. Next, we compare PMIDAS against those obtained from “static” commercial VQIs, namely PubChem (https://pubchem.ncbi.nlm.nih.gov/ edit3/index.html) and eMol (https://reaxys.emolecules.com). We extract 13 and 6 canned patterns that are of size 3 or larger from the PubChem and eMol VQIs, respectively. Note that some of these patterns contain no vertex labels (i.e., unlabeled patterns). We transform these unlabeled patterns into labeled ones by assigning a common label that is not found in our set of 5 queries to ensure that participants relabel the vertices to the correct ones during formulation. Further, we set |PMIDAS | to 13 and 6 when comparing against PubChem and eMol VQIs, respectively. From Fig. 8.11, we observe that MIDAS is faster in terms of QFT (up to 42%) and requires fewer steps (up to 50%) than commercial VQIs. The superior performance is statistically significant ( p < 0.05).

8.7.3

Experimental Results

Exp 1: Setting the values of , κ, and λ. In this set of experiments, we vary the evolution ratio and swapping thresholds on AIDS25K with batch addition of 5K graphs. Figure 8.12 plots the results. PMT and clustering time of MIDAS remain relatively constant when  ≤ 0.1. A dip in these times when  = 0.2 is due to fewer clusters requiring maintenance compared

8 Maintenance of Patterns

ε0.025

1000 100 10 1

MIDAS

1000 100 10 1

MIDAS

λ0.05

κ0.05

CATAPULT++

ε0.05

ε0.1

Maintain Cluster Time(s)

MIDAS

ε0.2

CATAPULT++

λ0.1

λ0.2

PGT(s)

1000 100 10 1

λ0.4

κ0.1

κ0.2

1000 100 10 1

1000 100 10 1 1000

CATAPULT++

PGT(s)

PMT(s)

PMT(s)

PMT(s)

152

κ0.4

MIDAS

ε0.025 MIDAS

λ0.05 MIDAS

CATAPULT++

ε0.05

ε0.1

ε0.2

CATAPULT++

λ0.1

λ0.2

λ0.4

CATAPULT++

100 10 1

κ0.05

κ0.1

κ0.2

κ0.4

Fig. 8.12 Effect of varying , λ, and κ

to smaller values of . Importantly, compared to CATAPULT++, MIDAS is up to two orders of magnitude faster in terms of PMT due to the shorter time required to maintain the clusters. This highlights the efficiency of cluster and CSG maintenance using MIDAS versus regeneration of clusters and CSGs using CATAPULT++. In particular, we set  as 0.1 since variations of scov, lcov, and div between PMIDAS and PCATAPULT++ are less than 1%, and there is an improvement of 24% in terms of cog (cog ∈ [1.8, 2.3]). Next, we vary the swapping thresholds (i.e., κ, λ ∈ {0.05, 0.1, 0.2, 0.4}). We assess the performance based on PMT and pattern generation time (PGT), which is the time required to generate candidate patterns and swap them with existing patterns. In particular, MIDAS is almost one order of magnitude faster than CATAPULT++ due to more efficient cluster and CSG maintenance since its PGT is similar to that of CATAPULT++. Observe that the effect of varying κ is similar to λ. Hence, we set κ = λ = 0.1. Exp 2: Cost of indices and FCT. Next, we examine the cost of using FCT and the indices I FC T and I I F E . As expected, the costs of mining FCT and index construction increase as the dataset size increases (Fig. 8.13, top left). In particular, I FC T requires longer construction time and more memory than I I F E due to additional data structures. The total memory requirement for the indices is 49MB for PubChem1M and is well within the limits of any modern machine. The maintenance time of the indices increases with the dataset size. In comparison, the FCT maintenance time increases as the size of the graph modification increases. In particular, for PubChem1M, maintenance of indices and FCT require around 3 and 16 min, respectively. Note that |FC T |/|D| of PubChem100K, PubChem500K, and PubChem1M are 0.01, 0.001, and 0.0001%, respectively. The results are qualitatively similar to other datasets. Hence, constructing and maintaining the FCT and indices are fast and consume a small amount of memory.

FCT

200K +50K

IFCT

450K +50K

IIFE

950K +50K

100 10

IFCT IIFE

Time(s)

3

102 101 100 10

153 Memory(MB)

Performance Study

Maintenance Time(s)

8.7

1 0.1

100K 500K

1M

10 10 10

4 2 0

100K

Mine FCT Construct IFCT

500K

1M

Construct IIFE

0.2

MIDAS

NoMaintain

scov

div

0.3 0.1 0

0.98 MIDAS

NoMaintain

0.96 0.94

−40% −20% +20% +40%

−40% −20% +20% +40%

MP%

Fig. 8.13 Cost of indices and FCT (PubChem) 15 10 5 0

MIDAS NoMaintain

−40% −20% +20% +40%

Fig. 8.14 Comparison with no maintenance on AIDS25K

Exp 3: Comparison with baselines. We first compare MIDAS with NoMaintain on AIDS25K (Fig. 8.14). Observe that MP of PMIDAS outperforms PNoMaintain by 61% on average. Further, PMIDAS exhibits greater diversity of patterns and scov than PNoMaintain . Next, we compare MIDAS with CATAPULT, CATAPULT++, and Random on AIDS25K (Fig. 8.15) and Pubchem15K (Fig. 8.16). In terms of execution time, MIDAS is comparable with Random (fastest approach) and is up to an order of magnitude faster than CATAPULT. In general, MIDAS yields canned pattern set of comparable or better (div, scov, lcov, cog) quality than CATAPULT and CATAPULT++. Note that lcov on AIDS25K (resp. Pubchem15K) is approximately 1 (resp. 0.97) for all approaches and the average cog of MIDAS, CATAPULT, and CATAPULT++ are 2.1, 2.2, and 2.5 (resp. 1.8, 2.3, and 2.6), respectively. As for μ, PMIDAS outperforms PCATAPULT , PCATAPULT++ , and PRandom . Furthermore, PCATAPULT and PCATAPULT++ have higher average MP compared to PMIDAS . This highlights that MIDAS can efficiently maintain a set of canned patterns to ensure its relevance (lowest average MP, highest average scov) across a range of graph modifications without significant loss in pattern set quality. In comparison with random swapping, MIDAS’s multi-scan swap approach has smaller MP (Fig. 8.15, middle left, MIDAS vs. Random) and lower μ (Fig. 8.15, middle right, MIDAS vs. Random). This justifies the multi-scan swap approach of MIDAS. Exp 4: Scalability. We examine the scalability of MIDAS on PubChem with the following dataset DS = {200K, 450K, 950K} where 50K data graphs are added to each (Fig. 8.17). The canned pattern quality varies in the range of [0.94–0.98], [0.94–0.97], [0.13–0.21], and [1.8–3.3] for scov, lcov, div, and cog, respectively. As expected, PMT and PGT increase −step200K where as dataset size increases. In this set of experiments, we defined μ = step Xstep X step X is the minimum number of steps required to construct Q when P is derived from ds X . In particular, μ is −27.7, −6.5, and −25.9 for 250K, 500K, and 1M datasets, respectively. Note that negative μ indicates greater step reduction. Further, cluster maintenance of MIDAS is faster (∼2.3 min) compared to generation of cluster from scratch using CATAPULT (25 h)

8 Maintenance of Patterns

0.98 0.96 0.94

div

scov

154

−40%

−20%

MIDAS CATAPULT

+20%

+40%

−40%

CATAPULT++ Random

+20%

−40%

CATAPULT++ Random

30 μ%

8 4

20 10 0

0

−40% −40%

−20%

MIDAS CATAPULT

PMT(s)

−20%

MIDAS CATAPULT

12 MP%

0.3 0.2 0.1 0

+20%

+40%

−20%

+40%

+20%

MIDAS vs CATAPULT MIDAS vs CATAPULT++ MIDAS vs Random

CATAPULT++ Random

2000 MIDAS 1500 CATAPULT 1000 CATAPULT++ Random 500 0 +20%

−20%

−40%

−60%

+60%

+40%

1 0.9 0.8 0.7

0.2 div

scov

Fig. 8.15 Baseline comparison on AIDS25K

0 −40%

−20%

MIDAS CATAPULT

20%

40%

−40%

CATAPULT++ Random

30 20 10 0 −40%

−20%

MIDAS CATAPULT

20%

−20%

MIDAS CATAPULT

μ%

MP%

0.1 20%

40%

CATAPULT++ Random

30 20 10 0

40%

−40%

CATAPULT++ Random

−20%

CATAPULT CATAPULT++

20%

40%

Random

PMT(s)

3000 2000 1000 0 −20%

−40%

−60% MIDAS

CATAPULT

20%

60%

40%

CATAPULT++

Random

Time(min)

MP%

Fig. 8.16 Baseline comparison on PubChem15K

5 2.5 0 200K+50K 450K+50K 950K+50K

Fig. 8.17 Scalability study on PubChem

25 20 15 10 5 0

200K+50K 450K+50K 950K+50K

PMT

PGT

Maintain Cluster

Performance Study

155 1

0.7 scov

pattern score

8.7

0.6

0.9

0.5 i=1 Default

i=2

i=3

i=4

On_i

i=1

i=5 Default

Off_i

cog

0.2 0.15

i=4

i=5

Off_i

2 1

0.1 i=1 Default

i=3

i=2 On_i

3

0.25 div

0.95

i=2

i=3

On_i

i=4

i=5

Off_i

i=1 Default

i=2

i=3

On_i

i=4

i=5

Off_i

Fig. 8.18 Effect of swapping criteria

for PubChem 1M dataset (i.e., 642X). Similarly, there is a speed-up of 83X in terms of PMT for MIDAS (18 min) compared to CATAPULT. Exp 5: Effect of swapping criteria. Furthermore, we examine the effect of turning on (denoted by On_i) or off (denoted by O f f _i) individual swapping criterion swi on AIDS25K. Figure 8.18 reports the results. Note that De f ault refers to enabling all swapping criteria as described in Sect. 8.5. As expected, the pattern quality of a characteristic improves only when its associated swapping criterion is turned on. When a swapping criterion is disabled, there is no quality guarantee and the resultant quality can be either better or worse off. For example, div of On_3 increases compared to that of O f f _3 (0.2 vs. 0.125). In comparison, pattern score (0.575 vs. 0.659) and scov (0.947 vs. 0.979) deteriorate. Exp 6: Effect of alternative distance measures for graphlet frequent distribution. Lastly, we examine the effect of using alternative distance measures (i.e., Manhattan distance and Cosine distance) on AIDS25K (denoted as D). In particular, we constructed three new datasets D1 , D2 , and D3 where 10, 20, and 30% distinct new graphs are added to D, respectively. Note that the vertices of these newly added graphs contain labels that are either absent or of low occurrence in D. Hence, the newly added graphs are expected to be different from the graphs in D. As a result, we have the following ground truth: dist(D, D3 ) > dist(D, D2 ) > dist(D, D1 ) where dist(D, Dx ) is the distance between the graphlet frequency distribution of D and Dx . The distance measures shall be accessed based on how well it is aligned with the ground truth. All three distance measures rank the pairwise graphlet frequency distribution the same as the expected ground truth (Fig. 8.19). The time required for execution (i.e., < 0.01s) is similar for all three measures. Hence, we randomly selected a distance measure (i.e., Euclidean distance).

156

8 Maintenance of Patterns

Fig. 8.19 Effect of alternative distance measures

8.8

MIDAS in AURORA

MIDAS is integrated with AURORA (recall from Chap. 6) to support efficient maintenance of canned pattern sets. We illustrate this with a case study here. Consider the AURORA VQI in Fig. 8.20 to query a chemical compound database (e.g., PubChem). Suppose John, a chemist, constructs a query graph of boronic acid (Panel 2). In Chap. 6, we have seen that canned patterns can expedite such visual query formulation. Chemical compounds are discovered at an exponential rate Llanos et al. (2019). Suppose that the patterns are maintained using MIDAS. Figure 8.21a and b shows samples of canned patterns before and after maintenance, respectively. There are 28 common patterns (e.g., p1 and p2 ) between the two sets. In particular, some patterns (e.g., p3 ) have been swapped with candidate patterns (e.g., p3 ) relevant to boronic esters. It took 296s to maintain the pattern set. After maintenance, the canned pattern set improved in scov marginally (1%) and maintained the same diversity and cognitive load. John would have taken 20 steps (around 102s) with the initial set of canned patterns. In particular, he may use p4 and p1 ; remove a H and its associated edge from p4 ; add seven vertices (3H , 1C, 1B, and 2O); and ten edges. In comparison, John only requires now 14 steps (around 70s) with the maintained set of canned patterns. Specifically, he uses p4 , p1 , and p3 ; removes a H and its associated edge from p4 ; adds three H vertices and seven

Fig. 8.20 PnP interface generated by AURORA

8.9

Conclusions

157

Fig. 8.21 Canned pattern sets

edges. That is, the refreshed pattern set led to a more efficient formulation compared to its stale version. Also, the existence of the new p3 pattern may trigger a bottom-up search for boronic ester-based compounds that may not be possible if the stale VQI is used.

8.9

Conclusions

Real-world graph data repositories are seldom static. However, patterns in existing manual visual subgraph query interfaces are rarely updated when the underlying data evolve. The lack of maintenance of patterns may adversely impact efficient visual query formulation. PnP interfaces pave the way for the automatic maintenance of these patterns. To this end, in this chapter we present an efficient pattern maintenance framework for graph databases. It takes a data-driven approach to automatically and opportunely maintain the patterns of a PnP interface. Our maintenance strategy ensures that the updated patterns enjoy high coverage and diversity without imposing a high cognitive load on the users. Our experimental study emphasizes the benefits of maintaining patterns.

158

8 Maintenance of Patterns

References J.L. Balcázar, A. Bifet, A. Lozano. Mining frequent closed rooted trees. Machine Learning, 78(1-2):1, 2010. A. Bifet, R. Gavaldà. Mining adaptively frequent closed unlabeled rooted trees in data streams. In SIGKDD, 2008. A. Bifet, R. Gavaldà. Mining frequent closed trees in evolving data streams. Intell. Data Anal., 15(1):29-48, 2011. L.P. Cordella, P. Foggia, C. Sansone. A (sub)graph isomorphism algorithm for matching large graphs. IEEE Trans. Pattern Anal. Mach. Intell., 26(10):1367-1372, 2004. K. Huang, et al. MIDAS: Towards Efficient and Effective Maintenance of Canned Patterns in Visual Graph Query Interfaces. In SIGMOD, 2021. G. Li, M. Semerci, B. Yener, M.J. Zaki. Graph classification via topological and label attributes. In MLG, 2011. E.J. Llanos, J. Leal, W. Luu, D.H. Jost, P.F., Stadler, G. Restrepo. Exploration of the chemical space and its three historical regimes. PNAS, 116(26):12660-12665, 2019. R.T. Marler, J.S. Arora. The weighted sum method for multi-objective optimization: new insights. Struct. Multidiscip. O., 41(6):853-862, 2010. N. Pržulj. Biological network comparison using graphlet degree distribution. Bioinformatics, 23(2):e177-e183, 2007. K. Riesen, M. Neuhaus, H. Bunke. Bipartite graph matching for computing the edit distance of graphs. In GbRPR, 2007. B. Saha, L. Getoor. On maximum coverage in the streaming model & application to multi-topic blog-watch. In SDM, 2009. C. Tofallis. Add or multiply? A tutorial on ranking and choosing with multiple criteria. INFORMS Trans. on Education, 14(3): 109-119, 2014. Z. Yang, A.W.C. Fu, R. Liu. Diversified top-k subgraph querying in a large graph. In SIGMOD, 2016. D. Yuan, P. Mitra, H. Yu, C.L. Giles. Updating graph indices with a one-pass algorithm. In SIGMOD, 2015. L. Zou, L. Chen, J.X. Yu, Y. Lu. A novel spectral coding in a large graph database. In EDBT, 181-192, 2008.

9

The Road Ahead

In this chapter, we summarize the contributions of this book and establish several lines of inquiry associated with PnP interfaces for future research.

9.1

Summary

The contributions of this book are summarized as follows:

9.1.1

Plug-and-Play (PnP) Interfaces

In Chap. 3, we review existing VQIs for graph data that are prevalent in the academic and industrial world and emphasize the role patterns (i.e., small connected subgraphs) play in facilitating visual subgraph query construction in the pattern-at-a-time mode. We also highlight the limitations of existing VQIs w.r.t. usability and cognitive load. Specifically, they do not provide sufficient features to aid flexible and efficient visual query formulation, and are static in nature when the underlying graph repository evolves. Furthermore, the manual construction of a VQI limits its portability across different domains and sources as one has to reimplement and customize the VQI for each one of them. These issues adversely impact various usability criteria such as flexibility, robustness, efficiency, and satisfaction. In Chap. 4, we introduce the notion of a Plug-and-Play (PnP) visual subgraph query interface to alleviate these limitations, and highlight their features and advantages in comparison to existing manual VQIs, and challenges to realize them. A PnP interface takes a fundamentally different approach in VQI construction compared to existing manual approaches. Given a graph repository and a plug, it automatically populates and maintains various data panels (i.e., attribute, pattern) of the VQI from the underlying data consistent with the require-

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 S. S. Bhowmick and B. Choi, Plug-and-Play Visual Subgraph Query Interfaces, Synthesis Lectures on Data Management, https://doi.org/10.1007/978-3-031-16162-9_9

159

160

9 The Road Ahead

ments specified in the plug. Intuitively, the underlying graph repository acts as a socket, and the PnP template with user-specified requirements represents a plug. Hence, PnP interfaces allow an end user to change the socket (i.e., graph repository) or the plug (i.e., requirements) as necessary to automatically generate a VQI for her query formulation tasks effortlessly. Such a data-driven paradigm brings in several benefits such as superior support for visual subgraph query construction, significant reduction in the manual cost of constructing and maintaining a VQI for any graph data source, and portability of the interface across a diverse variety of graph data sources and querying applications.

9.1.2

Canned Patterns—The Building Block of PnP Interfaces

In Chap. 5, we introduce two types of patterns in a VQI, basic and canned, for facilitating topdown and bottom-up searches in the pattern-at-a-time mode. Basic patterns are subgraphs with a size typically less than three, whereas canned patterns are larger size connected subgraphs. We describe the key characteristics of canned patterns, which are the building blocks of PnP interfaces. First, a canned pattern set in a PnP interface should ideally cover as large a portion of the graph data as possible (i.e., high coverage). Then a large number of subgraph queries on the underlying graph repository can be constructed by utilizing the pattern set. Second, in order to make efficient use of the limited display space in a VQI, the patterns should be structurally diverse to serve a variety of queries. This also facilitates bottom-up search where a user gets a bird’s-eye view of the diverse substructures in the underlying graph repository. Third, the displayed canned patterns should impose a low cognitive load on end users to facilitate cognitively efficient browsing and selection of relevant patterns during query formulation. In the context of visual query formulation, the cognitive load on a user is associated with visually interpreting a displayed canned pattern’s edge relationships to determine if it is useful for a query. Consequently, a topologically complex pattern may demand substantial cognitive effort (i.e., increase intrinsic cognitive load) from an end user to decide if it can assist in her query formulation. In this chapter, we review measures proposed in the literature to quantify coverage, diversity, and cognitive load of canned patterns.

9.1.3

Pattern Selection for Graph Databases

Chapter 6 presents a data-driven framework for selecting basic and canned patterns from a graph database containing a large collection of small- or medium-sized graphs (e.g., chemical compounds, protein structures). In particular, the selection of canned patterns is a computationally challenging problem. Given a graph database D and a user-specified plug b, the goal is to automatically select canned patterns for the PnP interface from D that satisfy b. The chapter presents a framework called CATAPULT to address it. It comprises the following

9.1

Summary

161

three steps. First, it partitions D into a set of clusters. Then, it summarizes each cluster into a cluster summary graph (CSG) by performing graph closure iteratively on pairs of data graphs in the cluster. A closure graph integrates graphs of varying sizes into a single graph by inserting dummy vertices or edges with a special label such that every vertex and edge is represented in it. Finally, it follows a greedy iterative approach based on weighted random walks for selecting canned patterns from CSGs based on the aforementioned characteristics. Specifically, it exploits a pattern score that incorporates coverage, diversity, and cognitive load to associate a score with each candidate pattern. The candidate pattern with the largest pattern score and within a user-specified size range is greedily selected as the best pattern to be added to the canned pattern set in the PnP interface. The selection process continues until either the required number (a user-defined value) of canned patterns is discovered or when no new pattern can be found. CATAPULT is query log-oblivious primarily due to the lack of publicly available query log repository for small- or medium-sized data graphs. We also present a PnP interface for graph databases called AURORA that is powered by CATAPULT.

9.1.4

Pattern Selection for Large Networks

The CATAPULT approach for selecting canned patterns from a collection of small- or mediumsized data graphs is not efficient for large networks as a clustering-based approach is prohibitively expensive. In Chap. 7, we describe a framework called TATTOO to address this problem for large networks. It exploits a recent analysis of real-world query logs (Bonifati et al. 2017) to classify topologies of canned patterns into categories that are consistent with the topologies of real-world queries (e.g., star, chain, petals, and flower). Such classification enables it to bypass the stumbling block of the lack of availability of query logs yet exploit topological characteristics of real-world queries to guide the selection process. Since real-world query logs contain triangle-like and non-triangle-like substructures, it first decomposes the input network into a dense truss-infested region (G T ) and a sparse trussoblivious region (G O ) by leveraging the notion of k-truss. Then candidate patterns from G T and G O are discovered based on the classified topologies to identify potentially useful patterns. Lastly, canned patterns are selected from these candidates for display on the VQI based on a pattern set score that is sensitive to coverage, diversity, and cognitive load of patterns. Specifically, the selection algorithm guarantees 1e -approximation. The basic patterns for large networks are provided as default for all datasets as they are the building blocks for real-world networks. The chapter culminates with a brief description of PLAYPEN system that exploits TATTOO to realize the world’s first PnP interface for large networks.

162

9.1.5

9 The Road Ahead

Pattern Maintenance

Since real-world graph data often evolves over time, in Chap. 8 we describe MIDAS, a framework to address the pattern maintenance (CPM) problem for a large collection of small- or medium-sized data graphs. It is built on top of CATAPULT and seeks to update the existing canned patterns in a VQI such that the updated set continues to have high coverage, high diversity, and low cognitive load. In particular, MIDAS guarantees that the quality of the updated pattern set is at least the same or better than the original canned patterns. MIDAS maintains the canned patterns based on batch updates instead of unit updates. This is because (a) unit update involves a single data graph and is unlikely to impact the set of canned patterns in a VQI and (b) several real-world databases of small- or mediumsized data graphs are updated periodically (e.g., daily). In particular, it exploits the degree of changes to graphlet frequency distribution in D to selectively maintain the pattern set. It also replaces frequent subtrees with frequent closed trees (FCT) as feature vectors for clustering in CATAPULT. As FCTs display closure property, it paves the way for efficient maintenance of the clusters. First, MIDAS assigns all newly added graphs to existing clusters of D and removes all graphs marked for deletion. Then, it calculates graphlet frequency distributions for D and the updated version of D. Next, it performs FCT maintenance by first retrieving the existing FCTs and changes to D and then maintaining them due to these changes. The modified clusters and CSGs are maintained after that. MIDAS computes the Euclidean distance between the graphlet distributions of D and updated D to determine the type of modification and corresponding action. For major modification, it generates candidate patterns from CSGs of newly generated and modified clusters. The existing canned patterns are then updated using a multi-scan swapping strategy that guarantees progressive gain of coverage without sacrificing diversity and cognitive load. In the case of minor modification, no pattern maintenance is required. Only the underlying clusters and CSGs are maintained to ensure that they are consistent with the updated D.

9.1.6

Usability Results

We summarize the usability evaluation of PnP interfaces w.r.t. its manual counterparts. We discuss it from two dimensions, performance measures and preference measures. The former types are quantifiable measures (i.e., can be communicated with numbers), whereas the latter ones give an indication of a “user’s opinion about the interface which is not directly observable” (McCracken and Wolfe 2004) (through questionnaires and interviews). PnP interfaces are more efficient (lesser query formulation time and steps) compared to several industrial-strength classical VQIs. It also provides superior experience (preference

9.2

Future Directions

163

measures). Hence, it outperforms traditional VQIs in several usability criteria such as efficiency, satisfaction, and flexibility. Note that these usability evaluations in existing literature are conducted on a small number of end users.

9.2

Future Directions

While good progress has already been made, research on PnP interfaces has just begun, and there are many opportunities for continued research. Here, we present open problems (non-exhaustive list) in this area. Some of these topics were introduced by a vision paper (Bhowmick et al. 2016). Our grand vision is a pervasive desire to continue stimulating a change in our traditional thinking by shifting the generation of visual query interfaces from manual to data-driven mode. Maintenance of PnP interfaces for large networks. The research on maintenance of canned pattern set with the evolution of underlying graph repository is still in its nascent stage. It has focused on maintenance strategy for a large collection of small- or medium-sized data graphs. Efficient maintenance of VQIs for large networks is still an open problem. Note that the solution to this needs a rethink as the evolution characteristics of large networks differ fundamentally from a collection of data graphs. In the latter case, the repositories are typically updated periodically, whereas large networks often evolve continuously. PnP interfaces for massive networks. All research related to PnP interface construction and maintenance have focused either on a large set of small or medium-sized data graphs or on networks with millions of nodes. Both these types of data are assumed to reside in a single commodity machine. A natural extension to this paradigm is to support similar problems on massive graphs (comprising hundreds to billions of nodes), which may demand a distributed framework and novel construction and maintenance algorithms built on top of it. Specifically, the PnP engine in Fig. 4.1 has to be extended to address this issue. Toward aesthetics-aware PnP interfaces. An issue that is paramount to an end user but widely ignored by the data management community is the aesthetics of the layout of a VQI. People prefer attractive interfaces (De Angeli et al. 2006). The visual appearance of a VQI (i.e., aesthetics) impacts its usability as it influences the way users interact with it (i.e., aesthetic-usability effect 1 ). Specifically, various panels in a VQI and their characteristics (size, color) influence complexity of a VQI. A visual pattern can be considered complex if its components are difficult to identify and separate from each other (Oliva et al. 2004). Several studies in HCI and psychology have found a strong relationship between aesthetic preferences and visual complexity (Berlyne 1974; Reinecke et al. 2013). According to Berlyne’s aesthetic theory (Berlyne 1974), the relationship between them follows an inverted U-shaped curve where stimuli of a moderate degree of visual complexity are considered pleasant, but both less and more complex stimuli are considered unpleasant. Note 1 https://www.nngroup.com/articles/aesthetic-usability-effect/.

164

9 The Road Ahead

that visual complexity impacts extraneous cognitive load on end users. Consequently, there are attempts to measure aesthetics automatically. For instance, Miniukovich and De Angeli (2014, 2015) proposed an array of aesthetics metrics to quantify visual complexity such as visual clutter, color variability, contour congestion, and layout quality. The key components that influence visual complexity in a VQI are the Attribute, Pattern, and Results Panels. Manually constructed VQIs work out all the aesthetics issues associated with these panels manually resulting in designs that may not always be aesthetically pleasing. Although cognitive load has been considered for canned pattern selection and maintenance in existing work on data-driven VQIs, it is exploited so far in selecting individual patterns. The cognitive load imposed by the layout choices of canned patterns and node/edge attribute labels in a VQI has not been explored yet. Furthermore, aesthetically pleasing and cognitive load-aware presentation of query results in Results Panel is largely unexplored. Although not a focus of this book, if a result subgraph containing matches to a user query looks like a hairball in a Results Panel then it is challenging for an end user to explore it to trigger a bottom-up search. In summary, the layouts of existing VQIs are not automatically generated by considering various aesthetics metrics that impact cognitive load. Hence, how can we extend PnP interface construction techniques to be aesthetics-aware? The PnP engine in Fig. 4.1 has to be extended to incorporate a component responsible for addressing aestheticsaware visual layout design problems. To this end, the Display method in Algorithm 4.1 (Line 7) needs to be extended. Note that this problem can be reformulated as an optimization problem where the goal is to find an “optimal” layout that minimizes query formulation task complexity and visual complexity/cognitive load (measured using aesthetics metrics) of the interface. Large-scale user study. The user studies reported in this book are carried out in an academic environment and are small in scale. To understand the full potential of PnP interfaces and how they impact real-world end users during top-down and bottom-up searches, it is paramount to undertake comprehensive and large-scale user studies across different application domains and real queries. These studies may reveal novel and interesting challenges that may drive further research on this paradigm. Beyond Graphs. While this book focuses on PnP interfaces for graphs, it is easy to see that this paradigm is potentially relevant for other data types where visual querying is prevalent. For example, sketch-based visual query interfaces for time series data (i.e., data series) (Correl and Gleicher 2020; Lee et al. 2016; Mannino and Abouzied 2018) provide freehand sketching as an efficient means for query formulation. It enables an end user to convey complex free-form and scaleless patterns of interest, which are then matched against the underlying time series data to identify regions of interest using some notion of similarity. Lee et al. (2020) observe that “sketching a pattern for querying is often ineffective on its own. This is due to the fact that sketching makes the assumption that users know the pattern that they want to sketch and are able to sketch it precisely. However this is typically not the case in practice.” Hence, the data-driven construction of VQIs for time series data has the potential to mitigate this challenge by exposing representative objects on a VQI to facilitate

References

165

both top-down and bottom-up searches. SENSOR (Yan et al. 2022) is a preliminary step to this end. Beyond VQIs. Lastly, the canned pattern selection and maintenance algorithms reviewed in this book have potential use cases beyond data-driven VQIs. For example, given that these patterns have high coverage and diversity, and low cognitive load, they can be potentially useful for efficiently generating graph summaries that are visualization-friendly (Khan et al. 2017). Due to the cognitive load-consciousness of these patterns in comparison to topological summaries generated by classical graph summarization techniques, they are potentially more palatable for end users to visualize them.

References D. Berlyne. Studies in the new Experimental Aesthetics. Washington D.C., Hemisphere Pub. Corp., 1974. S. S. Bhowmick, B. Choi, C. E. Dyreson. Data-driven Visual Graph Query Interface Construction and Maintenance: Challenges and Opportunities. PVLDB 9(12), 2016. A. Bonifati, W. Martens, T. Timm. An Analytical Study of Large SPARQL Query Logs. In VLDB, 2017. M. Correl, M. Gleicher. The Semantics of Sketch: Flexibility in Visual Query Systems for Time Series Data. In 2016 IEEE Conference on Visual Analytics Science and Technology (VAST), 2016. A. De Angeli, A. Sutcliffe, J. Hartmann. Interaction, Usability and Aesthetics: What Influences Users’ Preferences? In Proc. of Conference on Designing Interactive Systems, 2006. A. Khan, Sourav S. Bhowmick, F. Bonchi. Summarizing Static and Dynamic Big Graphs. PVLDB 10(12): 1981-1984, 2017. D.J.L. Lee, et al. You Can’t Always Sketch What you Want: Understanding Sensemaking in Visual Query Systems. IEEE Trans. Vis. Comput. Graph., 26(1): 1267-1277, 2020. M. Mannino, A. Abouzied. Expressive Time Series Querying with Hand-drawn Scale-free Sketches. In CHI, 2018. D. D. McCracken, R. J. Wolfe. User-Centered Website Development: A Human-computer Interaction Approach. Pearson Education Inc., New Jersey, 2004. A. Miniukovich, A. De Angeli. Computation of Interface Aesthetics. In SIGCHI, 2015. A. Miniukovich, A. De Angeli. Quantification of Interface Visual Complexity. In Working Conference on Advanced Visual Interfaces, 2014. A. Oliva, M. L. Mack, M. Shrestha, A. Peeper. Identifying the Perceptual Dimensions of Visual Complexity of Scenes. In Proc. of the 26th Annual Meeting of the Cognitive Sc. Society, 2004. K. Reinecke, T. Yeh, L. Miratrix, R. Mardiko, Y. Zhao, J. Liu, K. Z. Gajos. Predicting Users’ First Impressions of Website Aesthetics with a Quantification of Perceived Visual Complexity and Colorfulness. In SIGCHI, 2013. L. Yan, N. Xu, G. Li, S. S Bhowmick, B. Choi, J. Xu. SENSOR: Data-driven Construction of Sketchbased Visual Query Interfaces for Time Series Data. In PVLDB, 15(12), 2022.

References

Y. Bassil. A Comparative study on the performance of permutation algorithms. J. Comp. Sci. Res., 1(1):7-19, 2012. Allan Borodin, Hyun Chul Lee, and Yuli Ye. Max-sum diversification, monotone submodular functions and dynamic updates. In PODS, 2012. Andrei Z Broder. On the resemblance and containment of documents. In Compression and Complexity of Sequences, 1997. T. Catarci, M. Francesca Costabile, S. Levialdi, C. Batini. Visual Query Systems for Databases: A Survey. J. Vis. Lang. Comput., 8(2), 1997. L.P. Cordella, P. Foggia, et al. An Improved Algorithm for Matching Large Graphs. Proceedings of the 3rd IAPR TC-15 Workshop on Graph-based Representations in Pattern Recognition, 2001. M. A. Gallego, J. D. Fernandez, M. A. Martinez-Prieto, P. de la Fuente. An Empirical Study of Real-World SPARQL Queries. In USEWOD workshop, 2011. X. Gao, B. Xiao, D. Tao, X. Li. A Survey of Graph Edit Distance. In Pattern Anal. Appl., 13(1), 2010. W.-S. Han, J. Lee, M-D. Pham, and J.X. Yu. iGraph: a framework for comparisons of disk-based graph indexing techniques. In PVLDB, 2010. F. Haag, S. Lohmann, S. Bold, T. Ertl. Visual SPARQL Querying based on Extended Filter/flow Graphs. In AVI, 2014. J.P. Huan, W. Wang, J. Prins. Efficient Mining of Frequent Subgraph in the Presence of Isomorphism. In ICDM, 2003. K. Huang, S. S. Bhowmick, S. Zhou, B. Choi. PICASSO: Exploratory Search of Connected Subgraph Substructures in Graph Databases. PVLDB, 10(12): 1861-1864, 2017. R.J. Hyndman, et al. Another look at measures of forecast accuracy. Int. J. Forecast., 22(4):679-688, 2006. Nandish Jayaram, Arijit Khan, Chengkai Li, Xifeng Yan, Ramez Elmasri. Querying Knowledge Graphs by Example Entity Tuples. In ICDE (demo), 2014. Nandish Jayaram, Sidharth Goyal, and Chengkai Li. VIIQ: Auto-suggestion enabled visual interface for interactive graph query formulation. PVLDB, pages 1940–1951, 2015. G. Karypis, V. Kumar. A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs. SIAM Journal on Scientific Computing, 20(1), 1999. © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. Bhowmick and B. Choi, Plug-and-Play Visual Subgraph Query Interfaces, Synthesis Lectures on Data Management, https://doi.org/10.1007/978-3-031-16162-9

167

168

References

F. Katsarou, et al. Performance and scalability of indexed subgraph query processing methods. In VLDB, 2015. H. Kim et al. Natural Language to SQL: Where Are We Today? In PVLDB, 13(10), 2020. Nils Kriege, Petra Mutzel, and Till Schäfer. Practical sahn clustering for very large data sets and expensive distance metrics. J. Graph Algorithms Appl., 2014. E. Michailidou, S. Harper, S. Bechhofer. Visual Complexity and Aesthetic Perception of Web Pages. In Proc. of ACM International Conference on Design of Communication, 2008. P. Pardalos, A. Migdalas. A note on the complexity of longest path problems related to graph coloring. Appl. Math. Lett., 17(1):13-15, 2004. R. Pienta, Fred Hohman, Alex Endert, et al. VIGOR: Interactive Visual Exploration of Graph Query Results. In IEEE Transactions on Visualization and Computer Graphics, 2018. Lu Qin, Jeffrey Xu Yu, and Lijun Chang. Diversifying top-k results. CoRR, abs/1208.0076, 2012. A. N. Tuch, E. E. Presslaber, M. Stöcklina, K. Opwis, J. A. Bargas-Avila. The Role of Visual Complexity and Prototypicality Regarding First Impression of Websites: Working Towards Understanding Aesthetic Judgements. International J. of Human-Computer Studies, 70, 2012. Mirtha Lina Fernández Venero and Gabriel Valiente. A graph distance metric combining maximum common subgraph and minimum common supergraph. Pattern Recognition Letters, 22:753–758, 2001. Walter D. Wallis, Peter Shoubridge, Miro Kraetzl, and D. Ray. Graph distances using graph union. Pattern Recognition Letters, 22:701–704, 2001. R. W. White, R. A. Roth. Exploratory Search: Beyond the Query-response Paradigm. Synth. Lec. on Inf. Conc., Retr., and Serv. 1, 1, 2009. X. Yan, J. Han. gSpan: Graph-based Substructure Pattern Mining. In ICDM, 2002. J. Zhang, S. S. Bhowmick, H. H. Nguyen, B. Choi, F. Zhu. DAVINCI: Data-driven Visual Interface Construction for Subgraph Search in Graph Databases. In IEEE ICDE, 2015.

Index

P

Proc, 49, 53

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 B. Bhowmick and B. Choi, Plug-and-Play Visual Subgraph Query Interfaces, Synthesis Lectures on Data Management, https://doi.org/10.1007/978-3-031-16162-9

169