Modeling and Management of Fuzzy Semantic RDF Data (Studies in Computational Intelligence, 1057) 3031116682, 9783031116681

This book systemically presents the latest research findings in fuzzy RDF data modeling and management. Fuzziness widely

99 84

English Pages 221 [217] Year 2022

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Preface
Contents
1 RDF Data and Management
1.1 Introduction
1.2 RDF Data Model
1.2.1 RDF Basic Definitions
1.2.2 RDF Data Model
1.2.3 RDF Semantics
1.3 RDF Query Language SPARQL
1.3.1 The W3C Syntax of SPARQL
1.3.2 The Algebraic Syntax of SPARQL Graph Patterns
1.3.3 Semantics of SPARQL
1.4 RDF Data Store
1.4.1 RDF Stores in Traditional Databases
1.4.2 RDF Stores in Not Only SQL Databases
1.5 Summary
References
2 Fuzzy Sets and Fuzzy Database Modeling
2.1 Introduction
2.2 Imperfect Information and Fuzzy Sets
2.2.1 Imperfect Information
2.2.2 Fuzzy Sets
2.2.3 Fuzzy Graph
2.3 Fuzzy Relational Database Models
2.4 Fuzzy Object-Oriented Database Models
2.4.1 Fuzzy Objects
2.4.2 Fuzzy Classes
2.4.3 Fuzzy Object-Class Relationships
2.4.4 Fuzzy Inheritance Hierarchies
2.5 Fuzzy XML Model
2.5.1 Fuzziness in XML Documents
2.5.2 Fuzzy XML Representation Models and Formalizations
2.6 Summary
References
3 Fuzzy RDF Modeling
3.1 Introduction
3.2 Fuzzy RDF Graph
3.2.1 Fuzzy Information in RDF Graph
3.2.2 Fuzzy RDF Data Model
3.3 Fuzzy RDF Schema
3.4 Similarity Matching of Fuzzy RDF Graphs
3.4.1 Matching Semantics
3.4.2 Matching Approach
3.5 Algebraic Operations in Fuzzy RDF Graphs
3.5.1 Algebraic Operations
3.5.2 Equivalences
3.5.3 Relationship of SPARQL and the Algebraic Operations
3.6 Summary
References
4 Persistence of Fuzzy RDF and Fuzzy RDF Schema
4.1 Introduction
4.2 Fuzzy RDF Mapping to Relational Databases
4.2.1 Fuzzy Triple Stores Model
4.2.2 Fuzzy Horizontal Stores
4.3 Fuzzy RDF Mapping to Object-Oriented Databases
4.3.1 Mapping of Fuzzy Classes
4.3.2 Mapping of Fuzzy Properties
4.3.3 Mapping of Datatypes
4.3.4 Mapping of Fuzzy Instances
4.3.5 Implementation
4.4 Fuzzy RDF Mapping to HBase Databases
4.4.1 Fuzzy RDF Storage in Fuzzy HBase
4.4.2 FHBase‐Based RDF Queries
4.4.3 Design and Implementation
4.5 Fuzzy RDF Graph Mapping to Property Graph
4.5.1 Preliminaries
4.5.2 Transform Fuzzy RDF Graph to Property Graph
4.5.3 Query Fuzzy RDF Graph in Neo4j
4.6 Summary
References
5 Fuzzy RDF Queries
5.1 Introduction
5.2 Exact Pattern Match Query Over Fuzzy RDF Graph
5.2.1 Graph Pattern Matching Problem
5.2.2 RDF Graph Pattern
5.2.3 Fuzzy Graph Pattern Matching
5.2.4 Query Evaluation Algorithms
5.3 Approximate Fuzzy RDF Subgraph Match Query
5.3.1 Problem Definition
5.3.2 The Matching Algorithm
5.4 Fuzzy Quantified Query Over Fuzzy RDF Graph
5.4.1 Linguistic Quantifier and Fuzzy Quantified Statement
5.4.2 Fuzzy Quantified Graph Patterns Matching
5.4.3 Fuzzy Quantified Graph Patterns Matching
5.5 Extended SPARQL for Fuzzy RDF Query
5.5.1 The Fuzzy Query Language
5.5.2 Implementation Issues
5.6 Summary
References
Index
Recommend Papers

Modeling and Management of Fuzzy Semantic RDF Data (Studies in Computational Intelligence, 1057)
 3031116682, 9783031116681

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Studies in Computational Intelligence 1057

Zongmin Ma Guanfeng Li Ruizhe Ma

Modeling and Management of Fuzzy Semantic RDF Data

Studies in Computational Intelligence Volume 1057

Series Editor Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland

The series “Studies in Computational Intelligence” (SCI) publishes new developments and advances in the various areas of computational intelligence—quickly and with a high quality. The intent is to cover the theory, applications, and design methods of computational intelligence, as embedded in the fields of engineering, computer science, physics and life sciences, as well as the methodologies behind them. The series contains monographs, lecture notes and edited volumes in computational intelligence spanning the areas of neural networks, connectionist systems, genetic algorithms, evolutionary computation, artificial intelligence, cellular automata, self-organizing systems, soft computing, fuzzy systems, and hybrid intelligent systems. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution, which enable both wide and rapid dissemination of research output. Indexed by SCOPUS, DBLP, WTI Frankfurt eG, zbMATH, SCImago. All books published in the series are submitted for consideration in Web of Science.

Zongmin Ma · Guanfeng Li · Ruizhe Ma

Modeling and Management of Fuzzy Semantic RDF Data

Zongmin Ma College of Computer Science and Technology Nanjing University of Aeronautics and Astronautics Nanjing, China

Guanfeng Li College of Information Engineering Ningxia University Yinchuan, China

Ruizhe Ma Department of Computer Science University of Massachusetts Lowell Lowell, MA, USA

ISSN 1860-949X ISSN 1860-9503 (electronic) Studies in Computational Intelligence ISBN 978-3-031-11668-1 ISBN 978-3-031-11669-8 (eBook) https://doi.org/10.1007/978-3-031-11669-8 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

In the era of big data, we have witnessed a tremendous increase in the amount of data available. In this context, it has become very crucial to develop a common framework for massive data sharing across applications, enterprises, and communities. For this purpose, data should be provided with semantic meaning (through metadata), which enables machines to consume, understand, and reason about the structure and purpose of data. The Resource Description Framework (RDF) recommended by W3C (World Wide Web Consortium) has quickly gained popularity since its emergence and has been the de-facto standard for semantic information representation and exchange. Nowadays, the RDF metadata model is finding increasing usage in a wide range of massive data management scenarios (e.g., knowledge graph). With the widespread acceptance of RDF in diverse applications, a considerable amount of RDF data is being proliferated and becoming available. RDF and related standards allow intelligent understanding and processing of big data. This creates a new set of data processing requirements involving RDF, such as the need to construct and manage RDF data. For the purpose of RDF construction, various data resources, including the traditional databases, XML (Extensible Markup Language) and JSON (JavaScript Object Notation) documents, texts, tabular data such as CSV (comma-separated values) and TSV (tab-separated values), NoSQL (not only SQL) databases and so on, have been used for automatically constructing RDF models. RDF data management typically involves two primary technical issues: scalable storage and efficient queries. For more effective queries, it is necessary to index RDF data. All the listed issues are closely related. Indexing of RDF data is enabled based on RDF storage, and efficient querying of RDF data is supported by the indexing structure. Efficient and scalable management of massive RDF data is of increasing importance. With the wide and in-depth utilization of RDF in diverse application domains, particularities with information management in concrete applications emerge, which can challenge the traditional RDF technologies. In data and knowledge intensive applications, one of the challenges can be generalized as the need to deal with uncertain information in RDF data management. In the real world, human knowledge and

v

vi

Preface

natural language have a great deal of imprecision and vagueness. With the increasing amount of RDF data that is becoming available, efficient and scalable management of massive RDF data with uncertainty is of crucial importance. Fuzzy set theory, which has been one of the key means of implementing machine intelligence, has been used in a large number and a wide variety of applications. In order to bridge the gap between human-understandable soft logic and machinereadable hard logic, fuzzy logic cannot be ignored. Fuzzy logic has been introduced into diverse data models for fuzzy data processing. The emergence of the big data era has put essential requirements on dealing with both semantic and fuzzy phenomena. Currently, the research of fuzzy logic in RDF knowledge graphs is attracting increasing attention, but the achievements are still few and scattered. This book goes into great depth concerning the fast-growing topic of technologies and approaches to fuzzy RDF data modeling and management. This book covers the representation of fuzzy RDF, the persistence of fuzzy RDF, and the query of fuzzy RDF. Concerning the representation of fuzzy RDF, the multi-granularity fuzziness in the RDF graph and RDF schema are identified, and a set of algebraic operations is defined for the fuzzy RDF model. Concerning the persistence of fuzzy RDF, several storage frameworks are proposed with diverse database models, the traditional relational and object-oriented database models, as well as the emerging NoSQL databases such as the HBase database and Neo4j database, are introduced. Concerning the query of fuzzy RDF, the fuzzy graph pattern matching and the fuzzy extension mechanism of SPARQL (Simple Protocol and RDF Query Language) query language are investigated. The methods for exact pattern match query, approximate fuzzy RDF subgraph match query, and fuzzy quantified query over fuzzy RDF graph are proposed. In addition, an extension of SPARQL language to query fuzzy RDF graphs is developed. This book aims to provide a single record of current studies in the field of fuzzy semantic data management with RDF. The objective of this book is to systematically present the state-of-the-art information to researchers, practitioners, and graduate students who need to intelligently deal with Big Data with uncertainty and, at the same time, serve the data and knowledge engineering professionals faced with nontraditional applications that make the application of conventional approaches difficult or impossible. Researchers, graduate students, and information technology professionals interested in RDF and fuzzy data processing will find this book a starting point and a reference for their study, research, and development. We would like to acknowledge all of the researchers in the area of fuzzy data and knowledge engineering. Based on both their publications and many discussions with some of them, their influence on this book is profound. The materials in this book are the outgrowth of research conducted by the authors in recent years. The initial research work was supported by the National Natural Science Foundation of China (62176121, 62066038, 61772269, and 61370075). We are grateful for the financial support from the National Natural Science Foundation of China through several research grant funds. Additionally, the assistance and facilities of authors’ universities are deemed important and highly appreciated. Special thanks go to Janusz Kacprzyk, the series editor of Studies in Fuzziness and Soft Computing, and Thomas

Preface

vii

Ditzinger, the senior editor of Applied Sciences and Engineering of Springer-Verlag, for their advice and help in proposing, preparing, and publishing of this book. This book will not have been completed without the support from them. Nanjing, China Yinchuan, China Lowell, USA June 2022

Zongmin Ma Guanfeng Li Ruizhe Ma

Contents

1 RDF Data and Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 RDF Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 RDF Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2 RDF Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.3 RDF Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 RDF Query Language SPARQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 The W3C Syntax of SPARQL . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.2 The Algebraic Syntax of SPARQL Graph Patterns . . . . . . . . 1.3.3 Semantics of SPARQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 RDF Data Store . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.1 RDF Stores in Traditional Databases . . . . . . . . . . . . . . . . . . . . 1.4.2 RDF Stores in Not Only SQL Databases . . . . . . . . . . . . . . . . . 1.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 1 2 3 4 7 10 10 12 13 16 17 21 27 27

2 Fuzzy Sets and Fuzzy Database Modeling . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Imperfect Information and Fuzzy Sets . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Imperfect Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Fuzzy Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.3 Fuzzy Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Fuzzy Relational Database Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Fuzzy Object-Oriented Database Models . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Fuzzy Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Fuzzy Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.3 Fuzzy Object-Class Relationships . . . . . . . . . . . . . . . . . . . . . . 2.4.4 Fuzzy Inheritance Hierarchies . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Fuzzy XML Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Fuzziness in XML Documents . . . . . . . . . . . . . . . . . . . . . . . . .

33 33 34 35 36 39 40 43 45 45 46 51 53 54

ix

x

Contents

2.5.2 Fuzzy XML Representation Models and Formalizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

55 65 66

3 Fuzzy RDF Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.2 Fuzzy RDF Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.2.1 Fuzzy Information in RDF Graph . . . . . . . . . . . . . . . . . . . . . . 73 3.2.2 Fuzzy RDF Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.3 Fuzzy RDF Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 3.4 Similarity Matching of Fuzzy RDF Graphs . . . . . . . . . . . . . . . . . . . . . 82 3.4.1 Matching Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 3.4.2 Matching Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 3.5 Algebraic Operations in Fuzzy RDF Graphs . . . . . . . . . . . . . . . . . . . . 90 3.5.1 Algebraic Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 3.5.2 Equivalences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 3.5.3 Relationship of SPARQL and the Algebraic Operations . . . . 99 3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 4 Persistence of Fuzzy RDF and Fuzzy RDF Schema . . . . . . . . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Fuzzy RDF Mapping to Relational Databases . . . . . . . . . . . . . . . . . . . 4.2.1 Fuzzy Triple Stores Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Fuzzy Horizontal Stores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Fuzzy RDF Mapping to Object-Oriented Databases . . . . . . . . . . . . . 4.3.1 Mapping of Fuzzy Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Mapping of Fuzzy Properties . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 Mapping of Datatypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.4 Mapping of Fuzzy Instances . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Fuzzy RDF Mapping to HBase Databases . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Fuzzy RDF Storage in Fuzzy HBase . . . . . . . . . . . . . . . . . . . . 4.4.2 FHBase-Based RDF Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.3 Design and Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Fuzzy RDF Graph Mapping to Property Graph . . . . . . . . . . . . . . . . . 4.5.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.2 Transform Fuzzy RDF Graph to Property Graph . . . . . . . . . . 4.5.3 Query Fuzzy RDF Graph in Neo4j . . . . . . . . . . . . . . . . . . . . . . 4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

109 109 111 111 112 118 119 121 123 125 126 127 128 132 140 141 142 144 146 147 148

Contents

5 Fuzzy RDF Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Exact Pattern Match Query Over Fuzzy RDF Graph . . . . . . . . . . . . . 5.2.1 Graph Pattern Matching Problem . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 RDF Graph Pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.3 Fuzzy Graph Pattern Matching . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.4 Query Evaluation Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Approximate Fuzzy RDF Subgraph Match Query . . . . . . . . . . . . . . . 5.3.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 The Matching Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Fuzzy Quantified Query Over Fuzzy RDF Graph . . . . . . . . . . . . . . . . 5.4.1 Linguistic Quantifier and Fuzzy Quantified Statement . . . . . 5.4.2 Fuzzy Quantified Graph Patterns Matching . . . . . . . . . . . . . . 5.4.3 Fuzzy Quantified Graph Patterns Matching . . . . . . . . . . . . . . 5.5 Extended SPARQL for Fuzzy RDF Query . . . . . . . . . . . . . . . . . . . . . . 5.5.1 The Fuzzy Query Language . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.2 Implementation Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xi

151 151 153 154 155 156 160 166 167 175 183 184 186 192 197 198 202 204 205

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

Chapter 1

RDF Data and Management

1.1 Introduction Recent years have witnessed a tremendous increase in the amount of data available on the Web (Hassanzadeh et al., 2012). At the same time, Web 2.0 applications have introduced new forms of data and have radically changed the nature of the modern Web. In these applications, the Web has been transformed from a publishonly environment into a vibrant forum for information exchange (Hassanzadeh et al., 2012). The main purpose of the Semantic Web, proposed by W3C founder Tim Berners-Lee in his description of the future of the Web (Berners-Lee et al., 2001), is to provide a common framework for data sharing across applications, enterprises, and communities. By giving data semantic meaning (through metadata), this framework enables machines to consume, understand, and reason about the structure and purpose of data. The core of the Semantic Web is built on the Resource Description Framework (RDF) data model (Manola & Miller, 2004). RDF provides a flexible and concise model for representing metadata of resources on the Web. RDF can represent structured as well as unstructured data and is quickly becoming the de facto standard for representation and exchange of information1 (Duan et al., 2011). Nowadays, the RDF data model is finding increasing use in a wide range of Web data-management scenarios and its use is now wider than the semantic web. Governments (e.g. from the United States2 and United Kingdom3 ) and large companies and organizations (e.g. New York Times,4 BBC,5 and Best Buy6 ) have started using RDF as a business 1

http://www.w3.org/RDF. http://www.data.gov/. 3 http://www.data.gov.uk/. 4 http://data.nytimes.com/ 5 http://www.bbc.co.uk/blogs/bbcinternet/2010/07/bbc_world_cup2010_dynamic_sem.html. 6 http://www.chiefmartec.com/2009/12/best-buy-jump-starts-data-webmarketing.html. 2

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 Z. Ma et al., Modeling and Management of Fuzzy Semantic RDF Data, Studies in Computational Intelligence 1057, https://doi.org/10.1007/978-3-031-11669-8_1

1

2

1 RDF Data and Management

data model and representation format, either for semantic data integration, searchengine optimization, and better product search, or to represent data from information extraction. Yago (Suchanek et al., 2008) and DBPedia (Bizer et al., 2009) extract facts from Wikipedia automatically and store them in RDF format to support structural queries over Wikipedia; biologists encode their experiments and results using RDF to communicate among themselves leading to RDF data collections, such as Bio2RDF (bio2rdf.org) and Uniprot RDF (dev.isb-sib.ch/projects/uniprot-rdf). Furthermore, in the Linked Open Data (LOD) cloud (Bizer et al., 2009), Web data from a diverse set of domains like Wikipedia, films, geographic locations, and scientific data are linked to provide one large RDF data cloud. With the increasing amount of RDF data which is becoming available, efficient and scalable management of RDF data is of crucial importance. As a new data model, the RDF data-representation format largely determines how to store and index RDF data and furthermore influences how to query RDF data. Management of RDF data typically involves two primary technical challenges: scalable storage and efficient queries. Among these two issues, RDF data storage provides the infrastructure for RDF data management. Many proposals of RDF queries have been developed based on diverse query policies, (Ali et al., 2021) such as fuzzy queries (Ma et al., 2016a, 2016b, 2016c), approximate queries (Yan et al., 2017), keyword queries (Ma et al., 2018), natural language query (Hu et al., 2017), and so on. With the RDF format gaining widespread acceptance, much work is being done in RDF data management, and a number of research efforts have been undertaken to address these issues. Some RDF data-management systems have started to emerge such as Sesame (Broekstra et al., 2002), Jena-TDB (Wilkinson et al., 2003), Virtuoso (Erling & Mikhailov, 2007, 2009), 4store (Harris et al., 2009), BigOWLIM (Bishop et al., 2011), SPARQLcity/SPARQLverse,7 MarkLogic,8 Clark & Parsia/Stardog,9 and Oracle Spatial and Graph with Oracle Database 12c.10 BigOWLIM was renamed to OWLIM-SE and later on to GraphDB. In addition, some research prototypes have been developed [e.g. RDF-3X (Neumann & Weikum, 2008, 2010); SW-Store (Abadi et al., 2007, 2009), and RDFox11 ].

1.2 RDF Data Model The purpose of the Semantic Web is to add semantic support to the existing Web, so that machines can understand the meaning of information, so as to realize the intelligent processing of Web information. This requires that machines be 7

http://sparqlcity.com/. http://www.marklogic.com/. 9 http://clarkparsia.com/. 10 http://www.oracle.com/us/products/database/options/spatial/overview/index.html. 11 http://www.cs.ox.ac.uk/isg/tools/RDFox/. 8

1.2 RDF Data Model

3

provided with data describing Web data, that is, metadata. A universal metadata model, RDF came into being. RDF is a framework for metadata and the cornerstone of the Semantic Web. It provides interoperability between applications using machine-understandable Web data.

1.2.1 RDF Basic Definitions The RDF is a W3C Recommendation that has rapidly gained popularity. RDF provides a means of expressing and exchanging semantic metadata (i.e. data that specify semantic information about data). By representing and processing metadata about information sources, RDF defines a model for describing relationships among resources in terms of uniquely identified attributes and values. In the RDF data model, the universe is modelled as a set of resources, where a resource is anything that has a universal resource identifier (URI), including all information on the Web, virtual concepts, or things in the real world, such as movies, screenwriters, directors, countries, etc. And a resource can be described using a set of RDF statements in the form of (subject, predicate, object) triples. Here subject is the resource being described, predicate is the property being described with respect to the resource, and object is the value for the property. RDF data set consists of these statements. For example, the natural language expression “The director of the movie Dinner is Barry Levinson” can be expressed by RDF statements: • A subject: http://www.example.org/director/BarryLevinson • A predicate: http://www.example.org/dc/elements/direct • and an object: http://www.example.org/film/Dinner Here, URIs are used to identify the subject, predicate, and object of the statement. Note that both subject and object can be anonymous objects, known as blank nodes. RDF uses these triples to describe resources and attach additional semantic information to the resources. It is possible to annotate RDF data with semantic metadata using RDFS (RDF Schema) or OWL, both of which are W3C standards. This annotation primarily enables reasoning over the RDF data (called entailment), that we do not consider in this book. However, as we will see below, it also impacts data organization in some cases, and the metadata can be used for semantic query optimization. We illustrate the fundamental concepts by simple examples using RDFS, which allows the definition of classes and class hierarchies. RDFS has built-in class definitions— the more important ones being rdfs: Class and rdfs: subClassOf that are used to define a class and a subclass, respectively. To specify that an individual resource is an element of the class, a special property, rdf: type is used. For example, if we wanted to define a class called Movies and two subclasses ActionMovies and Dramas, this would be accomplished in the following way:

4

1 RDF Data and Management

Movies rdf: type rdfs: Class. ActionMovies rdfs: subClassOf Movies. Dramas rdfs: subClassOf Movies. In addition, the RDF specification includes a built-in vocabulary with a normative semantics (RDFS). This vocabulary deals with inheritance of classes and properties, as well as typing, among other features (Brickley & Guha, 2004).

1.2.2 RDF Data Model 1.2.2.1

RDF Graphs

In this section, we introduce an abstract version of the RDF data model, which is both a fragment following faithfully the original specification, and an abstract version suitable to do formal analysis. The abstract syntax of RDF model is a set of triples. Formally, an RDF triple is defined as (s, p, o) ∈ (U ∪ B) × U × (U ∪ L ∪ B), where U, B, and L are infinite sets of URI, blank nodes, and RDF literals, respectively. In a triple (s, p, o), s is called the subject, p the predicate (or property), and o the object. The interpretation of a triple statement is that subject s has property p with value o. Thus, an RDF triple can be seen as representing an atomic “fact” or a “claim”. Note that any object in one triple, say oi in (si , pi , oi ), can play the role of a subject in another triple, say (oi , pj , oj ). Therefore, RDF data is a directed, labelled graph data format for representing Web resources. There are many syntaxes available for writing RDF data and serializing RDF data, such as N-Triples,12 RDF/XML,13 RDFa,14 JSON LD,15 Notation 3 (N3),16 Turtle17 and so on. This section mainly introduces three common RDF representation methods: N-Triples, RDF/XML and graph-based representation grammar. Suppose there are the following three statements in the RDF data: Statement 1: Barry Levinson is the director of Dinner. Statement 2: Barry Levinson’s age is 77. Statement 3: Barry Levinson’s nationality is USA. (a) N-Triples triples representation grammar N-Triples aims to express RDF in a concise and intuitive syntax and provide shortcuts to commonly used RDF functions. This grammar is based on the definition of a statement. A statement consists of three parts: subject, attribute, 12

http://www.w3.org/TR/2004/REC-rdf-testcases-20040210/#ntriples. http://www.w3.org/TR/2004/REC-rdf-syntax-grammar-20040210/. 14 https://www.w3.org/TR/rdfa-primer/. 15 https://json-ld.org/. 16 http://www.w3.org/TeamSubmission/n3/. 17 http://www.w3.org/TeamSubmission/turtle/. 13

1.2 RDF Data Model

5

and object. Therefore, each RDF statement is expressed as a triplet. The above three statements are expressed in triple grammar as follows: .

As shown above, http://www.example.org/film/Dinner and http://www.exa mple.org/director/BarryLevinsonn are two subjects, http://www.example.org/ dc/element/direct, http://www.example.org/dc/elements/age and http://www. example.org/dc/elements/nationality are three predicates. “77” and “USA” are property values, written in quotation marks. If RDF data is expressed in this syntax, an RDF data set consists of many such triples. (b) XML-based representation grammar Since the RDF triplet model is abstract in nature, W3C introduced a standard called RDF/XML. As the name suggests, RDF/XML involves encoding RDF in XML format. The core principle behind RDF/XML is to use XML tools to create and parse RDF serialization. RDF/XML is the recommended syntax for applications to exchange RDF information. To express RDF in XML syntax, there are several tags that need to be defined: rdf: RDF as the root tag represents the beginning of an RDF document, and the content of the tag is a series of descriptions. The rdf: Description tag represents a description of a resource. As an attribute of the tag, rdf: about indicates the ID of the resource to be described. The subtags of the rdf: Description tag are the attribute names of the resources to be described, and the content of the subtags is the corresponding attribute value. The rdf: resource attribute indicates that the value of the attribute is a resource, and this attribute indicates the ID of the referenced resource. The three statements in the above example are expressed in XML-based syntax as follows:

53 USA



6

1 RDF Data and Management http://www.example.org/film/Dinner http://www.example.org/dc/elements/direct http://www.example.org/director/ Elvin

http://www.example.org/dc/elements/age 53

http://www.example.org/dc/elements/nationality USA

Fig. 1.1 Graphic-based RDF syntax representation

(c) Graph-based representation grammar As the name of RDF graph already hints at, the RDF data model is essentially a graph-based data model, albeit with special features such as vertexes that can act as edge labels and no fundamental distinction between schemas and instances, which can be represented in one and the same graph (Ma et al., 2016a, 2016b, 2016c). In a finite set of RDF triples, any object from one triple can play the role of a subject in another triple which amounts to chain two labeled edges in a graph-based structure. As such, RDF triples datasets can be naturally represented as a directed labeled RDF graph, each vertex corresponding to a subject (or object), and the edge representing predicate. Figure 1.1 shows the corresponding RDF graph of the above example. In Fig. 1.1, resources are represented by nodes, and three attributes are used as directed edges. The literal values “77” and “United States” are also represented by nodes. Note that in RDF, the qualified name QName can be used to simplify the URIref. If there is no special description, in the later chapters of this book, all the URIrefs involved are in simplified form.

1.2.2.2

RDF Schema

The RDF specification includes a set of reserved words, the RDFS vocabulary [RDF Schema (Brickley & Guha, 2004)], which is designed to describe relationships between resources and properties like attributes of resources (traditional attributevalue pairs). Roughly speaking, this vocabulary can be conceptually divided into the following groups: (a) A set of properties, which are binary relations between subject resources and object resources: rdfs: subPropertyOf (denoted by sp in this book), rdfs: subClassOf (sc), rdfs: domain (dom), rdfs: range (range) and rdf: type (type). (b) A set of classes, that denote set of resources. Elements of a class are known as instances of that class. To state that a resource is an instance of a class, the reserved word type may be used.

1.2 RDF Data Model

7

(c) Other functionalities, like a system of classes and properties to describe lists, and a system for doing reification. (d) Utility vocabulary used to document, comment, etc. [the complete vocabulary can be found in Brickley and Guha (2004)]. The groups in (b), (c) and (d) have a light semantics, essentially describing their internal relationships in the ontological design of the system of classes of RDFS. Their semantics is defined by a set of “axiomatic triples” (Hayes, 2004) which express the relationships among these reserved words. All axiomatic triples are “structural”, in the sense that do not refer to external data. Much of this semantics corresponds to what in standard languages is captured via typing. On the contrary, the group (a) is formed by predicates whose intended meaning is non-trivial, and is designed to relate individual pieces of data external to the vocabulary of the language. Their semantics is defined by rules which involve variables (to be instantiated by actual data). For example, rdfs: subClassOf (sc) is a reflexive and transitive binary property; and when combined with rdf: type (type) specifies that the type of an individual (a class) can be lifted to that of a superclass. The group (a) forms the core of the RDF language and, from a theoretical point of view, it has been shown to be a very stable core to work with [the detailed arguments supporting this claim are given in Munoz et al. (2007)]. Thus, throughout the charpter we focused on the fragment of RDFS given by the set of keywords {sp, sc, type, dom, range}.

1.2.3 RDF Semantics In this section, we present the formalization of the semantics of RDF. The normative semantics for RDF graphs given in Hayes (2004), and the mathematical formalization in Marin (2004) follows standard classical treatment in logic with the notions of model, interpretation, entailment, and so on. Model theory assumes that the language refers to a ‘world’, and describes the minimal conditions that a world must satisfy to assign an appropriate meaning for every expression in the language. A particular world is called an interpretation, so that model theory might be better called ‘interpretation theory’. The idea is to provide an abstract, mathematical account of the properties that any such interpretation must have, making as few assumptions as possible about its actual nature or intrinsic structure, thereby retaining as much generality as possible. All interpretations will be relative to a set of names, called the vocabulary of the interpretation, so that one should speak, strictly, of an interpretation of an RDF vocabulary, rather than of RDF itself. Some interpretations may assign special meanings to the symbols in a particular vocabulary. Interpretations which share the special meaning of a particular vocabulary will be named for that vocabulary, e.g. ‘rdfinterpretations’, ‘rdfs-interpretations’, etc. An interpretation with no particular extra

8

1 RDF Data and Management

conditions on a vocabulary (including the RDF vocabulary itself) will be called a simple interpretation, or simply an interpretation. Next, we present the simplification of the normative semantics proposed in Munoz et al. (2007). An RDF interpretation is a tuple I = (Res, Prop, Class, PExt, CExt, Int), where (1) Res is a nonempty set of resources, called the domain or universe of I; (2) Prop is a set of property names (not necessarily disjoint from Res); (3) Class ⊆ Res is a distinguished subset of Res identifying if a resource denotes a class of resources; (4) PExt: Prop → 2Res × Res , a mapping that assigns an extension to each property name; (5) CExt: Class → 2Res a mapping that assigns a set of resources to every resource denoting a class; (6) Int: U → Res ∪ Prop, the interpretation mapping, is a mapping that assigns a resource or a property name to each element of U. Intuitively, a ground triple (s, p, o) in a graph G is true under the interpretation I, if p is interpreted as a property name, s and o are interpreted as resources, and the interpretation of the pair (s, o) belongs to the extension of the property assigned to p. Formally, we say that I satisfies the ground triple (s, p, o) if Int(p) ∈ Prop and (Int(s), Int(o)) ∈ PExt (Int(p)). An interpretation must also satisfy additional conditions induced by the usage of the RDFS vocabulary. For example, an interpretation satisfying the triple (c1 , sc, c2 ) must interpret c1 and c2 as classes of resources, and must assign to c1 a subset of the set assigned to c2 . More formally, we say that I satisfy (c1 , sc, c2 ) if Int(c1 ), Int(c2 ) ∈ Class and CExt(c1 ) ⊆ CExt(c2 ). Blank nodes work as existential variables. Intuitively, a triple (x, p, o) would be true under I, where x is a blank node, if there exists a resource s such that (s, p, o) is true under I. An arbitrary element can be chosen when interpreting a blank node, with the restriction that all the occurrences of the same blank node in an RDF graph must be replaced by the same value. To formally deal with blank nodes, an extension of the interpretation mapping Int is used. Let A: B → Res be a function between blank nodes and resources. Then Int A : UB → Res is defined as the extension of function Int: Int A (x) = A(x) for x ∈ B, and Int A (x) = Int(x) for x ∈ U. We next formalize the notion of model for an RDF graph (Hayes, 2004; Munoz et al., 2007). We say that the RDF interpretation I = (Res, Prop, Class, PExt, CExt, Int) is a model of (is an interpretation for) an RDF graph G, denoted by I |= G, if the following conditions hold. Simple Interpretation: • there exists a function A: B → Res such that for each (s, p, o) ∈ G, it holds that Int(p) ∈ Prop and (Int A (s), Int A (o)) ∈ PExt(Int(p)). Properties and Classes: • Int(sp), Int(sc), Int(type), Int(dom), Int(range) ∈ Prop, • if (x, y) ∈ PExt(Int(dom)) ∪ PExt(Int(range)), then x ∈ Prop and y ∈ Class. Sub-property: • PExt(Int(sp)) is transitive and reflexive over Prop, • if (x, y) ∈ PExt(Int(sp)), then x, y ∈ Prop and PExt(x) ⊆ PExt(y).

1.2 RDF Data Model

9

creates

type Writer type

Novelist type

Henry

type

sp writes

X

boule de suif

issuing time 1880

Fig. 1.2 Example of an RDF graph

Sub-class: • PExt(Int(sc)) is transitive and reflexive over Class, • if (x, y) ∈ PExt (Int(sc)), then x, y ∈ Class and CExt(x) ⊆ CExt(y). Typing: • (x, y) ∈ PExt(Int(type)) if and only if y ∈ Class and x ∈ CExt(y), • if (x, y) ∈ PExt(Int(dom)) and (u, v) ∈ PExt(x), then u ∈ CExt(y), • if (x, y) ∈ PExt(Int(range)) and (u, v) ∈ PExt(x), then v ∈ CExt(y). Example 1.1 Figure 1.2 shows an RDF graph storing information about writers. All the triples in the graph are composed by elements in U, except for the triples containing the blank node B. Consider now the interpretation I = (Res, Prop, Class, PExt, CExt, Int) defined as follows: • • • •

Res = {Writer, Henry, Novelist, creates, writes, boule de suif, 1880}. Prop = {creates, writes, issuing time, type, sp, sc, dom, range}. Class = {Writer, Novelist}. PExt is such that: PExt(writes) = PExt(creates) = {(Henry, boule de suif)},

– – – – –

PExt(issuing time) = {(boule de suif, 1880)}, PExt(type) = {(Henry, Writer), (Henry, Writer)}, PExt(sp) = {(writes, creates)} ∪ {(x, x)|x ∈ Prop}, PExt(sc) = {(Novelist, Writer), (Novelist, Novelist), (Writer, Writer)}, PExt(dom) = PExt(range) = ∅.

• CExt is such that CExt(Novelist) = CExt(Writer) = {Henry}. • Int is the identity mapping over Res ∪ Prop.

10

1 RDF Data and Management

1.3 RDF Query Language SPARQL In 2004, the RDF Data Access Working Group, part of the W3C Semantic Web Activity, released a first public working draft of a query language for RDF, called SPARQL (Prud’hommeaux & Seaborne, 2008). The name SPARQL is a recursive acronym that stands for SPARQL Protocol and RDF Query Language. Since then, SPARQL has been rapidly adopted as the standard for querying Semantic Web data. In January 2008, SPARQL became a W3C Recommendation. In this section, we give a detailed description of the syntax and Semantics of SPARQL. RDF is a directed labeled graph data format and, thus, SPARQL is essentially a graph-matching query language. We start by focusing on the syntax of SPARQL in the specification of SPARQL by the W3C, and then introduce an algebraic syntax for the language and compare it with the official syntax. Finally, we formalize the semantics of SPARQL.

1.3.1 The W3C Syntax of SPARQL The syntax and semantics of SPARQL are specified by the RDF Data Access Working Group (Prud’hommeaux & Seaborne, 2008). SPARQL is a language designed to query data in the form of sets of triples, namely RDF graphs. The basic engine of the language is a pattern matching facility, which uses some graph pattern matching functionalities (sets of triples can be viewed also as graphs). From a syntactic point of view, SPARQL language is similar to the SQL language, and the overall structure consists of three main blocks. The pattern matching part, which includes several interesting features of pattern matching of graphs, like optional parts, union of patterns, nesting, filtering values of possible matchings, and the possibility of choosing the data source to be matched by a pattern. The solution modifiers, which once the output of the pattern has been computed (in the form of a table of values of variables), allow to modify these values applying classical operators like projection, distinct, order and limit. Finally, the output of a SPARQL query can be of different types: yes/no queries, selections of values of the variables which match the patterns, construction of new RDF data from these values, and descriptions of resources. In order to present the language, we follow the grammar given in Fig. 1.3 that specifies the basic structure of the SPARQL Query Grammar (Prud’hommeaux & Seaborne, 2008). There are several basic concepts used in the definition of the syntax of SPARQL, many of which are taken from the RDF specification with some minor modifications. For denoting resources, SPARQL uses IRIs instead of the URIs of RDF. Anything represented by a literal could also be represented by an IRI, but it is often more convenient or intuitive to use literals. In what follows, we explain in more detail each component of the language. Of course, for ultimate details the reader should consult the W3C Recommendation (Prud’hommeaux & Seaborne, 2008).

1.3 RDF Query Language SPARQL

11

Fig. 1.3 A fragment of the SPARQL query grammar (Prud’hommeaux & Seaborne, 2008)

As shown in Fig. 1.3, a SPARQL Query is given by a Prologue followed by any of the four types of SPARQL queries: SelectQuery, ConstructQuery, DescribeQuery or AskQuery. The Prologue contains the declaration of variables, namespaces, and abbreviations to be used in the query. The SELECT clause in a SelectQuery selects a group of variables, or all of them using—as in SQL—the wildcard*. In this type of queries, one can eliminate duplicate solutions using DISTINCT. In a ConstructQuery, the CONSTRUCT form, and more specifically the ConstructTemplate form, is used to constructs an RDF graph using the obtained solutions. In a DescribeQuery, the DESCRIBE form is not normative (only informative). It is intended to describe the specified variables or IRIs, i.e., it returns all the triples in the dataset involving these resources. In an AskQuery, the ASK form has no parameters but the dataset to be queried and a WHERE clause. It returns TRUE if the solution set is not empty, and FALSE otherwise. In a SPARQL query, the DatasetClause allows to specify one graph (the DefaultGraphClause) or a set of named graphs, i.e., a set of pairs of identifiers and graphs, which are the data sources to be used when computing the answer to the query. Moreover, the WHERE clause is used to indicate how the information from the data sources is to be filtered, and it can be considered the central component of the query language. It specifies the pattern to be matched against the data sources. In particular, it includes sets of triples with some of the IRIs or blank elements replaced by variables, called “triple blocks” (TB in the grammar), an operator for collecting triples and blocks (denoted by {A. B}, and with no fixed arity), an operator UNION for specifying alternatives, an operator OPTIONAL to provide optional matchings, and an operator FILTER that allows filtering results of patterns under certain basic constraints.

12

1 RDF Data and Management

Example 1.2 (Pérez et al., 2006a, 2006b) Consider the following query: “Give the name and the mailbox of each person who has a mailbox with domain.cl”. This query can be expressed in SPARQL as follows: PREFIX foaf: PREFIX ex: SELECT ?name ?mbox FROM WHERE { ?x foaf:name ?name. ?x foaf:mbox ?mbox. ?mbox ex:domain “.cl” } The first two lines in this example form the Prologue of the query, which specifies the namespaces to be used. In this case, one is the well-known FOAF ontology, and the other one is an example namespace. The keywords foaf and ex are abbreviations for the namespaces, which are used in the body of the query. The SELECT keyword indicates that the query returns a table with two columns, corresponding to the values obtained from the matching of the variables ?name and ?mbox against the graph pointed to in the FROM clause (myDataSource.rdf), and according to the pattern described in the WHERE clause. It should be noticed that a string starting with the symbol ? denotes a variables in SPARQL. In the above query, the WHERE clause is composed by a pattern with three triples: ?x foaf:name ?name, ?x foaf:mbox ?mbox and ?mbox ex:domain “.cl”, where.cl is a literal. This pattern indicates that one is looking for the elements ?x, ?name and ?mbox in the RDF graph myDataSource.rdf such that the foaf:name of ?x is ?name, the foaf:mbox of ?x is ?mbox and the ex:domain of ?mbox is.cl. Thus, an expression of the form {A. B} in SPARQL denotes the conjunction of A and B, as this expression holds if both A and B holds.

1.3.2 The Algebraic Syntax of SPARQL Graph Patterns RDF is a directed labeled graph data format and, thus, SPARQL is essentially a graph-matching query language. In this section, we present the algebraic syntax of the core fragment of SPARQL graph patterns proposed in (Arenas et al., 2009; Pérez et al., 2006a, 2006b, 2009), and show that it is equivalent in expressive power to the core fragment of SPARQL. Thus, this formalization is used in this chapter to give a formal semantics to SPARQL. The official syntax of SPARQL (Prud’hommeaux & Seaborne, 2008) considers operators OPTIONAL, UNION, FILTER, and concatenation via a point symbol (.), to construct graph pattern expressions. The syntax also considers {} to group patterns, and some implicit rules of precedence and association. For example, the

1.3 RDF Query Language SPARQL

13

point symbol (.) has precedence over OPTIONAL, and OPTIONAL is left associative. In order to avoid ambiguities in the parsing of expressions, Pérez and Arenas et al. present a syntax of SPARQL graph patterns in a more traditional algebraic formalism, using binary operators AND (.), UNION (UNION), OPT (OPTIONAL), and FILTER (FILTER). They fully parenthesize expressions making explicit the precedence and association of operators. Assume the existence of a set of variables V disjoint from U. A SPARQL graph pattern expression is defined recursively as follows: (a) A tuple from (U ∪ V ) × (U ∪ V ) × (U ∪ V ) is a graph pattern (a triple pattern). (b) If P1 and P2 are graph patterns, then expressions (P1 AND P2 ), (P1 OPT P2 ), and (P1 UNION P2 ) are graph patterns (conjunction graph pattern, optional graph pattern, and union graph pattern, respectively). (c) If P is a graph pattern and R is a SPARQL built-in condition, then the expression (P FILTER R) is a graph pattern (a filter graph pattern). A SPARQL built-in condition is constructed using elements of the set U ∪ V and constants, logical connectives (¬, ∧, ∨), inequality symbols (), the equality symbol (=), unary predicates like bound, isBlank, and isIRI, plus other features (see [15] for a complete list). In this chapter, we restrict to the fragment where the built-in condition is a Boolean combination of terms constructed by using = and bound, that is: (a) If ?X, ?Y ∈ V and c ∈ U, then bound(?X), ?X = c and ?X = ?Y are built-in conditions. (b) If R1 and R2 are built-in conditions, then (¬R1 ), (R1 ∨ R2 ) and (R1 ∧ R2 ) are built-in conditions. In the rest of the book, we use var(*) to denote the set of variables occurring in *, where * be a SPARQL graph pattern P or a built-in condition R. We conclude the definition of the algebraic framework by describing the formal syntax of the SELECT query result form. A SELECT SPARQL query is simply a tuple (W, P), where P is a SPARQL graph pattern expressions and W is a set of variables such that W ⊆ var(P).

1.3.3 Semantics of SPARQL In this section, we present a streamlined version of the core fragment of SPARQL with precise algebraic syntax and a formal compositional semantics based on Pérez et al. (2006a, 2006b). The definition of a formal semantics for SPARQL has played a key role in the standardization process of this query language. Although taken one by one the features of SPARQL are intuitive and simple to describe and understand, it turns out that the combination of them makes SPARQL into a complex language. Reaching a consensus in the W3C standardization process about a formal semantics for SPARQL was not an easy task. The initial efforts to define SPARQL were driven

14

1 RDF Data and Management

by use cases, mostly by specifying the expected output for particular example queries. In fact, the interpretations of examples and the exact outcomes of cases not covered in the initial drafts of the SPARQL specification, were a matter of long discussions in the W3C mailing lists. Pérez et al. (2006a, 2006b) presented one of the first formalizations of a semantics for a fragment of the language. Currently, the official specification of SPARQL (Prud’hommeaux & Seaborne, 2008), endorsed by the W3C, formalizes a semantics based on Pérez et al. (2006a, 2006b). The semantics of SPARQL is formalized by using partial mappings between variables in the patterns and actual values in the RDF graph being queried. To define the semantics of SPARQL graph pattern expressions, we need to introduce some terminology. A mapping μ from V to U is a partial function μ: V → U. Abusing notation, for a triple pattern t we denote by μ(t) the triple obtained by replacing the variables in t according to μ. The domain of μ, denoted by dom(μ), is the subset of V where μ is defined. The empty mapping μΦ is a mapping such that dom(μΦ ) = Φ (i.e. μΦ = Φ). Given a triple pattern t and a mapping μ such that var(t) ⊆ dom(μ), μ(t) is the triple obtained by replacing the variables in t according to μ. Similarly, given a basic graph pattern P and a mapping μ such that var(P) ⊆ dom(μ), we have that μ(P) = ∪ t ∈P {μ(t)}, i.e. μ(P) is the set of triples obtained by replacing the variables in the triples of P according to μ. To define the semantics of more complex patterns, we need to introduce some more notions. Two mappings μ1 and μ2 are compatible when for all ?X ∈ dom(μ1 ) ∩ dom(μ2 ), it is the case that μ1 (?X) = μ2 (?X), i.e. when μ1 ∪ μ2 is also a mapping. Intuitively, μ1 and μ2 are compatibles if μ1 can be extended with μ2 to obtain a new mapping, and vice versa. Note that two mappings with disjoint domains are always compatible, and that the empty mapping μΦ (i.e. the mapping with empty domain) is compatible with any other mapping. Let Ω 1 and Ω 2 be sets of mappings, the join of, the union of and the difference between Ω 1 and Ω 2 are defined as: Ω 1 ⨝ Ω 2 = {μ1 ∪ μ2 |μ1 ∈ Ω 1 , μ2 ∈ Ω 2 and μ1 , μ2 are compatible mappings}, Ω 1 ∪ Ω 2 = {μ|μ ∈ Ω 1 or μ ∈ Ω 2 }, Ω 1 \Ω 2 = {μ ∈ Ω 1 |for all μ' ∈ Ω 2 , μ and μ' are not compatible}. Based on the previous operators, the left outer-join are defined as: Ω 1 ⟕ Ω 2 = (Ω 1 ⨝ Ω 2 ) ∪ (Ω 1 \Ω 2 ). Intuitively, Ω 1 ⨝ Ω 2 is the set of mappings that result from extending mappings in Ω 1 with their compatible mappings in Ω 2 , and Ω 1 \Ω 2 is the set of mappings in Ω 1 that cannot be extended with any mapping in Ω 2 . The operation Ω 1 ∪ Ω 2 is the usual set theoretical union. A mapping μ is in Ω 1 ⟕ Ω 2 if it is the extension of a mapping of Ω 1 with a compatible mapping of Ω 2 , or if it belongs to Ω 1 and cannot be extended with any mapping of Ω 2 . These operations resemble relational algebra operations over sets of mappings (partial functions). We are ready to define the semantics of graph pattern expressions as a function [[·]]G which takes a pattern expression and returns a set of mappings. We follow the approach in Gutierrez et al. (2011) defining the semantics as the set of mappings that

1.3 RDF Query Language SPARQL

15

matches the graph G. For the sake of readability, the semantics of filter expressions is presented in a separate definition. Let G be an RDF graph and P be a graph pattern. The evaluation of P over G, denoted by [[P]]G , is defined recursively as follows (Arenas et al., 2009): (a) (b) (c) (d)

if P is a triple pattern t, then [[P]]G = { μ|dom(μ) = var(t) and μ(t) ∈ G} . if P is (P1 AND P2 ), then [[P]]G =[[P1 ]]G ⨝ [[P2 ]]G . if P is (P1 OPT P2 ), then [[P]]G =[[P1 ]]G ⟕ [[P2 ]]G . if P is (P1 UNION P2 ), then [[P]]G =[[P1 ]]G ∪ [[P2 ]]G .

The idea behind the OPT operator is to allow for optional matching of patterns. Consider pattern expression ((P1 OPT P2 ) and let μ1 be a mapping in [[P1 ]]G . If there exists a mapping μ2 ∈ [[P2 ]]G such that μ1 and μ2 are compatible, then (μ1 ∪ μ2 ) ∈ [[(P1 OPT P2 )]]G . But if no such a mapping μ2 exists, then μ1 ∈ [[(P1 OPT P2 )]]G . Thus, operator OPT allows information to be added to a mapping μ if the information is available, instead of just rejecting μ whenever some part of the pattern does not match. The semantics of filter expressions goes as follows. Given a mapping μ and a built-in condition R, we say that μ satisfies R, denoted by μ |= R, if: (a) (b) (c) (d) (e) (f)

R is bound(?X) and ?X ∈ dom(μ); R is ?X = c, ?X ∈ dom(μ) and μ(?X) = c; R is ?X = ?Y, ?X ∈ dom(μ), ?Y ∈ dom(μ) and μ(?X) = μ(?Y ); R is (¬R1 ), R1 is a built-in condition, and it is not the case that μ |= R1 ; R is (R1 ∨ R2 ), R1 and R2 are built-in conditions, and μ |= R1 or μ |= R2 ; R is (R1 ∧ R2 ), R1 and R2 are built-in conditions, μ |= R1 and μ |= R2 .

Let G be an RDF graph and (P FILTER R) be a filter expression. The evaluation of filter expression over G is defined as [[(P F I L T E R R)]]G = { μ ∈ [[P]]G |μ| = R}. Several algebraic properties of graph patterns are proved by Pérez et al. (2006a, 2006b). A simple property is that AND and UNION are associative and commutative. This permits us to avoid parenthesis when writing sequences of AND operators or UNION operators. The official W3C Recommendation (Prud’hommeaux & Seaborne, 2008) defines four query forms, namely SELECT, ASK, CONSTRUCT, and DESCRIBE queries. These query forms use the mappings obtained after the evaluation of a graph pattern to construct result sets or RDF graphs. The query forms are: (1) SELECT, that performs a projection over a set of variables in the evaluation of a graph pattern, (2) CONSTRUCT, that returns an RDF graph constructed by substituting variables in a template, (3) ASK, that returns a truth value indicating whether the evaluation of a graph pattern produces at least one mapping, and (4) DESCRIBE, that returns an RDF graph that describes the resources found. In this paper, we only consider the SELECT query form. We refer the reader to Pérez et al. (2006a, 2006b) for a formalization of the remaining query forms. To formally define the semantics of SELECT SPARQL queries, we need the following notion. Given a mapping μ: V → U and a set of variables W ⊆ V, the

16

1 RDF Data and Management

restriction of μ to W, denoted by μ|W , is a mapping such that dom(μ|W ) = dom(μ) ∩ W and μ|W (?X) = μ(?X) for every ?X ∈ dom(μ) ∩ W. Definition 1.1 A SPARQL SELECT query is a tuple (W, P), where P is a graph pattern and W is a set of variables such that W ⊆ var(P). The answer of (W, P) over an RDF graph G, denoted by [[(W, P)]]G , is the set of mappings: [[(W, P)]]G = { μ|W |μ ∈ [[P]]G } .

1.4 RDF Data Store RDF plays an important role in representing Web resources in a natural and flexible way. As the amount of RDF datasets increasingly growing, efficient and scalable management of RDF data is therefore of increasing importance. RDF data management has attracted attention in the database and Semantic Web communities. Much work has been devoted to proposing different solutions to store RDF data efficiently. In this section, we focus on RDF data storage and present a full up-to-date overview of the current state of the art in RDF data storage based on the work by Ma et al. (2016a, 2016b, 2016c). The various approaches are classified according to their storage strategy, including RDF data stores in traditional databases and RDF data stores in NoSQL databases. Figure 1.4 illustrates this classification for RDF data stores. Note that, two different levels of RDF data storage can be distinguished: logical storage and physical storage. This chapter mainly focusses on logical storage of RDF data. Traditionally, databases are classified into relational databases and object-oriented databases. In addition, NoSQL databases have only recently emerged as a commonly used infrastructure for handling big data. So, two top categories of RDF data stores in Fig. 1.4 are traditional database stores and NoSQL database stores, respectively. For the traditional database stores, corresponding to two kinds of traditional database models, two categories of RDF data stores in the traditional database stores are relational stores and object-oriented stores, which apply relational database and RDF Data Stores NoSQL database stores

Traditional database stores

Relational stores

Vertical stores

Horizontal stores

Object-oriented stores

Type stores

Graph databases stores

Key-value stores

Column-family stores

Fig. 1.4 Classification of resource description framework (RDF) data stores

Document stores

1.4 RDF Data Store

17

object-oriented database models, respectively. Depending on concrete data models adopted, the NoSQL database stores are categorized into key-value stores, columnfamily stores, document stores, and graph databases. Note that in the relational stores of RDF data, several different relational schemas can be designed, depending on how to distribute RDF triples to an appropriate relational schema. This results in three major categories of RDF relational stores, which are vertical stores, horizontal stores, and type stores. They are formally illustrated in the following section.

1.4.1 RDF Stores in Traditional Databases A number of attempts have been made to use traditional databases to store RDF data, and various storage schemes for RDF data have been proposed. Some ideas and techniques developed earlier for object-oriented databases, for example, have already been adapted to the RDF setting. RDF data were stored in an object-oriented database by mapping both triples and resources to objects in Bönström et al. (2003). An object-oriented database model was proposed for storage of RDF documents (Chao, 2007a, 2007b), but the RDF documents were encoded in XML (eXtensible Markup Language). Relational database management systems (RDBMSs) are currently the most widely used databases. It has been shown that RDBMSs are very efficient, scalable, and successful in hosting various types of data, including some new types of data such as XML data, temporal/spatial data, media data, and complex objects. Currently, more mature RDF systems use RDBMSs to store RDF data, map RDF triples with relational table structures, and use RDBMS for storage and retrieval. According to the different table structure designed, the storage of RDF data can be divided into three methods (Luo et al., 2012; Sakr & Al-Naymat, 2009), namely the vertical stores, the horizontal stores, and the property stores. 1. Vertical stores Vertical stores (also called triple stores, e.g. Broekstra et al. 2002, Harris and Gibbins 2003, Harris and Shadbolt 2005, Neumann and Weikum 2008, 2010, Weiss et al. 2008) use a single relational table to store a set of RDF statements, in which the relational schema contains three columns for subject, property, and object. Formally, each triple, say (s, p, o), occurs in the relational table as a row, that is, tuple . Here subject s is placed in column subject of this row, predicate p is placed in column property of this row, and object o is placed in column object of this row. When performing an RDF query, given a SPARQL query, a query rewriting mechanism is designed to convert the SPARQL into a corresponding SQL statement, and the relational database will answer the SQL statement. Although this method has good versatility, the query performance is poor, and a lot of self-join operations need to be performed when the query is executed. Moreover, because vertical stores quickly encounter scalability limitations, several approaches have been proposed to deal with

18

1 RDF Data and Management

these limitations by using extensive sets of indices or by using selectivity estimation information to optimize the join ordering. Sesame, a generic architecture for storing and querying RDF and RDF schema, is introduced by Broekstra et al. (2002). An important feature of the Sesame architecture is its abstraction from the details of any particular repository used for the actual storage. Therefore, Sesame can be ported to a large variety of different repositories. The implementation of Sesame (Broekstra et al., 2002) uses both PostgreSQL and MySQL as database platforms. An RDF storage scheme called Hexastore RDF is proposed by Weiss et al. (2008). This scheme enhances the vertical partitioning idea and takes it to its logical conclusion. RDF data are indexed in six possible ways, one for each possible ordering of the three RDF elements. Each instance of an RDF element is associated with two vectors; each such vector gathers elements of one of the other types, along with lists of the third resource type attached to each vector element. Hence, a sextuple indexing scheme emerges. This format enables quick and scalable general-purpose query processing; it confers significant advantages (up to five orders of magnitude) over previous approaches for RDF data management, at the price of a worst-case fivefold increase in index space. Note that Hexastore focusses on exhaustive indexing of pairs of positions in triples such as SP, SO, …, OP. Being different from Hexastore, RDF-3X focusses on exhaustive indexing off all permutations of triples of positions such as SPO, SOP, …, OPS and TripleT (Wolff et al., 2015) focusses on exhaustive indexing of all single positions, S, P, and O. 2. Horizontal stores The second approach for storing RDF data is called horizontal stores (e.g. Abadi et al., 2007, 2009). Under the horizontal representation, RDF data can be stored directly in a single table. This table has one column for each predicate occurring in the RDF graph and one row for each subject. Formally, for a triple (s, p, o), object o is placed in column p of row s. Note that for two triples, say (si , pi , oi ) and (sj , pj , oj ), one may have si = sj and either pi = pj or pi /= pj . At this point, if si = sj and pi /= pj , oi and oj are placed in different columns pi and pj of the same row. However, if si = sj and pi = pj , oi and oj are placed in the same column of the same row, and a set of values {oi , oj} results. Of course, it is possible that si /= sj and pi = pj . Then oi and oj are placed in the same column of different rows si and sj . It is very common that for any two triples (si , pi , oi ) and (sj , pj , oj ) in the context of massive RDF triples, they have different subjects and different predicates, and oi and oj are placed in different columns of different rows. As a result, row si has a null value in the pj column and row sj has a null value in the pi column. This will lead to a sparse table with many null values. To solve these problems of null values and sets of values, efforts have been made to partition a single table vertically into a set of property tables using predicates. Each predicate has a table over the schema (subject, object), in which a binary relation between a subject and an object with respect to the given predicate is represented. Formally, for two triples, say (si , pi , oi ) and (sj , pj , oj ), if pi = pj , there are two tuples and in the same table. However, if pi /= pj , the tuples and

1.4 RDF Data Store

19

occur in two different tables. It is clear that the number of relational tables is the same as the number of predicates in the RDF data sets. SW-Store was proposed by Abadi et al. (2007, 2009) as an RDF data store that vertically partitions RDF data (by predicates) into a set of property tables, maps them onto a column-oriented database, and builds a subject–object index on each property table. Note that the implementation of SW-Store relies on the C-Store column-store database (Stonebraker et al., 2005) to store tables as collections of columns rather than as collections of rows. Current relational database systems, for example, Oracle, DB2, SQL Server, and Postgres, are standard row-oriented databases in which entire tuples are stored consecutively. In addition, the results of an independent evaluation of SW-Store are reported by Sidirourgos et al. (2008). Extending the SW-Store approach, an approach called SPOVC is proposed by Mulay and Kumar (2012). The main techniques used in this approach are horizontal partitioning of logical indices and special indices for values and classes. The SPOVC approach uses five indices, namely, subject, predicate, object, value, and class, on top of column-oriented databases. 3. Property stores The third approach for storing RDF data is called property stores (e.g. Levandoski and Mokbel 2009, Matono et al. 2005, Sintek and Kiesel 2006), in which one relational table is created for each RDF data type and a relational table contains the properties as n-ary table columns for the same subject. Actually, property stores are type-oriented stores (Bornea et al., 2013). The basic idea of this approach is to divide one wide table into multiple smaller tables so that each table contains related predicates as its columns. Formally, for two triples, say (si , pi , oi ) and (sj , pj , oj ), suppose that pi and pj are related. Then these two triples occur in the same table, with one row for each subject. Furthermore, when si = sj and pi /= pj , oi and oj are placed in different columns pi and pj of the same row; when si = sj and pi = pj , oi and oj are placed in the same column of the same row, and a set of values {oi , oj } results; when si /= sj and pi = pj , oi and oj are placed in the same column of different rows si and sj . It is not difficult to see that designing a schema for property tables depends on identifying related predicates. Jena is an open-source toolkit for Semantic Web programmers (McBride, 2002). It implements persistence for RDF graphs using an SQL database through a JDBC connection. Jena has evolved from its first version, Jena 1, to a second version, Jena 2. In the Jena RDF, the grouping of predicates is defined by applications (Wilkinson et al., 2003; Wilkinson, 2006). Applications typically have access patterns in which certain subjects or properties are accessed together. In particular, the application programmer must specify which predicates are multivalued. For each such multivalued predicate p, a new relational table is created, with a schema consisting of subject and p. Jena also supports so-called property-class tables, in which for each value of the rdf: type predicate, a new table is created. The remaining predicates that are not in any defined group are stored independently.

20

1 RDF Data and Management

Using a dynamic relation model, a system called FlexTable is proposed to store RDF data (Wang et al., 2010). In FlexTable, all triples of an instance are coalesced into one tuple, and all tuples are stored in relational schema. To partition all the triples into several tables, first, a schema evolution method is proposed, based on a lattice structure, to evolve schema automatically when new triples are inserted; second, a page layout with an interpreted storage format is proposed to reduce the physical adjustment cost during schema evolution. For each subject s in the RDF graph G, a set of predicates satisfying {p|(∃ o) ∧ ((s, p, o) ∈ G)} is obtained, which is called the signature of s (Sintek & Kiesel, 2006). The predicates in the set are considered as related predicates. Then, for each signature, a corresponding predicate relational table called the signature table is created. The relational schema contains the subject and the set of predicates. Based mainly on the concepts of signatures and signature tables which are organized in a lattice-like structure, RDFBroker, an RDF store, was introduced by Sintek and Kiesel (2006). Note that the approach by Sintek and Kiesel (2006) actually creates many small tables. To improve query evaluation performance, various criteria are proposed for merging small tables into larger ones, but this introduces null values when a predicate is absent. Based on RDF document structure, a storage scheme is proposed by Matono and Kojima (2012), which stores RDF graphs without decomposition. Considering that adjacent triples have a strong relationship and can be described for the same resource, a set of adjacent triples that refer to the same resource is defined as an RDF paragraph. Then the table layout is constructed based on RDF paragraphs (Luo et al., 2012). Levandoski and Mokbel (2009) proposed a data-centric schema-creation approach for storing RDF data in relational databases. With the aim of maximizing the size of each group of predicates and meanwhile minimizing the number of null values that occur in the tables, association rule mining is used to determine automatically the predicates that often occur together. According to the support threshold, which measures the strength of correlation between properties in the RDF data, properties which are grouped together in the same cluster may constitute a single n-ary table, and properties which are not grouped in any cluster may be stored in binary tables. Finally, in the partitioning phase, the formed clusters are checked, and the trade-off is made between storing as many RDF properties as possible in clusters while keeping null storage to a minimum based on the null threshold. Actually, the approach by Levandoski and Mokbel (2009) provides a tailored schema for each RDF data set, using a balance between n-ary and binary tables. Each triple in the form (subject, predicate, object) is classified into categories by Matono et al. (2005) according to the type of predicate, and then subgraphs are constructed for each category. The graph is decomposed into five subgraphs according to the type of predicate: class inheritance graphs, property inheritance graphs, type graphs, domain-range graphs, and generic graphs. Each subgraph is stored by applicable techniques into distinct relational tables. More precisely, all classes and properties are extracted from RDF schema data, and all resources are also extracted from RDF data. Each extracted item is assigned an identifier and a path expression and stored in the corresponding relational table. Because the proposed

1.4 RDF Data Store

21

scheme retains schema information and path expressions for each resource, the pathbased relational RDF database (Matono et al., 2005) can process path-based queries efficiently and store RDF instance data without schema information. Among the three approaches for storing RDF data in relational databases, vertical stores use a fixed relational schema, and new triples can be inserted without considering RDF data types. Therefore, vertical stores can handle dynamic schema of RDF data. However, vertical stores generally involve a number of self-join operations for querying, and therefore efficient querying requires specialized techniques. To overcome the problem of self-joins in vertical stores, horizontal stores using a single relational table are proposed. However, it commonly occurs that in the single relational table containing all predicates as columns, a subject occurs only with certain predicates, which leads to a sparse relational table with many null values. In addition, a subject may have multiple objects for the same predicate. Such a predicate is called a multi-valued predicate. As a result, the relational table in a horizontal store contains multi-valued attributes. Finally, when new triples are inserted, new predicates result in changes to the relational schema, and dynamic schema of RDF data cannot be handled. To solve the problem of null values as well as that of multi-valued attributes, horizontal stores using a set of relational tables are proposed, where each predicate corresponds to a relational table. However, horizontal stores using a set of relational tables generally involve many join operations for querying. In addition, when new triples are inserted, new predicates result in new relational tables, and dynamic schema of RDF data cannot be handled. A vertical store in (p, s, o) shape would equal the sequential concatenation of all tables in a horizontal store which uses a set of relational tables. The type-store approach is actually a trade-off between the two kinds of horizontal stores. Compared with horizontal stores using a single relational table, type stores contain fewer null values (no null values in horizontal stores using multiple relational tables), and involve fewer join operations than horizontal stores using multiple relational tables (no join operations in horizontal stores using a single relational table). It should be noted that, like horizontal stores using a single relational table, type stores may contain multivalued attributes and new predicates, resulting in changes to relational schema when new triples are inserted. Some major features of relational RDF data stores are summarized in Table 1.1.

1.4.2 RDF Stores in Not Only SQL Databases 1. Graph model RDF data has the characteristics of graph structure, so some work studies the storage of RDF data from the perspective of graph model. Bönström et al. (2003) first proposed to treat RDF data from the perspective of XML format data or a collection of triples. The graph model contains more semantic information in RDF data. They believe that the advantages of using graphs to store

22

1 RDF Data and Management

Table 1.1 Major features of relational resource description framework data stores Join operations

Multi-valued attributes

Null values

Relational schema

Number of relation(s)

More self-joins

No

No

Fixed

Fixed

One table for all No predicates

Yes

Yes and many

Dynamic

Fixed

One table for each predicate

More joins

No

No

Dynamic

Dynamic

Type stores

Fewer joins

Yes

Yes and fewer

Dynamic

Dynamic

Vertical stores Horizontal stores

RDF data are: (i) RDF model and graph model structure can be directly mapped, when RDF data is stored, the conversion of RDF data is avoided; (ii) RDF data is queried, graph model Storing RDF data avoids refactoring. Angles and Gutierrez (2005) discussed the problem of using graph databases to store RDF data, and compared the relationship between relational models, object-oriented models, semantic models and RDF models. In addition, in his work, he also studied the adaptability of graph database query language to RDF data and the applicability of RDF query language to graph data. The results show that most RDF query languages have low support for some basic graph queries, even SPARQL does not Support path query and node distance query in graph structure, but these queries are very important in practical applications. Udrea et al. proposed to use the GRIN algorithm to answer SPARQL queries. The core of GRIN is to construct a GRIN index similar to the M-tree structure (Ciaccia et al., 1997). Using distance constraints, GRIN can quickly determine and prune the parts of the RDF graph that do not meet the query conditions, which improves query performance as a whole. Wu et al. (2008) proposed using a hypergraph data model to store RDF data, and designed a persistent storage strategy based on the graph structure. Yan et al. (2009) proposed to divide the RDF graph into several subgraphs and add indexes such as Bloom filters to quickly determine whether the data being checked is in a certain subgraph during query. The graph segmentation technology is used to reduce the self-connection of triples, but the update of the graph still needs to be re-segmented. Zou et al. (2011) proposed using the gStore system to store RDF and answer SPARQL queries. Through the coding method, each entity node, neighbor attribute and attribute value in the RDF graph is coded into a node with Bitstring to obtain a label graph G*. When querying, the query graph Q is also encoded into a query label graph Q*, and then the subgraph matching method is used to find the subgraph satisfying Q* in the label graph G*. Both Property Graphs (PG) and RDF can be used to represent graph model data, but the conversion between the two models is incompatible. To this end, Hartig (2014) proposed a formal definition of the attribute graph model, and introduced a clear definition for the conversion between the PG and RDF models. On the one hand, by implementing the RDF-to-PG conversion definition, PG-based systems can enable users to load RDF data and enable them to

1.4 RDF Data Store

23

use the graph traversal language Gremlin or declarations in a compatible and systemindependent manner Figure query language Cypher. On the other hand, PG-to-RDF conversion enables RDF data management systems to achieve compatible processing of Property Graphs content by using SPARQL. Recently, De Virgilio (2017) proposed an automatic conversion method from RDF to graph storage system. The conversion uses the integrity constraints defined on the source to properly construct a target database and attempts to reduce the number of visits required to answer the query in the database. This is done by storing the same node data that may appear together in the query results, and they have also developed a system to implement the conversion. At present, the representative graph database products of RDF data mainly include Neo4j18 and Dydra.19 Neo4j is currently a relatively mature and high-performance open source graph database. Neo4j can traverse nodes and edges at the same speed, and its traversal speed has nothing to do with the amount of data that constitutes the graph. But it does not support distributed storage, and the existing research on using Neo4j to store RDF and support SPARQL query is very few, which is limited to some engineering applications. The Thinkerpop team developed LinkedDataSail,20 which provides an interface for processing RDF data in a graph database. Using this interface, Neo4j can support SPARQL queries and can be used as a triple database. Martella21 stored the DBpedia data set in Neo4j, and then expanded SPARQL queries and other graph algorithms on this basis. Dydra is a cloud-based graph database. Using Dydra, RDF data is directly stored as an attribute graph, directly representing the relationship in the underlying RDF data, and can be accessed and updated through an industry standard query language designed specifically for graph processing. These works have applied graph databases to store RDF data, and used graph algorithms to solve query problems. 2. NoSQL data-management systems NoSQL data-management systems have emerged as a commonly used infrastructure for handling big data outside the RDF space. The various NoSQL data stores were divided into four major categories by Grolinger et al. (2013): key-value stores, column-family stores, document stores, and graph databases. Key-value stores have a simple data model based on key-value pairs. Most column-family stores are derived from Google BigTable (Chang et al., 2008), in which the data are stored in a column-oriented way. In BigTable, the data set consists of several rows. Each row is addressed by a primary key and is composed of a set of column families. Note that different rows can have different column families. Representative column-family stores include Apache HBase,22 which directly implements the Google BigTable concepts. According to Grolinger et al. (2013), there is one type of column-family store, say Amazon SimpleDB (Stein & Zachrias, 2010) and DynamoDB (DeCandia 18

http://neo4j.org/. http://www.dydra.com. 20 https://github.com/thinkerpop/gremlin/wiki/linkeddatasail. 21 https://github.com/claudiomartella/dbpedia4neo. 22 https://hbase.apache.org/. 19

24

1 RDF Data and Management

et al., 2007), in which only a set of columns name value pairs is contained in each row, without having column families. In addition, Cassandra (Lakshman & Malik, 2010) provides the additional functionality of super columns, which are formed by grouping various columns together. Document stores provide another derivative of the key-value store data model that uses keys to locate documents inside the data store. Most document stores represent documents using JSON (JavaScript Object Notation) or some format derived from it. Typically, CouchDB23 and the Couchbase server24 use the JSON format for data storage, whereas MongoDB25 stores data in BSON (Binary JSON). Graph databases use graphs as their data model, and a graph is used to represent a set of objects, known as vertices or nodes, and the links (or edges) that interconnect these vertices. Illustrative representations of these NoSQL models were presented by Grolinger et al. (2013) and are shown in Fig. 1.5. Actually, massive RDF data-management merits the use of big-data infrastructure because of the scalability and high performance of cloud data management. A number of efforts have been made to develop RDF data-management systems based on NoSQL systems. SimpleDB by Amazon was used as a back end to store RDF data quickly and reliably for massive parallel access (Stein & Zachrias, 2010). Cloud-based key-value stores (e.g. BigTable) were used by Gueret et al. (2011), and a robust query engine was developed over these key-value stores. In addition, there is a new RDF store on the block, called SPARQLcity,26 which is a Hadoop-based graph analytical engine for performing rich business analytics on RDF data with SPARQL. SPARQLcity is the first just in time compiled engine in SPARQL query execution. Given the fact that NoSQL systems offer either no support or only high latency support (MapReduce) for effective join processing, however, SPARQL queries with many joins, which is the mainstay, run into big problems on such systems. The normal NoSQL APIs (Application Programming Interfaces) that are centered on individual key lookup (whether one looks up a value, column-family, or document) simply have too high latency if one has to join tens of thousands (or billions) of RDF triples. Several NoSQL systems for RDF data were investigated by Cudre-Mauroux et al. (2013), including document stores (e.g. CouchDB27 ), key-value/column stores (e.g. Cassandra28 and HBase29 ), and query compilation for Hadoop (e.g. Hive30 ). Major characteristics of these four NoSQL systems are described (Cudre-Mauroux et al., 2013). First, Apache HBase is an open-source, horizontally scalable, row-consistent, low-latency, and random-access data store. HBase uses HDFS as a storage back end

23

https://couchdb.apache.org/. https://www.couchbase.com/products/server. 25 https://www.mongodb.com/. 26 https://www.hugedomains.com/domain_profile.cfm?d=sparqlcity.com. 27 https://couchdb.apache.org/. 28 https://cassandra.apache.org/_/index.html. 29 https://hbase.apache.org/. 30 https://hive.apache.org/query. 24

1.4 RDF Data Store Key_1

25 Dataset

Value_1

Key_2

Value_2

Key_3

Value_1

Key_4

Value_3

Key_5

Value_2

Key_6

Value_1

Key_7

Value_4

Key_8

Value_3

ROW KEY-1

ROW KEY-2

(a) Key-value store.

Column-Family-1 Column Column Name-1 Name-2 Column Column Value-1 Value-2

Column Name-3 Column Name-3

Column-Family-2 Column Name-3 Column Value-3

Column-Family-1 Column Name-3 Column Name-3

Column Name-3 Column Name-3

(b) Column-family store.

Dataset Document_ld-1 Document_ld-2 Document_ld-3 Document_ld-4

Document-1

Key-Value Node1

Key-Value1

Key-Value Node3

Document-2 Document-3 Key-Value Node2

Document-4

(c) Document store.

(d) Graph database

Fig. 1.5 Different types of NoSQL data model (Grolinger et al., 2013)

and Apache ZooKeeper31 to provide support for coordination tasks and fault tolerance. HBase is a column-oriented distributed NoSQL database system. Its data model is a sparse, multi-dimensional sorted map. Here, columns are grouped into column families, and timestamps add an additional dimension to each cell. HBase is well integrated with Hadoop, which is a large-scale MapReduce computational framework. The second HBase implementation uses Apache Hive, a SQL-like data-warehousing tool that enables querying using MapReduce. Third, Couchbase32 is a documentoriented, schema-less distributed NoSQL database system with native support for JSON documents. Couchbase is intended to run mostly in-memory and on as many nodes as needed to hold the whole data set in RAM (random-access memory). It has a built-in object-managed cache to speedup random reads and writes. Updates to documents are first made in the in-memory cache and are only later processed to disk using an eventual consistency paradigm. Finally, Apache Cassandra is a NoSQL database 31 32

https://zookeeper.apache.org/. https://www.couchbase.com/.

26

1 RDF Data and Management

management system originally developed by Facebook (Lakshman & Malik, 2010), which provides decentralized data storage and failure tolerance based on replication and failover. Among the NoSQL systems available, HBase has been the most widely used. To manage distributed RDF data, HBase and MySQL Cluster were used by Franke et al. (2011) to store RDF data. An empirical comparison of these two approaches was then conducted on a cluster of commodity machines. The Hexastore (Weiss et al., 2008) schema was applied for HBase to store verbose RDF data (Sun & Jin, 2010). Also based on HBase, two distributed triple stores called H2RDF and H2RDF+ were developed to optimize distributed joins using MapReduce (Papailiou et al., 2012, 2013). The main differences between H2RDF and H2RDF+ are found in the join algorithms and the number of maintained indices (three vs. six). Combining the Jena framework with the storage provided by HBase, Khadilkar et al. (2012) developed several versions of a triple store. A scalable technique was created (Przyjaciel-Zablocki et al., 2012) for performing indexed nested loop joins, which combines the power of the MapReduce paradigm with the random-access pattern provided by HBase. Using a combination of MapReduce and HBase, a storage schema called RDFChain was proposed by Choi et al. (2013) to support scalable storage and efficient retrieval of a large set of RDF data. Focussing on large data sets from various areas of provenance which record the history of an in-silico experiment, large collections of provenance graphs were serialized as RDF graphs in an Apache HBase database (Chebotko et al., 2013). On this basis, storage, indexing, and query techniques for RDF data in HBase were proposed, which are better suited for provenance data sets than generic RDF graphs. Another important category of NoSQL RDF storage is the concept of graph databases (Angles & Gutierrez, 2008). Focussing on the structure of RDF data, these data are viewed as a classical graph in which subjects and objects form the nodes and triples specify directed and labelled edges. Note that here RDF graphs may contain cycles and have labelled edges. Angles and Gutierrez (2005) surveyed graph database models and query languages and they further propose that RQL should incorporate graph database query language primitives. Identifying that the standard graph database model (essentially labelled graphs) is different from the triples-based RDF model, a triple-based model called Trial was introduced by Libkin et al. (2013), which combines the usual idea of triple stores used in many RDF implementations with that of graphs with data. Actually, one of the major problems encountered in modelling RDF data as classical graphs is that an edge or a labelled edge cannot represent the ternary relation given by an RDF triple. It is natural to use hypergraphs for this purpose with three-node connecting edges instead of classical two-node edges. Hypergraphs are represented naturally by bipartite graphs (Hayes & Gutierrez, 2004), in which the concept of an RDF bipartite graph is introduced as an intermediate model for RDF data. With a focus on distributed and Web-scale RDF data management, a memorybased graph engine called Trinity.RDF is introduced (Zeng et al., 2013). Trinity.RDF models RDF data in its native graph form, in which entities (i.e. subjects and objects of RDF triples) are represented as graph nodes and relationships (i.e. predicates of

References

27

RDF triples) are represented as graph edges. Each RDF entity is represented as a graph node with a unique id and stored as a key-value pair in the Trinity memory cloud. Formally, a key-value pair (node-id, ) consists of the node-id as the key and the node research directions to explore in managing voluminous adjacency list as the value. The adjacency list is divided into two lists: one for neighbors with incoming edges and the other for neighbors with outgoing edges. Each element in the adjacency list is a (predicate, node-id) pair, which records the id of the neighbor and the predicate on the edge.

1.5 Summary The RDF is increasingly being adopted for modeling data in various application domains and has become a cornerstone for publishing, exchanging, sharing, and interrelating data on the Web. The goal of this chapter is to give an overview of the basics of the theory of RDF data and management. We start by providing a formal definition of RDF that includes the features that distinguish this model from other graph data models. We then move into the fundamental issue of querying RDF data. We study the RDF query language SPARQL, which is a W3C Recommendation since January 2008. We provide an algebraic syntax and a compositional semantics for this language. We furthermore focus on RDF data storage and present a full up-to-date overview of the current state of the art in RDF data storage strategy, including RDF data stores in traditional databases and RDF data stores in NoSQL databases. However, the traditional database models and RDF feature limitations, mainly with what can be said about fuzzy information that is commonly found in many application domains. In order to provide the necessary means to handle and manage such information there are today a huge number of proposals for fuzzy extensions to database models and RDF. In particular, Zadeh’s fuzzy set theory (Zadeh, 1965) has been identified as a successful technique for modelling the fuzzy information in many application areas, especially in the databases and RDF. In the next chapter, we will briefly introduce the fuzzy set theory and fuzzy database models.

References Abadi, D. J., Marcus, A., Madden, S., & Hollenbach, K. (2007). Scalable semantic web data management using vertical partitioning. In Proceedings of the 33th International Conference on Very Large Data Bases (pp. 411–422). Abadi, D. J., Marcus, A., Madden, S., & Hollenbach, K. (2009). SW-store: A vertically partitioned DBMS for semantic web data management. VLDB Journal, 18(2), 385–406. Ali, W., Saleem, M., Yao, B., Hogan, A., & Ngomo, A. C. N. (2021). A survey of RDF stores & SPARQL engines for querying knowledge graphs. The VLDB Journal, 1–26. Angles, R., & Gutierrez, C. (2005). Querying RDF data from a graph database perspective. In Proceedings of the Second European Semantic Web Conference (pp. 346–360).

28

1 RDF Data and Management

Angles, R., & Gutierrez, C. (2008). Survey of graph database models. ACM Computing Surveys, 40, 1:1–1:39. Arenas, M., Gutierrez, C., & Pérez, J. (2009, August). Foundations of RDF databases. In Reasoning Web International Summer School (pp. 158–204). Springer. Berners-Lee, T., Hendler, J., & Lassila, O. (2001). The semantic web. Scientific American, 284(5), 34–43. Bishop, B., Kiryakov, A., Ognyanoff, D., Peikov, I., Tashev, Z., & Velkov, R. (2011). OWLIM: A family of scalable semantic repositories. Semantic Web, 2(1), 1–10. Bizer, C., Heath, T., & Berners-Lee, T. (2009). Linked data—The story so far. International Journal of Semantic Web and Information Systems, 5(3), 1–22. Bönström, V., Hinze, A., & Schweppe, H. (2003). Storing RDF as a graph. In Proceedings of the First Conference on Latin American Web Congress, 27–36. Bornea, M. A., Dolby, J., Kementsietsidis, A., Srinivas, K., Dantressangle, P., Udrea, O., & Bhattacharjee, B. (2013). Building an efficient RDF store over a relational database. In Proceedings of the 2013 ACM International Conference on Management of Data (pp. 121–132). Brickley, D., & Guha, R. V. (2004). RDF Vocabulary Description Language 1.0: RDF Schema, W3C Recommendation. Broekstra, J., Kampman, A., & van Harmelen, F. (2002). Sesame: a generic architecture for storing and querying RDF and RDF schema. In Proceedings of the 2002 International Semantic Web Conference (pp. 54–68). Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., Chandra, T., Fikes, A., & Gruber, R. E. (2008). BigTable: A distributed storage system for structured data. ACM Transactions on Computer Systems 26(2), 4:1–4:26. Chao, C.-M. (2007a). An object-oriented approach for storing and retrieving RDF/RDFS documents. Tamkang Journal of Science and Engineering, 10(3), 275–286. Chao, C.-M. (2007b). An object-oriented approach to storage and retrieval of RDF/XML documents. In Proceedings of the 19th International Conference on Software Engineering & Knowledge Engineering (pp. 586–591). Chebotko, A., Abraham, J., Brazier, P., Piazza, A., Kashlev, A., & Lu, S. (2013). Storing, indexing and querying large provenance data sets as RDF graphs in Apache HBase. In Proceedings of IEEE Ninth World Congress on Services (pp. 1–8). Choi, P., Jung, J., & Lee, K.-H. (2013). RDFChain: Chain centric storage for scalable join processing of RDF graphs using MapReduce and HBase. In Proceeding of the 2013 International Semantic Web Conference (pp. 249–252). Ciaccia, P., Patella, M., & Zezula, P. (1997, August). M-tree: An efficient access method for similarity search in metric spaces. In Vldb (Vol. 97, pp. 426–435). Cudre-Mauroux, P., Enchev, I., Fundatureanu, S., Groth, P., Haque, A., Harth, A., Keppmann, F. L., Miranker, D. P., Sequeda, J. F., & Wylot, M. (2013). NoSQL databases for RDF: An empirical evaluation. In Proceedings of the 12th International Semantic Web Conference (pp. 310–325). DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., & Vogels, W. (2007). Dynamo: Amazon’s highly available key-value store. In Proceedings of the 21st ACM Symposium on Operating Systems Principles (pp. 205–220). De Virgilio, R. (2017). Smart RDF data storage in graph databases. In 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) (pp. 872–881). IEEE. Duan, S., Kementsietsidis, A., Srinivas, K., & Udrea, O. (2011). Apples and oranges: A comparison of RDF benchmarks and real RDF datasets. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data (pp. 145–156). Erling, O., & Mikhailov, I. (2007). RDF support in the Virtuoso DBMS. In Proceedings of the 1st Conference on Social Semantic Web (pp. 59–68). Erling, O., & Mikhailov, I. (2009). Virtuoso: RDF support in a native RDBMS. In R. De Virgilio, F. Giunchiglia, & L. Tanca (Eds.), Semantic Web Information Management (pp. 501–519). Springer.

References

29

Franke, C., Morin, S., Chebotko, A., Abraham, J., & Brazier, P. (2011). Distributed semantic web data management in HBase and MySQL Cluster. In Proceedings of the 2011 IEEE International Conference on Cloud Computing (pp. 105–112). Grolinger, K., Higashino, W. A., Tiwari, A., & Capretz, M. A. M. (2013). Data management in cloud environments: NoSQL and NewSQL data stores. Journal of Cloud Computing: Advances Systems and Applications, 2, 22. Gueret, C., Kotoulas, S., & Groth, P. (2011). TripleCloud: an infrastructure for exploratory querying over web-scale RDF data. In Proceedings of the 2011 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology—Workshops (pp. 245–248). Gutierrez, C., Hurtado, C. A., Mendelzon, A. O., & Pérez, J. (2011). Foundations of semantic web databases. Journal of Computer and System Sciences, 77(3), 520–541. Harris, S., & Gibbins, N. (2003). 3store: efficient bulk RDF storage. In Proceedings of the First International Workshop on Practical and Scalable Semantic Systems Harris, S., Lamb, N., & Shadbolt, N. (2009). 4store: The design and implementation of a clustered RDF store. In Proceedings of the 5th International Workshop on Scalable Semantic Web Knowledge Base Systems (pp. 94–109). Harris, S., & Shadbolt, N. (2005). SPARQL query processing with conventional relational database systems. In Proceedings of the International Workshop on Scalable Semantic Web Knowledge Base Systems (pp. 235–244). Hartig, O. (2014). Reconciliation of RDF and Property Graphs. arXiv preprint arXiv:1409.3288 Hassanzadeh, O., Kementsietsidis, A., & Velegrakis, Y. (2012). Data management issues on the semantic web. In Proceedings of the 2012 IEEE International Conference on Data Engineering (pp. 1204–1206). Hayes, J., & Gutierrez, C. (2004). Bipartite graphs as intermediate model for RDF. In Proceedings of the 2004 International Semantic Web Conference (pp. 47–61). Hayes, P. (2004). RDF Semantics, W3C Recommendation. http://www.w3.org/TR/rdf-mt/ Hu, X., Dang, D., Yao, Y., & Ye, L. (2017). Natural language aggregate query over Rdf data. Information Sciences, 454–455, 363–381. Khadilkar, V., Kantarcioglu, M., Thuraisingham, B. M., & Castagna, P. (2012). Jena-HBase: A distributed, scalable and efficient RDF triple store. In Proceedings of the 2012 International Semantic Web Conference. Lakshman, A., & Malik, P. (2010). Cassandra: A decentralized structured storage system. ACM SIGOPS Operating System Review, 44(2), 35–40. Levandoski, J. J., & Mokbel, M. F. (2009). RDF data-centric storage. In Proceedings of the 2009 IEEE International Conference on Web Services (pp. 911–918). Libkin, L., Reutter, J. L., & Vrgoc, D. (2013). Trial for RDF: Adapting graph query languages for RDF data. In Proceedings of the 32nd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (pp. 201–212). Luo, Y., Picalausa, F., Fletcher, G. H. L., Hidders, J., & Vansummeren, S. (2012). Storing and indexing massive RDF datasets. In R. De Virgilio, F. Guerra, & Y. Velegrakis (Eds.), Semantic Search Over the Web (pp. 31–60). Springer. Ma, R., Jia, X., Cheng, J., & Angryk, R. A. (2016a). SPARQL queries on RDF with fuzzy constraints and preferences. Journal of Intelligent & Fuzzy Systems, 30(1), 183–195. Ma, Z., Capretz, M. A., & Yan, L. (2016b). Storing massive resource description framework (RDF) data: A survey. The Knowledge Engineering Review, 31(4), 391–413. Ma, Z., Lin, X., Yan, L., & Zhao, Z. (2018). RDF keyword search by query computation. Journal of Database Management (JDM), 29(4), 1–27. Ma, Z. M., Capretz, M. A. M., & Yan, L. (2016c). Storing massive resource description framework (RDF) data: A survey. Knowledge Engineering Review, 31(4), 391–413. Manola, F., & Miller, E. (2004). RDF Primer, W3C Recommendation. http://www.w3.org/TR/2004/ REC-rdf-primer-20040210/.

30

1 RDF Data and Management

Marin, D. (2004). A formalization of RDF (applications de la logique á la sémantique du Web), Tech. rep., École Polytechnique–Universidad de Chile, dept. Computer Science, Universidad de Chile, TR/DCC-2006-8. http://www.dcc.uchile.cl/cgutierr/ftp/draltan.pdf Matono, A., Amagasa, T., Yoshikawa, M., & Uemura, S. (2005). A path-based relational RDF database. In Proceedings of the 16th Australasian Database Conference (pp. 95–103). Matono, A., & Kojima, I. (2012). Paragraph tables: A storage scheme based on RDF document structure. In Proceedings of the 23rd International Conference on Database and Expert Systems Applications (pp. 231–247). McBride, B. (2002). Jena: A semantic web toolkit. IEEE Internet Computing, 6(6), 55–59. Morsey, M., Lehmann, J., Auer, S., & Ngomo, A. C. N. (2011). DBpedia SPARQL benchmarkperformance assessment with real queries on real data. In Proceedings of the 10th International Semantic Web Conference (pp. 454–469). Morsey, M., Lehmann, J., Auer, S., & Ngomo, A. C. N. (2012). Usage-centric benchmarking of RDF triple stores. In Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence (pp. 2134–2140). Mulay, K., & Kumar, P. S. (2012). SPOVC: A scalable RDF store using horizontal partitioning and column-oriented DBMS. In Proceedings of the 4th International Workshop on Semantic Web Information Management. Munoz, S., Pérez, J., & Gutiérrez, C. (2007). Minimal deductive systems for RDF. In European Semantic Web Conference. Springer. Neumann, T., & Weikum, G. (2008). RDF-3X: A RISC-style engine for RDF. Proceedings of the VLDB Endowment, 1(1), 647–659. Neumann, T., & Weikum, G. (2010). The RDF-3X engine for scalable management of RDF data. The VLDB Journal, 19(1), 91–113. Papailiou, N., Konstantinou, I., Tsoumakos, D., Karras, P., & Koziris, N. (2013). H2RDF+: Highperformance distributed joins over large-scale RDF graphs. In Proceedings of the 2013 IEEE International Conference on Big Data (pp. 255–263). Papailiou, N., Konstantinou, I., Tsoumakos, D., & Koziris, N. (2012). H2RDF: Adaptive query processing on RDF data in the cloud. In Proceedings of the 21st World Wide Web Conference (pp. 397–400). Pérez, J., Arenas, M., & Gutierrez, C. (2006a). Semantics and complexity of SPARQL. In International Semantic Web Conference. Springer. Pérez, J., Arenas, M., & Gutierrez, C. (2006b). Semantics of SPARQL. Technical Report, Universidad de Chile TR/DCC-2006b-17. Pérez, J., Arenas, M., & Gutierrez, C. (2009). Semantics and complexity of SPARQL. ACM Transactions on Database Systems (TODS), 34(3), 1–45. Prud’hommeaux, E., & Seaborne, A. (2008). SPARQL Query Language for RDF, W3C Recommendation. http://www.w3.org/TR/rdf-sparql-query/ Przyjaciel-Zablocki, M., Schatzle, A., Hornung, T., Dorner, C., & Lausen, G. (2012). Cascading map-side joins over HBase for scalable join processing. In CoRR 2012. Sakr, S., & Al-Naymat, G. (2009). Relational processing of RDF queries: A survey. SIGMOD Record, 38(4), 23–28. Sidirourgos, L., Goncalves, R., Kersten, M. L., Nes, N., & Manegold, S. (2008). Column-store support for RDF data management: Not all swans are white. Proceedings of the VLDB Endowment, 1(2), 1553–1563. Sintek, M., & Kiesel, M. (2006). RDFBroker: A signature-based high-performance RDF store. In Proceedings of the 3rd European Semantic Web Conference (pp. 363–377). Stein, R., & Zachrias, V. (2010). RDF on cloud number nine. In Proceedings of the 4th Workshop on New Forms of Reasoning for the Semantic Web: Scalable & Dynamic (pp. 11–23). Stonebraker, M., Abadi, D. J., Batkin, A., Chen, X., Cherniack, M., Ferreira, M., Lau, E., Lin, A., Madden, S., O’Neil, E., Rasin, A., Tran, N., & Zdonik, S. (2005). C-Store: a column-oriented DBMS. In Proceedings of the 31st International Conference on Very Large Data Bases (pp. 553– 564).

References

31

Sun, J. L., & Jin, Q. (2010). Scalable RDF store based on HBase and MapReduce. In Proceedings of the 3rd International Conference Advanced Computer Theory and Engineering (pp. V1-633– V1-636). Wang, Y., Du, X. Y., Lu, J. H., & Wang, X. F. (2010). FlexTable: using a dynamic relation model to store RDF data. In Proceedings of the 15th International Conference on Database Systems for Advanced Applications (pp. 580–594). Weiss, C., Karras, P., & Bernstein, A. (2008). Hexastore: Sextuple indexing for semantic web data management. Proceedings of the VLDB Endowment, 1(1), 1008–1019. Wilkinson, K. (2006). Jena property table implementation. Technical Report HPL-2006-140, HP Labs. Wilkinson, K., Sayers, C., Kuno, H. A., & Reynolds, D. (2003). Efficient RDF storage and retrieval in Jena2. In Semantic Web and Databases Workshop (pp. 131–150). Wu, G., Li, J., & Wang, K. (2008, April). System II: A hypergraph based native RDF repository. In Proceedings of the 17th international Conference on World Wide Web (pp. 1035–1036). Wolff, B. G. J., Fletcher, G. H. L., & Lu, J. J. (2015). An extensible framework for query optimization on TripleT-based RDF stores. In Proceedings of the Workshops of the EDBT/ICDT 2015 Joint Conference (pp. 190–196). Yan, L., Ma, R., Li, D., & Cheng, J. (2017). RDF approximate queries based on semantic similarity. Computing, 99(5), 481–491. Yan, Y., Wang, C., Zhou, A., Qian, W., Ma, L., & Pan, Y. (2009). Efficient indices using graph partitioning in RDF triple stores. In 2009 IEEE 25th International Conference on Data Engineering (pp. 1263–1266). Zadeh, L. A. (1965). Fuzzy sets. Information and Control, 8(3), 338–353. Zeng, K., Yang, J. C., Wang, H. X., Shao, B., & Wang, Z. Y. (2013). A distributed graph engine for web scale RDF data. Proceedings of the VLDB Endowment, 6(4), 265–276. Zou, L., Mo, J., Chen, L., Özsu, M. T., & Zhao, D. (2011). gStore: Answering SPARQL queries via subgraph matching. Proceedings of the VLDB Endowment, 4(8), 482–493.

Chapter 2

Fuzzy Sets and Fuzzy Database Modeling

2.1 Introduction Information is often imprecise and uncertain in many real-world applications, and many sources can contribute to the imprecision and uncertainty of data or information. Therefore, it has been pointed out that we need to learn how to manage data that is imprecise or uncertain (Dalvi & Suciu, 2007). Unfortunately, the classical data models such as relational database, objectoriented databases, and the RDF data model introduced in Chap. 1 often suffer from their inability to represent and manipulate imprecise and uncertain data information. Since the early 1980s, Zadeh’s fuzzy logic (Zadeh, 1965) has been introduced into various database models to enhance the classical models such that uncertain and imprecise information can be represented and manipulated. Over the past 40 years, a significant body of research in fuzzy database modeling has been developed and tremendous gain is hereby accomplished in this area. Various fuzzy database models have been proposed, and some major issues related to these models have been investigated. Among various fuzzy database models, many fruitful results have been achieved in fuzzy relational database modeling (Chen, 1999; Galindo, 2008; Petry, 1996). Furthermore, to model complex objects with uncertainty, much work has devoted to fuzzy object-oriented database models (de Caluwe, 1998; Ma, 2005a). Fuzzy object-oriented database model is a fuzzy extension to the classical objectoriented database model by introducing the related notions of fuzzy classes, generalization/specialization, and fuzzy inheritance relationships (Yan et al., 2012). In addition, with the wide use of XML as the de-facto standard of data representation and exchange on the Web, fuzzy XML (Yan et al., 2014; Ma & Yan, 2016) have been attracting more attention. Recent years have witnessed many new application perspectives such as Big Data and artificial intelligence (AI). As a result, some new data models are emerging beyond the traditional data models. Clearly it is not enough for the existing fuzzy data models and their extensions to represent necessary data semantics. It is essential © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 Z. Ma et al., Modeling and Management of Fuzzy Semantic RDF Data, Studies in Computational Intelligence 1057, https://doi.org/10.1007/978-3-031-11669-8_2

33

34

2 Fuzzy Sets and Fuzzy Database Modeling

to invent some new fuzzy data models like semi-structured and graph data models. Being one kind of special graph data model, RDF recommended by the W3C is finding more and more uses in a wide range of semantic data management scenarios. To represent and deal with fuzziness in RDF data, few efforts have proposed several fuzzy RDF models. The elementary construct of RDF model is a triple with format (subject, predicate, object), which encodes the binary relation predicate between subject and object, representing a single knowledge fact. The most common fuzzy RDF model contains the triples associated with membership degrees (Manolis & Tzitzikas, 2011; Straccia, 2009). Here the fuzzy RDF triples represent the fuzziness of a triple-level granularity and it is hard to exactly know the fuzziness of one triple’s components. To tackle this, a kind of fuzzy RDF models are proposed in Ma et al. (2018), in which the fuzziness of triples can appear in triple’s components. Based on such a fuzzy RDF model with fine granularity of fuzziness, few recent efforts investigate fuzzy RDF graph matching (Li et al., 2019a, 2019b, 2019c) and fuzzy RDF graph storage (Fan et al., 2019, 2020; Ma et al., 2018). In this chapter, we mainly introduce several fuzzy database models, including fuzzy XML model, fuzzy relational and fuzzy object-oriented database models. These models can be used for mapping to and from the fuzzy RDF models in order to realize the fuzzy data management in many areas, such as database and Web-based application domains. Before that, we briefly introduce some notions of fuzzy set theory.

2.2 Imperfect Information and Fuzzy Sets In the real-world applications, information is usually vague or ambiguous. Some data are inherently fuzzy since their values are subjective in real applications (Ma & Yan, 2008). Considering values representing the satisfaction degree for a film, clearly different individuals may have different satisfaction degrees. Uncertainty extensively exists in data and knowledge intensive applications, in which fuzzy information processing plays a crucial role. Fuzzy set theory introduced by Zadeh (1965) is applied to explain the concept of uncertainty in real life. A fuzzy set is defined mathematically by assigning to each possible individual in the universe of discourse a value, representing its grade of membership, which corresponds to the degree, to which that individual is similar or compatible with the concept represented by the fuzzy set. Currently, fuzzy sets have been extensively used to enhance various database models for managing fuzzy data. Therefore, in this section, we briefly introduce some notions of fuzzy sets and fuzzy graph theory.

2.2 Imperfect Information and Fuzzy Sets

35

2.2.1 Imperfect Information There are different categories of data quality (or the lack thereof) to be handled. Some efforts try to identify and distinguish different types and sources of imperfect information. In Parsons (1996): imperfect information can be imprecise, vague, inconsistent, incomplete and/or uncertain. Bosc and Prade (1993) identify five basic kinds of imperfection: inconsistency, imprecision, vagueness, uncertainty, and ambiguity. In the following, we give explanations to the meanings of imperfect information. (a) Inconsistency stands for a kind of semantic conflict, which means that the same aspect of a real-world entity is irreconcilably represented more than once in data resource(s). For example, the height value of one person is recorded as several values with different scales (say, 1.78 m, 178.40 cm and 5.85 ft). (b) Imprecision means that we must make a choice from a given range of values without knowing which one should be chosen. This range is basically represented by an interval or a set of values. For example, we do not know the exact height value of one person but know that it must be one of several values (say, 1.77 m, 178 m and 179 m). (c) Vagueness has a similar semantics with imprecision but is generally represented with linguistic terms. For example, “between 20 and 30 years old” and “young” for the attribute Age are imprecise and vague values, respectively. (d) Incompleteness means the information for which some data are missing. We completely have no idea, for example, how tall one person is. Generally, incomplete information can be described by null values (Cross, 1996; Cross & Firat, 2000). (e) Uncertainty means that we apportion some (maybe not all) of our belief to a value or a group of values, which is related to the degree of truth. For example, a possibility degree of 95% is assigned to the height value (say 1.78 m) of one person. Note that this paper concentrates on subjective uncertainty described with possibility theory rather than stochastic uncertainty described with probability theory. (f) Ambiguity means that some elements of a model lack complete semantics, which can lead to several possible interpretations. For example, a length value 3 without necessary semantics may be interpreted a time length, a distance length and so on. If it is a time length, it may be interpreted 3 days, 3 h, 3 min, or 3 s. In general, several different kinds of imperfect information can co-exist with respect to the same piece of information. In addition, imprecise values generally denote a set of values in the form of [ai1, ai2, …, aim] or [ai1, ai2] for the discrete and continuous universe of discourse, respectively, meaning that exactly one of the values is the true value for the single-valued attribute, or at least one of the values is the true value for the multivalued attribute. So, imprecise information here has two interpretations: disjunctive information and conjunctive information. Null values, which are originally called incomplete information, have several possible interpretations: (a) “existing but unknown”, (b) “nonexisting” or

36

2 Fuzzy Sets and Fuzzy Database Modeling

“inapplicable”, (c) “no information” and (d) “open null value” (Gottlob & Zicari, 1988), which means that the value may do not exist, exactly one unknown value, or several unknown values. An imprecise value can be considered as a particular case of the null value with the semantics of “existent but unknown” (i.e., an applicable null value), where the range of values that an imprecise value takes is restricted to a given set or interval of values while the range of values that an applicable null value takes corresponds to the corresponding universe of discourse. The notion of a partial value is illustrated as follows (Grant, 1979). A partial value on a universe of discourse U corresponds to a finite set of possible values in which exactly one of the values in the set is the true value, denoted by {a1 , a2 , …, am } for discrete U or [a1 , an ] for continua U, in which {a1 , a2 , …, am } ⊆ U or [a1 , an ] ⊆ U. Let η be a partial value, then sub (η) and sup (η) are used to represent the minimum and maximum in the set. Note that crisp data can also be viewed as special cases of partial values. A crisp data on discrete universe of discourse can be represented in the form of {p}, and a crisp data on continua universe of discourse can be represented in the form of [p, p]. Moreover, a partial value without containing any element is called an empty partial value, denoted by ⊥. In fact, the symbol ⊥ means an inapplicable missing data (Codd, 1986, 1987). Null values, partial values, and crisp values are thus represented with a uniform format.

2.2.2 Fuzzy Sets The fuzzy set was originally presented by Zadeh (1965). Since then fuzzy set has been infiltrating into almost all branches of pure and applied mathematics that are set-theory-based. This has resulted in a vast number of real applications crossing over a broad realm of domains and disciplines. Over the years, many of the existing approaches dealing with imprecision and uncertainty are based on the theory of fuzzy sets. Let U be a universe of discourse. A fuzzy value on U is characterized by a fuzzy set F in U. A membership function μ F : U → [0, 1] is defined for the fuzzy set F, where μF (u), for each u ∈ U, denotes the degree of membership of u in the fuzzy set F. For example, μF (u) = 0.8 means that u is “likely” to be an element of F by a degree of 0.8. For ease of representation, a fuzzy set F over universe U is organized into a set of ordered pairs: F = {μ F (u 1 )/u 1 , μ F (u 2 )/u 2 , . . . , μ F (u n )/u n }. When the membership function μF (u) above is explained to be a measure of the possibility that a variable X has the value u in this approach, where X takes values

2.2 Imperfect Information and Fuzzy Sets

37

in U, a fuzzy value is described by a possibility distribution π X (Zadeh, 1978). π X = {π X (u 1 )/u 1 , π X (u 2 )/u 2 , . . . , π X (u n )/u n } Here, π X (ui ), ui ∈ U, denotes the possibility that ui is true. In addition, a fuzzy data is represented by similarity relations in domain elements (Buckles & Petry, 1982), in which the fuzziness comes from the similarity relations between two values in a universe of discourse, not from the status of an object itself. Similarity relations are thus used to describe the degree similarity of two values from the same universe of discourse. A similarity relation Sim on the universe of discourse U is a mapping: U × U → [0, 1] such that: (i) for ∀ x ∈ U, Sim (x, x) = 1, (reflexivity); (ii) for ∀ x, y ∈ U, Sim (x, y) = Simi (y, x), (symmetry); and (iii) for ∀x, y, z ∈ U, Sim (x, z) ≥ maxy (min (Sim (x, y), Sim (y, z))), (transitivity). Moreover, the following notions related to fuzzy sets can be defined. Support: The set of the elements that have non-zero degrees of membership in F is called the support of F, denoted by supp(F) = {u|u ∈ U and μ F (u) > 0}. Kernel: The set of the elements that completely belong to F is called the kernel of F, denoted by ker(F) = {u|u ∈ U and μ F (u) = 1}. Cut: The set of the elements which degrees of membership in F are greater than (greater than or equal to) α, where 0 ≤ α < 1 (0 < α ≤ 1), is called the strong (weak) α-cut of F, respectively denoted by Fα+ = {u|u ∈ U and μ F (u) > α} and Fα = {u|u ∈ U and μ F (u) ≥ α}. In addition, to manipulate fuzzy sets and possibility distributions, several common set operations are defined. The usual set operations (such as union, intersection and complementation) have been extended to deal with fuzzy sets (Zadeh, 1965). Let A and B be fuzzy sets on the same universe of discourse U with the membership functions μA and μB , respectively. Then we have Union. The union of fuzzy sets A and B, denoted A ∪ B, is a fuzzy set on U with the membership function μA ∪ B : U → [0, 1], where ∀u ∈ U, μ A∪B (u) = max(μ A (u), μ B (u)).

38

2 Fuzzy Sets and Fuzzy Database Modeling

Intersection. The intersection of fuzzy sets A and B, denoted A ∩ B, is a fuzzy set on U with the membership function μA ∩ B : U → [0, 1], where ∀u ∈ U, μ A∩B (u) = min(μ A (u), μ B (u)). ¯ denoted by A, ¯ is a fuzzy Complementation. The complementation of fuzzy set A, set on U with the membership function μA¯ : U → [0, 1], where ∀u ∈ U, μ A (u) = 1 − μ A (u). Based on these definitions, the difference of the fuzzy sets B and A can be defined as: B − A = B ∩ A. Also, most of the properties that hold for classical set operations, such as DeMorgan’s Laws, have been shown to hold for fuzzy sets. The only law of ordinary set theory that is no longer true is the law of the excluded middle, i.e., A ∩ A /= ∅ and A ∪ A /= U. Let A, B and C be fuzzy sets in a universe of discourse U. Then the operations on fuzzy sets satisfy the following conditions: • Commutativity laws: A ∪ B = B ∪ A, A ∩ B = B ∩ A • Associativity laws: (A ∪ B) ∪ C = A ∪ (B ∪ C), (A ∩ B) ∩ C = A ∩ (B ∩ C) • Distribution laws: (A ∪ B) ∩ C = (A ∩ C) ∪ (B ∩ C), (A ∩ B) ∪ C = (A ∪ C) ∩ (B ∪ C) • Absorption laws: (A ∪ B) ∩ A = A, (A ∩ B) ∪ A = A • Idempotency laws: A ∪ A = A, A ∩ A = A • de Morgan laws: A ∪ B = A ∩ B, A ∩ B = A ∪ B Given two fuzzy sets A and B in U, B is a fuzzy subset of A, denoted by B ⊆ A, if μ B (u) ≤ μ A (u) for all u ∈ U. Two fuzzy sets A and B are said to be equal if A ⊆ B and B ⊆ A. Let U = U 1 × U 2 × … × U n be the Cartesian product of n universes and A1 , A2 , …, An be fuzzy sets in U 1 , U 2 , …, U n , respectively. The Cartesian product A1 × A2 × … × An is defined to be a fuzzy subset of U 1 × U 2 × … × U n , where μ A1×...×An (u 1 . . . u n ) = min(μ A1 (u 1 ), . . . , μ An (u n )) and ui ∈ U i , i = 1, …, n.

2.2 Imperfect Information and Fuzzy Sets

39

2.2.3 Fuzzy Graph A graph represents a particular relationship between elements of a set V. It gives an idea about the extent of the relationship between any two elements of V. We can solve this problem by using a weighted graph if proper weights are known. But in most of the situations, the weights may not be known, and the relationships are ‘fuzzy’ in a natural sense. Hence, a fuzzy relation can deal with the situation in a better way. As an example, if V represents certain locations and a network of roads is to be constructed between elements of V, then the costs of construction of the links are fuzzy. But the costs can be compared, to some extent using the terrain and local factors and can be modeled as fuzzy relations. Thus, fuzzy graph models are more helpful and realistic in natural situations. Kaufman (1973) gave the first definition of a fuzzy graph. But it was Rosenfeld (1975) and Yeh and Bang (1975b) who laid the foundations for fuzzy graph theory. Rosenfeld introduced fuzzy analogs of several basic graph-theoretic concepts, including subgraphs, paths, connectedness, cliques, bridges, cut vertices, forests, and trees. Yeh and Bang (1975a) independently introduced many connectivity concepts including vertex and edge connectivity in fuzzy graphs and applied fuzzy graphs for the first time in clustering of data. In this section, we discuss fundamentals of fuzzy graph theory (Sunitha, 2001) and we provide formal definitions, basic concepts, and properties of fuzzy graphs. Let V be a non-empty set. A fuzzy graph G = (σ, μ) is a pair of functions σ: S → [0, 1], μ: V 1 × V 2 → [0, 1] which satisfies ∀(u1 , u2 ) ∈ V 1 × V 2 , μ(u1 , u2 ) ≤ σ 1 (u1 ) ∧ σ 2 (u2 ), where ∧ denotes the minimum. The fuzzy set σ is called the fuzzy vertex set of G and μ the fuzzy edge set of G. Clearly μ is a fuzzy relation on σ. We denote the underlying (crisp) graph of G by G* = (σ * , μ* ) where μ* is referred to as the (nonempty) set S of nodes and μ* = E ⊆ S 1 × S 2 . Note that the crisp graph (V, E) is a special case of a fuzzy graph with each vertex and edge of (V, E) having degree of membership 1. We need not consider loops and we assume that μ is reflexive. Also, the underlying set V is assumed to be finite and μ can be chosen in any manner so as to satisfy the definition of a fuzzy graph. Example 2.1 Let V = {a, b, c}. Define the fuzzy set σ on V as σ (a) = 0.5, σ (b) = 1 and σ (c) = 0.8. Define a fuzzy set μ of E such that μ(ab) = 0.5, μ(bc) = 0.7 and μ(ac) = 0.1. Then μ(x, y) ≤ σ (x) ∧ σ (y) for all x, y ∈ V. Thus, G = (σ, μ) is a fuzzy graph. If we redefine μ(ab) = 0.6, then it is no longer a fuzzy graph. Let G = (σ, μ) be a fuzzy graph. Then a fuzzy graph G' = (σ ', μ' ) is called a partial fuzzy subgraph of G if σ ' ⊆ σ and μ' ⊆ μ. Similarly, the fuzzy graph G' = (σ ', μ' ) is called a fuzzy subgraph of G induced by P if P ⊆ V, σ ' (u) = σ (u) for every u ∈ P and μ' (e) = μ(e) for every e ∈ E. We write

to denote the fuzzy subgraph induced by P. Example 2.2 Let G1 = (σ, μ), where σ * = {a, b, c} and μ* = {ab, bc} with σ (a) = 0.4, σ (b) = 0.8, σ (c) = 0.5, μ(ab) = 0.3 and μ(bc) = 0.2. Then clearly G1 is a

40

2 Fuzzy Sets and Fuzzy Database Modeling

partial fuzzy subgraph of the fuzzy graph in Example 2.1. Also, if P = {a, b} and H = (σ, μ), where σ (a) = 0.5, σ (b) = 1 and μ(ab) = 0.5, then H is the induced fuzzy subgraph of G in Example 2.1, induced by P. Let G = (σ, μ) be a fuzzy graph. Then a partial fuzzy subgraph G' = (σ ', μ' ) of G is said to span G if σ ' = σ and μ' ⊆ μ; that is if σ ' (u) = σ (u) for every u ∈ V and μ' (e) ≤ μ(e) for every e ∈ E. In this case, we call G' = (σ ', μ' ) a spanning fuzzy subgraph of G. In fact a fuzzy subgraph G' = (σ ', μ' ) of a fuzzy graph G = (σ, μ) induced by a subset P of V is a particular partial fuzzy subgraph of G. Take σ ' (u) = σ (x) for all u ∈ P and 0 for all u ∈ / P. Similarly, take μ' (u1 , u2 ) = μ(u1 , u2 ) if (u1 , u2 ) is in a set of edges involving elements from P, and 0 otherwise. Let G: (σ, μ) and G': (σ ', μ' ) be the fuzzy graphs with underlying sets V, homomorphism of fuzzy graphs (Holub & Melichar, 1998) h: G → G' is a map h: V → V' which satisfies σ (x) ≤ σ ' (h(x)) ∀x ∈ V and μ(x, y) ≤ μ' (h(x), h(y)) ∀x, y ∈ V. We mainly discuss the concepts of fuzzy path and fuzzy bridges in this subsection. Most of the results are due to the works (Sunitha & Vijayakumar, 2005; Mathew et al., 2018). Let G: (σ, μ) be a fuzzy graph. If μ (x, y) > 0, then x and y are called neighbors. Then x and y lie on the edge e = (x, y). A path p in a fuzzy graph G: (σ, μ) is a sequence of distinct vertices v0 , v1 , v2 , …, vn such that μ(vi-1 , vi ) > 0, 1 ≤ i ≤ n. Here ‘n’ is called the length of the path. The consecutive pairs (vi-1 , vi ) are called arcs of the path. The diameter of x, y ∈ V, written diam(x, y), is the length of the longest n μ(xi−1 xi ). In words, the path joining x to y. The strength of P is defined to be ∧i=1 strength of a path is defined to be the weight of the weakest edge. We denote the strength of a path P by d(P). The strength of connectedness between two vertices x and y is defined as the maximum of the strengths of all paths between x and y and is denoted by μ∞ (x, y). A strongest path joining any two vertices x, y has strength μ∞ (x, y). Two vertices that are joined by a path are called connected. It follows that this notion of connectedness is an equivalence relation. The equivalence classes of vertices under this equivalence relation are called connected components of the given fuzzy graph. They are just its maximal connected partial fuzzy subgraphs.

2.3 Fuzzy Relational Database Models In order to manage fuzzy data in the databases, fuzzy set theory has been extensively applied to extend various database models and resulted in numerous contributions, mainly with respect to the popular relational model or to some related form of it. In general, several basic approaches can be classified: (i) one of fuzzy relational databases is based on possibility distribution (Chaudhry et al., 1999; Prade & Testemale, 1984; Umano & Fukami, 1994); (ii) the other one is based on the use of similarity relation (Buckles & Petry, 1982), proximity relation (De et al., 2001; Shenoi & Melton, 1999), resemblance relation (Rundensteiner & Bic, 1992), or fuzzy

2.3 Fuzzy Relational Database Models

41

relation (Raju & Majumdar, 1988); (iii) another possible extension is to combine possibility distribution and similarity (proximity or resemblance) relation (Chen et al., 1992; Ma & Mili, 2002; Ma et al., 2000). Currently, some major questions have been discussed and answered in the literature of the fuzzy relational databases, including representations and models, semantic measures and data redundancies, query and data processing, data dependencies and normalizations, implementation, and etc. For a comprehensive review of what has been done in the development of fuzzy relational databases, please refer to Chen (1999), Ma and Yan (2008), Ma (2005b), Petry (1996), Yazici and George (1999). In this section, we briefly introduce some basic notions of fuzzy relational databases based on possibility distributions. A relation is a two-dimensional table and its rows and columns are called tuples and attribute values, respectively. So, a relation is a set of tuples and a tuple consists of attribute values. A relation has its relational schema, which is a set of attributes. Each attribute corresponds to a range of values that this attribute can take and this range is called the domain of the attribute. Basically, a fuzzy relational database (FRDB) is based on the notions of fuzzy relational schema, fuzzy relational instance, tuple, key, and constraints, which are introduced briefly as follows: • A fuzzy relational database consists of a set of fuzzy relational schemas and a set of fuzzy relational instances (i.e., simply fuzzy relations). • The set of fuzzy relational schemas specifies the structure of the data held in a database. A fuzzy relational schema consists of a fixed set of attributes with associated domains. The information of a domain is implied in forms of schemas, attributes, keys, and referential integrity constraints. • The set of fuzzy relations, which is considered to be an instance of the set of fuzzy relation schemas, reflects the real state of a database. Formally, a fuzzy relation is a two-dimensional array of rows and columns, where each column represents an attribute and each row represents a tuple. • Each tuple in a table denotes an individual in the real world identified uniquely by primary key, and a foreign key is used to ensure the data integrity of a table. A column (or columns) in a table that makes a row in the table distinguishable from other rows in the same table is called the primary key. A column (or columns) in a table that draws its values from a primary key column in another table is called the foreign key. As is generally assumed in the literature, we assume that the primary key attribute is always crisp and all fuzzy relations are in the third normal form. • An integrity constraint in a schema is a predicate over relations expressing a constraint; by far the most used integrity constraint is the referential integrity constraint. A referential integrity constraint involves two sets of attributes S 1 and S 2 in two relations R1 and R2 , such that one of the sets (say S 1 ) is a key for one of the relations (called primary key). The other set is called a foreign key if R2 [S 2 ] is a subset of R1 [S 1 ]. Referential integrity constraints are the glue that holds the relations in a database together. In summary, in a fuzzy relational database, the structure of the data is represented by a set of fuzzy relational schemas, and data are stored in fuzzy relations (i.e.,

42

2 Fuzzy Sets and Fuzzy Database Modeling

tables). Each table contains rows (i.e., tuples) and columns (i.e., attributes). Each tuple is identified uniquely by the primary key. The relationships among relations are represented by the referential integrity constraints, i.e., foreign keys. Moreover, here, two types of fuzziness are considered in fuzzy relational databases, one is the fuzziness of attribute values (i.e., attributes may be fuzzy), which may be represented by possibility distributions; another is the fuzziness of a tuple being a member of the corresponding relation, which is represented by a membership degree associated with the tuple. Formally, a fuzzy relational database FRDB = consists of a set of fuzzy relational schemas FS and a set of fuzzy relations FR, where: • Each fuzzy relational schema FS can be represented formally as FR (Al /D1 , A2 /D2 , …, An /Dn , μFR /DFR ), which denotes that a fuzzy relation FR has attributes Al , A2 , …, An and μFR with associated data types D1 , D2 , …, Dn and DFR . Here, μFR is an additional attribute for representing the membership degree of a tuple to the fuzzy relation. • Each fuzzy relation FR on a fuzzy relational schema FR (Al /D1 , A2 /D2 , …, An /Dn , μFR /DFR ) is a subset of the Cartesian product of Dom (A1 ) × Dom (A2 ) ×…× Dom (An ) × Dom (μFR ), where Dom (Ai ) may be a fuzzy subset or even a set of fuzzy subset and Dom (μFR ) ∈ (0, 1]. Here, Dom (Ai ) denotes the domain of attribute Ai , and each element of the domain satisfies the constraint of the datatype Di . Formally, each tuple in FR has the form t = , where the value of an attribute Ai may be represented by a possibility distribution πAi , and μFR ∈ (0, 1]. Moreover, a resemblance relation Res on Dom (Ai ) is a mapping: Dom (Ai ) × Dom (Ai ) → [0, 1] such that (i) for all x in Dom (Ai ), Res (x, x) = 1 (reflexivity) (ii) for all x, y in Dom (Ai ), Res (x, y) = Res (y, x) (symmetry). To provide the intuition on the fuzzy relational database we show an example. The following gives a fuzzy relational database modeling parts of the reality at a company, including fuzzy relational schemas in Table 2.1 and fuzzy relations in Table 2.2. The detailed introduction is as follows: • The attribute underlined stands for primary key PK. The foreign key (FK) is followed by the parenthesized relation called referenced relation. A relation can have several candidate keys from which one primary key, denoted PK, is chosen. • An ‘f ’ next to an attribute means that the attribute is fuzzy. • In Table 2.1, there are the inheritance relationships Chief-Leader “is-a” Leader and Young-Employee “is-a” Employee. There is a 1-many relationship between Department and Young-Employee. The relation Supervise is a relationship relation, and there is many-many relationship between Chief-Leader and YoungEmployee. • Note that, a relation is different from a relationship. A relation is essentially a table, and a relationship is a way to correlate, join, or associate the two tables.

2.4 Fuzzy Object-Oriented Database Models

43

Table 2.1 The fuzzy relational schemas of a fuzzy relational database Relation name

Attribute and datatype

Foreign key and referenced relation

Leader

leaID (String), lNumber (String), μFR (Real)

no

Employee

empID (String), eNumber (String), no μFR (Real)

Chief-Leader

leaID (String), clName (String), f_clAge (Integer), μFR (Real)

leaID (Leader (leaID))

Young-Employee

empID (String), yeName (String), f_yeAge (Integer), f_yeSalary (Integer), dep_ID (String), μFR (Real)

empID (Employee (empID)) dept_ID (Department (depID))

Supervise

supID (String), lea_ID (String), emp_ID (String), μFR (Real)

lea_ID (Chief-Leader (leaID)) emp_ID (Young-Employee (empID))

Department

depID (String), dName (String), μFR (Real)

no

2.4 Fuzzy Object-Oriented Database Models In some real-world applications (e.g., CAD/CAM, multimedia and GIS), they characteristically require the modeling and manipulation of complex objects and semantic relationships. It has been proved that the object-oriented paradigm lends itself extremely well to the requirements. Since classical relational database model and its extension of fuzziness do not satisfy the need of modeling complex objects with imprecision and uncertainty, currently many researches have been concentrated on fuzzy object-oriented database models in order to deal with complex objects and uncertain data together. Zicari and Milano (1990) first introduced incomplete information, namely, null values, where incomplete schema and incomplete objects can be distinguished. From then on, the incorporation of imprecise and uncertain information in object-oriented databases has increasingly received attention. A fuzzy object-oriented database model was defined in Bordogna and Pasi (2001) based on the extension of a graphs-based object model. Based on similarity relationship, uncertainty management issues in the object-oriented database model were discussed in George et al. (1996). Based on possibility theory, vagueness and uncertainty were represented in class hierarchies in Dubois et al. (1991). In more detail, also based on possibility distribution theory, Ma et al. (2004) introduced fuzzy object-oriented database models, some major notions such as objects, classes, objects-classes relationships and subclass/superclass relationships were extended under fuzzy information environment. Moreover, other fuzzy extensions of object-oriented databases were developed. In Marín et al. (2000, 2001), fuzzy types were added into fuzzy object-oriented databases to manage vague structures. The fuzzy relationships and fuzzy behavior in fuzzy object-oriented database models were discussed in Cross (2001), Gyseghem and Caluwe (1995). Several intelligent fuzzy object-oriented database architectures were proposed in Koyuncu and Yazici (2003), Ndouse (1997),

44

2 Fuzzy Sets and Fuzzy Database Modeling

Table 2.2 The fuzzy relations of a fuzzy relational database Leader leaID

μFR

lNumber

L001

001

0.7

L002

002

0.9

L003

003

0.8

eNumber

μFR

Employee empID E001

001

0.8

E002

002

0.9

Chief-Leader leaID

clName

f_clAge

μFR

L001

Chris

{35/0.8, 39/0.9}

0.65

L003

Billy

37

0.7

Young-Employee empID

yeName

f_yeAge

f_yeSalary

dep_ID

μFR

E001

John

{24/0.7, 25/0.9}

{2000/0.3, 3000/0.4}

D001

0.75

E002

Mary

23

{4000/0.5, 4500/0.7, 5000/1.0}

D003

0.85

Department μFR

dName

depID D001

HR

0.8

D002

Finance

0.9

D003

Sales

0.7

Supervise supID

lea_ID

emp_ID

μFR

S001

L001

E001

0.78

S002

L001

E002

0.8

S002

L003

E002

0.9

Ozgur et al. (2009). The other efforts on how to model fuzziness and uncertainty in object-oriented database models were done in Lee et al. (1999), Majumdar et al. (2002), Umano et al. (1998). The fuzzy and probabilistic object bases (Cao & Rossiter, 2003; Nam et al., 2007), fuzzy deductive object-oriented databases (Yazici and Koyuncu, 1997), and fuzzy object-relational databases (Cubero et al., 2004) were also developed. In addition, an object-oriented database modeling technique was proposed based on the level-2 fuzzy sets in de Tré and de Caluwe (2003), where the authors also discussed how the object Data Management Group (ODMG) data model can be generalized to handle fuzzy data in a more advantageous way. Also, the other efforts have been paid on the establishment of consistent framework for a fuzzy object-oriented database model based on the standard for the ODMG object

2.4 Fuzzy Object-Oriented Database Models

45

data model (Cross et al., 1997). More recently, how to manage fuzziness on conventional object-oriented platforms was introduced in Berzal et al. (2007). Yan and Ma (2012) proposed the approach for the comparison of entity with fuzzy data types in fuzzy object-oriented databases. Yan et al. (2012) investigated the algebraic operations in fuzzy object-oriented databases, and discussed fuzzy querying strategies and gave the form of SQL-like fuzzy querying for the fuzzy object-oriented databases. In the section, the basic notions of fuzzy object-oriented database (FOODB) models, including fuzzy object, fuzzy class, fuzzy inheritance, and algebraic operations are introduced.

2.4.1 Fuzzy Objects Objects model real-world entities or abstract concepts. Objects have properties that may be attributes of the object itself or relationships also known as associations between the object and one or more other objects. An object is fuzzy because of a lack of information. For example, an object representing a part in preliminary design for certain will also be made of stainless steel, moulded steel, or alloy steel (each of them may be connected with a possibility, say, 0.7, 0.5 and 0.9, respectively). Formally, objects that have at least one attribute whose value is a fuzzy set are fuzzy objects.

2.4.2 Fuzzy Classes The fuzzy classes in fuzzy object-oriented databases are similar to the notion of the fuzzy classes in fuzzy UML data models as introduced in Sect. 2.3. The objects having the same properties are gathered into classes that are organized into hierarchies. Theoretically, a class can be considered from two different viewpoints (Dubois et al., 1991): (a) an extensional class, where the class is defined by the list of its object instances, and (b) an intensional class, where the class is defined by a set of attributes and their admissible values. In addition, a subclass defined from its superclass by means of inheritance mechanism in the object-oriented database (OODB) can be seen as the special case of (b) above. Therefore, a class is fuzzy because of the following several reasons. First, some objects are fuzzy ones, which have similar properties. A class defined by these objects may be fuzzy. These objects belong to the class with membership degree of [0, 1]. Second, when a class is intensionally defined, the domain of an attribute may be fuzzy and a fuzzy class is formed. For example, a class Old equipment is a fuzzy one because the domain of its attribute Using period is a set of fuzzy values such as long, very long, and about 20 years. Third, the subclass produced by a fuzzy class by means of specialization and the superclass produced by some classes (in which there is at least one class who is fuzzy) by means of generalization are also fuzzy.

46

2 Fuzzy Sets and Fuzzy Database Modeling

The main difference between fuzzy classes and crisp classes is that the boundaries of fuzzy classes are imprecise. The imprecision in the class boundaries is caused by the imprecision of the values in the attribute domain. In the FOODB, classes are fuzzy because their attribute domains are fuzzy. The issue that an object fuzzily belongs to a class occurs since a class or an object is fuzzy. Similarly, a class is a subclass of another class with membership degree of [0, 1] because of the class fuzziness. In the OODB, the above-mentioned relationships are certain. Therefore, the evaluations of fuzzy object-class relationships and fuzzy inheritance hierarchies are the cores of information modeling in the FOODB.

2.4.3 Fuzzy Object-Class Relationships In the FOODB, the following four situations can be distinguished for object-class relationships. (a) Crisp class and crisp object. This situation is the same as the OODB, where the object belongs or not to the class certainly. For example, the objects Car and Computer are for a class Vehicle, respectively. (b) Crisp class and fuzzy object. Although the class is precisely defined and has the precise boundary, an object is fuzzy since its attribute value(s) may be fuzzy. In this situation, the object may be related to the class with the special degree in [0, 1]. For example, the object which position attribute may be graduate, research assistant, or research assistant professor, is for the class Faculty. (c) Fuzzy class and crisp object. Being the same as the case in (b), the object may belong to the class with the membership degree in [0, 1]. For example, a Ph.D. student is for Young student class. (d) Fuzzy class and fuzzy object. In this situation, the object also belongs to the class with the membership degree in [0, 1]. The object-class relationships in (b), (c) and (d) above are called fuzzy objectclass relationships. In fact, the situation in (a) can be seen the special case of fuzzy object-class relationships, where the membership degree of the object to the class is one. It is clear that estimating the membership of an object to the class is crucial for fuzzy object-class relationship when classes are instantiated. In the OODB, determining if an object belongs to a class depends on if its attribute values are respectively included in the corresponding attribute domains of the class. Similarly, in order to calculate the membership degree of an object to the class in a fuzzy object-class relationship, it is necessary to evaluate the degrees that the attribute domains of the class include the attribute values of the object. However, it should be noted that in a fuzzy object-class relationship, only the inclusion degree of object values with respect to the class domains is not accurate for the evaluation of membership degree of an object to the class. The attributes play different role in the definition and identification of a class. Some may be dominant and some not. Therefore, a weight w is assigned to each attribute of the class according to its

2.4 Fuzzy Object-Oriented Database Models

47

importance by designer. Then the membership degree of an object to the class in a fuzzy object-class relationship should be calculated using the inclusion degree of object values with respect to the class domains, and the weight of attributes. Let C be a class with attributes {A1 , A2 , …, An }, o be an object on attribute set {A1 , A2 , …, An }, and o (Ai ) denote the attribute value of o on Ai (1 ≤ i ≤ n). In C, each attribute Ai is connected with a domain denoted dom (Ai ). The inclusion degree of o (Ai ) with respect to dom (Ai ) is denoted ID (dom (Ai ), o (Ai )). In the following, we investigate the evaluation of ID (dom (Ai ), o (Ai )). As we know, dom (Ai ) is a set of crisp values in the OODB and may be a set of fuzzy subsets in fuzzy databases. Therefore, in a uniform OODB for crisp and fuzzy information modeling, dom (Ai ) should be the union of these two components, dom (Ai ) = cdom (Ai ) ∪ fdom (Ai ), where cdom (Ai ) and fdom (Ai ) respectively denote the sets of crisp values and fuzzy subsets. On the other hand, o (Ai ) may be a crisp value or a fuzzy value. The following cases can be identified for evaluating ID (dom (Ai ), o (Ai )). Case 1: o (Ai ) is a fuzzy value. Let fdom (Ai ) = {f 1 , f 2 , …, f m }, where f j (1 ≤ j ≤ m) is a fuzzy value, and cdom (Ai ) = {c1 , c2 , …, ck }, where cl (1 ≤ l ≤ k) is a crisp value. Then ID(dom(Ai ), o(Ai )) = max(ID(cdom(Ai ), o(Ai )), ID( f dom(Ai ), o ( (Ai ))) = max SID({1.0/c1 , 1.0/c2 , . . . , 1.0/ck }, o(Ai )), max j (SID ( f i , o(Ai )))), where SID (x, y) is used to calculate the degree that fuzzy value x includes fuzzy value y. Case 2: o (Ai ) is a crisp value. Then ID(dom(Ai ), o(Ai )) = 1 if o(Ai ) ∈ cdom(Ai ) else ID(dom(Ai ), o(Ai )) = ID( f dom(Ai ), {1.0/o(Ai )}). Consider a fuzzy class Young students with attributes Age and Height, and two objects o1 and o2 . Assume cdom (Age) = {5 − 20}, fdom (Age) = {{1.0/20, 1.0/21, 0.7/22, 0.5/23}, {0.4/22, 0.6/23, 0.8/24, 1.0/25, 0.9/26, 0.8/27, 0.6/28}, {0.6/27, 0.8/28, 0.9/29, 1.0/30, 0.9/31, 0.6/32, 0.4/33, 0.2/34}}, and dom (Height) = cdom (Height) = [60, 210]. Let o1 (Age) = 15, o2 (Age) = {0.6/25, 0.8/26, 1.0/27, 0.9/28, 0.7/29, 0.5/30, 0.3/31}, and o2 (Height) = 182. According to the definition above, we have ID(dom( Age), o1 ( Age)) = 1, ID(dom(H eight), o2 (H eight)) = 1, ID(cdom( Age), o2 ( Age)) = SID({1.0/5, 1.0/6, . . . , 1.0/19, 1.0/20}, o2 ( Age)) = 0, and

48

2 Fuzzy Sets and Fuzzy Database Modeling

ID( f dom(Age), o2 (Age)) = max(SID({1.0/20, 1.0/21, 0.7/22, 0.5/23}, o2 ( Age)), SID{0.4/22, 0.6/23, 0.8/24, 1.0/25, 0.9/26, 0.8/27, 0.6/28}, O2 ( Age)), SID({0.6/27, 0.8/28, 0.9/29, 1.0/30, 0.9/31, 0.6/32, 0.4/33, 0.2/34}, o2 ( Age))) = max(0, 0.58, 0.60) = 0.60. Therefore, ID(dom(Age), o2 (Age)) = max(ID(cdom( Age), o2 (Age)), ID ( f dom(Age), o2 (Age))) = 0.60. Now, we define the formula to calculate the membership degree of the object o to the class C as follows, where w (Ai (C)) denotes the weight of attribute Ai to class C. Σn I D(dom(Ai ), o(Ai )) × w(Ai (C)) Σn μC (o) = i=1 i=1 w(Ai (C)) Consider the fuzzy class Young students and object o2 above. Assume w (Age (Young students)) = 0.9 and w (Height (Young students)) = 0.2. Then μYoung students (o2 ) = (0.9 × 0.6 + 0.2 × 1.0)/(0.9 + 0.2) = 0.67. In the above-given determination that an object belongs to a class fuzzily, it is assumed that the object and the class have the same attributes, namely, class C is with attributes {A1 , A2 , …, An } and object o is on {A1 , A2 , …, An } also. Such an object-class relationship is called direct object-class relationship. As we know, there exist subclass/superclass relationships in the OODB, where subclass inherits some attributes and methods of the superclass, overrides some attributes and methods of the superclass, and define some new attributes and methods. Any object belonging to the subclass must belong to the superclass since a subclass is the specialization of the superclass. So we have one kind of special object-class relationship: the relationship between superclass and the object of subclass. Such an object-class relationship is called indirect object-class relationship. Since the object and the class in indirect object-class relationship have different attributes, in the following, we present how to calculate the membership degree of an object to the class in an indirect object-class relationship. Let C be a class with attributes {A1 , A2 , …, Ak , Ak+1 , …, Am } and o be an object on attributes {A1 , A2 , …, Ak , A' k+1 , …, A' m , Am+1 , …, An }. Here attributes A' k+1 , …, and A' m are overridden from Ak+1 , …, and Am and attributes Am+1 , …, and An are special. Then we have

2.4 Fuzzy Object-Oriented Database Models

49

μC (o) Σk =

Σm ' i=1 I D(dom(Ai ),o(Ai )) × w(Ai (C)) + j=k+1 I D(dom(Aj ),o(Aj )) × w(Aj (C)) Σm . i=1 w(Ai (C))

Based on the direct object-class relationship and the indirect object-class relationship, now we focus on arbitrary object-class relationship. Let C be a class with attributes {A1 , A2 , …, Ak , Ak+1 , …, Am , Am+1 , …, An } and o be an object on attributes {A1 , A2 , …, Ak , A' k+1 , …, A' m , Bm+1 , …, Bp }. Here attributes A' k+1 , …, and A' m are overridden from Ak+1 , …, and Am , or Ak+1 , …, and Am are overridden from A' k+1 , …, and A' m . Attributes Am+1 , …, and An and Bm+1 , …, Bp are special in {A1 , A2 , …, Ak , Ak+1 , …, Am , Am+1 , …, An } and {A1 , A2 , …, Ak , A' k+1 , …, A' m , Bm+1 , …, Bp }, respectively. Then we have μC (o) Σk =

Σm ' i=1 I D(dom(Ai ),o(Ai )) × w(Ai (C)) + j=k+1 I D(dom(Aj ),o(Aj )) × w(Aj (C)) Σn i=1 w(Ai (C))

Since an object may belong to a class with membership degree in [0, 1] in fuzzy object-class relationship, it is possible that an object that is in a direct object-class relationship and an indirect object-class relationship simultaneously belongs to the subclass and superclass with different membership degrees. This situation occurs in fuzzy inheritance hierarchies, which will be investigated in next section. Also for two classes that do not have subclass/superclass relationship, it is possible that an object may belong to these two classes with different membership degrees simultaneously. This situation only arises in fuzzy object-oriented databases. In the OODB, an object may or may not belong to a given class definitely. If it belongs to a given class, it can only belong to it uniquely (except for the case of subclass/superclass). The situation where an object belongs to different classes with different membership degrees simultaneously in fuzzy object-class relationships is called multiple membership of object in this paper. Now let us focus on how to handle the multiple membership of object in fuzzy object-class relationships. Let C 1 and C 2 be (fuzzy) classes and α be a given threshold. Assume there exists an object o. If μC1 (o) ≥ α and μC2 (o) ≥ α, the conflict of the multiple membership of object occurs, namely, o belongs to multiple classes simultaneously. At this moment, which one in C 1 and C 2 is the class of object o dependents on the following cases. Case 1: There exists a direct object-class relationship between object o and one class in C 1 and C 2 . Then the class in the direct object-class relationship is the class of object o. Case 2: There is no direct object-class relationship but only an indirect object-class relationship between object o and one class in C 1 and C 2 , say C 1 . And there exists such subclass C 1 ' of C 1 that object o and C 1 ' are in a direct object-class relationship. Then class C 1 ' is the class of object o.

50

2 Fuzzy Sets and Fuzzy Database Modeling

Case 3: There is neither direct object-class relationship nor indirect object-class relationship between object o and classes C 1 and C 2 . Or there exists only an indirect object-class relationship between object o and one class in C 1 and C 2 , say C 1 , but there is not such subclass C 1 ' of C 1 that object o and C 1 ' are in a direct object-class relationship. Then class C 1 is considered as the class of object o if μC1 (o) > μC2 (o), else class C 2 is considered as the class of object o. It can be seen that in Case 1 and Case 2, the class in direct object-class relationship is always the class of object o and the object and the class have the same attributes. In Case 3, however, object o and the class that is considered as the class of object o, say C 1 , have different attributes. It should be pointed out that class C 1 and object o are definitely defined, respectively, viewed from their structures. For the situation in Case 3, the attributes of C 1 do not affect the attributes of o and the attributes of o do not affect the attributes of C 1 also. There should be a class C and C and o are in direct object-class relationship. But class C is not available so far. That C 1 is considered as the class of object o, compared with C 2 , only means that C 1 is more similar to C than C 2 . Class C is the class of object o once C is available. Consider three fuzzy classes C 1 with {A, B}, C 2 with {A, B, D}, and C 3 with {A, F}. There exists a fuzzy object o on {A, B', E}. Here, B' is overridden from B and D /= E /= F. According to the definitions above, we have I D(dom(A), o( A)) × w(A(C1 )) + I D(dom(B), o(B ' )) × w(B(C1 )) , w(A(C1 )) + w(B(C1 )) I D(dom(A), o( A)) × w(A(C2 )) + I D(dom(B), o(B ' )) × w(B(C2 )) , μC2 (o) = w(A(C2 )) + w(B(C2 )) + w(D(C2 )) I D(dom(A), o( A)) × w(A(C3 )) μC3 (o) = . w(A(C3 )) + w(F(C3 )) μC1 (o) =

Assume w (A (C 1 )) = w (A (C 2 )) = w (A (C 3 )), w (B (C 1 )) = w (B (C 2 )), and w (B (C 2 )) + w (D (C 2 )) = w (F (C 3 )). Also assume μC1 (o) ≥ α, μC2 (o) ≥ α, and μC3 (o) ≥ α, where α is a given threshold. Then object o belongs to classes C 1 , C 2 and C 3 simultaneously. The conflict of the multiple membership of object occurs. It can be seen that the relationship between o and C 1 is an indirect object-class relationship. But the relationship between o and C 2 , which is the subclass of class C 1 , is not a direct object-class relationship. So class C 2 is not the class of object o. It can also be seen that μC1 (o) ≥ μC2 (o) ≥ μC3 (o). So C 1 is considered as the class of object o. But in fact, there should be a new class C with {A, B', E}, which is the class in the direct object-class relationship of o and C. That μC1 (o) ≥ μC2 (o) ≥ μC3 (o) only means that C 1 with {A, B} is more similar to C with {A, B', E} than C 2 with {A, B, E} and C 3 with {A,

2.4 Fuzzy Object-Oriented Database Models

51

F}. When class C is not available right now, class C 1 is considered as the class of object o.

2.4.4 Fuzzy Inheritance Hierarchies In the OODB, a new class, called subclass, is produced from another class, called superclass by means of inheriting some attributes and methods of the superclass, overriding some attributes and methods of the superclass, and defining some new attributes and methods. Since a subclass is the specialization of the superclass, any one object belonging to the subclass must belong to the superclass. This characteristic can be used to determine if two classes have subclass/superclass relationship. In the FOODB, however, classes may be fuzzy. A class produced from a fuzzy class must be fuzzy. If the former is still called subclass and the later superclass, the subclass/superclass relationship is fuzzy. In other words, a class is a subclass of another class with membership degree of [0, 1] at this moment. Correspondingly, the method used in the OODB for determination of subclass/superclass relationship is modified as (a) for any (fuzzy) object, if the member degree that it belongs to the subclass is less than or equal to the member degree that it belongs to the superclass, and (b) the member degree that it belongs to the subclass is great than or equal to the given threshold. The subclass is then a subclass of the superclass with the membership degree, which is the minimum in the membership degrees to which these objects belong to the subclass. Let C 1 and C 2 be (fuzzy) classes and β be a given threshold. We say C 2 is a subclass of C 1 if (∀o)(β ≤ μC2 (o) ≤ μC1 (o)). The membership degree that C 2 is a subclass of C 1 should be minμC2(o)≥β (μC2 (o)). It can be seen that by utilizing the inclusion degree of objects to the class, we can assess fuzzy subclass/superclass relationships in the FOODB. It is clear that such assessment is indirect. If there is no any object available, this method is not used. In fact, the idea used in evaluating the membership degree of an object to a class can be used to determine the relationships between fuzzy subclass and superclass. We can calculate the inclusion degree of a (fuzzy) subclass with respect to the (fuzzy) superclass according to the inclusion degree of the attribute domains of the subclass with respect to the attribute domains of the superclass as well as the weight of attributes. In the following, we give the method for evaluating the inclusion degree of fuzzy attribute domains.

52

2 Fuzzy Sets and Fuzzy Database Modeling

Let C 1 and C 2 be (fuzzy) classes with attributes {A1 , A2 , …, Ak , Ak+1 , …, Am } and {A1 , A2 , …, Ak , A' k+1 , …, A' m , Am+1 , …, An }, respectively. It can be seen that in C 2 , attributes A1 , A2 , …, and Ak are directly inherited from A1 , A2 , …, and Ak in C 1 , attributes A' k+1 , …, and A' m are overridden from Ak+1 , …, and Am in C 1 , and attributes Am+1 , …, and An are special. For each attribute in C 1 or C 2 , say Ai , there is a domain, denoted dom (Ai ). As shown above, dom (Ai ) should be dom (Ai ) = cdom (Ai ) ∪ fdom (Ai ), where cdom (Ai ) and fdom (Ai ) denote the sets of crisp values and fuzzy subsets, respectively. Let Ai and Aj be attributes of C 1 and C 2 , respectively. The inclusion degree of dom (Aj ) with respect to dom (Ai ) is denoted by ID (dom (Ai ), dom (Aj )). Then we identify the following cases and investigate the evaluation of ID (dom (Ai ), dom (Aj )): (a) when i /= j and 1 ≤ i, j ≤ k, ID (dom (Ai ), dom (Aj )) = 0, (b) when i = j and 1 ≤ i, j ≤ k, ID (dom (Ai ), dom (Aj )) = 1, and (c) when i = j and k + 1 ≤ i, j ≤ m, ID (dom (Ai ), dom (Aj )) = ID (dom (Ai ), dom (A' i )) = max (ID (dom (Ai ), cdom (A' i )), ID (dom (Ai ), fdom (A' i ))). Now we respectively define ID (dom (Ai ), cdom (A' i )) and ID (dom (Ai ), fdom (A i )). Let fdom (A' i ) = {f 1 , f 2 , …, f m }, where f j (1 ≤ j ≤ m) is a fuzzy value, and cdom (A' i ) = {c1 , c2 , …, ck }, where cl (1 ≤ l ≤ k) is a crisp value. We can consider {c1 , c2 , …, ck } as a special fuzzy value {1.0/c1 , 1.0/c2 , …, 1.0/ck }. Then we have the following: '

( ( )) ID dom(Ai ), cdom A'i = ID(dom(Ai ), {1.0/c1 , 1.0/c2 , . . . , 1.0/ck }). ( ( )) ( )) ( ID dom(Ai ), f dom A'i = max j ID dom(Ai ), f j . Based on the inclusion degree of attribute domains of the subclass with respect to the attribute domains of its superclass as well as the weight of attributes, we can define the formula to calculate the degree to which a fuzzy class is a subclass of another fuzzy class. Let C 1 and C 2 be (fuzzy) classes with attributes {A1 , A2 , …, Ak , Ak+1 , …, Am } and {A1 , A2 , …, Ak , A' k+1 , …, A' m , Am+1 , …, An }, respectively, and w (A) denote the weight of attribute A. Then the degree that C 2 is the subclass of C 1 , written μ (C 1 , C 2 ), is defined as follows. Σm μ(C1 ,C2 ) =

i=1

I D(dom(Ai (C1 )), dom(Ai (C2 ))) × w(Ai ) Σm i=1 w(Ai )

In subclass-superclass hierarchies, a critical issue is multiple inheritance of class. Ambiguity arises when more than one of the superclasses have common attributes and the subclass does not declare explicitly the class from which the attribute was inherited. Let class C be a subclass of classes C 1 and C 2 . Assume that the attribute Ai in C 1 , denoted by Ai (C 1 ), is common to the attribute Ai in C 2 , denoted by Ai (C 2 ). If dom (Ai (C 1 )) and dom (Ai (C 2 )) are identical, there does not exist a conflict in the multiple inheritance hierarchy and C inherits attribute Ai directly. If dom (Ai (C 1 ))

2.5 Fuzzy XML Model

53

and dom (Ai (C 2 )) are not identical, however, the conflict occurs. At this moment, which one in Ai (C 1 )) and Ai (C 2 ) is inherited by C dependents on the following rule: I f I D(dom(Ai (C1 )), dom(Ai (C2 ))) × w(Ai (C1 )) > I D(dom(Ai (C2 )), dom(Ai (C1 ))) × w(Ai (C2 )), then Ai (C1 ) is inherited by C, else Ai (C2 ) is inherited by C.

Note that in fuzzy multiple inheritance hierarchy, the subclass has different degrees with respect to different superclasses, not being the same as the situation in classical object-oriented database systems.

2.5 Fuzzy XML Model With the wide utilization of the Web and the availability of huge amounts of electronic data, information representation and exchange over the Web becomes important, and eXtensible Markup Language (XML) has been the de facto standard (Bray et al., 2000). XML and related standards are technologies that allow the easy development of applications that exchange data over the Web such as e-commerce (EC) and supply chain management (SCM). Unfortunately, although it is the current standard for data representation and exchange over the Web, XML is not able to represent and process imprecise and uncertain data. In fact, the fuzziness in EC and SCM has received considerable attentions and fuzzy set theory has been used to implement web-based business intelligence. Therefore, topics related to the modeling of fuzzy data can be considered very interesting in the XML data context. Regarding modeling fuzzy information in XML, Turowski and Weng (2002) extended XML DTDs with fuzzy information to satisfy the need of information exchange. Lee and Fanjiang (2003) studied how to model imprecise requirements with XML DTDs and developed a fuzzy object-oriented modeling technique schema based on XML. Ma and Yan (2007) and Ma (2005a, 2005b) proposed a fuzzy XML model for representing fuzzy information in XML documents. Tseng et al. (2005) presented an XML method to represent fuzzy systems for facilitating collaborations in fuzzy applications. Moreover, aimed at modeling fuzzy information in XML Schemas, Gaurav and Alhajj (2006) incorporated fuzziness in an XML document extending the XML Schema associated to the document and mapped fuzzy relational data into fuzzy XML. In detail, Oliboni and Pozzani (2008) proposed an XML Schema definition for representing different aspects of fuzzy information. Kianmehr et al. (2010) described a fuzzy XML schema model for representing a fuzzy relational database. In addition, XML with incomplete information (Abiteboul et al., 2006) and probabilistic data in XML (Nierman & Jagadish, 2002; Senellart & Abiteboul, 2007) were presented in research papers.

54

2 Fuzzy Sets and Fuzzy Database Modeling

2.5.1 Fuzziness in XML Documents The fuzziness in an XML document is similar with the fuzziness in a relational database. There may be two kinds of fuzziness occur in a fuzzy XML document: one is the fuzziness in elements, the other is the fuzziness in attribute values of elements. 1. the fuzziness in elements: this kind of fuzziness use membership degrees associated with elements. The membership degree associated with an element represents the possibility of this element (including itself and the sub-elements rooted at it) belonging to its parent element. Now let us interpret what a membership degree associated with an element means, given that the element can nest under other elements, and more than one of these elements may have an associated membership degree. The existential membership degree associated with an element should be the possibility that the state of the world includes this element and the sub-tree rooted at it. For an element with the sub-tree rooted at it, each node in the sub-tree is not treated as independent but dependent upon its root-to-node chain. Each possibility in the source XML document is assigned conditioned on the fact that the parent element exists certainly. In other words, this possibility is a relative one based upon the assumption that the possibility the parent element exists is exactly 1.0. In order to calculate the absolute possibility, we must consider the relative possibility in the parent element. In general, the absolute possibility of an element e can be obtained by multiplying the relative possibilities found in the source XML along the path from e to the root. Of course, each of these relative possibilities will be available in the source XML document. By default, relative possibilities are therefore regarded as 1.0. Consider a chain X → Y → Z from the root node X. Assume that the source XML document contains the relative possibilities Poss (Z|Y ), Poss (Y|X), and Poss (X), associated with the nodes Z, Y, and X, respectively. Then we have Poss(Y ) = Poss(Y |X ) × Poss(X ) Poss(Z ) = Poss(Z |Y ) × Poss(Y |X ) × Poss(X ) Here, Poss (Z|Y ), Poss (Y|X) and Poss (X) can be obtained from the source XML document. 2. the fuzziness in attribute values of elements: this kind of fuzziness use possibility distributions to represent the values of the attributes. Furthermore, attributes are classified into two types: (a) single value attributes: some data items are known to have a single unique value, e.g., the age of a person in years is a unique integer, and if such a value is unknown so far, we can use the following possibility distribution: {21/0.4, 23/0.5, 25/0.8, 26/0.9, 27/0.6, 28/0.5, 29/0.3}. This is called disjunctive possibility distribution.

2.5 Fuzzy XML Model

55

(b) multiple value attributes: XML restricts attributes to a single value, but it is often the case that some data item is known to have multiple valuesthese values may be unknown completely and can be specified with a possibility distribution. For example, the e-mail address of a person may be multiple character strings because he or she has several e-mail addresses available simultaneously. In case we do not have complete knowledge of the e-mail address for Tom Smith, we may say that the e-mail address may be “[email protected]” with possibility 0.60, and “[email protected]” with possibility 0.85. This is called conjunctive possibility distribution. For ease of understanding, we interpret the above two kinds of fuzziness with a simple fuzzy XML document d 1 in Fig. 2.1. In Fig. 2.1, we talk about the universities in an area of a given city, say, Detroit, Michigan, in the USA. (a) Wayne State University is located in downtown Detroit, and thus the possibility that it is included in the universities in Detroit is 1.0. For pair < Val Poss = 1.0 > … < /Va > is omitted (see Lines 50–51). (b) Oakland University, however, is located in a nearby county of Michigan, named Oakland. Whether Oakland University is included in the universities in Detroit depends on how to define the area of Detroit, the Greater Detroit Area or only the City of Detroit. Assume that it is unknown and the possibility that Oakland University is included in the universities in Detroit is assigned 0.8 (see Line 3). The cases 1–2 are the fuzziness in elements. The degree associated with such an element represents the possibility that a university is included in universities in Detroit. (c) For the student Tom Smith, if his age is unknown so far, i.e., he has fuzzy value in the attribute age. Since age is known to have a single unique value, we can use the disjunctive possibility distribution to represent such value (see Lines 23–35). (d) The e-mail address of Tom Smith may be multiple character strings because he has several e-mail addresses simultaneously. If we do not know his exact e-mail addresses, and we use the conjunctive possibility distribution to represent such information and may say that the e-mail address may be “[email protected]” with possibility 0.6 and “[email protected]” with possibility 0.45 (see Lines 37– 45). Note that, the cases 3–4 are the fuzziness in attribute values of elements. In an XML document, it is often the case that some values of attributes may be unknown completely and can be specified with possibility distributions.

2.5.2 Fuzzy XML Representation Models and Formalizations In the following, we introduce fuzzy XML representation models, including the representation of fuzzy data in the XML document, and two fuzzy XML document structures fuzzy DTD and fuzzy XML Schema.

56

Fig. 2.1 A fragment of a fuzzy XML document d1

2 Fuzzy Sets and Fuzzy Database Modeling

2.5 Fuzzy XML Model

2.5.2.1

57

Representation of Fuzzy Data in XML Document

In order to represent the fuzzy data in XML documents, it is shown in the previous part that several fuzzy constructs (such as Poss, Val and Dist) are introduced. It is not difficult to see from the example given above that a possibility attribute, denoted Poss, should be introduced first, which takes a value in [0, 1]. This possibility attribute is applied together with a fuzzy construct called Val to specify the possibility of a given element existing in the fuzzy XML document (see Line 3 in Fig. 2.1). Another fuzzy construct called Dist, which specifies a possibility distribution, is introduced. Based on pair and , possibility distribution for an element can be expressed. Also, possibility distribution can be used to express fuzzy element values. For this purpose, we introduce another fuzzy construct called Dist to specify a possibility distribution. Typically, a Dist element has multiple Val elements as children, each with an associated possibility. Since we have two types of possibility distribution, the Dist construct should indicate the type of a possibility distribution being disjunctive or conjunctive (see Lines 24–34 and Lines 38–44 in Fig. 2.1). Again consider Fig. 2.1. Lines 24–34 are the disjunctive Dist construct for the age of student “Tom Smith”. Lines 38–44 are the conjunctive Dist construct for the email of student “Tom Smith”. It should be noted, however, that the possibility distributions in Lines 24–34 and Lines 38–44 are all for leaf nodes in the ancestor– descendant chain. In fact, we can also have possibility distributions and values over non-leaf nodes. Observe the disjunctive Dist construct in Lines 6–19, which express the two possible statuses for the employee with ID 85,431,095. In these two employee values, Lines 7–12 are with possibility 0.8, and Lines 13–18 are with possibility 0.6. The structure of an XML document can be described by Document Type Definition (DTD) or XML Schema (Antoniou and van Harmelen 2004). A DTD, which defines the valid elements and their attributes and the nesting structures of these elements in the instance documents, is used to assert the set of “rules” that each instance document of a given document type must conform to. XML Schemas provide a much more powerful means than DTDs by which to define your XML document structure and limitations. It has been shown that the XML document must be extended for fuzzy data modeling. As a result, several fuzzy constructs have been introduced.

2.5.2.2

DTD Modification

In order to accommodate these fuzzy constructs, it is clear that the DTD of the source XML document should be correspondingly modified. In this section, we focus on DTD modification (i.e., fuzzy DTD) for representing the structure of the fuzziness in XML document as introduced in Sect. 2.5.1. Firstly we define the basic elements in a fuzzy DTD as follows:

58

2 Fuzzy Sets and Fuzzy Database Modeling

//element 1 contains element 2 , and the appearance times of element 2 are restricted by the cardinalities: ? denotes 0 or 1 time; * denotes 0 or n times; + denotes 1 or n times; No cardinality operator means exactly once

//element1 contains element 2 , element 3 , … in order

//element 1 contains either element 2 or element 3 , …

//#PCDATA, which is the only atomic type for elements, denotes element 1 may have any content

//element 1 is an empty element Moreover, the attributes of an element element i can be represented as follows:

Here AttName is the name of the attribute, AttType is the type of the attribute, and ValType is the value type which can be #REQUIRED, #IMPLIED, #FIXED “value”, and “value” (Antoniou & van Harmelen, 2004). Then, we define Val and Dist elements as follows:

//basic_definition represents any case of the basic element definitions above

Finally, based on the Val and Dist elements, we modify the basic element definitions above so that all of the elements can use possibility distributions (Dist). In summary, the basic elements can be classified into two types, i.e., the leaf element and the non-leaf element: • for the leaf element which only contains #PCDATA, say leafElement, its definition is modified from to

That is, a leaf element may be fuzzy and takes a value represented by a possibility distribution. • for the non-leaf element which contains the other elements, say nonleafElement, its definition is modified from to .

2.5 Fuzzy XML Model

59

That is, a non-leaf element may be crisp, e.g., student in Fig. 2.1, and thus the non-leaf element student can be defined as . Also, a non-leaf element may be fuzzy and takes a value represented by a possibility distribution. We differentiate two cases: the first one is the element takes a value connected with a possibility degree, e.g., university in Fig. 2.1, which can be defined as

and the second one is the element takes a set of values and each value is connected with a possibility degree, e.g., age of student in Fig. 2.1, which can be defined as

Based on the above modified fuzzy DTD definitions, Fig. 2.2 gives the fuzzy DTD D1 w.r.t. the fuzzy XML document d 1 in Fig. 2.1.

2.5.2.3

Fuzzy XML Schema

In the following, we define the XML Schema modification (i.e., fuzzy XML Schema) for representing the structure of the fuzziness in XML document as introduced in Sect. 2.5.1. First, we define Val element as follows:





Then we define Dist element as follows:



60

2 Fuzzy Sets and Fuzzy Database Modeling

Fig. 2.2 The fuzzy DTD D1 w.r.t. the fuzzy XML document d1 in Fig. 2.1

Now we modify the element definition in the classical Schema so that all of the elements can use possibility distributions (Dist). For a sub-element that only contains leaf elements, its definition in the Schema is as follows.





2.5 Fuzzy XML Model

61

For an element that contains leaf elements without any fuzziness, its definition in the Schema is as follows.

For an element that contains leaf elements with fuzziness, its definition in the Schema is as follows.



For a sub-element that does not contain any leaf elements, its definition in the Schema is as follows.





For an element that does not contain leaf elements without any fuzziness, its definition in the Schema is as follows.



For a sub-element that does not contain leaf elements but a fuzzy value, its definition in the Schema is as follows.



For a sub-element that does not contain leaf elements but a set of fuzzy values, its definition in the Schema is as follows.

62

2 Fuzzy Sets and Fuzzy Database Modeling



The fuzzy XML Schema w.r.t. the fuzzy XML document in Fig. 2.1 is shown as follows:

















2.5 Fuzzy XML Model

63













2.5.2.4

Formalization of Fuzzy XML Models

Being similar to the classical XML document, a fuzzy XML document can be intuitionally seen as a syntax tree. Figure 2.3 shows a fragment of the fuzzy XML document d 1 in Fig. 2.1 and its tree representation. Based on the tree representation of the fuzzy XML document, in the following we define the formalization of fuzzy XML models in Ma et al. (2010), Zhang et al. (2013). It can be found from Fig. 2.2 that a fuzzy DTD is made up of element type definitions, and each element may have associated attributes. Each element type definition has the form E → (α, A), where E is the defined element type (e.g., university and student), α, called the content model such as university (UName, Val + ), and A are attributes of E. For the sake of simplicity, we assume that the symbol T denotes the atomic types of elements and attributes such as #PCDATA and CDATA, E denotes the set of elements including the basic elements (e.g., university and student) and the special elements (e.g., Val and Dist), A denotes the set of attributes, and S = T ∪ E. A fuzzy DTD D is a pair (P, r), where P is a set of element type definitions, and r [ E is the root element type, which uniquely identifies a fuzzy DTD. Each element type definition has the form E → (α, A), constructed according to the following syntax:

64

2 Fuzzy Sets and Fuzzy Database Modeling

Fig. 2.3 A fragment of the fuzzy XML document and its tree representation

α ::= S|empty|(α1 |α2 )|(α1 , α2 )|α?|α ∗ |α + |any A ::= empty|(AN, AT, VT) Here: 1. S = T ∪ E; empty denotes the empty string; “|” denotes union, and “,” denotes concatenation; α can be extended with cardinality operators “?”, “*”, and “+”, where “?” denotes 0 or 1 time, “*” denotes 0 or n times, and “+” denotes 1 or n times; the construct any stands for any sequence of element types defined in the fuzzy DTD; 2. AN ∈ A denotes the attribute names of the element E; AT denotes the attribute types; and VT is the value types of attributes which can be #REQUIRED, #IMPLIED, #FIXED “value”, “value”, and disjunctive/conjunctive possibility distribution. The formal definition of fuzzy XML Schemas can be analogously given following the procedure above. Next, we give a formal definition of the fuzzy XML documents. A fuzzy XML document d over a fuzzy DTD D is a tuple d = 0 , ρ ' (u, v) = ρ(u, v). Definition 3.4 (Fuzzy RDF graph isomorphism). Given two fuzzy RDF graphs G 1 = (V1 , E 1 , Σ1 , L 1 , μ1 , ρ1 ) and G 2 = (V2 , E 2 , Σ2 , L 2 , μ2 , ρ2 ), an isomorphism from G1 to G2 is a bijective function h: V 1 → V 2 such that:

78

3 Fuzzy RDF Modeling

1. ∀u ∈ V1 , h(u) ∈ V2 , L 1 (u) = L 2 (h(u)) and μ1 (u) = μ2 (h(u)); 2. ∀ (u, v) ∈ E 1 , (h(u), h(v)) ∈ E 2 , L 1 (u, v) = L 2 (h(u), h(v)) and ρ 1 (u, v) = ρ 2 (h(u), h(v)). If such a function h exists, then G1 is isomorphic to G2 , denoted it as G 1 ∼ = G2. Given two fuzzy RDF graphs Q and G, Q is sub-graph isomorphic to G, denoted as QG, if Q is isomorphic to at least on sub-graph G' of G, and G' is a matching of Q in G. Theorem 3.1 Isomorphism between fuzzy RDF graphs is an equivalence relation. Proof: 1. Reflexivity: Consider the identity map h: V → V such that ∀ v ∈ V, h (v) = v. ∀v ∈ V, μ(v) = μ(h(v)) and ∀ (vi , vj ) ∈ This (h is a )bijective ( map satisfying ( )) E, ρ vi , v j = ρ h(vi ), h v j . Hence h is an isomorphism of the fuzzy graph to itself. Therefore, it satisfies reflexivity. 2. Symmetry: Given two fuzzy RDF graphs G1 and G2 . Let h : V1 → V2 be an isomorphism from G1 to G2 then h is a bijective (map h(v)1 ) = v2 , v1 ∈ v1( satisfying )) ( μ1 (v)1 ) = μ2 (h(v1 )), ∀v1 ∈ V1 and ρ1 v1i , v1 j = ρ2 (h(v1i ) , h v1 j , ∀ v1i , v1 j ∈ E 1 . −1 As h is by (h(v1 ) = ( v2 ,)) v1 ∈ 2 ∈ ) V2 ; ( bijective, ) ( V1 then ) h (v2 ) = v(1 ∀v −1 v1i , v1 j ( ∈ E 1 then h μ Using ρ1 v1i , v1 j = ρ2 (h(v1i ), h v1 j (, ∀ )) 1 ) ( )(v2 ) = μ2 (v2 ) ∀ v2 ∈ V2 and ρ1 h −1 (v2i ), h −1 v2 j = ρ2 v2i , v2 j , ∀ v2i , v2 j ∈ E 2 . Thus, we get a 1–1, onto map h −1 : V2 → V1 , which is an isomorphism from G2 to G1, i.e., G 1 ∼ = G2 ⇒ G2 ∼ = G1.

3. Transitivity: Given three fuzzy RDF graphs G1 , G2 and G3 . Supposed h : V1 → V2 and h 2 : V2 → V3 are isomorphism of the fuzzy RDF graph G1 onto G2 and G2 onto G3 respectively. As h1 is a bijective map 1 satisfying ( h 1 (v)1 ) = ( v2 , v1 ∈( V)) ( )μ1 (v1 ) = μ2 (h(v1 )), ∀v1 ∈ V1 and ρ1 v1i , v1 j( = ρ2 )h(v1i ), (h v1 j ,)∀ v(1i , v1 j )∈ E 1 . i.e., μ1 (v1 ) = μ2 (v2 ), ∀v1 ∈ V1 and ρ1 v1i , v1 j = ρ2 v2i , v2 j , ∀ v1i , v1 j ∈ E 1 . In map the same way, as h1 is a bijective ( ) h 2 (v(2 ) = v3 , v2(∈ V))2 satisfying ( )μ2 (v2 ) = μ3 (h 2 (v2 )), ∀v2 ∈ V2 and ρ2 v2i , v2 j = ρ3 h 2 (v2i ), h 2 v2 j , ∀ v2i , v2 j ∈ E 2 . From what has been discussed above we have μ1 (v1 ) = μ2 (v2 ) = μ3 (h 2 (v2 )) = μ3 (h 2 (h 1 (v1 )))(∀v1 ∈ V1 ); ( ) ( ) ρ1 v1i , v1 j = ρ2 v2i , v2 j ( ( )) = ρ3 h 2 (v2i ), h 2 v2 j ( ( ( ))) = ρ3 h 2 (h 1 (v1i )), h 2 h 1 v1 j (∀(v1i , ∈ E 1 )). Hence h2 ° h1 is an isomorphism between G1 and G3 , i.e., it satisfies transitivity.

3.2 Fuzzy RDF Graph

79

In conclusion, isomorphism between fuzzy RDF graphs is an equivalence relation. The notion of graph pattern provides a simple yet intuitive specification of the structural and semantic requirements of interest in the input graph. Graph pattern as the basic operational unit is central to the semantics of many operations in fuzzy RDF algebra. Essentially, a fuzzy graph pattern is a directed crisp graph with predicate on vertexes and edges, and regular expressions that denote path over edges. Definition 3.5 (Fuzzy RDF graph pattern). A fuzzy RDF graph pattern is a 5-tuple P = (V P , E P , FV , FE , R E ) where 1. V P is a finite set of vertexes. 2. E P is a finite set of directed edges. 3. F V is a function defined on V P such that for a given vertex u ∈ V P , FV (u) is the predicate applied on the value of a label of vertex u. This predicate is a Boolean combination of atomic predicates such that each predicate compares a constant c specified in the pattern with the value V i using a given operator θ (e.g., , /=). Let cj be a constant and θ j be a comparison operator, F V (u) is the combination of atomic predicates of the form: V i θ j cj by the logical connectives (∧, ∨, ¬). 4. F E is the counterpart of F V for edges, which is a condition (or predicate) on the labels of the edges. This predicate is a conjunction of atomic formulas that each of them compares a constant c specified on the pattern with the actual value of a label of the edge e using a given comparison operator θ. The comparison is performed using any of the following operators: , /=. 5. RE : E P → r e(E) is a function defined on E P s. t. for each (u, v) in E P , re(E) is a path regular expression (Fan et al., 2011) in which E is a set of composed of U, B of the data graph G, variables and wildcard *, and it can be constructed inductively as R ::= e|R1 |R2 |R1 |R2 |R + . Here e denotes either an edge labeled by e or a wildcard symbol matching any label in Σ, R1 · R2 denotes a concatenation of expressions, R1 |R2 denotes disjunction and is an alternative of expressions, R+ denotes one or more occurrences of R. For example, a pattern graph P for the RDF graph shown in Fig. 3.1 is given in Fig. 3.2. This pattern is applied to model information concerning actor (?p) who is born in country1. The box office of the film (?film) that the actor starred is more than 30 million (?b > $ 30 million) and its genre is a tragedy. Fig. 3.2 Fuzzy pattern graph

80

3.2.2.3

3 Fuzzy RDF Modeling

Fuzzy RDF Semantics

Depending on the meaning we want to give to a certain RDF graph we will consider different kinds of fuzzy interpretations, e.g., simple, RDF, RDFS, and D, etc. For each one of them there will be some special semantic conditions. Intuitively, a fuzzy interpretation will represent a possible configuration of the world, such that we can verify whether or not what is said on a graph G is true within the framework of fuzzy logic. This leads us to think of an RDF graph as something which satisfies the possible world, thus providing some information. As described in Hayes (2004) any interpretation is relative to a certain vocabulary, so we( will in general speak ) of a fuzzy interpretation of the vocabulary V. For a triple μs /s, μ p / p, μo /o can be thought of as stating that a certain binary predicate associated to μp /p holds for the couple (μs /s, μo /o). A fuzzy interpretation will give us this association, and given a fuzzy RDF graph, it will be true if none of its triple state something false within the framework of fuzzy logic. Definition 3.6 (Fuzzy Simple Interpretation). Given a set of V, a fuzzy simple interpretation If of V is a 7-tuple I f = (V , I r, I p, Lv, I s, I l, I ext). Here, 1. V is a fuzzy set of vocabulary and each element x ∈ V has a degree μx ∈ [0, 1] to give an estimation of the belonging of x to V, 2. Ir is a non-empty fuzzy set of resources, called the domain or universe of I f , 3. Ip is a finite fuzzy set of all property objects of I f , 4. Lv is a fuzzy set of literal values, 5. Is: URIref → Ir ∪ Ip is a function, and μ I s(x) ∈ [0, 1] indicates the degree to an object which is mapped from an x ∈ URIref via Is belonging to Ir ∪ Ip, 6. Il: L T → Ir is a function from typed literals to Ir, and μIl(x) ∈ [0, 1] indicates the degree to a type literal x mapped via Il belonging to Ir, 7. Iext is a function from Ip to the powerset of Ir × Ir, and μ(xz)y ∈ [0, 1] indicates the membership degree which a pair (Is(x), Is(z)) belongs to the set Iext(Is(y)), where y and z are elements of fuzzy set V. Given a fuzzy triple (μs /s, μp /p, μo /o), if min (μs , μp , μo , μIs(s) , μIs(p) , μIs(o) ) ≥ α (α is a given threshold), then If (μs /s, μp /p, μo /o) = true, otherwise, If (μs /s, μp /p, μo /o) = false. Given a set of triples S, If (S) = false if If (μs /s, μp /p, μo /o) = false for some triple (μs /s, μp /p, μo /o) in S, otherwise If (S) = true. If satisfies S, written as If |≈ S, if If (S) = true, in this case, we say If is a fuzzy simple interpretation of S. Fuzzy simple interpretation, instead of associating a value in {0, 1} to each element of the corresponding set, accepts any value in the closed unit interval [0, 1]. Definition 3.7 (Fuzzy Simple Entailment). Let S be a set of fuzzy RDF graphs, and G a fuzzy RDF graph, then S fuzzy { simply entails G if and only } if for every fuzzy simple interpretation If we have: I f | ≈ G|∀H ∈ S, I f | ≈ H . In that case we note S|≈f G.

3.3 Fuzzy RDF Schema

81

3.3 Fuzzy RDF Schema On the basis of RDF, RDF Schema (RDFS) is used to describe the RDF vocabulary. RDF and RDF Schema together implement data exchange at the semantic level of any vocabulary between different machines. Here, RDF is the part of the data model and RDF Schema is the semantic interpretation part with additional ability to describe resources. While in RDF the main construct is the extension, the RDF Schema semantics is stated in terms of classes (Hayes, 2004). As a class is a resource with a class extension, which represents a set of domain element, the definition of class relies on the definition of extension. If an extension is a set of couples, and a fuzzy extension is a fuzzy set of couples, fuzzy class extensions in RDF Schema are fuzzy sets of domain’s elements. RDF Schema has a larger vocabulary then RDF, composed of URIs in the rdfs: namespace. The semantics is conveniently expressed in terms of classes: a class is a resource with a class extension, which is a subset of resource. As a consequence of this definition, a class can have itself as a member. The relation between a class and a class member is given using the RDF vocabulary property rdf: type, and the set of all classes is IC. With the RDFS, classes, properties, and relationships between classes and properties can be declared. The modeling primitive rdfs: Class and rdf: Property, for example, are applied to define classes and properties, respectively, which are the generalization of rdfs: Resource. In addition, the modeling primitive rdf: type is applied to state that a resource is an instance of a class. In particular, class inheritance and property inheritance can be described by rdfs: subClassOf and rdfs: subPropertyOf , respectively. Furthermore, RDFS provides rdfs: domain and rdfs: range to constrain the domain and range of properties, respectively. In the following, we define fuzzy RDF Schema (Fan et al., 2019) for modeling primitives, which can organize fuzzy RDF vocabularies into hierarchies. The formal fuzzy RDFS is given as follows: Definition 3.8 (Fuzzy RDF Shema graph). Fuzzy RDF(S) data graph GF is represented by a 7-tuple = {V, E, Σ, L, μ, ρ, A}. Here. 1. V is a finitude set of vertices. 2. E ⊂ Vi × V j is a set of directed edges, where V i , V j ⊂ V. 3. Σ = {IC, OP, LP, D} is a set of labels, where IC is a set of class resource labels, OP is a set of object property resource labels, LP is a set of datatype property resource labels, and D is a set of datatype labels. 4. L = {L V , L E } is a function assigning labels to vertices and edges, respectively. L V : V → Σ is a function assigning labels to vertices, and L E : E → Σ is a function assigning labels to edges. 5. μ: V → [0, 1] is a fuzzy subset of V. 6. ρ: E → [0, 1] is a fuzzy relation on fuzzy subset μ. Note that ∀vi , vj ∈ V, ρ (vi , vj ) ≤ μ (vi ) ∧ μ (vj ), where ∧ stands for minimum. 7. A is a set of axioms as shown in Table 3.2.

82

3 Fuzzy RDF Modeling

Table 3.2 The fuzzy relational schemas of a fuzzy relational database Fuzzy RDFS triples

Fuzzy RDFS axiom

(L(vi ) ∈ Σ. C ρ(vi ,vj )/rdfs subClassOf L(vj ) ∈ Σ. C)

Fuzzy class axioms ρ(vi ,vj )/rdfs: subClassOf (L(vi ) L(vj ))

(L(vi ) ∈ Σ. C L E (vi , vj ) ∈ Σ. LP μj /L(vj ) ∈ Σ. C) (L(vi ) ∈ Σ. C ρ(vi ,vj )/L E (vi , vj ) ∈ Σ. OP L(vj ) ∈ Σ. C)

Fuzzy property axioms DatatypeProperty (L E (vi , vj ) domain(L(vi )) range(μj /L(vj ))) ObjectPropertyρ(vi ,vj )/L E (vi , vj ) domain(L(vi )) range(L(vj )))

(L(vi ) ∈ Σ. T ρ(vi ,vj )/type L(vj ) ∈ Σ. C)

Individual axioms Individual (L(vi ) ρ(vi , vj )/type (L(vj ))… value (L ∈ Σ. LP, μ1 L(n1 ) ∈ Σ. D) … value (ρ 1 /L E' ∈ Σ. OP, L(n1 ' ) ∈ Σ.C)…

In Definition 3.8, the fuzzy RDFS data graph GF is a directed labeled graph, in which each vertex and the directed edge is assigned a label. The set of axioms A denotes the semantic of the fuzzy RDFS data. In this case, the labels contain the semantic information that can be used in the set of axioms. Let vi ∈ V and vj ∈ V be a subject vertex and an object vertex of the graph GF , and their labels be L(vi ) ∈ Σ. C and L(vj ) ∈ Σ. C, respectively. If edge label L E (vi ,vj ) is rdfs: subClassOf and the label value is ρ(vi ,vj ), the class axiom can be represented as ρ(vi ,vj )/rdfs: subClassOf (L(vi ) L(vj )). In a similar way, the extended fuzzy RDFS graph model can describe not only its instance information but also its structure information and the inferred semantic data can be derived from the graph. Table 3.1 shows the fuzzy RDFS triples and their corresponding axioms. In addition, Definition 3.8 explicitly classifies the element Σ that is the set of labels into four categories: class resource labels, object property resource labels, datatype property resource labels, and datatype labels. Along the same time, a crisp RDFS graph is simply a special case of fuzzy RDFS data graph with fuzzy values of 0 or 1 on all vertices (resp. edges).

3.4 Similarity Matching of Fuzzy RDF Graphs Data matching is the process of bringing data from different data sources together and comparing them to find out whether they represent the same real-world object in a given domain (Dorneles et al., 2011). Fuzzy RDF data matching is a fundamental problem in the integration of fuzzy RDF data. Based on the fuzzy RDF data model, we propose an approach for fuzzy RDF graph matching in this section. The method computes multiple measures of similarity among graph elements: syntactic,

3.4 Similarity Matching of Fuzzy RDF Graphs

83

semantic and structural. These measures are composed in a principled manner for graph matching. In particular, an iterative similarity function is introduced with the consideration of structural information of fuzzy RDF graph.

3.4.1 Matching Semantics RDF data have a natural representation in the form of labeled directed graphs, in which vertices represent resources and values (also called literals), and edges represent semantic relationships between resources. So, RDF data matching problem has been often addressed in terms of graph matching approach. Definition 3.9 (Fuzzy RDF graph matching). Given two fuzzy RDF graphs GS and GT from a given domain, the matching problem is to identify all correspondences between graphs GS and GT representing the same real-world object. The match result is typically represented by a set of correspondences, sometimes called a mapping. A correspondence c = (id, E s , E t , m) interrelates two element E s and E t from graphs GS and GT . An optional similarity degree m ∈ [0, 1] indicates the similarity or strength of the correspondence between the two elements. Definition 3.10 (Similarity function). Let GS and GT are two datasets, a similarity function is defined as: Fs (s, t) → [0, 1], where (s, t) ∈ GS × GT , i.e., the function computers a normalized value for every pair (s, t). The higher the score value, the more similar sand tare. The advantage of using similarity functions is to deal with a finite interval for the score values. Definition 3.11 (Predecessor set and successor ) G }be a fuzzy RDF graph, { (set). Let for any vertex v ∈ V of graph G, pr e(v) = v{' | v,( v ' ∈ ) E is } the predecessor set (i.e., forward neighbors) of v and succ(v) = v ' | v ' , v ∈ E is the successor set (i.e., backward neighbors) of v. For example, Fig. 3.3a, b illustrate two fragments of fuzzy RDF data graph with some fuzzy elements, and crisp ones. The edge “pid2-has_address-addid2” associated with membership degree 0.5 in target graph Fig. 3.3a represents the fact that the person labeled pid2, whose address label is addid2. And the possibility of the fact is 0.5. Note that opaque labels exist as shown in Fig. 3.3b. The resource “_:” is distinct from others, and it makes the resource name garbled. According to the RDF specification (Manola et al., 2004), a blank vertex can be assigned an identifier prefixed with “_:”. With the presence of dislocated matching (Zhu et al., 2014), some vertices in an RDF graph can be starting/ending vertices. We add the following restriction on fuzzy RDF graph:

84

3 Fuzzy RDF Modeling

Fig. 3.3 The fuzzy RDF graphs. a Source graph; b target graph

1. There is one and only one vertex in the RDF graph that is called the home vertex, denoted by v, ˆ which indicates the virtual beginning/end of all paths in an RDF graph. We specify that the label of the home vertex is “_: H”, i.e., L(v) ˆ = “_:H”. 2. There are paths from the home vertex to any other vertices in fuzzy RDF graph. That is, for each vertex v ∈ V except v, ˆ we add two edges (v, v) ˆ and (v, ˆ v). Thus, a path can begin (or end with vertex v at all the locations where v occurs. Moreover, ) ( ) ˆ v = μ(v), i.e., we regard the fuzzy degree associated we associate ρ v, vˆ ρ v, with each vertex represents the possibility that the vertex exists in the graph as the fuzzy degree of edge from home vertex to the vertex. Matching procedure takes as input two fuzzy RDF graphs and outputs a set of correspondences of the two graphs. Figure 3.4 illustrates an overview of the framework and it has three main stages: First, the procedure is to compute a vertex-to-vertex similarity score using different similarity functions. Label similarity functions adopt different computation strategies to compute multiple types of vertex label similarity scores. Structural similarity function iteratively computes similarity scores for every vertex pair by aggregating the similarity scores of edge and immediate neighbors’ vertices. Then, we obtain the overall similarity by combining label similarity scores and structural similarity scores. Finally, we select the potential correspondences based on the similarity scores and include them in the alignment.

3.4 Similarity Matching of Fuzzy RDF Graphs

85

Fig. 3.4 The framework of fuzzy RDF graph matching

3.4.2 Matching Approach 3.4.2.1

Label Similarity Function

1. Syntactic Similarity Intuitively, the element label denoting an element typically captures the most distinctive characteristic of the element in the RDF graph model. The syntactic similarity assigns a normalized similarity value to every pair (s, t) by applying the Levenshtein distance (Levenshtein, 1966) to the name labels of s and t. Formally the syntactic similarity sim sy (s, t) between two name labels s and t is defined as following. sim sy (s, t) = 1 −

L D(s.label, t.label) max(|s.label|, |t.label|)

(3.1)

Here s.label and t.label denote the name label of s and t, respectively, max(|s.label|, |t.label|) is the max length of the name string in s and t, and LD(w1 , w2 ) is the Levenshtein distance between two words w1 and w2 . 2. Semantic Similarity For semantic similarity, we use WordNet::Similarity package (Pedersen et al., 2004) to get the semantic relatedness between element labels based on their linguistic correlations. In our work, we use Jaccard similarity (Jaro, 1989) measures. In many cases, the element labels whose relatedness is being measure are phrase or short sentence, e.g., “house-number” and “room number” in Fig. 3.3. In these cases, we need a new measure that computes degree for element labels expressed as a sentence or phrase. To this end, we use a simple measure from the natural language processing. The method takes two phrases or short sentence as input and computes a semantic similarity as output in four steps: Step 1: Tokenize the names of both labels. We token each label as a set of words. Denote by s.tok as the j-th token in the name labels. Step 2: Search the synset of each token applying WordNet. Denote by syn(w) the WordNet synset of a word w.

86

3 Fuzzy RDF Modeling

Step 3: Compute the Semantic similarity. We use the Jaccard similarity to calculate the Semantic similarity on the synsets of each pair of tokens. Step 4: Return the average-max Semantic similarity as the result. The formula is shown as follows: sim se (s, t) =

1 Σ max( J accar d(syn(s.toki ), syn(t.tok j ))) j |s.tok| i (3.2)

Here |s.tok| is the number of tokens in the name of s, Jaccard denotes the Jaccard similarity between two sets, and syn(w) denotes the WordNet synset of a token w.

3.4.2.2

Structural Similarity Function

Inspired by the work in Nejati et al. (2011), we propose a structural similarity function for matching RDF graph, in which we further take edge information into consideration. 1. Structural Similarity Function '

Let vs is a vertex of fuzzy RDF graph Gs , i.e. vs ∈Gs , vs ∈ pre(vs ), and es ∈ E s is a directed edge from vertex vs to vertex vs '. Our matching method iteratively computers a similarity degree for every vertex pair (vs , vt ) of two fuzzy RDF graphs by aggregating the similarity degrees between the immediate neighbors of vs and vt . By neighbors, we mean either successor or predecessor, depending on which bisimilarity notion is being used. The method iterates until either the similarity degree between all vertex pairs stabilize or a maximum number of iterations is reached. If the similarity of edge es is different from the similarity of et or the fuzzy membership ρs deviates far from the fuzzy membership ρt , the similarity degree of vs' and vt' will have less effect on the similarity degree of vs and vt . For proper aggregation of similarity degrees between vertices of two fuzzy RDF graphs, we further take edge labels similarity and edge fuzzy membership into consideration. We compare edge labels using label similarity functions defined in Sect. 3.4.2.1. These similarity functions assign a similarity degree sim e (es , et ) to every edge labels pair (es , et ). We use a measure of similarity of fuzzy membership degrees (Pappis & Karacapilidis, 1993), which is based on the difference as well as the sum of corresponding grades of membership. Let ρs and ρt are the fuzzy membership degrees of edges es and et , respectively. The similarity of fuzzy values ρ s and ρ t is defined by sim ρ (ρs , ρt ) = 1 −

(ρs ∨ ρt ) − (ρs ∧ ρt ) ρs + ρt

(3.3)

Here (ρs ∨ ρt ) = max(ρs , ρt ) and (ρs ∧ ρt ) = min(ρs , ρt ). We now describe the computation of forward similarity. For every vertex pair (vs , vt ), the similarity degree Sim i (vs , vt ) is computed from (i) similarity degree

3.4 Similarity Matching of Fuzzy RDF Graphs

87

between vs and vt after step i − 1, i.e., Sim i−1 (vs , vt ); (ii) similarity degree between ( ) the forward neighbors of vs and those of vt after step i − 1, i.e., Sim i−1 vs' , vt' ; (iii) similarity degree between the edge labels relating vs and vt to their forward neighbors, i.e., sime (es , et ); (iv) similarity degree between the edges fuzzy values of es and et . of) vt , we need to To find the best match for vs among the forward neighbors ( maximize the value sim e (es , et ) × sim Q (ρs , ρt ) × Sim i−1 vs' , vt' . The similarity degrees between the forward neighbors of vs and their best matches among the forward neighbors of vt after the ith iteration are computed by Σ 1 max sim e (es , et ) | pr e(vs )| ' vt' ∈ pr e(vt ) vs ∈ pr e(vs ) ( ) × sim ρ (ρs , ρt ) × Sim i−1 vs' , vt'

sim i (vs , vt ) =

(3.4)

And the similarity degrees between the forward neighbors of vt and their best matches among the forward neighbors of vs after iteration i are computed by sim i (vt , vs ) =

1 | pr e(vt )|

Σ vt' ∈ pr e(vt )

max sim e (et , es )

vs' ∈ pr e(vs )

( ) × sim e (ρt , ρs ) × Sim i−1 vt' , vs'

(3.5)

Note that this sim measure is asymmetric, i.e., sim i (vs , vt ) / = |sim i (vt , vs ). In conclusion, we define the forward similarity degrees of vertex pair (vs , vt ) after the ith iteration as follows: Sim i (vs , vt ) =

((

) ) sim i (vs , vt ) + sim i (vt , vs ) /2 + Sim i−1 (vs , vt ) /2

(3.6)

For backward similarity degrees calculating, we perform the above formula for vertex vs and vt , but consider their backward neighbors instead of their forward neighbors. 2. Iterative Computation To calculate sim(vt , vs ) from forward neighbors, we present an iteration method by iteratively applying Formula (3.6). The computation has two phases: the initialization phase which assigns sim 0 (vt , vs ) for every vertex pair (vt , vs ), and the iteration phase which update the degree of sim i (vt , vs ) by using simi−1 (vt , vs ) according to Formula (3.6), when i ≥ 1. The principle of the method is that the similarities between two vertices must depend on the similarities between their adjacent vertices. We summarize the procedure of iterative computation of forward similarity in Algorithm 3.1.

88

3 Fuzzy RDF Modeling

Algorithm 3.1 Structural similarity algorithm Input: two fuzzy RDF graphs GS and GT , and a constant ε Output: matching similarity Sim 1: for each vs ∈ V s , vt ∈ V t do 2: ρ(vˆs , vs ) ← μ(vs ) and ρ(vˆt , vt ) ← μ(vt ) 3: for each es ∈ V s × {vˆs }, et ∈ V t × {vˆt } do 4: sim(es , et ) ← 1 5: for each vs ∈ V s and vt ∈ V t do 6: if (vs = vˆs ) and (vt =vˆt ) then 7: Sim0 (vs , vt ) ← 1 8: else 9: Sim0 (vs , vt ) ← 0 10: repeat 11: i ← i + 1 12: for each vs ∈ V s and vt ∈ V t do 13: Simi (vs , vt ) ← ((simi (vs , vt ) + simi (vt , vs ))/2 + Simi−1 (vs , vt ))/2 14: until|Simi (vs , vt ) − Simi-1 (vs , vt )| < ε 15: return Sim

Algorithm 3.1 is an iteration method by iteratively applying Formula (3.6). We v)ˆ begin by assigning the fuzzy degree μ(v) associated with each vertex v ∈ V except ( 0 to edges (v, v) ˆ and ( v, ˆ v) in lines 1–2. Then we initialize the similarity sim , e ˆ e ˆ s t ( ) for the edge pair eˆs , eˆt formed by the home vertices vˆs or vˆt connected with other real vertices in lines 3–4. The similarity sim 0 (eˆs , eˆt ) is set to 1.0 since their edge labels are inexistence. At the same time,( we initialize the similarity ) ( ) for vertex pair in lines 5–9. For every home vertex pair vˆs , vˆt , we set sim0 vˆs , vˆt = 1.0 in line 7. For every other vertex pair (vs , vt ), we assign the similarity degrees sim 0 (vs , yt ) = 0 as initial similarities in line 9. We further use iteration method to calculate the matching similarity in lines 10–14. In each iteration, we update the similarity degree for each vertex pair (vs , vt ) using the similarities of their neighbors in the previous iteration. The value sim i (vs , yt ) is non-decrease as i increase, and the method iterates until either the similarity degree between all vertex pairs stabilize or a maximum number of iterations is reached. Note that, the similarities between home vertices and real vertices are not refreshed during the iteration. Finally, we return the matching similarity Sim as the result in line 15. 3. Convergence Lemma 3.1: The iterative Formula (3.6) is bounded in the interval [0, 1]. Proof : This follows the analysis and the definitions of the iterative formula. It is quite straightforward. For all vs ∈ V s , vt ∈ V t , i ≥ 1, Simi (vs , vt ) ∈ [0, 1], i.e., iterative formula is bounded. Lemma 3.2: The iterative Formula (3.6) is monotone non-decreasing.

3.4 Similarity Matching of Fuzzy RDF Graphs

89

Proof : This can be simply proved by mathematical induction. Basis: Let show that the monotone holds for i = 1. If vˆs ∈ pre(vs ) and vˆt ∈ pre(vt ), we have sim 1 (vs , vt ) ∈ [0, 1], otherwise, sim1 (vs , vt ) = 0. Similarly, we have sim 1 (vt , vs ) ∈ [0, 1]. According to the iterative Formula (3.6), Sim 1 (vs , vt ) ∈ [0, 1]. Since the initial similarities Sim 0 (vs , vt ) = 0, we have Sim 0 (vs , vt ) = 0. Thus Sim 0 (vs , vt ) ≤ Sim 1 (vs , vt ). Therefore, the monotone non-decreasing holds for i = 1. Inductive step: Assume Sim k−1 (vs , vt ) ≤ Sim k (vs , vt ) holds for i = k. According to the above iterative formula definitions, we have sim k (vs , vt ) =

Σ 1 max sim e (es , et ) × sim ρ (ρs , ρt ) | pr e(vs )| ' vt' ∈ pr e(vt ) v ∈ pr e (v ) s

s

Σ 1 vs' , vt ≤ max sim e (es , et ) × Sim | pr e(vs )| ' vt' ∈ pr e (vt ) vs ∈ pr e (vs ) ( ) × sim ρ (ρs , ρt ) × Sim k vs' , vt' = sim k+1 (vs , vt ). ( k−1

) '

Thus, we have Sim k (vs , vt ) ≤ Sim k+1 (vs , vt ), that is, the monotone nondecreasing holds for i = k + 1. Since both the basis and the inductive step have been performed, by mathematical induction, the monotone non-decreasing holds for all i ≥ 1. Theorem 3.2 Iterative Formula (3.6) is convergence. Proof : Based on the Lemma 3.1 and Lemma 3.2, it is obviously proved.

3.4.2.3

Combining Similarities and Alignment Extraction

In order to obtain the overall similarity degrees between vertices, we need to aggregate the results of different similarity functions. There are several approaches to this, including linear averages, nonlinear averages and machine learning techniques. In our works, we use a simple approach based on linear averages. Firstly, we obtain the label similarity (SimL ) by taking an average of the syntactic similarities (simsy ) and semantic (simse ) similarities. Then, total similarity (Sim) is calculated by combining the label similarity and structural similarity (SimS ). To more accurately distinguish between similarity scores that are close to the median, we use a non-linear function, sigmoid function s (Ehrig and Sure, 2004), to compute each similarity score. The idea behind using a sigmoid function is quite simple: it allows reinforcing similarity scores higher than 0.5 and to weaken those lower than 0.5. That is to say, the sigmoid function provides high values for the best matches and lower ones for the worse matches. This treatment is meant to clearly separate two zones: the positive and negative correspondences. In this way, the general formula for this combination task can be given as following:

90

3 Fuzzy RDF Modeling

Sim(vs , yt ) = ω sig(Sim L (vs , vt )) + (1 − ω)sig(Sim S (vs , vt ))

(3.7)

1 , α being a parameter for the Here ω is a pre-defined weight, sig(x) = 1+e−α(x−0.5) slope. To obtain a correspondence relation between input fuzzy RDF graphs, we set a threshold δ for translating the overall similarity degrees into a binary relation. Pairs of data with similarity degree above the threshold are included in and the rest are left out. However, the choice of the thresholds δ is difficult: an increment of δ results in increased matching quality (i.e., a low number of false positives), but simultaneously reduces the matching coverage (i.e., a low number of false negatives). Similarly, a smaller δ decreases the matching quality along with a higher matching coverage. In practice, we expect to produce a small decrease the matching quality if it can bring about a comparable increase the matching coverage. Because it is easier for us to remove incorrect matches rather than find the missing ones in the process of data matching.

3.5 Algebraic Operations in Fuzzy RDF Graphs As evidenced by the database management systems, a formal algebra is essential for applying database-style optimization to query processing. Similarly, along the fuzzy RDF model as introduced in the previous chapters, the fuzzy RDF algebraic operations should be defined for supporting fuzzy RDF queries. In this section, we introduce a general algebraic framework for supporting imprecise and uncertain RDF queries (Ma et al., 2018). The algebra serves as a target language for translation from declarative user-oriented query language for fuzzy RDF. It is user-friendly and can provide a concise representation of query execution. In the following, we introduce several common fuzzy RDF algebraic operators for SPARQL graph pattern: for example, union, selection, left join and projection, because these operations can be directly applied in the UNION, FILTER, OPTIONAL and SELECT expressions of SPARQL, respectively. In addition, we need to investigate some additional operations to deal with RDF graph model because SPARQL queries run on RDF graph data model. To satisfy both general and RDF specific properties, we design our algebra which can be classified into three main categories of operations: graph-set operations, pattern-matching operations and construction operations. Graph-set operations take a collection of graphs (a collection of vertices or edges are the extreme cases) and perform set-theoretical operations, although some may not have exactly identical semantics as their relational counterparts. Patternmatching operations are oriented to structural selection and extraction by employing pattern graphs. Construction operations are designed to facilitate the result graph construction for RDF queries by providing a means for creating and inserting new vertices/edges and manipulating the extracted structures.

3.5 Algebraic Operations in Fuzzy RDF Graphs

91

3.5.1 Algebraic Operations 1. Set Operations Set operations take a set of graphs as input and then perform set-theoretical operations. Here we identify four standard fuzzy set-graph operations, which are fuzzy union (∪), fuzzy intersection (∩), Cartesian product (×) and fuzzy difference (−). Fuzzy union: Let G 1 = = (V1 , E 1 , Σ1 , L 1 , μ1 , ρ1 ) and G 2 (V2 , E 2 , Σ2 , L 2 , μ2 , ρ2 ) be two fuzzy RDF sub-graphs of G, respectively. The fuzzy union of G1 and G2 is defined as follows. G 1 ∪ G 2 = (Vr , Er , Σr , L r , μr , ρr ) Here Vr = V1 ∪ V2 , Er = E 1 ∪ E 2 , Σr = Σ1 ∪ Σ2 , and L r = L 1 ∪ L 2 are the classic set theoretical union, μr and ρ r are the membership degree of fuzzy union respectively. Here (Sunitha, 2001) result, ⎧ ⎨ μ1 (v), ∀v ∈ V1 − V2 μr (v) = μ2 (v), ∀v ∈ V2 − V1 , ⎩ μ1 (v) ∨ μ2 (v), ∀v ∈ V1 ∩ V2 ⎧ ( ) ⎨ ρ 1 (vi , v j ), ∀(vi , v j ) ∈ E 1 − E 2 , and ρr vi , v j = ρ 2 (vi , v j ), ∀(vi , v j ) ∈ E 2 − E 1 ⎩ ρ 1 (vi , v j ) ∨ ρ 2 (vi , v j ), ∀(vi , v j ) ∈ E 1 ∩ E 2 a ∨ b denoted the maximum of a and b (i.e., a ∨ b = max (a, b)). For example, we apply fuzzy union operation to the fuzzy RDF graphs shown in Figs. 3.5a and 3.1. Then we have the result of the union operation shown in Fig. 3.5b. Fuzzy intersection: Let G 1 = = (V1 , E 1 , Σ1 , L 1 , μ1 , ρ1 ) and G 2 (V2 , E 2 , Σ2 , L 2 , μ2 , ρ2 ) be two fuzzy RDF sub-graphs of G, respectively. The fuzzy intersection of G1 and G2 is defined as follows. G 1 ∩ G 2 = (Vr , Er , Σr , L r , μr , ρr ) Here Vr = V1 ∩ V2 , Er = E 1 ∩ E 2 , Σr = Σ1 ∩ Σ2 , L r = L 1 ∩ L 2 with the( classic ) set theoretical ( ) intersection, ( ) ( μr (v)) = μ1 (v) ∧ μ2 (v), ∀v ∈ V1 ∩ V2 and ρx vi , v j = ρ1 vi , v j ∧ ρ2 vi , v j , ∀ vi , v j ∈ E 1 ∩ E 2 are the membership degree of fuzzy intersection (Sunitha, 2001) result respectively, and a ∧ b denoted the minimum of a and b, i.e., a ∧ b = min(a, b). For example, we apply a fuzzy intersection operation to the fuzzy RDF graphs in Figs. 3.5a and 3.1. Then we get the result of the intersection operation shown in Fig. 3.6.

92

3 Fuzzy RDF Modeling

Fig. 3.5 Fuzzy union operation

Fig. 3.6 Fuzzy intersection operation

Fuzzy Cartesian product: Let G 1 = (V1 , E 1 , Σ1 , L 1 , μ1 , ρ1 ) and G 2 = (V2 , E 2 , Σ2 , L 2 , μ2 , ρ2 ) be two fuzzy RDF sub-graphs of G, respectively. Then the fuzzy Cartesian product (Sunitha, 2001) of G1 and G2 is defined as follows. G 1 × G 2 = (Vr , Er , Σr , L r , μr , ρr )

3.5 Algebraic Operations in Fuzzy RDF Graphs

93

Fig. 3.7 Fuzzy Cartesian product operation

Here Vr = V1 × V2 , Er = {(u, u 2 )(u, v2 )|u ∈ V1 , u 2 v2 ∈ E 2 }∪ {(u 1 , w)(v1 , w) |w ∈ V2 , u1v1 ∈ E 1 }, μr (u 1 , u 2 ) = (μ1 × μ2 )(u 1 , u 2 ) = μ1 (u 1 ) ∧ μ2 (u 2 ), ∀(u 1 , u 2 ) ∈ V , and

ρr =

ρ1 ρ2 ((u, u 2 )(u, v2 ) = μ1 (u) ∧ ρ2 (u 2 v2 ), ∀u ∈ V1 , ∀u 2 v2 ∈ E 2 ρ1 ρ2 ((u 1 , w)(v1 , w) = μ2 (w) ∧ ρ1 (u 1 v1 ), ∀w ∈ V2 , ∀u 1 v1 ∈ E 1

In the above definitions, an edge between two vertices u and v is denoted by uv rather than (u, v), because in the Cartesian product of two graphs, a vertex of the graph itself is an ordered pair. For example, for two simple fuzzy RDF graphs G and G' in Fig. 3.7a–c shows the result of fuzzy Cartesian product of G and G'. Fuzzy difference: Let G 1 = = (V1 , E 1 , Σ1 , L 1 , μ1 , ρ1 ) and G 2 (V2 , E 2 , Σ2 , L 2 , μ2 , ρ2 ) be two fuzzy RDF sub-graphs of G, respectively, Then the fuzzy difference (Sunitha, 2001) of G1 and G2 is defined as follows. G 1 − G 2 = (Vr , Er , Σr , L r , μr , ρr ) Here E K = E 1 − E 2 with the classic set theoretical difference, V r consists precisely of those vertices which are ( ) ( induced ) ( by the ) set of edges in E r , μr (v) = μ1 (v), ∀v ∈ V r and ρr vi , v j = ρ1 vi , v j , ∀ vi , v j ∈ E 1 − E 2 . Actually, the fuzzy difference of G1 and G2 defines a new fuzzy RDF graph which is formed by removing the edges of G2 from the edges of G1 . Note that G1 − G2 is different from G2 − G1 .

94

3 Fuzzy RDF Modeling

Fig. 3.8 Fuzzy difference operation

For example, Fig. 3.8 shows the result of fuzzy difference operation G1 − G2 , where the graph G1 is the fuzzy RDF graph in Fig. 3.1, and graph G2 is the fuzzy RDF graph of Fig. 3.5a. = = Theorem 3.3 Let G 1 (V1 , E 1 , Σ1 , L 1 , μ1 , ρ1 ) and G 2 , E , Σ , L , μ , ρ be two fuzzy RDF graphs, respectively. Then we have (V2 2 2 2 2 2 ) the followings. 1. 2. 3. 4.

G1 ∪ G2 is a fuzzy RDF graph. G1 ∩ G2 is a fuzzy RDF graph. G1 × G2 is a fuzzy RDF graph. G1 − G2 is a fuzzy RDF graph.

Let G 1 = (V1 , E 1 , Σ1 , L 1 , μ1 , ρ1 ), G 2 = (V2 , E 2 , Σ2 , L 2 , μ2 , ρ2 ), …, G i , = (V i , E i , Σi , , L i , μi , ρi , ) be two fuzzy RDF graphs, respectively. Then we say G = (V, E, Σ, L, μ, ρ) is the reconstruction of G1 , G2 , …, Gi if G comes from G1 , G2 , …, Gi based on Theorem 3.3. 2. Fuzzy Selection Operation The fuzzy selection operation can filter the fuzzy graphs using a graph pattern. It accepts a set of fuzzy graphs and a fuzzy graph pattern as input. The output is a fuzzy collection composed of all subgraphs that match the given graph pattern, which is not only the content of the right result but also the structure of objective graphs. Fuzzy selection: Let G = (V , E, Σ, L , μ, ρ) be a fuzzy RDF data graph. For a given RDF graph pattern P = (V P , E P , FV , FE , R E ), we have the definition of fuzzy selection as follows. σ P (G) = {|g = ∈(P, G), δ P (g) > 0}

3.5 Algebraic Operations in Fuzzy RDF Graphs

95

Fig. 3.9 The result graph of fuzzy selection operation

Here g is a subgraph of G, function ∈(P, G) is used for fuzzy RDF graph pattern P matching with G, and δ P (g) is the satisfaction degrees. In case of duplicates (a same graph appearing with several satisfaction degrees), the highest satisfaction degree is kept. For example, Fig. 3.9 is the answer of σ P (G) where P is the RDF graph pattern of Fig. 3.2 and G is the fuzzy data graph of Fig. 3.1. From the graph, the box office of the film labeled Film1 is over 3.5 billion and its genre is tragedy. Two people labeled pid1 and pid2 respectively are the stars of the film, and they are born in counrty1. Furthermore, the path going from pid1/pid2 to country1 satisfies the regular express RE = “* · locateIn+ ”. Thus, there are two answers (Fig. 3.9a, b) matching the graph pattern P in the fuzzy data graph G. As the satisfaction degree is the minimum of satisfaction degrees induced by Definition 3.4, we have δ P (g1 ) = 0.7 in Fig. 3.9a and δ P (g2 ) = 0.3 in Fig. 3.9b, respectively. 3. Fuzzy Projection Operation Selection and projection are orthogonal operations in relational algebra. With RDF graphs, selection and projection are not so obviously orthogonal. However, they have different semantics that respectively correspond to two returned semantics for matching pattern P against a fuzzy RDF graph G operation, and they are generalizations of their respective relational counterparts. The fuzzy projection in our data model takes a collection fuzzy graph as input, an RDF graph pattern P and a projection list PL as parameters. A projection list is a list of objects (vertices and edges) labels appearing in the pattern P, possibly adorned with *. The output of projection includes the all objects appearing in the P, however emphasizing that the (partial) hierarchical relationship among the retained objects in the original input graph structure is preserved. Note that, if this projection list is empty, just the matching graphs are returned. This implies that the fuzzy projection may be regarded as eliminating

96

3 Fuzzy RDF Modeling

Fig. 3.10 The result graph of fuzzy projection operation

objects other than specified in the fuzzy RDF data graph. The projection operation is defined as follows. Fuzzy projection: Let G = (V , E, Σ, L , μ, ρ) be a fuzzy RDF data graph, is a fuzzy projection function and P is an RDF graph pattern. Then the fuzzy projection can be defined as follows. π P,P L (G) = {|g = (P, P L , G), δT (g) > 0} The result of the projection operation is a fuzzy set of graphs, and δT (g) is the satisfaction degrees. The fuzzy projection operation returns a fuzzy set composed of all subgraphs of G that match the fuzzy graph pattern P. For example, we apply the same pattern graph of Fig. 3.2 and a projection to the fuzzy RDF graph of Fig. 3.1. Then we obtain the result of the projection operation shown in Fig. 3.10. The satisfaction degree δT (g) is 0.3. The difference in the output structures of selection and projection operations is obvious. 4. Fuzzy Join Operation The fuzzy join operation joins data graphs on a pattern. As in relational algebra, join can be expressed as a Cartesian product followed by a fuzzy selection. The condition of selection is to compare a property of the first graph with the other graph. In a valued join, the join condition is a predicate on vertex labels of the constituent graphs. In a structural join, the constituent graphs can be concatenated by edges or unification. Fuzzy join: Let G1 and G2 be two fuzzy RDF graphs and P be an RDF graph pattern. Then the fuzzy join operation is defined as follows. G 1 ▷ P G 2 = {g|g = σ P (G 1 × G 2 )} Here P is to be matched against (G 1 × G 2 ), at least one predicate f in the F V of P is L(v1 ) = L(v2 ), here v1 matches vertices in G1 , and v2 matches vertices in G2 . That is, L(v1 ) refers to a vertex label in G1 and L(v2 ) to one in G2 . The left join of the above expressions is defined as G1 ⟕P G2 , which has the following semantics: P1 and P2 are the two parts in P that are matched against G1

3.5 Algebraic Operations in Fuzzy RDF Graphs

97

and G2 respectively, if no matching graph G '2 obtained from σ P2 (G 2 ) satisfies the join condition L(v1 ) = L(v2 ), then output just σ P1 (G 1 ); otherwise, output σ P (G 1 × G 2 ). 5. Construction Operations Querying a fuzzy RDF graph implies not only extracting interesting content from the input model but also constructing an output model by inserting new vertices/edges or by deleting vertices/edges from the extracted graph. Construction operations are designed to facilitate the result graph construction for RDF queries. The vertex deletion operation removes identify vertices from a graph. A delete specification is used to identify vertices, and it indicates by vertex label which vertices to delete. Vertex deletion: Formally, the delete operation takes a fuzzy data graph G = (V , E, Σ, L , μ, ρ) as input and a delete specification DS as parameter. A delete specification is a set of vertices labels appearing in G. It generates a fuzzy graph defined as follows: { ( )} K (G, DS) = g|g = V ' , E ' , Σ, L , μ, ρ Here V ' = {v|v ∈ V and L(v) ∈ / DS} and E ' is the restriction of E over V ' × V ' . Edge deletion has the same idea with vertex deletion. It removes the relationship from an RDF graph. Edge deletion: Edge deletion operation takes as input a fuzzy graph G, a set of edge labels ES, it returns a fuzzy graph defined as follows: { ( )} λ(G, E S) = g|g = V, E ' , Σ, L , μ, ρ Here E ' = {e|e ∈ E and L(e) ∈ / DS}. Note that, vertex deletion is very similar to projection. In fact, it can be viewed as projection with a complemented projection list, specifying vertices to be eliminated rather than vertices to be retained. The vertex insertion operation possibly adds a new vertex to the fuzzy RDF data graph. The type of the new vertex is a resource, blank or literal, and the label of the new vertex is a URIs if the vertex represents a resource or a string if the vertex represents a literal. Vertex insertion: Let G be a fuzzy RDF graph, IS be an insert specification which is a set of vertices labels and δ be fuzzy degree of the insert vertex. The vertex insertion operation returns a fuzzy graph including the inserting vertices. ( ) Φ(G, I S) = (g|g = V ' , E, Σ ' , L , μ, ρ } { ( ) ( ) } Here V ' = V ∪ v ' |L v ' ∈ I S and μ v ' = δ and Σ ' = Σ ∪ I S. The edge insertion operation adds a new property edge to connect subject and object in the RDF data graph.

98

3 Fuzzy RDF Modeling

Edge insertion: Let G be a fuzzy RDF graph, ES be the edges labels and δ be fuzzy degree of the insert edges. The edge insertion operation returns a fuzzy graph including the inserting edges. ( ) φ(G, E S) = (g|g = V , E ' , Σ ' , L , μ, ρ } ( ) } { ( ) Here E ' = E ∪ e' |L e' ∈ E S and ρ e' = δ and Σ ' = Σ ∪ E S.

3.5.2 Equivalences Equivalence laws can be applied to rewrite algebra expressions in a form that satisfies certain needs. In this section, we present some algebraic equivalences based on data graph isomorphism. Algebraic laws are important for query optimization. Since our RDF graph algebra shares some operations with relational algebra and therefore related properties and laws defined in relational algebra carry along. We focus here on graph patterns properties that are unique to our algebra. First, we define an equivalence relationship between graph patterns. Definition 3.12 (Equivalence of graph patterns). Let P1 and P2 be two graph pattern expressions. For any valuation ξ of P1 and P2 over G, it holds that ξ (P1 ) = ξ (P2 ). Then the two graph pattern expressions P1 and P2 are equivalent, denoted by P1 ≡ P2 . There are some properties for the fuzzy RDF algebra. For Proposition 3.1 (Commutativity of ∪, ∩, ×, ⨝). If the operator is one of the operators ∪, ∩, ×, and ⨝ then 1. 2. 3. 4.

G1 ∪ G2 = G2 ∪ G1 G1 ∩ G2 = G2 ∩ G1 G1 × G2 = G2 × G1 G 1 ▷G 2 = G 2 ▷G 1

Proposition 3.2 (Associativity of ∪, ∩, ×, ⨝). If operator is one of the operators ∪, ∩, ×, and ⨝, then 1. 2. 3. 4.

(G 1 ∪ G 2 ) ∪ G 3 = G 1 ∪ (G 2 ∪ G 3 ) (G 1 ∩ G 2 ) ∩ G 3 = G 1 ∩ (G 2 ∩ G 3 ) (G 1 × G 2 ) × G 3 = G 1 × (G 2 × G 3 ) (G 1 ▷G 2 )▷G 3 = G 1 ▷(G 2 ▷G 3 )

Proposition 3.3 (Commutativity of σ with ∪, −, ⨝, ⟕). Let G1 , G2 be fuzzy data graphs, P be the RDF graph pattern. Then we have 1. σ P (G 1 ∪ G 2 ) = σ P (G 1 ) ∪ σ P (G 2 ) 2. σ P (G 1 − G 2 ) = σ P (G 1 ) − σ P (G 2 ) 3. σ P (G 1 ▷G 2 ) = σ P (G 1 )▷G 2

3.5 Algebraic Operations in Fuzzy RDF Graphs

99

4. σP (G1 ⟕ G2 ) = σP (G1 ) ⟕ G2 Proposition 3.4 (Commutativity of π with ∪). Let G1 , G2 be fuzzy data graphs, P1 and P2 be the RDF graph pattern. Then we have 1. π P (G 1 ∪ G 2 ) = π P (G 1 ) ∪ π P (G 2 ) 2. π P1 (π P2 (G)) = π P1 ∩P2 (G) Proposition 3.5 (Distributivity of ∪ with ⨝, ⟕). Let G1 , G2 and G3 be fuzzy data graph. Then we have 1. (G1 ⨝ (G2 ∪ G3 )) ≡ ((G1 ⨝ G2 ) ∪ (G1 ⨝ G3 )) 2. (G1 ⟕ (G2 ∪ G3 )) ≡ ((G1 ⟕ G2 ) ∪ (G1 ⟕ G3 )) 3. ((G1 ∪ G2 ) ⟕ G3 ) ≡ ((G1 ⟕ G3 ) ∪ (G2 ⟕ G3 )) Proposition 3.6 (Decomposition and elimination). Let G be fuzzy data graphs, P1 and P2 be the RDF graph pattern. Then we have 1. σ P1 ∧P2 (G) = σ P1 (σ P2 (G)) 2. σ P1 ∨P2 (G) = σ P1 (G) ∪ σ P2 (G) 3. σ P1 (σ P2 (G)) = σ P2 (σ P1 (G)) The list above is not comprehensive by any means. Further study of other algebraic properties of RDF graph patterns is part of our current research focuses. We believe that studying these algebraic properties can yield fruitful results that can further be implemented in tasks like caching RDF query results, views management and query results reuse.

3.5.3 Relationship of SPARQL and the Algebraic Operations In order to meet the needs of practical application, it is not enough for modeling fuzzy RDF and querying fuzzy RDF is very necessary. This section investigates the fuzzy RDF query processing according to the definitions of fuzzy RDF algebraic operations presented above. We begin with the description of the characters of the SPARQL query language in the fuzzy RDF and then explain the translation of SPARQL queries into equivalent RDF algebraic expressions. 1. SPARQL Query in the Fuzzy RDF SPARQL (Prud’hommeaux & Seaborne, 2008) is a proposal of a protocol and query language designed for easy access to RDF format datasets. It defines a query language with a SQL-like syntax, including joins and the capability to retrieve and combine data from several graphs, where a simple query is based on graph patterns, and query processing consists of the binding of variables to generate pattern solutions. SPARQL comes with a powerful graph matching facility, whose basic construct are so-called triple patterns. On top of that, SPARQL provides a number of advanced functions

100

3 Fuzzy RDF Modeling

for constructing more expressive queries, for stating additional filtering conditions, and for formatting the final output. The overall structure of the query language resembles SQL with its three major parts, denoted by the upper-case key words SELECT, FROM, and WHERE. 1. The key word SELECT determines the result specification including solution modifiers. The statements after SELECT refer to the remainder of the query: the listed names are identifiers of variables for which return values are to be retrieved. In contrast to SQL, SPARQL allows several forms of returning the data: a table using SELECT, a graph using DESCRIBE or CONSTRUCT, or a TRUE/FALSE answer using ASK. 2. The key word FROM specifies a dataset of one default graph and zero or more named graphs to be queried. 3. The key word WHERE initiates the actual query, in which is composed of a graph pattern. Informally speaking, this clause is given by a pattern that corresponds to an RDF graph where some resources have been replaced by variables. But not only that, more complex patterns are also allowed, which are formed by using some algebraic operators. This pattern is used as a filter of the values of the dataset to be returned. Classical SPARQL query suffers from a lack of query flexibility. The given query condition and the contents of the RDF repositories are all crisp. In this context, a query answer will either definitely or definitely not satisfy the condition. In the fuzzy RDF repositories, however, an answer may satisfy the query condition with a certain possibility and a certain membership degree even if the condition is crisp due to the fact that datasets are vagueness (or imprecision). Therefore, just like the definition of fuzzy selection operation given above, one needs to compute appropriate trustworthiness for the query results when fuzzy data are transformed through SPARQL queries. Thus, we introduce one additional expression “WITH ”. The optional parameter [WITH ] indicates the condition that must be satisfied as the minimum membership degree threshold in [0, 1]. Users choose an appropriate value of to express his/her requirement. Therefore, a canonical SPARQL statement is of the form: SELECT—FROM—WHERE—[WITH ]. Utilizing such SPARQL, one can get such answers that satisfy the given query condition and the given threshold. Therefore, depending on the different thresholds that are values in [0, 1], the same query for the same fuzzy RDF may have different query answers. Queries for fuzzy RDF databases are concerned with the numerous choices of threshold. Note that the item WITH can be omitted. The default of is exactly 1 at this moment. 2. Translating SPARQL Pattern into Fuzzy RDF Algebraic Formalism A principal motivation in designing fuzzy RDF graph model is to use it as a basis for efficient implementation of high level RDF query language. As the standard query language for RDF, SPARQL allows us to build complex group graph pattern. Group patterns can be used to restrict the scope of query conditions to certain parts of the pattern. Moreover, it is possible to define sub-patterns as being optional, or to provide

3.5 Algebraic Operations in Fuzzy RDF Graphs Table 3.3 The translation rules of a SPARQL query pattern into RDF graph algebra expressions

101

Original SPARQL syntax

Algebraic Syntax

{t}

(t)

{P1 } OPTIONAL {P2 }

P1 ⟕ P2

{P1 } UNION {P2 }

P1 ∪ P2

{P1 . P2 }

P1 ⨝ P2

{P FILTER R}

σR (P)

multiple alternative patterns. In this section, we begin with the expressive power of fuzzy RDF algebra w.r.t the core fragment of SPARQL query languages. Then, we show that every SPARQL query pattern can be translated into our fuzzy RDF algebraic terminology introduced above, and provide the procedure that performs this translation. Our fuzzy RDF algebra is designed taking SPARQL’s power of expression into consideration. SPARQL pattern expressions from the WHERE clause can easily be translated into fuzzy RDF algebraic expressions. The vice versa translation is not always possible as there are fuzzy RDF algebra expressions (e.g. expressions with construction operations) that are not expressible in SPARQL. Before providing the procedure that performs this translation, we discuss the translation rules of SPARQL pattern into RDF algebra expression. We do not recall the complete surface syntax of SPARQL here but simply introduce the underlying algebraic operations using our notation. Let G be a RDF graph over a RDF dataset D, t denotes a triple pattern, P, P1 , P2 be basic SPARQL graph patterns, and R a filter condition, and S a set of variables. Table 3.3 shows the translation rules of SPARQL query mode and RDF algebraic expression. An SPARQL query pattern is either a basic graph pattern or group graph pattern, consisting of the triple blocks, FILTER, OPTIONAL, and UNION graph pattern. Some of them contain other graph patterns. The above translation is applied to a single SPARQL group graph pattern. Nested group graph pattern blocks in the WHERE clauses can be handled quite easily, leading to the following result: Theorem 3.4 Fuzzy RDF algebra expressions can express SPARQL query patterns. Proof: SPARQL individual triple patterns can be expressed by “triple pattern matching” expressions. Basic graph patterns in SPARQL imply a join on common variables among individual triple patterns. The UNION, FILTER, OPTIONAL and SELECT expressions can be directly mapped to “union”, “selection” “leftjoin” and “projection” operators in fuzzy RDF algebra operations. These identified pattern expressions in the nesting sequence, inside out, and then can be expressed by a cascade of “join” operator in the same way that natural join is defined in relational algebra. Besides the conversion rules as such, it is of course also necessary to define how to transform SPARQL queries into expressions of this algebra in the first place. Based on the above translation rules and Theorem 3.4, we can transform any SPARQL patterns into algebra expression. For the sake of readability, we assume that the translation

102

3 Fuzzy RDF Modeling

of triple blocks (basic graph pattern) is given (this translation is straightforward). In Algorithm 3.2, we show a transformation function Translate (G) of patterns in the SPARQL syntax into the algebraic formalism presented in Sect. 3.5. Algorithm 3.2 Transformation of SPARQL pattern syntax into fuzzy RDF algebraic expression Translate (group graph pattern G) Input: a SPARQL pattern G Output: an algebraic expression A 1: A= φ; F = φ 2: for each syntactic form g in G do 3: if g is triple pattern t then 4: A = (A ⨝ (t)) 5: if g is OPTIONAL {P} then 6: A = (A ⟕ Translate (P)) 7: if g is {P1 } UNION…UNION {Pn } then 8: if n>1 then 9: A' = (Translate (P1 ) ∪…∪ Translate (Pn )) 10: else 11: A' = Translate (P1 ) 12: A = (A ⨝ A' ) 13: if g is FILTER {R} then 14: F = F ∧ {R} 15: end for 16: if F /= φ then 17: A = σ F(A)

Algorithm 3.2 consists of three phases. In the first phase (Lines 1), initially the set A and F are empty, in which store pattern and filtering conditions respectively. In the second phase (Lines 2–15), translating is performed to obtain all algebraic expression of g in group graph pattern G. For each loop of translating, if sub pattern g is a triple pattern or triple block, a join operation is performed for collecting triples and blocks (Line 3–4). Then, for each sub pattern g with Optional, a left join operation is performed to provide optional matching (Lines 5–6). Next, all occurrences of UNION are expressed using the binary operator union for specifying alternatives (Lines 7–12). In case of a longer chain of alternatives, the patterns are processed two at a time in accordance with the association rules for UNION. Finally, if g is an operator FILTER, and R is a SPARQL built-in condition, a conjunction operator is performed for combining filter condition R and F as basic constraints (Lines 13–14). This procedure is repeated until all sub patterns in G have been translated. If F is not empty, combine it with A with the selection operator in fuzzy RDF algebra operations (Lines 16–17). In Algorithm 3.2, we focus on the core fragment of SPARQL query pattern and, thus, we impose the following restrictions on graph patterns and the translation process. First, we will be mainly focused on the procedure that performs this translation of SPARQL patterns, that is, we do not take into account the solution modifiers and the output of a SPARQL query. Second, we are not considering blank vertices.

3.5 Algebraic Operations in Fuzzy RDF Graphs

103

We make this simplification here to concentrate on the pattern matching part of the language. And third, we concentrate on the set semantics of graph patterns. Proposition 3.7 Algorithm 3.2 is correct and complete for translating the SPARQL pattern into RDF algebraic expressions. This proposition can be proved inductively. First, the set of algebraic expressions is complete for the empty set, and at each step the SPARQL graph pattern G is completely extended for the current syntactic form f , and the number of syntactic forms being finite in SPARQL graph pattern. The algorithm proceeds recursively until all syntax forms have been translated into algebraic expressions completely. The procedure ends having an algebraic expression for each syntactic form in G. In essence, a SPARQL query is a result constructor wrapped in a set of variable bindings generated by the graph pattern. Therefore, the final step of translation work generates the operators for the result type. The official W3C Recommendation (Prud’hommeaux & Seaborne, 2008) defines four different types of queries on top of expressions, namely SELECT, ASK, CONSTRUCT, and DESCRIBE queries. Depending to the result type of the query, the translation creates the appropriate operator and connects it to the algebraic expression generated so far, i.e., it constructs a dataflow to the algebraic expression representing the graph pattern. We will restrict our discussion to SELECT queries in Example 9. The various expressive features of a SELECT query can be successively replaced by an expression using fuzzy RDF algebra operators. We do not recall the complete surface syntax of SPARQL here but simply introduce the underlying algebraic operations using our notation. In the following, we show how a fuzzy RDF algebraic expression is used to represent an SPARQL query. For convenience, we firstly use the natural language to express the fuzzy RDF queries. Then, we provide the SPARQL query statement written according to the official SPARQL syntax along with their equivalent RDF algebraic expression. For example, suppose that we would query the name of a movie and its starring name. The movie’ box office is more than more than 30 million. The birthplace of the starring is “region1” and, optionally (i.e., if available), their partner. At the same time, trustworthiness for the query result is more than 0.2. Consider the SPARQL query written according to the official SPARQL syntax. PREFIX ex: SELECT ?x, ?z, ?p FROM G WHERE { ?x ex: box office ?y FILTER (?y > $ 30 million) ?x ex: starring ?z. ?z ex: bornIn ?c. ?c ex: locateIn ?r FILTER (?r = “region1”) OPTIONAL {?z ex: marriedTo ?p } } WITH

104

3 Fuzzy RDF Modeling

Following the grammar of SPARQL, the above pattern (WHERE clause) is parsed as a single group graph pattern that contains the syntactic forms triple block, filter, triple blocks, filter, and optional graph pattern in that order. This final optional graph pattern syntactic form contains a group graph pattern with a single triple block syntactic form. The translation procedure in Algorithm 3.1 starts with A = {} and F = {}. Then we consider all the syntactic forms in the pattern to obtain: A = (({}▷ T ranslate(t1 )▷T ranslate (t2 ))▷T ranslate(gp1 ) F = ((?y > “$ 30 million”) ∧ (?r = “region1”)) Here t 1 is ?x ex: box office ?y, t 2 is ?x ex: starring ?z. ?z ex: born in ?c. ?c ex: locateIn ?r, and gp1 is { ?z:marriedTo ?p}. The translations Translate (t 1 ) and Translate (t 2 ) are simply {(?x ex: box office ?y)} and {(?x ex: starring ?z), (?z ex: born in ?c), (?c ex: locateIn ?r)}, respectively. To compute Translate (gp1 ) the algorithm proceeds recursively and gives as output the pattern: A' = ({} ⨝ (?z:marriedTo ?p)) Finally, the graph pattern of the query in the algebraic syntax is: P = σ F (A) Here A = (({} ⨝ {(?x ex: box office ?y)} ⨝ {(?x ex: starring ?z), (?z ex: born in ?c), (?c ex: locateIn ?r)}) ⟕ ({} ⨝ {(?z:marriedTo ?p)}) and F = ((?y > “$ 30 million”) ∧ (?r = “region1”)). Assume that the input RDF graph G is given in Fig. 1. Then the above SPARQL query evaluated on the fuzzy RDF graph G is equivalent to the RDF algebraic expression: π P,L S (G) Here P = σ F (A) is the pattern graph, LS = {?x, ?z, ?p} is the projection list and G is the input RDF graph. It is easily verified that answers are as follows. πP, LS (G) = {, }. Similar translations are also feasible for other SPARQL query types. The main challenge of SPARQL query translation to the algebraic expression lies in the core fragment of the query pattern, which is common to all query types. We will briefly introduce the translation process corresponding to different SPARQL query types. A CONSTRUCT query can copy existing triples from a dataset, or can create new triples. For the former case, the triple graph (result graph) can be directly retrieved from the data source by the selection operation. For the latter case, the intermediate graph can be extracted by the selection operation firstly. And then the required triples can be extracted by the projection operation. Finally, construction operations are designed to facilitate the result graph construction for RDF queries by providing a means for creating and inserting new vertices/edges and manipulating the

3.6 Summary

105

extracted structures. Of course, this process may be repeated using multiple construction operations to complete. And the specific number of construction and complexity determined by the size of the query problem. ASK asks a query processor whether a given graph pattern describes a set of triples in a given dataset or not, and the processor returns a boolean true or false depended on whether there is a result graph. We can use a selection operation to extract a result graph from a specific data source, based on a given graph pattern. In a DESCRIBE query, it takes each of the resources identified in a solution, together with any resources directly named by IRI, and assembles a single RDF graph by taking a “description” which can come from any information available including the target RDF dataset. It is worth noting that the description is determined by the SPARQL query processor, according to the SPARQL 1.1 specification. This has led to inconsistent implementations of DESCRIBE queries. In our solution, similar to the CONSTRUCT query, the query pattern is utilized to create a result set. And selection and projection operations are designed to return an RDF graph describing a set of IRIs and the resources that are bound to given variable names, i.e., it returns all the triples in the dataset involving these resources. Finally, the result RDF graph is obtained through the construction operation.

3.6 Summary Incorporation of fuzzy information in data model has been an important topic of database community because such information extensively exists in real-world applications, in which fuzzy data play an import role in nature. Classical RDF model cannot satisfy the need for modeling and processing fuzzy information. Therefore, topics related to the modeling of fuzzy data are considered very interesting in the RDF data context. In this chapter, we address the need for considering fuzzy data as part of the RDF data model. We propose a fuzzy RDF graph data model to manipulate the fuzzy information in RDF. We extend the ability of RDF that represents fuzzy information without changing the current RDF standard. We also introduce a fuzzy algebra based on a fuzzy RDF data model, which incorporates fuzzy information into query answering. This algebra consists of a family of operations that make it possible to express the data content and the structure of the fuzzy RDF graph. Moreover, we discuss how to use our algebra to capture queries expressed in popular SPARQL query languages. We investigate translation theorems and give the form of fuzzy querying with SPARQL. In order to meet the needs of practical application, just providing the modelling technology of fuzzy RDF is not enough, Fuzzy RDF data management is also very necessary. RDF data management, especially fuzzy RDF data management, typically faces two primary technical challenges, which are scalable storage and efficient queries of RDF data. Among these two issues, RDF data storage is the infrastructure of RDF data management. It is also true that fuzzy RDF data storage is very important

106

3 Fuzzy RDF Modeling

for fuzzy RDF data management. How to store RDF with imprecise or uncertain information has raised certain concerns as will be introduced in the following chapter.

References Chen, L., Gupta, A., & Kurul, M. E. (2005). A semantic-aware RDF query algebra. In Proceedings of the International Conference on Management of Data, Hyderabad, India. Dividino, R., Sizov, S., Staab, S., & Schueler, B. (2009). Querying for provenance, trust, uncertainty and other Meta knowledge in RDF. Journal of Web Semantics: Science, Services and Agents on the World Wide Web, 7(3), 204–219. Dorneles, C. F., Gonçalves, R., & dos Santos Mello, R. (2011). Approximate data instance matching: A survey. Knowledge and Information Systems, 27(1), 1–21. Ehrig, M., & Sure, Y. (2004). Ontology mapping—An integrated approach. In European Semantic Web Symposium (pp. 76–91). Springer. Fan, T., Yan, L., & Ma, Z. (2019). Mapping fuzzy RDF(S) into fuzzy object-oriented databases. International Journal of Intelligent Systems, 34(10), 2607–2632. Fan, W., Li, J., Ma, S., Tang, N., & Wu, Y. (2011). Adding regular expressions to graph reachability and pattern queries. In Proceedings of the 27th IEEE International Conference on Data Engineering, Hannover, Germany (pp. 39–50). Frasincar, F., Houben, G. J., Vdovjak, R., & Barna, P. (2002). RAL: an algebra for querying RDF. In Proceedings of the Third International Conference on Web Information Systems Engineering (pp 173–181). Fukushige, Y. (2005). Representing probabilistic relations in RDF. In Proceedings of the International Semantic Web Conference, Galway, Ireland (pp. 106–107). Grant, J., & Beckett, D. (2002). RDF test cases. http://www.w3.org/TR/2002/WD-rdf-testcases20021112/ Hartig, O. (2009). Querying trust in RDF data with tSPARQL. In Proceedings of the 6th European Semantic Web Conference on the Semantic Web: Research and Applications, Heraklion, Crete, Greece (pp. 5–20). Hayes, P. (2004). RDF Semantics, W3C Recommendation. http://www.w3.org/TR/rdf-mt/ Huang, H., & Liu, C. (2009). Query evaluation on probabilistic RDF databases. In Proceedings of the 10th International Conference on Web Information Systems Engineering, Pozna´n, Poland (pp. 307–320). Jaro, M. A. (1989). Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. Journal of the American Statistical Association, 84(406), 414–420. Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady (Vol. 10, No. 8, pp. 707–710). Lopes, N., Polleres, A., Straccia, U., & Zimmermann, A. (2010). Anql: Sparqling up annotated rdfs. In Proceedings of the 9th International Semantic Web Conference, Shanghai, China (pp. 518–533). Ma, Z., Li, G., & Yan, L. (2018). Fuzzy data modeling and algebraic operations in RDF. Fuzzy Sets and Systems, 351, 41–63. Ma, Z. M., Liu, J., & Yan, L. (2010). Fuzzy data modeling and algebraic operations in XML. International Journal of Intelligent Systems, 25(9), 925–947. Manola, F., Miller, E., & McBride, B. (2004). RDF primer. W3C Recommendation, 10(1–107), 6. Mazzieri, M., & Dragoni, A. F. (2008). A Fuzzy Semantics for the Resource Description Framework, Uncertainty Reasoning for the Semantic Web I: ISWC International Workshops, URSW 2005– 2007 (pp. 244–261). Springer. Nejati, S., Sabetzadeh, M., Chechik, M., Easterbrook, S., & Zave, P. (2011). Matching and merging of variant feature specifications. IEEE Transactions on Software Engineering, 38(6), 1355–1375.

References

107

Piattini, M., Galindo, J., & Urrutia, A. (2006). Fuzzy Databases: Modeling, Design and Implementation. Pappis, C. P., & Karacapilidis, N. I. (1993). A comparative assessment of measures of similarity of fuzzy values. Fuzzy Sets and Systems, 56(2), 171–174. Pedersen, T., Patwardhan, S., & Michelizzi, J. (2004). WordNet:: Similarity-Measuring the Relatedness of Concepts. In AAAI (Vol. 4, pp. 25–29). Prud’hommeaux, E., & Seaborne, A. (2008). SPARQL Query Language for RDF. W3C Recommendation. http://www.w3.org/TR/2008/REC-rdf-sparql-query-20080115/ Robertson, E. L. (2004). Triadic relations: An algebra for the semantic web. In Proceedings of the Second International Workshop on Semantic Web and Databases, Toronto, Canada (pp. 91–108). Straccia, U. (2009). A minimal deductive system for general fuzzy RDF. In Proceedings of the Third International Conference Web Reasoning and Rule Systems, Chantilly, VA, USA (pp. 166–181). Sunitha, M. S. (2001). Studies on fuzzy graphs. PhD thesis, Cochin University of Science and Technology, India. Tappolet, J., & Bernstein, A. (2009). Applied temporal RDF: Efficient temporal querying of RDF data with SPARQL. In Proceedings of the 6th European Semantic Web Conference on the Semantic Web: Research and Applications, Heraklion, Crete, Greece (pp. 308–322). Udrea, O., Recupero, D. R., & Subrahmanian, V. S. (2010). Annotated RDF. ACM Transactions on Computational Logic, 11(2), 1–41. Udrea, O., Subrahmanian, V. S., & Majkic, Z. (2006). Probabilistic RDF. In 2006 IEEE International Conference on Information Reuse and Integration, Waikoloa Village, HI (pp. 172–177). Zadeh, L. A. (1978). Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets Systems, 1(1), 3–28. Zimmermann, A., Lopes, N., Polleres, A., & Straccia, U. (2011). A general framework for representing, reasoning and querying with annotated semantic web data. Journal of Web Semantics, 11(3), 72–95. Zhu, X., Song, S., Lian, X., Wang, J., & Zou, L. (2014). Matching heterogeneous event data. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (pp. 1211–1222).

Chapter 4

Persistence of Fuzzy RDF and Fuzzy RDF Schema

4.1 Introduction RDF represents an emerging data model that provides the means to describe resources in a semi-structured manner for real-world applications. In practice, RDF is gaining widespread momentum and usage in different domains, such as the Semantic Web, Linked Data, Open Data, social networks, digital libraries, bioinformatics, or business intelligence. With the wide application of RDF, the scale of available RDF data is increasing dramatically. At this point, the scalable storage and efficient queries of RDF data are becoming increasingly crucial. The former is the infrastructure for RDF data management (Ma et al., 2016). We can identify three major types of RDF data store: memory-based storage, traditional databases-based storage, and NoSQL databases-based storage (Harris & Gibbins, 2003). While the memory-based storage [e.g., BitMat (Atre et al., 2009), BRAHMS (Janik & Kochut, 2005), and RDFox (Nenov et al., 2015)] has the fastest speed of processing RDF, this method can only store the most necessary RDF structural data due to the memory usage restriction. A more common RDF storage method is based on traditional databases, such as relational databases and objectoriented databases. In the context of relational databases, we can further identify three approaches. With the first one called vertical stores or triple stores [e.g., RDFPeers (Cai & Frank, 2004), 3store (Harris & Gibbins, 2003), RDF-3X (Neumann & Weikum, 2008; Neumann & Weikum, 2010a, 2010b), and Hexastore (Weiss et al., 2008)], each RDF triple is stored as a tuple in a relational table with the relational schema (subject, predicate, object), in which each column corresponds to an element of RDF triple. The disadvantage of this approach is that too many self-join operations must be applied while querying RDF data stored in the relational table. The second approach called horizontal stores [e.g., SW-Store (Abadi et al., 2009) and C-Store (Weiss et al., 2018)] divides RDF triples vertically based on their predicates. Then the triples with the same predicate are stored in a relational table. Such © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 Z. Ma et al., Modeling and Management of Fuzzy Semantic RDF Data, Studies in Computational Intelligence 1057, https://doi.org/10.1007/978-3-031-11669-8_4

109

110

4 Persistence of Fuzzy RDF and Fuzzy RDF Schema

a predicate-oriented relational table does not contain null values and multivalued attributes. But this approach involves complicated join operations among different relational tables to get RDF data stored in multiple relational tables. With the third approach called property stores (Chong et al., 2005; Sintek & Kiesel, 2006; Wilkinson et al., 2003), for the same or similar subject, multiple attributes are designed in the form of n-ary table columns. Then each row stores the same or similar subject and its corresponding attribute value. This approach can reduce join operations but faces the problems of null values and multivalued attributes. Although RDF storage based on relational database provides a convenient way to manage RDF data, relational database cannot well support the storage of massive RDF data. Therefore, to store large-scale RDF data, some distributed storage architectures are developed specifically for massive RDF data in the distributed RDF data management systems. More recently, NoSQL databases such as CouchDB, HBase (Sun & Jin, 2010), and graph databases (Peng et al., 2016) are used to store and manage large-scale RDF data (Cudré-Mauroux et al., 2013). Note that all the aforementioned works assume that the underlying RDF data are reliable and precise. However, information is often imprecise and uncertain in many real-world applications, and many sources can contribute to the imprecision and uncertainty of data or information. Therefore, the study of reengineering fuzzy RDF in fuzzy database models has received attention. Fuzzy databases such as fuzzy relational databases and fuzzy object-oriented databases (Quasthoff & Meinel, 2011) can store a large set of semantic information. And reengineering fuzzy RDF into fuzzy database models may satisfy the needs of storing fuzzy RDF data in fuzzy databases. Currently, there have been many efforts in storing crisp RDF data based on various databases while few in storing fuzzy RDF data (Bornea et al., 2013; Chen et al., 2006). Currently, there have been many efforts in storing crisp RDF data based on various databases while few in storing fuzzy RDF data. To the best of our knowledge, there are only two efforts in the storage of fuzzy RDF. Ma and Yan (2018) investigated the formal mapping from the fuzzy RDF model to the fuzzy relational databases, which is based on the fuzzy relational database model and supports the storage of fuzzy RDF triples. Considering the storage of fuzzy RDFS in addition to fuzzy RDF triples, Fan et al. (2019) presented an approach for reengineering fuzzy RDF(S) into fuzzy object-oriented database models. Like the situation of crisp RDF storage in databases, the fuzzy relational databases and fuzzy object-oriented databases cannot effectively support large-scale fuzzy RDF data management. In this chapter, we introduce the issue on reengineering fuzzy RDF into fuzzy database models, including fuzzy relational database model, fuzzy object-oriented database model, and HBase databases.

4.2 Fuzzy RDF Mapping to Relational Databases

111

4.2 Fuzzy RDF Mapping to Relational Databases Because of its success in data storage and management, and because the triple form of RDF data subject-predicate-object can be easily mapped to the relational data table model, the relational database is used by many researchers as RDF. Depending on the table structure of the RDF triples mapped to the relational database, the corresponding storage methods are also different. To reengineer fuzzy RDF into fuzzy relational database model, Ma and Yan (2018) investigated the formal mapping from the fuzzy RDF model to the fuzzy relational database. In this section, we investigate the strategies and approaches to mapping fuzzy RDF data to fuzzy relational databases based on the research work of Ma and Yan (2018). It is important to note that the fuzzy RDF model in this section differs from the model defined in the previous section for the sake of simplicity. That is, the fuzzy RDF model in this section only considers the fuzziness of triples, and does not consider the fuzziness of element-level.

4.2.1 Fuzzy Triple Stores Model In traditional RDF management, one straightforward way to maintain RDF triples is to store triple statements in a table like structure. In particular, in this approach, the input RDF data is maintained as a linearized list of triples, storing them as ternary tuples. Hence, an initial idea for fuzzy RDF data storage is to use a single relational table and all fuzzy RDF triples are directly stored in this relational table. Then each fuzzy RDF triple becomes a tuple of relational databases. Corresponding to the four components of each fuzzy RDF triple, the schema of the relational table includes three common columns for subject, property, and object as well as one additional column for membership degree. This kind of approach for storing fuzzy RDF data in relational databases is called fuzzy triple stores in this book. Formally let a fuzzy relational schema be the form of (subject, predicate, object, μ). Then a fuzzy RDF triple, say (s, p, (o, λ)), is directly mapped into a tuple t = ⟨s, p, o, λ⟩ of the fuzzy relation. Here, t[subject] = s, t[predicate] = p, t[object] = o and t[μ] = λ. Here, we use t[A] to represent t’s value on attribute (column) A. For the fuzzy RDF triples and fuzzy RDF graph presented in Fig. 4.1, their relational representation of fuzzy triple stores is shown in Table 4.1. It can be seen from the example that fuzzy triple stores use a fixed relational schema. As a result, new triples can be inserted without considering RDF data types. So, fuzzy triple stores can handle dynamic schemas of fuzzy RDF data. It can be seen also from the example that for one subject, its objects with respect to different properties appear in different tuples. Therefore, fuzzy triple stores generally involve a number of self-join operations for querying.

112

4 Persistence of Fuzzy RDF and Fuzzy RDF Schema

Fig. 4.1 RDF triples and fuzzy graph view. a Fuzzy RDF data, b fuzzy RDF graph

4.2.2 Fuzzy Horizontal Stores In order to overcome the problem of self-joins in fuzzy triple stores, a single relational table containing all different predicates as columns may be applicable. In the relational table, for each unique predicate of RDF triples, a subject–object relation is directly represented, in which the predicate is as a column name and the object is a value of this column. Triples with the same subject become a tuple of relational databases. Note that several triples with the same subject may have the same predicate and different objects. In Fig. 4.2 for example, we have three triples (IBM, industry, Software, 1.0), (IBM, industry, Hardware, 1.0), and (IBM, industry, Services, 0.9). They are mapped into a tuple, which value on attribute “industry” is a fuzzy set represented by {(Software, 1.0), (Hardware, 1.0), (Services, 0.9)}. The approach for storing fuzzy RDF data in a single relational table database is called fuzzy horizontal stores in this paper. Formally, for a given set T of fuzzy triples, suppose that n different predicates, say p1 , p2 , …, pn , are included. Then we have a fuzzy relational schema with the form of (subject, p1 , p2 , …, pn ). For any triple (si , pi , (oi , λi )) ∈ T, it should correspond to a tuple t i in the fuzzy relation. If there are not any tuples which have value si on attribute subject in the fuzzy relation, t i is a new tuple and inserted into the fuzzy relation. At this point t i [subject] = si , t i [pi ] = {(oi , λi )}, and the values of t i on other attributes are null values. If there exists a tuple ti which has value si on attribute subject in the fuzzy relation (i.e., t i [subject] = si ), we need to further determine if t i [pi ] is a null

4.2 Fuzzy RDF Mapping to Relational Databases

113

Table 4.1 Triple stores of fuzzy RDF data Subject

Predicate

Object

μ

Charles Flint

Born

1850

0.7

Charles Flint

Died

1934

0.8

Charies Flint

Founder

IBM

1.0

Larry Page

Born

1973

0.9

Larry Page

Founder

Google

1.0

Larry Page

Board

Google

0.8

Larry Page

Home

Palo Alto

0.6

Android

Developer

Google

0.9

Android

Version

4.1

0.7

Android

Kernel

Linux

0.8

Android

Preceded

4.0

0.8

Android

Graphics

OpenGL

0.8

Google

Industry

Software

0.9

Google

Industry

Internet

1.0

Google

Employees

54,604

0.8

Google

HQ

Mountain view

0.7

IBM

Industry

Software

1.0

IBM

Industry

Hardware

1.0

IBM

Industry

Services

0.9

IBM

Employees

433,362

0.8

IBM

HQ

Armonk

0.8

value or not. If t i [pi ] is a null value, then t i [pi ] = {(oi , λi )}, otherwise t i [pi ] = t i [pi ] ∪ {(oi , λi )}. For the fuzzy RDF triples and fuzzy RDF graph presented in Fig. 4.2, their relational representation of fuzzy horizontal stores is shown in Table 4.2. There are five different subjects and 13 unique predicates, and so the single relational table contains five tuples and 13 columns (attributes). It can be seen from the example that fuzzy horizontal stores use a single relational table which contains all different predicates as columns. When new triples are inserted, new predicates result in changes of the relational schema and dynamic schemas of RDF data cannot be handled. In addition, it is a common case that in the single relational table containing all predicates as columns, a subject occurs only with some predicates, which leads to a sparse relational table with many null values. To solve the problem of too many null values in fuzzy horizontal stores, we propose two variations of fuzzy horizontal stores in the following, which are called fuzzy column stores and fuzzy type stores in the paper. The basic idea of these two kinds of stores is to vertically partition the single table of fuzzy horizontal stores into a set of tables by the predicates. Each table contains one predicate (in fuzzy column stores) or several predicates (in fuzzy type stores).

114

4 Persistence of Fuzzy RDF and Fuzzy RDF Schema

Fig. 4.2 Column stores of fuzzy RDF data

1. Fuzzy column stores Fuzzy column stores use a set of relational tables and each unique predicate corresponds to a relational table. Then each fuzzy RDF triple becomes a tuple of the corresponding relational table, in which a binary relation between a subject and an object with respect to the given predicate is represented. Each relational table has a similar schema and includes two common columns for subject and property as well as one additional column for membership degree. It can be seen that the two fuzzy RDF triples which have the same property but different subjects become two different tuples of the same relational table. Generally speaking, for a triple (s, p, (o, λ)), we have a fuzzy relational table over the schema (subject, p, μ), in which o is placed in the p column of row s and λ is placed in the μ column of row s. For any two triples, say (si , pi , (oi , λi )) and (sj , pj , (oj , λj )), if pi = pj , we have two tuples ⟨si , oi , λi ⟩ and ⟨sj , oj , λj ⟩ in the same fuzzy relational table. But if pi /= pj , tuples ⟨si , oi , λi ⟩ and ⟨sj , oj , λj ⟩ occur in two different fuzzy relational tables. It is clear that the number of fuzzy relational tables is the same as the number of unique predicates in the RDF datasets. For the fuzzy RDF triples and fuzzy RDF graph presented in Fig. 4.1, the relational representation of fuzzy column stores is shown in Fig. 4.2. There are 13 unique predicates, and so a set of relational tables contains 13 relational tables. In fuzzy column stores, each fuzzy relational table contains only one predicate as a column, and so fuzzy column stores solve the problem of null values. But it should be noted that fuzzy column stores use a set of relational tables. As a

{(433,362,0.8)}

Employees

{(Google,0.9)}

{(54,604,0.8)}

{(OpenGL,0.8)}

Industry

{(Palo Alto,0.6)}

Developer

{(Software,1.0), (Hardware,1.0),(Services,0.9)}

{(4.0,0.8)}

Graphics

{(Google,0.8)}

Home

{(Software,0.9), (Internet,1.0)}

{(Linux,0.8)}

Preceded

{(Google,1.0)}

Board

IBM

{(4.1,0.7)}

Kernel

Founder {(IBM,1.0)}

Google

Android

Larry Page

Charles Flint

Subject

IBM

Google

Version

{(1973,0.9)}

Larry Page

Android

Died

{(1934,0.8)}

Born

{(1850,0.7)}

Subject

Charles Flint

Table 4.2 Horizontal stores of fuzzy RDF data

4.2 Fuzzy RDF Mapping to Relational Databases 115

116

4 Persistence of Fuzzy RDF and Fuzzy RDF Schema

result, the descriptions of a subject in its properties and objects are partitioned in multiple relational tables, and this generally involves too many join operations for querying. In addition, when new triples are inserted, new predicates result in new relational tables and dynamic schemas of RDF data cannot be handled also. To solve the problem of too many join operations for querying, we introduce fuzzy type stores in the following. 2. Fuzzy type stores Fuzzy type stores also use a set of relational tables. But, instead of vertically partitioning the single table of fuzzy horizontal stores into a set of tables by each predicate, in fuzzy type stores, each relational table contains some predicates as its columns. The predicates included in a relational table generally have the same data types. The fuzzy RDF triples which properties have the same data types will arise in the same relational table. Here, fuzzy triples with the same subject become a tuple of the corresponding relational table. For the triples which have the same subject, the same predicate and different objects, they are mapped into a tuple, which value on the predicate as an attribute is represented as a fuzzy set. The different objects contained in this fuzzy set act as its supports. Formally, suppose that we have a set T of fuzzy triples having m unique predicates with the same data type, say p1 , p2 , …, pm . For these fuzzy triples, we have a fuzzy relational schema with the form of (subject, p1 , p2 , …, pm ). For any two triples (si , pi , (oi , λi )) ∈ T and (sj , pj , (oj , λj )) ∈ T, they arise in the same relational table with one row for each subject. Furthermore, when si = sj and pi /= pj , oi and oj are placed in different columns pi and pj of the same row in the forms of {(oi , λi )} and {(oj , λj )}, respectively; when si = sj and pi = pj , oi and oj are placed in the same column of the same row in the forms of {(oi , λi ), (oj , λj )}; when si /= sj and pi = pj , oi and oj are placed in the same column of different rows si and sj in the forms of {(oi , λi )} and {(oj , λj )}, respectively. For the fuzzy RDF triples and fuzzy RDF graph in Fig. 4.1, the predicates are identified as three data types: people, companies and operation systems. Then their relational representation of fuzzy type stores is shown in Fig. 4.3. The approach of fuzzy type stores is actually a trade-off between fuzzy horizontal stores and fuzzy column stores. Fuzzy horizontal stores use a single relational table, which generally contain many null values but do not involve join operations. Fuzzy column stores use a set of relational tables, which do not contain null values but involve too many join operations. Fuzzy type stores contain fewer null values compared with fuzzy horizontal stores and involve less join operations compared with fuzzy column stores. Summarily the strategies and approaches to storing fuzzy RDF data in fuzzy relational databases are presented in the above, including fuzzy triple stores, fuzzy horizontal stores, fuzzy column stores and fuzzy type stores. Since fuzzy RDF data are stored in fuzzy relational databases and SPARQL (Simple Protocol and RDF Query Language, the RDF query language recommended by W3C) cannot be applied directly, a consequent issue emerges, that is how to query fuzzy RDF data stored in fuzzy relational databases. A possible way is to translate SPARQL queries for RDF

4.2 Fuzzy RDF Mapping to Relational Databases

117

Fig. 4.3 Type stores of fuzzy RDF data

data to SQL (Structured Query Language, the standard query language for relational databases) queries for relational databases. Let us look at fuzzy RDF data in Fig. 4.1. Suppose that we have a SPARQL query: SELECT DISTINCT ?p ?company WHERE {{?p founder ?company. UNION ?p board ?company.} ?company industry “Software”. } This SPARQL query returns (Charles Flint, IBM) and (Larry Page, Google). Now we store fuzzy RDF data in Fig. 4.1 in fuzzy relational databases, say the fuzzy relational databases in Fig. 4.2. Suppose that these tables are named with their predicates and we have t_born, t_died, t_founder, t_board, t_home, t_version, t_developer, t_kernel, t_preceded, t_graphics, t_industry, t_employees and t_HQ. Then the SPARQL query above is translated to a SQL query correspondingly: SELECT DISTINCT * FROM (((SELECT subject, founder as company FROM t_founder) AS t_1 LEFT OUTER JOIN (SELECT subject as company, board FROM t_board) AS t_2 On (false)) UNION ((SELECT subject as company, board FROM t_board) AS t_3 LEFT OUTER JOIN (SELECT subject, founder as company FROM t_founder) AS t_4 On (false)) ) AS t _5 INNER JOIN

118

4 Persistence of Fuzzy RDF and Fuzzy RDF Schema

(SELECT subject as company FROM t_industry AS t_6 WHERE industry = “Software” ) AS t_7 This SQL query returns and . It is clear that two queries get the same result. So, it is feasible that fuzzy RDF data are stored in fuzzy relational databases.

4.3 Fuzzy RDF Mapping to Object-Oriented Databases The classical relational database model and its fuzzy extension do not satisfy the need of modeling complex objects with imprecision and uncertainty. In order to model uncertain data and complex-valued attributes as well as complex relationships among objects, current efforts have concentrated on the fuzzy object-oriented databases as introduced in Chap. 2. Therefore, reengineering fuzzy RDF into fuzzy object-oriented database model may satisfy the needs of storing fuzzy RDF data in fuzzy databases and help to the interoperability between fuzzy object-oriented database model and fuzzy RDF. Based on the similar idea in Sect. 4.2, in the following, we introduce how to reengineer fuzzy RDF into fuzzy object-oriented database model, provide a set of rules for mapping fuzzy RDF into fuzzy object-oriented database model and implement a prototype to demonstrate our approach. Note that, we apply the fuzzy object-oriented databases instead of the fuzzy relational databases because the fuzzy object-oriented database model can represent complex objects and relationships with fuzziness more effectively. More importantly, the fuzzy object-oriented databases are very suitable for storing some important concepts in the fuzzy RDFS such as fuzzy classes, instances, properties, and fuzzy class/property hierarchies. To deal with uncertainties in RDF Schema layer, Fan et al. (2019) extended the definition of fuzzy RDF graph model, which explicitly classifies the element Σ that is the set of labels into five categories. That is Σ = {C, OP, LP, D, T } is a set of labels, where C is a set of class resource labels, OP is a set of object property resource labels, LP is a set of datatype property resource labels, D is a set of datatype labels, and T is a set of instance resource labels. In particular, we investigate how to formally map the fuzzy RDF model to the fuzzy object-oriented database model in the subsection. We develop mapping rules and implement a prototype system to demonstrate the feasibility of our approach. In the fuzzy RDF graph, its label elements include class resource labels, property resource labels, datatype labels, and instance resource labels. The elements on the fuzzy RDF semantic layer can identify the types of resources that the vertices and edges in the fuzzy RDF graph model correspond to. It can be seen that they are very similar to the elements of the fuzzy object-oriented database. The interpretation of the semantic layer of the fuzzy RDF graph model mainly includes four aspects:

4.3 Fuzzy RDF Mapping to Object-Oriented Databases

(a) (b) (c) (d)

119

fuzzy classes and their relationships, fuzzy properties and their relationships, datatypes, and fuzzy instances.

Among them, the first three aspects are the elements of the fuzzy RDF Schema layer, and the fuzzy instances are the elements of the fuzzy RDF layer. These four aspects can all find the corresponding elements in the fuzzy object-oriented database model. So, it is feasible to map the fuzzy RDF model to the fuzzy object-oriented database model. In the two layers of fuzzy RDF model, which are fuzzy RDF Schema layer and fuzzy RDF layer, based on the four aspects of fuzzy RDF mentioned above, we propose the mapping rules and algorithms to map the fuzzy RDF model to the fuzzy object-oriented database model. To concisely describe the mapping algorithm, we introduce several primary functions as follows: (a) nQueue (Q, Element): Inserting an element Element at the rear end of the queue Q. (b) sEmpty (Q): Determining if the queue Q is empty or not, which will return true if empty, otherwise, return false. (c) eQueue (Q): Removing and returning an element from the front end of the queue Q. (d) asChildNodes (Node): Determining if a node Node has a son node, which will return true if yes, otherwise, return false. (e) ubClassOnNextLevel (C): Returning all son nodes of class node C. (f) ubPropertyOnNextLevel (P): Returning all son nodes of property node P.

4.3.1 Mapping of Fuzzy Classes In the fuzzy RDF model, classes are the elements of the RDFS layer. The fuzzy class differs from the classic class because the behavior and state of the object contained in the fuzzy class are uncertain. That is, the properties of the fuzzy classes are fuzzy ones. In addition, the inheritance relationships between fuzzy classes are also uncertain. It means that two (fuzzy) classes have a subclass-superclass relationship with a membership degree. For the fuzzy classes in the fuzzy RDF, we need to map not only these classes themselves but also their relationships. For this purpose, we propose two mapping rules in the following. Here we use a function Γ that maps the elements in the fuzzy RDF model to the corresponding elements in the fuzzy object-oriented databases. Rule 1: L v (vi ) ∈ Σ.C ⇒ Γ (vi ) = fci ∈ FC FS When a vertex label of the fuzzy RDF graph model is a class label, it is mapped to a class in the fuzzy object-oriented database model and then named after the label. Rule 2: L v (vi ) ∈ Σ.C ∧ L v (vj ) ∈ Σ.C ∧ L E (vi × vj ) = subClassOf ⇒ Class fci is-a fci /μ type-is ft k ∈ FT.

120

4 Persistence of Fuzzy RDF and Fuzzy RDF Schema

Fig. 4.4 Fuzzy classes

When an edge label of the fuzzy RDF graph model is subClassOf , the fuzzy class which corresponds to the start point is a subclass of the fuzzy class which corresponds to the end point. And the label value is the membership degree of the subclass to the superclass. Let us look at a fuzzy RDF subgraph model shown in Fig. 4.4. It is shown in Fig. 4.4 that there are three vertex labels, which are the class labels Person, Student and Staff , and two edge labels, which are named subClassOf . The labels values are 0.8 and 0.9, respectively. They are mapped to the class Person, the class Student, and the class Staff in the fuzzy object-oriented database model respectively, the keyword is-a in the type expression denotes that the class Student and the class Staff are subclasses of the class Person. The corresponding mapping structure is as follows: Class Person type-is Union Student/0.8, Staff /0.9 End Class Student is-a Person/0.8 type-is Record … End Class Staff is-a Person/0.9 type-is Record … End Note that the above Rule 1 and 2 do not take the order of mapping into account. When a fuzzy subclass is mapped and meanwhile its superclass is not mapped accordingly, an error will occur. In the following, we organize the fuzzy classes into a hierarchical structure and present an algorithm of mapping fuzzy class hierarchies as shown in Algorithm 4.1.

4.3 Fuzzy RDF Mapping to Object-Oriented Databases

121

Algorithm 4.1. Mapping algorithm of fuzzy classes Input: FRDFS Model Output: FOODB Model 1. EnQueue (Q, “rdfs:Class”) 2. while (!isEmpty(Q)) 3. C = DeQueue(Q) 4. if (C! = “rdfs:Class”) 5. Mapping the fuzzy class C according to Rule 1 and Rule 2 6. if (hasChildNodes(C)) 7. for each C i ∈ subClassOnNextLevel() (i = 1, …, n) 8. EnQueue (Q, C i )

The root node “rdfs:Class” is the parent node of all fuzzy class nodes after the fuzzy classes are organized in a hierarchical structure. Algorithm 4.1 uses a breadthfirst traversal. With the algorithm, the root node “rdfs:Class” first enters the queue Q. Since the root node is just an abstract class, instead of mapping it, it is judged whether it has a son node and if so, all its son nodes are enqueued. If the queue Q is not empty, a fuzzy class node is sequentially dequeued from the queue Q, and it is mapped according to Rule 1 and 2. This fuzzy class node is further determined whether it has a son node, and if so, all its son nodes are enqueued. In a similar way, each node in the queue is dealt with until the queue is empty. Finally, all fuzzy classes in the fuzzy RDF graph are mapped in order according to their hierarchical relationship.

4.3.2 Mapping of Fuzzy Properties In the classic RDF, the property modeling primitive is “rdf:Property”, which is an element of the RDF Schema. The range of a property can be an RDF literal or a resource described by a URI. Therefore, we can identify two types of properties in the fuzzy RDF, which are the properties whose values are RDF literals and RDF resources, respectively. The former is called fuzzy datatype properties, which describe uncertain property values, and the latter is called fuzzy object properties, which describe uncertain properties themselves. In the following, we present two mapping rules: Rule 3 maps fuzzy datatype properties and Rule 4 map fuzzy object properties. 1. Mapping of fuzzy datatype properties Rule 3: L v (vi ) ∈ Σ.C ∧ L v (vj ) ∈ Σ.D ∧ L E (vi × vj ) ∈ Σ.LP ⇒ Γ (vi ) = fci ∈ FC FS ∧ Γ (vj ) = fbj ∈ FB ∧ Γ (L E (vi × vj )) = fai (fci ) ∈ FAFS When an edge label of the fuzzy RDF graph model is a datatype property label, the start point label is a class label and the end point label is a datatype label. At this point, the edge is mapped into an attribute of the fuzzy class, which corresponds to its start point and is named after the datatype property label; the endpoint is mapped into a simple/complex type or a fuzzy-type-based simple/complex type in the fuzzy

122

4 Persistence of Fuzzy RDF and Fuzzy RDF Schema

Fig. 4.5 Fuzzy datatype property sex

object-oriented database according to the mapping rules of the datatypes given in Sect. 4.3.3. Let us look at a fuzzy RDF subgraph model shown in Fig. 4.5. It is shown in Fig. 4.5 that there are two vertex labels, which are the class label Student and the datatype label &xsd; string, and one edge label, which is the datatype label and named sex. The vertices are mapped to the class Student and the simple type string in the fuzzy object-oriented database model, respectively. The edge is mapped to the fuzzy attribute Sex of the class Student. Here a keyword FUZZY is applied in the type expression to denote that the attribute Sex of the class Student is a fuzzy attribute. The corresponding mapping structure is as follows: Class Student is-a Person type-is Record FUZZY Sex: string … End 2. Mapping of fuzzy object properties In the RDF graph model, for the edge label that is an object property label, its start point label and end point label are both class labels. At this point, the edge description is actually a nonhierarchical binary relationship between the instances of these two classes that correspond to the start point and the end point. Note that a fuzzy object property is different from a crisp object property because the relationship described by the fuzzy object property is uncertain rather than determined. As a result, a membership degree is used to describe the uncertain binary relationship. In the fuzzy object-oriented database, the binary relationship is defined by the attribute elements with traversing the path. Therefore, the fuzzy object properties of the fuzzy RDF graph model can be mapped into attributes in the fuzzy object-oriented database model. In the fuzzy object-oriented database model, cardinality constraints are a kind of common binary relationships, which are generally divided into one-to-one cardinality constraints (1:1), one-to-many cardinality constraints (1: n or n: 1) and many-to-many cardinality constraints (m: n). Here, for a relationship in the class declaration, its cardinality constraint can explicitly appear in the cardinality description of the type of expression. In the fuzzy RDF graph model, however, there are not explicitly defined primitives for the cardinality constraints. Rule 4 gives a mapping rule for the fuzzy object properties on the premise that the cardinality constraints are already known. Rule 4: L v (vi ) ∈ Σ.C ∧ L v (vj ) ∈ Σ.C ∧ L E (vi × vj ) ∈ Σ.OP ⇒ Γ (vi ) = fci ∈ FC FS ∧ Γ (vj ) = fcj ∈ FC FS ∧ Γ (L E (vi × vj )) = fai (fci ) ∈ FAFS .

4.3 Fuzzy RDF Mapping to Object-Oriented Databases

123

When the edge label of the fuzzy RDF graph is an object property label, the start points label and the end point label are both class labels. In this case, the edge is mapped into an attribute of the fuzzy class, which corresponds to the start point and is named after the object property label; the endpoint is mapped into the corresponding fuzzy class according to Rule 1. Note that in the above mapping of fuzzy properties, the relationship between fuzzy property and fuzzy subproperty is not considered. Such a mapping process does not need to consider the mapping order under the premise of not considering time efficiency. In the following, we organize the fuzzy properties into a tree with height 2, which root node is the node “rdf:Property”. We present an algorithm of mapping fuzzy properties as shown in Algorithm 4.2 Algorithm 4.2. Mapping algorithm of fuzzy properties Input: FRDFS Model Output: FOODB Model 1. begin: 2. EnQueue (Q, “rdf:Property”) 3. P = DeQueue (Q) 4. if (hasChildNodes(P)) 5. for each Pi = SubPropertyOnNextLevel (P) (i = 1, …, n) 6. EnQueue (Q, Pi ) 7. end for 8. while (!isEmpty(Q)) 9. if (L v (P = DeQueue (Q)) ∈ Σ. LP) 10. Mapping fuzzy datatype property P according to Rule 3 11. else 12. Mapping the fuzzy object property P according to Rule 4 13. end

The root node “rdf:Property” is the parent node of all fuzzy property nodes after the fuzzy properties are organized in a hierarchical structure. Algorithm 4.2 uses a breadth-first traversal. With this algorithm, the root node “rdf:Property” first enters the queue Q. The root node is just an abstract property, so it is dequeued rather than mapped. Then it is judged if it has a son node, and if so, all its son nodes are enqueued. If the queue Q is not empty, a fuzzy property node is sequentially dequeued from the queue Q, and it is further determined if the node is a fuzzy datatype property node or a fuzzy object property node. Then the fuzzy property is mapped according to Rule 3 or Rule 4. In a similar way, each node in the queue is handled until the queue is empty. Finally, all fuzzy properties in the fuzzy RDF graph are completely mapped.

4.3.3 Mapping of Datatypes In the classic RDF, only one datatype rdf:XMLLiteral is predefined and users are recommended to use the basis datatypes defined in XML Schema. In the fuzzy RDF, the basis datatypes are not fuzzy and we can still use the basis datatypes defined

124

4 Persistence of Fuzzy RDF and Fuzzy RDF Schema

Table 4.3 Mapping of XSD datatypes into fuzzy object-oriented database datatypes Datatype

XSD datatype

FOODB datatype

Numerical

xsd:decimal

Decimal

xsd:integer

Integer

xsd:short

Short

xsd:long

Long

xsd:float

Float

xsd:double

Double

Enumeration

xsd:enum

Enum

String

xsd:string

String

Boolean

xsd:boolean

Boolean

Date and time

xsd:date

Date

xsd:time

Time

by XML Schema, such as integer, float, string, date, time and so on. The major basis datatypes in XML Schema and their corresponding datatypes in the fuzzy object-oriented database model are shown in Table 4.3. The datatypes used in the fuzzy RDF have the corresponding datatypes in the fuzzy object-oriented databases. As shown in Table 4.3, for example, the XSD datatype xsd:string is mapped to the datatype string in the fuzzy object-oriented databases. Note that XML Schema supports custom complexType. At this point, the fuzzy object-oriented databases need to provide a type generator to support the definition of structured literal so that complexType can be mapped accordingly. Suppose that XML Schema defines the following complexType element Degree:





Then the complexType element Degree is mapped to the structured literal Degree in the fuzzy object-oriented databases as follows: struct Degree{ string school_name;

4.3 Fuzzy RDF Mapping to Object-Oriented Databases

125

string degree_type; short degree_year; };

4.3.4 Mapping of Fuzzy Instances In the fuzzy RDF graph model, the description of a fuzzy instance is realized by describing the fuzzy property value of the fuzzy class. When the edge label is “type”, the labels of the start point and the end point are the instance label and the class label, respectively. In this case, the edge indicates that the start point is an instance of the corresponding class of the end point. Rule 5 gives a rule of mapping the fuzzy RDF instances to the FOODB instances. Rule 5: L v (vi ) ∈ Σ.T ∧ L v (vj ) ∈ Σ.C ∧ L E (vi × vj ) = type ⇒ (Γ (vi ) = foi ∈ FOFS ) ∧ (Object foi belong-to fcj /μj has-value [fa1 : fb1 , …, fak : fbk ]). When the edge label is “type”, the starting point is mapped into an instance of the fuzzy object-oriented database model, which is the instance of the fuzzy class that corresponds to the end point. All vertices and edges associated with the start point are mapped into the corresponding attributes of the instance in the fuzzy object-oriented database model. In the fuzzy object-oriented database model, an object with identifier OID uniquely identifies an object and is named after the label of the start point. In the fuzzy RDF data subgraph shown in Fig. 4.6, for example, there are two edges labeled as “type”. The label of the start point is the instance label student1 and the label of the end point is the class label Student. The membership degree of the edge is 0.9, indicating that the object student1 belongs to the class Student with a membership degree of 0.9. The other label of the start point is the instance label book1 and the label of the end point is the class label Book. The membership degree of the edge is 0.85, indicating that the object book1 belongs to the class Book with a membership degree of 0.85. For the fuzzy instances of the fuzzy RDF shown in Fig. 4.6, the corresponding mapping structure is as follows: Object student1 belong-to Student/0.9 has-value FUZZY Name: 1.0/Bob, FUZZY Sex: 0.85/male, FUZZY Age: 0.9/20, FUZZY Read: 0.75/book1 End Object book1 belong-to Book/0.85 has-value FUZZY Title: 0.8/A Semantic Web Primer, FUZZY Author: 0.85/Antoniou, FUZZY Category: 0.9/Science Information End And their corresponding database instances in the fuzzy object-oriented databases are finally shown in Fig. 4.7.

126

4 Persistence of Fuzzy RDF and Fuzzy RDF Schema

Fig. 4.6 Fuzzy instances

Fig. 4.7 Fuzzy database instance

4.3.5 Implementation Based on the mapping rules proposed in Sect. 4.3, we implement a prototype called FRDF2FOODB, which can map the fuzzy RDF model to the fuzzy object-oriented

4.4 Fuzzy RDF Mapping to HBase Databases

127

Fig. 4.8 The overall architecture of FRDF2FOODB

database model. In the following, we briefly explain the implementation of the prototype, which consists of three main modules: parsing module, mapping module, and output module. Figure 4.8 shows the overall architecture of the FRDF2FOODB. The functions of the three main modules of the FRDF2FOODB are described below: 1. Parsing module: The Parsing module parses the input fuzzy RDF model, which is described in the form of triples, into classes, properties, instances, etc., and stores the parsed results, which are the input of the mapping module. 2. Mapping module: The mapping module maps the fuzzy RDF classes, properties, instances and other elements, which are obtained by the parsing module, into the corresponding fuzzy object-oriented database classes and instances according to the mapping rules proposed in Sect. 4.3. 3. Output module: The output module is actually an interface module, displaying the input fuzzy RDF model, and the resulting fuzzy object-oriented database model after mapping the fuzzy RDF model. Also, this module displays the specific storage of the RDF in the fuzzy object-oriented databases after the mapping is completed.

4.4 Fuzzy RDF Mapping to HBase Databases With the explosive growth of RDF data, some efforts have carried out massive RDF data store. Several proposals are introduced to store RDF data in Hadoop (Farhan Husain et al., 2009; Myung et al., 2010; Rohloff & Schantz, 2010). The drawback of Hadoop-based RDF store is that RDF data are directly stored in HDFS, resulting in a lack of efficient index structure. HBase, a column-oriented NoSQL database,

128

4 Persistence of Fuzzy RDF and Fuzzy RDF Schema

implements a global, distributed index with sorting the row key of HBase table by dictionary. There have been some works proposed to store RDF data in HBase. In Sun and Jin (2010) presented an approach for storing RDF data into six HBase tables, which are S_PO, P_SO, O_SP, PS_O, SO_P, and PO_S. The row key of table S_PO is the subject of RDF triple, and the column is the tuple (predicate, object). Similarly, RDF data are repeatedly stored in different HBase tables according to the different organizational forms of RDF triple elements. Papailiou et al. (2012) presented a fully distributed RDF store method, H2RDF, which can reduce the number of HBase tables in Sun and Jin (2010) from six to three (i.e., SP_O, PO_S, and OS_P). The row key of table SP_O is the tuple (subject, predicate), and the column is the object of RDF triple. At the same time, Abraham et al. (2010) also used three HBase tables to store RDF data in which are Ts, Tp, and To. These three HBase tables, respectively, take the subject, predicate, and object as the row key, and take the other terms as column values. Like the situation of crisp RDF storage in databases, the fuzzy relational databases and fuzzy object-oriented databases cannot effectively support large-scale fuzzy RDF data management. To manage large-scale fuzzy RDF data efficiently and effectively, some work has already investigated the storage of fuzzy RDF data in NoSQL databases. Since HBase databases support high-reliability underlying storage and have high-performance computing power. Fan et al. (2020) proposed a fuzzy RDF storage schema with fuzzy HBase databases. With the distributed fuzzy RDF(S) storage approach proposed by Fan et al. (2020) in this section, we present a distributed fuzzy RDF storage approach based on HBase databases. This approach makes use of the index function of HBase databases. In addition, according to different organizational forms of the fuzzy triple patterns, we propose a set of FHBase-based query algorithms to deal with the query of fuzzy triples from different fuzzy HBase tables. On the basis, we implement a prototype system to demonstrate the feasibility of our approach.

4.4.1 Fuzzy RDF Storage in Fuzzy HBase In the fuzzy RDF graph model, it covers both fuzzy RDF schema layer and fuzzy RDF instance layer. The former mainly describes two kinds of information about fuzzy classes and fuzzy properties in fuzzy RDF ontology data, and the latter mainly describes the specific information of fuzzy RDF instance data. To improve the query efficiency of the storage of fuzzy RDF, we separately store the fuzzy RDFS data to ensure the retrieval efficiency. As a result, we design two FHBase tables to store the fuzzy RDFS data and other two FHBase tables to store the fuzzy RDF instance data.

4.4 Fuzzy RDF Mapping to HBase Databases

4.4.1.1

129

Storage of Fuzzy RDFS

The fuzzy RDFS data describes the information about fuzzy classes and fuzzy properties in fuzzy RDF ontology data. The information related to fuzzy classes refers to the corresponding fuzzy classes information of each fuzzy instance, the corresponding fuzzy properties information of each fuzzy class and the subclass-superclass relationships between fuzzy classes, and so forth. And the information related to fuzzy properties refers to the relationships between fuzzy properties, such as the inheritance relationships, equivalence relationships, the domains and ranges of each fuzzy property, and so forth. To store the fuzzy classes and fuzzy properties of fuzzy RDFS data, we design two FHBase tables which named FClassRelation and FPropertyRelation in the following. The specific table structures and storage examples of FClassRelation and FPropertyRelation are shown in Tables 4.4 and 4.5, respectively. Note that for the sake of simplicity of discussion, timestamp is omitted. The FHBase table FClassRelation shown in Table 4.4 takes the fuzzy class name as the row key and the class relationship as the column family name. For the relationship between classes include fuzziness and a method is developed to calculate the membership degree of fuzzy subclass/superclass relationship in (Ma et al., 2004) the Table 4.4 The specific table structure of FHBase table FClassRelation Row key

Column family: EquivalentClass

Column family: SubClass

FC1

EquivalentClass: ρ 1 /equivalentclass = FC2

SubClass: ρ 3 /subclass = FC3 SubClass: ρ 4 /subclass = FC5

FC2

EquivalentClass: ρ 2 /equivalentclass = FC1 SubClass: ρ 5 /subclass = FC4

FC3 FC4

EquivalentClass: ρ 6 /equivalentclass = FC3 SubClass: ρ 7 /subclass = FC4

FC5

Table 4.5 The specific table structure of FHBase table FPropertyRelation Row key

Column family: EquivalentProperty

Column family: SubProperty

Column family: Domain

FP1

EquivalentProperty: ρ 1 /equivalentproperty = FP3

SubProperty: ρ 3 /subproperty = FP4 SubProperty: ρ 4 /subproperty = FP5

Domain: domain = FC2

Domain: domain = FC3

FP2 FP3

EquivalentProperty: ρ 2 /equivalentproperty = FP1

Domain: domain = FC1

FP4

Domain: domain = FC5

FP5

Domain: domain = FC4

130

4 Persistence of Fuzzy RDF and Fuzzy RDF Schema

column name is named according to the following expression: a relationship name which is the same as column family name but lowercase followed by a number in [0, 1] and a notation “/”, in which the number represents the membership degree. Note that this expression can be shortened and only a relationship name is left when its number is 1. The cell value is the other corresponding fuzzy class name. Likewise, the FHBase table FPropertyRelation shown in Table 4.5 takes the fuzzy property name as the row key and the property relationship as the column family name. The column name is named according to the following expression: a relationship name which is the same as column family name but lowercase followed by a number in [0, 1] and a notation “/”, in which the number represents the membership degree between fuzzy properties. The cell value is usually the other corresponding fuzzy property name. In particular, when the column family name is “Domain” and the column name is “domain”, the cell value is the corresponding domain of the fuzzy property represented by the row key, and in this case, the cell value is a fuzzy class name.

4.4.1.2

Storage of Fuzzy RDF Instance Data

Fuzzy RDF data are modeled by describing the fuzzy property value of the fuzzy class in fuzzy RDFS. For the purpose of storing fuzzy RDF instance data correctly and supporting efficient queries for different triple pattern forms, we design two different FHBase tables which named FHTS_PO and FHTO_PS, respectively. These two tables both take “Object Property,” “Datatype Property,” and “Type” as the column family name, while the former takes the subject in fuzzy RDF triple as the row key and the latter takes the object as the row key. The specific table structure and storage example of FHTS_PO and FHTO_PS are shown in Tables 4.6 and 4.7, respectively. The FHBase table FHTS_PO shown in Table 4.6 takes the subject in fuzzy RDF triple as the row key and stores the fuzzy RDF triples corresponding different properties in different column families. When the predicate of a fuzzy RDF triple is an object property, for example, it is stored in the cell corresponding to the column family which named “Object Property.” The category of predicate in fuzzy RDF triple can be obtained from the axioms of the fuzzy RDF graph data model. The column name and cell value are different in different cases. First, when the column family name is “Object Property,” the column name is named according to the following expression: the property name in fuzzy RDF triple followed by a number in [0, 1] and a notation “/”, in which the number represents the membership degree of the instance corresponding to the row key belonging to a class. And the cell value is the corresponding class. Second, when the column family name is “Datatype Property,” the column is named after the datatype property name. And the cell value is named according to the following expression: the object name in fuzzy RDF triple followed by a number in [0, 1] and a notation “/” where the number represents the membership degree of object in fuzzy RDF triple. In particular, when the column family name is “Type,” the column name is named according to the following expression: “type” followed by a number in [0, 1] and

4.4 Fuzzy RDF Mapping to HBase Databases

131

Table 4.6 The specific table structure of FHBase table FHTS_PO Row key

Column family: object Property

Column family: datatype Property

Column family: type

S1

Object property: ρ 11 /OP1 = O111 Object property: ρ 12 /OP1 = O111 Object property: ρ 12 /OP1 = O112 Object property: ρ 11 /OP3 = O111

Datatype property: LP1 = μ11 /L111 Datatype property: LP1 = μ12 /L111 Datatype property: LP1 = μ12 /L112 Datatype property: LP2 = μ2 /L121

Type: ρ 3 /type = FC1

S2

Object property: ρ 2 /OP2 = O221

Type: ρ 4 /type = FC1 Type: ρ 5 /type = FC4

S3

Type: ρ 6 /type = FC2

S4 Datatype property: LP3 = μ3 /L531

S5

Type: ρ 7 /type = FC3

Table 4.7 The specific table structure of FHBase table FHTO_PS Row key

Column family: object Property

O111

Object property: ρ 11 /OP1 = S1 Object property: ρ 11 /OP3 = S1 Object property: ρ 12 /OP1 = S1

O112

Object property: ρ 12 /OP1 = S1

O221

Object property: ρ 2 /OP2 = S2

Column family: datatype Property

μ11 /L111

Datatype property: LP1 = S1

μ12 /L111

Datatype property: LP1 = S1

μ12 /L112

Datatype property: LP1 = S1

μ2 /L121

Datatype property: LP2 = S1

μ3 /L531

Datatype property: LP3 = S5

Column family: type

FC1

Type: ρ 3 /type = S1 Type: ρ 4 /type = S2

FC2

Type: ρ 6 /type = S4

FC3

Type: ρ 7 /type = S5

FC4

Type: ρ 5 /type = S6

132

4 Persistence of Fuzzy RDF and Fuzzy RDF Schema

a notation “/” where the number represents the membership degree of predicate in fuzzy RDF triple. And the cell value is the corresponding object. Likewise, the FHBase table FHTO_PS shown in Table 4.7 takes the object in fuzzy RDF triple as the row key, and it can be obtained from the table FHTS_PO. Specifically, the row key of table FHTO_PS is the cell value of table FHTS_PO, on the contrary, its cell value is the row key of table FHTS_PO. At the same time, both the tables have the same column families and columns. In particular, when the column family name is “Type,” the cell values are the instances of the class corresponding to the row key with uncertainties.

4.4.2 FHBase-Based RDF Queries On the basis of the storage model of fuzzy HBase for fuzzy RDFS and fuzzy RDF instance data proposed in Sect. 4.4.2, in this section, we investigate the issue of queries which support the fuzzy HBase-based retrieval.

4.4.2.1

Triple Matching Algorithm

The aim of SPARQL queries is to get the triples that satisfy all the conditions in the WHERE clause of the given SPARQL query. In the RDF data query based on classical HBase database, the SPARQL query is first parsed into a set of triple patterns, and then the triple matching algorithm proposed by Abraham et al. (2010) is used to determine whether the given triple pattern matches. The input of matching algorithm is a given triple pattern and a triple to be judged, and it will return true if the triple matches the triple pattern or false otherwise. Note that this algorithm is mainly for classical RDF triples while not considering the fuzzy RDF triples. Here, we present a more general triple matching algorithm, MatchFTP-T, to support both triple matching and fuzzy triple matching. Algorithm 4.3 MatchFTP-T Input: Fuzzy triple pattern tp = (ps, ρ/pp, μ/po) and fuzzy triple t = (s, ρ i /p, μi /o) Output: true or false 1. if (tp.ps is var || tp.ps == t.s) && (tp.pp is var || tp.pp == t.p) && (tp. ρ is var || tp.ρ == p. ρ i ) && (tp.po is var || tp.po == t.o) && (tp.μ is var || tp.μ == t.μi ) then 2. if (tp.ps == tp.pp && t.s != t.p) || (tp.ps == tp.po && t.s != t.o) || (tp.pp == tp.po && t.p != t.o) then 3. return false 4. endif 5. return ture 6. endif 7. return false

4.4 Fuzzy RDF Mapping to HBase Databases

133

Similar to the classical triple matching algorithm proposed in Abraham et al. (2010), to check that tp matches t, three conditions must be satisfied: (a) a variable in tp can match any URI or literal in t, (b) a URI or literal in tp must match itself in t exactly, and (c) if a variable in tp occurs more than once, then the fuzzy triples that match tp must have the same term for all occurrences. 4.4.2.2

Triple Pattern Query Algorithm

Given that the fuzzy RDF data are stored in FHDB, to get all the fuzzy triples satisfying the parsed fuzzy triple patterns, we need to query different fuzzy HBase table to get the fuzzy triples by judging if these fuzzy triples match the given fuzzy triple pattern. For each fuzzy triple (s, ρ/p, μ/o), there are five elements: subject, predicate, object, membership degree of predicate, and membership degree of object. Note that when the predicate is an object property, the membership degree of object is 1 which means it is determined, and similarly, when the predicate is a datatype property, the membership degree of predicate is 1. As a result, unlike the eight organizational forms of the classic triple pattern as shown in Table 4.8, there are 32 for fuzzy triple pattern as shown in Table 4.9. Regardless of which organizational form of the fuzzy triple pattern to query, it is closely related to the storage schema of fuzzy RDFS and fuzzy RDF data proposed Table 4.8 Organizational forms of the classic triple pattern (S, P, O)

(S, P, ?O)

(S, ?P, O)

(S, ?P, ?O)

(?S, P, O)

(?S, P, ?O)

(?S, ?P, O)

(?S, ?P, ?O)

Table 4.9 Organizational forms of the fuzzy triple pattern Type

Fuzzy triple pattern

Type 1: subject and predicate are known (S, ρ/P, O) (S, ρ/P, ?O) (S, ?ρ/P, O) (S, ?ρ/P, ?O) (?S, P, μ/O) (?S, P, ? μ/O) (?S, ?P, μ/O) (?S, ?P, ?μ/?O) Type 2: subject and object are known

(S, ρ/?P, O) (S, ?ρ/?P, ?O) (S, ?P, μ/O) (S, ?P, ?μ/O)

Type 3: predicate and object are known

(?S, ρ/P, O) (?S, ?ρ/P, O) (?S, P, μ/O) (?S, P, ?μ/O)

Type 4: subject is known

(S, ρ/?P, ?O) (S, ?ρ/?P, ?O) (S, ?P, μ/?O) (S, ?P, ?μ/?O)

Type 5: predicate is known

(?S, ρ/P, ?O) (?S, ?ρ/P, ?O) (?S, P, μ/?O) (?S, P, ?μ/?O)

Type 6: object is known

(?S, ρ/?P, O) (?S, ?ρ/?P, O) (?S, ?P, μ/O) (?S, ?P, ?μ/O)

Type 7: all are unknown

(?S, ρ/?P, ?O) (?S, ?ρ/?P, ?O) (?S, ?P, μ/?O) (?S, ?P, ?μ/?O)

134

4 Persistence of Fuzzy RDF and Fuzzy RDF Schema

in Sect. 4.4.1. When dealing with different fuzzy triple pattern matches, we need to select different FHBase tables and algorithms to query according to any known elements in the fuzzy triple pattern. On the basis of the storage schema of fuzzy RDF and organizational forms of the fuzzy triple pattern mentioned above, we propose several specific query algorithms as follows. Note that the function SPLIT (Expression) in the following algorithms returns the element after the notation “/” of the expression (i.e., predicate or object of the fuzzy triple). The following algorithms all deal with the query according to whether the predicate of the fuzzy triple pattern is an object property or a datatype property. 1. Query algorithm Query_FS_PO When the given fuzzy triple pattern is one of the Type 1 contains, that is, the subject and predicate in the fuzzy triple pattern are known. In this case, the fuzzy HBase table that needs to be queried is FHTS_PO. Then we propose the query algorithm Query_FS_PO. Algorithm 4.4: Query_FS_PO Input: Fuzzy triple pattern tp = (S, ρ/P, O) or tp = (S, ρ/P, ?O) or tp = (S, ?ρ/P, O) or tp = (S, ?ρ/P, ?O) or tp = (S, P, μ/O) or tp = (S, P, μ/?O) or tp = (S, P, ?μ/O) or tp = (S, P, ?μ/?O) Output: Result which matches tp 1. Initialize Result 2. selectRowkey ← S 3. if tp = (S, ρ/P, O) or tp = (S, ρ/P, ?O) or tp = (S, ?ρ/P, O) or tp = (S, ?ρ/P, ?O) 4. selectColumnfamily ← Object Property 5. selectColumn ← P 6. select O1 , O2 , …, On , from FHTS_PO where rowkey = selectRowkey and columnfamily = selectColumnfamily and SPLIT(column) = selectColumn 7. foreach t = (S, ρ i /P, Oi ) do 8. if MatchFTP_T(tp, t) then 9. Result.add(t) 10. end if 11. end for 12. end if 13. else if tp = (S, P, μ/O) or tp = (S, P, μ/?O) or tp = (S, P, ?μ/O) or tp = (S, P, ?μ/?O) 14. selectColumnfamily ← Datatype Property 15. selectColumn ← P 16. select μ1 /O1 , μ2 /O1 , μ1 /O2 , …, μs /On , μt /On , from FHTS_PO where rowkey = selectRowkey and columnfamily = selectColumnfamily and column = selectColumn 17. foreach t = (S, P, μi /Oj ) do 18. if MatchFTP_T(tp, t) then 19. Result.add(t) 20. end if 21. end for 22. end if 23. return Result

Algorithm 4.4 starts with some initialization work, such as initializing the result set to be returned and determining the row key to look for is the given subject. When

4.4 Fuzzy RDF Mapping to HBase Databases

135

the predicate is an object property, determine the column family and column to look for are “Object Property” and the given predicate. Next, query the table FHTS_PO and use the index function of the FHBase table to get all cell values according to the determined row key name S, column family name “Object Property” and column name P. This step can get the candidate fuzzy triples, then call the MatchFTP_T algorithm to filter these fuzzy triples that match the condition and add them to the result set. Finally, Algorithm 4.4 returns the result set which matches the given fuzzy triple pattern. 2. Query algorithm Query_FSO_P Query_FSO_P When the given fuzzy triple pattern is one of the Type 2 contains, that is, the subject and object in the fuzzy triple pattern are known. In this case, the fuzzy HBase table that needs to be queried is FHTS_PO. Then we propose the query algorithm Query_FSO_P. Algorithm 4.5: Query_FSO_P Input: Fuzzy triple pattern tp = (S, ρ/?P, O) or tp = (S, ?ρ/?P, O) or tp = (S, ?P, μ/O) or tp = (S, ?P, ?μ/O) Output: Result which matches tp 1. Initialize Result 2. selectRowkey ← S 3. if tp = (S, ρ/?P, O) or tp = (S, ?ρ/?P, O) 4. selectColumnfamily ← Object Property 5. selectCell ← O 6. select ρ 1 /P1 , ρ 2 /P1 , ρ 1 /P2 , …, ρ s /Pn , ρ t /Pn , from FHTS_PO where rowkey = selectRowkey and columnfamily = selectColumnfamily and cell = selectCell 7. foreach t = (S, ρ i /Pj , O) do 8. if MatchFTP_T(tp, t) then 9. Result.add(t) 10. end if 11. end for 12. end if 13. else if tp = (S, ?P, μ/O) or tp = (S, ?P, ?μ/O) 14. selectColumnfamily ← Datatype Property 15. selectCell ← O 16. select P1 , P2 , …, Pn , from FHTS_PO where rowkey = selectRowkey and columnfamily = selectColumnfamily and SPLIT(cell) = selectCell 17. foreach t = (S, Pi , μj /O) do 18. if MatchFTP_T(tp, t) then 19. Result.add(t) 20. end if 21. end for 22. end if 23. return Result

Algorithm 4.5 first initializes the result set to be returned and determines the row key to look for is the given subject. Because the object of the given fuzzy triple pattern is known, when the predicate is an object property, determine the column family and cell value to look for are “Object Property” and the given object. Next,

136

4 Persistence of Fuzzy RDF and Fuzzy RDF Schema

query the table FHTS_PO and use the index function of the FHBase table to get all column values according to the determined row key name S, column family name “Object Property” and cell value O. This step can get the candidate fuzzy triples, then call the MatchFTP_T algorithm to filter these fuzzy triples that match the condition and add them to the result set. Finally, Algorithm 4.5 returns the result set which matches the given fuzzy triple pattern. Similarly, Algorithm 4.5 can perform similar operations when the predicate is a datatype property. 3. Query algorithm Query_FS_OP Query_FS_OP When the given fuzzy triple pattern is one of the Type 4 contains, that is, just the subject in the fuzzy triple pattern is known. In this case, the fuzzy HBase table that needs to be queried is FHTS_PO. Then we propose the query algorithm Query_FS_OP. Algorithm 4.6: Query_FS_OP Input: Fuzzy triple pattern tp = (S, ρ/?P, ?O) or tp = (S, ?ρ/?P, ?O) or tp = (S, ?P, μ/?O) or tp = (S, ?P, ?μ/?O) Output: Result which matches tp 1. Initialize Result 2. selectRowkey ← S 3. if tp = (S, ρ/?P, ?O) or tp = (S, ?ρ/?P, ?O) 4. selectColumnfamily ← Object Property 5. select ρ s /Pi , Oj , from FHTS_PO where rowkey = selectRowkey and columnfamily = selectColumnfamily 6. foreach t = (S, ρ s /Pi , Oj ) do 7. if MatchFTP_T(tp, t) then 8. Result.add(t) 9. end if 10. end for 11. end if 12. else if tp = (S, ?P, μ/?O) or tp = (S, ?P, ?μ/?O) 13. selectColumnfamily ← Datatype Property 14. select Pi , μs /Oj , from FHTS_PO where rowkey = selectRowkey and columnfamily = selectColumnfamily 15. foreach t = (S, Pi , μs /Oj ) do 16. if MatchFTP_T(tp, t) then 17. Result.add(t) 18. end if 19. end for 20. end if 21. return Result

Algorithm 4.6 initializes the result set and determines the row key in the same way as Algorithms 4.4 and 4.5. Because the predicate and object of the fuzzy triple patterns processed by Algorithm 4.6 are both unknown. When the predicate is an object property, just determine the column family to look for is “Object Property.” Next, query the table FHTS_PO and use the index function of the FHBase table to get all column values and corresponding cell values according to the determined row

4.4 Fuzzy RDF Mapping to HBase Databases

137

key name S, column family name “Object Property.” Then call the MatchFTP_T algorithm to filter the candidate fuzzy triples that match the condition and add them to the result set. Finally, Algorithm 4.6 returns the matched result set. 4. Query algorithm Query_FOP_S When the given fuzzy triple pattern is one of the Type 3 contains, that is, the predicate and object in the fuzzy triple pattern are known. In this case, the fuzzy HBase table that needs to be queried is FHTO_PS. Then we propose the query algorithm Query_FOP_S. Algorithm 4.7: Query_FO P_S Input: Fuzzy triple pattern tp = (?S, ρ/P, O) or tp = (?S, ?ρ/P, O) or tp = (?S, P, μ/O) or tp = (?S, P, ?μ/O) Output: Result which matches tp 1. Initialize Result 2. selectRowkey ← O 3. if tp = (?S, ρ/P, O) or tp = (?S, ?ρ/P, O) 4. selectColumnfamily ← Object Property 5. selectCell ← P 6. select S 1 , S 2 , …, S n , from FHTO_PS where rowkey = selectRowkey and columnfamily = selectColumnfamily and SPLIT(column) = selectColumn 7. foreach t = (S i , ρ j /P, O) do 8. if MatchFTP_T(tp, t) then 9. Result.add(t) 10. end if 11. end for 12. end if 13. else if tp = (?S, P, μ/O) or tp = (?S, P, ?μ/O) 14. selectColumnfamily ← Datatype Property 15. select S 1 , S 2 , …, S n , from FHTO_PS where rowkey = selectRowkey and columnfamily = selectColumnfamily and column = selectColumn 16. foreach t = (S i , P, μj /O) do 17. if MatchFTP_T(tp, t) then 18. Result.add(t) 19. end if 20. end for 21. end if 22. return Result

Algorithm 4.7 first initializes the result set and determines the row key to look for is the given object. Being different from the above query algorithms, Algorithm 4.7 queries the table FHTO_PS rather than FHTS_PO. When the predicate is an object property, determine the column family and column to look for are “Object Property” and the given predicate. Next, query the table FHTO_PS and use the index function of the FHBase table to get all cell values according to the determined row key name S, column family name “Object Property” and column name P. Then call the MatchFTP_T algorithm to filter the candidate fuzzy triples that match the condition and add them to the result set. Finally, Algorithm 4.7 returns the matched result

138

4 Persistence of Fuzzy RDF and Fuzzy RDF Schema

set. Similarly, Algorithm 4.7 can perform similar operations when the predicate is a datatype property. 5. Query algorithm Query_FP_SO When the given fuzzy triple pattern is one of the Type 5 contains, that is, just the predicate in the fuzzy triple pattern is known. In this case, we propose the query algorithm Query_FP_SO. Algorithm 4.8: Query_FP_SO Input: Fuzzy triple pattern tp = (?S, ρ/P, ?O) or tp = (?S, ?ρ/P, ?O) or tp = (?S, P, μ/?O) or tp = (?S, P, ?μ/?O) Output: Result which matches tp 1. Initialize Instances 2. Initialize S 3. let S add all ri from PropertyRelation where rowkey = P and columnfamily = Domain and column = domain 4. foreach ri in S do 5. select rii from ClassRelation where rowkey = ri and (columnfamily = EquivalentClass and column = 1.0/equivalentclass) or (columnfamily—SubClass and column—1.0/subclass) 6. if rii not in S then 7. S.add(t) 8. end if 9. end for 10. foreach ri in S do 11. let Instances.add all Ii from FHTO_PS where rowkey = ri and columnfamily = Type and column = 1.0/type 12. end for 13. Initialize Result 14. foreach I in Instances do 15. selectRowkey ← I 16. call the Query algorithm Query_FS_PO 17. end for 18. return Result

The fuzzy triple patterns processed by Algorithm 4.8 are only the predicate is known, Algorithm 4.8 first queries the table FPropertyRelation according to the given predicate P to get the domains of P and add them to the set S. Second, get the determined equivalentclasses and subclasses of all fuzzy classes in the set S by querying the table FClassRelation and add them to the set S. And then get the instances corresponding to each fuzzy class in the set S by querying the table FHTO_PS and add them to the set Instances. Next, for each instance in the set Instances, which is equivalent to the subject in the fuzzy triple pattern, call the Algorithm 4.4 Query_FS_PO to get the fuzzy triples that match the condition and add them to the result set. Finally, Algorithm 4.8 returns the matched result set. 6. Query algorithm Query_FO_PS When the given fuzzy triple pattern is one of the Type 6 contains, that is, just the object in the fuzzy triple pattern is known. In this case, the fuzzy HBase table

4.4 Fuzzy RDF Mapping to HBase Databases

139

that needs to be queried is FHTO_PS. Then we propose the query algorithm Query_FO_PS. Algorithm 4.9: Query_FO_PS Input: Fuzzy triple pattern tp = (?S, ρ/?P, O) or tp = (?S, ?ρ/?P, O) or tp = (?S, ?P, μ/O) or tp = (?S, ?P, ?μ/O) Output: Result which matches tp 1. Initialize Result 2. selectRowkey ← O 3. if tp = (?S, ρ/?P, O) or tp = (?S, ?ρ/?P, O) 4. selectColumnfamily ← Object Property 5. select ρ s /Pi , S j from FHTO_PS where rowkey = selectRowkey and columnfamily = selectColumnfamily 6. foreach t = (S j , ρ s /Pi , O) do 7. if MatchFTP_T(tp, t) then 8. Result.add(t) 9. end if 10. end for 11. end if 12. else if tp = (?S, ?P, μ/O) or tp = (?S, ?P, ?μ/O) 13. selectColumnfamily ← Datatype Property 14. select Pi , S j from FHTO_PS where rowkey = selectRowkey and columnfamily = selectColumnfamily 15. foreach t = (S j , Pi , μs /O) do 16. if MatchFTP_T(tp, t) then 17. Result.add(t) 18. end if 19. end for 20. end if 21. return Result

Algorithm 4.9 starts with some initialization work, such as initializing the result set to be returned and determining the row key to look for is the given object. Because the subject and predicate of the fuzzy triple patterns processed by Algorithm 4.9 are both unknown. When the predicate is an object property, just determine the column family to look for is “Object Property.” Next, query the table FHTO_POS and use the index function of the FHBase table to get all column values and corresponding cell values according to the determined row key name S, column family name “Object Property.” Then call the MatchFTP_T algorithm to filter the candidate fuzzy triples that match the condition and add them to the result set. Finally, Algorithm 4.9 returns the matched result set. Note that Algorithm 4.9 can perform similar operations when the predicate is a datatype property. 7. Query algorithm Query_FSPO When the given fuzzy triple pattern is one of the Type 7 contains, that is, the subject, predicate, and object in the fuzzy triple pattern are all unknown. In this case, we need to take all the fuzzy triples in the FHDB and add them to the candidate result set. It means whether we need to query the table FHTS_PO or FHTO_PS. Then we propose the query algorithm Query_FSPO.

140

4 Persistence of Fuzzy RDF and Fuzzy RDF Schema Algorithm 4.10: Query_FSPO Input: Fuzzy triple pattern tp = (?S, ρ/?P, ?O) or tp = (?S, ?ρ/?P, ?O) or tp = (?S, ?P, μ/?O) or tp = (?S, ?P, ?μ/?O) Output: Result which matches tp 1. Initialize Result 2. if tp = (?S, ρ/?P, ?O) or tp = (?S, ?ρ/?P, ?O) 3. selectColumnfamily ← Object Property 4. select S i ρ s /Pj , Ok from FHTS_PO or FHTO_PS where columnfamily = selectColumnfamily 5. foreach t = (S i , ρ s /Pj , Ok ) do 6. if MatchFTP_T(tp, t) then 7. Result.add(t) 8. end if 9. end for 10. end if 11. else if tp = (?S, ?P, μ/?O) or tp = (?S, ?P, ?μ/?O) 12. selectColumnfamily ← Datatype Property 13. select S i , Pj , Ok from FHTS_PO or FHTO_PS where columnfamily = selectColumnfamily 14. foreach t = (S i , Pj , μs /Ok ) do 15. if MatchFTP_T(tp, t) then 16. Result.add(t) 17. end if 18. end for 19. end if 20. return Result

Being different from all query algorithms proposed above, the subject, predicate, and object of the fuzzy triple patterns processed by Algorithm 4.10 are all unknown. So, Algorithm 4.10 can get all fuzzy triples in the FHDBs, which are added to the candidate result set by querying table FHTS_PO or FHTO_PS. Specifically speaking, Algorithm 4.10 first initializes the result set to be returned. When the predicate is an object property, the column family is just determined to look for “Object Property.” Next, the table FHTS_PO or FHTO_PS is queried to get all fuzzy triples, which are added to the candidate result set. Then, the MatchFTP_T algorithm is called to filter the eligible fuzzy triples, which are added to the result set. Finally, Algorithm 4.10 returns the matched result set. Of course, Algorithm 4.10 can perform similar operations when the predicate is a datatype property.

4.4.3 Design and Implementation On the basis of the storage and query methods proposed in Sects. 4.4.1 and 4.4.2, we design and implement a prototype called FRDF2FHBase, which can store the fuzzy RDF data into the FHDB and support basic fuzzy triple patterns queries. In the following, we briefly discuss the implementation of the FRDF2FHBase.

4.5 Fuzzy RDF Graph Mapping to Property Graph

141

Fig. 4.9 The overall architecture of FRDF2FHBASE

The prototype FRDF2FHBase consists of four main modules: data loading module, data storage module, FHBase-based query module, and parsing module. The overall architecture of FRDF2FHBASE is shown in Fig. 4.9. 1. Data loading module: The data loading module loads fuzzy RDF data which is described in the form of triples. The data is divided into fuzzy RDFS data and fuzzy RDF data, respectively. 2. Data storage module: The storage module stores the fuzzy RDF data in a target FHDB according to the storage model proposed in Sect. 4.4.1. 3. FHBase-based query module: The FHBase-based query module processes the input f-SPARQL queries, this module parses the f-SPARQL query into a set of fuzzy triple patterns and returns the candidate result set satisfying the parsed fuzzy triple patterns according to the FHBase-based RDF(S) query algorithms proposed in Sect. 4.4.2. 4. Parsing module: The parsing module processes the candidate result set obtained by the FHBase-based query module, it uses a greedy multiple connection join strategy for f-SPARQL BGP processing and returns the final result set.

4.5 Fuzzy RDF Graph Mapping to Property Graph Although RDF triples can be stored in a relational database, fundamentally speaking, RDF models can be viewed as a special case of graph models, so using a graphshaped database would be more appropriate. Bonstrom et al. (2003) show that the

142

4 Persistence of Fuzzy RDF and Fuzzy RDF Schema

advantages of storing RDF data in a graph structure are: (i) Graph structures can directly map to RDF models, avoiding the need to convert RDF data to accommodate storage structures (ii) Query semantic information of RDF data does not require reconstruction RDF graphs. The graph model conforms to the semantic level of the RDF model, and can maintain the semantic information of the RDF data to the utmost extent. In addition, many graphs theory-based algorithms can be applied to optimize the inferential query of RDF data. There has been some related work on RDF data graph storage. Zou et al. (2014) proposed a method for storing and processing RDF data using a graph model called gStore, which converted RDF graphs into data signature graphs and used vertex signature (VS)*-tree indexes to reduce maintenance overhead. Hartig (2014) proposed a formal definition of the Property Graph model and introduced transformations between Property Graphs and RDF*. Libkin et al. (2018) introduced a triple-based model called Trial, which combined the concept of triple storage in RDF with the concept of graph data, and illustrated the difference between the RDF graph model based on triples and the standard graph database model. De Virgilio (2017) proposed a method of using ontology and related constraint rules to convert RDF data storage into a graph database. In order to realize the distributed management of Web-scale RDF data, Zeng et al. (2013) proposed a distributed graph engine that stores RDF data in the form of primitive graphs instead of triples or bitmap matrices, called Trinity RDF. However, all the above works did not consider the issues of fuzzy RDF graph data storage and query. In order to solve the problem of fuzzy RDF data storage and query, it is an effective method to establish the mapping relationship between fuzzy RDF and attribute graph. In this section, we discuss the methodology to make a lossless transformation of a fuzzy RDF graph into a property graph. The main idea is to represent any ordinary RDF triple as property graph edge, and the fuzzy degree of the corresponding triple can be expressed as an attribute of the edge. Specifically, our research goal is to convert a fuzzy RDF graph into a property graph, and further realize the mapping of a SPARQL query on G to a Cypher query over GP .

4.5.1 Preliminaries 1. SPARQL Query in the Fuzzy RDF In the section, we use the SPARQL extension method in Sect. 3.5.3 and we add an optional query statement “WITH ⟨threshold⟩” based on the standard SPARQL statement to indicate the minimum membership threshold that the query result should satisfy. The user chooses an appropriate value of threshold to express his/her needs. In this way, the classic SPARQL query statement has the following form: SELECT—FROM—WHERE—[WITH ⟨threshold⟩]. Utilizing this kind of SPARQL query, users can obtain query results that meet the query conditions and preset thresholds at the same time. Therefore, the query process of fuzzy RDF database involves many choices of threshold. It should be emphasized that

4.5 Fuzzy RDF Graph Mapping to Property Graph

143

if the default value of ⟨threshold⟩ is 1, then the item WITH ⟨threshold⟩ can be omitted. Suppose that we want to seek an action movie, whose director is an American. At the same time, trustworthiness is more than 0.6. According to the extended SPARQL syntax, the SPARQL SELECT statement that meets the above query conditions is expressed as follows. PREFIX le: < http://fuzzy RDF example.org/> SELECT ?x WHERE { ?x le: Genre “action” ?x le: Director ?z. ?z le: birthPlace “Ameircan” } WITH ⟨0.6⟩ Here, “WITH ⟨0.6⟩” is the threshold expression, which specifies the lowest possibility of the matching subgraph. The symbol “?x” represents the film that we want to retrieve. 2. Property Graph Model Assume that the set D of data types contains the string type S, that is, S ∈ D, and D may also include the data type of the collection type. For each data type D, dom(D) represents the value space of type D, that is, all possible value sets of data type D, and dom(S) represents all string sets. The formal definition of the Property Graph is as follows: A property graph GP , is a 6-tuple ⟨V P , E P , src, tgt, lbl, P⟩ and ⟨V P , E P , src, tgt, lb⟩ represents a directed label multigraph, where V P and E P represent the set of vertices and edges respectively; function src: E P → V P indicates that each edge has a start(head) vertex; function tgt: E P → V P indicates that each edge has a termination(tail) vertex; lbl: E P → dom(S) means that each edge has a label. The function P: V P ∪ E P → 2P indicates every vertex v ∈ V P and edge e ∈ E P are associated with a set of pairs ⟨key, value⟩ called properties. Neo4j is a management system for crisp property graph databases, whose primitives are vertices, relationships, and attributes. Different types of vertices are identified by labels, which can be IRI, Literal, or Blank. Vertices can have zero or more attributes, which exist as key-value pairs. The vertex of IRI has two attributes, namely kind and IRI. The vertex of Blank has one attribute. The vertex of Literal has four attributes: kind, value, datatype, and language. The attributes of the same vertex are stored in a linked list. A relationship consists of a start vertex and an end vertex. As with vertices, relationships can also have multiple attributes and labels. Figure 4.10 is an example of simple Property Graph that contains two vertices and a relationship between the two vertices. Among them, the relationship marked as “partner” starts at the vertex “Pratt” and ends at the vertex “Statham”. In addition, some boxes associated with graph elements (vertices and edges) represent the

144

4 Persistence of Fuzzy RDF and Fuzzy RDF Schema

Fig. 4.10 A property graph with two vertices

attributes of these elements. For example, the vertex of Chris Pratt has two attributes, which represent the name and year of birth of the famous actor. The partnership has only one attribute, which is used to indicate the certainty of whether Statham is Pratt’s partner. Cypher is the standard query language of Neo4j graph database, which is like SQL in relational database, querying property graph in a crisp way. It is composed of is composed of a START clause followed by a MATCH and a RETURN clause, where START indicates a starting vertex of the matching subgraph, MATCH describes all edges of the matching subgraph, and WHERE describes the attributes expression on vertex and edges of the subgraph as a filter condition. An example of a Cypher query that uses these three clauses to find the mutual partners of actor named James Guan is: START a = node: actor (name = “James Guan “) MATCH (a) − [: partner] - > (b) − [: partner] - > (c), (a) − [: partner] - > (c) RETURN b, c

4.5.2 Transform Fuzzy RDF Graph to Property Graph In order to adapt the fuzzy RDF data model to the property graph data model, we use the following rules to convert each triple in the RDF dataset into a property graph: (i) any subject or object vertex in RDF becomes a vertex with a unique integer ID in property graph, (ii) object property in RDF is designated as the adjacent edge in the property graph, where the source and the target of the edge are vertex IDs, and the edge is identified by integer ID, (iii) the datatype property in RDF is specified as vertex attributes in the property graph, (iv) fuzzy degree information is converted into vertex and edge attributes. As we all know, a basic requirement for conversion is that any possible IRI must be explicitly mapped to a different string. The IRI string indicates that this requirement can be met. Therefore, we define an injective function called IRI-to-string im: I → dom(S). In view of these preliminary knowledge, the conversion rules are defined as follows. Let G = (V, E, Σ, L, μ, ρ) is PG-convertible RDF graph. V = {x ∈ (I ∪ B ∪ L) | ⟨s, p, o⟩ ∈ G and x ∈ {s, o}} is the set of vertex elements. The property graph corresponding to graph G can be expressed as GP = ⟨V P , E P , src, tgt, lbl, P⟩:

4.5 Fuzzy RDF Graph Mapping to Property Graph

145

• V P contains |V| vertices, and each vertex represents a different RDF item in V. In other words, there is a function v: V → V P , such that each x ∈ V can be mapped to a different vertex v(x) ∈ V P . If RDF item is IRI, then P(v(u)) = {⟨“kind”, “IRI”⟩, ⟨“IRI”, im(u)⟩}, here u ∈ I, v(u) ∈ V P and im is the IRI-to-string mapping mentioned above. (ii) If RDF item is blank vertex, then P(v(b)) = {⟨“kind”, “blank vertex”⟩}, here b ∈ B and v(b) ∈ V P . (iii) If RDF item is literal, then P(v(l)) = {⟨“kind”, “literal”⟩, ⟨“literal”, vm−1 (l)⟩, ⟨“datatype”, im(dtype(l))⟩} ∪ lang, here l ∈ L, v(l) ∈ V P , vm−1 is the inverse operation of the value-to-literal bijective mapping, and lang =  {⟨”languge”, lang(l)⟩} i f l ∈ dom(lang) φ else (iv) If RDF item is x ∈ (I ∪ B ∪ L), the property set is defined as P(v(x)) = {⟨“fdegree”, vm (μ(x))⟩}, here v(x) ∈ V P . (i)

• E P contains |E| edges, and each edge corresponds to an RDF triple t ∈ G. Therefore, a bijective function e: E → E P is defined, such that each triple t = ⟨s, p, o⟩ ∈ G can be mapped to an edge e(t) ∈ E P . (i) The edge label of e(t) is im(p), and the labels of the two adjacent vertices corresponding to the edge e(t) are v(s) and v(o), respectively, which are formally defined as: src(e(t)) = v(s), lbl(e(t)) = im(p), and tgt(e(t)) = v(o). (ii) Moreover, the property P(e(t)) can be defined as P(e(t)) = {⟨“fdegree”, vm (ρ(t))⟩}. This conversion can represent any fuzzy RDF triple as an edge in the Property Graph, and its attributes include the relationship and ambiguity of the RDF triples. The two adjacent vertices of this edge correspond to the subject and object of the RDF triples. Each vertex introduces two attributes: (i) kind indicates whether the corresponding data type is IRI, Literal or Blank, (ii) and value indicates the corresponding value. It should be noted that if the data type is Literal, another attribute should be introduced, namely the type, to describe the type of value. For the sake of clarity, we shall give an example to illustrate the global steps of our proposed approach. Figure 4.11 shows a fuzzy RDF graph, in which vertices are used to represent entity resources such as actors, movies, etc., while edges represent the relationship between them. For readability reasons, each vertex in the graph uses the name of the entity resource or the literals instead of the URI itself. The label on the vertex is associated with the ambiguity to indicate the likelihood of the vertex being labeled. For instance, the genre of the movie Guardian of the Galaxy 2 is labeled as “action” and the possibility is 0.91. The fuzzy RDF graph G is PG-convertible in this example, and the given conversion rule can be used to translate fuzzy RDF graph into a Property Graph. The generated Property Graph GP is shown in Fig. 4.12, which contains the following elements: V P = {v1 , v2 , …, v7 }, E P = {e1 , e2 , …, e7 }, src(e1 ) = v1 , lbl(e1 ) = “Rating”, tgt(e1 ) = v3 , src(e2 ) = v1 , tgt(e2 ) = v2 , lbl(e2 ) = “Genre”, src(e3 ) = v1 , tgt(e3 )

146

4 Persistence of Fuzzy RDF and Fuzzy RDF Schema

Fig. 4.11 A fuzzy RDF graph inspired by IMDB

Fig. 4.12 A property graph converted from the fuzzy RDF

= v4 , lbl(e3 ) = “Starring”, src(e5 ) = v5 , tgt(e5 ) = v4 , lbl(e5 ) = “partner”, src(e6 ) = v5 , tgt(e6 ) = v6 , lbl(e6 ) = “birthPlace”, src(e7 ) = v5 , tgt(e7 ) = v7 , lbl(e7 ) = “Age”, P(v1 ) = {⟨“kind”, “IRI”⟩, ⟨“IRI”, “Guardian of the Galaxy 2”⟩}, P(v2 ) = {⟨“kind”, “Literal”⟩, ⟨“value”, action⟩, ⟨“datatype”, string⟩, ⟨“fdegree”, 0.91⟩}, P(v3 ) = {⟨“kind”, “Literal”⟩, ⟨“value”, 8.4⟩, ⟨“datatype”, float⟩, ⟨“fdegree”, 0.8⟩}, P(v4 ) = {⟨“kind”, “IRI”⟩, ⟨“IRI”, “Vin. Diesel”⟩}, P(v5 ) = {⟨“kind”, “IRI”⟩, ⟨“IRI”, “James Guan”⟩}, P(v6 ) = {⟨“kind”, “Literal”⟩, ⟨“value”, American⟩, ⟨“datatype”, string⟩}, P(v7 ) = {⟨“kind”, “Literal”⟩, ⟨“value”, 43⟩, ⟨“datatype”, int⟩}, P(e1 ) = φ, P(e2 ) = φ, P(e3 ) = {⟨“fdegree”, 0.7⟩}, P(e4 ) = φ, P(e5 ) = {⟨“fdegree”, 0.4⟩}, P(e6 ) = {⟨“fdegree”, 0.8⟩}, P(e7 ) = φ.

4.5.3 Query Fuzzy RDF Graph in Neo4j Since the fuzzy RDF data is stored in the Neo4j database, SPARQL cannot be directly applied. The problem that follows is how to query the fuzzy RDF data stored in the Neo4j database. There are two possible ways to implement the query of fuzzy RDF data stored in Neo4j database: One way is to convert a SPARQL query into a Cypher language to implement the query. Another way is to use Cypher to directly implement the query. The former way keeps the SPARQL language extracting information

4.6 Summary

147

from Neo4j through supported plug-in. The plug-in was developed as a wrapper for Neo4j graph database. It is tailor-made for reusing the advanced features of Neo4j to efficiently store, index, and query graph structures using the core API of Neo4j. The latter way focuses on the uses of Cypher query. Similar to SPARQL, this approach considers that all entities and relations stored in the database are formed by the triple storage of the [entity]-(relationship /predicate)-[entity] pattern—the first element of the triple is also called as “subject”. In a graph database, a directed edge connecting two vertices (that is, the relationship is directional) is used to indicate the “subject” of a particular triple. In addition, the Cypher query language also supports grouping (GROUP BY), filtering (WHERE), and sorting (ORDER BY) operations, which are like the SQL language. RDF graphs are usually queried by specifying a graph pattern using the standard SPARQL query language, which returns matching subgraphs. There are several ways to express pattern matching queries in Cypher. The most straightforward method is to start with a vertex in the matching pattern graph, and then match all edges in a MATCH statement in the Cypher query. In this research, we just focus on the Cypher’s basic query approach and their advantages in handling fuzzy RDF data. Cypher queries also enable users to implement some query functions that cannot be implemented in SPARQL. For instance, in the attribute path query, Cypher allows users to use of more powerful path expressions than those provided by SPARQL. Let us consider a Cypher query with the same functionality as the SPARQL query in the previous example. The query also specifies a threshold δ t (δ t = 0.25), which is used to return matching items with possible greater than δ t . The Cypher query statement in this example is presented as follows. START v1 = node: nodes (IRI = “Guardian of the Galaxy 2”) MATCH (v1 ) − [: Director] - > (v5 ) − [e: birthPlace] - > (v6 {vlaue: “Ameircan”}) WHERE v2 .fdegree > 0.6 MATCH (v1 ) − [: Genre] - > (v2 {vlaue: “action”}) WHERE e.fdegree > 0.6 RETURN v1 When translating the threshold expression into the corresponding Cypher, we define the format of the conditional expression as: fdegree > δ t , which means that the overall possibility of the matching answer must satisfy the fuzzy degree δ t ∈ [0, 1]. In the example, the Cypher language equivalent of the threshold expression “WITH < 0.6>” is fdegree > 0.6. When the query contains multiple triple patterns, we must aggregate the results of each pattern.

4.6 Summary With the prompt development of the Internet, the requirement of managing information based on the Web has attracted much attention both from academia and industry. RDF is widely regarded as the next step in the evolution of the World Wide Web, and

148

4 Persistence of Fuzzy RDF and Fuzzy RDF Schema

has been the de-facto standard. This creates a new set of data management requirements involving RDF. On the other hand, fuzzy sets and possibility theory have been extensively applied to deal with information imprecision and uncertainty in the practical applications, and reengineering fuzzy RDF into fuzzy database models is receiving more attention for managing fuzzy RDF data. In this chapter, we proposed some approaches for reengineering fuzzy RDF into fuzzy database models, including fuzzy relational database models, fuzzy object-oriented database models, and HBase database models, respectively. Moreover, we investigate the storage and query of fuzzy RDF graph represented by the labeled directed graph structure to Property Graphs database storage model. We manage these data by a chosen Neo4j Graph DBMS in order to support expressive querying services over the stored data. The two-way mappings between the fuzzy database models to the fuzzy RDF models pay an important role for establishing the overall management system of fuzzy RDF data. Moreover, for processing fuzzy RDF data intelligently, fuzzy RDF query is also very necessary. How to query RDF with imprecise or uncertain information has raised certain concerns as will be introduced in the following chapter.

References Abadi, D. J., Marcus, A., Madden, S. R., & Hollenbach, K. (2009). SW-Store: A vertically partitioned DBMS for semantic web data management. The VLDB Journal, 18(2), 385–406. Abraham, J., Brazier, P., Chebotko, A., Navarro, J., & Piazza, A. (2010). Distributed storage and querying techniques for a semantic web of scientific workflow provenance. In 2010 IEEE International Conference on Services Computing (pp. 178–185). Atre, M., Srinivasan, J., & Hendler, J. A. (2009). BitMat: A main memory RDF triple store. In Tetherless World Constellation, Rensselar Plytehcnic Institute, Troy NY. Bönström, V., Hinze, A., & Schweppe, H. (2003). Storing RDF as a graph (detailed view). In Proceedings of the First Latin American Web Congress (pp. 27–36). Bornea, M. A., Dolby, J., Kementsietsidis, A., Srinivas, K., Dantressangle, P., Udrea, O., & Bhattacharjee, B. (2013). Building an efficient RDF store over a relational database. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (pp. 121–132). Cai, M., & Frank, M. (2004). RDFPeers: A scalable distributed RDF repository based on a structured peer-to-peer network. In Proceedings of the 13th International Conference on World Wide Web (pp. 650–657). Chen, H., Wu, Z., Wang, H., & Mao, Y. (2006). RDF/RDFS-based relational database integration. In 22nd International Conference on Data Engineering (ICDE’06) (pp. 94–94). Chong, E. I., Das, S., Eadon, G., & Srinivasan, J. (2005). An efficient SQL-based RDF querying scheme. In Proceedings of the 31st International Conference on Very Large Data Bases (pp. 1216– 1227). Cudré-Mauroux, P., Enchev, I., Fundatureanu, S., Groth, P., Haque, A., Harth, A., … & Wylot, M. (2013). NoSQL databases for RDF: An empirical evaluation. In International Semantic Web Conference (pp. 310-325). Springer. De Virgilio, R. (2017). Smart RDF data storage in graph databases. In Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (pp. 872–881). Fan, T., Yan, L., & Ma, Z. (2019). Mapping fuzzy RDF (S) into fuzzy object-oriented databases. International Journal of Intelligent Systems, 34(10), 2607–2632.

References

149

Fan, T., Yan, L., & Ma, Z. (2020). Storing and querying fuzzy RDF (S) in HBase databases. International Journal of Intelligent Systems, 35(4), 751–780. Farhan Husain, M., Doshi, P., Khan, L., & Thuraisingham, B. (2009), Storage and retrieval of large rdf graph using hadoop and mapreduce. In IEEE International Conference on Cloud Computing (pp. 680–686). Springer. Harris, S., & Gibbins, N. (2003). 3store: Efficient bulk RDF storage. In: R. Volz, S. Decker, & I. F. Cruz (Eds.), Proceedings of the First International Workshop on Practical and Scalable Semantic Systems (pp. 1–15). CEUR-WS.org. Hartig, O. (2014). Reconciliation of RDF* and property graphs. Technical report, University of Waterloo. http://arxiv.org/abs/1409.3288 Janik, M., & Kochut, K. (2005). Brahms: A workbench RDF store and high-performance memory system for semantic association discovery. In International Semantic Web Conference (pp. 431445), Springer. Libkin, L., Reutter, J. L., Soto, A., & Vrgoˇc, D. (2018). TriAL: A navigational algebra for RDF triplestores. ACM Transactions on Database Systems (TODS), 43(1), 1–46. Ma, Z., & Yan, L. (2018). Modeling fuzzy data with RDF and fuzzy relational database models. International Journal of Intelligent Systems, 33(7), 1534–1554. Ma, Z., Capretz, M. A., & Yan, L. (2016). Storing massive resource description framework (RDF) data: A survey. The Knowledge Engineering Review, 31(4), 391–413. Ma, Z. M., Zhang, W. J., & Ma, W. Y. (2004). Extending object-oriented databases for fuzzy information modeling. Information Systems, 29(5), 421–435. Myung, J., Yeon, J., & Lee, S. G. (2010). SPARQL basic graph pattern processing with iterative MapReduce. In Proceedings of the 2010 Workshop on Massive Data Analytics on the Cloud (pp. 1– 6). Nenov, Y., Piro, R., Motik, B., Horrocks, I., Wu, Z., & Banerjee, J. (2015). RDFox: A highly-scalable RDF store, In International Semantic Web Conference (pp. 3-20). Springer. Neumann, T., & Weikum, G. (2008). RDF-3X: A RISC-style engine for RDF. Proceedings of the VLDB Endowment, 1(1), 647–659. Neumann, T., & Weikum, G. (2010a). The RDF-3X engine for scalable management of RDF data. The VLDB Journal, 19(1), 91–113. Neumann, T., & Weikum, G. (2010b). x-RDF-3X: Fast querying, high update rates, and consistency for RDF databases. Proceedings of the VLDB Endowment, 3(1–2), 256–263. Papailiou, N., Konstantinou, I., Tsoumakos, D., & Koziris, N. (2012). H2RDF: Adaptive query processing on RDF data in the cloud. In Proceedings of the 21st International Conference on World Wide Web (pp. 397–400). Peng, P., Zou, L., Özsu, M. T., Chen, L., & Zhao, D. (2016). Processing SPARQL queries over distributed RDF graphs. The VLDB Journal, 25(2), 243–268. Quasthoff, M., & Meinel, C. (2011). Supporting object-oriented programming of semantic-web software. IEEE Transactions on Systems, Man, and Cybernetics Part C (Applications and Reviews), 42(1), 15–24. Rohloff, K., & Schantz, R. E. (2010). High-performance, massively scalable distributed systems using the MapReduce software framework: The SHARD triple-store. In Programming Support Innovations for Emerging Distributed Applications (pp. 1–5). Sintek, M., & Kiesel, M. (2006). RDFBroker: A signature-based high-performance RDF store. In European Semantic Web Conference (pp. 363–377). Springer. Sun, J., & Jin, Q. (2010). Scalable RDF store based on Hbase and mapreduce. In 2010 3rd International Conference on Advanced Computer Theory and Engineering (ICACTE) (Vol. 1, pp. V1–633). Weiss, C., Karras, P., & Bernstein, A. (2008). Hexastore: Sextuple indexing for semantic web data management. Proceedings of the VLDB Endowment, 1(1), 1008–1019. Wilkinson, K., Sayers, C., Kuno, H. A., & Reynolds, D. (2003). Efficient RDF storage and retrieval in Jena2. In SWDB (Vol. 3, pp. 131–150).

150

4 Persistence of Fuzzy RDF and Fuzzy RDF Schema

Zeng, K., Yang, J., Wang, H., Shao, B., & Wang, Z. (2013). A distributed graph engine for web scale RDF data. Proceedings of the VLDB Endowment, 6(4), 265–276. Zou, L., Özsu, M. T., Chen, L., Shen, X., Huang, R., & Zhao, D. (2014). gStore: A graph-based SPARQL query engine. The VLDB Journal, 23(4), 565–590.

Chapter 5

Fuzzy RDF Queries

5.1 Introduction The Resource Description Framework (RDF) has been widely applied to represent and exchange domain information because of its machine-readable characteristic. With a huge amount of RDF data available, retrieving RDF data is essential, so that many RDF query approaches have been developed. Solving the RDF data retrieval task can usually be achieved in two ways: The first way is to solve the problem with the query language of the RDF database system. Another approach is to use graph pattern matching algorithms to implement queries, since RDF data can be represented as graphs. However, in many real applications, the RDF data are often noisy, incomplete, and inaccurate. Traditional approaches generally cannot handle imprecise and uncertain information, and this seriously prevents a large number of common users from obtaining information in RDF datasets. Therefore, in this chapter, we focus on fuzzy RDF queries. We present methods of pattern match query, approximated fuzzy RDF subgraph matching query, fuzzy quantified query over fuzzy RDF graph and investigate the problem of fuzzy RDF query based on extended SPARQL. In classical RDF graph pattern matching the task is to find inside a given graph G some specific smaller graph Q, called pattern. A naive idea of this approach is to compare all possible subgraph in G and its label bindings with the pattern graph Q, i.e., obtaining all the candidate subgraphs with the existing techniques. Then, check the dominating relationship and return true answers. Although there have been many studies (Neumann & Weikum, 2008; Zou & Özsu, 2017) on RDF subgraph matching, none of these works considers the problem that the RDF graph could contain fuzzy information in some applications. Moreover, these methods are not efficient in response time because of the need to perform subgraph isomorphism checks on Q and G, producing a large number of unnecessary intermediate results, which have now been shown to be NP-complete (Ullman, 1976). Therefore, a thresholdbased RDF subgraph pattern matching query method is introduced in Sect. 5.2. Based © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 Z. Ma et al., Modeling and Management of Fuzzy Semantic RDF Data, Studies in Computational Intelligence 1057, https://doi.org/10.1007/978-3-031-11669-8_5

151

152

5 Fuzzy RDF Queries

on the traditional subgraph isomorphism matching method, the fuzzy RDF subgraph matching problem is solved efficiently, Specifically, we want to retrieve all qualified matches of a query pattern in the fuzzy RDF graph. In order to alleviate the time-consuming exhaustive search, the other method resorts to the approximate matching strategy. This can relax rigid structure and label matching constraints of subgraph isomorphism and other traditional graph similarity measures. These various approaches (Costabello, 2014; Virgilio et al., 2015) to approximate matching on RDF graph data rely on heuristics, based on similarity or distance metrics, on the use of specific indexing structures to improve the performance of the algorithm. However, the existing inexact graph matching algorithm ignore many features of RDF graph. For example, these algorithms only take the similarity of vertices and edges into account in RDF graph but did not concern the structure among the vertices and edges. More importantly, these algorithms disregard the semantic relationships between resources, and cannot process and manage fuzzy information about RDF graph in the matching process as well. Inspired by the method of joining the path query graph introduced in (Virgilio et al., 2015; Moustafa et al., 2014; Zhao & Han, 2010), we choose the path instead of the vertex as the basic matching unit and propose a new path-based solution to efficiently answer subgraph pattern queries over fuzzy RDF graph. We introduce this path-based approximate RDF subgraph pattern matching method in Sect. 5.3. It has been widely recognized that classical querying suffers from a lack of flexibility due to crisp querying conditions and querying objects. Flexible queries play important roles in intelligent information retrieval and have become the main means to realize data querying. Bosc and Pivert (1992) point out that a query is flexible if a qualitative distinction between the selected entities is allowed. The case arises when the query conditions are crisp but databases being queried contain imperfect information. As a special kind of flexible query, fuzzy quantified queries have been long recognized for their ability to express different types of imprecise and flexible information needs in a relational database context. However, in the specific RDF/SPARQL setting, the current approaches from the literature that deal with quantified queries consider crisp quantifiers only (Bry et al., 2010; Fan et al., 2016) over crisp RDF data. In Sect. 5.4, we intend to integrate linguistic quantifier in a subgraph patterns addressed to a fuzzy RDF graph database and use graph pattern matching approach to evaluate fuzzy quantified query. This extension allows to express fuzzy preferences on values present in the graph as well as on the structure of the data graph, which has not been proposed in any previous fuzzy RDF graph pattern matching. SPARQL (Prudhommeaux, 2008), the official W3C recommendation as an RDF query language, plays the same role for the RDF data model as SQL does for relational data model. In SPARQL query, the “where” clause consists of triple patterns that contain either variables or literals. Actually, each SPARQL query can be represented by a graph pattern. As a result, any SPARQL query can be equivalently transformed a subgraph query problem, which locates the subgraph in RDF data graph matching with the query graph. Nevertheless, SPARQL requires accurate knowledge about the graph structure and contents. As users are not very clear about the contents and the data distribution of the database, such a strict query often leads to the Few Answers

5.2 Exact Pattern Match Query Over Fuzzy RDF Graph

153

Problem: the user query is too selective and the number of answers is not enough. More importantly, classical SPARQL lacks of some expressiveness and usability capabilities as it follows a crisp (Boolean) querying of RDF data for which the response is either false or true. As a result, it lacks the ability to deal with flexibility aspects (including queries with user preferences or vagueness), which is significant in real-world applications. Therefore, we extend the SPARQL language in Sect. 5.5, for querying fuzzy RDF data.

5.2 Exact Pattern Match Query Over Fuzzy RDF Graph Traditional specialized pattern graph matching models are usually defined in terms of subgraph isomorphism and its extensions (e.g., edit distance), which identify subgraphs that are exactly or approximately isomorphic to pattern graphs. A comparison of various specialized algorithms for graph pattern matching has been done recently (Lee et al., 2012). The exact RDF graph matching algorithm (Carroll, 2002; Wang et al., 2005) is not efficient in terms of response time, and it has been proved that its complexity is NP-complete (Ullmann, 1976). Existing RDF matching algorithms based on inexact graph matching (Costabello, 2014; Virgilio et al., 2015; Zhang et al., 2012) ignored many features of RDF graph. For example, most of the algorithms (Costabello, 2014; Zhang et al., 2012) disregarded the fuzzy data and the semantic relationships between vertices, which in turn results in the loss of some potential answers. Worse still, the traditional approach is incapable to recognize and evaluate the fuzzy information in the matching process, which further results in the incapacity of obtaining all the satisfactory answer. Therefore, traditional graph querying techniques are not able to capture good quality matches in this context. Moreover, the existing techniques (Ma et al., 2011) for processing twig-patterns over fuzzy XML tree cannot be effectively applied to handle graph pattern matching over an RDF graph. It is because a graph does not have nice property such that every two vertices are connected along a unique path. In this section, we study pattern matching in the context of large fuzzy RDF graphs. Specifically, we want to retrieve all qualified matches of a query pattern in the fuzzy RDF graph. We carefully defined the syntax and semantic of an extension of the query pattern graph that makes it possible to express and interpret such queries. We defined fuzzy graph patterns that allows: (i) to query a fuzzy RDF data model, and (ii) to express preferences on data through fuzzy conditions and on the structure of the data graph with regular expressions as edge constraints. In addition, in order to answer subgraph pattern queries efficiently over fuzzy RDF data graph, we propose an approach for evaluating RDF graph patterns.

154

5 Fuzzy RDF Queries

5.2.1 Graph Pattern Matching Problem The basic graph pattern matching problem is to find matches in a graph for a specified pattern. We first introduce graph pattern matching on precise graphs based on subgraph isomorphism. Then we will proceed to discuss fuzzy graph pattern matching. Subgraph isomorphism is a graph matching technique which is to find all subgraphs of G that are isomorphic to Q (see (Gallagher, 2006) for a survey). Given a query pattern graph Q = (V q , E q ) with n vertices {u1 , …, un } and a precise data graph G = (V, E), a pattern match query based on subgraph isomorphism retrieves all matches of Q in G. For a given Q and an n vertex set m = {v1 , …, vn } in G, m is a match of Q in G, if (1) the n vertices {v1 , …, vn } in G have the same vertex labels as the corresponding vertices {u1 , …, un } in Q; and (2) for any an edge (ui , uj ) in Q, there exists a corresponding edge (vi , vj ) in G such that edge (vi , vj ) have the same edge labels with edge (ui , uj ). This makes graph pattern matching NP-complete, and hence, hinders its scalability in finding exact matches. Moreover, a bijective function is often too restrictive to identify patterns in emerging applications. Graph matching in our scenario is essentially finding a homomorphism (Hahn & Tardif, 1997) from the pattern graph Q to elements of the data graph G. The traditional notion is, however, often too restrictive for graph matching in emerging applications. So, we introduce PRDF homomorphism (Alkhateeb et al., 2009) for checking if an RDF graph pattern is a consequence of an RDF graph. The notion extends graph homomorphism to deal with vertices connected with regular expression patterns, that can be mapped to vertices connected by paths, rather than edge-to-edge mappings. Here PRDF homomorphism is used for answering fuzzy RDF graph pattern. Definition 5.1 [PRDF homomorphism (Alkhateeb et al., 2009)] Let G be an RDF graph, and Q be an RDF graph pattern. A PRDF homomorphism from Q into G is a map φ from V q into V such that: ∀ ∈ Q, either (i) the empty word ∈ ∈ L(R) and φ(s) = φ(o); or (ii) ∃, …, in G such that n0 = φ(s), nk = φ(o), and p1 , …, pk ∈ L(φ(R)). Here ∈ is an empty label, R is a regular expression pattern, L(R) is the label of a regular expression path R and L(φ(R)) is the set of all edge labels in the path φ(R). For the fuzzy graph pattern matching, in this paper, we focus on threshold-based RDF pattern matching (T-RPM) over a large fuzzy RDF graph where vertices and edges are fuzzy. Specifically, given a large fuzzy RDF data graph G, a query pattern graph Q, and a user-specified satisfaction threshold δ q ∈ [0, 1], a T-RPM query retrieves all vertex sets M = {v1 , …, vn } in G (i.e. n vertices in G), such that the satisfaction threshold of M in G is at least δ q . That is, we want to retrieve fuzzy subgraph, which contain the pattern graph and have high existence possibilities.

5.2 Exact Pattern Match Query Over Fuzzy RDF Graph

155

Naively, this problem could be solved by directly performing traditional subgraph pattern matching over RDF graph. However, there are three key questions to answer subgraph queries efficiently over fuzzy RDF data graph: • How to effectively build a pattern graph that satisfies the user’s query requirements? • How to efficiently find a possible answer that matching pattern graph in fuzzy RDF graphs? • How to decide the satisfaction degree of matches? In order to deal with these challenging problems mentioned above, we carefully design the corresponding solutions. As far as the first question is concerned, query graph pattern specifies the structural and semantic requirements that a subgraph of G must satisfy. In order to satisfy the user’s query requirements, we assign predicate conditions on vertices to express user preferences, with regular expressions as edge constraints to express graph structure. For the second question, we will define a graph pattern matching algorithm based on a revised notion of graph homomorphism. This forms the basis for the algorithms discussed in Sect. 5.5, which further speed up fuzzy subgraph pattern matching. Lastly, the satisfaction degree of a match M on G is the aggregation of the satisfaction degree of a set of matching vertices.

5.2.2 RDF Graph Pattern The notion of graph pattern provides a simple yet intuitive specification of the structural and semantic requirements of interest in the input graph. Graph pattern as the basic operational unit is central to the semantics of many operations in fuzzy RDF. Essentially, a fuzzy graph pattern is a directed crisp graph with predicate on query vertices, and regular expressions that denotes the path over relationships as edges’ labels. For the following, we assume the existence of an infinite set VAR of variables such that VAR ∩ (U ∪ L) = ∅. By convention, we prefix the elements of VAR by a question mark symbol. Definition 5.2 (Fuzzy graph pattern) A fuzzy graph pattern is a labeled directed graph defined as Q = (V q , E q , F V , RE ), where (i) V q is a finite set of vertices. (ii) E q ⊆ V q × V q is a finite set of directed edges, where (ui , uj ) denotes an edge from vertex ui to uj . (iii) F V is a function on V q such that for each vertex u ∈ V q , F V (u) can be a constant value c, a variable ?x, or a fuzzy condition C defined as the form “?x op c”, “?x op ?y” and “?x is Fterm”. Here ?x, ?y ∈ VAR, c ∈ (U ∪ L), op is a fuzzy or crisp comparator (e.g., , /=), and Fterm is a predefined or user-defined fuzzy term like high, long, young and so on. One can extend fuzzy condition to support fuzzy conjunction ∧ (resp. disjunction ∨), usually interpreted by the

156

5 Fuzzy RDF Queries

Fig. 5.1 Fuzzy pattern graph

triangular norm minimum (resp. maximum). To simplify the discussion, we focus on fuzzy conditions in the simple form given above. (iv) RE : E q → re(ui , uj ) is a function defined on E q s. t. for each edge (ui , uj ) in E q , re(ui , uj ) is a path regular expression, and it can be constructed inductively as R = ϶ |e|R1 · R2 |R1 |R2 |R+ . Here ϶ is a fuzzy regular expression denoting the empty pattern, e denotes either an edge label or a wildcard symbol * matching any label in U, R1 · R2 denotes a concatenation of expressions, R1 |R2 denotes disjunction and is an alternative of expressions, R+ denotes one or more occurrences of R. Essentially, the predicate F V (u) of a vertex u specifies a search condition. As will be seen shortly, an edge (ui , uj ) in a pattern Q is mapped to a path p in a data graph G, and a regular expression is used to constrain the edges on the path. This differs from the traditional notion of graph pattern matching defined in terms of subgraph isomorphism. Note that traditional graph patterns (Lee et al., 2012) are a special case of the patterns defined above, when (i) a vertex carries its label as its only attribute, and (ii) all edges have no regular expression constraints. Example 5.1 We want to find a rating-high action movie with more than 20 million at the box office. Specifically, the film is starred by American actors. Query graph Q in Fig. 5.1 is a possible way to express this information need. Here ?b, ?film and ?p are three variables, expression ?b > “20 million” is a crisp compare expressions, expression “?r is high” is a fuzzy condition expression, and expression RE = birthPlace · locateIn+ is a regular expression. This pattern “models” information concerning high rating (?r is high) action films (?film). The box office of the film is over 20 million (?b > “0 million”). Moreover, the actors (?p) starred in the film are American.

5.2.3 Fuzzy Graph Pattern Matching The notion of graph pattern Q specifies the topological and content-based constraints chosen by the user. Next, we introduce the notion of fuzzy RDF graph pattern matching which generalizes result subgraph homomorphism with evaluation of the RDF graph pattern. Intuitively, given a fuzzy RDF data graph G, the semantics of a graph pattern Q defines a set of matching, where each matching (from variable of Q to URIs and literals of G) matches the pattern to a homomorphism subgraph of G.

5.2 Exact Pattern Match Query Over Fuzzy RDF Graph

157

Definition 5.3 (Fuzzy graph pattern matching) A fuzzy graph pattern Q = (V q , E q , F V , RE ) is matching with a fuzzy RDF data graph G = (V, E, Σ, L, μ, ρ) with a satisfaction degree threshold δ t , if there exists an injective mapping φ: Q → G which is a total mapping from vertices and regular expression edges of Q to vertices and paths of G such that: (matching vertices) Every vertex u ∈ V q has an image vertex φ(u) ∈ V by the injective function. More precisely, if u is a constant vertex (F V (u) ∈ U ∪ L), then φ(u) is a matching vertex associated with a satisfaction degree δ u = μ(φ(u)), and their labels are matched (i.e., L(u) = L(φ(u))); if u is a variable vertex (F V (u) ∈ VAR), then φ(u) ∈ (U ∪ L) is the set of matches of the variable vertex u, i.e., vertices φ(u) induced by all matches φ of Q in G, in which each item is associated with a satisfaction degree δ u = μ(φ(u)). (ii) (checking condition on vertices) For the fuzzy condition C of vertex u ∈ V q , φ(u) satisfies the fuzzy condition with a satisfaction degree δ co defined as follows, according to the form of C: (i)

• if C is of the form “?x op c”, then φ(?x) satisfies the condition C with a degree of δ co = μop (φ(?x), c). Here μop is membership function of the fuzzy or crisp comparator. In particular, crisp comparison operators have a Boolean semantics, if the condition is evaluated to true, then the satisfaction degree is 1, otherwise 0. • if C is of the form “?x op ?y”, then φ(?x) and φ(?y) satisfy the condition C with a degree of δ co = μop (φ(?x), φ(?y)). • if C is of the form “?x is Fterm”, then φ(?x) satisfies the condition C to the degree δ co = μFterm (φ(?x)). Here μFterm is fuzzy membership function of the fuzzy term Fterm. • if C is of the form C 1 ∧ C 2 or C 1 ∨ C 2 , we use the usual interpretation of the fuzzy operator involved (minimum for the conjunction, maximum for the disjunction). (iii) (matching edges) For each edge (ui , uj ) ∈ E q , there exist two vertices φ(ui ) and φ(uj ) of V such that φ(ui ) and φ(uj ) match vertices u1 and u2 respectively, and there is a path p in G from vertex φ(ui ) to φ(uj ) s. t. p matches regular expression re(ui , uj ) with a satisfaction degree, δ re (p), defined as follows, according to the form of re (in the following, R, R1 and R2 are regular expressions): • re is of the form ϶ . If p is empty then δ re (p) = 1 else δ re (p) = 0. • re is of the form e with e ∈ U (resp. “*”). If p is an edge e' from vertex φ(ui ) to φ(uj ), where e' = e (resp. where e' ∈ U) then δ re (p) = ρ(φ(ui ), φ(uj )) else δ re (p) = 0. • re is of the form R1 · R2 . We denote by P the set of all pairs of paths (p1 , p2 ) such that p is of the form p1 p2 . One has δ re (p) = max P (min(δ R1 ( p1 ), δ R2 ( p2 ))). • re is of the form R1 |R2 . One has δ re (p) = max(δ R1 ( p1 ), δ R2 ( p2 )).

158

5 Fuzzy RDF Queries

• re is of the form R+ . Let P be the set of all tuples of paths (p1 , …, pn ) (n > 0) such that p is of the form p1 … pn . One has δ re (p) = max P (min(δ R ( p1 ), . . . , δ R ( pn ))). (iv) (aggregating satisfaction degree) The satisfaction degree to the overall query, denoted by δ Q (G), is the aggregation of the satisfaction degrees to each elementary matching and conditions from (i), (ii) and (iii). As far as the satisfaction degree aggregation function, different types of aggregation may be considered for different application. The minimum, for instance, is a cautious choice; it assumes the satisfaction of a set of triples is only as satisfactory as the least satisfied triple. The median, a more optimistic choice, is another reasonable satisfaction degree aggregation function. In this paper, we use minimum as the satisfaction degree aggregation function. Note that, the satisfaction degree δ Q (G) must be greater than δ t . If there is no matching, then δ Q (G) = 0, i.e., G does not match Q. Intuitively, when the graph pattern Q is evaluated on a data graph G, the result is a binary relation M ⊆ V q × V such that: (a) for each u ∈ V q , there exists v ∈ V such that (u, v) ∈ M; (b) for each edge (ui , uj ) in E q , there exists a nonempty path p from vi to vj in G such that (i) the vertex label L(vi ) of vi satisfies the predicate condition specified by F V (ui ); (ii) the path p is constrained by the regular expression re(ui , uj ); and (iii) (uj , vj ) is also in M. From this one can see that pattern query are defined in terms of an extension of graph simulation (Henzinger et al., 1995), by (i) imposing query conditions on the labels of vertices; (ii) mapping an edge in a pattern to a nonempty path in a data graph; and (iii) constraining the edges on the path with a regular expression. This also differs from the traditional notion of graph pattern matching defined in terms of subgraph isomorphism (Gallagher, 2006). Let us now come to the definition of matching result. Since our primary focus is on fuzzy RDF graph matching, the above definition does not delve into the satisfaction degree. We need to extend the query evaluation from returning a set of mappings to returning a set of pairs. Given a fuzzy RDF graph G, a query pattern graph Q, and a satisfaction degree threshold δ t (0 ≤ δ t ≤ 1), a graph pattern matching query returns vertices mapping pair sets M = {(m, δ m )|m: V q × V ∧ δ m ≥ δ t }, where m is a mapping from variable of Q to URIs and literals of G and δ m denotes the satisfaction degree associated with the mapping. Note that, a match M is a relation rather than a function. Hence, for each u in V q there may exist multiple vertices v in V such that (u, v) is in M, i.e., each vertex in Q is mapped to a nonempty set of vertices in G. Hence, we refer to the relation M grouped by vertices in V q as a match in G for Q. There may be multiple matches in a graph G for a pattern Q. Nevertheless, below we show that there exists a unique maximum match in G for Q. That is, there exists a unique match QM (G) in G for Q such that for any match M in G for Q, M ⊆ QM (G).

5.2 Exact Pattern Match Query Over Fuzzy RDF Graph

159

Proposition 5.1 For any data graph G and any graph pattern query Q, there is a unique maximum match QM (G) in G for Q. Proof 1. By Definition 5.3, we show that there exists a match, which covers all the vertices in V q and is maximum. And it is the union of all matches in G for Q. 2. We then show the uniqueness by contradiction. That is, if there exist two distinct maximum matches M 1 and M 2 , then M 3 = M 1 ∪ M 2 is a match that is larger than both M 1 and M 2 . By (1) and (2) Proposition 5.1 follows. The task of graph pattern matching problem is finding the set M of subgraph of G that “match” the pattern Q. Problem formulations often require that Q represent a single connected graph and, therefore, that M is connected as well. A graph is connected if there exists some path between every pair of its vertices. We introduce result graph, to better illustrate the meaning of maximum match. A result graph Gr = (V r , E r ) is a graph representation of the maximum match QM (G) in G for Q, where (i) V r is the set of vertices of G in M, and (ii) there is an edge er = (vi , vj ) ∈ E r if and only if there is an edge (ui , uj ) ∈ E q , such that (ui , vi ) ∈ M and (uj , vj ) ∈ M. We use the following Example 5.1 to illustrate result graphs. Example 5.2 Let us consider the fuzzy graph pattern Q of Example 5.1. We evaluate this matching according to the fuzzy RDF data graph G of Fig. 5.2. The query also specifies a threshold δ t (δ t = 0.25 in the example), to indicate that only matches with possible larger than δ t should be returned. The matching process is depicted as following. Intuitively, this pattern retrieve the list of films in G, and the matching value of ?film is potentially Diner, Iron Man 2 and Chef . The actors in the three films are American actor Mickey Rourke, Steve Gullenberg and Robert Downey Jr. respectively. Because three paths p1 = Jon Favreau—birthPlace—New York—locateIn— America, p2 = Steve Gullenberg—birthPlace—Florida—locateIn—America and p3 = Robert Downey Jr.—birthPlace—New York—locateIn—America match the regular expression RE . And the satisfaction degrees are δ re (p1 ) = 0.3, δ re (p2 ) = 0.4 and δ re (p3 ) = 0.75, respectively. However, the genre of film Diner is comedy, which is not an action movie. And the box office of film Chef is 11 million, which does not satisfy the condition ?b > 20 million. So, Iron Man 2 is the only movie which is an action movie with satisfaction degree δ u (“action”) = 0.85 and it’s box office is over 20 million with satisfaction degree δ co (“29 million”) = 0.7. If we suppose that μhigh (7.1) = 0.65 and the satisfaction degree of the condition ?r is high is 0.65, which is the minimum of satisfaction degrees induced by μhigh (7.1) and δ u (7.1). Moreover, vertex labeled Iron Man 2 and vertex labeled Jon Favreau in G match with vertex ?film and vertex ?p in pattern graph with satisfaction degree 1, respectively. Thus, the matching result graph is depicted in Fig. 5.3. As the satisfaction degree is the minimum of satisfaction degrees induced by the results described above, we have δ Q (G) = 0.3, which satisfy the minimum satisfaction degree threshold constraint.

160

5 Fuzzy RDF Queries

Fig. 5.2 A fuzzy RDF data graph G inspired by IMDB

Fig. 5.3 Subgraphs of G matching Q

5.2.4 Query Evaluation Algorithms In this section, we discuss implementation issues related to our proposal. We describe how to incrementally build a possible map set of queries by making use of a backtrack algorithm, following a similar approach to (Alkhateeb et al., 2009). In particular, we need to produce answers with satisfaction degree.

5.2 Exact Pattern Match Query Over Fuzzy RDF Graph

161

RDF graph pattern matching in our scenario is essentially finding a homomorphism from the query graph Q to the data graph G. A feasible method for evaluating RDF graph patterns, i.e., enumerating all RDF homomorphism from the patterns graph into the data graph, is based on a backtracking technique that generates each possible map from the current one by traversing the parse tree in a depth-first manner and using the intermediate results to avoid unnecessary computations. Specifically, the algorithm consists of four parts. Algorithm 5.4 (Reach) is used to compute reachable paths that match regular expressions. Then, Algorithm 5.4 is used by Algorithm 5.3 (Eva), which, given a fuzzy RDF graph and a triple pattern, calculates the set of maps that satisfy the triple pattern. The results of Algorithm 5.3 are used to calculate the RDF homomorphism of a pattern graph into a data graph in Algorithm 5.2 (Candidates), which returns all possible candidate images in data graph for the current vertex satisfying the partial map. Algorithm 5.1 describes the framework of the candidate retrieves.

5.2.4.1

Pattern-Match Algorithm

Algorithm 5.1 illustrates a general frame work for a pattern match query Q over a fuzzy RDF graph G, which is a recursive version of the basic backtrack algorithm (Golomb & Baumert, 1965). The input of this algorithm is: an RDF graph pattern Q, an RDF graph G, and a partial map μp , which includes a set of pairs {(, δ)} such that u is a term of Q, v is the image of u in G and δ is a satisfaction degree associated with the mapping. If we call this algorithm with (Q, G, μø ), where μø is the map with the empty domain, then at the end of the algorithm we have all homomorphism from the pattern graph Q into the fuzzy RDF graph G. The algorithm perform as follows: Algorithm 5.1: Pattern-Match (Q, G, µp ). Input: an RDF graph pattern Q, a fuzzy RDF graph G, and a partial map μp . Output: extends the partial map to a set of RDF homomorphism. 1. if |μp | == |V q | then 2. return μp ; 3. pick a vertex u of V q ; 4. for each ∈ Candidates (μp , u, G, Q) do 5. Pattern-Match (Q, G, μp ▷◁ {(, δ)} ▷◁ μ);

The procedure first checks whether all homomorphism from the pattern graph Q into the fuzzy RDF graph G are obtained in line 1. If all the homomorphisms are obtained, we can stop the recursion process and return the complete solution in line 2. Otherwise, the procedure chooses a term u ∈ V q to obtain a possible homomorphism in line 3. After that, Pattern-Match takes each candidate v of the current term u ∈ V q and the possible map μ, puts v in the mapping pairs, and tries to generate the possible candidates of v in lines 4–5. This is done recursively in a depth-first manner through the call of Pattern-Match (note that μp , {(, δ)}, and μ are compatible, since the set is calculated with respect to μp ). At the end of the algorithm, we have a

162

5 Fuzzy RDF Queries

tree that contains one level with a term from Q, i.e., a vertex from Q, and one level with the possible images of that term in G. The input to each vertex of each level is the current map. Each possible path in the tree from the root to a leaf labeled by a term of G represents a possible homomorphism.

5.2.4.2

Candidates Algorithm

Algorithm 5.2 calculates all possible candidate maps in G for the current term u satisfying the partial map μp . It returns all sets of pairs such that v is a possible map of u, and μ is the possible map from the terms of each regular expression pattern Ri appearing in a triple with u and one of the terms in V q already mapped in μp . That is, if there is no term in V q involved in a triple with u, then the possible candidate images of u are all v in G such that u can be mapped to v. Otherwise, there exists a set of terms x 1 , …, x k ∈ V q involved in a triple with u, which are already mapped in μp . In this case, the maps of x i and v satisfies μ(Ri ), where Ri is the regular expression pattern appearing in the predicate position of the triple between x i and u. The order in which the two mapping vertices of x i and v satisfy μ(Ri ) depends on the order in which u and x i appear in the triple, that is, if the triple is then satisfies μ(Ri ) in G, otherwise satisfies μ(Ri ) in G. μ maps the terms appearing in the regular expression patterns of Q into the terms appearing along the paths in G with respect to μp , that is, μ is a possible map such that μ and μp are compatible. At the beginning, we use collection T s to store triples in the line 1, in which one of the predecessor vertices of u already mapped in μ. We use T s to store triples in the same way in the line 2. If there is no term in T s and T o , we calculate the candidate matching information according to the type of u in the lines 4–10. If u is simple variable and u is not mapped in μp , the candidates are all v in G such that u can be mapped to v (Line 5). Otherwise, the candidate is μp (u) (Line 6). If u is a constant or conditional expression, a candidate matching result is obtained according to the matching operation (Line 9). After that, the algorithm checks whether the edges between u and already matched query vertices of Q have corresponding edges between v and already matched data vertices of G in lines 12– 15. It calls Eva to check whether the maps of x i and v satisfies μ(Ri ), and it obtains temporary candidate set. At the same time, the algorithm updates T s and T o in line 13 and 15. Next, the algorithm proceeds to refine the candidate from T s and T o . it updates status information in the line 17 and 19 and all changes done are restored. Finally, we return candidates in line 20. The results of Algorithm 5.2 are used to calculate the RDF homomorphism of a graph pattern Q into an RDF graph G by successive joins in Algorithm 5.1.

5.2 Exact Pattern Match Query Over Fuzzy RDF Graph

163

Algorithm 5.2: Candidates (µp , u, G, Q). Input: A fuzzy RDF graph G, a vertex u from graph pattern Q and a map μp . Output: The set such that v is a possible image of u in G, and μ extends μp to the vertex u. 1. T s ← {}| ∈ Q and x i ∈ dom(μp )}; 2. T o ← {}| ∈ Q and x i ∈ dom(μp )}; 3. if T s = = φ and T o = = φ then 4. if u is variable then 5. if u ∈ / dom(μp ) then c ← {|v ∈V }; 6. else if μp (u) ∈ V then c ← ; 7. else c ← φ; 8. else 9. if u ∈ V then c ← ; 10. else c ← φ; 11. else 12. if T s /= φ then 13. t ← {| ∈ T s and (s, o, μ) ∈ Eva(μp , u, Ri , μp (x i ), G); T s ← T s \{}; 14. else 15. t ← {| ∈ T o and (s, o, μ) ∈ Eva(μp , μp (x i ), Ri , u, G); T o ← T o \{}; 16. for each ∈ T s do 17. c ← {| ∈ t, (s, o, μ2 ) ∈ Eva(μp , μp (x i ), Ri , u, G), μ1 , μ2 are compatible, and μ’ ← merge(μ1 , μ2 )}; t ← c; 20. return c;

5.2.4.3

Evaluate Algorithm

Algorithm 5.3 calculates the set of maps μ such that satisfies R in G with the map μ (it is said that μ satisfies in G). The results of Algorithm 5.3 are used to calculate the candidate homomorphism in Algorithm 5.2. The algorithm first checks ui . If ui is a constant, i.e., a URIs or a literal, the result set are obtained via calling the function Reach in line 2. The argument used to the function is ui itself. Otherwise, the result set is then computed, by using Reach algorithm in line 4. In this case, ui is a variable, and it constructs the map pair as the argument used to call function in line 6. Algorithm 5.3 then checks uj , along the same lines as ui . If uj is a constant, the result set of the algorithm is (s, uj , μ) in lines 5–6, where s ∈ V. Otherwise, the result set is (s, o, μ' ) in line 8, where μ' ← μ ▷◁ (uj ← o). Finally, the map result is returned in line 9.

164

5 Fuzzy RDF Queries

Algorithm 5.3: Eva(µp , ui , R, uj , G). Input: A fuzzy RDF graph G, a graph pattern triple , and a partial map μp . Output: The set of maps μ satisfying triple in G. 1. if ui is a constant then 2. S ← Reach(G, R, ui , φ); 3. else 4. S ← ∪s∈V Reach(G, R, s, {}); 5. if uj is a constant then 6. S ← {(s, uj , μ) ∈ S}; 7. else 8. S ← {(s, o, μ' )|(s, o, μ) ∈ S, (μ, (uj ← o)) are compatible, and μ' ← μ ▷◁ (uj ← o)}}; 9. return {μ|(s, o, μ) ∈ S};

5.2.4.4

Reach Algorithm

Regular path queries have been studied and used for querying databases and semistructured data. Liu et al. (2004) presented the algorithm Reach, which included complete algorithms and data structures for directly and efficiently solving existential and universal parametric regular path queries. Given a graph G, a regular expression R, and a start vertex v0 in G, the authors consider a graph to be a set G of labeled edges of the form , with source and target vertices v1 and v2 respectively and edge label el. They calculate Reach (G, R, v0 , μi ) called the reach set, which are the set of triples such that some path from v0 to v in G matches some path from s0 to s in R under map μ. The principle of the algorithm is based on the following two rules: Rule 1: if ∈ G, ∈ R and μ ∈ match(tl, el), then ∈ Reach(G, R, v0 , μi ); Rule 2: if ∈ Reach(G, R, v0 , μi ), ∈ G, ∈ R, μ1 ∈ match(tl, el) and μ2 = merge(μ, μ1 ), then ∈ Reach(G, R, v0 , μi ). Here, match(tl, el) is the set of minimal substitutions μ such that el matches tl under μ. In order to realize the reachable query of fuzzy RDF regular path, we propose a path reachable algorithm based on this method. Algorithm 5.4 describes the detailed process, which computes all pairs such that there is some path from v0 to vertex v that matches some path from s0 to some vertex in A under map μ with satisfaction degree δ. In Algorithm 5.4, H is the set of triples already considered for the reach set, W is the worklist of triples yet to consider, E is the matching result, and we can compute Reach (G, R, v0 , μi ) by repeatedly adding triples according to the aforementioned two rules. We use adjacency list to store adjacency information of each vertex of fuzzy RDF graph, i.e., a list of triples (vertex ID, edge label, edge membership degree) ordered by the vertex ID. We use nested arrays, hash tables, or combinations of them for R and W, as well as for S.

5.2 Exact Pattern Match Query Over Fuzzy RDF Graph

165

Algorithm 5.4: Reach(G, R, v0 , µi ) Input: A fuzzy RDF graph G, a regular expression R, and a start vertex v0 in G. Output: {(v0 , vk , μ)} 1. Construct the NDFA of R: A ← ; 2. Initialize reach set H, worklist W, query result E and the satisfaction degree δ; 3. for ∈ G 4. for ∈ A 5. for μ in match (tl, el) 6. W ← W ∪ {}; 7. while exists in W 8. H ← H ∪ {}; W ← W – {}; 9. for ∈G 10. for ∈ A 11. if μ in match(tl, el) then 12. μ1 ← {(tl, el)}; μ2 ← (μ ▷◁ μ1 ); δ ← min(δ(v), δ(v1 ), δ(el)); 13. if () ∈ / H then 14. W ← W ∪ {}; 15. if s ∈ F then 16. E ←E ∪ {}; 17. return E;

This algorithm calculates the set of triples , where vk is a vertex of G and μ is a map from terms of R into terms of G such that there exists a sequence T = (v0 , …, vk ) of vertices of G and a path label ω ∈ L(R) with T is a path label of ω in G according to μ. We convert regular expression pattern straightforwardly to a nondeterministic finite automaton, denoted by NDFA (Holub & Melichar, 1998) in line 1. An automaton is a set A of labeled transitions of the form , with source and target states s1 and s2 respectively and transition label tl, a finite state set S, a start state s0 , and a final state set F ∈ S. To construct a NDFA that generates an equivalent language to a given regular expression, we use the same way described in (Aho & Hopcroft, 1974). Then we initialize reach set H, worklist W and query result E in line 2. We compute possible map by adding triples yet to consider into worklist W according to Rule 1 in lines 4–6. Given an edge label el and a transition label tl, let match (tl, el) in line 5, which takes a set of symbols as an implicit argument, be the set of minimal substitutions μ such that el matches tl under μ. For each triple took from worklist, we add it to the set of triples already considered for the reach set and update worklist in lines 7–8. We map a pair to the set of triples such that there is in G and in A and match (tl, el) = μ1 according to Rule 2 in lines 9–12. When a mapping is dynamically constructed, we add it to the array of mappings if it is not already present. To efficiently check whether it is present, we can maintain a nested array structure representing all previously constructed mappings. We simply check whether el matches tl under each of the extensions in line 11. In case of matching, we merge the mapping with previously constructed mappings, and we calculate the degree of satisfaction after the connection in line 12. If the extensions mapping is

166

5 Fuzzy RDF Queries

not in the set R, we add the result to the worklist W in lines 13–14. If s is the final states (s ∈ F), the algorithm terminates execution and we add the matching result to E in lines 15–16. We return E in line 17.

5.2.4.5

Correctness and Complexity

Proposition 5.2 Algorithm 5.1 is correct and complete for enumerating all RDF homomorphism from a given pattern graph into a fuzzy RDF graph. Proof We can prove this by means of induction. The set of all homomorphism is complete for the empty set at the beginning of the algorithm. Because Algorithm 5.4 is complete (Alkhateeb et al., 2009) and the number of vertices being finite, the partial homomorphism, i.e., μp , are completely extended for the current vertex at each step. Finally, the procedure ends having a homomorphism mapping for each vertex in Q. Reach algorithm considers each triple in W and R, iterates over all outgoing edges of v and outgoing transitions of s, and computes a match and possibly a merger taking time O(predicatesize) and O(vars(Ri )), respectively, in each iteration. The factor map is used because only substitutions that are the third component of a triple in W and R, i.e., that match some path from v0 in G with some path from s0 in R, are considered. So, Reach algorithm has worst-case running time O(|G| × |R| × maps × (predicatesize + vars(Ri ))). For each triple in Q, the Reach algorithm is called by the Evaluate algorithm once if u1 is a constant; otherwise it is called for each vertex in G multiplied by the number of variables in Q in the subject position. So, Eva algorithm has overall time complexity O((vars(Q) × subj(G) + const(Q)) × |G| × |R| × maps × (predicatesize + vars(Ri ))), where vars(Q) and const(Q) are the number of variables and constants appearing in the subject position in a triple of Q. This result shows an exponential complexity O(pred(G)vars(R) ). However, vars(R) can be a constant since it is usually considered very small with regards to the data graph. Hence, the complexity of query evaluation is O(|G|2 ).

5.3 Approximate Fuzzy RDF Subgraph Match Query At the core of many advanced RDF graph operations, lies a common and critical graph matching primitive. Particularly, as one of the most important topics in this area, efficiently finding all occurrences of a subgraph pattern have received considerable attention (Lian and Chen, 2011; Moustafa et al., 2014). Subgraph pattern matching is meaningful and useful in many applications. For example, answering SPARQL queries in RDF database is actually equivalent to conducting subgraph isomorphism match over graphs, in which users need to pose a query with strict conditions over the database. Nevertheless, as users are not very clear about the contents and the data distribution of the database, such a strict query often leads to the Few Answers

5.3 Approximate Fuzzy RDF Subgraph Match Query

167

Problem (Yan et al., 2017): the user query is too selective and the number of answers is not enough. In the worst case, they even cannot get matching results of some queries. More importantly, classical SPARQL querying assumes that RDF data are certain and accurate and it does not consider fuzzy information in the querying process. This motivates us to investigate fuzzy subgraph matching techniques suitable for query answering, which can relax the rigid structural and label matching constraints of subgraph isomorphism and other traditional graph similarity measures. In order to efficiently answer the subgraph pattern query over the fuzzy RDF data graph, inspired by the method of joining the path query graph introduced in (Virgilio et al., 2015; Moustafa et al., 2014; Zhao & Han, 2010), we choose the path instead of the vertex as the basic matching unit and propose a new path-based solution to efficiently answer subgraph pattern queries over such fuzzy RDF graphs. The process of fuzzy RDF subgraph pattern matching is as follows: the pattern graph is firstly decomposed into a set of paths that start from a root vertex and end into a destination vertex, then these paths are matched against the data graph, and the candidate paths that best match the query paths are finally reconstructed to generate the answer. At the same time, we calculate the path match membership (referring to an absolute possibility of a match), and then aggregate into an overall match membership above a given threshold during the query evaluation process.

5.3.1 Problem Definition 5.3.1.1

Path in Fuzzy RDF Graph

In the context of RDF graph, different paths denote different semantic relationships between vertices. For an RDF graph, its root vertex is a vertex with indegree (number of incoming edges) zero. While a destination vertex is a vertex with outdegree (number of outgoing edges) zero. A path whose starting vertex is a root is called an absolute path. In addition, if there is no root vertex in the RDF graph, the starting vertex of the path is the vertex with the largest difference between outdegree and indegree. We call such vertices hubs. Definition 5.4 (path). Assume G = (V, E, Σ, L, μ, ρ) is a fuzzy RDF graph. A directed path p in G is defined as a finite sequence of distinct vertices p = v1 , v2 , …, vn such that vi ∈ V and (vi , vi+1 ) ∈ E for i ∈ [1, n − 1]. Because RDF graphs have a structure in which not only vertices but also edges have labels, a path expression of the RDF graph can be described as a vertex-edge alternating sequence. We use path-label PL(p) to denote the set of all vertex/edge labels in the path p. i.e., PL(p) = L(v1 ), L(e1 ), L(v2 ), L(e2 ), …, L(vn−1 ), L(en−1 ), L(vn ), where L is a function that assigns labels to vertices and edges.

168

5 Fuzzy RDF Queries

In our work, path expressions can be extracted from RDF graph G by breadthfirst traversal on every vertex starting from the roots. For each step, the absolute path expressions from all roots to the current vertex and the vertex itself are output and stored in relational tables path and resource, respectively (Matono et al., 2005). Definition 5.5 (path subsumption). Given two paths p and p' in RDF graph, p = v1 , ' ' e1 , v2 , e2 , v3 , …, em−1 , vm , p ' = v1' , e1' , v2' , e2' , v3' , . . . , en−1 , vn and m ≥ n. For each ' ' ' ' 1 < k < n, if ∀vk , ek ∈ p , such that ∃ vi , ei ∈ p, and vi = vk , ei = ek' , we say that p' is subsumed by p, denoted by p' ⊆ p. The subsumption is important to decrease the number of considered paths. Example 5.3 Let us consider for instance the fuzzy RDF graph G in Fig. 5.4a. This graph has two root vertices (mid1 and mid2) and three destination vertices (country1, country2 and country3). Three examples of paths of G are: p1 = mid1—Title— Movies1, of length 1, p2 = mid2—Director—pid3—bornIn—City3, of length 2, and p3 = mid2—Director—pid3—bornIn—City3—locateIn—Country3, of length 3. Among them, p2 is subsumed by p3 , namely p2 ⊆ p3 . Definition 5.6 (path join). Assume G = (V, E, Σ, L, μ, ρ) is a fuzzy RDF graph, p = v1 , v2 , …, vm and q = v1' , v2' , · · · , vn' are two fuzzy directed paths in G. The join of p and q, denoted by p ▷◁ q, is defined as an induced graph on the vertex set {v1 , v2 , …, vm } ∪ {v1' , v2' , · · · , vn' }, where {v1 , v2 , …, vm } ∩ {v1' , v2' , · · · , vn' } /= φ are expressed as intersection points between the paths p and q. To preserve the structural information, intersection points between paths are represented as join predicates that must be satisfied when the paths are joined into a full graph. For each pair of overlapping paths, the join predicates of p and q are defined as JoinPredicate(p, q) = {L(vi ) = L(vj )|vi ∈ p, vj ∈ q}, i.e., p and q are joinable if they share at least one common vertex. Example 5.4 Let us now consider the query graph Q in Fig. 5.4b, which has two root vertices (mid1 and mid2) and two destination vertices (City2 and tragedy). We decompose Q into three paths q1 , q2 , and q3 that start from a root vertex and end into a destination vertex. It follows that the paths of Q are: q1 = mid2 − Dir ector − pid3 − marriedT o − pid2 − bor n I n − City2 q2 = mid1 − Starring − pid2 − bor n I n − City2 q3 = mid1 − Genr e − tragedy The intersection points between the paths q1 and q2 are pid2 and City2 and the join predicates are JoinPredicate(q1 , q2 ) = {(q1 .pid2 = q2 .pid2), (q1 .City2 = q2 .City2)}. In the same way, the intersection points between the paths q2 and q3 are mid1, and the join predicates are JoinPredicate(q2 , q3 ) = (q2 .mid1 = q3 .mid1).

5.3 Approximate Fuzzy RDF Subgraph Match Query

169

Fig. 5.4 An example of data and query graph

5.3.1.2

Subgraph Pattern Matching Over RDF Graph

A subgraph query is to identify the occurrences of the query subgraph in the fuzzy RDF database graph. A query graph Q = (V Q , E Q , L Q ) is an RDF graph, where each vertex v ∈ V Q is labeled with a label L Q (v) ∈ Σ. The query graph specifies the structural and semantic requirements that a subgraph of G must satisfy. Abstractly, a subgraph query takes a query graph Q as input, retrieves the data graph G that contains (or is similar to) the query graph, and returns the retrieved graphs or new graphs composed from the retrieved graphs. In the fuzzy RDF database graph, we formally define subgraph matching below.

170

5 Fuzzy RDF Queries

Fig. 5.5 Query processing phase schematic diagram

Given a fuzzy RDF data graph G, a query graph Q, and a user-specified satisfaction threshold δ th ∈ [0, 1], where |V Q | ≤ |V|, a subgraph matching query is composed of several parts including elements (vertices and edges) matching, structure matching and match membership (referring to an absolute possibility of a match). Its answer is a set of subgraphs M, such that (1) subgraph m ∈ M is similar to query graph Q, and (2) matching membership δ m > δ th holds. Naively, this problem can be solved by directly performing traditional subgraph pattern matching over RDF graph. However, there are two key issues that need to be solved: How to effectively search for possible subgraphs in RDF graph? How to effectively calculate the match satisfaction degree? In order to deal with these two issues, we carefully design the corresponding solutions to these two problems. As far as the first question is concerned, Zhao and Han (2010) analyzed that paths have more advantage than trees and graphs as appropriate indexing patterns in large graphs. Although more structural information can be preserved by trees and graphs, their potentially massive size and expensive pruning cost even outweigh the advantage for search space pruning. Thus, we choose the path as the graph indexing during graph query processing. For the second issue, the membership of a match M on G is an aggregation of the membership of a set of matching paths. As the paths in this set are exactly those paths containing all vertices in V M with correct labels as well as all edges in E M . In the remainder of this section, we show how to measure path similarity by calculating path edit distance and calculate the satisfaction degree of a given match directly. This forms the basis for the algorithms discussed in Sect. 5.3.2, which further speed up fuzzy subgraph pattern matching.

5.3 Approximate Fuzzy RDF Subgraph Match Query

5.3.1.3

171

Measuring Path Similarity

In order to compare the data paths to an input query path and decide which of the data paths is most similar to the query path, it is necessary to define a distance measure for paths. Similar to the string matching problem where edit operations are used to define the string edit distance (Wagner and Fischer, 1974), we define a path edit distance that is based on the idea of altering the paths by means of edit operations until there exist a path equaling to the query path. Definition 5.7 (Edit Operation). Given an RDF path p, a basic path edits operation ω(p) on p is any of the following: L(v) → σ, v ∈ V, σ ∈ Σ V : substituting an RDF entity or literal (i.e., modification the label L(v) of vertex v by σ ). L(e) → σ ' , e ∈ E, σ ' ∈ Σ E : substituting an RDF property (i.e., modification the label L(e) of edge e by σ ' ). v → ∈, v ∈ V: deleting an RDF instance or literal (i.e., deleting the vertex v from p). e → ∈, e ∈ E: deleting an RDF property (i.e., deleting the edge e from p). ∈ → v, v ∈ V: inserting an RDF instance or literal (i.e., adding a vertex v into p). ∈ → e = (v1 , v2 ), v1 , v2 ∈ V: inserting an RDF property between two existing vertices v1 , v2 of p. Here ∈ is an empty RDF entity, literal, or property. These six operations in Definition 5.7 are sufficient to transform a path p into another path p' . Therefore, it is always possible to find a sequence of basic edit operations that transform a path p into another path p' . Definition 5.8 (Edited Path). Given an RDF path p and a sequence T = (ω1 , ω2 , …, ωn ) of edit operations, the edited path, T (p), is a path T (p) = ωn (…ω2 (ω1 (p))…). In order to model the fact that certain edit operations are more likely than others, each basic path edit operations ωi is assigned a certain cost c(ωi ). The cost c(ωi ) of an edit operation varies according to the type of edit operation and the nature of the involved RDF element (Gao et al., 2010). For example, modifying vertex label is less relevant to vertex insertion because the latter increases the semantic distance between paths. It is obvious that how to determine the similarity of components in paths and define costs of edit operations are the key issues. In order to make the problem simple, in our work, we fix the cost of basic edit operations of insertion, deletion, and labeling modification to 1, 0.5 and 0, respectively. Σn The total cost of the transformation of p into T (p) is given by c(T ) = i=1 c(ωi ). In other words, the cost of edited path is the sum of the costs of all edit operations in the sequence T. It is not difficult to see that there is usually no less one sequence of edit operations that transforms one path p to another path T (p). For our path edit distance measure, we are particularly interested in the sequence with the least cost.

172

5 Fuzzy RDF Queries

Definition 5.9 (Path edit distance). Given two paths p and p' , the path edit distance between p and p' is defined as: dist ( p, p ' ) = MinTi ∈T {c(Ti )| T i is a sequence of path edit operations that transformation p to p' }. According to the above definition, we can conclude that the smaller the path edit distance between a data model and an input query path, the more similar they are. Intuitively, we will calculate the graph similarity distance by computing alignments on the paths. It follows that a matching answer of Q over a data graph G is a set of matching of all the paths of Q that forms a connected component of G (Virgilio et al., 2015).

5.3.1.4

Calculating Matching Membership

In a classical RDF database, the answer to a query Q is either true or false definitely. However, in a fuzzy RDF database, the system computes the answers and for each answer computes a membership score representing the possibility. In terms of fuzzy RDF graphs, the existential possibility associated with an element (vertex and edge) should be the possibility of the state of the world among these elements. On the surface, each possibility in the fuzzy RDF graph is a relative one based upon the assumption that the element exists is independent. Therefore, we consider this possibility as a relative possibility. However, each element in the RDF graph is dependent upon the graph structure. Correspondingly, the existential possibility of a substructure (such as a path or a subgraph) composed of some basic elements in a graph must depend on the relative possibility of the elements. For example, the existential possibility of a path is related to each the relative possibility of each element (vertex and edge) in the path. Therefore, we consider this possibility as an absolute possibility. In order to calculate the absolute possibility (whole membership) of a match, we must consider all the relative possibilities in the match. In general, the absolute possibility of a match can be computed by aggregating the relative possibilities in the match. In a fuzzy RDF graph, we define three kinds of fuzzy structures, namely the triple structure, the path structure and the graph structure. The fuzziness membership of these three structures can be defined as follows. (i) The fuzziness membership in the single triple. In RDF graph, every triple describes a directed edge labeled with p from the vertex labeled with s to the vertex labeled with o. The interpretation of each triple is that subject s has property p with value o. Thus, an RDF triple can be seen as a relationship from the subject vertex to the object vertex. Hence, the absolute possibility of a triple can be computed by aggregating the possibilities of s, p and o. We introduce a membership aggregation function to calculate the fuzziness memberships for RDF triples.

5.3 Approximate Fuzzy RDF Subgraph Match Query

173

Definition 5.10 Assume G = (V, E, Σ, L, μ, ρ) is a fuzzy RDF graph and t = (vi , ρ, vi+1 ) is a fuzzy RDF triple in G. The fuzzy membership δ t for triple t is defined as δt = tm(μ(vi ), ρ(vi , vi+1 ), μ(vi+1 )) where tm is an application-specific membership aggregation function. It should be pointed out that applications have the freedom to choose a function that fits their use cases. The minimum, for instance, is a cautious choice. It assumes that the possibility of a triple is simply the possibility of the least possibility item of the triple. The median is another reasonable membership aggregation function. In our work, we choose the Zadeh’s logical product (minimum) t-norm (Zou et al., 2014) for aggregating the relative possibilities. (ii) The fuzziness membership in the single path. The concept of a fuzzy relationship plays a fundamental role in modeling a fuzzy graph. Let V be a set of vertices, a fuzzy relationship on V is a mapping function ρ: V × V → [0, 1] where ρ(x, y) indicates the degree of relationship between x and y. The fuzzy relation ρ may be viewed as a fuzzy subset on V × V, which be used to represent the relationship between vertices. An important operation on fuzzy relations is composition. In general, fuzzy relationship composition is applied to derive new relationships between two relationships by reusing already existing relationships. Definition 5.11 (Zimmermann, 1996). Let V be a set of vertices. For i ∈ {1, 2, 3}, μi is a function from V i into [0, 1], and for i ∈ {1, 2}, ρ i is a function from V i × V i+1 into [0, 1], i.e. ρ 1 and ρ 2 be two fuzzy relations on μ1 × μ2 and μ2 × μ3 , respectively. The composition of ρ 1 and ρ 2 , denoted by ρ 1 ◦ ρ 2 , is defined as ∀ (u1 , u3 ) ∈ V 1 × V 3 , we have (ρ 1 ◦ ρ 2 ) (u1 , u3 ) = supu 2 ∈V2 {ρ 1 (u1 , u2 ) ∧ ρ 2 (u2 , u3 )}, here ∧ is the minimum. To define the composition of more than two relationships, we can n − 1 times apply the binary compose operator (◦). Starting with the first relationship ρ 1 , we compose succeeding relationships along the ordered chain of relationship with the binary operator. The result of one binary compose step is used as input for the next step until we processed the last relationship ρ n of the ordered chain. Let P = (ρ 1 , ρ 2 , …, ρ n ) be ordered chain of relationship. Then the n-ary composition of P is defined as compose(P) = (… (ρ 1 ◦ ρ 2 ) ◦…) ◦ ρ n . Thus, new relationships are generated indirectly via one or more intermediate relationships. As discussed above, different paths of the RDF graph denote different relationships between vertices in fuzzy RDF graph. A path of the RDF graph can be composed by more RDF triples as a vertex-edge alternating sequence. Hence, the absolute possibility of a matching path can be computed by aggregating the possibilities of the set of triples comprising the matching path.

174

5 Fuzzy RDF Queries

Definition 5.12 Assume G = (V, E, Σ, L, μ, ρ) is a fuzzy RDF graph, and p = v1 , v2 , …, vn is a fuzzy directed path in G. The fuzzy membership δ p for path p is defined as δ p = tm(δt1 , δt2 , . . . , δtn ) where tm is an application-specific fuzziness membership aggregation function, δ ti is the set of triples. Like the triple membership aggregation function in Definition 5.9, we also choose the minimum t-norm for aggregating the relative possibilities. It is clear that δ p = ρ(v1 , v2 ) ∧ ρ(v2 , v3 ) ∧ … ∧ ρ(vn−1 , vn ) ∧ μ(v1 ) ∧ μ(v2 ) ∧ … ∧ μ(vn ), i.e., it is the minimum fuzzy value of the edge or vertex in the fuzzy path. (iii) The fuzziness membership in the graph. The fuzziness memberships of an RDF subgraph can be computed by aggregating the possibilities of the set of paths comprising the subgraph. Hence, we introduce a membership aggregation function to calculate the memberships for RDF subgraphs. Definition 5.13 Assume G = (V, E, Σ, L, μ, ρ) is a fuzzy RDF graph, and G' ∈ G is a fuzzy subgraph which is joined by the set of paths P = (p1 , p2 , …, pn ), i.e., G' = (p1 ▷◁ p2 ▷◁ … ▷◁ pn ). A membership aggregation function for fuzzy RDF subgraphs is a function tm which assigns each fuzzy RDF graph G' an aggregated fuzziness value δ G' that represents the fuzziness of G' , which is defined as ) ( δG ' = tm δ p1 , δ p2 , · · · , δ pn where tm is an application-specific membership aggregation function. In our work, we choose the minimum as our function, i.e., the overall membership value δ G' is the minimum of the membership degrees produced by paths.

5.3.1.5

Motivating Example

In this subsection, we will illustrate the path-based query processing by a simple motivating example. Assume that a user wants to seek the Country1 excellent actors/actresses who are married to the director of Movie2. Specifically, the genre of the movie that the actor/actress starred in is tragedy. Query graph Q in Fig. 5.4b is a possible way to express this information need. A query also specifies a minimum threshold δ th (δ th = 0.25 in the example), to indicate that only matches with possibility larger than δ th should be returned.

5.3 Approximate Fuzzy RDF Subgraph Match Query

175

The query processing phase first decomposes the query graph Q into a set of paths that start from a root and end into a destination. In our example, we decompose Q into three paths as described in Example 5.4. Then the query method extracts all the paths of data graph G in Fig. 5.4a that align with these query paths taking advantage of a special index structure that is built off-line. In our example, the following data paths of G would be extracted: p1 = mid2 − Dir ector − pid3 − marriedT o − pid2 − bor n I n − City2 [0.3] p2 = mid1 − Starring − pid2 − bor n I n − City2 [0.3] p3 = mid1 − Genr e − tragedy [0.95] At the same time, absolute matching membership of a match can be computed by aggregating the relative memberships in the match. For instance, matching membership of path p2 is computed by minimizing the three vertex label memberships (1, 1, 1) and two edge memberships (0.85, 0.3), resulting in a match possibility of 0.3, which is above our cutoff of 0.25. Similarly, the other two memberships of path match, p1 and p3 , are 0.3 and 0.95, respectively, and they also satisfy the minimum threshold constraint. Finally, the candidate paths are suitably joined to generate the answer to the query. The process starts from the matches of one path and progressively adds matches of joining paths, based on the join predicates of joining paths. In our example, the join predicates between the paths q1 and q2 are JoinPredicate(q1 , q2 ) = {(p1 .pid2 = p2 .pid2), (p1 .City2 = p2 .City2)}, which have been described in Example 5.4. Therefore, paths p1 and p2 are joined by merging pid2 and City2. In the same way, paths p2 and p3 are joined by merging vertex p2 .mid1 and vertex p3 .mid1. Thus, we obtain an induced subgraph A of G in Fig. 5.4a enclosed by dashed lines, which is a possible match for the query Q. The matching membership of that potential answer is computed by minimizing the three path memberships (0.3, 0.3, 0.95). The induced graph contains a match for the query Q with possibility 0.3, which satisfies the minimum threshold constraint, and so it is the only answer to our query. We tackle the problem of querying the RDF graph by finding the best combinations of the paths of the data graph that best align with the paths of the query graph.

5.3.2 The Matching Algorithm In this section, we introduce the approximate subgraph matching algorithm. We first start with an overview of this algorithm in Sect. 5.3.2.1. Then we describe the graph matching processing algorithm in detail in Sect. 5.3.2.2. We analyze the complexity of our algorithm in Sect. 5.3.2.3.

176

5 Fuzzy RDF Queries

5.3.2.1

Overview of Our Approach

Based on the above analysis, we propose a path-based solution to the fuzzy RDF subgraph matching. The approach is composed of two main phases: 1. Data Preprocessing: The graph traverse algorithm is very time-consuming, since it is made in every user interaction. Thus, we need to build an indexing structure that contains information about vertices and edges in fuzzy RDF data graph. The graph indexing is executed only one time, independently of the user interaction. Based on the fact that paths have more advantage than trees and graphs as appropriate indexing patterns in large graphs (Zhao & Han, 2010), we propose a novel graph indexing method, context-aware path indexing, to capture information about the graph paths and their membership degrees, enabling efficient retrieval of candidate matches. An optimization strategy by starting only from the root vertices is then considered. In order to extract the set of all paths that reach a given vertex v, we started the exploration of G from the roots by using a breadth first search. For each step, the corresponding path expressions from all roots to the current vertex and the vertex itself are output and stored in the path table and resource table, respectively. The resource table can be used to locate the destination vertex of each candidate path, such that given a vertex v in the query graph, we can easily figure out its candidate paths. The path table enables us to skip the expensive graph traversal at runtime. In order to increase efficiency of path-based query processing, we introduce reversepath expressions and build a B+ tree index in the path table. In addition, we precomputed and store the underlying membership degree of each path by applying the corresponding aggregation functions as specified in Sect. 5.3.1.4. Thus, the path index contains all reverse absolute arc-path expressions from the current vertex to all roots in the fuzzy RDF graph, with an aggregation membership degree δ. 2. Query Processing: This is the subgraph matching phase, which consists of three sub phases, namely path decomposition, finding candidate path and jointing candidate path. Figure 5.5 illustrates a general framework for a pattern match query Q over a fuzzy RDF graph G. We briefly present each step in the following. • Path Decomposition. In this step, we partition the query graph into a set of paths Q = {q1 , q2 , …, qk } by decomposition algorithm. To facilitate reconstruction answers subgraph, we employ a k-partition intersection graph to preserve the structure information of the graph query. In the k-partite intersection graph, a vertex corresponds to a query path q while an edge (qi , qj )

5.3 Approximate Fuzzy RDF Subgraph Match Query

177

means that the paths qi and qj share at least one common vertex, i.e., paths qi and qj are jointly and there are at least one intersection points between them. Moreover, intersection points between the paths are expressed as join predicates, which have to be satisfied when combining (reconstructing) path matches into a full query match. • Finding path candidates. For each query path q ∈ Q, we first conduct fuzziness membership (the membership degree must be greater than or equal to the user-specified threshold δ th ) to obtain a set of qualified candidates matches in the indexed paths of data graph G. Then we use path edit distance dist (q, p) between query path q and data path p to further filter the remaining match set. By using these later, the system generates from G all paths that have a good candidate of the query paths. • Combination. In this step, we obtain the full graph matches by reconstruction candidate paths using a graph explore algorithm, which performs message passing in the k-partition intersection graph where each partition corresponds to a path in the query decomposition. The results are a set of approximate subgraphs included in G, and it generated from joining all candidate paths matching with the paths in the decomposition. In the end, the actual matching answers are ranked according path edit distance, and the user is able to explore these subgraphs to get more information about vertices. 5.3.2.2

Graph Matching Processing

In this section, we discuss how graph matches are processed. Given a query graph Q, we first study how Q can be split into a set of paths, among which parts of paths with good selectivity are then selected as candidates. Q is then reconstructed by joining the selected candidate paths until every edge in Q has been examined at least once. We discuss each step of the query processing in the following subsections. 1. Query Path Decomposition Given a query graph Q, the main task of query decomposition is to split Q into a set of possibly overlapping paths, denoted as P, that cover the entire query, by traversing the entire query graph Q. As finding a least-cost path decomposition based on the number of operations involved in producing the final result is too costly, we use a simple path decomposition method in order to reduce query search space and improve efficiency. The idea is simple: the set of all paths from a vertex s to another vertex t is the intersection of all paths starting at s and the set of all paths ending at t. The task of path decomposition is to split the query into a set of possibly overlapping paths, each of length L or less, that cover the entire query, and whose matches can be obtained from the path index.

178

5 Fuzzy RDF Queries

The principle to decompose query graph Q into a set of paths P is that we start the exploration of graph Q from a root by using a bread-first search and extract all paths starting from the root and end into a destination, whose matches can be obtained from the path index of the data graph G. In order to preserve the graph structural information of the query, the elements of P are organized as a k-partite intersection graph. Now, we are ready to implement the function which will list all paths between a pair of vertices. The implementation is easy: look at the opposite graph of Q then find the paths beginning at the given vertex and return the reverses of each path. The code below will find all paths between every pair of root and destination. In Algorithm 5.5, we begin by initializing the set of paths in line 1, and then extract root vertex of Q in line 2. We further call function findpath for each root vertex to obtain the paths and we add them into P in lines 3–4. Finally, we establish a k-partite intersection to keep structural information of query graph Q in lines 5–9, in which we obtain the intersection points and join predicates between the path q and q' . Function findpath shows the main algorithm of query path decomposition, which operates in three stages. In the first stage, we initialize all varies, in which PathSet be used to store decomposed path set. We initialize it as null in line1. We use π[v] to store parent vertex of v and set the parent of root vertex s to be NIL in line 2. We use queue Queue to store visited vertices in line 3. In the second stage, the breadthfirst search algorithm develops a spanning tree (a breadth-first search tree) with the source vertex, s, as its root. The parent or predecessor of any other vertex in the tree is the vertex from which it was first discovered. For each vertex v, the parent of v is placed in the variable π[v]. After initialization, the source vertex is discovered. Line 4 initialize Queue to contain just the root vertex s. Lines 6–9 guarantees to remove the vertex u from the queue when insert the new vertex v adjacent to the vertex u in the queue and establish the search tree. At the same time, whether the vertex adjacent to u is visited in the process of creating the search tree, if it is not visited, we insert it in the queue in line 10. The breadth-first search traversal terminates until the queue is empty, i.e., every vertex has been fully explored. In the last stage, we obtain the path from the source vertex s to destination vertex t in lines 11–17. Breadth-first search algorithm builds a search tree containing all vertices reachable from s. The set of edges in the tree contains (π[v], v) for all v where π[v] /= NIL. If s is reachable from the bottom of the tree v then there is a unique reverse path of tree edges from v to s. We return path set PathSet in line 18.

5.3 Approximate Fuzzy RDF Subgraph Match Query

179

Algorithm 5.5: Decomposition Input: The query graph Q. Output: The query path set P and k-partite intersection graph. 1: P ← { }; 2: S ← FS(Q); 3: foreach s ∈ S do 4: P ←P ∪ findpath(Q, s); 5: foreach q ∈ P do 6: foreach q' ∈ P −{q} do 7: if ((JoinPredicate(q, q' ) ← L(q ) ∩ L(q' )) != null) then 8: J(q) ← J(q) ∪ {q' }; 9: Pathset.put ((q, J(q), JoinPredicate(q, q' ))); Function findpath (Q, s) 1: PathSet ← {}; 2: π[s] ← NIL; 3: Queue ← {}; 4: ENQUEUE (Queue, s); 5: while (!Queue.isEmpty()) 6: u ← DEQUEUE(Queue); 7: foreach ((v ← getUnvisitedAdjacentVertex (u)) != null) do 8: v.visited ← true; 9: π[v] ← u; 10: ENQUEUE (Queue, v); 11: T ← FD(Q); 12: foreach t ∈ T do 13: p ← { }; 14: while (t != NIL) 15: p.add(t); 16: t ← π[t]; 17: PathSet ← PathSet ∪ p; 18: return PathSet;

2. Finding Path Candidates After the query graph Q has been decomposed, the next step is to find candidate matches for every query path. This step needs to solve two problems: one problem is how to get all possible paths from G for query path q, another problem is that how to decide that a given path is a good approximate path for q from the found paths. How to extract from G the paths that are similar to the query paths is important. Every query path q ∈ PathSet has two specific labeled vertices: the root vertex, denoted by s(q) ∈ V (Q) and the destination vertex, denoted by t(q) ∈ V (Q). From this, if the destination vertex t is specified, we can find its correspondents in data graph G by acceding to the extended labels L(v) of every v ∈ V (G) using the labels similarity which is able to discover the common meaning of given labels of two

180

5 Fuzzy RDF Queries

vertices. Thus, every destination vertex t(q) has a set of similar vertices from G, denoted by M t(q) so M t(q) = {v|v ∈ V (G), L(v) = L(t)}. The goal here is clear: for every query path q, using vertices in M t(q) , searching in the indexed paths of RDF graph G and discovering a set of candidate paths, denoted by CandidateSet(q), which represent an approximation of the query path q. In order to reconstruct the query Q in an efficient way, we build a set for every element q ∈ PathSet. Then, we group in the same set all the paths p of data graph G having a destination vertex that matches the destination vertex of q. Thus, each path in the same set maps counterpart path q of PathSet. To build the answer subgraphs, the approximate candidate paths that participating in this building must be computed. Since there is many false positive during the matching candidate path examining, we need to prune the false positive in the first place. For every path q ∈ PathSet, we access the path index to get its candidate matches set CandidateSet(q) by only keeping those paths that satisfy certain context criteria as following: (i) We compute the path edit distance dist(q, p) between query path q and data path p. The main goal of the path edit distance is deciding if a given path has a good approximation of q. It can be concluded that the smaller the path edit distance between a query and a data path is, the more similar they are. For a path p ∈ G, if dist(q, p) is the smallest, p be a candidate for the corresponding path q. (ii) We should obtain the fuzzy satisfaction degrees δ p for path p. The fuzziness membership δ p in the single path p must be greater than or equal to δ th , which is a user-specified fuzziness membership threshold. Algorithm 5.6: Find Paths Candidates Input: The query path set PathSet, the data graph G. Output: The candidates set CandidateSet. 1: foreach q ∈ PathSet do 2: t ← L(q); 3: cn ← φ; 4: C ← getpaths(G, t); 5: foreach p ∈ C do 6: if δ(p) ≥ δ th then 7: m ← dist (p, q); 8: cn.enqueue(p, m); 9: CandidateSet ← CandidateSet ∪ {(q, cn)}; 10: return CandidateSet;

Given a path q ∈ PathSet, we perform the above criteria tests to efficiently obtain the final list of candidates CandidateSet(q) from G. And extraneous paths in the data graph are automatically ignored. Thus, we are able to compute a set of ranked tuples containing the candidate’s paths of q and its path edit distance. And the tuples in a set are ordered according to their path edit distance, with the lower coming first.

5.3 Approximate Fuzzy RDF Subgraph Match Query

181

Given the query path set PathSet (i.e. also the k-partite intersection graph) and a data graph G, we retrieve and select the paths from G ending into the destinations of the paths of PathSet, as showed in Algorithm 5.6. In Algorithm 5.6, for each q ∈ PathSet, we firstly extract destination vertex t of q in line 2. Then we select all possible paths p from G index matching t by the function getpaths in line 4. This prevents a sequential scan of all paths in a large graph. After has obtained the possible path set C, we prune the false positive in line 6. At the same time, we compute the path edit distance of each p transformed from q and we insert p in the set cn in ascending order in lines 8–9. At the end, we insert cn in CandidateSet in line 9. The set CandidateSet is implemented as a map where the key is a path q from P and the value is a set with all the paths p ending in the destination of q. Each set is implemented as a priority queue of paths, where the priority is determined by the path edit distance associated with each path. 3. Full Query Matches The last step of algorithm is selecting the most relevant paths and generating the full query matches by joining the paths with the lowest path edit distance from each set. The join order is determined by exploring the k-partite intersection graph where vertices represent the retrieved paths, while edges between paths mean that they have vertices in common. The join condition is the number of join predicates between path p and p' equaling number of join predicates between path q and q' , where q and q' are the paths corresponding to the sets where p and p' were included, respectively. At the same time, a join operator that operates on fuzzy solution matching has to consider the fuzzy membership values while combining solutions. The fuzzy membership value of a combined solution is an aggregation of the membership values associated with the individual paths that has been used for combining. In other words, the absolute possibility of a match can be computed by aggregating the relative possibilities in the match. To determine the fuzzy membership values for solution matching, we choose the minimum as our application-specific fuzzy membership aggregation function. Algorithm 3 outlines the combining procedure. Algorithm 5.7 starts from the matches of one path and progressively adds matches of joining paths, based on the k-partition intersection graph KpartiteIntersectionGraph. Once we have obtained the set CandidateSet, our graph search algorithm is then performed by joining the most promising paths from CandidateSet. We initialize our result set to an empty set in line 1. If there are no results after a joining process ends, we output the empty set. If we are not able to generate k answers and the set CandidateSet is not empty in line 2, we obtain the top-k answer by selecting and combining the paths ordered in increasing order of the path edit distance from each set CandidateSet. Firstly, we initialize the answers set and the fuzzy membership value in line 3. Then we choose the vertex q of K-partiteIntersectionGraph with the largest number of vertices overlapping (join predicates) with the existing paths in line 4. We select the set cn corresponding to q and we dequeue the top paths p from cn in lines 5–6. The path q is added into the set V of visited matching paths in line 7. In lines 8–9, we add p into the answer ans and

182

5 Fuzzy RDF Queries

computer the fuzzy membership value δ m of the answer. We obtain the full answer in line 10 by a breadth-first search traversal as shown in detail in function BFS-visit. Finally, we include the full answer ans in the set ApproximateAnswersSet in line 11. By using this strategy, if we are not able to find k approximate answers for the query graph Q, the process is stopped. Algorithm 5.7: Full query matches Input: The candidates set, k-partite intersection graph, number k (the number of answers required). Output: The top-k approximate answers set of query Q. 1: ApproximateAnswersSet ← { }; 2: while (|ApproximateAnswersSet| < k) and (not empty CandidateSet) 3: ans ← { }; δ m ← 1; 4: q = maxCardinality(K-partiteIntersectionGraph); 5: cn ← CandidateSet.get(q); 6: p ← cn.dequeueTop( ); 7: V ← {q}; 8: ans ← ans ∪ {p}; 9: δ m ← min(δ m , δ p ); 10: BFS-visit (p, ans, CandidateSet, K-partiteIntersectionGraph, q, V ) 11: ApproximateAnswersSet.put(ans, δ m ); 12: return ApproximateAnswersSet; Function BFS-visit (p, ans, CandidateSet, K-partiteIntersectionGraph, q, V ) 1: foreach (q, q' ) ∈ K-partiteIntersectionGraph do 2: if q' ∈ / V then 3: cn ←CandidateSet.get(q' ); 4: p' ← cn.dequeueTop( ); 5: if (|L(p) ∩ L(p' )| == |L(q) ∩ L(q' )|) 6: ans ← ans ∪ {p' }; δ m = min (δ m , δ p' ); 7: π[q' ] = q; 8: BFS-visit (p' , ans, CandidateSet, K-partiteIntersectionGraph, q' , V ); 9: V ←V ∪ {q' };

5.3.2.3

Algorithm Complexity

We now analyze the complexity of each step of our algorithm. In the data preprocessing, we need to construct a path indexing structure by traversing fuzzy RDF graph G. For each vertex v, we exploit an optimized implementation of the breadth-first search traversal from root vertex s to collect path information. Suppose the average degree of s in G is d, it is straightforward to demonstrate that the time complexity of the data preprocessing phase is O(|E| + |V| × d), where |E| is the number of relations, |V| is the number of vertices in RDF graph G and d is the largest vertex degree. The core procedure of the path decomposition step is a breadth-first search processing in essence. The while-loop in the breadth-first search is executed at most |V Q | times. The reason is that every vertex enqueued at most once. So, the complexity is O(|V Q |). The for-loop inside the while-loop is executed at most |E Q | times since

5.4 Fuzzy Quantified Query Over Fuzzy RDF Graph

183

Q is a directed graph. The reason is that every vertex dequeued at most once and we examine (u, v) only when u is dequeued. Therefore, each edge is examined at most once as directed. So, the complexity is O(|E Q |). Therefore, the total running time for the sub-step is O(|V Q | + |E Q |), where |V Q | and |E Q | are the numbers of vertex and edge of query graph Q, respectively. The complexity of the procedure of finding path candidates step is |P| × O(D), where |P| is the number of the query paths in set P and D is the number of paths retrieved by the index that, in the worst case, is proportional to the size of data. That is, we have to execute D insertions into CandidateSet for |P| times at most. In full query matches step, the joint sub-step is most time-consuming. And it iterates at most k times, where k is the number of the returned answers. In each iteration, there is a call of the function BFS-visit, which explores the k-partition intersection graph. In the worst case, it has a cost in O(h × D) since it checks h times each data path in G, where h is the depth of K-partiteIntersectionGraph. Therefore the complexity of this sub-step is O(k × h × D), since k times we call the function BFS-visit to explore K-partiteIntersectionGraph, that is O(h × D).

5.4 Fuzzy Quantified Query Over Fuzzy RDF Graph Fuzzy queries to databases have been suitably used in several domains such as in decision making support or linguistic summarization. In particular, fuzzy quantified queries have proved useful in a relational database context for expressing different types of imprecise information needs (Bosc et al., 1995). This work examines advantages of fuzzy queries, which provide a better representation of the user requirements by expressing imprecise conditions through linguistic terms. In this section, we introduce fuzzy quantifiers (Zadeh, 1983) into fuzzy RDF database queries. Such quantifiers can be used to express an intermediary attitude between conjunction (“all of the criteria must be satisfied”) and disjunction (“at least one criterion must be satisfied”). They model linguistic expressions such as, “most of”, “about a third”, and are notably used to construct fuzzy predicates (with quantifications). Fuzzy quantified queries have received significant attention in the database community for several decades. Bouchon-Meunier and Moyse (2012) proposed an overview of linguistic summarization, presenting the main streams of a symbolic representation and management of numerical data, which can be crisp or fuzzy. They pointed out that fuzzy approaches bring solutions to the imprecision of quantification and the use of subjective qualification of data. Delgado et al. (2014) presented an overview of the existing approaches for evaluating and managing statements involving quantification. In a graph database context, there have been some recent proposals for incorporating quantified statements into user queries (see, (Bry et al., 2010; Blau et al., 2002; Yager 2014; Pivert et al., 2016c). SPARQLog (Bry et al., 2010) extended SPARQL with first-order logic (FO) rules, including existential and universal quantification over vertex variables. QGRAPH (Blau et al., 2002) annotated vertices and edges with a counting range (count 0 as negated edge) to specify the

184

5 Fuzzy RDF Queries

number of matches that must exist in a database. Yager (2014) briefly mentioned the possibility of using fuzzy quantified structure queries in a social network database context. He also suggested interpreting it using an OWA operator. However, the author did not propose any formal language for expressing such queries. Pivert et al. (2016c) considered a particular type of fuzzy quantified structural query in the general context of fuzzy graph databases and showed how the fuzzy quantified structural query could be expressed in FUDGE which is an extension of the CYPHER query language. Castelltort and Laurent (2016) proposed an approach aimed to summarize a (crisp) graph database by means of fuzzy quantified statements. They considered a crisp interpretation of this concept and recall how the corresponding query can be expressed in CYPHER. A limitation of this approach was that only the quantifier was fuzzy. More recently, Fan et al. (2016) introduced quantified graph patterns (QGPs), an extension of the classical graph patterns using simple counting quantifiers on edges. The authors also showed that quantified matching in the absence of negation did not significantly increase the cost of query processing. However, quantified graph patterns could only express numeric and ratio aggregates, and negation besides existential and universal quantification. They did not consider fuzzy quantified patterns matching in the fuzzy RDF graph database. In the following, we intend to integrate linguistic quantifier in a subgraph patterns addressed to a fuzzy RDF graph database and use graph pattern matching approach to evaluate fuzzy quantified query. In a fuzzy RDF graph database context, fuzzy quantified queries have an even higher potential since they can exploit the structure of the RDF graph, beside the label values attached to the vertices or edges. In the present section, we define the syntax and semantics of an extension of the query pattern graph that makes it possible to express and interpret. In addition, in order to answer subgraph pattern queries efficiently over fuzzy RDF data graph, we present a novel approach for evaluating fuzzy quantified graph pattern.

5.4.1 Linguistic Quantifier and Fuzzy Quantified Statement Linguistic summaries have been studied for many years and allow to sum up large volumes of data in a very intuitive manner. They have been studied over several types of data. However, few works have been led on graph databases. In this section, we recall important notions about linguistic quantifiers and fuzzy quantified statement. Linguistic quantifiers modelled by means of fuzzy sets is then proposed for modelling the so-called fuzzy quantified statements. 1. Linguistic quantifier The notion of fuzzy or linguistic quantifier (Zadeh, 1983) describe an intermediate attitude between the universal quantifier ∀ and the existential quantifier ∃. Depending if quantifiers represent imprecise quantities or proportions of quantifiers are classified into absolute or relative quantifiers respectively.

5.4 Fuzzy Quantified Query Over Fuzzy RDF Graph

185

(i) Absolute quantifiers express quantity over the total number of elements of a particular set, stating whether this number is, for example, “much more than 10”, “around 5”, “a great number of”, and so forth. (ii) Relative quantifiers express measurements over the total number of elements, which fulfill a certain condition being dependent on the total number of possible elements (the proportion of elements). This type of quantifier is used in expressions such as “most”, “little of”, “at least half of”, and so forth. Consequently, the truth of the relative quantifier depends on two quantities. In this case, in order to evaluate the truth of the quantifier, we need to find the total number of elements fulfilling the condition and to consider this value with respect to the total number of elements that could fulfill it (including those that do fulfill it and those that do not). Essentially, linguistic quantifiers are fuzzy proportions or fuzzy probabilities. Definition 5.14 (Linguistic quantifier). A linguistic quantifier named Q is defined by a fuzzy set with a membership function μQ whose domain depends on whether it is absolute or relative: Q abs : R → [0, 1] Q r el : [0, 1] → [0, 1] where the domain of Qrel is [0, 1] because the division a/b ∈ [0, 1], where a is the number of elements fulfilling a certain condition and b is the total number of existing elements. The value μQ (x) expresses the extent to which proportion x (resp. the cardinality x) agrees with the quantifier. Therefore, linguistic quantifiers can be considered as fuzzy conditions which are defined on cardinalities or proportions. Example 5.5 “Around 7” is an absolute fuzzy quantifier, defined as a triangular and symmetrical function (Fig. 5.6a) with m = 7 and margin = 6, for example. “Most” is a relative fuzzy quantifier, defined as shown in Fig. 5.6b. According to this linguistic quantifier, a proportion less than 25% cannot be considered in agreement with “most” (since μmost (p) is 0 for p ≤ 0.25). For a proportion between 25 and 100%, the closer to 100% the proportion, the more it agrees with “most”. 2. Fuzzy Quantified Statements and Interpretation Quantified statements are a meta description of the information in the database and can be used to express relational knowledge about the data. Areas of applicability of linguistic quantified statements are very wide. For example, in information retrieval quantified statements are used to model natural language statements for querying a database in a more flexible manner. There are two types of quantified statements (Delgado et al., 2014): Type I: “Q of X are f 1 ” Type II: “Q of f 2 X are f 1 ”

186

5 Fuzzy RDF Queries

Fig. 5.6 Graphical representation of fuzzy quantifier

where Q is a linguistic quantifier, X is a finite crisp set, f 1 and f 2 are fuzzy conditions of linguistic concepts, defined over the domain of X including fuzzy predicates, fuzzy operators and conjunctions ∧, ∨ and so on. A quantified statement of the Type I means that, among the elements of set X, a quantity Q satisfies the fuzzy predicate f 1 . Such a statement can be more or less true and many approaches can be used to interpret the quantified statement. Note that Type II generalizes the case of Type I by considering that the set to which the quantifier applies is itself fuzzy. An example of Type I statement is “Most of the students are young” and of Type II is “Most of the good students are young”, where X is a finite set of students, the quantifier is “most”, f 1 is the property “young” and f 2 represents the property “good”. Also associated with a linguistic quantified statement is a truth value [0,1], called satisfactory degree of the statement. The process of calculating the satisfactory degree of a quantified statement is usually known as an evaluation method. The problem is to find truth value μ(Q of X are f 1 ) or μ(Q of f 2 X are f 1 ), respectively, knowing that truth (x is f 1 ) ∀x ∈ X, which is done using Zadeh’s calculus of linguistically quantified propositions (Zadeh, 1983). Some other quality criteria see the literature (Delgado et al., 2014).

5.4.2 Fuzzy Quantified Graph Patterns Matching The problem to be addressed in this section is to find the answers to a fuzzy quantified statement over a fuzzy RDF graph G. The key challenge in this problem is how to represent the query intention of the fuzzy quantified statement in a structural way. The underlying RDF repository is a graph structured data, but, the fuzzy quantified statements is unstructured data. To enable query processing, we need a graph representation of fuzzy quantified statement.

5.4 Fuzzy Quantified Query Over Fuzzy RDF Graph

187

1. Fuzzy Quantified Graph Patterns Because fuzzy quantified statement is unstructured data and G is graph structure data, we should fill the gap between two kinds of data. The graph pattern provides a simple yet intuitive specification of the structural and semantic requirements of interest in the input graph. Therefore, we propose a fuzzy Quantified Graph Patterns (f -QGPs), by extending conventional graph patterns to express fuzzy quantified matching conditions. In a general setting, the fuzzy quantified statements considered are of the form “Q of f 2 X are f 1 ” over fuzzy RDF graph databases, where Q is a quantifier (relative or absolute) represented by a fuzzy set, f 2 is the fuzzy condition “to be connected (according to the pattern P(x)) to a vertex x 0 ”, X is the set of vertices in the graph, and f 1 denotes a fuzzy (possibly compound) condition. Thus, graph pattern bridges the gap between user’s unstructured query intention of a fuzzy quantified statements and the structured fuzzy RDF data G. Definition 5.15 (Fuzzy quantified graph pattern). A fuzzy quantified graph pattern is a labeled directed graph defined as Q(x 0 ) = (V q , E q , L q , F q ), where (i)

V q and E q are the set of pattern vertices and the set of directed pattern edges, respectively, as defined for data graphs. (ii) x 0 is a vertex in V q , referred to as the query focus of Q(x 0 ), for search intent (Bendersky et al., 2010). (iii) L q is a function that assigns a vertex label L q (v) (resp. edge label L q (e)) to each pattern vertex v ∈ V q (resp. edge e ∈ E q ). The label can be variable, constant, or condition. The predicates in the condition C can be defined as a combination of atomic formulas of the form “?x op c”, “?x op ?y” and “?x is Fterm”, where ?x, ?y ∈ variable, c ∈ (U ∪ L), op is a fuzzy or crisp comparator, and Fterm is a predefined or user-defined fuzzy term like young (see Fig. 5.7b). One can extend fuzzy condition to support fuzzy conjunction ∧ (resp. disjunction ∨), usually interpreted by the triangular norm minimum (resp. maximum). (iv) F q is a function such that for a given triple pattern tp = ∈ Q(x 0 ), F q (tp) is defined as the form Quant(p), here Quant is a linguistic quantifier and p is the predicate of the triple pattern. We refer to Quant(p) as the quantifier of triple pattern. We used this mechanism that makes it possible to attach linguistic quantifier to triple. Example 5.6 An example of fuzzy quantified statements is: “Most of the recent films that actor x starred in, are directed by young directors”. The query, denoted by Q(?actor), that aims to retrieve every actor (?actor) such that most of the recent films (?film) that he/she starring in are directed by young directors (?director) may be expressed in fuzzy quantified graph pattern shown in Fig. 5.8, where ?actor is its query

188

5 Fuzzy RDF Queries

Fig. 5.7 Graphical representation of fuzzy linguistic term

Fig. 5.8 Fuzzy quantified graph pattern

focus, indicating potential actors, i.e., variables ?actor should be returned in the result set; ?film and ?d are two variables, “?y is recent” and “?a is young” are fuzzy condition expressions. Here edge Starring(?actor, ?film) carries a linguistic quantifier “most”, for condition (d) above. In this query, ?actor corresponds to x 0 , ?film corresponds to X, sub-pattern f 1 (