247 28 5MB
English Pages 164 [165] Year 2023
Synthesis Lectures on Data, Semantics, and Knowledge
Heiko Paulheim · Petar Ristoski · Jan Portisch
Embedding Knowledge Graphs with RDF2vec
Synthesis Lectures on Data, Semantics, and Knowledge Series Editors Ying Ding, The University of Texas at Austin, Austin, USA Paul Groth, Amsterdam, Noord-Holland, The Netherlands
This series focuses on the pivotal role that data on the web and the emergent technologies that surround it play both in the evolution of the World Wide Web as well as applications in domains requiring data integration and semantic analysis. The large-scale availability of both structured and unstructured data on the Web has enabled radically new technologies to develop. It has impacted developments in a variety of areas including machine learning, deep learning, semantic search, and natural language processing. Knowledge and semantics are a critical foundation for the sharing, utilization, and organization of this data. The series aims both to provide pathways into the field of research and an understanding of the principles underlying these technologies for an audience of scientists, engineers, and practitioners.
Heiko Paulheim · Petar Ristoski · Jan Portisch
Embedding Knowledge Graphs with RDF2vec
Heiko Paulheim University of Mannheim Mannheim, Germany
Petar Ristoski eBay (United States) San Jose, CA, USA
Jan Portisch SAP SE Walldorf, Germany
ISSN 2691-2023 ISSN 2691-2031 (electronic) Synthesis Lectures on Data, Semantics, and Knowledge ISBN 978-3-031-30386-9 ISBN 978-3-031-30387-6 (eBook) https://doi.org/10.1007/978-3-031-30387-6 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
Knowledge graphs are an important ingredient in today’s artificial intelligence systems. They provide a means to encode arbitrary knowledge to be processed in those AI systems, allowing an interpretation of that knowledge both for humans and machines. Today, there are large-scale open knowledge graphs, likeWikidata or DBpedia, as well as privately owned knowledge graphs in organizations, e.g., the Google knowledge graph used in the Google search engine. Knowledge graph embedding is a technique which projects entities and relations in a knowledge graph into a continuous vector space. Many other components of AI systems, especially machine learning components, can work with those continuous representations better than operating on the graph itself, and often yield superior result quality compared to those trying to extract non-continuous features from a graph. RDF2vec is a knowledge graph embedding approach which was invented in the scope of the Mine@LOD project1 and has evolved since then, which has led to numerous variants of the original approach. There exist different implementations of the approach. Moreover, the Web page rdf2vec.org2 collects far more than 60 applications of RDF2vec to a large variety of problems in a number of domains, ranging from NLP applications like information retrieval to improving computer security by utilizing a knowledge graph of security threats. With this book, we want to give a gentle introduction to the idea of knowledge graph embeddings with RDF2vec. We discuss the different variants that exist, including their advantages and disadvantages, and give examples for using RDF2vec in practice. Heiko would like to thank all the researchers in his team at the University of Mannheim, i.e., Andreea Iana, Antonis Klironomos Franz Krause, Martin Böckling, Michael Schlechtinger, Nicolas Heist, Sven Hertling, and Tobias Weller, as well as Rita Sousa from Universidade de Lisboa, who worked on a few interesting extensions for RDF2vec during her research stay in Mannheim. Moreover, all students who worked with RDF2vec and provided valuable input and feedback, i.e., Alexander Lütke, Niclas Heilig, MichaelVoit, Angelos Loucas, Rouven Grenz, and Siraj Sheikh Afham Uddin. Finally, 1 https://gepris.dfg.de/gepris/projekt/238007641. 2 http://www.rdf2vec.org/.
v
vi
Preface
my partner Tine for bearing many unsolicited dinner table monologues on graphs, vectors, and stuff, and my daughter Antonia for luring me away from graphs, vectors, and stuff every once in a while. Jan would like to thank all researchers from the University of Mannheim who participated in lengthy discussions on RDF2vec and graph embeddings, particularly Sven Hertling, Nicolas Heist, and Andreea Iana. In addition, Jan is grateful for interesting exchanges at SAP, especially with Michael Hladik, Guilherme Costa, and Michael Monych. Lastly, Jan would like to thank his partner, Isabella, and his best friend, Sophia, for a continued support in his private life. Petar would like to thank Heiko Paulheim, Christian Bizer, and Simone Paolo Ponzetto, for making this work possible in the first place, by forming an ideal environment for conducting research at the University of Mannheim. Michael Cochez for the insightful discussions and collaboration on extending RDF2vec in many directions. Anna Lisa Gentile for the collaboration on applying RDF2vec in several research projects during my time in IBM Research. Finally, my wife, Könül, and my daughter, Ada, for the non-latent, high-dimensional support and love. Moreover, the authors would like to thank the developers of pyRDF2vec and the Python KG extension, which we used for examples in this book, especially Gilles Vandewielle for quick responses on all issues around pyRDF2vec, and GEval, which has been used countless times in evaluations. Moreover, we would like to thank all people involved in the experiments shown in this book: Ahmad Al Taweel, Andreea Iana, Michael Cochez, and all people who got their hands dirty with RDF2vec. Finally, we would like to thank the series editors, Paul Groth and Ying Ding, for inviting us to create this book and providing us with this unique opportunity, and Ambrose Berkumans, Ben Ingraham, Charles Glaser, and Susanne Filler at Springer Nature for their support throughout the production of this book. Mannheim, Germany Walldorf, Germany San Jose, USA February 2023
Heiko Paulheim Jan Portisch Petar Ristoski
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 What is a Knowledge Graph? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 A Short Bit of History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.3 General-Purpose Knowledge Graphs . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Feature Extraction from Knowledge Graphs . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Node Classification in RDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 1 1 2 3 5 10 13 14
2 From Word Embeddings to Knowledge Graph Embeddings . . . . . . . . . . . . . . 2.1 Word Embeddings with word2vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Representing Graphs as Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Learning Representations from Graph Walks . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Software Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Node Classification with RDF2vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17 17 20 23 24 25 27 28
3 Benchmarking Knowledge Graph Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Node Classification with Internal Labels—SW4ML . . . . . . . . . . . . . . . . . . . 3.2 Machine Learning with External Labels—GEval . . . . . . . . . . . . . . . . . . . . . 3.3 Benchmarking Expressivity of Embeddings—DLCC . . . . . . . . . . . . . . . . . . 3.3.1 DLCC Gold Standard based on DBpedia . . . . . . . . . . . . . . . . . . . . . 3.3.2 DLCC Gold Standard based on Synthetic Data . . . . . . . . . . . . . . . . 3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31 31 33 36 37 40 43 44
4 Tweaking RDF2vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Introducing Edge Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Graph Internal Weighting Approaches . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Graph External Weighting Approaches . . . . . . . . . . . . . . . . . . . . . . .
45 45 46 51 vii
viii
Contents
4.2 Order-Aware RDF2vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Motivation and Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.3 Order-Aware RDF2vec in Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Alternative Walk Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Entity Walks and Property Walks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Further Walk Extraction Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 RDF2vec with Materialized Knowledge Graphs . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.3 RDF2vec on Materialized Graphs in Action . . . . . . . . . . . . . . . . . . . 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54 54 56 56 58 58 62 66 66 68 72 73 73
5 RDF2vec at Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Using Pre-trained Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 The KGvec2Go Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.2 KGvec2Go in Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Training Partial RDF2vec Models with RDF2vec Light . . . . . . . . . . . . . . . 5.2.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 RDF2vec Light in Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77 77 77 79 79 81 84 86 86
6 Link Prediction in Knowledge Graphs (and its Relation to RDF2vec) . . . . . . 6.1 A Brief Survey on the Knowledge Graph Embedding Landscape . . . . . . . 6.2 Knowledge Graph Embedding for Data Mining . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Data Mining is Based on Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.2 How RDF2vec Projects Similar Instances Close to Each Other . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.3 Using RDF2vec for Link Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.4 Link Prediction with RDF2vec in Action . . . . . . . . . . . . . . . . . . . . . 6.3 Knowledge Graph Embedding Methods for Link Prediction . . . . . . . . . . . 6.3.1 Link Prediction is Based on Vector Operations . . . . . . . . . . . . . . . . 6.3.2 Usage for Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.3 Comparing the Two Notions of Similarity . . . . . . . . . . . . . . . . . . . . . 6.3.4 Link Prediction Embeddings for Data Mining in Action . . . . . . . . 6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.1 Experiments on Data Mining Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.2 Experiments on Link Prediction Tasks . . . . . . . . . . . . . . . . . . . . . . . . 6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
87 87 91 92 92 96 98 99 99 101 102 103 103 103 110 111 115
Contents
ix
7 Example Applications Beyond Node Classification . . . . . . . . . . . . . . . . . . . . . . . 7.1 Recommender Systems with RDF2vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.1 An RDF2vec-Based Movie Recommender in Less than 20 Lines of Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.2 Combining Knowledge Graph Embeddings with Other Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Ontology Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 Ontology Matching by Embedding Input Ontologies . . . . . . . . . . . 7.2.2 Ontology Matching by Embedding External Knowledge Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Further Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Knowledge Graph Refinement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.2 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.3 Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.4 Applications in the Biomedical Domain . . . . . . . . . . . . . . . . . . . . . . 7.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
119 119
8 Future Directions for RDF2vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Incorporating Information in Literals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Exploiting Complex Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 Exploiting Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4 Dynamic and Temporal Knowledge Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5 Extension to other Knowledge Graph Representations . . . . . . . . . . . . . . . . . 8.6 Standards and Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.7 Embeddings and Explainability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
143 143 145 146 147 148 149 150 152
Appendix A: Datasets and Code Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
155
120 122 126 126 129 130 130 132 135 137 138 139
1
Introduction
Abstract
In this chapter, the basic concept of a knowledge graph is introduced. We discuss why knowledge graphs are important for machine learning and data mining tasks, and we show classic feature extraction or propositionalization techniques, which are the historical predecessor of knowledge graph embeddings, and we show how these techniques are used for basic node classification tasks.
1.1
What is a Knowledge Graph?
The term knowledge graph (or KG for short) has been popularized by Google in 2012, when they announced in a blog post that their search is going to be based on structured knowledge representations in the future, not only on string similarity and keyword overlap, as done until then.1 Generally, a knowledge graph is a mechanism in knowledge representation, where things in the world (e.g., persons, places, or events) are represented as nodes, while their relations (e.g., a person taking part in an event, an event happening at a place) are represented as labeled edges between those nodes.
1.1.1
A Short Bit of History
While Google popularized the term, the idea of knowledge graphs is much older than that. Earlier works usually used terms like knowledge base or semantic network, among others (Ji et al. 2021). Although the exact origin of the term knowledge graph is not fully known, 1 https://blog.google/products/search/introducing-knowledge-graph-things-not/.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 H. Paulheim et al., Embedding Knowledge Graphs with RDF2vec, Synthesis Lectures on Data, Semantics, and Knowledge, https://doi.org/10.1007/978-3-031-30387-6_1
1
2
1 Introduction
Hogan et al. (2021) have traced the term back to a paper from the 1970s (Schneider 1973). In the Semantic Web community and the Linked Open Data (Bizer et al. 2011) movement, researchers have been producing datasets that would follow the idea of a knowledge graph for decades. In addition to open knowledge graphs created by the research community, and the already mentioned knowledge graph used by Google, also other major companies nowadays use knowledge graphs as a central means to represent corporate knowledge. Notable examples include, but are not limited to, eBay, Facebook, IBM, and Microsoft (Noy et al. 2019).
1.1.2
Definitions
While a lot of researchers and practitioners claim to use knowledge graphs, the field has long lacked a common definition of the term knowledge graph. Ehrlinger and Wöß (2016) have collected a few of the most common definitions of knowledge graphs. In particular, they list the following definitions: 1. A knowledge graph (1) mainly describes real-world entities and their interrelations, organized in a graph, (2) defines possible classes and relations of entities in a schema, (3) allows for potentially interrelating arbitrary entities with each other, and (4) covers various topical domains. (Paulheim 2017) 2. Knowledge graphs are large networks of entities, their semantic types, properties, and relationships between entities. (Journal of Web Semantics 2014) 3. Knowledge graphs could be envisaged as a network of all kinds of things which are relevant to a specific domain or to an organization. (Semantic Web Company 2014) 4. A Knowledge Graph [is] an RDF graph. An RDF graph consists of a set of RDF triples where each RDF triple (s, p, o) is an ordered set of the following RDF terms: a subject s ∈ U ∪ B, a predicate p ∈ U , and an object o ∈ U ∪ B ∪ L. An RDF term is either a URI u ∈ U , a blank node b ∈ B, or a literal l ∈ L. (Färber et al. 2018) 5. Knowledge, in the form of facts, [which] are interrelated, and hence, recently this extracted knowledge has been referred to as a knowledge graph. (Pujara et al. 2013) In addition, they synthesize their own definition, i.e.: 6. A knowledge graph acquires and integrates information into an ontology and applies a reasoner to derive new knowledge. (Ehrlinger and Wöß 2016) In the course of this book, we will use a very minimalistic definition of a knowledge graph. We consider a knowledge graph a graph G = (V, E) consisting of a set of entities V (i.e., vertices in the graph), and a set of E ⊆ VxRx(V ∪ L) labeled edges, where R defines the set of possible relation types (which can be considered edge labels), and L is a set of
1.1 What is a Knowledge Graph?
3
literals (e.g., numbers or string values). Moreover, each entity in V can have one or more classes assigned, where C defines the set of possible classes. Further ontological constructs, such as defining a class hierarchy or describing relations with domains and ranges, are not considered here. While most of the definitions above are more focusing on the contents of the knowledge graph, we, in this book, look at knowledge graphs from a more technical perspective, since the methods discussed in this book are not bound to a particular domain. Our definition is therefore purely technical and does not constrain the contents of the knowledge graph in any way.
1.1.3
General-Purpose Knowledge Graphs
While the research community has come up with a large number of knowledge graphs, there are a few which are open, large-scale, general-purpose knowledge graphs covering a lot of domains in reasonable depth. They are therefore interesting ingredients to artificial intelligence applications since they are ready to use and contain background knowledge for many different tasks at hand. One of the earliest attempts to build a general-purpose knowledge graph was Cyc, a project started in the 1980s (Lenat 1995). The project was initiated to build a machine-processable collection of the essence of the world’s knowledge, using a proprietary language called Cyc. After an investment of more than 2,000 person-years, the project, in the end, encompassed almost 25M axioms and rules2 – which is most likely still just a tiny fraction of the world’s knowledge. The example of Cyc shows that having knowledge graphs built manually by modeling experts does not really scale (Paulheim 2018). Therefore, modern approaches usually utilize different techniques, such as crowdsourcing and/or heuristic extraction. Crowdsourcing knowledge graphs was first explored with Freebase (Pellissier-Tanon et al. 2016), with the goal of establishing a large community of volunteers, comparable to Wikipedia. To that end, the schema of Freebase was kept fairly simple to lower the entrance barrier as much as possible. Freebase was acquired by Google in 2010 and shut down in 2014. Wikidata (Vrandeˇci´c and Krötzsch 2014) also uses a crowd editing approach. In contrast to Cyc and Freebase, Wikidata also imports entire whole large datasets, such as several national libraries’ bibliographies. Porting the data from Freebase to Wikidata is also a long standing goal (Pellissier-Tanon et al. 2016). A more efficient way of knowledge graph creation is the use of structured or semistructured sources. Wikipedia is a commonly used starting point for knowledge graphs such as DBpedia (Lehmann et al. 2013) and YAGO (Suchanek et al. 2007). In these approaches, 2 https://files.gotocon.com/uploads/slides/conference_13/724/original/AI_GOTO%20Lenat
%20keynote%2030%20April%202019%20hc.pdf.
4
1 Introduction
an entity in the knowledge graph is created per page in Wikipedia, and additional axioms are extracted from the respective Wikipedia pages using different means. DBpedia mainly uses infoboxes in Wikipedia. Those are manually mapped to a predefined ontology; both the ontology and the mapping are crowd-sourced using a Wiki and a community of volunteers. Given those mappings, the DBpedia Extraction Framework creates a graph in which each page in Wikipedia becomes an entity, and all values and links in an infobox become attributes and edges in the graph. YAGO uses a similar process but classifies instances based on the category structure and WordNet (Miller 1995) instead of infoboxes. YAGO integrates various language editions of Wikipedia into a single graph and represents temporal facts with meta-level statements, i.e., RDF reification. CaLiGraph also uses information in categories but aims at converting them into formal axioms using DBpedia as supervision (Heist and Paulheim 2019). Moreover, instances from Wikipedia list pages are considered for populating the knowledge graph (Kuhn et al. 2016, Paulheim and Ponzetto 2013). The result is a knowledge graph that is not only richly populated on the instance level but also has a large number of defining axioms for classes (Heist and Paulheim 2020). A similar approach to YAGO, i.e., the combination of information in Wikipedia and WordNet, is used by BabelNet (Navigli and Ponzetto 2012). The main purpose of BabelNet is the collection of synonyms and translations in various languages, so that this knowledge graph is particularly well suited for supporting multi-language applications. Similarly, ConceptNet (Speer and Havasi 2012) collects synonyms and translations in various languages, integrating multiple third-party knowledge graphs itself. DBkWik (Hertling and Paulheim 2018) uses the same codebase as DBpedia, but applies it to a multitude of Wikis. This leads to a graph that has larger coverage and level of detail for many long-tail entities and is highly complementary to DBpedia. However, the absence of a central ontology and mappings, as well as the existence of duplicates across Wikis, which might not be trivial to detect, imposes a number of data integration challenges not present in DBpedia (Hertling and Paulheim 2022). Another source of structured data are structured annotations in Web pages using techniques such as RDFa, Microdata, and Microformats (Meusel et al. 2014). While the pure collection of those could, in theory, already be considered a knowledge graph, that graph would be rather disconnected and consist of a plethora of small, unconnected components (Paulheim 2015) and would require additional cleanup for compensating irregular use of the underlying schemas and shortcomings in the extraction (Meusel and Paulheim 2015). A consolidated version of this data into a more connected knowledge graph has been published under the name VoldemortKG (Tonon et al. 2016). The extraction of a knowledge graph from semi-structured sources is considered easier than the extraction from unstructured sources. However, the amount of unstructured data
1.2
Feature Extraction from Knowledge Graphs
5
exceeds the amount of structured data by large.3 Therefore, extracting knowledge from unstructured sources has also been proposed. NELL (Carlson et al. 2010) is an example of extracting a knowledge graph from free text. NELL was originally trained with a few seed examples and continuously runs an iterative coupled learning process. In each iteration, facts are used to learn textual patterns to detect those facts, and patterns learned in previous iterations are used to extract new facts, which serve as training examples in later iterations. To improve the quality, NELL has introduced a feedback loop incorporating occasional human feedback (Pedro and Hruschka 2012). WebIsA (Seitner et al. 2016) also extracts facts from natural language text but focuses on the creation of a large-scale taxonomy. For each extracted fact, rich metadata are collected, including the sources, the original sentences, and the patterns used in the extraction of a particular fact. That metadata is exploited for computing a confidence score for each fact. (Hertling and Paulheim 2017). Table 1.1 depicts an overview of some of the knowledge graphs discussed above. ConceptNet and WebIsA are not included, since they do not distinguish a schema and instance level (i.e., there is no specific distinction between a class and an instance), which does not allow for computing those metrics meaningfully. For Cyc, which is only available as a commercial product today, we used the free version OpenCyc, which has been available until 2017.4 From those metrics, it can be observed that the KGs differ in size by several orders of magnitude. The sizes range from 50,000 instances (and Voldemort) to 50 million instances (for Wikidata), so the latter is larger by a factor of 1,000. The same holds for assertions. Concerning the linkage degree, YAGO is much richer linked than the other graphs.
1.2
Feature Extraction from Knowledge Graphs
When using knowledge graphs in the context of intelligent applications, they are often combined with some machine learning or data mining based processing (van Bekkum et al. 2021). The corresponding algorithms, however, mostly expect tabular or propositional data as input, not graphs, hence, information from the graphs is often transformed into a propositional form first, a process called propositionalization or feature extraction (Lavraˇc et al. 2020, Ristoski and Paulheim 2014a). Particularly for the combination with machine learning algorithms, it is not only important to have entities in a particular propositional form, but that this propositional form also fulfills some additional criteria. In particular, proximity in the feature space should – ideally – reflect the similarity of entities.5 For example, when building a movie recommendation system, we 3 Although it is hard to trace down the provenance of that number, many sources state that 80% of
all data is structured, such as Das and Kumar (2013). 4 It is still available, e.g., at https://github.com/asanchez75/opencyc. 5 We will discuss this in detail in Chap. 3.
6
1 Introduction
Table 1.1 Basic metrics of open knowledge graphs (Heist et al. 2020) DBpedia
YAGO
Wikidata
BabelNet
# Instances
5,044,223
6,349,359
52,252,549
7,735,436
# Assertions
854,294,312
479,392,870
732,420,508
178,982,397
Avg. linking degree
21.30
48.26
6.38
0.00
Median ingoing edges
0
0
0
0
Median outgoing edges
30
95
10
9
# Classes
760
819,292
2,356,259
6,044,564
# Relations
1355
77
6,236
22
Avg. depth of class tree
3.51
6.61
6.43
4.11
Avg. branching 4.53 factor of class tree
8.48
36.48
71.0
Ontology complexity
SHOFD
SHOIF
SOD
SO
Cyc
NELL
CaLiGraph
Voldemort
# Instances
122,441
5,120,688
7,315,918
55,861
# Assertions
2,229,266
60,594,443
517,099,124
693,428
Avg. linking degree
3.34
6.72
1.48
0
Median ingoing edges
0
0
0
0
Median outgoing edges
3
0
1
5
# Classes
116,821
1,187
755,963
621
# Relations
148
440
271
294
Avg. depth of class tree
5.58
3.13
4.74
3.17
Avg. branching 5.62 factor of class tree
6.37
4.81
5.40
Ontology complexity
SROIF
SHOD
SH
SHOIFD
would like to recommend movies that are similar to the movies someone already watched, and when building a system for entity classification, we want similar entities to be assigned to the same class.
1.2
Feature Extraction from Knowledge Graphs
7
In Paulheim and Fümkranz (2012), we have introduced a set of basic transformations which can extract propositional features for an entity in a knowledge graph. Those techniques include: • Types: Create a binary feature for each entity type. • Literals: Create a feature for each literal value (those may have different types, such as numeric, string, ...). • Relations: Create a binary or numeric feature for each ingoing and/or outgoing relation • Qualified Relations: Create a binary or numeric feature for each combination of an ingoing and/or outgoing relation and the corresponding entity type. • Relations to individuals: Create a binary feature for each combination of a relation and the individual it connects to. Implementations of these techniques exist in the original FeGeLOD framework for Weka (Paulheim and Fümkranz 2012), the RapidMiner Linked Open Data Extension (Ristoski et al. 2015), and the Python kgextension (Bucher et al. 2021). The latter technique, however, is usually not used due to a rapid explosion of the search space. Figure 1.1 shows a simple knowledge graph describing three persons and their relations among each other, as well as their relation to other objects. If we apply the above techniques for creating propositional representations of the three persons, we would arrive at the rep-
Fig. 1.1 A simple knowledge graph. The dashed line marks the delineation of the schema (upper part) and the instance level (lower part)
8
1 Introduction
Table 1.2 A simple propositionalization of the knowledge graph shown in Fig. 1.1. We show the numeric variant of the relations and qualified relations Technique
Entities Attribute
John
Mary
Julia
Types
Person
True
True
True
Literals
birthdate
1997-06-08
1996-11-14
NULL
Relations
likes birthdate bornIn livesIn
2 1 1 0
1 1 0 1
1 0 1 1
Qualified relations
likes.Person likes.Food bornIn.City livesIn.City
1 1 1 0
0 1 0 1
0 1 1 1
Relations to individuals
bornIn.{Berlin} bornIn.{Paris} livesIn.{Berlin} livesIn.{Paris} likes.{Mary} likes.{Pizza} likes.{Sushi}
1 0 0 0 1 1 0
0 1 1 0 0 1 0
1 0 0 1 0 0 1
resentation shown in Table 1.2. A corresponding code snippet, using the Python knowledge graph extension (Bucher et al. 2021), is shown in code Listing 1.1.6 There are a few observations we can make here. First, the graph is not precisely evaluated under Open World Semantics, which is the semantics that holds for typical knowledge graphs in RDF (Gandon et al. 2011) and in particular those discussed above, like DBpedia, Wikidata, and the like. For relation features, for example, the features represent if or how often a relation exists in the graph, not necessarily how often it exists in the real world. Since knowledge graphs may be incomplete (Issa et al. 2021), and that incompleteness might not be evenly distributed, some biases might be introduced here. In the example above, we create a feature likes.Person with a value of 1 for John and a value of 0 for Mary and Julia, but this does not necessarily mean that there a no persons that Mary or Julia like, nor that there is not more than one person that John likes. Second, the number of features is not limited. The more classes and relation types are defined in an ontology and used in a graph, the more features are generated. This may easily lead to very high dimensional feature spaces, which are often suboptimal for downstream 6 Code examples, as well as other additional materials, are available online at http://rdf2vec.org/
book/.
1.2
Feature Extraction from Knowledge Graphs
9
Listing 1.1 Example for propositionalization with the kgextension package from k g e x t e n s i o n . s p a r q l _ h e l p e r i m p o r t L o c a l E n d p o i n t i m p o r t p a n d a s as pd # Load graph M y G r a p h = L o c a l E n d p o i n t ( f i l e _ p a t h = " ./ i n t r o _ e x a m p l e . ttl " ) M y G r a p h . i n i t i a l i z e () # Create data frame df = pd . D a t a F r a m e ({ ’ uri ’: [ ’ http :// r d f 2 v e c . org / book / e x a m p l e 1 # John ’ , ’ http :// r d f 2 v e c . org / book / e x a m p l e 1 # Mary ’ , ’ http :// r d f 2 v e c . org / book / e x a m p l e 1 # Julia ’ ] }) # C r e a t e f e a t u r e s - e x a m p l e 1: d i r e c t t y p e s from k g e x t e n s i o n . g e n e r a t o r i m p o r t d i r e c t _ t y p e _ g e n e r a t o r df_types = direct_type_generator ( df , " uri " , e n d p o i n t = M y G r a p h ) # C r e a t e f e a t u r e s - e x a m p l e 2: l i t e r a l s from k g e x t e n s i o n . g e n e r a t o r i m p o r t d a t a _ p r o p e r t i e s _ g e n e r a t o r df_data_properties = data_properties_generator ( df , " uri " , e n d p o i n t = M y G r a p h ) # C r e a t e f e a t u r e s - e x a m p l e 3: r e l a t i o n s from k g e x t e n s i o n . g e n e r a t o r i m p o r t u n q u a l i f i e d _ r e l a t i o n _ g e n e r a t o r df_relations = unqualified_relation_generator ( df , " uri " , e n d p o i n t = MyGraph , r e s u l t _ t y p e = " count " ) # C r e a t e f e a t u r e s - e x a m p l e 4: q u a l i f i e d r e l a t i o n s from k g e x t e n s i o n . g e n e r a t o r i m p o r t q u a l i f i e d _ r e l a t i o n _ g e n e r a t o r df_qrelations = qualified_relation_generator ( df , " uri " , e n d p o i n t = MyGraph , r e s u l t _ t y p e = " count " )
processors like prediction algorithms. For the qualified relations, the number of features often grows exponentially with the number of instances, and the number of features generated can exceed the number of instances by a factor of more than 100, as we have shown in Ristoski and Paulheim (2016). This means that for a dataset with 2,000 instances, the number of features can already exceed 200,000, making it hard to make sense of that data due to the curse of dimensionality (Verleysen and François 2005). While some approaches for post hoc filtering of the generated features have been proposed (Ristoski and Paulheim 2014b), they usually require generating the full feature space first, which may be a way to circumvent the curse of dimensionality, but does not really remedy the problem of scalability.
10
1 Introduction
Table 1.3 Computational complexity of different types of features. T: number of types, P = DP + OP: number of properties, DP: number of datatype properties, OP: number of object properties, I: number of Individuals Strategy
Complexity
Types
O(T)
Literals
O(DP)
Relations
O(P)
Qualified relations
O(OP*T)
Relations to individuals
O(OP*I)
While creating relations to individuals is a possible technique of propositionalization, it can hardly be used in practice, again for the reason of scalability. In this simple example, we already have seven valid combinations of a predicate and an instance, twice as many as entities considered. For a real knowledge graph, this number would grow even faster than that of qualified relations. Therefore, one is often bound to direct types, relations, and qualified relations. The computational complexity (here: the upper bound of the number of features that are generated) for each of the strategies is shown in Table 1.3. When looking back at the numbers in Table 1.1, it becomes obvious that some of the techniques can cause issues in terms of scalability. When looking at the representation with those groups of features, one might conclude that Mary is more similar to Julia than to John, since Mary and Julia have the same value for five features, while Mary and John only share values of two features. When looking at the graph, on the other hand, one might come to the conclusion, though, that Mary and John are more similar, since both like Pizza, and both are related to Berlin. This example shows that those relations to individuals are very relevant for downstream tasks. In the example of movie recommendation mentioned above, movies are considered similar because they share the same director, actor(s), or genre. However, they are hard to exploit in practice. This observation was one of the key motivating points for developing RDF2vec.
1.3
Node Classification in RDF
Node classification is the task of assigning a label to a node in a graph (not necessarily an RDF graph, but in the context of this book, we will look only at RDF graphs). The label may be the ontological class of the entity, which is often considered in the context of knowledge graph completion (Paulheim 2017), but it can be any other binary or n-ary label. For example, in a content-based recommender system, where each recommendable item is represented as a node in a knowledge graph, the recommendation could be modeled as a node classification
1.3
Node Classification in RDF
11
Table 1.4 Node classification results with different propositionalization techniques Generator Direct types
# of features
Accuracy
12
0.545 ± 0.057
19
0.505 ± 0.111
Qualified relations
163
0.530 ± 0.108
Combined
194
0.525 ± 0.081
Unqualified relations
task (i.e., given a user u, predict whether or not to recommend item i) (Ristoski et al. 2019, Rosati et al. 2016). A related problem is node regression, where a numerical label (e.g., a rating for an item) is to be predicted. As an example, we use an excerpt from DBpedia, which contains 200 bands, 100 each from the genres rock and soul. The node classification target is to predict the genre.7 Listing 1.2 shows the classification using three of the propositionalization strategies discussed above, using a standard feed-forward neural network as a downstream classifier. The results of node classification using those approaches are shown in Table 1.4. We can observe that none of the approaches works significantly better than guessing, which, for a balanced binary classification problem like this, would yield an accuracy of 0.5. Obviously, the propositionalization approaches at hand cannot extract any useful features from the graph. At the same time, the number of features extract is already considerable and, when using all three generators altogether, almost as high as the number of instances. Nevertheless, Fig. 1.2 indicates that there might be some signals which should be useful for the task at hand (e.g., other genres, record labels, associated artists, etc.). However, those would require the inclusion of information about entities (i.e., individual genres, record labels, artists) in the features, which, as discussed above, none of the current propositionalization techniques do. Figure 1.3 shows an example decision tree trained on the example dataset. By analyzing some of the paths, it becomes obvious why the models trained with the propositionalization techniques perform that badly. For example, the leftmost leaf node essentially says that bands for which no genre information is given are classified rock bands, whereas the rightmost leaf node expresses that bands for which at least two genres, one hometown, and one record label are given are classified as soul bands. This indicates that the classifier rather picks up on some statistical artifacts in the knowledge graph (here, rock bands seem to be described in less detail than soul bands), but does not really provide insights beyond that and is not capable of expressing the essence of what makes a band a rock or a soul band.
7 The prediction target, i.e., the target artists’ genres, has been removed from the DBpedia excerpt.
Details on the dataset construction can be found in Appendix A.1.
12
1 Introduction
Listing 1.2 Example for node classification using classic propositionalization with the kgextension package # Load data from k g e x t e n s i o n . s p a r q l _ h e l p e r i m p o r t L o c a l E n d p o i n t M y G r a p h = L o c a l E n d p o i n t ( f i l e _ p a t h = " ./ a r t i s t s _ g r a p h . nt " ) M y G r a p h . i n i t i a l i z e () # C r e a t e d a t a frame , split into f e a t u r e s and label i m p o r t p a n d a s as pd df = pd . r e a d _ c s v ( ’ ./ b a n d s _ l a b e l s . csv ’ , sep = " \ t " ) dfX = df [[ ’ Band ’ ]] dfY = df [[ ’ G enre ’ ]] # C r e a t e f e a t u r e s - use three g e n e r a t o r s from k g e x t e n s i o n . g e n e r a t o r i m p o r t d i r e c t _ t y p e _ g e n e r a t o r from k g e x t e n s i o n . g e n e r a t o r i m p o r t u n q u a l i f i e d _ r e l a t i o n _ g e n e r a t o r from k g e x t e n s i o n . g e n e r a t o r i m p o r t q u a l i f i e d _ r e l a t i o n _ g e n e r a t o r dfX = d i r e c t _ t y p e _ g e n e r a t o r ( dfX , " Band " , e n d p o i n t = M y G r a p h ) dfX = u n q u a l i f i e d _ r e l a t i o n _ g e n e r a t o r ( dfX , " Band " , e n d p o i n t = MyGraph , r e s u l t _ t y p e = " count " ) dfX = q u a l i f i e d _ r e l a t i o n _ g e n e r a t o r ( dfX , " Band " , e n d p o i n t = MyGraph , r e s u l t _ t y p e = " count " ) # T r a i n n e u r a l network , e v a l u a t e in 10 - fold cross v a l i d a t i o n from s k l e a r n . n e u r a l _ n e t w o r k i m p o r t M L P C l a s s i f i e r from s k l e a r n . m o d e l _ s e l e c t i o n i m p o r t c r o s s _ v a l _ s c o r e i m p o r t n um py as np dfX = dfX . iloc [: ,1:] clf = M L P C l a s s i f i e r ( m a x _ i t e r = 1 0 0 0 ) s c o r e s = c r o s s _ v a l _ s c o r e ( clf , dfX , dfY . v a l u e s . ravel () , cv =10) s c o r e s . m e a n () s c o r e s . std ()
Fig. 1.2 Excerpt of the band node classification dataset. The prediction target labels are shown in grey rectangular boxes. Note that not all bands in the dataset are labeled examples
1.4
Conclusion
13
Fig. 1.3 Example decision tree trained on the node classification example
1.4
Conclusion
In this chapter, we have seen that classic propositionalization has some shortcomings – the search space can get large very quickly, while, at the same time, the expressivity of the features (and, hence, the downstream models) is very limited, in particular since relations to individuals are not reflected very well (and cannot be easily expressed without avoiding an exponential explosion of the search space). The idea of RDF2vec is to address these two shortcomings of classic propositionalization approaches – to create rich representations which can also reflect relations to entities, while, at the same time, limiting the dimensionality of the feature space. With RDF2vec, it is possible to create propositional representations which have a low and controllable number of features (typically, 200-500 features are used), and, at the same time, capture a large amount of the information available for the entities in a knowledge graph.
14
1 Introduction
References Bizer C, Heath T, Berners-Lee T (2011) Linked data: the story so far. In: Semantic services, interoperability and web applications: emerging concepts, IGI global, pp 205–227 Bucher TC, Jiang X, Meyer O, Waitz S, Hertling S, Paulheim H (2021) Scikit-learn pipelines meet knowledge graphs. In: European semantic web conference. Springer, pp 9–14 Carlson A, Betteridge J, Wang RC, Hruschka Jr ER, Mitchell TM (2010) Coupled semi-supervised learning for information extraction. In: Proceedings of the third ACM international conference on Web search and data mining. ACM, New York, pp 101–110. https://doi.org/10.1145/1718487. 1718501 Das TK, Kumar PM (2013) Big data analytics: a framework for unstructured data analysis. Int J Eng Sci Technol 5(1):153 Ehrlinger L, Wöß W (2016) Towards a definition of knowledge graphs. In: SEMANTiCS Färber M, Bartscherer F, Menne C, Rettinger A (2018) Linked data quality of dbpedia, freebase, opencyc, wikidata, and yago. Semant Web 9(1):77–129 Gandon F, Krummenacher R, Han SK, Toma I (2011) The resource description framework and its schema Heist N, Paulheim H (2019) Uncovering the semantics of wikipedia categories. In: International semantic web conference. Springer, pp 219–236 Heist N, Paulheim H (2020) Entity extraction from wikipedia list pages. In: Extended semantic web conference Heist N, Hertling S, Ringler D, Paulheim H (2020) Knowledge graphs on the web-an overview Hertling S, Paulheim H (2017) Webisalod: providing hypernymy relations extracted from the web as linked open data. In: International semantic web conference. Springer, pp 111–119 Hertling S, Paulheim H (2018) Dbkwik: a consolidated knowledge graph from thousands of wikis. In: 2018 IEEE international conference on big knowledge (ICBK). IEEE, pp 17–24 Hertling S, Paulheim H (2022) Dbkwik++-multi source matching of knowledge graphs. In: Knowledge graphs and semantic web: 4th iberoamerican conference and third Indo-American conference, KGSWC 2022, Madrid, Spain, November 21–23, 2022, proceedings. Springer, pp 1–15 Hogan A, Blomqvist E, Cochez M, d’Amato C, Melo Gd, Gutierrez C, Kirrane S, Gayo JEL, Navigli R, Neumaier S et al (2021) Knowledge graphs. ACM Comput Surv (CSUR) 54(4):1–37 Issa S, Adekunle O, Hamdi F, Cherfi SSS, Dumontier M, Zaveri A (2021) Knowledge graph completeness: a systematic literature review. IEEE Access 9:31322–31339 Ji S, Pan S, Cambria E, Marttinen P, Philip SY (2021) A survey on knowledge graphs: representation, acquisition, and applications. IEEE Trans Neural Netw Learn Syst 33(2):494–514 Journal of Web Semantics (2014) Jws special issue on knowledge graphs. http://www. websemanticsjournal.org/2014/09/cfp-special-issue-on-knowledge-graphs.html Kuhn P, Mischkewitz S, Ring N, Windheuser F (2016) Type inference on wikipedia list pages. Informatik 2016 Lavraˇc N, Škrlj B, Robnik-Šikonja M (2020) Propositionalization and embeddings: two sides of the same coin. Mach Learn 109(7):1465–1507 Lehmann J, Isele R, Jakob M, Jentzsch A, Kontokostas D, Mendes PN, Hellmann S, Morsey M, van Kleef P, Auer S, Bizer C (2013) DBpedia – a large-scale, multilingual knowledge base extracted from wikipedia. Semant Web J 6(2). https://doi.org/10.3233/SW-140134 Lenat DB (1995) CYC: a large-scale investment in knowledge infrastructure. Commun ACM 38(11):33–38. https://doi.org/10.1145/219717.219745 Meusel R, Paulheim H (2015) Heuristics for fixing common errors in deployed schema. org microdata. In: European semantic web conference. Springer, pp 152–168
References
15
Meusel R, Petrovski P, Bizer C (2014) The webdatacommons microdata, rdfa and microformat dataset series. In: International semantic web conference. Springer, pp 277–292 Miller GA (1995) Wordnet: a lexical database for english. Commun ACM 38(11):39–41 Navigli R, Ponzetto SP (2012) Babelnet: the automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artif Intell 193:217–250 Noy N, Gao Y, Jain A, Narayanan A, Patterson A, Taylor J (2019) Industry-scale knowledge graphs: lessons and challenges: five diverse technology companies show how it’s done. Queue 17(2):48–75 Paulheim H (2015) What the adoption of schema. org tells about linked open data. In: Joint proceedings of USEWOD and PROFILES Paulheim H (2017) Knowledge graph refinement: a survey of approaches and evaluation methods. Semant Web 8(3):489–508 Paulheim H (2018) How much is a triple? Estimating the cost of knowledge graph creation. In: ISWC 2018 posters and demonstrations, industry and blue sky ideas tracks Paulheim H, Fümkranz J (2012) Unsupervised generation of data mining features from linked open data. In: Proceedings of the 2nd international conference on web intelligence, mining and semantics, pp 1–12 Paulheim H, Ponzetto SP (2013) Extending dbpedia with wikipedia list pages. NLP-DBPEDIA@ ISWC 13 Pedro SD, Hruschka ER (2012) Conversing learning: active learning and active social interaction for human supervision in never-ending learning systems. In: Ibero-American conference on artificial intelligence. Springer, pp 231–240 Pellissier-Tanon T, Vrandeˇci´c D, Schaffert S, Steiner T, Pintscher L (2016) From freebase to wikidata: the great migration. In: Proceedings of the 25th international conference on world wide web, pp 1419–1428 Pujara J, Miao H, Getoor L, Cohen W (2013) Knowledge graph identification. In: International semantic web conference. Springer, pp 542–557 Ristoski P, Paulheim H (2014a) A comparison of propositionalization strategies for creating features from linked open data. In: Linked data for knowledge discovery 6 Ristoski P, Paulheim H (2014b) Feature selection in hierarchical feature spaces. In: International conference on discovery science. Springer, pp 288–300 Ristoski P, Paulheim H (2016) Rdf2vec: Rdf graph embeddings for data mining. In: International semantic web conference. Springer, pp 498–514 Ristoski P, Bizer C, Paulheim H (2015) Mining the web of linked data with rapidminer. J Web Semant 35:142–151 Ristoski P, Rosati J, Di Noia T, De Leone R, Paulheim H (2019) Rdf2vec: Rdf graph embeddings and their applications. Semant Web 10(4):721–752 Rosati J, Ristoski P, Di Noia T, Leone Rd, Paulheim H (2016) Rdf graph embeddings for content-based recommender systems. CEUR Workshop Proc RWTH 1673:23–30 Schneider EW (1973) Course modularization applied: the interface system and its implications for sequence control and data analysis Seitner J, Bizer C, Eckert K, Faralli S, Meusel R, Paulheim H, Ponzetto SP (2016) A large database of hypernymy relations extracted from the web. In: Proceedings of the tenth international conference on language resources and evaluation (LREC 2016), pp 360–367 Semantic Web Company (2014) From taxonomies over ontologies to knowledge graphs. https:// semantic-web.com/from-taxonomies-over-ontologies-to-knowledge-graphs/ Speer R, Havasi C (2012) Representing general relational knowledge in conceptnet 5. In: LREC, pp 3679–3686
16
1 Introduction
Suchanek FM, Kasneci G, Weikum G (2007) YAGO: a core of semantic knowledge unifying wordnet and wikipedia. In: 16th international conference on World Wide Web. ACM, New York, pp 697–706. https://doi.org/10.1145/1242572.1242667 Tonon A, Felder V, Difallah DE, Cudré-Mauroux P (2016) Voldemortkg: mapping schema. org and web entities to linked open data. In: International semantic web conference. Springer, pp 220–228 van Bekkum M, de Boer M, van Harmelen F, Meyer-Vitali A, At Teije (2021) Modular design patterns for hybrid learning and reasoning systems. Appl Intell 51(9):6528–6546 Verleysen M, François D (2005) The curse of dimensionality in data mining and time series prediction. In: International work-conference on artificial neural networks. Springer, pp 758–770 Vrandeˇci´c D, Krötzsch M (2014) Wikidata: a free collaborative knowledge base. Commun ACM 57(10):78–85. https://doi.org/10.1145/2629489
2
From Word Embeddings to Knowledge Graph Embeddings
Abstract
Word embedding techniques have been developed to assign words to vectors in a vector space. One of the earliest such methods was word2vec, published in 2013 – and embeddings have gathered a tremendous uptake in the natural language processing community since then. Since RDF2vec is based on word2vec, we take a closer look at word2vec in this chapter. We explain how word2vec has been developed to represent words as vectors, and we discuss how this approach can be adapted to knowledge graphs by performing random graph walks, yielding the basic version of RDF2vec. We explain the CBOW and SkipGram variants of basic RDF2vec, revisiting the node classification tasks used in Chap. 1.
2.1
Word Embeddings with word2vec
The previous chapter has already introduced the idea of feature extraction from a knowledge graph. Feature extraction has also been used in other fields, such as Natural Language Processing (NLP), e.g., by means of extracting relevant words using POS taggers and/or keyphrase extraction (Scott and Matwin 1999), or image processing, e.g., by extracting shapes and color histograms (Kumar and Bhatia 2014, Nixon and Aguado 2019). In contrast to those approaches, representation learning or feature learning are approaches that input raw data (e.g., graphs, texts, images) into a machine learning pipeline directly, and the first steps of the pipeline create a representation which is suitable for the task at hand (Bengio et al. 2013). There are supervised approaches, which learn a representation that is suitable for a problem at hand, and unsupervised approaches, which learn a representation that is usable across different downstream problems. A typical example of supervised approaches is convolutional neural networks, where each convolution layer creates a more © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 H. Paulheim et al., Embedding Knowledge Graphs with RDF2vec, Synthesis Lectures on Data, Semantics, and Knowledge, https://doi.org/10.1007/978-3-031-30387-6_2
17
18
2 From Word Embeddings to Knowledge Graph Embeddings
abstract representation of the input data, and the neurons in those layers act as feature detectors (Schmidhuber 2015). Typical unsupervised approaches are autoencoders, which try to learn a more compact representation of the input data, which can then be used to reconstruct the original data with as little error as possible (Bengio et al. 2013). Both supervised and unsupervised methods have been applied to knowledge graphs as well. While RDF2vec, which is discussed in this book, is an unsupervised method, the most well-known supervised methods are graph neural networks (Scarselli et al. 2008), which have also been applied to knowledge graphs under the name of relational graph convolutional networks (Schlichtkrull et al. 2018). NLP is probably one of the fields which have changed the most since the advent of (neural) representation learning. With the increasing popularity of embedding methods, previous approaches for text representation, like bag of words, are barely used anymore. Instead, using vector-based representations of words, as well as smaller and larger chunks of text, has become the dominant representation paradigm in NLP (Khurana et al. 2022). The general idea underlying word embeddings is that similar words appear in similar contexts, as phrased by John Rupert Firth in a 1957 article: a word is characterized by the company it keeps (Firth 1957). For example, consider the two sentences: Jobs, Wozniak, and Wayne founded Apple Computer Company in April 1976.
and Google was officially founded as a company in January 2006.
Here, both Apple and Google appear with similar context words (e.g., company and founded), hence, it can be assumed that they are somewhat similar.1 Hence, when representing words, creating similar representations for similar contexts should lead to similar words being represented by similar vectors. This idea was taken up by Bengio et al. (2000) for training neural networks to predict a word from its context. By using shared weight matrices for each word, they could come up with embedding vectors for words. This approach led to one of the most famous word embedding approaches, called word2vec (Mikolov et al. 2013a). word2vec comes in two variants: context bag of words (CBOW) is closely aligned to the neural language model by Bengio et al. (2000) and tries to predict a word from its context, while skip-gram (SG) is organized reversely and tries to predict the context of a word from the word itself. In both cases, a projection layer is used to learn a representation of a word, as shown in Fig. 2.1. 1 Of course, the example is a bit simplistic, and one would in fact not be able to conclude the similarity
of the two terms Google and Apple from just those two sentences. However, when using a large corpus of sentences containing both words, one will be able to observe a similar distribution of words in the two terms’ contexts.
2.1 Word Embeddings with word2vec
19
(a) CBOW
(b) SG
Fig. 2.1 The two basic architectures of word2vec (Ristoski and Paulheim 2016)
The CBOW model predicts target words from context words within a given window. The model architecture is shown in Fig. 2.1a. The input layer is comprised of all the surrounding words for which the input vectors are retrieved from the input weight matrix, averaged, and projected in the projection layer. Then, using the weights from the output weight matrix, a score for each word in the vocabulary is computed, which is the probability of the word being a target word. Formally, given a sequence of training words w1 , w2 , w3 , . . . , wT , and a context window c, the objective of the CBOW model is to maximize the average log probability: T 1 log p(wt |wt−c . . . wt+c ), (2.1) T t=1
where the probability p(wt |wt−c . . . wt+c ) is calculated using the softmax function: ex p(¯v T vw t ) p(wt |wt−c . . . wt+c ) = V , v T vw ) w=1 ex p(¯
(2.2)
where vw is the output vector of the word w, V is the complete vocabulary of words, and v¯ is the averaged input vector of all the context words: v¯ =
1 2c
vwt+ j
(2.3)
−c≤ j≤c, j=0
The skip-gram model does the inverse of the CBOW model and tries to predict the context words from the target words (Fig. 2.1b). More formally, given a sequence of training words w1 , w2 , w3 , . . . , wT , and a context window c, the objective of the skip-gram model is to maximize the following average log probability: T 1 T
log p(wt+ j |wt ),
t=1 −c≤ j≤c, j=0
where the probability p(wt+ j |wt ) is calculated using the softmax function:
(2.4)
20
2 From Word Embeddings to Knowledge Graph Embeddings T v ) ex p(vwo wi p(wo |wi ) = V , T ex p(v w vwi ) w=1
(2.5)
where vw and vw are the input and the output vector of the word w, and V is the complete vocabulary of words. In both cases, calculating the softmax function is computationally inefficient, as the cost for computing is proportional to the size of the vocabulary. Therefore, two optimization techniques have been proposed, i.e., hierarchical softmax and negative sampling (Mikolov et al. 2013b). Empirical studies have shown that in most cases negative sampling leads to a better performance than hierarchical softmax, which depends on the selected negative samples, but it has higher runtime. Recent advances of word embeddings try to capture meaning also in a more fine-grained and contextual way (e.g., distinguishing the Apple company from the fruit named apple in the example above), as done, e.g., in BERT (Devlin et al. 2018), or creating cross-lingual word embeddings (Ruder et al. 2019). Besides their very good performance on many tasks, Word embeddings have also become rather popular because, since they belong to the unsupervised category, they can be trained and published for further use. Today, there is an abundance of pre-trained word embeddings for many languages, domains, and text genres available online. This makes it easy to develop applications using such word embeddings without requiring extensive computational resources to actually compute the embeddings.
2.2
Representing Graphs as Sequences
Techniques for word embeddings work on texts, i.e., sequences of words. In order to apply them to knowledge graphs, those need to be represented as sequences first. In Chap. 1, we have used a very simple definition of knowledge graphs, i.e., representing a knowledge graph as a set of triples. By considering entities E and relations R as “words”, the set of triples can also be thought of as a set of three-word “sentences”. Hence, applying word embedding methods on those sets would already provide us with a basic embedding vector for each entity. This approach is taken, e.g., by Wembedder, an embedding service for Wikidata entities (Nielsen 2017). While this approach is very straightforward, it will most often not perform very well. The reason is that the amount of context information of an entity captured by just looking at single triples is very limited. Looking back at the example in Fig. 1.2, the only information in the direct neighborhood of the entity Rustic Overtones are the two associated bands. However, the actually interesting information to classify the entity at hand would rather be the genre and record label of those associated bands. By only considering single triples, however, that information is not encoded in any of the “sentences”, and, hence, not picked up by the word embedding method applied to them.
2.2
Representing Graphs as Sequences
21
To overcome this issue, we enhance the context of an entity by not only looking at single triples but longer sequences S instead. Formally, the set of all two-hop sequences S2hop can be extracted from the set of all triples E ⊆ VxRxV as follows2 : S2hop := {(v1 , r1 , v2 , r2 , v3 ) : (v1 , r1 , v2 ) ∈ E ∧ (v2 , r2 , v3 ) ∈ E}
(2.6)
Longer sequences can be defined accordingly.3 Generally, a walk of length n (for an even number n) can be defined as a sequence of entities and relations as follows: walkn := w− n2 , w− n2 +1 , . . . , w−1 , w0 , w1 , . . . , w n2 −1 , w n2 (2.7) where wi ∈
V if i is even R
if i is odd
(2.8)
It is worth noticing that, given that the set of triples is a mathematical set and hence free from duplicates, E can be fully reconstructed from S2hop (and also all walk sets longer than two hops), i.e., S2hop is a representation of the knowledge graph that contains all its information. A full enumeration of sequences, even with a larger number of hops, is possible for smaller knowledge graphs. The number of sequences is n · d h , where n is the number of entities, d is the average node degree, and h is the number of hops. For larger and more densely connected graphs, d h can quickly become large. For example, using 4-hop sequences for DBpedia, which has a linkage degree of 21.3, as shown in Chap. 1, one would extract more than 200,000 sequences per entity, i.e., more than a trillion sequences in total. In comparison, typical word2vec models are trained on Wikipedia, which has roughly 160 million sentences, i.e., a couple of orders of magnitude less.4 Thus, using all sequences does not scale. Therefore, it is a common practice to sample S by using random walks. Instead of enumerating all walks, a fixed number of random walks is started from each node. Thus, the number of extracted sequences grows only linearly in the number of entities in the graph, independently of the length of sequences extracted and the degree of the knowledge graph. Depending on the implementation, the result is either a set or a multiset of sequences. In the first case, the set of extracted walks W is a subset of S. In the second case, walks can also 2 You may note that literals are not considered here. As many other knowledge graph embedding
approaches, RDF2vec only uses relations between entities and does not utilize literal values. 3 It is important to point out that not all implementations of RDF2vec share the same terminology.
The two-hop sequence above would be referred to as a “walk of length 2” (i.e., counting only nodes) by some implementations, while others would consider it a “walk of length 4” (i.e., counting nodes and edges). In this book, we follow the latter terminology. 4 According to https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia, the English Wikipedia has roughly 4 billion words. Assuming 25 words per sentence, as measured by Jatowt and Tanaka (2012), this would correspond to 160 million sentences.
22
2 From Word Embeddings to Knowledge Graph Embeddings
Table 2.1 Subset of the walks extracted for the example in Fig. 1.1. The table only shows the walks starting in the entities John, Mary, and Julia v1
r1
v2
r2
v3
John
isA
Person
–
–
John
bornIn
Berlin
isA
City
John
likes
Pizza
isA
Food
John
likes
Mary
isA
Person
John
likes
Mary
livesIn
Berlin
Mary
isA
Person
–
–
Mary
livesIn
Berlin
isA
City
Mary
likes
Pizza
isA
Food
Julia
isA
Person
–
–
Julia
bornIn
Paris
isA
City
Julia
livesIn
Paris
isA
City
Julia
likes
Sushi
isA
Food
be duplicated – which is particularly the case for low-degree nodes. Hence, multiset-based approaches can introduce a representation bias, increasing the influence of such low-degree nodes. On the other hand, they may also capture the distributions in the graph better. Table 2.1 shows a set of walks extracted for the example in Fig. 1.1 in Chap 1. It depicts the walks started in the three nodes John, Mary, and Julia (while in general, walks would be generated for all entities in the graph). When we considered this example with propositionalization methods, we saw that, countering our intuition, Mary was more similar to Julia than to John (cf. Sect. 1.2). On the other hand, considering the walk representation, out of the three walks created from the entity Mary, two are completely identical to the ones created for John (except for the first entity, of course), and the third one, only one element is different. In contrast, only one walk for Mary is identical to a walk for Julia, and the other two differ in at least one element. We can hence assume that a model built on those walks should be able to capture those fine-grained similarities a lot better.5
5 For the sake of clarity, it should be stated that this argument is a bit simplified since RDF2vec
uses all walks which contain an entity to build that entity’s representation, while here, we have only looked at the walks where the entity appears in the v1 position. However, the example still shows that the context of an entity is better captured by the walks than by the propositionalization techniques discussed in Chap. 1.
2.3
2.3
Learning Representations from Graph Walks
23
Learning Representations from Graph Walks
As discussed above, walks or sequences extracted from knowledge graphs can be considered as yet another way to represent those graphs. If we start enough walks from each entity, we will end up with a set of sequences where each entity and each relation in the graph appears at least once. Language modeling techniques such as word2vec use such sequences of tokens (i.e., words for natural language, entities and relations for sequences extracted from knowledge graphs) to learn a representation for each token. This means that we can use them to learn representations for entities and relations in knowledge graphs. Internally, word2vec slides a window of a fixed size over the walks as they are shown in Table 2.1. Each element in each row becomes the focus word after one another (marked as w(t) in Fig. 2.1, with the other ones within the window length becoming the context words (marked w(t − 2), w(t − 1), etc. in Fig. 2.1. In most RDF2vec implementations, the standard value of 5 of the gensim implementation is used for the window size, i.e., for each entity, an inbound and an outbound triple are used to learn the representations, but further away entities are not considered. This value, however, is configurable, and for graphs using more complex representational patterns, a larger window size could yield better vector representations (at the cost of a higher computational cost). Ultimately, we expect that entities appearing in similar contexts in the knowledge graph will make them appear also in similar walks (as shown in the example above), which will lead to similar embedding vectors for those entities.6 Figure 2.2 shows the complete big picture of RDF2vec: first, sequences of entities and relations are extracted from the graph, then, those sequences are fed into the word2vec algorithm to create embedding vectors for all entities in the knowledge graph. There are a few approaches that use the same idea (i.e., extracting walks from graphs, and learning a language model on those walks). Those approaches are often identical or at least very similar to RDF2vec. Examples of such approaches include: • Walking RDF and OWL (Alshahrani et al. 2017) pursues exactly the same idea as RDF2vec, and the two can be considered identical. It uses random walks and Skip Gram embeddings. The approach has been developed at the same time as RDF2vec. • KG2vec (Wang et al. 2021b) pursues a similar idea as RDF2vec by first transforming the directed, labeled RDF graph into an undirected, unlabeled graph (using nodes for the relations) and then extracting walks from that transformed graph. Although no direct comparison is available, we assume that the embeddings are comparable.
6 We will revisit – and, to a certain extent, weaken – that assumption when we discuss different
variants of RDF2vec, in particular order-aware RDF2vec, in Chap. 4.
24
2 From Word Embeddings to Knowledge Graph Embeddings
Fig. 2.2 Overall workflow of RDF2vec
• Wembedder (Nielsen 2017) is a simplified version of RDF2vec which uses the raw triples of a knowledge graph as input to the word2vec implementation, instead of random walks. It serves pre-computed vectors for Wikidata. • KG2vec (Soru et al. 2018) (not to be confused with the aforementioned approach also named KG2vec) follows the same idea of using triples as input to a Skip-Gram algorithm. • Triple2Vec (Fionda and Pirró 2019) follows a similar idea of walk-based embedding generation but embeds entire triples instead of nodes. For the sake of completeness, one should also mention node2vec (Grover and Leskovec 2016) and DeepWalk (Perozzi et al. 2014), which pursue a similar approach, but are designed graphs without edge labels, i.e., graphs with only one type of edge. Therefore, they can be (and have been) applied to knowledge graphs, but do not leverage all the information, since they treat all edges the same.
2.4
Software Libraries
There are two main libraries that can be used for RDF2vec (in addition to a few more implementations, which often support different features, but are less well-maintained and/or documented): • pyRDF2vec7 (Vandewiele et al. 2022) is a Python-based implementation. It supports many flavors of RDF2vec and comes with an extensible library of walk generation, sampling, and embedding strategies. pyRDF2vec is used for most examples in this book. • jRDF2vec8 (Portisch et al. 2020b) is a Java-based implementation, which makes all functionality available through a command line interface. It can be integrated in software engineering projects as a maven dependency and is also available as a Docker image.
7 https://github.com/IBCNServices/pyRDF2Vec. 8 https://github.com/dwslab/jRDF2Vec.
2.5
Node Classification with RDF2vec
25
ˇ uˇrek and Sojka 2010) to Both pyRDF2vec and jRDF2vec use the gensim library9 (Reh˚ compute the actual word2vec embeddings. Both libraries differ slightly with respect to the feature set they support. Details are provided in Chap. 4.
2.5
Node Classification with RDF2vec
We will now revisit the node classification example from Sect. 1.3. As a brief recap, the knowledge graph contains a subset of DBpedia consisting of 100 bands from the rock and soul genre each, and the task is to predict that genre. Listing 2.1 shows the code used to classify nodes with RDF2vec.10 With that approach, we reach an accuracy of 0.700± 0.071, which is significantly better than the approaches using simple propositionalization (the best was 0.545± 0.057). Furthermore, the number of features can be directly controlled and does not depend on the graph topology and complexity. To understand why this approach works better, we use a 2D plot of the resulting embedding space. For that purpose, we reduce the number of dimensions by computing a Principal Component Analysis (Abdi and Williams 2010). The code for transforming the resulting embeddings of Listing 2.1 into a 2D PCA plot is shown in Listing 2.2. Figure 2.3 shows the PCA plot of both the RDF2vec embeddings, as well as the propositionalization created in Sect. 1.3. We can observe that the RDF2vec embeddings have
(a) Propositionalization approach from chapter 1
(b) RDF2vec
Fig. 2.3 The band dataset from Chap. 1 represented using propositionalization and RDF2vec, shown in 2-D PCA projections 9 https://radimrehurek.com/gensim/index.html. 10 We use the pyRDF2vec implementation by Vandewiele et al. (2022) for the code examples throughout this book. For a full list of implementations of RDF2vec, see http://www.rdf2vec.org.
26
2 From Word Embeddings to Knowledge Graph Embeddings
Listing 2.1 Node classification example with RDF2vec # Load k n o w l e d g e graph from p y r d f 2 v e c . g r a p h s i m p o r t KG path = " f i l e p a t h / " kg = KG ( ’ ./ a r t i s t s _ g r a p h . nt ’ ) # Load ground truth i m p o r t p a n d a s as pd df = pd . r e a d _ c s v ( path + ’ b a n d s _ l a b e l s . csv ’ , sep = " \ t " ) dfX = df [[ ’ Band ’ ]] dfY = df [[ ’ Genre ’ ]] # I d e n t i f y e n t i t i e s to c r e a t e v e c t o r s for e n t i t i e s = list ( dict . f r o m k e y s ( df [ ’ Band ’ ]. t o _ l i s t ())) k g e n t i t i e s = kg . _ e n t i t i e s # Define walk strategy from p y r d f 2 v e c . w a l k e r s i m p o r t R a n d o m W a l k e r r a n d o m _ w a l k e r = R a n d o m W a l k e r (4 , 500) w a l k e r s = [] for i in range (1): walkers . append ( random_walker ) # Learn RDF2vec model from p y r d f 2 v e c i m p o r t R D F 2 V e c T r a n s f o r m e r from p y r d f 2 v e c . e m b e d d e r s i m p o r t W o r d 2 V e c t r a n s f o r m e r = R D F 2 V e c T r a n s f o r m e r ( w a l k e r s = walkers , e m b e d d e r = W o r d 2 V e c ( sg =1 , v e c t o r _ s i z e =50 , hs =1 , w i n d o w =5 , m i n _ c o u n t =0)) em be dd in gs , _ = t r a n s f o r m e r . f i t _ t r a n s f o r m ( kg , e n t i t i e s ) # e v a l u a t e in 10 - fold CV from s k l e a r n . n e u r a l _ n e t w o r k i m p o r t M L P C l a s s i f i e r from s k l e a r n . m o d e l _ s e l e c t i o n i m p o r t c r o s s _ v a l _ s c o r e i m p o r t n um py as np dfX = pd . D a t a F r a m e ( list ( map ( np . ravel , e m b e d d i n g s ))) clf = M L P C l a s s i f i e r ( m a x _ i t e r = 1 0 0 0 0 ) s c o r e s = c r o s s _ v a l _ s c o r e ( clf , dfX , dfY . v a l u e s . ravel () , cv =10) s c o r e s . m e a n () s c o r e s . std ()
a stronger tendency to create visible clusters: for example, the upper left part of the diagram contains mostly rock bands, while the lower part mostly contains soul bands. Such clusters can be exploited by downstream classifiers. On the other hand, the plot of the propositionalization approach shows no such class separation, indicating that it is harder for the downstream classifier to predict the correct class. This is also reflected in the better classification results for RDF2vec in this case.
2.6
Conclusion
27
Listing 2.2 Visualizing node embeddings in a 2D PCA # V i s u a l i z e in 2 D PCA i m p o r t m a t p l o t l i b . p y p l o t as plt from s k l e a r n . d e c o m p o s i t i o n i m p o r t PCA # C o m p u t e PCA pca = PCA ( n _ c o m p o n e n t s =2) p c a _ r e s u l t = pca . f i t _ t r a n s f o r m ( dfX ) p r i n c i p a l D f = pd . D a t a F r a m e ( data = pca_result , columns = [’ principal component 1’, ’ p r i n c i p a l c o m p o n e n t 2 ’ ]) f i n a l D f = pd . c o n c a t ([ p r i n c i p a l D f , dfY ] , axis = 1) # Prepare diagram fig = plt . f i g u r e ( f i g s i z e = (8 ,8)) ax = fig . a d d _ s u b p l o t (1 ,1 ,1) ax . s e t _ x l a b e l ( ’ P r i n c i p a l C o m p o n e n t 1 ’ , f o n t s i z e = 15) ax . s e t _ y l a b e l ( ’ P r i n c i p a l C o m p o n e n t 2 ’ , f o n t s i z e = 15) # Create color codes t a r g e t s = [ ’ Soul ’ , ’ Rock ’ ] colors = [ ’r ’, ’b ’] # Create plot for target , color in zip ( targets , c o l o r s ): i n d i c e s T o K e e p = f i n a l D f [ ’ G e n r e ’ ] == t a r g e t ax . s c a t t e r ( f i n a l D f . loc [ i n d i c e s T o K e e p , ’ p r i n c i p a l c o m p o n e n t 1 ’] , f i n a l D f . loc [ i n d i c e s T o K e e p , ’ p r i n c i p a l c o m p o n e n t 2 ’] , c = color ) ax . l e g e n d ( t a r g e t s ) ax . grid ()
2.6
Conclusion
In this chapter, we have seen a first glance at RDF2vec embeddings and grasped an understanding of how they are computed. We have observed that they create vector representations of nodes that can be used for downstream classification tasks because the resulting distributions separate classes better than classic propositionalization tasks. At the same time, we are able to limit the dimensionality of the resulting feature space. While this chapter covered only the basic variant RDF2vec, quite a few variants have been proposed for RDF2vec, which affect both the generation of the walks as well as the computation of the word embedding model. In the subsequent chapters, we will investigate a few of those variants, and discuss their impact on the resulting embeddings.
28
2 From Word Embeddings to Knowledge Graph Embeddings
References Abdi H, Williams LJ (2010) Principal component analysis. Wiley Interdiscip Rev Comput Stat 2(4):433–459 Alshahrani M, Khan MA, Maddouri O, Kinjo AR, Queralt-Rosinach N, Hoehndorf R (2017) Neurosymbolic representation learning on biological knowledge graphs. Bioinformatics 33(17):2723– 2730 Bengio Y, Ducharme R, Vincent P (2000) A neural probabilistic language model. In: Advances in neural information processing systems 13 Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828 Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 Fionda V, Pirró G (2019) Triple2vec: learning triple embeddings from knowledge graphs. arXiv:1905.11691 Firth JR (1957) A synopsis of linguistic theory, 1930-1955. In: Studies in linguistic analysis Grover A, Leskovec J (2016) node2vec: scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp 855–864 Jatowt A, Tanaka K (2012) Is wikipedia too difficult? comparative analysis of readability of wikipedia, simple wikipedia and britannica. In: Proceedings of the 21st ACM international conference on Information and knowledge management, pp 2607–2610 Khurana D, Koli A, Khatter K, Singh S (2022) Natural language processing: state of the art, current trends and challenges. In: Multimedia tools and applications, pp 1–32 Kumar G, Bhatia PK (2014) A detailed review of feature extraction in image processing systems. In: 2014 fourth international conference on advanced computing & communication technologies. IEEE, pp 5–12 Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space. arXiv:1301.3781 Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013b) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119 Nielsen FÅ (2017) Wembedder: wikidata entity embedding web service. arXiv:1710.04099 Nixon M, Aguado A (2019) Feature extraction and image processing for computer vision. Academic press Perozzi B, Al-Rfou R, Skiena S (2014) Deepwalk: online learning of social representations. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 701–710 Portisch J, Hladik M, Paulheim H (2020b) Rdf2vec light–a lightweight approach for knowledge graph embeddings. In: International semantic web conference, posters and demonstrations ˇ uˇrek R, Sojka P (2010) Software framework for topic modelling with large corpora. In: ProceedReh˚ ings of the LREC 2010 workshop on new challenges for NLP frameworks, ELRA, Valletta, Malta, pp 45–50. http://is.muni.cz/publication/884893/en Ristoski P, Paulheim H (2016) Rdf2vec: Rdf graph embeddings for data mining. In: International semantic web conference. Springer, pp 498–514 Ruder S, Vuli´c I, Søgaard A (2019) A survey of cross-lingual word embedding models. J Artif Intell Res 65:569–631 Scarselli F, Gori M, Tsoi AC, Hagenbuchner M, Monfardini G (2008) The graph neural network model. IEEE Trans Neural Netw 20(1):61–80
2.6
Conclusion
29
Schlichtkrull M, Kipf TN, Bloem P, Berg Rvd, Titov I, Welling M (2018) Modeling relational data with graph convolutional networks. In: European semantic web conference. Springer, pp 593–607 Schmidhuber J (2015) Deep learning. Scholarpedia 10(11):32832 Scott S, Matwin S (1999) Feature engineering for text classification. ICML, Citeseer 99:379–388 Soru T, Ruberto S, Moussallem D, Valdestilhas A, Bigerl A, Marx E, Esteves D (2018) Expeditious generation of knowledge graph embeddings. arXiv:1803.07828 Vandewiele G, Steenwinckel B, Agozzino T, Ongenae F (2022) pyrdf2vec: a python implementation and extension of rdf2vec. 10.48550/ARXIV.2205.02283, https://arxiv.org/abs/2205.02283 Wang Y, Dong L, Jiang X, Ma X, Li Y, Zhang H (2021b) Kg2vec: a node2vec-based vectorization model for knowledge graph. Plos one 16(3):e0248552. https://doi.org/10.1371/journal.pone. 0248552
3
Benchmarking Knowledge Graph Embeddings
Abstract
RDF2vec (and other techniques) provide embedding vectors for knowledge graphs. While we have used a simple node classification task so far, this chapter introduces a few datasets and three common benchmarks for embedding methods—i.e., SW4ML, GEval, and DLCC—and shows how to use them for comparing different variants of RDF2vec. The novel DLCC benchmark allows us to take a closer look at what RDF2vec vectors actually represent, and to analyze what proximity in the vector space means for them.
3.1
Node Classification with Internal Labels—SW4ML
In the previous chapter, we have motivated the use of knowledge graph embeddings for learning predictive models on entities in knowledge graphs. In the running example we introduced, the task was to predict the genre of bands represented in a knowledge graph— i.e., the task is node classification. In the running example in the previous chapters, we used one particular relation in the knowledge graph—the genre relation—and removed it for prediction. This is a common way of creating benchmarks for node classification, and there are a few datasets that are used for benchmarking embeddings which use this approach. The SW4ML benchmark,1 introduced in Ristoski et al. (2016), uses four different existing knowledge graphs and holds out a discrete label for certain nodes: • The AIFB dataset describes the AIFB research institute in terms of its staff, research group, and publications. In Bloehdorn and Sure (2007) the dataset was first used to predict 1 http://w3id.org/sw4ml-datasets.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 H. Paulheim et al., Embedding Knowledge Graphs with RDF2vec, Synthesis Lectures on Data, Semantics, and Knowledge, https://doi.org/10.1007/978-3-031-30387-6_3
31
32
3 Benchmarking Knowledge Graph Embeddings
Table 3.1 Characteristics of the SW4ML datasets with internal labels. The numbers on the lefthand side show the statistics of the knowledge graph at hand, i.e., the number and average degree of instances, number of classes, and object properties used in the graph (since RDF2vec only considers relations between resources, but no literals, we only included those in the computation), the numbers on the right-hand side depict the statistics of the classification problem, i.e., number of labeled instances, and number of distinct class labels Dataset AIFB AM MUTAG BGS
# Instances
Avg. Degree # Classes
# Relations
# Instances
# Labels
2,548
12.5
57
72
176
4
367,236
8.6
12
44
1,000
11
22,372
3.6
142
4
340
2
101,458
5.4
6
41
146
2
the affiliation (i.e., research group) for people in the dataset. The dataset contains 178 members of a research group, however, the smallest group contains only 4 people, which was removed from the dataset, leaving five classes. Furthermore, the employs relation, which is the inverse of the prediction target relation affiliation, has been removed. • The AM dataset contains information about artifacts in the Amsterdam Museum (de Boer et al. 2012). Each artifact in the dataset is linked to other artifacts and details about its production, material, and content. It also has an artifact category (e.g., painting, sculpture, etc.), which serves as a prediction target. For SW4ML, a stratified random sample of 1,000 instances was drawn from the complete dataset. Moreover, the material relation has been removed, since it highly correlates with the artifact category. • The MUTAG dataset is distributed as an example dataset for the DL-Learner toolkit.2 It contains information about complex molecules that are potentially carcinogenic, which is given by the isMutagenic property. • The BGS dataset was created by the British Geological Survey and describes geological measurements in Great Britain.3 It was used in de Vries (2013) to predict the lithogenesis property of named rock units. The dataset contains 146 named rock units with a lithogenesis, from which we the largest two classes are used. Table 3.1 depicts the characteristics of the four datasets. In Ristoski et al. (2019), we have run experiments on the SW4ML dataset, including different variants of RDF2vec. The experiments also included basic propositionalization techniques introduced in Chap. 1. The results are depicted in Table 3.2. We can make a few observations from this table. First of all, RDF2vec is always able to outperform the baselines, and the relations to individuals approach is not scalable enough for the MUTAG dataset, as discussed above. 2 http://dl-learner.org/. 3 http://data.bgs.ac.uk/.
3.2
Machine Learning with External Labels—GEval
33
Table 3.2 Results (accuracy) of different propositionalization and embedding variants on the SW4ML datasets, taken from Ristoski et al. (2019). The table only depicts the results achieved with a linear SVM, which was the best-scoring approach according to the original publication. Results for RDF2vec are reported for CBOW and Skip Gram (SG), and 200 and 500 dimensions, respectively. Results marked with—did not finish within a time limit, or ran out of RAM Approach/Dataset
AIFB
AM
MUTAG
BGS
Relations
0.501
0.644
–
0.720
Relations to individuals
0.886
0.868
–
0.858
RDF2vec CBOW 200
0.795
0.818
0.803
0.747
RDF2vec SG 200
0.874
0.872
0.779
0.753
RDF2vec CBOW 500
0.874
0.872
0.779
0.753
RDF2vec SG 500
0.896
0.782
0.781
0.882
When looking at the results for the different variants of RDF2vec, the results are less conclusive. In general, the results using 500 dimensions are often superior to those with 200 dimensions. Moreover, the skip-gram result often outperforms the results achieved with CBOW. Nevertheless, there are also cases where these observations do not hold (see, e.g., the good performance of CBOW with 200 dimensions on MUTAG).
3.2
Machine Learning with External Labels—GEval
The original version of the SW4ML benchmark also contained further test cases. Those are based on the existing knowledge graph DBpedia, and use various external variables as prediction targets. GEval supports a wider range of data mining tasks (i.e., not only classification). In general, to evaluate an embedding method with GEval, one needs to first compute embedding vectors for DBpedia. With those vectors, GEval performs different runs with predictions and also performs parameter tuning on the respective prediction operators, as shown in Fig. 3.1. Due to the systematic nature of the benchmark, results for different embedding methods can be directly compared. In total, GEval comprises six different tasks, many of which have different test cases, making 20 test cases in total (Table 3.3): • Five classification tasks, evaluated by accuracy. Those tasks use the same ground truth as the regression tasks (see below), where the numeric prediction target is discretized into high/medium/low (for the Cities, AAUP, and Forbes dataset) or high/low (for the Albums and Movies datasets). All five tasks are single-label classification tasks. • Five regression tasks, evaluated by root mean squared error (RMSE). Those datasets are constructed by acquiring an external target variable for instances in knowledge graphs
34
3 Benchmarking Knowledge Graph Embeddings
Fig. 3.1 Schematic depiction of the GEval framework
which is not contained in the knowledge graph per se. Specifically, the ground truth variables for the datasets are: a quality of living indicator for the Cities dataset, obtained from Mercer; average salary of university professors per university, obtained from the AAUP; profitability of companies, obtained from Forbes; average ratings of albums and movies, obtained from Facebook. • Four clustering tasks (with ground truth clusters), evaluated by accuracy. The clusters are obtained by retrieving entities of different ontology classes from the knowledge graph. The clustering problems range from distinguishing coarser clusters (e.g., cities vs. countries) to finer ones (e.g., basketball teams vs. football teams). • A document similarity task (where the similarity is assessed by computing the similarity between entities identified in the documents), evaluated by the harmonic mean of Pearson and Spearman correlation coefficients. The dataset is based on the LP50 dataset (Lee et al. 2005). It consists of 50 documents, each of which has been annotated with DBpedia entities using DBpedia spotlight (Mendes et al. 2011). The task is to predict the similarity of each pair of documents. In the GEval framework, this similarity is computed from the pairwise similarities of the entities in the documents. • An entity relatedness task (where semantic similarity is used as a proxy for semantic relatedness), evaluated by Kendall’s Tau. The dataset is based on the KORE dataset (Hoffart
3.2
Machine Learning with External Labels—GEval
35
Table 3.3 Overview of the evaluation datasets in GEval Task
Dataset
# entities
Target variable
Classification
Cities
212
3 classes (67/106/39)
AAUP
960
3 classes (236/527/197)
Forbes
1,585
3 classes (738/781/66)
Albums
1,600
2 classes (800/800)
Movies
2,000
2 classes (1,000/1,000)
Cities
212
Numeric [23, 106]
AAUP
960
Numeric [277, 1009]
Forbes
1,585
Numeric [0.0, 416.6]
Albums
1,600
Numeric [15, 97]
Movies
2,000
Numeric [1, 100]
Cities and countries (2k)
4,344
2 clusters (2,000/2,344)
Cities and countries
11,182
2 clusters (8,838/2,344)
Cities, countries, albums, movies, AAUP, forbes
6,357
5 clusters (2,000/960/1,600/212/1,585)
Regression
Clustering
Teams
4,206
2 clusters (4,185/21)
Document similarity
Pairs of 50 documents with entities
1,225
Numeric similarity score [1.0, 5.0]
Entity relatedness
20×20 entity pairs
400
Ranking of entities
Semantic analogies
(All) capitals and countries
4,523
Entity prediction
Capitals and countries
505
Entity prediction
Cities and states
2,467
Entity prediction
Countries and currencies
866
Entity prediction
et al. 2012). The dataset consists of 20 seed entities from the YAGO knowledge graph, and 20 related entities each. Those 20 related entities per seed entity have been ranked by humans to capture the strength of relatedness. The task is to rank the entities per seed by relatedness. In the GEval framework, this ranking is computed based on cosine similarity in the embedding vector space. • Four semantic analogy tasks (e.g., Athens is to Greece as Oslo is to X), which are based on the original datasets on which word2vec was evaluated (Mikolov et al. 2013). The datasets were created by manual annotation. The goal of the evaluation is to predict the fourth element (D) in an analogy A : B = C : D by considering the closest n vectors to B − A + C. If the element is contained the top n predictions, the answer is considered to be correct, i.e., the evaluation metric is top-n accuracy. In the default setting of the evaluation framework used, n is set to 2. Table 3.4 shows the results of the two basic variants of RDF2vec on GEval. It shows that the already observed trend of skip-gram being superior to CBOW also holds in this case.
36
3 Benchmarking Knowledge Graph Embeddings
Table 3.4 Results on GEval for RDF2vec Task
Dataset
CBOW
Classification
Cities Movies Albums AAUP Forbes
0.725 0.549 0.536 0.643 0.575
SG 0.818 0.726 0.586 0.706 0.623
Regression
Cities Movies Albums AAUP Forbes
18.963 24.238 15.812 77.250 39.204
15.375 20.215 15.288 65.985 36.545
Clustering
Cities and countries (2k) Cities and countries Cities, albums, movies, AAUP, forbes Teams
0.520 0.783 0.547
0.789 0.587 0.829
0.940
0.909
Document similarity LP50
0.283
0.237
Entity relatedness
KORE
0.611
0.747
Semantic analogies
Capitals-countries (all) Capitals-countries Cities-states Countries-currencies
0.594 0.810 0.507 0.338
0.905 0.957 0.609 0.574
While GEval is a useful tool to conduct evaluations on a variety of tasks and to get an idea of how well different flavors of embeddings work for different downstream applications, comparing results across papers using GEval is not that easy. Many papers use GEval or subsets of GEval to report on RDF2vec variants and extensions, and we will also show different variants of RDF2vec and discuss the results achieved on those benchmarks in the next chapter. However, as those results are taken from different papers, they are not always fully comparable, as they may use different versions and/or subsets of DBpedia.
3.3
Benchmarking Expressivity of Embeddings—DLCC
When building an application using knowledge graph embeddings, and the task in the application is known, the above benchmarks may give some guidance in choosing an embedding approach. However, they do not say much about what those embedding methods can actually represent and what they cannot represent.
3.3
Benchmarking Expressivity of Embeddings—DLCC
37
Table 3.5 Overview of the test cases Test case
DL expression
tc01
∃r .
tc02
∃r −1 .
tc03
∃r . ∃r −1 .
tc04
∃R. {e} ∃R −1 . {e}
tc05
∃R1 .(∃R2 . {e}) ∃R1−1 .(∃R2−1 {e})
tc06
∃r . {e}
tc07
∃r .T
tc08
∃r −1 .T
tc09
≥ 2r .
tc10
≥ 2r −1 .
tc11
≥ 2r .T
tc12
≥ 2r −1 .T
In order to shed more light on that question, the DLCC benchmark (Portisch and Paulheim 2022) has been created. It allows for analyzing the capabilities of knowledge graph embeddings with respect to the underlying patterns in a knowledge graph they can represent. Those patterns are created using description logic class constructors (DLCCs). In total, the benchmark dataset defines 12 different patterns. These patterns are depicted in Table 3.5. Those are created (1) using six different domains in DBpedia, and (2) synthetic knowledge graphs. All test cases are created in different sizes (i.e., number of examples) and a fixed train/test split. Figure 3.2 shows the overall process of the gold standard generation.
3.3.1
DLCC Gold Standard based on DBpedia
We created SPARQL queries for each test case (see Table 3.5) to generate positives, negatives, and hard negatives. The latter are meant to be less easily distinguishable from the positives and are created by variations such as softening the constraints in the class constructor or switching the subject and object in the constraint. For example, for qualified relations, a positive example would be a person playing in a team which is a basketball team (cf. tc07). A simple negative example would be any person not playing in a basketball team, whereas a hard negative example would be any person playing in a team that is not a basketball team. Query examples for every test case in the people domain are provided in Tables 3.6 and 3.7. The framework uses slightly more involved queries to vary the size of the result set and to better randomize results.
38
3 Benchmarking Knowledge Graph Embeddings
Fig. 3.2 Overview of the DLCC gold standard generation approach (Portisch and Paulheim 2022)
In total, we used six different domains: people (P), books (B), cities (C), music albums (A), movies (M), and species (S). This setup yields more than 200 hand-written SPARQL queries, which are used to obtain positives, negatives, and hard negatives; they are available online4 and can be easily extended, e.g., to add an additional domain. For each test case, we created differently sized (50, 500, 5000) balanced test sets.5 Table 3.8 shows the outcome of RDF2vec embeddings on the DLCC DBpedia benchmark. The table depicts the accuracy of the classification task, using DBpedia 2021-09. From those results, we can make a few interesting observations: 1. In all test cases, skip-gram works better than CBOW. 2. In most cases, the hard test cases are in fact harder than their non-hard counterparts. 3. All classification tasks can be solved considerably better than random guessing.6
4 https://github.com/janothan/DL-TC-Generator/tree/master/src/main/resources/queries. 5 The desired size classes can be configured in the framework. 6 Since the classification tasks are balanced, random guessing would yield an accuracy of 0.5.
3.3
Benchmarking Expressivity of Embeddings—DLCC
39
Table 3.6 Exemplary SPARQL queries for test cases 01-06 for class Person (Portisch and Paulheim 2022) TC
Query positive
Query negative
Query negative (hard)
tc01
SELECT DISTINCT(?x) WHERE { ?x a dbo:Person . ?x dbo:child ?y . }
SELECT DISTINCT(?x) WHERE { ?x a dbo:Person . FILTER(NOT EXISTS { ?x dbo:child ?z})}
SELECT DISTINCT(?x) WHERE { ?x a dbo:Person . ?y dbo:child ?x. FILTER(NOT EXISTS { ?x dbo:child ?z})}
tc02
Analogous to tc01 (inverse case)
tc03
SELECT DISTINCT(?x) WHERE { { ?x a dbo:Person . ?x dbo:child ?y} UNION { ?x a dbo:Person . ?y dbo:child ?x}}
SELECT COUNT(?x) WHERE { ?x a dbo:Person . FILTER(NOT EXISTS{ ?x dbo:child ?y} AND NOT EXISTS { ?z dbo:child ?x})}
–
tc04
SELECT DISTINCT(?x) WHERE { { ?x a dbo:Person . ?x ?y dbr:New_York_City} UNION { ?x a dbo:Person . dbr:New_York_City ?y ?x}}
SELECT DISTINCT(?x) WHERE { ?x a dbo:Person . FILTER(NOT EXISTS{ ?x ?y dbr:New_York_City} AND NOT EXISTS { dbr:New_York_City ?y ?x})}
SELECT DISTINCT(?x) WHERE {{ ?x a dbo:Person . ?x ?y1 ?z . ?z ?y2 dbr:New_York_City } UNION { ?x a dbo:Person . ?z ?y1 ?x . dbr:New_York_City ?y2 ?z } FILTER(NOT EXISTS {?x ?r dbr:New_York_City} AND NOT EXISTS {dbr:New_York_City ?s ?x})}
tc05
Analogous to tc04 (inverse case)
tc06
SELECT DISTINCT(?x) WHERE { ?x a dbo:Person . ?x dbo:birthPlace dbr:New_York_City }
SELECT DISTINCT(?x) WHERE { ?x a dbo:Person . FILTER(NOT EXISTS{ ?x dbo:birthPlace dbr:New_York_City })}
SELECT DISTINCT(?x) ?r WHERE {{ ?x a dbo:Person . ?x dbo:birthPlace ?y . dbr:New_York_City ?r ?x . FILTER(?y!= dbr:New_York_City)} UNION { ?x a dbo:Person . ?x dbo:birthPlace ?y . ?x ?r dbr:New_York_City . FILTER(?y!= dbr:New_York_City)}}
The latter observation is particularly interesting. For example, the hard variant of test case 1 is learned with an accuracy considerably above 0.5, however, as word2vec (and, hence, RDF2vec) does not consider word order (see Sect. 4.2), it should not be able to do so. The reason is that on real-world knowledge graphs, like DBpedia, certain correlations occur. Properties are not distributed independently, but tend to co-occur (e.g., most person entities with a death place also have a birthplace). Therefore, even if a certain pattern cannot be learned directly, it is likely that an embedding algorithm can encode a correlated pattern. The correlation observation also holds for the patterns we aim at investigating. Figure 3.3 shows an excerpt of DBpedia used in the gold standard. The instance dbr:LeBron_James
40
3 Benchmarking Knowledge Graph Embeddings
Table 3.7 Exemplary SPARQL queries for test cases 07-12 for class Person (Portisch and Paulheim 2022) TC
Query positive
Query negative
Query negative (hard)
tc07
SELECT DISTINCT(?x) WHERE { ?x a dbo:Person . ?x dbo:team ?y . ?y a dbo:BasketballTeam }
SELECT DISTINCT(?x) WHERE { ?x a dbo:Person . FILTER(NOT EXISTS{ ?x dbo:team ?y . ?y a dbo:BasketballTeam})}
SELECT DISTINCT(?x) WHERE { ?x a dbo:Person . ?x dbo:team ?z1 . ?x ?r ?z2 . ?z2 a dbo:BaseballTeam FILTER(NOT EXISTS{ ?x dbo:team ?y . ?y a dbo:BasketballTeam })}
tc08
Analogous to tc07 (inverse case)
tc09
SELECT DISTINCT(?x) WHERE { ?x a dbo:Person . ?x dbo:award ?y1. ?x dbo:award ?y2. FILTER(?y1!=?y2)}
SELECT DISTINCT(?x) WHERE { ?x a dbo:Person . FILTER(NOT EXISTS{ ?x dbo:award ?y1. ?x dbo:award ?y2. FILTER(?y1!=?y2)})}
SELECT DISTINCT(?x) WHERE { ?x a dbo:Person . ?x dbo:award ?y . FILTER(NOT EXISTS{ ?x dbo:award ?z. FILTER(?y!=?z)})}
tc10
Analogous to tc09 (inverse case)
tc11
SELECT DISTINCT(?x) WHERE { ?x a dbo:Person . ?x dbo:recordLabel ?y1 . ?y1 a dbo:RecordLabel . ?x dbo:recordLabel ?y2 . ?y2 a dbo:RecordLabel . FILTER(?y1!=?y2)}
SELECT DISTINCT(?x) WHERE { ?x a dbo:Person . FILTER(NOT EXISTS{ ?x dbo:recordLabel ?y1 . ?y1 a dbo:RecordLabel . ?x dbo:recordLabel ?y2 . ?y2 a dbo:RecordLabel . FILTER(?y1!=?y2)})}
SELECT DISTINCT(?x) WHERE { ?x a dbo:Person . ?x dbo:recordLabel ?y1 . ?y1 a dbo:RecordLabel . FILTER(NOT EXISTS{ ?x dbo:recordLabel ?y2 . ?y2 a dbo:RecordLabel . FILTER(?y1!=?y2)})}
tc12
Analogous to tc11 (inverse case)
is a positive example for task tc07 in Table 3.7. At the same time, 95.6% of all entities in DBpedia fulfilling the positive query for positive examples also fall in the class ∃dbo:position. (which is a tc01 problem), but only 13.6% of all entities fulfilling the query for trivial negatives. Hence, on a balanced dataset, this class can be learned with an accuracy of 0.91 by any approach that can learn classes of type tc01. As a comparison to the synthetic dataset shows, the results on the DBpedia test set for tc07 actually overestimate the capability of many embedding approaches of learning classes constructed with a tc07 class constructor. Such correlations are quite frequent in DBpedia.
3.3.2
DLCC Gold Standard based on Synthetic Data
In order to provide a means to answer the question which embedding algorithm can encode which pattern more precisely, DLCC also provides a collection of synthetic test cases, which do not expose such correlations.
3.3
Benchmarking Expressivity of Embeddings—DLCC
41
Table 3.8 Results of RDF2vec on the DBpedia DLCC Gold Standard Test case
SG
CBOW
tc01
0.915
0.778
tc01 hard
0.681
0.637
tc02
0.953
0.865
tc02 hard
0.637
0.618
tc03
0.949
0.846
tc04
0.960
0.705
tc04 hard
0.963
0.674
tc05
0.986
0.772
tc06
0.957
0.698
tc06 hard
0.863
0.604
tc07
0.938
0.742
tc08
0.961
0.891
tc09
0.902
0.773
tc09 hard
0.785
0.659
tc10
0.947
0.918
tc10 hard
0.740
0.716
tc11
0.932
0.865
tc11 hard
0.725
0.687
tc12
0.955
0.888
tc12 hard
0.714
0.712
Fig. 3.3 Example excerpt from DBpedia
The synthetic DLCC gold standard is created as follows: first, an ontology (i.e., a T-Box) is created, based on user-specified metrics (e.g., number of classes and properties, branching factor). For each pattern, classes and properties are selected to form the pattern. Next, positive and negative examples are created. Finally, the A-box is further populated, whereby additional checks are performed in order to ensure that no extra positive examples are created. Algorithm 1 shows the creation of the DLCC synthetic benchmark.
42
3 Benchmarking Knowledge Graph Embeddings
Algorithm 1 DLCC synthetic benchmark creation (Portisch and Paulheim 2022) procedure generateClassTree(numClasses, branchingFactor) clsU R I s ← generateURIs(numClasses) r oot ← randomDraw(clsURIs) i ←0 wor k List ← newList( ) r esult ← newTree( ) curr entU R I ← r oot for clsU R I in clsU R I s do if clsU R I = r oot then continue end if if i = branching Factor then curr entU R I ← wor k List.r emoveFir st() i ←0 end if r esult.add Lea f (curr entU R I , clsU R I ) i ←i +1 wor k List.add(clsU R I ) end for return r esult end procedure procedure generateProperties(numProperties, classTree) pr oper ties ← generateURIs(numProperties) for pr oper t y in pr oper ties do pr oper t y.add Domain( drawDomainRange(classTree, 0.25) ) pr oper t y.add Range( drawDomainRange(classTree, 0.25) ) end for return pr oper ties end procedure procedure drawDomainRange(classTree, p) r esult ← classT r ee.randomClass() while Random.next Double > p ∧ ¬(classT r ee.getChildr en(r esult) == ∅) do r esult ← random Draw(classT r ee.getChildr en(r esult)) end while end procedure procedure populateClasses(numInstances, classTree) instances ← generateURIs(numInstances) for instance in instances do instance.t ype(classT r ee.randomClass()) end for return instances end procedure
3.4
Conclusion
43
Table 3.9 Results on the synthetic DLCC benchmark Test case
SG
CBOW
tc01
0.882
0.566
tc02
0.742
0.769
tc03
0.797
0.927
tc04
1.000
0.990
tc05
0.892
0.889
tc06
0.978
0.898
tc07
0.583
0.575
tc08
0.563
0.555
tc09
0.610
0.648
tc10
0.638
0.665
tc11
0.633
0.668
tc12
0.644
0.657
Table 3.9 shows the results of RDF2vec on the synthetic benchmark. The picture is much clearer, showing that test cases tc07-tc12 are very hard to learn for RDF2vec. Moreover, the results are much closer to random guessing than the respective ones on the DBpedia-based benchmark, indicating that the latter results are rather due to correlated patterns than to RDF2vec being able to learn the pattern used for the query. On the other hand, tc04 (i.e., entities which have any relation to a particular individual) is the case that is particularly well learnable for RDF2vec. We can also observe that a trend that is very clear on the DBpedia-based gold standard, i.e., the superiority of Skip Gram over CBOW, is less clear on the synthetic dataset. This indicates that the choice of a variant of RDF2vec (many more of which we will see in the next chapter) strongly depends on the knowledge graph at hand, and there is no one size fits all solution.
3.4
Conclusion
In this chapter, we have seen three benchmarks for knowledge graph embeddings, i.e., SW4ML, GEval, and DLCC. When looking at variants of RDF2vec in the subsequent chapters, we will evaluate them according to those benchmarks. One caveat to keep in mind when working with GEval and the DBpedia part of DLCC is that the version and subset of DBpedia have an influence on the results. This can hinder the comparability of results between different papers using those gold standards. Therefore, we strongly encourage authors to use those to benchmarks to specify the exact version and
44
3 Benchmarking Knowledge Graph Embeddings
set of files used. One way to do so is to define a collection on DBpedia databus (Frey et al. 2022), which enables other researchers to retrieve the exact same set of files. SW4ML and the synthetic part of DLCC, on the other hand, do not have that restriction, since they contain the knowledge graphs used. While the benchmarks discussed in this chapter target mostly node classification, there is a number of other approaches in the field of link prediction, which come with their own evaluation benchmarks. There are certain commonalities between the two—in fact, the running example in this book tries to predict the genre relation and hence can be seen as a link prediction task as well. We will revisit link prediction in Chap. 6, where discuss the commonalities and differences between these two families of approaches.
References Bloehdorn S, Sure Y (2007) Kernel methods for mining instance data in ontologies. The Semantic Web, pp 58–71 de Boer V, Wielemaker J, van Gent J, Hildebrand M, Isaac A, van Ossenbruggen J, Schreiber G (2012) Supporting linked data production for cultural heritage institutes: The Amsterdam museum case study. In: The semantic web: research and applications. Springer, pp 733–747. https://doi.org/ 10.1007/978-3-642-30284-8_56 de Vries GKD (2013) A fast approximation of the Weisfeiler-Lehman graph kernel for RDF data. In: ECML/PKDD (1), pp 606–621 Frey J, Götz F, Hofer M, Hellmann S (2022) Managing and compiling data dependencies for semantic applications using databus client. In: Research conference on metadata and semantics research. Springer, pp 114–125 Hoffart J, Seufert S, Nguyen DB, Theobald M, Weikum G (2012) KORE: keyphrase overlap relatedness for entity disambiguation. In: Chen X, Lebanon G, Wang H, Zaki MJ (eds) 21st ACM international conference on information and knowledge management, CIKM’12, Maui, HI, USA, October 29 - November 02, 2012, ACM, pp 545–554. https://doi.org/10.1145/2396761.2396832 Lee MD, Pincombe B, Welsh M (2005) An empirical evaluation of models of text document similarity. Proc Ann Meet Cogn Sci Soc 7(7):1254–1529, https://hdl.handle.net/2440/28910 Mendes PN, Jakob M, García-Silva A, Bizer C (2011) Dbpedia spotlight: shedding light on the web of documents. In: Ghidini C, Ngomo AN, Lindstaedt SN, Pellegrini T (eds) Proceedings the 7th international conference on semantic systems, I-SEMANTICS 2011, Graz, Austria, September 7-9, 2011, ACM, ACM International Conference Proceeding Series, pp 1–8. https://doi.org/10.1145/ 2063518.2063519 Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv:1301.3781 Portisch J, Paulheim H (2022) The dlcc node classification benchmark for analyzing knowledge graph embeddings. In: International semantic web conference Ristoski P, Rosati J, Di Noia T, De Leone R, Paulheim H (2019) Rdf2vec: Rdf graph embeddings and their applications. Semantic Web 10(4):721–752 Ristoski P, Vries GKDd, Paulheim H (2016) A collection of benchmark datasets for systematic evaluations of machine learning on the semantic web. In: International semantic web conference, Springer, pp 186–194
4
Tweaking RDF2vec
Abstract
Depending on the problem at hand, one might think of different tweaks to the RDF2vec algorithm, many of which have been discussed in the past. Those tweaks encompass various steps of the pipeline: reasoners have been used to preprocess the knowledge graph and add implicit knowledge. Different strategies for changing the walk strategy have been proposed, starting from injecting edge weights to biasing the walks towards higher or lower degree nodes, and changing the structure of the extracted walk completely. Moreover, also the embedding creation itself has been analyzed in the past by using different variants of the word2vec word embedding method. In this chapter, we introduce a few of those approaches and highlight their advantages and shortcomings.
4.1
Introducing Edge Weights
The standard variant of RDF2vec, as introduced in Chap. 2, uses random walks. In their basic form, they treat a graph as unweighted, i.e., given that the walk is currently at an entity e, the next entity is chosen among all outgoing edges with the same uniform probability: if the outdegree of e is d, then each edge is selected with a probability of d1 . Non-uniform random walks change this property by treating different edges differently, i.e., picking among the outgoing edges with different probabilities. Typically, this is achieved with weights: given that each edge r has a weight wr , then ri is chosen with probability wr P(ri ) = wri
(4.1)
There are two strategies for obtaining edge weights: they can be graph internal, i.e., coming from the graph itself, or graph external, i.e., obtained from an external source © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 H. Paulheim et al., Embedding Knowledge Graphs with RDF2vec, Synthesis Lectures on Data, Semantics, and Knowledge, https://doi.org/10.1007/978-3-031-30387-6_4
45
46
4 Tweaking RDF2vec
outside of the knowledge graph. Conceptually, graph external edge weights add information to the knowledge graph, while graph internal weights do not.
4.1.1
Graph Internal Weighting Approaches
Using random walks leads to certain properties of the resulting walks. For example, entities that have a very high indegree have a high likelihood of being visited in random walks started from many other entities, and hence be highly represented in the resulting set of walks. As a result, the information about those high-indegree nodes is denser than that for low-indegree nodes, and the resulting embedding vectors for high-indegree nodes will take more information into account and be more expressive than those for low-indegree nodes. This observation in itself is not necessarily bad. Suppose the task at hand contains many high-indegree entities (which are typically prominent concepts in the knowledge graph). In that case, it might even be an advantage to have more expressive embedding vectors for those entities. If, on the other hand, the task at hand has more low-indegree entities (which are often long-tail or less well-known entities), this property of random walks might turn out to be a disadvantage.
4.1.1.1 Metrics for Biasing Walks To be able to control the representation of entities and relations in the random walks, we have described a total of 12 strategies for using metrics derived from the graph as weights for the edges in Cochez et al. (2017). These weights will then in turn bias the random walks on the graph in different directions. To obtain these edge weights, they make use of different statistics computed on the RDF data. The statistics computed are the following: Predicate Frequency for each predicate in the dataset, we count the number of times the predicate occurs (only occurrences as a predicate are counted). Object Frequency for each entity in the dataset, we count the number of times it occurs as the object of a triple (i.e., the indegree of the entity). Predicate-Object frequency for each pair of a predicate and an object in the dataset, we count the number of times there is a statement with this predicate and object (but a different subject). Besides these statistics, we also use PageRank (Brin and Page 1998) computed for the entities in the knowledge graph. In particular, we use a pre-computed PageRank on DBpedia provided by Thalhammer and Rettinger (2016). This PageRank is computed based on links between the Wikipedia articles representing the respective entities. When using the PageRank computed for DBpedia, not each node has a value assigned, as only entities which have a corresponding Wikipedia page are accounted for in the PageRank computation. Examples of
4.1
Introducing Edge Weights
47
nodes that do not have a PageRank include DBpedia types or categories, like http://dbpedia. org/ontology/Place and http://dbpedia.org/resource/Category:Central_Europe. Those nodes which are not entities are assigned a fixed PageRank value of 0.2, which was identified as the median in that PageRank dataset by van Erp et al. (2016). Among the metrics used, there are essentially two types of metrics: those assigned to nodes, and those assigned to edges. The predicate frequency and predicate-object frequency, as well as the inverses of these, can be directly used as weights for edges. Therefore, those weighting methods are referred to as edge-centric. In the case of predicate frequency, each predicate edge with that label is assigned the weight in question. In the case of predicate-object frequency, each predicate edge that ends in a given object gets assigned the predicate-object frequency. When computing the inverse metrics, not the absolute frequency is assigned, but its multiplicative inverse. In contrast, the object frequency, and also the used PageRank metric, assign a numeric score to each node in the graph. Therefore, weighting approaches based on those metrics are referred to as node-centric. To obtain a weight for the edges from node-centric metrics, the weight is either directly propagated to each edge, or split. In that case, the weight is divided by the number of ingoing edges and then assigned to all edges. For an edge e =< s, p, o > wo and a node weight wo for the entity o, the edge weight we is computed as indegr ee(o) . In the metrics below, methods using this strategy are explicitly marked as split, whereas nonexplicitly marked strategies use the direct propagation. Note that uniform weights are equivalent to using object frequency with splitting the weights. Since object frequency uses the in-degree as a weight wo of an entity, the splitting formula above would yield a constant 1 for each edge. Therefore we analyze 12 different weighting schemes including standard random walks in total.
4.1.1.2 Biased Random Walk Strategies Uniform approach: 1. Uniform Weight = Object Frequency Split Weight—This is the most straightforward approach, also taken by the standard RDF2vec models. At first glance, it also looks like the most neutral strategy. However, the input graph does not have a regular structure in the sense that some entities have a (much) higher indegree than others and hence they are more likely to be visited. Thus, more strongly connected entities will have a higher influence on the resulting embeddings.
48
4 Tweaking RDF2vec
Edge-centric approaches: 2. Predicate Frequency Weight—With this strategy, edges with predicates that are more frequently used in the dataset are more often followed. The effect of this is that many uncommon predicates are never followed in our experiments and, as a result of that, many entities are also never visited in the walks. On the other hand, there are a few entities that have a very high indegree, and which thus attract a lot of walks towards them. 3. Inverse Predicate Frequency Weight—This strategy has at first sight a similar effect as the previous, but for other nodes. Those predicates which are rare will be followed more often. However, predicates follow a long-tail distribution, and there are more predicates that are rare than common, thus, the diversity of predicates occurring in the walks is higher. Moreover, despite having a low probability, also edges with a common predicate are followed once in a while as they occur so often in the dataset. 4. Predicate-Object Frequency Weight—This is similar to the Predicate Frequency Weight, but differentiates between the objects as well. If we have for example an outgoing link with label rdf:type with object owl:Thing, then this link will be followed more often than, e.g., the same predicate with object dbpedia-owl:AdministrativeRegion. 5. Inverse Predicate-Object Frequency Weight—The inverse of the previous, with similar features to Inverse Predicate Frequency Weight. We measured that this approach results in walks in which nodes occur most uniformly. Node-centric object freq. approaches (See also strategy 1): 6. Object Frequency Weight—This weighting does essentially ignore the predicate altogether and just ensures that entities that have a high in-degree get visited even more often. 7. Inverse Object Frequency Weight—This approach also ignores the predicate, but makes the probability for nodes to be visited more equally distributed. Hence, according to our measurements entities occur nearly as uniformly in walks as for Inverse Predicate-Object Frequency Weight. 8. Inverse Object Frequency Split Weight—The general statistics for these walks look surprisingly similar to the non-inverted strategy. Node-centric PageRank-based approaches: 9. PageRank Weight—Similar to Object Frequency Weight, this strategy treats some nodes as more important and hence there will be resources which are more frequent in the walks as others.
4.1
Introducing Edge Weights
49
10. Inverse PageRank Weight—One would expect that this approach would have a similar effect as Inverse Object Frequency Weight, however, our experiments show that the inversion does not cause more uniform occurrence of entities as strongly as that strategy. 11. PageRank Split Weight—Both this approach and the next one are somewhat difficult to predict as they do not only depend on the structure on the graph. Our analysis of the walks show that nodes are fairly uniformly used in these walks. Furthermore, these strategies result in a high uniformity in the absolute frequency of predicates. 12. Inverse PageRank Split Weight—The generated walks have similar statistics as PageRank Split Weight. The expectation is, however, that in this metric tends to include more unimportant nodes in the walks.
4.1.1.3 Evaluation For each set of the twelve sets of sequences created using those metrics, we build one CBOW and one skip-gram model, as described in Chap. 2, and therefore evaluate a total of 24 different metrics. The experiments are performed on the 2016-04 version of DBpedia and use the classification, regression, entity similarity, and document similarity tasks of the GEval benchmark collection (or, more precisely, a historic predecessor thereof). In Cochez et al. (2017), we compare against the propositionalization strategies introduced in Chap. 1, using the respective best performing one as best baseline, as well as to the link prediction methods TransE (Bordes et al. 2013), TransH (Wang et al. 2014), and TransR (Lin et al. 2015b) (cf. Chap. 6). Instead of depicting the absolute performance, we compare average ranks of results across the different classification and regression tasks. The results for classification and regression are depicted in Table 4.1. Moreover, they report results on the entity relatedness and document similarity tasks, as shown in Table 4.2. From the results, it can be observed that, although not in every case and often also not by a large margin, the weighting strategies can actually make a difference. In particular, the PageRank Split Weight metric often performs better than the pure random walks with uniform weights. However, given that the results for many other weighting schemes often underperform plain random walks, the latter is also a good and easy-to-implement choice.
4.1.1.4 Biased Graph Walks in Action The pyRDF2vec package used for examples in this book supports most of the edge weighting strategies discussed in this section, referred to as samplers.1 Listing 4.1 shows how the node classification example from Sect. 1.3 can be computed with the PageRank split weights, which worked best for the classification tasks according to Cochez et al. (2017). With this code, a slight improvement from 0.700 to 0.750 could be observed in the example at hand.
1 https://pyrdf2vec.readthedocs.io/en/latest/api/pyrdf2vec.samplers.html.
50
4 Tweaking RDF2vec
Table 4.1 Classification and regression average rank results for different walk weighting schemes. The best-ranked results for each method are marked in bold. The learning models for which the strategies were shown to have significant differences based on the Friedman test with α < 0.05 are marked with *. The single values marked with ∗ are significantly worse than the best strategy at significance level q = 0.05 Weighting strategy Uniform
Classification
Regression
NB*
KNN*
SVM
C4.5
LR*
KNN
14.4
9.7
12.8
9.4
8.0
7.4
9.0
6.4
3.3
10.0
6.6
4.4
7.6
8.8
CBOW
14.0
11.3
12.6
14.0
10.8
13.4
10.8
SG
11.6
11.1
10.4
12.8
15.0
11.6
16.4
CBOW
24.6*
25.6*
22.5
19.8
22.0
16.8
21.6
SG
23.0
19.4
15.8
18.2
13.0
15.4
17.2
CBOW
20.5
20.9
17.9
20.8
24.6*
22.4
24.2
SG
20.4
20.3
16.7
20.6
24.8*
23.6
24.8
CBOW
19.0
16.8
15.3
15.4
12.6
14.0
13.4
SG
17.2
15.6
10.6
12.2
6.2
10.6
8.2
CBOW
19.1
20.2
17.9
21.0
22.8
22.2
21.6
SG
17.8
14.6
14.0
15.8
10.8
15.0
14.6
7.0
10.6
10.2
7.6
6.8
10.0
9.4
SG
19.6
19.4
15.7
21.0
26.0*
22.8
23.8
CBOW
18.8
16.7
16.0
13.4
21.0
20.2
19.0
7.4
10.9
13.1
14.2
13.2
15.6
13.2
CBOW
25.2*
22.6
20.9
19.0
25.8*
18.0
25.6
SG
14.2
9.8
9.8
13.0
7.0
15.4
7.8
8.2
14.8
12.4
10.6
11.4
8.8
13.0
4.8
10.0
9.8
9.0
7.4
6.8
6.2
23.4
10.9
17.0
15.2
17.6
12.2
17.8
CBOW SG
M5
Edge-centric approaches Predicate frequency Inverse predicate frequency Predicate object frequency Inverse predicate object frequency
Node-centric object frequency approaches Object frequency Inverse object frequency Inverse object frequency split
CBOW
SG Node-centric PageRank approaches PageRank Inverse PageRank
CBOW SG
PageRank split
CBOW SG
4.4
4.7
6.7
8.4
8.6
10.2
8.4
13.4
11.3
17.9
15.6
17.6
18.2
17.8
7.4
8.9
11.6
10.6
9.4
11.2
7.2
Best baseline
12.0
15.0
19.0
7.8
17.4
9.6
9.6
TransE
10.0
16.7
16.8
16.6
12.8
16.7
13.0
TransH
9.8
15.8
16.3
17.2
12.8
14.1
12.4
TransR
12.4
19.1
16.3
20.2
16.2
16.2
11.2
Inverse PageRank split
CBOW SG
Baseline and related approaches
4.1
Introducing Edge Weights
51
Table 4.2 Entity relatedness and document similarity results for different walk weighting schemes— Pearson’s linear correlation coefficient (r) Spearman’s rank correlation (ρ) and their harmonic mean μ Method
Entity relatedness
Document similarity
ρ
r
ρ
μ
CBOW
0.384
0.562
0.480
0.518
SG
0.564
0.608
0.448
0.516
CBOW
−0.058
0.547
0.454
0.496
SG
−0.123
0.355
0.284
0.316
CBOW
0.468
0.560
0.395
0.463
SG
0.584
0.653
0.487
0.558
CBOW
0.076
0.339
0.302
0.319
−0.043
0.238
0.183
0.207
CBOW
0.578
0.549
0.473
0.508
SG
0.610
0.628
0.491
0.551
CBOW
−0.096
0.372
0.317
0.342
SG
−0.102
0.255
0.190
0.218
CBOW
0.429
0.552
0.455
0.499
SG
0.554
0.585
0.452
0.510
CBOW
0.447
0.501
0.405
0.448
SG
0.489
0.469
0.335
0.391
CBOW
0.378
0.530
0.303
0.386
SG
0.456
0.589
0.384
0.465
CBOW
0.411
0.588
0.491
0.535
SG
0.426
0.467
0.390
0.425
CBOW
0.621
0.578
0.426
0.490
SG
0.634
0.658
0.476
0.552
CBOW
0.462
0.525
0.419
0.466
SG
0.487
0.369
0.292
0.326
TransE
0.091
0.550
0.430
0.483
TransH
0.050
0.517
0.414
0.460
TransR
0.058
0.568
0.431
0.490
Uniform weight Edge-centric approaches Predicate frequency weight Inverse predicate frequency weight Predicate object frequency weight
SG Inverse predicate object frequency weight Node-centric object frequency approaches Object frequency weight Inverse object frequency weight Inverse object frequency split weight Node-centric PageRank approaches PageRank weight Inverse PageRank weight PageRank split weight Inverse PageRank split weight Related approaches
4.1.2
Graph External Weighting Approaches
Graph internal weights do not add any information that is not already present in the knowledge graphs. In other cases, however, there might be weights coming from outside of the knowledge graph. If they are used as an extra signal, they may actually add information to the embeddings which cannot be gathered from the graph itself.
52
4 Tweaking RDF2vec
Listing 4.1 Using edge weights for random walks. The code snippet only shows the random walk definition; the remaining code is equivalent to Listing 2.1 from pyrdf2vec . walkers import RandomWalker from pyrdf2vec . samplers import PageRankSampler r a n d o m _ w a l k e r = R a n d o m W a l k e r (4 , 500 , P a g e R a n k S a m p l e r ( split = True ))
4.1.2.1 Utilizing Clickstream Data from Wikipedia In a study on DBpedia, we analyzed the impact of such external weights (Al Taweel and Paulheim 2020). In that study, we utilized transition probabilities between Wikipedia pages, created from user logs in Wikipedia, as a proxy for the importance of an edge in DBpedia.2 Those logs aggregate the number of link transitions (i.e., clicks) from a Wikipedia page to another one. This transition probability is used as implicit user feedback, serving as a proxy for a human rating of the importance of an edge in the knowledge graph.3 Figure 4.1 shows an example visualization of those weights. The focus entity is London, with the probabilities of navigating to the corresponding Wikipedia page from other pages depicted on the left, and the probabilities of navigating to other pages on the right. Figure 4.2 shows an excerpt of DBpedia with the absolute click counts as weights. When using those weights in Eq. 4.1, we get to the following transition probabilities: 8,989 = 0.624 14,407 4,439 P(Nine_Inch_Nails,band_member,Atticus_Ross) = = 0.308 14,407 979 P(Nine_Inch_Nails,genre,Industrial_Rock) = = 0.068 14,407
P(Nine_Inch_Nails,band_member,Trent_Reznor) =
4.1.2.2 Evaluation As for the experiments in the previous section, classification and regression datasets from GEval were used.4 Different models were trained to analyze the performance. It can be seen in Table 4.3 that the RDF2vec models using the clickstream data often outperform the other models.
2 https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream. 3 The figure also shows non-Wikipedia pages as click sources. Those are ignored when computing
the edge weights. 4 The paper reports on a preliminary study and therefore uses only the cities, movies, and albums
classification and regression tasks. Therefore, the results are not directly comparable to the results in the previous section.
4.1
Introducing Edge Weights
53
Fig. 4.1 Clickstream data example, showing the top 10 click sources and targets for the Wikipedia page for London. Source https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream
Fig. 4.2 Example for a clickstream weighted graph (Al Taweel and Paulheim 2020)
Although that study is fairly preliminary, the results show that external graph weights can be exploited for RDF2vec models, and are fairly straightforward to incorporate into the overall process. In situations where such meaningful edge weights are available, training a weighted RDF2vec model can hence be a promising direction to pursue.5
5 Since neither jRDF2vec nor pyRDF2vec allows for incorporating external weights, the authors of
the study have used their own proprietary implementation of RDF2vec for the study. Therefore, no code example is given here. The implementation used for the experiments can be found at https:// github.com/ataweel55/RDF2VEC.
54
4 Tweaking RDF2vec
Table 4.3 Results on classification (accuracy) and regression (RMSE) datasets using the Wikipedia clickstream dataset as external edge weights Task
Dataset
Uniform
Clickstream weighted
Classification
Cities Movies Albums
73.01 81.19 75.80
78.39 81.44 74.15
Regresion
Cities Movies Albums
12.78 16.95 13.56
13.96 15.66 11.70
4.2
Order-Aware RDF2vec
By definition, word2vec does not respect word orders. Therefore, the RDF2vec embeddings also do not respect the order of elements in the extracted graph walks. Order-aware RDF2vec, or RDF2vecoa for short, is a variant that utilizes the order of elements in the walks.
4.2.1
Motivation and Definition
Most natural languages have a certain degree of flexibility with respect to word order (Comrie 1989). Therefore, basic word embedding methods like word2vec look at the context of a word to determine its embedding vector, but not the order of words. Consider the following two sentences: In 2021, John met Mary in a club in Berlin. In this club in Berlin, Julia and John met in 2021.
From these two sentences, we can figure out that Mary and Julia have something in common since they both met John in a club in Berlin in 2021 (at least if we assume that John refers to the same entity in both sentences). However, looking at the word windows around the two words Mary and Julia, we would encounter the following representation: w−5 – In
w−4 In this
w−3 2021 club
w−2 John in
w−1 met Berlin
w0 Mary Julia
w1 in and
w2 a John
w3 club met
w4 in in
w5 Berlin 2021
When we look at the exact overlap of tokens, we see that there is only one token that overlaps (i.e., w4 ). However, when we consider the tokens as a set, we find that there is almost a perfect match. This is why word2vec classically works in a position-agnostic fashion, i.e., it only looks at which words appear in the context of a target word, but not where.
4.2
Order-Aware RDF2vec
55
For capturing the semantics of words, especially in languages with a certain degree of freedom with respect to part of speech ordering, such approaches are fully eligible. If we, on the other hand, look at the sequences extracted from knowledge graphs, the world looks a bit different. Consider, for example, the following sequences of tokens, extracted with graph walks from the graph depicted in Fig. 1.1: w−2 John – Mary
w−1 likes – bornIn
w0 Mary John Paris
w1 bornIn bornIn isA
w2 Paris Berlin City
In that example, all three entities in the w0 position share one common element with the others, i.e., the token bornIn. However, the presence of this token in w1 indicates that w0 is a person (since persons have birthplaces), whereas the presence of the token in w−1 indicates that w0 is a city (since cities are birthplaces). Therefore, an embedding approach that is capable of performing the projection of similar entities together (and the projection of dissimilar entities, like persons and cities, further away from one another) should be capable of detecting such differences.6 This observation is the motivation of order-aware RDF2vec (Portisch and Paulheim 2021). Essentially, it uses the same walk extraction mechanism as RDF2vec, combined with an order-aware variant of word2vec (Ling et al. 2015a). Figure 4.3 shows the order-aware variant of skip-gram in contrast to the standard skip-gram architecture; for CBOW, a similar adaptation of the original word2vec architecture exists.
Fig. 4.3 Classic skip-gram architecture (left) and structured skip-gram architecture (right) (Portisch and Paulheim 2021)
6 Please note that both examples are strongly simplified for the sake of illustration. In practice, the
representation of a word or an entity is learned from a multitude of sentences or walks, rather than a single sentence or walk.
56
4 Tweaking RDF2vec
It can be seen from the picture that including order awareness slightly increases the model size. While standard skip-gram has one joint weight matrix for all of the outputs (and therefore cannot distinguish words appearing in different positions), the adaptation proposed by Ling et al. (2015a) introduces a different weight matrix for each output position. Therefore, the order-aware variant requires more training parameters. Experiments, however, have shown that RDF2vecoa still scales to large-scale knowledge graphs like DBpedia.
4.2.2
Evaluation
Like other variants, RDF2vecoa has also been evaluated on the datasets of the GEval gold standard. The results are depicted in Table 4.4, together with the results of e-walks and p-walks (see below). It can be observed that in many cases, the order-aware variant is superior or at least comparable to the classic variant. In particular, the tasks which involve entities from different classes and require a separation between them, i.e., the clustering and the semantic analogies task,7 RDF2vecoa has its strengths, whereas the difference is not that drastic in tasks where only entities of one class are involved (e.g., the classification and regression tasks).
4.2.3
Order-Aware RDF2vec in Action
Listing 4.2 provides a simple example of how the order-aware RDF2vec algorithm can be used on a UNIX system.8 On the artist classification running example, order-aware RDF2vec achieves an accuracy of 0.650 ± 0.077, i.e., a score below the standard RDF2vec variant. One assumption is that due to the relatively constrained example graph, the advantage of respecting order is not that big, and the complexity of the model (which is larger than for standard RDF2vec, as discussed above) has a higher tendency to overfit the data.
7 In this task, it is important to return an entity of the right class, e.g., for solving Berlin is to Germany
as Paris is to ?, the result must be from the class Country. 8 The snipped assumes that the jRDF2vec JAR has been downloaded and placed in the same directory. It is further assumed that the compiled wang2vec project has been placed in the same directory. For an extensive user guide with examples, we refer the reader to the GitHub repository: https://github. com/dwslab/jRDF2Vec.
0.237
Document similarity
0.609
City-state 0.747
0.574
Country-currency
20.215
Metacritic movies 0.905
15.288
Metacritic albums
Capital-country (all)
36.545
Forbes
0.957
15.375
Cities
Capital-country
65.985
0.909
Teams
AAUP
0.829
Cities, albums movies, AAUP, forbes
Entity relatedness
Semantic analogies
Regression
0.587
0.726
Metacritic movies
Cities and countries
0.586
Metacritic albums 0.789
0.623
Forbes
Cities and countries (2k)
0.818
Cities
Clustering
0.706
AAUP
Classification
0.230
0.716
0.578
0.535
0.857
0.864
20.420
15.903
36.050
12.782
63.814
0.931
0.854
0.760
0.900
0.716
0.585
0.605
0.803
0.713
0.283
0.611
0.507
0.338
0.594
0.810
24.238
15.812
39.204
18.963
77.250
0.940
0.547
0.783
0.520
0.549
0.536
0.575
0.725
0.643
0.209
0.547
0.442
0.447
0.758
0.789
23.362
15.705
37.067
19.287
66.473
0.925
0.652
0.720
0.917
0.626
0.532
0.600
0.723
0.690
0.275
0.832
0.459
0.309
0.657
0.794
20.436
15.573
38.589
17.017
67.337
0.889
0.759
0.749
0.726
0.724
0.596
0.608
0.770
0.696
0.250
0.800
0.484
0.193
0.591
0.747
20.258
15.785
38.558
16.913
65.429
0.926
0.828
0.766
0.726
0.732
0.583
0.605
0.743
0.717
sgoa
sg
cbowoa
e-RDF2vec cbow
sg
sgoa
Classic RDF2vec
Dataset
Task
0.170
0.726
0.250
0.198
0.359
0.660
23.348
15.574
39.867
17.290
70.482
0.916
0.557
0.820
0.668
0.686
0.564
0.612
0.750
0.703
cbow
0.111
0.779
0.361
0.297
0.592
0.397
22.518
14.640
36.313
20.798
69.292
0.931
0.719
0.745
0.660
0.676
0.584
0.600
0.702
0.690
cbowoa
0.193
0.432
0.009
0.006
0.014
0.008
23.235
15.178
37.146
20.322
80.318
0.941
0.598
0.687
0.605
0.610
0.634
0.581
0.606
0.564
sg
0.382
0.768
0.048
0.076
0.073
0.091
22.402
14.869
36.374
17.214
72.610
0.938
0.798
0.782
0.520
0.660
0.632
0.610
0.677
0.623
sgoa
p-RDF2vec
0.296
0.568
0.000
0.002
0.002
0.000
23.979
15.000
37.947
24.743
96.248
0.940
0.663
0.787
0.637
0.535
0.569
0.560
0.501
0.551
cbow
0.256
0.737
0.036
0.085
0.052
0.036
22.071
16.679
38.952
20.334
77.895
0.580
0.748
0.728
0.733
0.663
0.667
0.578
0.707
0.612
cbowoa
Table 4.4 Result of the 12 RDF2vec variants on 20 tasks. The best score for each task is printed in bold. The suffix oa marks the ordered variant of RDF2vec
4.2 Order-Aware RDF2vec 57
58
4 Tweaking RDF2vec
Listing 4.2 Order-aware RDF2vec example with jRDF2vec import
pandas
# Generate ! java - jar # Unzip ! gunzip
as
pd
walks ./ j r d f 2 v e c . jar
- onlyWalks
- graph
./ a r t i s t s _ g r a p h . nt
generaetd walks ./ walks / w a l k _ f i l e _ 0 . txt . gz
# Run the order - aware embedding generation !./ word2vec - train ./ walks / w a l k _ f i l e _ 0 . txt - t y p e 3 - min - c o u n t 0 - cap 1
- output
oa100 . txt
\
# Load import dfVect dfVect # drop dfVect =
generated embeddings into a dataframe pandas as pd o r s = pd . r e a d _ c s v ( " ./ oa100 . txt " , sep = " " , s k i p r o w s =[0]) o r s . c o l u m n s = [ " e " ] + [ f " v { i } " for i in range (0 ,101)] last empty value ors d f V e c t o r s [[ " entity " ] + [ f " v { i } " for i in range (0 ,100)]]
4.3
Alternative Walk Strategies
The strategies considered so far extract walks of alternating properties and entities which actually exist as paths in a knowledge graph. In this section, we will look into more possible strategies for extracting walks.
4.3.1
Entity Walks and Property Walks
Entity walks and property walks have been proposed in Portisch and Paulheim (2022). They are essentially subsets of the original graph walks and aim at representing different aspects in knowledge graphs.
4.3.1.1 Motivation and Definition In Chap. 3, we have discussed that different aspects of knowledge graphs may be interesting to be learned by a knowledge graph embedding method. In the DLCC benchmark, there are classification problems like all instances with a particular relation (tc01) or all instances with a relation to a particular individual (tc04). The idea of entity walks and property walks is to create random walks which capture the subsets of graphs that are required to learn embeddings capturing exactly those classes. While one could argue that standard RDF2vec embeddings trained on the full walks should
4.3
Alternative Walk Strategies
59
Fig. 4.4 Illustration of entity walks and property walks (Portisch and Paulheim 2022)
also be capable of learning such classes, we will later also discuss further implications on the resulting vector spaces. In Chap. 2, we have defined random walks as a sequence walkn := w− n2 , w− n2 +1 , . . . , w−1 , w0 , w1 , . . . , w n2 −1 , w n2 where wi ∈
V if i is even R
if i is odd
Given a walk wn , the corresponding entity walks (e-walks) and property walks (p-walks) are then defined as subsets of wn as follows: e-walkn := w− n2 , w− n2 +2 , . . . , w−2 , w0 , w2 , . . . , w n2 −2 , w n2 (4.2) p-walkn := w− n2 +1 , w− n2 +3 , . . . , w−1 , w0 , w1 , . . . , w n2 −3 , w n2 −1 (4.3) In an e-walk, all elements are elements of V (i.e., entities), whereas in a p-walk, all elements but w0 are elements of R, while w0 ∈ V. Figure 4.4 illustrates the relation between e-walks and p-walks, and how they can be derived from standard walks.9
4.3.1.2 Evaluation Like other approaches, entity walks and property walks have been evaluated on the GEval benchmark. The results are depicted in Table 4.4. They were achieved by generating 500 walks per node with a depth of 4, i.e. 4 node hops were performed. A dimension of 200 was used for the embedding space. It can be observed that classic RDF2vec performs better in 9 It is important to note that although they can be derived from standard walks, it is usually much
faster to generate those walks directly instead of generating standard walks first and then deriving the e-walks and p-walks.
60
4 Tweaking RDF2vec
Table 4.5 Five nearest neighbors to Mannheim in RDF2vec (classic), p-RDF2vec, and e-RDF2vec trained on DBpedia (SG) (Portisch and Paulheim 2022) #
RDF2vec
p-RDF2vec
e-RDF2vec
1
Ludwigshafen
Arnsberg
Ludwigshafen
2
Peter Kurz
Frankfurt
Timeline of Mannheim
3
Timeline of Mannheim Tehran
Peter Kurz
4
Karlsruhe
Bochum
Adler Mannheim
5
Adler Mannheim
Bremen
Peter Kurze
most of the cases, but e-RDF2vec solves the task of entity relatedness rather well, whereas p-RDF2vec has an advantage on the document relatedness task. In order to understand those results, we need to dig a bit deeper into what these embedding methods actually learn. p-walks focus on the relations that are going into and out of an entity, without looking at which entity is actually connected. Therefore, they consider the structures that surround an entity, rather than the actual entities. Therefore, they will embed things that are structurally similar (e.g., places, people, events) more closely to each other than related things in different structures. Therefore, the distance metric in the resulting vector space will reflect similarity rather than relatedness. e-walks, on the other hand side, focus only on the surrounding entities, not the way they are connected. For example, the contexts for Mannheim and Adler Mannheim (the local ice hockey team in Mannheim) in Fig. 4.4 are quite similar when considering the ewalks. Therefore, a downstream embedding model will embed related entities (such as a city, a sports team in that city, and their home stadium) more closely, rather than structurally similar, but unrelated entities. Therefore, the distance metric in the resulting vector space will reflect relatedness rather than similarity. To illustrate this, Table 4.5 depicts the nearest neighbors to the entity Mannheim in DBpedia, according to RDF2vec, p-RDF2evc, and e-RDF2vec. We can observe that p-RDF2vec lists exclusively cities, while e-RDF2vec has only one city (Ludwigshafen) among other things related to Mannheim—an entity derived from the timeline of Mannheim, Peter Kurz (the mayor of Mannheim), the aforementioned ice hockey team, and a person named Peter Kurze.10
10 That latter entity is unrelated to Mannheim, however, in the DBpedia graph, one of the few state-
ments about this entity is that it is different from Peter Kurz, who, in turn, is related to Mannheim. This leads to a large fraction of multi-hop walks starting in the entity Peter Kurze containing the entity Mannheim and other entities related to Mannheim, making it ultimately ending up close to Mannheim in the vector space. This anecdotic example shows that explicit negative information (here: an entity not being related to another entity) is not very well picked up by RDF2vec, and even has a contrary effect.
4.3
Alternative Walk Strategies
61
With that understanding in mind, we can now try to re-interpret the results in Table 4.4. It is clear that entity relatedness benefits from an embedding that focuses on relatedness. Document similarity, on the other hand, can be determined by looking at the classes of entities involved (an article mentioning sports teams is a sports article, an article mentioning politicians is about politics, etc.). Therefore, an embedding that focuses on entity similarity (in the sense of: entities belonging to the same class) has an advantage here. Figure 4.5 shows four scatterplots for different variants of RDF2vec, using classic walks, e-walks, and p-walks, and a separation by three classes, i.e., music works, record labels, and genres. It can be observed that a class-wise separation cannot be achieved using e-walks,
(a) Standard
(b) Order-aware
(c) e-walks
(d) p-walks
Fig.4.5 Class wise scatter plots for the example knowledge graph with RDF2vec classic, RDF2vecoa , e-walks, and p-walks
62
4 Tweaking RDF2vec
Listing 4.3 Using e-walks in jRDF2vec f i l e _ t o _ e m b e d = " ./ a r t i s t s _ g r a p h . nt " # Call jRDF2vec - this works only in Jupyter ! java - jar jrdf2vec . jar - graph $file_to_embed - dimension 50 - numberOfWalks 500 - walkGenerationMode EXPERIMENTAL_NODE_WALKS_DUPLICATE_FREE # Load vectors for from gensim . models k v _ f i l e = " ./ walks / word_vectors = Keye
further processing in Python import KeyedVectors model . kv " dVectors . load ( kv_file , mmap = ’r ’)
while p-walks and especially classic RDF2vec perform better at separating those classes. For the sake of completeness, vectors for RDF2vec_oa are also shown. It is remarkable that the variants are not only capable of separating classes to a different extent, but also have differences at which classes they can separate more easily. In the example, standard RDF2vec can separate the genre class quite well from the others, but mixes work and label, while the p-walks variant can separate work more easily, but mixes genre and label.
4.3.1.3 Entity Walks and Property Walks in Action Entity walks and property walks are available in the jRDF2vec implementation, although still as an experimental feature at the time of writing.11 Listing 4.3 shows the code for using e-walks with jRDF2vec.12 For the node classification example, the accuracy with e-walks is 0.680 ± 0.135, and the accuracy with p-walks is 0.645 ± 0.133. This shows that in this example, the actual entities in the surrounding of an entity at hand help more in the classification task than the more structural information on the relations an entity is involved in.13
4.3.2
Further Walk Extraction Strategies
In Steenwinckel et al. (2021), we have explored various further strategies of creating walks from graphs. In the following, we will briefly discuss those walk types and their intuition. 11 Note that the exact syntax of the code might change once this becomes an official feature. 12 The
corresponding walk type option for p-walks would be EXPERIMENTAL_MID_EDGE_WALKS_DUPLICATE_FREE. 13 Caveat: you may not directly compare the accuracies of jRDF2vec and pyRDF2vec, because smaller differences may also be explained by subtly different implementations, and/or different random seeds.
4.3
Alternative Walk Strategies
63
Fig. 4.6 Alternative walk extraction strategies (Steenwinckel et al. 2021)
The family of walks from that paper is visualized in Fig. 4.6. Further information on the different strategies, including pseudocode examples, can be found in Steenwinckel et al. (2021).
4.3.2.1 Anonymous walks Anonymous walks try to focus on the structural information or graph topology more than the actual relations and entities. Anonymous walks are created from random walks as follows: given a walk w = v0 → v1 → · · · → vn , it is transformed into f (v0 ) → f (v1 ) → · · · → f (vn ) with f (vi ) = min({i | w[i] = vi }), which corresponds to the first index where vi can be found in the walk w (Ivanov and Burnaev 2018). Apart from their different focus (i.e., graph topology instead of the semantic neighborhood), ignoring the labels also allows for computationally efficient generation of the walks.
64
4 Tweaking RDF2vec
4.3.2.2 Walklets The idea of walklets has been proposed by Perozzi et al. (2017). They represent each entity by the set of entities and edge labels that surrounds them, not taking the actual relation and distance into account. Formally, a walklet is a walk of length 2, which consists of the starting entity and one other entity appearing in a walk. Formally, given a walk w = v0 → v1 → · · · → vn , we can construct sets of walklets {(v0 , vi ) | 1 ≤ i ≤ n}.
4.3.2.3 Hierarchical Random Walks (HALK) The idea of hierarchical random walks is to focus on more common entities in the walks, and ignore less commonly known ones, since they are expected to contribute less to the information about an entity, and hence should not be reflected in the embedding. Similar to the idea of removing rare words from text corpora in preprocessing for text mining (Allahyari et al. 2017), HALK removes rare entities from random walks. Formally, Formally, given a walk w = v0 → v1 → · · · → vn , the corresponding HALK walk is extracted by removing all entities from {v1 . . .vn } whose frequency is below a threshold t. As argued by Schlötterer et al. (2019), the removal of rare entities from the random walks can increase the quality of the generated embeddings while decreasing memory usage. For use cases where the entities at hand are not long-tail entities, HALK can be a suitable alternative to pure random walks, whereas they might perform less well for tasks involving more long-tail entities.
4.3.2.4 N-Gram Walks The idea of n-gram walks is to unify walks that have only small differences. It combines two techniques—replacing subsequences of walks, i.e., n-grams of subsequent elements with a new label, and injecting wildcards in those n-grams, which allows for similar, but not identical sequences to be replaced by the same label (Vandewiele et al. 2019). As a result, there is already a generalization step involved in the walk generation, not only in the embedding learning and the downstream processing.
4.3.2.5 Community Hops Community hops introduce teleportation between nodes which are structurally similar to one another, i.e., nodes that share the same set of properties. Especially for incomplete knowledge graphs following the open world assumption, they, in a very basic sense, heuristically infer more information for entities that are scarcely described. For community hops, communities of similar nodes are first detected in a supervised fashion (Fortunato 2010), using the Louvain method (Blondel et al. 2008), which provides a good trade-off between speed and clustering quality. However, there currently exists no implementation that works for larger knowledge graphs like DBpedia.
4.3
Alternative Walk Strategies
65
Table 4.6 Results of alternative walk strategies on the GEval classification and regression tasks, as reported in Steenwinckel et al. (2021) Strategy
Classification (accuracy)
Regression (RMSE)
AAUP
Cities
Forbes
Albums Movies
AAUP
Cities
Forbes
Albums Movies
Random
67.94
79.07
63.73
75.24
80.06
64.81
14.22
35.79
11.68
17.58
Anonymous
54.73
55.34
55.16
54.45
59.40
103.41
23.08
38.49
15.67
24.92
Walklets
69.27
79.08
62.28
79.99
78.89
70.32
15.66
36.20
11.66
18.53
HALK
60.08
73.36
60.98
66.89
68.11
86.52
17.73
37.25
14.06
21.72
N-grams
66.96
79.79
63.65
79.38
78.84
69.16
19.45
36.02
11.74
17.81
Table 4.7 Results of alternative walk strategies on the GEval entity relatedness and document similarity tasks, as reported in Steenwinckel et al. (2021) Strategy
Entity relatedness Kendall’s τ
Document similarity r ρ
μ
Random
0.523
0.578
0.390
0.466
Anonymous
0.243
0.321
0.324
0.322
Walklets
0.520
0.528
0.372
0.437
HALK
0.424
0.455
0.376
0.412
N-grams
0.483
0.551
0.353
0.431
4.3.2.6 Evaluation The different alternative walk strategies have also been evaluated on GEval tasks, namely classification, regression, entity relatedness, and document similarity. The results are depicted in Tables 4.6 and 4.7. As discussed above, the implementation of community hops is not scalable enough to be performed on a graph like DBpedia, therefore, they are not contained in the result tables. From the tables, it can be seen that, while random walks usually perform quite well, especially Walklets and n-gram walks can be a suitable alternative for some tasks. Since, despite the scaling issues, Steenwinckel et al. (2021) report promising results for community hop-based walks on smaller knowledge graphs, those can also be considered a promising alternative when smaller scale knowledge graphs are involved.
4.3.2.7 Alternative Walk Strategies in Action The pyRDF2vec library supports all the above discussed walk strategies. For the node classification problem used as a running example, we obtain an accuracy of 0.540 ± 0.073 for anonymous walks, 0.590 ± 0.073 for Walklets, 0.745 ± 0.065 with HALK, 0.720 ± 0.093 for N-gram Walks, and 0.565 ± 0.055 for community hops, all with the respective standard
66
4 Tweaking RDF2vec
Listing 4.4 Using alternative walk strategies in pyRDF2vec. The code snippet only shows the random walk definition; the remaining code is equivalent to Listing 2.1 from pyrdf2vec . walkers w a l k e r = H A L K W a l k e r (4 ,
import 250)
HALKWalker
configurations in pyRDF2vec. Listing 4.4 shows the usage of alternative walk strategies in pyRDF2vec.
4.4
RDF2vec with Materialized Knowledge Graphs
Since no knowledge graph can contain all information in the world, all knowledge graphs are inherently incomplete (Paulheim 2017). While most missing knowledge in a knowledge graph cannot be added trivially, there are some pieces of knowledge that can easily be added. One example is symmetric relations: If we observe a triple < A, r , B >, and we know that r is a symmetric relation, we can add the triple < B, r , A > if it does not exist in the knowledge graph. For that, the schema or ontology of a knowledge graph has to be used, which defines that information (here: symmetric of a relation), and a reasoner can perform the task of materialization, also known as computing the closure (Abburu 2012).14 One typical example of a symmetric relation is the spouse relation in DBpedia. There are 18.9 k spouse relations that are present in both directions, whereas 23.9 k only exist in one direction. Hence, the relation is notoriously incomplete, and a knowledge graph completion approach exploiting the symmetry of the spouse relation could directly add 23.9 k missing axioms. Another example is inverse properties: there are 10.6 k relations of the type doctoral advisor, but only 4.6 k of the type doctoral student, although one can be inferred from the other.
4.4.1
Idea
It might be a straightforward assumption to think that RDF2vec embeddings would profit from more complete information. Hence, in Iana and Paulheim (2020), we pursued the approach of materializing such missing inferences first by exploiting symmetric, inverse, and transitive relations, and compared the downstream performance of RDF2vec embeddings on the resulting graph to those obtained on the original ones.
14 Note while this holds in theory, real-world knowledge graphs often pose practical scalability
challenges to existing reasoners (Heist and Paulheim 2021).
4.4
RDF2vec with Materialized Knowledge Graphs
67
For the experiments in the paper, DBpedia version 2016-10 was used. Since the original DBpedia ontology provides information about subproperties, but does not define any symmetric, transitive, and inverse properties, we first had to enrich the ontology with such axioms.
4.4.1.1 Enrichment Using Wikidata The first strategy is utilizing owl:equivalentProperty links to Wikidata (Vrandeˇci´c and Krötzsch 2014). We mark a property P in DBpedia as symmetric if its Wikidata equivalent has a symmetric constraint in Wikidata,15 and we mark it as transitive if its Wikidata equivalent is an instance of the Wikidata class transitive property.16 For a pair of properties P and Q in DBpedia, we mark them as inverse if their respective equivalent properties in Wikidata are defined as the inverse of one another.17
4.4.1.2 Enrichment Using DL-Learner The second strategy is applying DL-Learner (Lehmann 2009) to learn additional symmetry, transitivity, and inverse axioms for enriching the ontology. After inspecting the results of DL-Learner, and to avoid false T-box axioms, we used thresholds of 0.53 for symmetric properties, and 0.45 for transitive properties. Since the list of pairs of inverse properties generated by DL-Learner contained quite a few false positives (e.g., dbo:isPartOf being the inverse of dbo:countySeat as the highest scoring result), we manually filtered the top results and kept 14 T-box axioms which we rated as correct.
4.4.1.3 Materializing the Enriched Graphs In both cases, we identify a number of inverse, transitive, and symmetric properties, as shown in Table 4.8. The symmetric properties identified by the two approaches highly overlap, while the inverse and transitive properties identified differ a lot. With the enriched ontology, we infer additional A-box axioms on DBpedia. We use two settings, i.e., all subproperties plus (a) all inverse, transitive, and symmetric properties found using mappings to Wikidata, and (b) all plus all inverse, transitive, and symmetric properties found with DL-Learner. The inferring of additional A-box axioms was done in iterations. In each iteration, additional A-box axioms were created for symmetric, transitive, inverse, and subproperties. Using this iterative approach, chains of properties could also be respected. For example, from the axioms
15 https://www.wikidata.org/wiki/Q21510862. 16 https://www.wikidata.org/wiki/Q18647515. 17 https://www.wikidata.org/wiki/Property:P1696.
68
4 Tweaking RDF2vec
Cerebellar_tonsil isPartOfAnatomicalStructure Cerebellum . Cerebellum isPartOfAnatomicalStructure Hindbrain .
and the two identified T-box axioms isPartOf a owl:TransitiveProperty . isPartOfAnatomicalStructure rdfs:subPropertyOf isPartOf .
the first iteration adds Cerebellar_tonsil isPartOf Cerebellum . Cerebellum isPartOf Hindbrain .
whereas the second iteration adds Cerebellar_tonsil isPartOf Hindbrain .
The materialization process is terminated once no further axioms are added. This happens after two iterations for the dataset enriched with Wikidata, and three iterations for the dataset enriched with DL-Learner. The size of the resulting datasets is shown in Table 4.8. It can be observed that in relation to the original graphs, the number of added triples is not too large, as it is in the order of magnitude of 1–2%. On all three graphs (Original, Enriched Wikidata, and Enriched DL-Learner), we computed RDF2vec embeddings with 500 random graph walks per node of depth 4 and 8, respectively, and trained skip-gram embeddings of dimensionality 200 and 500, using the default parameters of pyRDF2vec otherwise.
4.4.2
Experiments
For the experiments in the paper, we used tasks from the GEval evaluation framework. The results depicted in Table 4.9 report the best values achieved on each task for the original and the enriched knowledge graphs, both for walks of length 4 and 8, and using 500 dimensions for the embedding vectors From those experiments, we can make different observations. For the classification task, the results are always better on the original graphs. For the regression tasks, there are a few cases where materialization helps a bit, while in more cases, the results on the original graph are still superior. The results are similar on the entity relatedness task. Moreover, while results on the original graphs, in many cases, benefit from longer walks, this is not the case for the results on the enriched graphs, where the deterioration is often
4.4
RDF2vec with Materialized Knowledge Graphs
69
Table 4.8 Enriched DBpedia versions used in the experiments. The upper part of the table depicts the number of T-box axioms identified with the two enrichment approaches, and the lower part depicts the number of A-box axioms created by materializing the A-box according to the additional T-box axioms Original
Enriched wikidata
Enriched DL-Learner
T-box subproperties
75
0
0
T-box inverse properties
0
8
14
T-box transitive properties
0
7
6
T-box symmetric properties
0
3
7
A-box subproperties
–
122,491
129,490
A-box inverse properties
–
44,826
159,974
A-box transitive properties
–
334,406
415,881
A-box symmetric properties
–
4,115
35,885
No. of added triples
–
505,838
741,230
No. of total triples
50,000,412
50,506,250
50,741,642
Table 4.9 Result of RDF2vec on original and materialized graphs, using axioms from Wikidata (WD) and DL-Learner (DL-L) for the latter Task
Dataset
Depth = 4 Original
Classification
Depth = 8 WD
DL-L
Original
WD
DL-L
AAUP
0.670
0.641
0.651
0.658
0.603
0.598
Cites
0.814
0.811
0.805
0.838
0.815
0.819
Forbes
0.606
0.582
0.567
0.611
0.566
0.569
Metacritic albums
0.766
0.705
0.701
0.739
0.594
0.611
Metacritic movies
0.728
0.674
0.677
0.709
0.631
0.638
AAUP
92.310
93.715
92.800
92.002
97.390
95.408
Cities
15.696
15.168
14.594
11.874
15.118
15.055
Forbes
39.468
38.511
38.482
40.827
39.864
40.647
Metacritic albums
12.422
13.713
13.934
12.824
15.114
15.131
Metacritic movies
21.911
23.895
23.882
23.126
25.127
24.396
Entity relatedness
0.678
0.599
0.623
0.633
0.529
0.550
Document similarity
0.180
0.154
0.217
0.185
0.184
0.213
Regression
70
4 Tweaking RDF2vec
more drastic after the materialization when longer walks are used. For a closer inspection of this effect, we looked at the entity relatedness task, where we found that only in a few cases, the degree of the entities at hand changed. This hints at the effects (both positive and negative) being mainly caused by information being added to the neighboring entities of the entities at hand, which has a higher likelihood to be reflected in longer walks. Finally, for document similarity, we see a different picture. Here, the results on the non-materialized graphs are always outperformed by those obtained on the materialized graphs, regardless of whether the embeddings were computed on the shorter or longer walks. One core difference between LP50 and the other datasets is that the entities in the LP50 dataset have by far the largest average degree (2,088, as opposed to only 18 and 19 for the MetacriticMovies and MetacriticAlbums datasets, respectively). Due to the already pretty large degree, it is less likely that the materialization skews the distributions in the random walks too much, and, instead, actually adds meaningful information. Another possible reason is that the entities in LP50 are very diverse (as opposed to a uniform set of cities, movies, or albums) and that in such a diverse dataset, the effect of materialization is different, as it tends to add heterogeneous rather than homogeneous information to the walks. Since the observation that materializing implicit information rather harms than helps in most cases, we conducted some further studies on the generated walks. To that end, we computed distributions of all properties occurring in the random graph walks, for both strategies and for both depths of 4 and 8, which are depicted in Fig. 4.7. From those figures, we can observe that the distribution of properties in the walks extracted from the enriched graphs is drastically different from those on the original graphs; the Pearson correlation of the distribution in the enriched and original case is 0.44 in the case of walks of depth 4 and only 0.21 in the case of walks of depth 8. The property distributions among the two enrichment strategies, on the other hand, are very similar, with the respective distributions exposing a Pearson correlation of more than 0.99.
Fig. 4.7 Distribution of top 10 properties in the generated walks
4.4
RDF2vec with Materialized Knowledge Graphs
71
Another observation from the graphs is that the distribution is much more uneven for the walks extracted from the enriched graphs, with the most frequent properties being present in the walks at a rate of 14–18%, whereas the most frequent property has a rate of about 3% in the original walks. The three most prominent properties in the enriched case—location, country, and locationcountry—altogether occur in about 20% of the walks in the depth 4 setup and even 30% of the walks in the depth 8 setup. This means that information related to locations is over-represented in walks extracted from the enriched graphs. As a consequence, the embeddings tend to focus on location-related information much more. This observation might be a possible explanation for the degradation in results on the music and movies datasets being more drastic than, e.g., on the cities dataset. Another interpretation of the results is that information in the DBpedia knowledge graph is not missing at random (Newman 2014). An edge from A to B is contained in DBpedia if a human editor of Wikipedia made the decision that it should be included in A’s infobox. The fact that a symmetric property exists as < A, r , B >, but not as < B, r , A > may therefore hint at the fact that humans considered it as an important piece of information describing A, but not as an important piece of information describing B. This human notion of importance can be interpreted as a signal (just like the human transition probabilities above), which is eradicated when materializing inferred statements. One example of the aforementioned spouse relation that only exists in one direction is Ayda_Field spouse Robbie_Williams .
Ayda Field is mainly known for being the wife of Robbie Williams, while Robbie Williams is mostly known as a musician. This is encoded by having the relation represented in one direction, but not the other. By adding the reverse edge, we cancel out the information that the original statement is more important than its inverse. Adding inverse relations may have a similar effect. One example in our dataset is the completion of doctoral advisors and students by exploiting the inverse relationship between the two. For example, the fact Georg_Joachim_Rheticus doctoralAdvisor Nicolaus_Copernicus .
is contained in DBpedia, while its inverse Nicolaus_Copernicus doctoralStudent Georg_Joachim_Rheticus .
is not (since Nicolaus Copernicus is mainly known for other achievements). Adding the inverse statement makes the random walks equally focus on the more important statements about Nicolaus Copernicus and the ones considered less relevant. The transitive property adding most axioms to the A-box is the isPartOf relation. For example, chains of geographic containment relations are usually materialized, e.g., two cities
72
4.5 Conclusion
in a country being part of a region, a state, etc. ultimately also being part of that country. For once, this under-emphasizes differences between those cities by adding a statement, making them more equal. Moreover, there usually is a direct relation (e.g., country) expressing this in a more concise way, so that the information added is also redundant.
4.4.3
RDF2vec on Materialized Graphs in Action
In our running example for artist classification, we use the dataset and the additional ontology axioms found with DL Learner in the previous experiment. Materialization of the additional axioms was done using the Pellet reasoner (Sirin et al. 2007). On the running example, we obtain an accuracy of 0.750±0.100, i.e., a small improvement, not a degradation as in the examples before. For our example graph, the number of triples increases by about 24%, the large majority of which are triples using properties from the DOLCE ontology. Since DBpedia version 3.9, released in 2013, mappings of the DBpedia ontology to DOLCE-Zero (Gangemi et al. 2003; Gangemi and Mika 2003), a subset of the modules of the formal ontology DOLCE, are included in the DBpedia ontology (Paulheim and Gangemi 2015). A closer look at the generated new properties shows that they mostly deal with bands and their members, adding, e.g., an additional has Member edge for each band member and former band member edge, and also for each associated music artist edge. The latter case is rather questionable semantically, and most likely due to an erroneous subproperty. At the same time, it occurs twice as often as the other two combined, showing that a majority of the inferred axioms can be assumed to not be semantically correct. This raises the important question: if the majority of the inferred axioms is wrong, why does their inclusion improve the resulting embeddings? In this case, they emphasize the information about band members and associated artists, since they now exist with multiple properties. Therefore, a random walker has a higher likelihood to transit from a band to its member or a related artist, which is obviously a relevant signal for the task at hand (classifying the genre of a music artist). However, this closer analysis shows that in the case at hand, the positive effect of materialization is rather an artifact than allowing us to conclude that it adds meaningful information which the embedding method can pick up upon. In this case, it rather has the effect of putting stronger weights on certain edges than on others—and just by luck, those are the edges that encode the information which is useful in the downstream task. To summarize the findings in this section: materialization may help but does not necessarily do so. In particular, in cases where information is missing at random, we can assume that materialization may help in improving embeddings for downstream tasks, whereas in cases where information is missing not at random, materializing axioms which are not encoded explicitly may actually cancel out a signal present in the knowledge graph. On the other hand, one may observe positive effects of materialization in certain cases, but they may rather be
References
73
Table 4.10 Feature comparison of RDF2vec implementations (at the time of writing) Feature
pyRDF2vec
jRDF2vec
Internal graph weights
–
External graph weights
–
–
Order-aware RDF2vec
–
()
e-walks and p-walks
–
Further walk strategies
–
RDF2vec light
artifacts of the knowledge graph and its ontology than true evidence that materialization is helpful.
4.5
Conclusion
This chapter has shown a larger number of variants of RDF2vec. Many of them can also be and, occasionally, have been combined, e.g., order awareness with e-walks and p-walks. In many cases, one or the other variant works better, and knowledge about the task (e.g., does it need the embedding to focus on relatedness or similarity) can help in choosing a suitable variant. Systematic evaluations, as which the DLCC benchmark (see Sect. 3.3) may give useful indications on which variant to use on a particular problem. Moreover, while different variants are discussed for the different steps of the RDF2vec embedding pipeline, not all combinations have been fully explored yet.18 In Chap. 2, we have already come across the two most well-known libraries for RDF2vec, i.e., pyRDF2vec and jRDF2vec. Table 4.10 shows an overview of which of the variants discussed in this chapter is supported by which library at the time of writing this book.19
References Abburu S (2012) A survey on ontology reasoners and comparison. Int J Comput Appl 57(17) Al Taweel A, Paulheim H (2020) Towards exploiting implicit human feedback for improving rdf2vec embeddings. In: CEUR workshop proceedings, RWTH, vol 2635, pp 1–10 Allahyari M, Pouriyeh S, Assefi M, Safaei S, Trippe ED, Gutierrez JB, Kochut K (2017) A brief survey of text mining: classification, clustering and extraction techniques. arXiv:1707.02919
18 See Chap. 5. 19 Materialization is usually done externally as a preprocessing step and hence not included in the
table.
74
4.5 Conclusion
Blondel VD, Guillaume JL, Lambiotte R, Lefebvre E (2008) Fast unfolding of communities in large networks. J Stat Mech: Theory Exper 10:P10008 Bordes A, Usunier N, Garcia-Duran A, Weston J, Yakhnenko O (2013) Translating embeddings for modeling multi-relational data. In: Advances in neural information processing systems 26 Brin S, Page L (1998) The anatomy of a large-scale hypertextual web search engine. Comput Netw ISDN Syst 30(1):107–117 Cochez M, Ristoski P, Ponzetto SP, Paulheim H (2017) Biased graph walks for rdf graph embeddings. In: Proceedings of the 7th international conference on web intelligence, mining and semantics, pp 1–12 Comrie B (1989) Language universals and linguistic typology: syntax and morphology. University of Chicago Press Fortunato S (2010) Community detection in graphs. Phys Rep 486(3–5):75–174 Gangemi A, Guarino N, Masolo C, Oltramari A (2003) Sweetening wordnet with dolce. AI Mag 24(3):13–13 Gangemi A, Mika P (2003) Understanding the semantic web through descriptions and situations. In: OTM confederated international conferences “On the move to meaningful internet systems”. Springer, pp 689–706 Heist N, Paulheim H (2021) The caligraph ontology as a challenge for owl reasoners. In: SemREC 2021: semantic reasoning evaluation challenge 2021, pp 21–31 Iana A, Paulheim H (2020) More is not always better: the negative impact of a-box materialization on rdf2vec knowledge graph embeddings. In: CIKM (Workshops) Ivanov S, Burnaev E (2018) Anonymous walk embeddings. arXiv:1805.11921 Lehmann J (2009) Dl-learner: learning concepts in description logics. J Mach Learn Res 10:2639– 2642 Ling W, Dyer C, Black AW, Trancoso I (2015a) Two/too simple adaptations of word2vec for syntax problems. In: NAACL HLT 2015, ACL, pp 1299–1304 Lin Y, Liu Z, Sun M, Liu Y, Zhu X (2015b) Learning entity and relation embeddings for knowledge graph completion. In: Twenty-ninth AAAI conference on artificial intelligence Newman DA (2014) Missing data: five practical guidelines. Org Res Methods 17(4):372–411 Paulheim H (2017) Knowledge graph refinement: a survey of approaches and evaluation methods. Semant Web 8(3):489–508 Paulheim H, Gangemi A (2015) Serving dbpedia with dolce–more than just adding a cherry on top. In: International semantic web conference. Springer, pp 180–196 Perozzi B, Kulkarni V, Chen H, Skiena S (2017) Don’t walk, skip! online learning of multi-scale network embeddings. In: Proceedings of the 2017 IEEE/ACM international conference on advances in social networks analysis and mining 2017, pp 258–265 Portisch J, Paulheim H (2021) Putting rdf2vec in order. In: International semantic web conference, posters and demonstrations Portisch J, Paulheim H (2022) Walk this way! entity walks and property walks for rdf2vec. In: Extended semantic web conference 2022, posters and demonstrations Schlötterer J, Wehking M, Rizi FS, Granitzer M (2019) Investigating extensions to random walk based graph embedding. In: 2019 IEEE international conference on cognitive computing (ICCC), IEEE, pp 81–89 Sirin E, Parsia B, Grau BC, Kalyanpur A, Katz Y (2007) Pellet: a practical owl-dl reasoner. J Web Semant 5(2):51–53 Steenwinckel B, Vandewiele G, Bonte P, Weyns M, Paulheim H, Ristoski P, Turck FD, Ongenae F (2021) Walk extraction strategies for node embeddings with rdf2vec in knowledge graphs. In: International conference on database and expert systems applications. Springer, pp 70–80
References
75
Thalhammer A, Rettinger A, (2016) PageRank on wikipedia: towards general importance scores for entities. The semantic web: ESWC 2016 satellite events. Springer International Publishing, Crete, Greece, pp 227–240 van Erp M, Mendes P, Paulheim H, Ilievski F, Plu J, Rizzo G, Waitelonis J (2016) Evaluating entity linking: an analysis of current benchmark datasets and a roadmap for doing a better job. In: 10th international conference on language resources and evaluation (LREC) Vandewiele G, Steenwinckel B, Ongenae F, De Turck F (2019) Inducing a decision tree with discriminative paths to classify entities in a knowledge graph. In: SEPDA2019, the 4th international workshop on semantics-powered data mining and analytics, pp 1–6 Vrandeˇci´c D, Krötzsch M (2014) Wikidata: a free collaborative knowledge base. Commun ACM 57(10):78–85. http://dx.doi.org/10.1145/2629489 Wang Z, Zhang J, Feng J, Chen Z (2014) Knowledge graph embedding by translating on hyperplanes. In: Proceedings of the AAAI conference on artificial intelligence, vol 28
5
RDF2vec at Scale
Abstract
On larger knowledge graphs, RDF2vec models can be very expensive to train. In this chapter, we look at two techniques that make RDF2vec easier to use with large knowledge graphs. First, we look at a knowledge graph embedding server called KGvec2go, which serves pre-trained embedding vectors for well-known knowledge graphs such as DBpedia as a service. Second, we look at how we can train partial RDF2vec models only for instances of interest with RDF2vec Light.
5.1
Using Pre-trained Embeddings
Quite a few applications use the same knowledge graphs as background knowledge, in particular general-purpose knowledge graphs such as DBpedia, YAGO, or Wikidata (Heist et al. 2020). Since their usage is so widespread, it can be beneficial to precompute them and provide them as a service, as opposed to large monolithic files for download. This is the main idea of the KGvec2go Web service.1
5.1.1
The KGvec2Go Service
KGvec2go currently serves embedding vectors for four different knowledge graphs: DBpedia (Lehmann et al. 2013) WebIsALOD (Hertling and Paulheim 2017, Seitner et al. 2016), Wiktionary (using the RDF conversion DBnary by Sérasset (2015)), and WordNet (Fellbaum 1998). 1 http://kgvec2go.org/.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 H. Paulheim et al., Embedding Knowledge Graphs with RDF2vec, Synthesis Lectures on Data, Semantics, and Knowledge, https://doi.org/10.1007/978-3-031-30387-6_5
77
78
5 RDF2vec at Scale
(a) Nearest neighbor retrieval
(b) Similarity computation
Fig. 5.1 KGvec2go User Interface (Portisch et al. 2020a)
For the walk generation, duplicate free random walks with four hops (i.e., depth = 8) have been generated. For WordNet and Wiktionary, 500 walks have been calculated per entity. For WebIsALOD and DBpedia, 100 walks have been created in order to account for the comparatively large size of the knowledge graphs. The models were trained with the following configuration: skip-gram vectors, window size = 5, number of iterations = 5, negative sampling for optimization, negative samples = 25. Apart from walk-generation adaptations due to the size of the knowledge graphs, the configuration parameters to train the models have been held constant and no dataset-specific optimizations have been performed in order to allow for comparability. In addition, a Web API is provided to access the data models in a lightweight way. (Portisch et al. 2020a) This allows for easy access to embedding models and for bringing powerful embedding models to devices with restrictions in CPU and RAM, such as smartˇ urˇek and phones. The server has been implemented in Python using flask 2 and gensim (Reh˚ Sojka, 2010) and can be run using Apache HTTP Server. Its code is publicly available on GitHub.3 While the main service of KGvec2go is a REST interface to obtain embedding vectors for entities in JSON format (see code example below), there are also further services, which exist both as REST APIs and as user interfaces, as shown in Fig. 5.1: the first one allows for retrieving nearest neighbors to an entity in the RDF2vec embedding space, the second one computes distances between two given entities. 2 https://flask.palletsprojects.com/en/1.1.x/. 3 https://github.com/janothan/kgvec2go-server/.
5.2 Training Partial RDF2vec Models with RDF2vec Light
5.1.2
79
KGvec2Go in Action
Since the running example in this book is based on entities in DBpedia, we can also retrieve DBpedia vectors from kgvec2go instead of computing the embedding on the graph ourselves. Listing 5.1 shows the corresponding code for performing the classification task with vectors from kgvec2go. There are a few remarks on this example. First of all, it should be noted that in this example, the knowledge graph itself is never accessed directly. Instead, the example solely retrieves predefined vectors This means that the local processing capabilities are much lower, since the machine running the code does not need to store and/or load the knowledge graph. Second, when retrieving the entities, there is a fallback option in case an entity is not found in DBpedia. In our case, this is due to a version mismatch of the DBpedia knowledge graph used to build the running example and the DBpedia knowledge graph which was used to compute the embeddings served via kgvec2go. In that case, the resulting embeddings are set to 0. In the example above, vectors for 25 out of 200 entities are not retrieved. Third, when executing this code, one may observe that the achieved accuracy is considerably higher than the accuracies observed for the various variants in Chap. 4–while we saw maximum accuracies around 0.75 there, the accuracy with kgvec2go is above 0.85, even with a significant number of embedding vectors set to 0. The reason for this is that the embeddings in kgvec2go were trained on the entire DBpedia knowledge graph, while the results discussed above work only on an excerpt of DBpedia. Hence, the embedding vectors served at kgvec2go were computed on a graph which is richer information. Moreover, and more importantly, the ground-truth labels in the example graph come from DBpedia themselves. They have been excluded for constructing the example graph, but they are contained in the knowledge graph from which the embedding vectors in kgvec2go were trained on. Hence, there might also be a certain degree of information leakage.
5.2
Training Partial RDF2vec Models with RDF2vec Light
In many real-world applications, large knowledge graphs are used, but only a small fraction of entities is relevant. As seen in Chap. 1, public knowledge graphs can easily contain millions of entities. The same holds for corporate knowledge graphs. Consider, for example, the development of a movie recommender using a public knowledge graph (see Chap. 7 for a discussion on how to build a recommender system using RDF2vec). If there are, say, ten thousand movies in your database, you will need embedding vectors for exactly those ten thousand movie entities, while the total number of entities in public knowledge graphs might be by at least two orders of magnitude larger. In other words: if you compute embedding vectors for the entire knowledge graph, you compute millions of vectors that are never used.
80
5 RDF2vec at Scale
Listing 5.1 Working with pre-computed vectors from kgvec2go import import import import import
n um py as np requests n um py as np p a n d a s as pd json
from s k l e a r n . n e u r a l _ n e t w o r k i m p o r t M L P C l a s s i f i e r from s k l e a r n . m o d e l _ s e l e c t i o n i m p o r t c r o s s _ v a l _ s c o r e # read input file df = pd . r e a d _ c s v ( ’ b a n d s _ l a b e l s . csv ’ , sep = " \ t " ) dfX = df [[ ’ Band ’ ]] dfY = df [[ ’ Genre ’ ]] df = pd . r e a d _ c s v ( ’ b a n d s _ l a b e l s . csv ’ , sep = " \ t " ) dfX = df [[ ’ Band ’ ]] dfY = df [[ ’ Genre ’ ]] # m e t h o d for r e t r i e v i n g v e c t o r s from k g v e c 2 g o def r e t r i e v e _ k g v e c 2 g o _ v e c t o r s ( e n t i t i e s ): v e c t o r s = [] for e in e n t i t i e s : e n t i t y = e [ e . r i n d e x ( " / " )+1:] r = r e q u e s t s . get ( " http :// k g v e c 2 g o . org / rest / get - v e c t o r / d b p e d i a / " + entity ) x = json . loads ( r . text ) # Catch case that an e n t i t y is not f o u n d in the API if ’ v e c t o r ’ in x : v e c t o r s . a p p e n d ( x [ ’ v e c t o r ’ ]) else : v e c t o r s . a p p e n d ( np . z e r o s ( 2 0 0 ) ) return vectors v e c t o r s = r e t r i e v e _ k g v e c 2 g o _ v e c t o r s ( dfX [ ’ Band ’ ]. t o _ l i s t ()) d f X v e c t o r s = pd . D a t a F r a m e . f r o m _ r e c o r d s ( v e c t o r s ) # use r e t r i e v e d v e c t o r s for c l a s s i f i c a t i o n d f X v e c t o r s = pd . D a t a F r a m e . f r o m _ r e c o r d s ( v e c t o r s ) clf = M L P C l a s s i f i e r ( m a x _ i t e r = 1 0 0 0 0 ) s c o r e s = c r o s s _ v a l _ s c o r e ( clf , d f X v e c t o r s , dfY . v a l u e s . r a v e l () , cv =10)
5.2 Training Partial RDF2vec Models with RDF2vec Light
81
Fig. 5.2 Schematic depiction of RDF2vec Light walk generation (Portisch et al. 2020b)
RDF2vec Light (Portisch et al. 2020b) is an approach for computing only embedding vectors for entities of interest. For that purpose, only walks which include those entities are created, usually ignoring larger portions of the knowledge graph, as shown in Fig. 5.2.
5.2.1
Approach
The original RDF2Vec approach generates walks for each vertex v ∈ V where a sentence for a specific vertex vs always starts with vs . This walk generation pattern is insufficient for local walks of only a few vertices because the context is not fully reflected: There is a bias towards facts where vs is the subject of a statement–whereas facts, where vs appears as an object, would only occur if another entity of interest has a path that passes vs .4 Therefore, the walk generation has been adapted: Rather than performing random walks where the entity of interest is always at the start of a sequence, it is randomly decided for each iteration whether to go backward, i.e. to one of the node’s predecessors, or forward, i.e. to the node’s successors (line 9 of Algorithm 2). The probability of continuing the walk in the backward or forward direction is proportional to the number of available options to do so (lines 10–14 of Algorithm 2). For example, for a node with an in-degree of 1 and an out-degree of 9, there are 9 options to continue the walk in forward direction and one option to continue the walk in backward direction, the walk will be continued in backward direction with a probability of 1/(1 + 9) = 10%. Predecessors are added at the beginning of the walk (line 11 of Algorithm 2), successors at the end of the walk (line 15 of Algorithm 2). Consequently, in the resulting walks, the entity of interest can be at the beginning, at the end, or in the middle of a sequence, which captures the context of the entity better. This generation process is described in Algorithm 2.
4 Note that this bias occurs only if walks are generated for a subset of V –the traditional RDF2Vec approach is, consequently, balanced.
82
5 RDF2vec at Scale
Algorithm 2 Walk generation algorithm for RDF2Vec Light (Portisch et al. 2020b) Input: G = (V,E): RDF Graph, V I : vertices of interest, d: walk depth, n: number of walks Output: WG : Set of walks WG = ∅ for vertex v ∈ V I do for 1 to n do add v to w pr ed = getIngoingEdges(v) succ = getOutgoingEdges(v) while w.length() < d do cand = pr ed ∪ succ elem = pickRandomElementFrom(cand) if elem ∈ pr ed then add elem at the beginning of w pr ed = getIngoingEdges(elem) else add elem at the end of w succ = getOutgoingEdges(elem) end if add w to WG end while end for end for
The original paper of RDF2vec Light reports on experiments with a subset of tasks in the GEval framework (see Chap. 3), namely classification, regression, entity relatedness, and document similarity. The results are depicted in Table 5.1. It can be observed that there is little difference between classic RDF2vec and RDF2vec Light for classification and regression (apart from the cities case), while the performance of RDF2vec Light degrades a lot more for the entity relatedness and document similarity tasks. In order to better understand these results, we have to take a closer look both at the characteristics of the benchmark datasets at hand, as well as the graphs spanned by the random walks for RDF2vec light. Table 5.2 shows some additional characteristics of the GEval datasets used for the evaluation. It can be observed that those datasets where the RDF2vec light variant significantly underperforms compared to classic RDF2vec are those where the entities have a high degree. It is known that for head entities, the in-degree of these entities is much larger than the out-degree in DBpedia.5 This means that starting a fixed number of random walks not 5 For example, a city in DBpedia has an average outdegree of about 4, but an average indegree of
about 34 (as of October 2022). This proportion is even more skewed for the cities in the GEval dataset, which are head entities.
5.2 Training Partial RDF2vec Models with RDF2vec Light
83
Table 5.1 Results with RDF2vec Light on GEval (using 200 dimensions) (Portisch et al. 2020b) Task
Dataset
Classic CBOW
Classification
Cities Movies Albums AAUP Forbes
0.494 0.588 0.592 0.568 0.576
Regression
Cities Movies Albums AAUP Forbes
Entity relatedness
KORE
0.318
0.535
0.347
Document similarity
LP50
0.237
0.437
0.381
99.7 23.5 14.2 80.3 57.6
Classic SG 0.771 0.739 0.757 0.667 0.618 28.3 19.7 11.9 68.0 61.8
Light CBOW 0.716 0.739 0.733 0.617 0.591 54.9 19.6 12.5 67.7 59.1
Light SG 0.738 0.746 0.764 0.628 0.603 44.4 19.5 12.2 70.1 60.3 0.405 0.328
Table 5.2 Number of entities and average degree of the test datasets in GEval Dataset
# Entities
Avg. degree
AAUP
960
83.3775
Cities
212
1264.7122
Forbes
1585
36.1759
Metacritic Albums
1600
19.0762
Metacritic Movies
2000
17.5003
KORE
414
474.5984
LP50
407
2087.5274
only from the entities of interest but also from neighboring entities leads to a much better representation of those entities in the random walks. On the other hand, this effect is not as severe for entities that have a rather low degree–here, the amount of information captured for an entity by the classic and the light variant of RDF2vec is comparable. Figure 5.3 visualizes the graphs spanned by the RDF2vec light walks in the dataset used. It can be observed that the cases where RDF2vec light works well are those where the spanned graphs are comparably densely connected. This also hints at the fact that for lowdegree entities, most of the relevant information is captured by RDF2vec light and classic RDF2vec alike.
84
5 RDF2vec at Scale
Fig. 5.3 Depiction of the graphs which were assembled using the generated walks. The graphs are rendered using a force layout in Gephi Listing 5.2 Computing embeddings for all entities in a graph with pyRDF2vec # Load graph : see l i s t i n g s in c h a p t e r 3+4 # Get all e n t i t i e s in the graph k g e n t i t i e s = [] for kge in kg . _ e n t i t i e s : k g e n t i t i e s . a p p e n d ( kge . name ) # le ar n the e m b e d d i n g s # d e f i n e w a l k s t r a t e g y and w o r d 2 v e c parameters , see c h a p t e r s 3+4 t r a n s f o r m e r . f i t _ t r a n s f o r m ( kg , k g e n t i t i e s ) # r e c o n s t r u c t e m b e d d i n g s for e n t i t i e s at hand f i l t e r e d _ e m b e d d i n g s = [] for e in e n t i t i e s : f i l t e r e d _ e m b e d d i n g s . a p p e n d ( e m b e d d i n g s [ k g e n t i t e s . index ( e )])
5.2.2
RDF2vec Light in Action
In the code examples in this book, we have used the pyRDF2vec implementation, which, by default, creates embeddings only for a given set of entities. If we want to use RDF2vec embeddings computed on the entire graph instead, we have to explicitly request this, as shown in Listing 5.2. We have applied RDF2vec Light and RDF2vec both to the running band classification example in this book. On average, RDF2vec Light takes about 4 s to compute the vectors for the 200 entities in the classification dataset, while RDF2vec Light needs about 30 s to compute vectors for all 1786 entities in the example knowledge graph.
5.2 Training Partial RDF2vec Models with RDF2vec Light
85
Fig. 5.4 Accuracy and runtime of RDF2vec Light and Classic for different dimensionalities
RDF2vec Light does not only save time on creating the embeddings. Since less information needs to be represented, we can also use smaller embedding models (i.e., less dimensions) when using RDF2vec Light. To analyze the extent to which this influences the results, we conducted a small experiment with the running example in this book, computing both light and classic RDF2vec embeddings of different dimensionality. The results are depicted in Fig. 5.4. We can make multiple observations here: First, while classic RDF2vec needs more dimensions in order to deliver good results, RDF2vec light already delivers decent results at a relatively small number of dimensions, i.e., 10. Second, the results are much more stable with RDF2vec light, indicating that the underlying learning problem is simpler and can be learned more easily. It is noteworthy that RDF2vec light does not only generate runtime savings when creating the embeddings but also for downstream applications. The fact that fewer dimensions are required to represent the entities at hand can also generate further runtime savings: if the subsequent processing steps operate on more compact representations, we can expect also those downstream processes to run faster. On the other hand, in scenarios where the set of entities is not known upfront or strict runtime requirements exist, using a pre-computed embedding on the entire graph might still be a better option.
86
5.3
5 RDF2vec at Scale
Conclusion
Although RDF2vec is a heavy-weight processing technique that, on larger knowledge graphs, can lead to high hardware requirements and long runtimes, there are alternatives for dealing with large knowledge graphs. Pre-trained embeddings, provided as a service, as well as lightweight variants only computing embedding vectors for a set of entities of interest are possible alternatives to make RDF2vec work at scale.
References Fellbaum C (ed) (1998) WordNet: an electronic lexical database. Language, speech, and communication. MIT Press, Cambridge, Massachusetts Heist N, Hertling S, Ringler D, Paulheim H (2020) Knowledge graphs on the web-an overview Hertling S, Paulheim H (2017) Webisalod: providing hypernymy relations extracted from the web as linked open data. In: International semantic web conference. Springer, pp 111–119 Lehmann J, Isele R, Jakob M, Jentzsch A, Kontokostas D, Mendes PN, Hellmann S, Morsey M, van Kleef P, Auer S, Bizer C (2013) DBpedia–a large-scale, multilingual knowledge base extracted from wikipedia. Semant Web J 6(2). https://doi.org/10.3233/SW-140134 Portisch J, Hladik M, Paulheim H (2020a) Kgvec2go–knowledge graph embeddings as a service. In: Proceedings of the 12th language resources and evaluation conference, pp 5641–5647 Portisch J, Hladik M, Paulheim H (2020b) Rdf2vec light–a lightweight approach for knowledge graph embeddings. In: International semantic web conference, posters and demonstrations ˇ uˇrek R, Sojka P (2010) Software framework for topic modelling with large corpora. In: ProceedReh˚ ings of the LREC 2010 workshop on new challenges for NLP frameworks, ELRA, Valletta, Malta, pp 45–50. http://is.muni.cz/publication/884893/en Seitner J, Bizer C, Eckert K, Faralli S, Meusel R, Paulheim H, Ponzetto SP (2016) A large database of hypernymy relations extracted from the web. In: Proceedings of the tenth international conference on language resources and evaluation (LREC 2016), pp 360–367 Sérasset G (2015) Dbnary: wiktionary as a lemon-based multilingual lexical resource in RDF. Semant Web 6(4):355–361. https://doi.org/10.3233/SW-140147
6
Link Prediction in Knowledge Graphs (and its Relation to RDF2vec)
Abstract
In recent years, a long list of research works has been published which utilize knowledge graph embeddings for link prediction (rather than node classification, which we have considered so far). In this chapter, we give a very brief overview of the main embedding techniques for link prediction and flesh out the main differences between the well-known link prediction technique TransE and RDF2vec. Moreover, we show how RDF2vec can be used for link prediction.
6.1
A Brief Survey on the Knowledge Graph Embedding Landscape
So far, in this book, we have mainly considered RDF2vec and its variants. However, there is a legion of knowledge graph embedding methods. An often-cited survey by Wang et al. (2017) lists already 25 approaches, with new models being proposed almost every month, as depicted in Fig. 6.1. Even more remarkably, two mostly disjoint strands of research have emerged in that vivid area. The first family of research works is the one to which RDF2vec belongs. Those approaches focus on the embedding of entities in the knowledge graph for downstream tasks outside the knowledge graph, which often come from the data mining field—hence, we coin this family of approaches embeddings for data mining. Examples include: the prediction of external variables for entities in a knowledge graph (Ristoski and Paulheim 2016), information retrieval backed by a knowledge graph (Steenwinckel et al. 2020), or the usage of a knowledge graph in content-based recommender systems (Ristoski et al. 2019). In those cases, the optimization goal is to create an embedding space that reflects semantic similarity as good as possible (e.g., in a recommender system, similar items to the ones in
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 H. Paulheim et al., Embedding Knowledge Graphs with RDF2vec, Synthesis Lectures on Data, Semantics, and Knowledge, https://doi.org/10.1007/978-3-031-30387-6_6
87
88
6 Link Prediction in Knowledge Graphs (and its Relation to RDF2vec)
Fig. 6.1 Publications with Knowledge Graph Embedding in their title or abstract over the past ten years, created with dimensions.ai (as of February 1st, 2023)
the user interest should be recommended, as discussed in Chap. 7). The evaluations here are always conducted outside the knowledge graph, based on external ground truth. The second—and more active—family of approaches focuses mostly on link prediction (Han et al. 2018), i.e., the approaches are evaluated in a knowledge graph refinement setting (Paulheim 2017). The optimization goal here is to distinguish correct from incorrect triples in the knowledge graph as accurately as possible.1 The evaluations of this kind of approach are always conducted within the knowledge graph, using the existing knowledge graph assertions as ground truth. In this chapter, we want to discuss the commonalities and differences between the two families. We look at two of the most basic and well-known approaches of both strands, i.e., TransE (Bordes et al. 2013) and RDF2vec (Ristoski and Paulheim 2016), and analyze and compare their optimization goals in a simple example. Moreover, we analyze the performance of approaches from both families in the respective other evaluation setups: we explore the usage of link-prediction-based embeddings for other downstream tasks based on similarity, and we propose a link prediction method based on node embedding techniques such as RDF2vec. From those experiments, we derive a set of insights into the differences between the two families of methods, and a few recommendations on which kind of approach should be used in which setting. As pointed out above, the number of works on knowledge graph embedding is legion, and enumerating them all in this section would go beyond the scope of this chapter (and if done thoroughly, even this book as a whole). However, there have already been quite a few survey articles. The first strand of research works—i.e., knowledge graph embeddings for link prediction—has been covered in different surveys, such as Wang et al. (2017), and, more 1 Ultimately, those approaches compute a confidence score for a triple < s, p, o >. This confidence score can be used for triple scoring, i.e., validating existing triples in a knowledge graph, as well as for link prediction by scoring candidates for non-existing triples.
6.1
A Brief Survey on the Knowledge Graph Embedding Landscape
89
recently, Dai et al. (2020), Rossi et al. (2021), Ji et al. (2021). The categorization of approaches in those reviews is similar, as they distinguish different families of approaches: translational distance models (Wang et al. 2017) or geometric models (Rossi et al. 2021) focus on link prediction as a geometric task, i.e., projecting the graph in a vector space so that a translation operation defined for relation r on a head h yields a result close to the tail t. The second family among the link prediction embeddings are semantic matching (Wang et al. 2017) or matrix factorization or tensor decomposition (Rossi et al. 2021) models. Here, a knowledge graph is represented as a three-dimensional tensor, which is decomposed into smaller tensors and/or two-dimensional matrices. The reconstruction operation can then be used for link prediction. The third and youngest family among the link prediction embeddings are based on deep learning and graph neural networks. Here, neural network training approaches, such as convolutional neural networks, capsule networks, or recurrent neural networks, are adapted to work with knowledge graphs. They are generated by training a deep neural network. Different architectures exist (based on convolutions, recurrent layers, etc.), and the approaches also differ in the training objective, e.g., performing binary classification into true and false triples, or predicting the relation of a triple, given its subject and object (Rossi et al. 2021). While most of those approaches only consider graphs with nodes and edges, most knowledge graphs also contain literals, e.g., strings and numeric values. Recently, approaches combining textual information with knowledge graph embeddings using language modeling techniques have also been proposed, using techniques such as word2vec and convolutional neural networks (Xie et al. 2016) or transformer methods (Daza et al. 2021, Wang et al. 2021a). Gesese et al. (2021) show a survey of approaches that take such literal information into account. It is also one of the few review articles which consider embedding methods from the different research strands. Link prediction is typically evaluated on a set of standard datasets, and uses a within-KG protocol, where the triples in the knowledge graph are divided into a training, testing, and validation set. Prediction accuracy is then assessed on the validation set. Datasets commonly used for the evaluation are FB15k, which is a subset of Freebase, and WN18, which is derived from WordNet (Bordes et al. 2013). Since it has been remarked that those datasets contain too many simple inferences due to inverse relations, the more challenging variants FB15k237 (Toutanova et al. 2015) and WN18RR (Dettmers et al. 2018) have been proposed. More recently, evaluation sets based on larger knowledge graphs, such as YAGO3-10 (Dettmers et al. 2018), DBpedia50k/DBpedia500k (Shi and Weninger 2018), and Wikidata5M (Wang et al. 2021a) have been introduced. The second strand of research works, focusing on the embedding for downstream tasks (which are often from the domain of data mining), is not as extensively reviewed, and the number of works in this area are still smaller. One of the more comprehensive evaluations is shown by Cochez et al. (2017b), which is also one of the rare works which includes approaches from both strands in a common evaluation. They show that at least the three methods for link prediction used—namely TransE, TransR, and TransH—perform inferior
90
6 Link Prediction in Knowledge Graphs (and its Relation to RDF2vec)
Table 6.1 Co-citation likelihood of different embeddings approaches, obtained from Google scholar, July 12th, 2021. An entry (row,column) in the table reads as: this fraction of the papers citing column also cites row (Portisch et al. 2022b)
on downstream tasks, compared to approaches developed specifically for optimizing for entity similarity in the embedding space. A third, yet less closely related strand of research works is node embeddings for homogeneous graphs, such as node2vec and DeepWalk. While knowledge graphs come with different relations and are thus considered heterogeneous, approaches for homogeneous graphs are sometimes used on knowledge graphs as well by first transforming the knowledge graph into an unlabeled graph, usually by ignoring the different types of relations. Since some of the approaches are defined for undirected graphs, but knowledge graphs are directed, those approaches may also ignore the direction of edges. For the evaluation of entity embeddings for data mining, i.e., optimized for capturing entity similarity, there are quite a few use cases at hand. Pellegrino et al. (2020) list a number of tasks, including classification and regression of entities based on external ground truth variables, entity clustering, as well as identifying semantically related entities. Most of the above-mentioned strands exist mainly in their own respective “research bubbles”. Table 6.1 shows a co-citation analysis of the different families of approaches. It shows that the Trans* family, together with other approaches for link prediction, forms its own citation network, and so do the approaches for homogeneous networks, while RDF2vec and KGlove are less clearly separated. Works which explicitly compare approaches from the different research strands are still rare. Zouaq and Martel (2020) analyze the vector spaces of different embedding models with respect to class separation, i.e., they fit the best linear separation between classes in different embedding spaces. According to their findings, RDF2vec achieves a better linear separation than the models tailored to link prediction. In Chen et al. (2020), an in-KG scenario, i.e., the detection and correction of erroneous links, is considered. The authors compare RDF2vec (with an additional classification layer) to TransE and DistMult on the link prediction task. The results are mixed: While RDF2vec outperforms TransE and DistMult in terms of Mean Reciprocal Rank and Precision@1, it
6.2
Knowledge Graph Embedding for Data Mining
91
is inferior in Precision@10. Since the results are only validated on one single dataset, the evidence is rather thin. Most other research works in which approaches from different strands are compared are related to different downstream tasks. In many cases, the results are rather inconclusive, as the following examples illustrate: • Celebi et al. (2019) and Karim et al. (2019) both analyze drug drug interaction, using different sets of embedding methods. The finding of Celebi et al. (2019) is that RDF2vec outperforms TransE and TransD, whereas in the experiment in Karim et al. (2019), ComplEx outperforms RDF2vec, KGlove, TransE, and CrossE, and, in particular, TransE outperforms RDF2vec. • Basu et al. (2020), Chen et al. (2020), and Wang et al. (2021b) all analyze link prediction in different graphs. While Basu et al. (2020) state that RotatE and TransD outperform TransE, DistMult, and ComplEx, which in turn outperforms node2vec, Chen et al. (2020) reports that DistMult outperforms RDF2vec, which in turn outperforms TransE, while Wang et al. (2021b) reports that KG2vec (which can be considered equivalent to RDF2vec) outperforms node2vec, which in turn outperforms TransE. • Bakhshandegan Moghaddam et al. (2021) compare the performance of RDF2vec, DistMult, TransE, and SimplE on a set of classification and clustering datasets. The results are mixed. For classification, the authors use four different learning algorithms, and the variance induced by the learning algorithms is most often higher than that induced by the embedding method. For the clustering, they report that TransE outperforms the other approaches.2 While this is not a comprehensive list, these observations hint at a need both for more task-specific benchmark datasets as well as for ablation studies analyzing the interplay of embedding methods and other processing steps. Moreover, it is important to gain a deeper understanding of how these approaches behave with respect to different downstream problems and to have more direct comparisons.
6.2
Knowledge Graph Embedding for Data Mining
As we have discussed in the previous chapters, the method under consideration in this book, i.e., RDF2vec, was designed for data mining. The basic idea is that two similar instances are projected to similar vectors. Since, due to their similarity, there is a high likelihood that they share the same label in a downstream prediction task, a learning algorithm can pick up 2 We think that these results must be taken with a grain of salt. To evaluate the clustering quality, the
authors use an intrinsic evaluation metric, i.e., Silhouette score, which is computed in the respective vector space. It is debatable, however, whether Silhouette scores computed in different vector spaces are comparable.
92
6 Link Prediction in Knowledge Graphs (and its Relation to RDF2vec)
that signal and learn a good predictive model. This was shown in the running example in the previous chapter: similar artists get similar embedding vectors, and artists with the same genres form clusters.
6.2.1
Data Mining is Based on Similarity
Predictive data mining tasks are predicting classes or numerical values for instances, often also referred to as node classification and node regression. Typically, the target is to predict an external variable not contained in the knowledge graph (or, to put it differently: use the background information from the knowledge graph to improve prediction models). An example for node classification is the task of predicting a band’s genre, which has been used as a running example in the previous chapters. An example for node regression would be to predict the popularity of an item (e.g., a book, a music album, a movie) as a numerical value. The idea here would be that two items that share similar features should also receive similar ratings. The same mechanism is also exploited in recommender systems: if two items share similar features, users who consumed one of those items are recommended the other one. Many techniques for predictive data mining rely on similarity in one or the other way. This is more obvious for, e.g., k-nearest neighbors, where the predicted label for an instance is the majority or average of labels of its closest neighbors (i.e., most similar instances), or Naive Bayes, where an instance is predicted to belong to a class if its feature values are most similar to the typical distribution of features for this class (i.e., it is similar to an average member of this class). A similar argument can be made for neural networks, where one can assume a similar output when changing the value of one input neuron (i.e., one feature value) by a small delta. Other classes of approaches (such as Support Vector Machines) use the concept of class separability, which is similar to exploiting similarity: datasets with well separable classes have similar instances (belonging to the same class) close to each other, while dissimilar instances (belonging to different classes) are further away from each other (Tan et al. 2016).
6.2.2
How RDF2vec Projects Similar Instances Close to Each Other
While we have already taken a deep dive into RDF2vec, we want to step back and understand why RDF2vec projects similar instances close to each other. To that end, we use the example depicted in Fig. 6.2. The corresponding triple notation is shown in Fig. 6.3. As discussed above, the first step of RDF2vec is to create random walks on the graph. To that end, RDF2vec starts a fixed number of random walks of a fixed maximum length from each entity. Since the example above is very small, we will, for the sake of illustration, enumerate all walks of length 4 that can be created for the graph. Those walks are depicted
6.2
Knowledge Graph Embedding for Data Mining
Fig. 6.2 Example graph used for illustration (Portisch et al. 2022b)
Fig. 6.3 Triples of the example knowledge graph
93
94
6 Link Prediction in Knowledge Graphs (and its Relation to RDF2vec)
Fig. 6.4 Walks extracted from the example graph
in Fig. 6.4. It is notable that, since the graph has nodes without outgoing edges, some of the walks are actually shorter than 4. In the next step, the walks are used to train a predictive model. Since RDF2vec uses word2vec, it can be trained with the two flavors of word2vec, i.e., CBOW (context back of words) and SG (skip-gram). For the sake of illustration, we will only use the second variant, which predicts the surroundings of a word, given the word itself. Simply speaking, given training examples where the input is the target word (as a one-hot-encoded vector) and the output is the context words (again, one-hot-encoded vectors), a neural network is trained, where the hidden layer is typically of smaller dimensionality than the input. That hidden layer is later used to produce the actual embedding vectors. To create the training examples, a window with a given size is slid over the input sentences. Here, we use a window of size 2, which means that the two words preceding and the two words succeeding a context word are taken into consideration. Table 6.2 shows the training examples generated for three instances. It is noteworthy that some training examples occur multiple times if they are extracted from different walks, which leads the learner to put more emphasis on those examples.
6.2
Knowledge Graph Embedding for Data Mining
95
Table 6.2 Training examples for instances Paris, Berlin, Mannheim, Angela Merkel, Donald Trump, and Belgium (upper part) and majority predictions (lower part) Target word
w−2
w−1
w+1
w+2
Paris
France
capital
locatedIn
France
Paris
–
–
locatedIn
France
Paris
–
–
locatedIn
France
Paris
–
–
locatedIn
France
Paris
France
capital
–
–
Paris
France
capital
–
–
Berlin
–
–
locatedIn
Germany
Berlin
Germany
capital
–
–
Berlin
–
–
locatedIn
Germany
Berlin
–
–
locatedIn
Germany
Berlin
Germany
capital
locatedIn
Germany
Berlin
Germany
capital
–
–
Mannheim
–
–
locatedIn
Germany
Mannheim
–
–
locatedIn
Germany
Mannheim
–
–
locatedIn
Germany
Angela Merkel
Germany
headOfGovernment
–
–
Angela Merkel
Germany
headOfGovernment
–
–
Angela Merkel
Germany
headOfGovernment
–
–
Donald Trump
USA
headOfGovernment
–
–
Donald Trump
USA
headOfGovernment
–
–
Belgium
–
–
partOf
EU
Belgium
–
–
capital
Brussels
Belgium
Brussels
locatedIn
–
–
Belgium
–
–
partOf
EU
Belgium
–
–
headOfGovernment
Sophie Wilmes
Belgium
Brussels
locatedIn
headOfGovernment
Sophie Wilmes
Belgium
Brussels
locatedIn
partOf
EU
Belgium
Brussels
locatedIn
capital
Brussels
Belgium
Brussels
locatedIn
–
–
Paris
France
capital
locatedIn
France
Berlin
Germany
capital
locatedIn
Germany
Mannheim
–
–
locatedIn
Germany
Angela Merkel
Germany
headOfGovernment
–
–
Donald Trump
USA
headOfGovernment
–
–
Belgium
Brussels
locatedIn
partOf
EU
A model that learns to predict the context given the target word would now learn to predict the majority of the context words for the target word at hand at the output layer, as depicted in the lower part of Table 6.2. Here, we can see that Paris and Berlin share two out of four
96
6 Link Prediction in Knowledge Graphs (and its Relation to RDF2vec)
Fig.6.5 Projection of the example (left) and computed average relations (right) (Portisch et al. 2022b)
predictions, so do Mannheim and Berlin. Angela Merkel and Berlin share one out of four predictions. Given that the activation function in the word2vec architecture which computes the output from the projection values is continuous, it implies that similar activations on the output layer require similar values on the projection layer. Hence, for a well-fit model, the distance on the projection layer of Paris, Berlin, and Mannheim should be comparatively lower than the distance of the other entities, since they activate similar outputs.3 Figure 6.5 depicts a two-dimensional RDF2vec embedding learned for the example graph.4 We can observe that there are clusters of persons, countries, and cities. The grouping of similar objects also goes further—we can, e.g., observe that European cities in the dataset are embedded closer to each other than to Washington D.C. This is in line with previous observations showing that RDF2vec is particularly well suited in creating clusters also for finer-grained classes (Sofronova et al. 2020). A predictive model could now exploit those similarities, e.g., for type prediction, as proposed by Kejriwal and Szekely (2017) and Sofronova et al. (2020).
6.2.3
Using RDF2vec for Link Prediction
From Fig. 6.5, we can assume that link prediction should, in principle, be possible. For example, the predictions for heads of governments all point in a similar direction. This is 3 Note that there are still weights learned for the individual connections between the projection and
the output layer, which emphasize some connections more strongly than others. Hence, we cannot simplify our argumentation in a way like “with two common context words activated, the entities must be projected twice as close as those with one common context word activated”. 4 Created with PyRDF2vec (Vandewiele et al. 2022), using two dimensions, a walk length of 8, and standard configuration otherwise.
6.2
Knowledge Graph Embedding for Data Mining
97
in line with what is known about word2vec, which allows for computing analogies, like the well-known example by Mikolov et al. (2013c): v(K ing) − v(Man) + v(W oman) ≈ v(Queen)
(6.1)
RDF2vec does not learn relation embeddings, only entity embeddings.5 Hence, we cannot directly predict links, but we can exploit those analogies. If we want to make a tail prediction like < h, r , ? >,
(6.2)
we can identify another pair < h , r , t > and exploit the above analogy, i.e., t − h + h ≈ t To come to a stable prediction, we would use the average, i.e., t − h + h t≈ , |< h , r , t >|
(6.3)
(6.4)
where |< h , r , t >| is the number of triples which have r as predicate. With the same idea, we can also average the relation vectors r for each relation that holds between all its head and tail pairs, i.e., t − h r≈ , (6.5) |< h , r , t >| and thereby reformulate the above equation to t ≈ h + r,
(6.6)
which is what we expect from an embedding model for link prediction. Those approximate relation vectors for the example at hand are depicted in Fig. 6.5. We can see that in some (not all) cases, the directions of the vectors are approximately correct: the partOf vector is roughly the difference between EU and Germany, France, and Belgium, and the headOfGovernment vector is approximately the vector between the countries and the politicians cluster. It can also be observed that the vectors for locatedIn and capitalOf point in reverse directions, which makes sense because they form connections between two clusters (countries and cities) in opposite directions.
5 Technically, we can also make RDF2vec learn embeddings for the relations, but they would not
behave the way we need them. See Sect. 7.3 for a discussion of relation predictions of RDF2vec.
98
6 Link Prediction in Knowledge Graphs (and its Relation to RDF2vec)
Listing 6.1 Using RDF2vec for Link Prediction. The code only shows the essential parts of the prediction; the training is done as in the other RDF2vec examples. # c o m p u t e a v e r a g e r e l a t i o n v e c t o r for # http :// d b p e d i a . org / o n t o l o g y / b a n d M e m b e r r e l a t i o n s = np . empty ([0 ,200]) p r e d = U R I R e f ( " http :// d b p e d i a . org / o n t o l o g y / b a n d M e m b e r " ) for s , p , o in g . t r i p l e s (( None , pred , None )): s V e c t o r = e m b e d d i n g s [ l s t e n t i t i e s . index ( str ( s ))] o V e c t o r = e m b e d d i n g s [ l s t e n t i t i e s . index ( str ( o ))] rVector = oVector - sVector r e l a t i o n s = np . v s t a c k ([ relations , r V e c t o r ]) r e l a t i o n V e c t o r = r e l a t i o n s . mean ( axis =0) # make a p r e d i c t i o n # band m e m b e r s of h t t p :// d b p e d i a . org / r e s o u r c e / S o u l i v e e n t i t y = " h t t p :// d b p e d i a . org / r e s o u r c e / S o u l i v e " s V e c t o r = e m b e d d i n g s [ l s t e n t i t i e s . i n d e x ( e n t i t y )] oVectorPrediction = sVector + rVector # r e t r i e v e c a n d i d a t e e n t i t i e s in n e i g h b o r h o o d of the p r e d i c t i o n from s k l e a r n . n e i g h b o r s i m p o r t N e a r e s t N e i g h b o r s d f e m b e d d i n g s = pd . D a t a F r a m e . f r o m _ r e c o r d s ( e m b e d d i n g s ) knn = N e a r e s t N e i g h b o r s ( n _ n e i g h b o r s =10 , a l g o r i t h m = ’ auto ’ , metric = ’ cosine ’) knn . fit ( d f e m b e d d i n g s ) d f V e c t o r P r e d i c t i o n = pd . D a t a F r a m e ( o V e c t o r P r e d i c t i o n ). t r a n s p o s e () candidates = knn . k n e i g h b o r s ( d f V e c t o r P r e d i c t i o n , 10 , r e t u r n _ d i s t a n c e = False )[0]
6.2.4
Link Prediction with RDF2vec in Action
Listing 6.1 shows how to put the above rationale in action, i.e., use averages of existing links for predicting links. We again use the band classification dataset introduced in Chap. 1, this time to predict band membership. With this code, the top 10 predictions for members of the band Soulive are: 1. 2. 3. 4. 5. 6. 7. 8.
Soulive Funk Trek Fred Wesley Joshua Redman Lettuce (band) Rashawn Ross Rustic Overtones Paranoid Social Club
6.3
Knowledge Graph Embedding Methods for Link Prediction
99
9. The Soul Children 10. Wild Adriatic The first result, i.e., the triple < Soulive, band Member , Soulive >, is contained in DBpedia by mistake, and it is actually the only triple with Soulive as a subject and bandMember as a predicate. Two further results—Fred Wesley and Rashawn Ross—are musicians who at least contributed to recordings and/or concerts of the band, so they would at least be sensible candidates for the link prediction task. As we have discussed in the previous chapters, standard RDF2vec has the tendency to group related objects, rather than similar ones, together. This in turn means that for link prediction, many candidate links do not fit ontologically (e.g., a band as a member of another band), but are closely related to the actual band members. Therefore, when utilizing such a prediction mechanism, it is usually useful to post-filter the results based on ontological fit (e.g., allowing only those instances which have a type compatible with the range of the predicate at hand).
6.3
Knowledge Graph Embedding Methods for Link Prediction
A larger body of work has been devoted to knowledge graph embedding methods for link prediction. Here, the goal is to learn a model which embeds entities and relations in the same vector space.
6.3.1
Link Prediction is Based on Vector Operations
As the main objective is link prediction, most models, more or less, try to find a vector space embedding of entities and relations so that t ≈ h ⊕r
(6.7)
holds for as many triples < h, r , t > as possible. ⊕ can stand for different operations in the vector space; in basic approaches, simple vector addition (+) is used. In our considerations below, we will also use vector addition. In most approaches, negative examples are created by corrupting an existing triple, i.e., replacing the head or tail with another entity from the graph (some approaches also foresee corrupting the relation). Then, a model is learned which tries to tell apart corrupted from non-corrupted triples. The formulation in the original TransE paper (Bordes et al. 2013) defines the loss function L as follows:
100
6 Link Prediction in Knowledge Graphs (and its Relation to RDF2vec)
Fig. 6.6 Example graph embedded by TransE (Portisch et al. 2022b)
L=
[γ + d(h + r , t) − d(h + r , t )]+
(6.8)
(h,r ,t)∈S, (h ,r ,t )∈S
where γ is some margin, and d is a distance function, usually the L1 or L2 norm. S is the set of statements that are in the knowledge graph, and S are the corrupted statements derived from them. In words, the formula states for a triple < h, r , t >, h + r should be closer to t than to t for some corrupted tail, similarly for a corrupted head. However, a difference of γ is accepted. Figure 6.6 shows the example graph from above, as embedded by TransE.6 Looking at the relation vectors, it can be observed that they seem approximately accurate in some cases, e.g., Ger many + head O f Gover nment ≈ Angela_Mer kel, but not everywhere.7 Like in the RDF2vec example above, we can observe that the two vectors for locatedIn and capital point in opposite directions. Also similar to the RDF2vec example, we can see that entities in similar classes form clusters: cities are mostly in the upper part of the space, people in the left, and countries in the lower right part.
6 Created with PyKEEN (Ali et al. 2021), using 128 epochs, a learning rate of 0.1, the soft plus loss
function, and default parameters otherwise, as advised by the authors of PyKEEN: https://github. com/pykeen/pykeen/issues/97. 7 This does not mean that TransE does not work. The training data for the very small graph is rather scarce, and two dimensions might not be sufficient to find a good solution here.
6.3
Knowledge Graph Embedding Methods for Link Prediction
6.3.2
101
Usage for Data Mining
As discussed above, positioning similar entities close in a vector space is an essential requirement for using entity embeddings in data mining tasks. To understand why an approach tailored towards link prediction can also, to a certain extent, cluster similar instances together (although not explicitly designed for this task), we first rephrase the approximate link prediction Eq. 6.8 as t = h + r + ηh,r ,t , (6.9) where ηh,r ,t can be considered an error term for the triple < h, r , t >. Moreover, we define ηmax =
max
∈S
ηh,r ,t
(6.10)
Next, we consider two triples < h 1 , r , t > and < h 2 , r , t >, which share a relation to an object—e.g., in our example, France and Belgium, which both share the relation partOf to EU. In that case, t = h 1 + r + ηh 1 ,r ,t (6.11) and t = h 2 + r + ηh 2 ,r ,t
(6.12)
hold. From that, we get8 h 1 − h 2 = ηh 2 ,r ,t − ηh 1 ,r ,t ⇒ |h 1 − h 2 | = |ηh 2 ,r ,t − ηh 1 ,r ,t | = |ηh 2 ,r ,t + (−ηh 1 ,r ,t )| ≤ |ηh 2 ,r ,t | + | − ηh 1 ,r ,t | = |ηh 2 ,r ,t | + |ηh 1 ,r ,t | ≤ 2 · ηmax
(6.13)
In other words, ηmax also imposes an upper bound of two entities sharing a relation to an object. As a consequence, the lower the error in relation prediction, the closer are entities that share a common statement. This also carries over to entities sharing the same two-hop connection. Consider two further triples < h 1a , ra , h 1 > and < h 2a , ra , h 2 >. In our example, this could be two cities located in the two countries, e.g., Strasbourg and Brussels. In that case, we would have h 1 = h 1a + ra + ηh 1a ,ra ,h 1
(6.14)
h 2 = h 2a + ra + ηh 2a ,ra ,h 2
(6.15)
8 Using the triangle inequality for the first inequation.
102
6 Link Prediction in Knowledge Graphs (and its Relation to RDF2vec)
Substituting this in (6.11) and (6.12) yields t = h 1a + ra + ηh 1a ,ra ,h 1 + r + ηh 1 ,r ,t
(6.16)
t = h 2a + ra + ηh 2a ,ra ,h 2 + r + ηh 2 ,r ,t .
(6.17)
Consequently, using similar transformations as above, we get h 1a − h 2a = ηh 2a ,ra ,h 2 − ηh 1a ,ra ,h 1 + ηh 2 ,r ,t − ηh 1 ,r ,t ⇒ |h 1a − h 2a | ≤ 4 · ηmax
(6.18)
Again, ηmax constrains the proximity of the two entities h 1a and h 2a , but only half as strictly as for the case of h 1 and h 2 .
6.3.3
Comparing the Two Notions of Similarity
In the examples above, we can see that embeddings for link prediction have a tendency to project similar instances close to each other in the vector space. Here, the notion of similarity is that two entities are similar if they share a relation to another entity, i.e., e1 and e2 are considered similar if there exist two statements < e1 , r , t > and < e2 , r , t > or < h, r , e1 > and < h, r , e2 >,9 or, less strongly, if there exists a chain of such statements. More formally, we can write the notion of similarity between two entities in link prediction approaches as e1 ≈ e2 ← ∃ t, r : r (e1 , t) ∧ r (e2 , t)
(6.19)
e1 ≈ e2 ← ∃ h, r : r (h, e1 ) ∧ r (h, e2 )
(6.20)
In other words: two entities are similar if they share a common connection to a common third entity. RDF2vec, on the other hand, covers a wider range of such similarities. Looking at Table 6.2, we can observe that two entities sharing a common relation to two different objects are also considered similar (Berlin and Mannheim both share the fact that they are located in Germany, hence, their predictions for w+1 and w+2 are similar). However, in RDF2vec, similarity can also come in other notions. For example, Germany and USA are also considered similar, because they both share the relations headOfGovernment and capital, albeit with different objects (i.e., their prediction for w1 is similar). In contrast, such similarities do not lead to close projections for link prediction embeddings. In fact, in Fig. 6.6, it can be observed that USA and Germany are further away than Germany and other European countries. In other words, the following two notions of similarity also hold for RDF2vec:
9 The argument in Sect. 6.3.2 would also work for shared relations to common heads.
6.4
Experiments
103
e1 ≈ e2 ← ∃ t1 , t2 , r : r (e1 , t1 ) ∧ r (e2 , t2 )
(6.21)
e1 ≈ e2 ← ∃ h 1 , h 2 , r : r (h 1 , e1 ) ∧ r (h 2 , e2 )
(6.22)
On a similar argument, RDF2vec also positions entities closer that share any relation to another entity. Although this is not visible in the two-dimensional embedding depicted in Fig. 6.5, RDF2vec would also create vectors with some similarity for Angela Merkel and Berlin, since they both have an (albeit different) relation to Germany (i.e., their prediction for w−2 is similar). Hence, the following notions of similarity can also be observed in RDF2vec: e1 ≈ e2 ← ∃ t, r1 , r2 : r1 (e1 , t) ∧ r2 (e2 , t)
(6.23)
e1 ≈ e2 ← ∃ h, r1 , r2 : r1 (h, e1 ) ∧ r2 (h, e2 )
(6.24)
The example with Angela Merkel and Berlin already hints at a slightly different notion of the interpretation of proximity in the vector space evoked by RDF2vec: not only similar, but also related entities are positioned close in the vector space. This shows why RDF2vec (at least in its basic formulation), as discussed in the previous chapters, mixes the concepts of similarity and relatedness in its distance function to a certain extent.
6.3.4
Link Prediction Embeddings for Data Mining in Action
As the discussion above indicates, the vectors computed for link prediction can also be used as feature vectors for classifiers, just as those computed by RDF2vec. Listing 6.2 shows the procedure of processing the classification example from the previous chapters with TransE. In this example, we obtain an accuracy of 0.485±0.071, i.e., the classification is not significantly better than random guessing. This is, however, only the case for the default parameters of TransE; with other parameters, e.g., changing the norm from L1 to L2, or with more advanced embedding models, higher scores can be obtained.
6.4
Experiments
To compare the two sets of approaches, we use standard setups for evaluating knowledge graph embedding methods for data mining as well as for link prediction.
6.4.1
Experiments on Data Mining Tasks
For evaluating on data mining tasks, we use the already mentioned GEval benchmark (see Chap. 3).
104
6 Link Prediction in Knowledge Graphs (and its Relation to RDF2vec)
Listing 6.2 Using TransE for node classfication. from p y k e e n . t r i p l e s i m p o r t T r i p l e s F a c t o r y from p y k e e n . p i p e l i n e i m p o r t p i p e l i n e import pykeen . regularizers i m p o r t n um py as np i m p o r t p a n d a s as pd i m p o r t t or ch # Load graph path = " ./ " t r i p l e s = np . l o a d t x t ( path + " a r t i s t s _ g r a p h . tsv " , dtype = str , d e l i m i t e r = ’\ t ’ , q u o t e c h a r = ’" ’) tf = T r i p l e s F a c t o r y . f r o m _ l a b e l e d _ t r i p l e s ( t r i p l e s = t r i p l e s ) # Tr ai n e m b e d d i n g results = pipeline ( t r a i n i n g = tf , t e s t i n g = tf , model = ’ TransE ’, m o d e l _ k w a r g s = dict ( e m b e d d i n g _ d i m =50) , t r a i n i n g _ k w a r g s = dict ( n u m _ e p o c h s =128) , r a n d o m _ s e e d =1 , ) model = results . model # Load ground truth df = pd . r e a d _ c s v ( ’ b a n d s _ l a b e l s . csv ’ , sep = " \ t " ) dfX = df [[ ’ Band ’ ]] dfY = df [[ ’ Genre ’ ]] # a s s e m b l e data frame for c l a s s i f i c a t i o n label_to_entity_id = { v : k for k , v in tf . e n t i t y _ i d _ t o _ l a b e l . items ()} e m b _ e n t i t i e s = np . zeros (( len ( dfX ) ,50)) for i in r an ge ( len ( dfX )): e n t i t y = dfX [ ’ B a n d ’ ][ i ] in de x = l a b e l _ t o _ e n t i t y _ i d [ e n t i t y ] in pu t = torch . L o n g T e n s o r ([ index ]) e m b _ e n t i t i e s [ i ] = model . e n t i t y _ r e p r e s e n t a t i o n s [0]( input ) . d e t a c h ( ) [ 0 ] . n u m p y () d f X E m b e d d i n g s = pd . D a t a F r a m e ( e m b _ e n t i t i e s ) # run and e v a l u a t e c l a s s i f i c a t i o n from s k l e a r n . n e u r a l _ n e t w o r k i m p o r t M L P C l a s s i f i e r from s k l e a r n . m o d e l _ s e l e c t i o n i m p o r t c r o s s _ v a l _ s c o r e i m p o r t n um py as np clf = M L P C l a s s i f i e r ( m a x _ i t e r = 1 0 0 0 0 ) scores = cross_val_score ( clf , d f X E m b e d d i n g s , dfY . v a l u e s . r a v e l () , cv =10) s c o r e s . m e a n () s c o r e s . std ()
6.4
Experiments
105
All embeddings are trained on DBpedia 2016-10.10 For generating the different embedding vectors, we use the DGL-KE framework (Zheng et al., 2020) in the respective standard settings, and we use the RDF2vec vectors provided by the KGvec2go API (Portisch et al. 2020a), trained with 500 walks of depth 8 per entity, Skip-Gram, and 200 dimensions. We compare RDF2vec, TransE (with L1 and L2 norm) (Bordes et al. 2013), TransR (Lin et al. 2015), RotatE (Sun et al. 2018), DistMult (Yang et al. 2015), RESCAL (Nickel et al. 2011), and ComplEx (Trouillon et al. 2016). To create the embedding vectors with DGL-KE, we use the parameter configurations recommended by the framework, a dimension of 200, and a step maximum of 1,000,000. The RDF2vecoa vectors were generated with the same configuration, but using the order-aware variant of Skip-Gram (see Chap. 4). For node2vec, DeepWalk, and KGlove, we use the standard settings and the code provided by the respective authors.11,12,13 For KGlove, we use the Inverse Predicate Frequency, which has been reported to work well on many tasks by the original paper (Cochez et al. 2017b). It is noteworthy that the default settings for node2vec and DeepWalk differ in one crucial property. While node2vec interprets the graph as a directed graph by default and only traverses edges in the direction in which they are defined, DeepWalk treats all edges as undirected, i.e., it traverses them in both directions. From Tables 6.3 and 6.4, we can observe a few expected and a few unexpected results. First, since RDF2vec is tailored towards classic data mining tasks like classification and regression, it is not much surprising that those tasks are solved better by using RDF2vec (and even slightly better by using RDF2vecoa ) vectors. Still, some of the link prediction methods (in particular TransE and RESCAL) perform reasonably well on those tasks. In contrast, KGloVe rarely reaches the performance level of RDF2vec, while the two approaches for unlabeled graphs—i.e., DeepWalk and node2vec—behave differently: while the results of DeepWalk are at the lower end of the spectrum, node2vec is competitive. The latter is remarkable, showing that pure neighborhood information, ignoring the direction and edge labels, can be a strong signal when embedding entities. Referring back to the different notions of similarity that these families of approaches imply (see above), this behavior can be explained by the tendency of RDF2vec (and also node2vec) to position entities closer in the vector space which are more similar to each other (e.g., two cities that are similar). Since it is likely that some of those dimensions are also correlated with the target variable at hand (in other words: they encode some dimension of similarity that can be used to predict the target variable), classifiers and regressors can pick up on those dimensions and exploit them in their prediction model. What is also remarkable is the performance on the entity relatedness task. While RDF2vec embeddings, as well as node2vec, KGlove, and, to a lesser extent, DeepWalk, reflect entity 10 The code for the experiments as well as the resulting embeddings can be found at https://github.
com/nheist/KBE-for-Data-Mining. 11 https://github.com/D2KLab/entity2vec. 12 https://github.com/phanein/deepwalk. 13 https://github.com/miselico/globalRDFEmbeddingsISWC.
RDF2vec (DM)
0.810
0.610
0.774
0.739
Cities
Forbes
Albums
Movies
0.696
0.926
0.917
Cities and Countries
Cities, Albums, Movies, AAUP, Forbes
Teams
68.745
15.601
36.459
11.930
19.648
AAUP
Cities
Forbes
Albums
Movies
Regression (RMSE)
0.758
Cities and Countries (2K)
Clustering (ACC)
0.676
AAUP
Classification (ACC)
Dataset
11.739
11.597
36.124
13.486
66.505
0.958
0.928
0.760
0.931
0.736
0.787
0.626
0.837
0.671
23.286
14.128
37.589
19.694
81.503
0.887
0.946
0.953
0.982
0.603
0.637
0.550
0.676
0.628
RDF2vec O A TransE-L1 (DM) (LP)
20.635
12.589
38.398
14.455
69.728
0.977
0.944
0.979
0.994
0.728
0.746
0.601
0.752
0.651
TransE-L2 (LP)
20.699
12.789
39.803
13.558
88.751
0.844
0.908
0.952
0.962
0.715
0.728
0.561
0.757
0.607
TransR (LP)
23.878
14.890
38.343
26.846
80.177
0.853
0.860
0.691
0.510
0.567
0.550
0.526
0.581
0.617
RotatE (LP)
22.161
13.452
38.037
19.785
78.337
0.883
0.878
0.909
0.957
0.668
0.666
0.601
0.666
0.597
DistMult (LP)
21.362
13.537
35.489
15.137
72.880
0.881
0.936
0.990
0.991
0.693
0.678
0.563
0.740
0.623
RESCAL (LP)
22.229
13.009
37.877
19.809
73.665
0.881
0.914
0.591
0.955
0.655
0.693
0.578
0.637
0.602
ComplEx (LP)
18.877
15.165
35.684
15.363
68.007
0.931
0.930
0.743
0.939
0.763
0.789
0.618
0.774
0.694
node2vec (DM)
24.215
15.129
41.384
25.323
103.23
0.830
0.335
0.817
0.557
0.555
0.543
0.49
0.495
0.549
DeepWalk (DM)
24.000
11.739
40.141
24.151
98.794
0.740
0.520
0.765
0.623
0.563
0.548
0.502
0.496
0.558
KGloVe (DM)
Table 6.3 Results of the different data mining tasks (1/2). DM denotes approaches originally developed for node representation in data mining, LP denotes approaches originally developed for link prediction (Portisch et al. 2022b)
106 6 Link Prediction in Knowledge Graphs (and its Relation to RDF2vec)
RDF2vec (DM)
0.913
0.629
Capitals and 0.648 countries
0.342
0.339
Cities and state
Currency (and Countries)
0.348
0.307
KORE
0.504
0.779
Entity relatedness (Kendall Tau)
LP50
Document similarity (harmonic mean)
0.427
0.895
0.685
(All) capitals and countries
0.002
0.343
0.005
0.335
0.840
0.709
RDF2vec O A TransE-L1 (DM) (LP)
Semantic analogies (precision@k)
Dataset
–0.081
0.397
0.285
0.209
0.792
0.675
TransE-L2 (LP)
0.139
0.434
0.143
0.392
0.937
0.938
–0.039
0.326
0.000
0.294
0.640
0.377
TransR (LP) RotatE (LP)
0.147
0.360
0.001
0.379
0.802
0.782
DistMult (LP)
0.087
0.344
0.000
0.089
0.312
0.211
RESCAL (LP)
0.115
0.341
0.000
0.309
0.864
0.814
ComplEx (LP)
0.525
0.333
0.420
0.068
0.164
0.284
node2vec (DM)
0.129
0.243
0.005
0.000
0.000
0.000
DeepWalk (DM)
0.421
0.225
0.003
0.029
0.043
0.011
KGloVe (DM)
Table 6.4 Results of the different data mining tasks (2/2). DM denotes approaches originally developed for node representation in data mining, LP denotes approaches originally developed for link prediction (Portisch et al. 2022b)
6.4 Experiments 107
108
6 Link Prediction in Knowledge Graphs (and its Relation to RDF2vec)
relatedness to a certain extent, this is not given for any of the link prediction approaches. According to the notions of similarity discussed above, this is reflected in the RDF2vec mechanism: RDF2vec has an incentive to position two entities closer in the vector space if they share relations to a common entity, as shown in Eqs. 6.21–6.24. One example is the relatedness of Apple Inc. and Steve Jobs—here, we can observe the two statements pr oduct(AppleI nc., I Phone) known f or (Steve J obs, I Phone) in DBpedia, among others. Those lead to similar vectors in RDF2vec according to Eq. 6.23. A similar argument can be made for node2vec and DeepWalk, and also for KGlove, which looks at global co-occurrences of entities, i.e., it also favors closer embeddings of related entities. The same behavior of RDF2vec—i.e., assigning close vectors to related entities—also explains the comparatively bad results of RDF2vec on the first two clustering tasks. Here, the task is to separate cities and countries in two clusters, but since a city is also related to the country it is located in, RDF2vec may position that city and country rather closely together (RDF2vecoa changes that behavior, as discussed in Chap. 4, and hence produces better results for the clustering problems). Hence, that city has a certain probability of ending up in the same cluster as the country. The latter two clustering tasks are different: the third one contains five clusters (cities, albums, movies, universities, and companies), which are less likely to be strongly related (except universities and companies to cities) and therefore are more likely to be projected in different areas in the vector space. Here, the difference between RDF2vec and the best-performing approaches (i.e., TransE-L1 and TransE-L2) is not that severe. The same behavior can also be observed for the other embedding approaches for data mining, i.e., node2vec, DeepWalk, and KGlove, which behave similarly in that respect. The problem of relatedness being mixed with similarity does not occur so strongly for homogeneous sets of entities, as in the classification and regression tasks, where all entities are of the same kind (cities, companies, etc.)—here, two companies which are related (e.g., because one is a holding of the other) can also be considered similar to a certain degree (in that case, they are both operating in the same branch). This also explains why the fourth clustering task (where the task is to assign sports teams to clusters by the type of sports) works well for RDF2vec—here, the entities are again homogeneous. At the same time, the test case of clustering teams can also be used to explain why link prediction approaches work well for that kind of task: here, it is likely that two teams in the same sports share a relation to a common entity, i.e., they fulfill Eqs. 6.19 and 6.20. Examples include participation in the same tournaments or common former players. The semantic analogies task also reveals some interesting findings. First, it should be noted that the relations which form the respective analogies (capital, state, and currency) are contained in the knowledge graph used for the computation. That being said, we can see that most of the link prediction results (except for RotatE and RESCAL) perform reasonably
6.4
Experiments
109
well here. Particularly, the first cases (capitals and countries) can be solved particularly well in those cases, as this is a 1:1 relation, which is the case in which link prediction is a fairly simple task. On the other hand, most of the data-mining-centric approaches (i.e., node2vec, DeepWalk, KGlove) solve this problem relatively badly. A possible explanation is that the respective entities belong to the strongly interconnected head entities of the knowledge graphs, and also the false solutions are fairly close to each other in the graph (e.g., US Dollar and Euro are interconnected through various short paths). This makes it hard for approaches concentrating on a common neighborhood to produce decent results here. On the other hand, the currency case is solved particularly badly by most of the link prediction results. This relation is an n:m relation (there are countries with more than one official, unofficial, or historic currency, and many currencies, like the Euro, are used across many countries. Moreover, looking into DBpedia, this relation contains a lot of mixed usages and is not maintained with very high quality. For example, DBpedia lists 33 entities whose currency is US Dollars14 —the list contains historic entities (e.g., West Berlin), errors (e.g., Netherlands), and entities which are not countries (e.g., OPEC), but the United States are not among those. For such kinds of relations that contain a certain amount of noise and heterogeneous information, many link prediction approaches are obviously not well suited. RDF2vec, in contrast, can deal reasonably well with that case. Here, two effects interplay when solving such tasks: (i) as shown above, relations are encoded by the proximity in RDF2vec to a certain extent, i.e., the properties in Eqs. 6.3 and 6.4 allow to perform analogy reasoning in the RDF2vec space in general. Moreover, (ii) we have already seen the tendency of RDF2vec to position related entities in relative proximity. Thus, for RDF2vec, it can be assumed that the following holds: U K ≈ Pound Sterling U S A ≈ U S Dollar
(6.25) (6.26)
Since we can rephrase the first equation as Pound Sterling − U K ≈ 0
(6.27)
we can conclude that analogy reasoning in RDF2vec would yield Pound Sterling − U K + U S A ≈ U S Dollar
(6.28)
Hence, in RDF2vec, two effects—the preservation of relation vectors as well as the proximity of related entities—are helpful for analogy reasoning, and the two effects also work for rather noisy cases. However, for cases that are 1:1 relations in the knowledge graph with rather clean training data available, link prediction approaches are better suited for analogy reasoning.
14 http://dbpedia.org/page/United_States_dollar.
110
6.4.2
6 Link Prediction in Knowledge Graphs (and its Relation to RDF2vec)
Experiments on Link Prediction Tasks
In a second series of experiments, we analyze if we can use embedding methods developed for similarity computation, like RDF2vec, also for link prediction. We use the two established tasks WN18 and FB15k for a comparative study. While link prediction methods are developed for the task at hand, approaches developed for data mining are not. Although RDF2vec computes vectors for relations, they do not necessarily follow the same notion as relation vectors for link prediction, as discussed above. Hence, we investigate two approaches: 1. We average the difference for each pair of a head and a tail for each relation r , and use that as average as a proxy for a relation vector for prediction, as shown in Eq. 6.4. The predictions are the entities whose embedding vectors are the closest to the approximate prediction. This method is denoted as avg. 2. For predicting the tail of a relation, we train a neural network to predict an embedding vector of the tail-based embedding vectors, as shown in Fig. 6.7. The predictions for a triple < h, r , ? > are the entities whose embedding vectors are closest to the predicted vector for h and r . A similar network is trained to predict h from r and t. This method is denoted as ANN. We trained the RDF2vec embeddings with 2,000 walks, a depth of 4, a dimension of 200, a window of 5, and 25 epochs in SG mode. For the second prediction approach, the two neural networks each use two hidden layers of size 200, and we use 15 epochs, a batch size of 1,000, and mean squared error as a loss function. KGlove, node2vec, and DeepWalk do not produce any vectors for relations. Hence, we only use the avg strategy for those approaches, since the experiments with the ANN strategy on RDF2vec were not too promising. The results of the link prediction experiments are shown in Table 6.5.15 We can observe that the RDF2vec based approaches perform at the lower end of the spectrum. The avg
Fig. 6.7 Training a neural network for link prediction with RDF2vec (Portisch et al. 2022b) 15 The code for the experiments can be found at https://github.com/janothan/kbc_rdf2vec.
6.5
Conclusion
111
Table 6.5 Results of the link prediction tasks on WN18 and FB15K. Results for TransE and RESCAL from Bordes et al. (2013), results for RotatE from Sun et al. (2018), results for DistMult from Yang et al. (2015), results for TransR from Lin et al. (2015). Portisch et al. (2022b) WN18
FB15k
MRR (raw) MRR (filt)
Hits@10 (raw)
Hits@10 (filt)
MRR (raw) MRR (filt)
Hits@10 (raw)
Hits@10 (filt)
Approaches for data mining RDF2vec (AVG)
147
135
64.4
71.3
399
347
35.3
40.5
RDF2vec (ANN)
353
342
49.7
55.4
349
303
34.3
41.8
RDF2vecoa 64 (AVG)
53
65.9
73.0
168
120
47.9
56.5
RDF2vecoa 77 (ANN)
66
66.6
75.1
443
90
30.9
37.4
node2vec (AVG)
17
10
12.3
14.2
192
138
44.7
53.3
DeepWalk (AVG)
6112
6106
5.7
6.0
2985
2939
7.2
7.7
KGlove (AVG)
8247
8243
1.7
1.7
2123
2077
11.1
11.1
Approaches for link prediction TransE
263
251
75.4
89.2
243
125
34.9
47.1
TransR
232
219
78.3
91.7
226
78
43.8
65.5
RotatE
–
309
–
95.9
–
40
–
88.4
DistMult
–
–
–
57.7
–
–
–
94.2
RESCAL
1180
1163
37.2
52.8
828
683
28.4
44.1
ComplEx
–
–
–
94.7
–
–
–
84.0
approach outperforms DistMult and RESCAL on WN18, and both approaches are about en par with RESCAL on FB15k. Except for node2vec on FB15k, the other data mining approaches fail at producing sensible results. While the results are not overwhelming, they show that the similarity of entities, as RDF2vec models it, is at least a useful signal for implementing a link prediction approach.
6.5
Conclusion
As already discussed above, the notion of similarity which is conveyed by RDF2vec mixes similarity and relatedness. This can be observed, e.g., when querying for the 10 closest concepts to Angela Merkel (the chancellor, i.e., head of government in Germany from 2005
112
6 Link Prediction in Knowledge Graphs (and its Relation to RDF2vec)
Table 6.6 Closest concepts to Angela Merkel in the different embedding approaches used RDF2vec
TransE-L1
TransE-L2
Joachim Gauck
Gerhard Schröder
Gerhard Schröder
Norbert Lammert
James Buchanan
Helmut Kohl
Stanislaw Tillich
Neil Kinnock
Konrad Adenauer
Andreas Voßkuhle
Nicolas Sarkozy
Helmut Schmidt
Berlin
Joachim Gauck
Werner Faymann
German language
Jacques Chirac
Alfred Gusenbauer
Germany
Jürgen Trittin
Kurt Georg Kiesinger
federalState
Sigmar Gabriel
Philipp Scheidemann
Social Democratic Party
Guido Westerwelle
Ludwig Erhard
Deputy
Christian Wulff
Wilhelm Marx
TransR
DistMult
RotatE
Sigmar Gabriel
Gerhard Schröder
Pontine raphe nucleus
Frank-Walter Steinmeier
Milan Truban
Jonathan W. Bailey
Philipp Rösler
Maud Cuney Hare
Zokwang Trading
Gerhard Schröder
Tristan Matthiae
Steven Hill
Joachim Gauck
Gerda Hasselfeldt
Chad Kreuter
Christian Wulff
Faustino Sainz Muñoz
Fred Hibbard
Guido Westerwelle
Joachim Gauck
Mallory Ervin
Helmut Kohl
Carsten Linnemann
Paulinho Kobayashi
Jürgen Trittin
Norbert Blüm
Fullmetal Alchemist and the Broken Angel
Jens Böhrnsen
Neil Hood
Archbishop Dorotheus of Athens
RESCAL
ComplEx
KGloVe
Gerhard Schröder
Gerhard Schröder
Aurora Memorial National Park
Kurt Georg Kiesinger
Diána Mészáros
Lithuanian Wikipedia
Helmut Kohl
Francis M. Bator
Baltic states
Annemarie Huber-Hotz
William B. Bridges
The Monarch (production team)
Wang Zhaoguo
Mette Vestergaard
Leeds Ladies F.C. Lauryn Colman
Franz Vranitzky
Ivan Rosenqvist
Steven Markovi´c
Bogdan Klich ˙ Irsen Küçük
Edward Clouston
Funk This (George Porter Jr. album)
Antonio Capuzzi
A Perfect Match (Ella Fitzgeral album)
Helmut Schmidt
Steven J. McAuliffe
Salty liquorice
Mao Zedong
Jenkin Coles
WMMU-FM
RDF2vec OA
node2vec
DeepWalk
Joachim Gauck
Sigmar Gabriel
Manuela Schwesig
Norbert Lammert
Guido Westerwelle
Irwin Fridovich
Stanislaw Tillich
Christian Wulff
Holstein Kiel Dominik Schmidt
Andreas Voßkuhle
Jürgen Trittin
Ella Germein
Berlin
Wolfgang Schäuble
Goyang Citizen FC Do Sang-Jin
German language
Joachim Gauck
Sean Cashman
Germany
Philipp Rösler
Chia Chiao
Christian Wulff
Joachim Sauer
Albrix Niigata Goson Sakai
Gerhard Schröder
Franz Müntefering
Roz Kelly
federalState
Frank-Walter Steinmeier
Alberto Penny
to 2021) in DBpedia in the different spaces, as shown in Table 6.6. The approach shows a few interesting effects:
6.5
Conclusion
113
• While most of the approaches (except for RotatE, KGlove, and DeepWalk) provide a clean list of people, RDF2vec brings up a larger variety of results, containing also Germany and Berlin (and also a few results which are not instances, but relations; however, those could be filtered out easily in downstream applications if necessary). This demonstrates the property of RDF2vec of mixing similarity and relatedness. The people in the RDF2vec result set are all related to Angela Merkel: Joachim Gauck was president during her chancellorship, Norbert Lammert was the head of parliament, Stanislaw Tillich was a leading board member in the same party as Merkel, and Andreas Voßkuhle was the head of the highest court during her chancellorship. • The approaches at hand have different foci in determining similarity. For example, TransE-L1 outputs mostly German politicians (Schröder, Gauck, Trittin, Gabriel, Westerwelle, Wulff) and former presidents of other countries (Buchanan as a former US president, Sarkozy and Chirac as former French presidents) TransE-L2 outputs a list containing many former German chancellors (Schröder, Kohl, Adenauer, Schmidt, Kiesinger, Erhardt), TransR mostly lists German party leaders (Gabriel, Steinmeier, Rösler, Schröder, Wulff, Westerwelle, Kohl, Trittin). Likewise, node2vec produces a list of German politicians, with the exception of Merkel’s husband Joachim Sauer.16 In all of those cases, the persons share some property with the query entity Angela Merkel (profession, role, nationality, etc.), but their similarity is usually affected only by one of those properties. In other words: one notion of similarity dominates the others. • In contrast, the persons in the output list of RDF2vec are related to the query entity in different respects. In particular, they played different roles during Angela Merkel’s chancellorship (Gauck was the German president, Lammert was the chairman of the parliament, and Voßkuhle was the chairman of the federal court). Here, there is no dominant property, instead, similarity (or rather: relatedness) is encoded along various properties. RDF2vecoa yields results that are slightly closer to the politicians lists of the other approaches, while the result list of KGlove looks more like a random list of entities. A similar observation can be made for DeepWalk, which, with the exception of the first result (which is a German politician) does not produce any results seemingly related to the query concept at hand. With that observation in mind, we can come up with an initial set of recommendations for choosing embedding approaches: • Approaches such as RDF2vec work well when dealing with sets of homogeneous entities—like the node classification problem used as a running example in this book. Here, the problem of confusing related entities (like Merkel and Berlin) is negligible, because all entities are of the same kind anyways. In those cases, RDF2vec captures the finer 16 The remaining approaches—RotatE, DistMult, RESCAL, ComplEx, KGlove, DeepWalk—
produce lists of (mostly) persons which, in their majority, share no close link to the query concept Angela Merkel.
114
6 Link Prediction in Knowledge Graphs (and its Relation to RDF2vec)
distinctions between the entities better than embeddings for link prediction, and it encodes a larger variety of semantic relations. • From the approaches for data mining, those which respect order (RDF2vecoa and node2vec) work better than those which do not (classic RDF2vec, KGlove, and DeepWalk).17 • For problems where heterogeneous sets of entities are involved, embeddings for link prediction often do a better job in telling different entities apart. Link prediction is a problem of the latter kind: in embedding spaces where different types are properly separated, link prediction mistakes are much rarer. Given an embedding space where entities of the same type are always closer than entities of a different type, a link prediction approach will always rank all “compatible” entities higher than all incompatible ones. Consider the following example in FB15k: instrument(Gil Scott H er on, ?) Here, musical instruments are expected in the object position. However, approaches like RDF2vec will suggest plausible candidates such as electric guitar and acoustic guitar, but also guitarist and Jimmy Page (who is a well-known guitarist). While electric guitar, guitarist, and Jimmy Page are semantically related, not all of them are sensible predictions here, and the fact that RDF2vec reflects that semantic relatedness is a drawback in link prediction. The same argument underlies an observation made by Zouaq and Martel (2020): the authors found that RDF2vec is particularly well suited for distinguishing fine-grained entity classes (as opposed to coarse-grained entity classification). For fine-grained classification (e.g., distinguishing guitar players from piano players), all entities to be classified are already of the same coarse class (e.g., musician), and RDF2vec is very well suited for capturing the finer differences. However, for coarse classifications, such as distinguishing persons from music instruments, misclassifications by mistaking relatedness for similarity become more salient. From the observations made in the link prediction task, we can come up with another recommendation: • For relations that come with rather clean data quality, link prediction approaches work well. However, for more noisy data, RDF2vec has a higher tendency of creating useful embedding vectors. For the moment, this is a hypothesis, which should be hardened, e.g., by performing controlled experiments on artificially noised link prediction tasks. 17 As discussed above, this comments holds for the default configuration of node2vec and DeepWalk
used in this paper.
References
115
These findings give rise to both a recommendation and some future work. On the one hand, in use cases where relatedness plays a role next to similarity, or in use cases where all entities are of the same type, approaches like RDF2vec may yield better results. On the other hand, for cases with mixed entity types where it is important to separate the types, link prediction embeddings might yield better results.
References Ali M, Berrendorf M, Hoyt CT, Vermue L, Sharifzadeh S, Tresp V, Lehmann J (2021) Pykeen 1.0: a python library for training and evaluating knowledge graph embeddings. J Mach Learn Res 22(82):1–6 Bakhshandegan Moghaddam F, Draschner C, Lehmann J, Jabeen H (2021) Literal2feature: an automatic scalable rdf graph feature extractor. In: Further with knowledge graphs. IOS Press, pp 74–88. https://dx.doi.org/10.3233/SSW210036 Basu S, Chakraborty S, Hassan A, Siddique S, Anand A (2020) Erlkg: entity representation learning and knowledge graph based association analysis of covid-19 through mining of unstructured biomedical corpora. In: Proceedings of the first workshop on scholarly document processing, pp 127–137. http://dx.doi.org/10.18653/v1/2020.sdp-1.15 Bordes A, Usunier N, Garcia-Duran A, Weston J, Yakhnenko O (2013) Translating embeddings for modeling multi-relational data. Adv Neural Inf Process Syst 26 Celebi R, Uyar H, Yasar E, Gumus O, Dikenelli O, Dumontier M (2019) Evaluation of knowledge graph embedding approaches for drug-drug interaction prediction in realistic settings. BMC Bioinf 20(1):1–14. https://doi.org/10.1186/s12859-019-3284-5 Chen J, Chen X, Horrocks I, B Myklebust E, Jimenez-Ruiz E (2020) Correcting knowledge base assertions. In: Proceedings of the web conference 2020, pp 1537–1547. https://doi.org/10.1145/ 3366423.3380226 Cochez M, Ristoski P, Ponzetto SP, Paulheim H (2017b) Global rdf vector space embeddings. In: International semantic web conference. Springer, pp 190–207. https://doi.org/10.1007/978-3-31968288-4_12 Dai Y, Wang S, Xiong NN, Guo W (2020) A survey on knowledge graph embedding: approaches, applications and benchmarks. Electronics 9(5):750. https://doi.org/10.3390/electronics9050750 Daza D, Cochez M, Groth P (2021) Inductive entity representations from text via link prediction. In: Proceedings of the web conference 2021, pp 798–808. https://doi.org/10.1145/3442381.3450141 Dettmers T, Minervini P, Stenetorp P, Riedel S (2018) Convolutional 2d knowledge graph embeddings. In: Thirty-second AAAI conference on artificial intelligence Gesese GA, Biswas R, Alam M, Sack H (2021) A survey on knowledge graph embeddings with literals: which model links better literal-ly? Semant Web 12(4):617–647. https://dx.doi.org/10. 3233/SW-200404 Han X, Cao S, Lv X, Lin Y, Liu Z, Sun M, Li J (2018) OpenKE: An open toolkit for knowledge embedding. In: Proceedings of the 2018 conference on empirical methods in natural language processing: system demonstrations, pp 139–144. http://dx.doi.org/10.18653/v1/D18-2024 Ji S, Pan S, Cambria E, Marttinen P, Philip SY (2021) A survey on knowledge graphs: representation, acquisition, and applications. IEEE Trans Neural Netw Learn Syst 33(2):494–514 Karim MR, Cochez M, Jares JB, Uddin M, Beyan O, Decker S (2019) Drug-drug interaction prediction based on knowledge graph embeddings and convolutional-lstm network. In: Proceedings of the 10th
116
6 Link Prediction in Knowledge Graphs (and its Relation to RDF2vec)
ACM international conference on bioinformatics, computational biology and health informatics, pp 113–123. https://doi.org/10.1145/3307339.3342161 Kejriwal M, Szekely P (2017) Supervised typing of big graphs using semantic embeddings. In: Proceedings of the international workshop on semantic big data, pp 1–6. https://doi.org/10.1145/ 3066911.3066918 Lin Y, Liu Z, Sun M, Liu Y, Zhu X (2015) Learning entity and relation embeddings for knowledge graph completion. In: Twenty-ninth AAAI conference on artificial intelligence Mikolov T, Yih Wt, Zweig G (2013c) Linguistic regularities in continuous space word representations. In: Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: human language technologies, Association for Computational Linguistics, Atlanta, pp 746–751. https://aclanthology.org/N13-1090 Nickel M, Tresp V, Kriegel HP (2011) A three-way model for collective learning on multi-relational data. In: International conference on machine learning, pp 809—816 Paulheim H (2017) Knowledge graph refinement: a survey of approaches and evaluation methods. Semant Web 8(3):489–508 Pellegrino MA, Altabba A, Garofalo M, Ristoski P, Cochez M (2020) Geval: a modular and extensible evaluation framework for graph embedding techniques. In: European semantic web conference. Springer, pp 565–582, https://doi.org/10.1007/978-3-030-49461-2_33 Portisch J, Heist N, Paulheim H (2022) Knowledge graph embedding for data mining versus knowledge graph embedding for link prediction-two sides of the same coin? Semant Web 13(3):399–422 Portisch J, Hladik M, Paulheim H (2020a) Kgvec2go–knowledge graph embeddings as a service. In: Proceedings of the 12th language resources and evaluation conference, pp 5641–5647 Ristoski P, Rosati J, Di Noia T, De Leone R, Paulheim H (2019) Rdf2vec: Rdf graph embeddings and their applications. Semant Web 10(4):721–752 Ristoski P, Paulheim H (2016) Rdf2vec: Rdf graph embeddings for data mining. In: International semantic web conference. Springer, pp 498–514 Rossi A, Barbosa D, Firmani D, Matinata A, Merialdo P (2021) Knowledge graph embedding for link prediction: a comparative analysis. ACM Trans Knowl Discov Data (TKDD) 15(2):1–49. https:// doi.org/10.1145/3424672 Shi B, Weninger T (2018) Open-world knowledge graph completion. In: Thirty-second AAAI conference on artificial intelligence Sofronova R, Biswas R, Alam M, Sack H (2020) Entity typing based on rdf2vec using supervised and unsupervised methods. In: European semantic web conference. Springer, pp 203–207. https:// doi.org/10.1007/978-3-030-62327-2_35 Steenwinckel B, Vandewiele G, Rausch I, Heyvaert P, Taelman R, Colpaert P, Simoens P, Dimou A, De Turck F, Ongenae F (2020) Facilitating the analysis of covid-19 literature through a knowledge graph. In: International semantic web conference. Springer, pp 344–357. https://doi.org/10.1007/ 978-3-030-62466-8_22 Sun Z, Deng ZH, Nie JY, g J (2018) Rotate: knowledge graph embedding by relational rotation in complex space. In: International conference on learning representations Tan PN, Steinbach M, Kumar V (2016) Introduction to data mining. Pearson Education, India Toutanova K, Chen D, Pantel P, Poon H, Choudhury P, Gamon M (2015) Representing text for joint embedding of text and knowledge bases. In: Proceedings of the 2015 conference on empirical methods in natural language processing, pp 1499–1509. http://dx.doi.org/10.18653/v1/D15-1174 Trouillon T, Welbl J, Riedel S, Gaussier É, Bouchard G (2016) Complex embeddings for simple link prediction. In: International conference on machine learning, PMLR, pp 2071–2080 Vandewiele G, Steenwinckel B, Agozzino T, Ongenae F (2022) pyrdf2vec: a python implementation and extension of rdf2vec. https://arxiv.org/abs/2205.02283
References
117
Wang Y, Dong L, Jiang X, Ma X, Li Y, Zhang H (2021b) Kg2vec: A node2vec-based vectorization model for knowledge graph. Plos one 16(3):e0248552, https://doi.org/10.1371/journal.pone. 0248552 Wang X, Gao T, Zhu Z, Zhang Z, Liu Z, Li J, Tang J (2021a) Kepler: a unified model for knowledge embedding and pre-trained language representation. Trans Assoc Comput Linguist 9:176–194. https://doi.org/10.1162/tacl_a_00360 Wang Q, Mao Z, Wang B, Guo L (2017) Knowledge graph embedding: a survey of approaches and applications. IEEE Trans Knowl Data Eng 29(12):2724–2743. https://doi.org/10.1109/TKDE. 2017.2754499 Xie R, Liu Z, Jia J, Luan H, Sun M (2016) Representation learning of knowledge graphs with entity descriptions. In: Proceedings of the AAAI conference on artificial intelligence, vol 30 Yang B, Yih Wt, He X, Gao J, Deng L (2015) Embedding entities and relations for learning and inference in knowledge bases. In: International conference on learning representations Zouaq A, Martel F (2020) What is the schema of your knowledge graph? leveraging knowledge graph embeddings and clustering for expressive taxonomy learning. In: Proceedings of the international workshop on semantic big data, pp 1–6. https://doi.org/10.1145/3391274.3393637
Example Applications Beyond Node Classification
Abstract
While our running example in this book was node classification, and also the benchmark datasets discussed above use mostly node classification as a target (with a few exceptions), RDF2vec is more versatile. In this chapter, we show examples of works that describe the use of RDF2vec for other purposes, such as recommender systems, relation extraction, ontology learning, or knowledge graph matching.
7.1
Recommender Systems with RDF2vec
As we have seen in the previous chapters, RDF2vec has the property to create embedding vectors that assign similar vectors to similar and related entities. One typical application scenario for this feature is the development of content-based recommender systems based on knowledge graphs. By recommending movies based on the proximity of movies already watched by a user, we can recommend similar/related movies (Ristoski et al. 2019, Rosati et al. 2016). Figure 7.1 illustrates this idea, visualizing a few movie vectors obtained from KGvec2go (see Chap. 5) in a two-dimensional PCA plot. We can see that there are visible groups of movies of different genres. We can observe a few clusters emerging, e.g., a group of Disney movies in the center and lower part of the plot, the two western-related movies Unforgiven and Wild Wild West being close to each other, etc.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 H. Paulheim et al., Embedding Knowledge Graphs with RDF2vec, Synthesis Lectures on Data, Semantics, and Knowledge, https://doi.org/10.1007/978-3-031-30387-6_7
119
7
120
7 Example Applications Beyond Node Classification
Fig. 7.1 Example scatter plot of movies in the 1k movies dataset, and the two example users used in the example
7.1.1
An RDF2vec-Based Movie Recommender in Less than 20 Lines of Code
With the already mentioned KGvec2go, we can build a simple content-based recommender system based on DBpedia. This recommender system is shown in Listing 7.1. This very basic recommender system first averages all the movies that a user has watched so far. This average vector can be considered a very basic user model Bobadilla et al. (2013). In the second step, movies in the vicinity of that average vector are proposed. Table 7.1 shows recommendations for two example users, each with three movies in their history. Especially for user 2, whose history consists of three Disney movies, we can see a strong tendency to recommend mostly Disney movies.1
1 We observe that in this basic version, the movies watched in the past are still contained in the
recommendations. Such recommendations would usually be filtered out in a post-processing step in a practically deployed system.
7.1
Recommender Systems with RDF2vec
121
Listing 7.1 A Simple Recommender System based on RDF2vec i m p o r t p a n d a s as pd from s k l e a r n . n e i g h b o r s i m p o r t N e a r e s t N e i g h b o r s # load file w i t h m o v i e s df = pd . r e a d _ c s v ( ’1 k m o v i e s . csv ’ , sep = " \ t " ) dfX = df [[ ’ M ovie ’ ]] # r e t r i e v e v e c t o r s from k g v e c 2 g o # see c h a p t e r 5 vectors = retrieve_kgvec2go_vectors ( list ( dict . f r o m k e y s ( dfX [ ’ M o v i e ’ ]. t o _ l i s t ( ) ) ) ) d f X v e c t o r s = pd . D a t a F r a m e . f r o m _ r e c o r d s ( v e c t o r s ) # P r e p a r e m o d e l for r e t r i e v a l knn = N e a r e s t N e i g h b o r s ( n _ n e i g h b o r s =10 , a l g o r i t h m = ’ auto ’ , metric = ’ cosine ’) knn . fit ( d f X v e c t o r s . t o _ n u m p y ()) # P e r f o r m the a c t u a l r e c o m m e n d a t i o n def g e t _ r e c o m m e n d a t i o n s ( u s e r _ p r o f i l e ): df_uservector = d f X v e c t o r s [ dfX [ ’ Movie ’ ]. isin ( u s e r _ p r o f i l e )]. mean () r e t u r n knn . k n e i g h b o r s ( [ d f _ u s e r v e c t o r ] , 10 , r e t u r n _ d i s t a n c e = False ) # C r e a t e r e c o m m e n d a t i o n s for e x a m p l e users u se r1 = [ " http :// d b p e d i a . org / r e s o u r c e / T h e _ M a t r i x " , " http :// d b p e d i a . org / r e s o u r c e / I n t e r s t e l l a r _ ( film ) " , " http :// d b p e d i a . org / r e s o u r c e / B l a d e _ R u n n e r " ] us er 2 = [ " http :// d b p e d i a . org / r e s o u r c e / Bam bi " , " http :// d b p e d i a . org / r e s o u r c e / A l a d d i n _ (1992 _ D i s n e y _ f i l m ) " , " http :// d b p e d i a . org / r e s o u r c e / C i n d e r e l l a _ (1950 _film ) " ] for i nd ex in g e t _ r e c o m m e n d a t i o n s ( use r1 )[0]: p ri nt ( dfX . at [ index , ’ Movie ’ ]) for i nd ex in g e t _ r e c o m m e n d a t i o n s ( user2 )[0]: p ri nt ( dfX . at [ index , ’ Movie ’ ])
The centroids of the two users are also depicted in Fig. 7.1 in red, where user 1’s centroid is in the upper half of the diagram, while user 2’s centroid is in the middle of the other Disney movies. Although this is a projection, it helps interpret the recommendation results depicted in Table 7.1.
122
7 Example Applications Beyond Node Classification
Table 7.1 Example recommendations of the simple RDF2vec based recommender system User 1
User 2
History
The Matrix Interstellar Blade Runner
Bambi Aladdin (1992) Cinderella (1950)
Recommendations
Blade Runner The Matrix Interstellar Unforgiven Blade Runner 2049 The Maltese Falcon (1941) Wild Wild West Transformers: Revenge of the Fallen L.A. confidential The shining
Cinderella (1950) Aladdin (1992) Bambi Beauty and the Beast (1991) Tangled Peter Pan (1953) The Three Caballeros The Little Mermaid (1989)
7.1.2
Beauty and the Beast (2017) Saludos amigos
Combining Knowledge Graph Embeddings with Other Information
One advantage of creating vector-shaped representations of entities in knowledge graphs is that they can be easily combined with our embedding representations. In the movie recommendation case, for example, one could use embedding methods for text (Naseem et al. 2021), images (Voulodimos et al. 2018), or videos (Khan et al. 2022), to create vector representations of textual descriptions, movie posters, or movie trailers. Those can then be easily joined with the vectors from the knowledge graph. To illustrate this principle, we use text embeddings of the Wikipedia pages for the movies in the 1k movies dataset along with the graph embeddings. To that end, we leverage the pretrained wikipedia2vec model (Yamada et al. 2018).2 The overall process is shown in Fig. 7.2. First, the two embedding vectors are obtained in isolation, then they are concatenated to a joint vector.
2 In order to obtain two embeddings that are complementary, we used the embedding model which
only uses text, not the link graph, in our example.
7.1
Recommender Systems with RDF2vec
123
Fig. 7.2 Jointly embedding information from the knowledge graph and a textual description
Pure concatenation can already deliver decent results, but there are some issues. The different embedding models may have different dimensionalities and therefore obtain a different weighting in the distance function used (in our example, the graph embeddings have 200 dimensions, while the text embeddings have 100). Moreover, some dimensions may be redundant in cases both models capture the same information. Those issues are not so important if the system involves an additional modeling step on top of the embeddings, such as a downstream classifier, but for pure neighborhood search, like in this example, we need to make the embedding space more uniform. In order to combine different embeddings to a better embedding space, Thoma et al. (2017) have proposed a fusion step using techniques like PCA (Abdi and Williams 2010), SVD (Wall et al. 2003), or auto-encoders (Hinton et al. 2011). Listing 7.2 shows a hybrid recommender using RDF2vec and wikipedia2vec vectors. The retrieved vectors are concatenated, and two variants are shown, as in Fig. 7.2: (1) using the concatenated vectors directly and (2) performing an SVD dimensionality reduction to combine the two embedding spaces.
124
7 Example Applications Beyond Node Classification
Listing 7.2 A Hybrid Recommender System based on RDF2vec and wikipedia2vec i m p o r t p a n d a s as pd from s k l e a r n . n e i g h b o r s i m p o r t N e a r e s t N e i g h b o r s from s k l e a r n . d e c o m p o s i t i o n i m p o r t T r u n c a t e d S V D # load file w i t h m o v i e s df = pd . r e a d _ c s v ( ’1 k m o v i e s . csv ’ , sep = " \ t " ) dfX = df [[ ’ M ovie ’ ]] # r e t r i e v e v e c t o r s from k g v e c 2 g o & w i k i p e d i a 2 v e c e n t i t i e s = list ( dict . f r o m k e y s ( df [ ’ M o v i e ’ ]. t o _ l i s t ())) r d f 2 v e c _ v e c t o r s = pd . D a t a F r a m e . f r o m _ r e c o r d s ( r e t r i e v e _ k g v e c 2 g o _ v e c t o r s ( e n t i t i e s )) w i k i 2 v e c _ v e c t o r s = pd . D a t a F r a m e . f r o m _ r e c o r d s ( r e t r i e v e _ w i k i p e d i a 2 v e c _ v e c t o r s ( e n t i t i e s )) d f J o i n e d V e c t o r s = pd . c o n c a t ( [ r d f 2 v e c _ v e c t o r s , w i k i 2 v e c _ v e c t o r s ] , axis =1 , join = " inner " ) # P r e p a r e m o d e l for r e t r i e v a l on c o n c a t e n a t e d v e c t o r s knn = N e a r e s t N e i g h b o r s ( n _ n e i g h b o r s =10 , a l g o r i t h m = ’ auto ’ , metric = ’ cosine ’) knn . fit ( d f J o i n e d V e c t o r s . t o _ n u m p y ()) # ... use m odel for r e c o m m e n d a t i o n ( see above ) # P r e p a r e m o d e l for r e t r i e v a l on SVD of c o n c a t e n a t e d v e c t o r s svd = T r u n c a t e d S V D ( n _ c o m p o n e n t s =100) s v d _ r e s u l t = svd . f i t _ t r a n s f o r m ( d f J o i n e d V e c t o r s ) sv dD f = pd . D a t a F r a m e ( data = s v d _ r e s u l t ) knn . fit ( sv dD f . t o _ n u m p y ()) # ... use mo del for r e c o m m e n d a t i o n ( see l i s t i n g 7.1)
Table 7.2 shows the recommendations both with wikipedia2vec as well as with the hybrid recommendation variants. While the wikipedia2vec-based recommender picks up the genre pretty well (it only shows science fiction movies for user 1 and only Disney movies for user 2), the combined recommendations become even more specific. For user 1, for example, only the combined recommendations contain the sequels Blade Runner 2049 and The Matrix Reloaded–where the latter was not recommended by wikipedia2vec nor RDF2vec alone.
7.1
Recommender Systems with RDF2vec
125
Table 7.2 Example recommendations for the hybrid recommender User 1
User 2
History
The Matrix Interstellar Blade Runner
Bambi Aladdin (1992) Cinderella (1950)
wikipedia2vec only
Interstellar Inception The matrix Tron: legacy Avatar (2009) Minority report Guardians of the galaxy War of the worlds (2005) The martian I, Robot
Cinderella (1950) Bambi Pinocchio (1940) Sleeping beauty (1959) Aladdin (1992) The little mermaid (1989) Peter Pan (1953) Alice in wonderland (1951) Snow white (1937) Dumbo
Concatenation
Blade runner The matrix Interstellar Blade runner 2049 Tron: legacy Alien The shining Wild wild west L.A. confidential Unforgiven
Cinderella (1950) Bambi Aladdin (1992) Beauty and the beast (1991) Peter pan (1953) The little mermaid (1989) Sleeping beauty (1959) Pinocchio (1940) Alice in wonderland (1951) Show white (1937)
SVD
Blade runner The matrix Interstellar Blade runner 2049 Tron The matrix reloaded Tron: legacy Transformers: revenge of the fallen Alien The thing (1982)
Cinderella (1950) Bambi Aladdin (1992) Peter Pan (1953) Beauty and the beast (1991) Alice in wonderland (1951) Sleeping beauty (1959) The little mermaid (1989) Pinocchio (1949) Show white (1937)
126
7.2
7 Example Applications Beyond Node Classification
Ontology Matching
As we have seen so far, RDF2vec is capable of capturing the semantics of entities and projects related entities close to each other in a vector space. This characteristic has also been exploited in a subfield of data integration, i.e., ontology matching. This field is concerned with identifying similar and/or related concepts in two schemas or ontologies (Euzenat et al. 2007). In this section, we will look into two techniques for using RDF2vec for ontology matching–the first directly uses embeddings on the ontologies to match, while the second one utilizes embeddings of external knowledge graphs.
7.2.1
Ontology Matching by Embedding Input Ontologies
Although they convey much richer and much more formal information, one could see ontologies as a form of a knowledge graph–entities, such as classes and relations, are connected to each other through a set of different relations (such as subclass, domain, range, etc.). Since ontologies in languages such as OWL can be expressed in RDF, they can be processed by algorithms such as RDF2vec. A naive approach for ontology matching using RDF2vec could then look as follows: given two input ontologies, compute embedding vectors for all entities in those ontologies. Then, each pair of entities from both ontologies which have a distance below a given threshold are then considered a match. This approach, however, is too naive. The main underlying misconception stems from the fact that we are embedding two disconnected knowledge graphs here, not just one.3 Therefore, we are creating also two embedding spaces, which might come with completely different dimensions. There is nothing in the RDF2vec mechanism that would ensure that two identical concepts are having the same or even close coordinates in those two different vector spaces. For that reason, what is needed is a translation from one vector space to the other. Once the two vector spaces are aligned, an approach like the one sketched above can be performed. In Portisch et al. (2022a), we have discussed an approach for performing such an alignment of vector spaces by means of rotation. The overall approach is sketched in Fig. 7.3. In that paper, the extension by Dev et al. (2021) of the absolute orientation approach is used, which showed good performance on multilingual word embeddings. It uses a set of anchor points (i.e., entities in both spaces that are known to be identical) to compute the best possible rotation. 3 Due to the fact that there exists a common set of predicates, e.g., rdf:subClassOf, the resulting
walks contain at least a few common elements, so the embedding spaces are not created entirely in isolation from each other. That signal alone, however, is too weak to sensibly align classes and properties defined in the respective ontologies.
7.2
Ontology Matching
127
Fig.7.3 Ontology matching by absolute orientation of embedding spaces, from Portisch et al. (2022a)
The calculation of the rotation matrix is based on two vector sets A = {a1 , a2 , ...an } and B = {b1 , b2 , ...bn } of the same size n where ai , bi ∈ Rd . In a first step, the means n n a¯ = n1 i=1 ai and b¯ = n1 i=1 bi are calculated. Now, a¯ and b¯ can be used to center A and n ¯ Given the sum of the outer products H = i=1 B: Aˆ ← (A, a) ¯ and Bˆ ← (B, b). bˆi aˆ T , the i
singular value decomposition of H can be calculated: svd(H ) = [U , S, V T ]. The rotation is R = U V T . Lastly, Bˆ can be rotated as follows: B = Bˆ R.
7.2.1.1 Evaluation on Synthetic Data In the paper, we first perform sandbox experiments on synthetic data. We generate a graph G with 2,500 nodes V . For each node v ∈ V , we draw a random d number using a Poisson k e−λ with λ = 4. We then randomly draw d nodes from V \v and distribution f (k; λ) = λ k! add the edge between v and the drawn node to G. We duplicate G as G and generate an alignment A where each v ∈ V is mapped to its copy v ∈ V . We define the matching task such that G and G shall be matched. The rotation is performed with a fraction α from A, referred to as the anchor alignment A . In all experiments, we vary α between 0.2 and 0.8 in steps of size 0.2. In order to test the stability of the performed rotation, also referred to herein as training, we evaluate varying values for α. Each experiment is repeated 5 times to account for statistical variance. The matching precision is computed for each experiment on the training dataset A and on the testing dataset A\A . The split between the training and the testing datasets is determined by α. We found that the model is able to map the entire graphs regardless of the size of the training set A (each run achieved a precision of 100%). In order to test the stability in terms of noise in the anchor alignment A , we distort a share of the training correspondences by randomly matching other than the correct nodes. We vary
128
7 Example Applications Beyond Node Classification
Fig. 7.4 The effect of distortions. (1) alignment noise (left) and (2) size differences (right). Graphs are given for α = 0.2
this level of alignment noise between 0 (no noise introduced) and 0.9 (90% of the alignments are randomly matched) in steps of size 0.1. Figure 7.4 (left) shows the performance with α = 0.2. We observe that the test performance declines with an increasing amount of noise. Interestingly, this relation is not linear. It is visible in Fig. 7.4 (left) that the approach can handle 40% of noise before dropping significantly in terms of test performance. Moreover, in order to test the stability in terms of graph heterogeneity, we randomly remove triples from the target graph G after setting up the alignment between the source graph G and the target graph G . We vary the fraction of randomly removed triples in G between 0 (no triples removed) and 0.9 (90% of the triples removed) in steps of size 0.1. In Fig. 7.4 (right) it can be observed that with a size deviation of 30%, the performance starts to drop rapidly. Comparing the two plots in the figure, it can be seen that the approach handles noise significantly better than size and structure deviations in graphs.
7.2.1.2 Evaluation on Real-World Data We also test our approach on the OAEI multifarm dataset (Meilicke et al. 2012). This dataset contains multilingual ontologies from the conference domain that have to be matched. Since the absolute orientation approach does not use textual data, we only evaluate the German-English test case. This is sufficient because the other language combinations of the multifarm dataset use structurally identical graphs. With a sampling rate of 20%, our approach achieves micro scores of P = 0.376, R = 0.347, and F1 = 0.361. Compared to the systems participating in the 2021 campaign (Pour et al. 2021), the recall is on par with state-of-the-art systems; an overall lower F1 is caused by a comparatively low precision score. While not outperforming top-notch OAEI systems in terms of F1 , the performance indicates that the approach is able to perform ontology matching and may particularly benefit from the addition of non-structure-based features. What we can observe from those results is that the approach works well for graphs that are structurally similar to one another. Here, the embedding spaces are reasonably similar so that they can be aligned by a rotation approach. Similar observations have been made by
7.2
Ontology Matching
129
Hertling et al. Hertling et al. (2020) for a linear transformation between RDF2vec spaces for alignment, which also yields decent results for structurally similar ontologies. More complex alignment functions, like training a non-linear transformation using a neural network, would also be possible. For example, Lütke (2019) trains an XGBoost classifier on pairs of entity embeddings from both ontologies to identify matching pairs. While the approach sketched above always uses a set of anchor mappings, it could be turned into a fully automatic approach by first computing a set of trivial correspondences (e.g., by name equivalence), and using those for computing the alignment of the spaces. In such a pipeline, one would first create syntactic matches based on entity names, and refine those with structurally similar entities in the second step.
7.2.2
Ontology Matching by Embedding External Knowledge Graphs
Instead of using embeddings of the input ontologies themselves, another approach for matching knowledge graphs and ontologies relies on external knowledge. Here, one can try to find matching terms in an external knowledge graph and assume that they are identical if those terms are reasonably close. One such approach is the ontology matching system ALOD2vec (Portisch and Paulheim 2018). That approach uses the large knowledge graph WebIsALOD (see Chap. 1). That graph contains several hundred million concepts and hypernym-hyponym pairs (Hertling and Paulheim 2017, Seitner et al. 2016), and it is thus one of the largest structured sources of conceptual knowledge. The matching system has two main steps: first, trivial matches (i.e., concepts of the same name) are created in a simple and efficient string-matching process. For the remaining concepts, links to entities in the WebIsALOD graph are generated, and the corresponding vectors are retrieved. These vectors are then compared in pairwise comparisons, and the pairs with the lowest distance are added as candidates. For matching using WebIsALOD, links of the concepts’ labels from the ontology to concepts in the WebIsALOD dataset are established first. To that end, string operations are performed on the label and it is checked whether the label is available in WebIsALOD. If it cannot be found, a token-by-token lookup is performed. Given two entities e1 and e2 , the matcher uses their textual labels to link them to concepts e1 and e2 in the external dataset. Afterward, the embedding vectors ve1 and ve2 of the linked concepts (e1 and e2 ) are retrieved via a Web request and the cosine similarity between those is calculated. Hence: sim(e1 , e2 ) = sim cosine (ve1 , ve2 ). If sim(e1 , e2 ) > t where t is a threshold in the range of 0 and 1, a correspondence is added to a temporary alignment. In the last step, a one-to-one arity is enforced by applying a Maximum Weight Bipartite (Cruz et al. 2009) filter on the temporary alignment.
130
7 Example Applications Beyond Node Classification
Fig. 7.5 Architecture of the ALOD2vec matcher system, from Portisch and Paulheim (2022a)
Implementation-wise, the matcher uses the linked data interface from WebIsALOD, as well as the service KGvec2go (see Sect. 5.1.1), so that the matching system itself is rather lean, as it does not come with any precomputed vectors (Fig. 7.5). Generally, the ALOD2vec matching system can identify correspondences that are less obvious, like anterior lingual gland and lingual salivary gland being the same concept. At the same time, since RDF2vec also focuses on relatedness (note that the vanilla version of RDF2vec was used in the original ALOD2vec system), has a tendency to bring up false correspondences, like facial bone and facial muscle.
7.3
Further Use Cases
The Web page rdf2vec.org4 collects usage examples of RDF2vec from various domains. In the following, we want to briefly outline a few of those use cases.
7.3.1
Knowledge Graph Refinement
Knowledge graph refinement deals with the improvement of existing knowledge graphs, either by completing the knowledge graph by adding missing information or by flagging (and potentially removing) false information (Paulheim 2017).
4 http://www.rdf2vec.org/.
7.3
Further Use Cases
131
In Chap. 6, we have encountered one area of knowledge graph refinement which has become very popular in the field of embeddings, mostly due to the high availability and popularity of benchmark datasets, i.e., link prediction. Here, the assumption is that a knowledge graph is per se incomplete (due to the open world assumption) and that a model trained on that graph can be used to predict additional links between entities. Another common task in knowledge graph refinement is entity type prediction. Here, it is assumed that the type information in knowledge graphs is incomplete, be it missing entirely or too coarse-grained (Paulheim and Bizer 2013). Some approaches use RDF2vec embeddings for type prediction. Typically, they create a training set from the embedding vectors of typed entities and train a classifier on top to predict the respective types. Such approaches have been proposed, e.g., by Kejriwal and Szekely (2017), Sofronova et al. (2020), and Jain et al. (2021), where the latter also compare different embedding methods. The text embedding method BERT (Devlin et al. 2018) has been reported to beat many state-of-the-art techniques. It is therefore remarkable that Cutrona et al. (2021) report that RDF2vec-based embeddings are superior in entity type prediction, compared to BERT using textual entity descriptions. In Biswas et al. (2022), we have shown that it is possible to combine the best of both worlds, where the best entity typing results are obtained by combining different flavors of RDF2vec with textual embeddings based on BERT. The approach is shown in Fig. 7.6. Yao and Barbosa (2021) solve a related problem, i.e., the identification of wrong type assertions. Like in the recommender example above, they combine RDF2vec and Wikipedia2vec embeddings. The embeddings are combined into a common representation, on which outlier detection methods are applied to find entities that possibly have a wrong type asserted.
Fig. 7.6 Entity typing by combining RDF2vec graph embeddings with textual entity descriptions (Biswas et al. 2022)
132
7 Example Applications Beyond Node Classification
Fig. 7.7 Triple classification with a relation identifier (left) and a vector representation for the relation (right)
Ammar and Celebi (2019) look into general fact validation in knowledge graphs. They use supervised learning of labeled ground truth statements to determine truth values for statements in a knowledge graph. To that end, they concatenate RDF2vec representations of subjects and objects, and an additional numeric value to encode the relation. A similar approach is pursued by Pister and Atemezing (2019), who concatenate the vectors for subject, predicate, and object. They also test various embedding methods besides RDF2vec (i.e., TransE, TransR, ComplEx, DistMult, and HolE), which are constantly outperformed. The latter two approaches for fact validation differ in the way they represent relations. Figure 7.7 shows a simplified schematic representation of these two approaches. Conceptually, the difference is that the approach using embedding vectors to represent relations can also learn from the similarity between different relations, while the one using only numeric identifiers (or a one-hot encoding) cannot. Figure 7.8 shows a visualization of a few common relations in DBpedia. It can be observed that RDF2vec captures some semantic similarities between relations–there is a cluster of relations describing locations in the bottom left, kingdom and phylum–which are used for classifying organisms–in the right, relations about space missions around (0,0), and a few relations from the movie domain in the top left corner.
7.3.2
Natural Language Processing
In Sect. 7.3.2.3, we have seen how to use RDF2vec for relation extraction. RDF2vec has also been utilized in various other tasks in Natural Language Processing (NLP). Here, we will look into examples of named entity linking, and text classification.
7.3.2.1 Named Entity Linking Named entity linking is the task of linking entity mentions in a text or in structured data, like tables, to a knowledge graph. (van Erp et al. 2016 Some approaches rely on the entity
7.3
Further Use Cases
133
Fig. 7.8 Vectors for example relations in DBpedia from KGvec2go visualized using 2D PCA
mentions being identified in a previous named entity recognition step, while there are also holistic approaches performing both entity recognition and linking in the same process. Embeddings of entities in the target knowledge graph can be exploited to improve the named entity linking, in particular for disambiguation, in case different candidate entities exist. To that end, knowledge graph embeddings are often used. Inan and Dikenelli (2017) incorporate RDF2vec embeddings in two named entity linking tools. Following the notion that co-mentioned entities should be related, RDF2vec embeddings can help select a subset of entities from a candidate set which maximizes the relatedness of the selected candidates. A similar approach is used as a baseline by Vaigh et al. (2020), which proves to be a rather strong baseline. A slightly more sophisticated approach is EARL (Dubey et al. 2018), which uses pre-trained RDF2vec entity embeddings to initialize a complex entity and relation linking and disambiguation model. A related field is entity identification in tables, also known as table annotations. In this field, RDF2vec has been exploited, e.g., by Cutrona et al. (2021). In that case, if multiple
134
7 Example Applications Beyond Node Classification
candidates for entities in tables exist, proximity in an embedding space can be used to pick the final set of entities for the annotations (Shigarov et al. 2021). Apart from entity disambiguation, RDF2vec has also been used for candidate set expansion. One such example is shown by Nizzoli et al. (2020), who tackle the task of linking geographic entities and expanding the candidate sets by entities that are close in the embedding space (in addition to other expansion techniques, such as entities with similar spellings or entities which are topologically close in the target knowledge graph.
7.3.2.2 Text Classification Text classification is nowadays mostly done using text embedding methods like BERT (Devlin et al. 2018). However, some approaches also use knowledge graphs and embeddings of entities therein, in particular when it comes to shorter texts, such as news ticker messages or social media postings. Türker (2019) discusses an approach for using knowledge graph embeddings for short text classification. In that approach, candidate entities are identified in the text at hand. For those, categories5 are retrieved, and their embedding vectors are used as input into a final classifier determining the topic of the text. Benítez-Andrades et al. (2022) address the task of classifying tweets, i.e., short texts on Twitter. They consider both text and entity embeddings and report that the best results are achieved by using a combination of both. Engleitner et al. (2021) use knowledge graph embeddings to enhance a given set of topics for a text by retrieving entities which are close in the vector space to improve the tagging of entities. Given that an article is tagged with a topic like Mannheim, they would use the closest entities to Mannheim as additional tags.
7.3.2.3 Relation Extraction Following the vision of the Semantic Web, many companies create their own knowledge graphs, which are collections of connected descriptions of entities and factual information within a specific field of interest (Noy et al. 2019). Keeping these graphs updated is crucial, as new data must be continually added. This process, known as knowledge graph population, involves extracting new entities and relationships between entities from various sources, such as unstructured text and other existing knowledge graphs. Following this motivation, a plethora of work on entity and relation mining from text and knowledge graphs has been published in the literature Kumar (2017), Pawar et al. (2017). In Ristoski et al. (2020), we present a large-scale information extraction system designed to extract relationships between 5 Categories are a special construct in DBpedia. They are derived from the categories in Wikipedia
and are a broader construct than types. For example, the entity DBpedia is a member of the categories Semantic Web and Open Data, among others.
7.3
Further Use Cases
135
entities from both Web documents and knowledge graphs using RDF2vec as one of the main components. The system extracts user-defined target relations between a set of KG entities by first conducting a large focused crawl (Chakrabarti et al. 1999) of web documents for relation extraction. To improve performance, a human-in-the-loop component (Wu et al. 2022b) is integrated to generate meaningful training data and reduce the amount of data needed for extraction. The final step uses state-of-the-art deep neural networks and external knowledge graphs for relation mining, specifically identifying relations not explicitly present in the graph. The method for extracting target relations from an external knowledge graph involves three steps: (i) linking entities to a reference knowledge graph, (ii) identifying connections between pairs of entities in the form of graph paths, and (iii) training a neural network to classify these relations based on the input graph paths. For each pair of entities of interest in the graph, the approach generates all shortest paths using a breadth-first search algorithm. These paths are then fed into a deep neural network architecture. The architecture of the neural network is shown in Fig. 7.9. The input of the neural network is the set of shortest paths. The first layer of the network is an RDF2vec embeddings layer, where each entity and vertex of the input path is replaced with an ndimensional graph embedding vector. The embedding layer is followed by several convolution layers with max pooling. The resulting feature vector is fed in a fully connected softmax layer, where the output is the confidence score for each of the relations in the domain. The evaluation was performed in the context of the Semantic Web Challenge, 20186 where the objective was to augment the Thomson Reuters permid.org open dataset (Ulicny 2015) with facts extracted from internet sources. More precisely, the objective was to identify supplier/customer relations between a pair of organizations. The dataset provided by the challenge organizers consists of 155, 571 organizations, including the name and the Thomson Reuters Perm ID. Furthermore, a training dataset of 25, 000 customer/supplier-related organization pairs was provided. The training dataset contains only the pairs, without evidence or provenance of how the relations were extracted. The private part of the Thomson Reuters graph is used for the evaluation of the approach. The system was evaluated using the official evaluation system provided by the challenge organizers. The evaluation was performed using standard evaluation metrics, i.e., precision, recall, and F1 score. The approach outperformed all of the competing solutions.
7.3.3
Information Retrieval
Information retrieval deals with finding relevant pieces of information given a user query. The most common flavor of information retrieval is retrieving text documents given a keyword 6 http://iswc2018.semanticweb.org/semantic-web-challenge-2018/.
136
7 Example Applications Beyond Node Classification
Fig. 7.9 Architecture of the neural network for binary relation classification using knowledge graph paths as input, from Ristoski et al. (2020)
query, but other forms of information retrieval also exist. RDF2vec is often used in scenarios where structured data needs to be retrieved. Färber and Lamprecht (2022) introduce the Data Set Knowledge Graph, a knowledge graph of data sets used in research and their metadata. They provide RDF2vec embeddings together with the graph. Such embeddings can be used, e.g., for identifying similar datasets, given a predefined query dataset. Similarly, Steenwinckel et al. (2020) introduce a large knowledge graph of CoViD-19-related scientific publications. They show that RDF2vec embeddings on the graph can be used for similarity queries, as well as for clustering the publications in the graph. Mittal et al. (2019) introduce a knowledge graph of cyber security threats and discuss the use of RDF2vec embeddings on that graph to identify related security threats. Another field of information retrieval where RDF2vec has been used extensively is the retrieval of tables. Here, tables are either searched for by keywords or by input tables. The first such work was Zhang and Balog (2018), which explored different variants of representing queries and tables, entity embeddings (for entities identified in the queries and the tables) being one of them. They report that combining different representation mechanisms, including entity embeddings, yields the best retrieval results.
7.3
Further Use Cases
137
RDF2vec has been used in the e-commerce domain as well. Liang et al. (2022) use RDF2vec to embed a large e-commerce graph mined from millions of product listings. The embedded graph is used to identify similar colors (e.g. midnight is similar to starlight in the cell phones category), and related concepts (e.g. starlight is related to the brand Apple and the iPhone 14). This model is then used to perform explicit query contextual rewrites, conditioned by additional concepts, such as brands, models, product lines, categories, etc. Similarly, RDF2vec has been used for representational machine learning for product formulation (Ristoski et al. 2022). Products like perfumes and spices have a list of ingredients, where each ingredient has specific characteristics and must be present in a certain amount. In this work, the authors convert each product formula into a directed weighted graph, representing the ingredients and their interactions. The graphs are embedded using RDF2vec, which are later used in several neural networks to identify complements and substitute ingredients, and ultimately to generate novel product formulas.
7.3.4
Applications in the Biomedical Domain
The life sciences and the biomedical domains have been early and intensive adopters of knowledge graphs and semantic web technologies (Abu-Salih 2021, Schmachtenberg et al. 2014). Hence, we have also seen quite a few applications of techniques like RDF2vec in that domain for various purposes. A recurring problem in the domain is the prediction of interactions, e.g., between drugs and diseases, between pairs of proteins, pairs of drugs, etc. Here, quite a few approaches have explored the utilization of RDF2vec. In a basic setting, a biomedical knowledge graph is embedded, and a classifier is trained on pairs of drugs (using pairs of drugs known to interact as positive examples and random pairs of drugs not known to interact as negatives). Such an approach is discussed by Celebi et al. (2019), who report the best results using RDF2vec and a random forest classifier. Karim et al. (2019) and Zhang et al. (2021) use a similar approach (where the latter not only considers the prediction of drug-drug interaction, but also drug-target interaction), but more complex downstream classifiers of neural networks combining convolutional and LSTM layers. Similarly, Vlietstra et al. (2022) aims at predicting interactions of genes and diseases (i.e., genes associated with certain diseases). Sousa et al. (2021) take a broader perspective of predicting multiple relations between entities in biomedical knowledge graphs, e.g., protein-protein interaction or protein-function similarity. They state that each of those can be described by a similarity function between the respective entities, and use RDF2vec, among others, to approximate that similarity. While the approaches discussed above target general knowledge graphs of genes, proteins, etc., and want to extract general findings of interactions between those, there are also approaches that try to make predictions on the level of individual patients. Carvalho et al. (2023) use an ontology-enriched variant of the MIMIC III dataset, a database of hospital patient data, to predict patient readmission to intensive care units. Like in the examples
138
7 Example Applications Beyond Node Classification
Fig. 7.10 Learning medical diagnosis rules with RDF2vec (Heilig et al. 2022)
above, each patient is represented by a feature vector computed using RDF2vec, and standard classification methods are used to predict the target, i.e., whether or not the patient is readmitted to ICU. They discuss two approaches–creating different embeddings for different subgraphs or holistic embeddings for the entire knowledge graph–and find that the latter performs better, most likely due to the fact that complex patterns spanning across different subgraphs can also be represented in the embeddings. Heilig et al. (2022) also use a MIMIC dataset of patient data, but join it with a knowledge graph of diagnosis rules. In that paper, the aim is not to make predictions for individual patients directly, but rather to extend diagnosis rules, as shown in Fig. 7.10. Those extensions are learned by embedding both existing diagnosis rules and individual risk factors, and then predicting whether a risk factor should be added to the rule at hand. The refined rules are then reviewed by a medical expert for quality control to ensure that the diagnoses are correct. Moreover, they allow for explaining the diagnoses made, which would not be possible when directly computing a prediction for a patient based on embeddings. Therefore, this approach also proposes an approach for one of the future challenges of knowledge graph embeddings, i.e., the combination of the predictive power of embeddings and the need for explainable AI systems. We will get back to that challenge in the next chapter.
7.4
Conclusion
In this chapter, we have seen applications of RDF2vec from various domains. While some of them exploit the embedding space directly by retrieving neighboring entities, such as the recommender use case, the majority use more elaborate downstream processing. Here, the RDF2vec embeddings are used as an input representation for entities for a classification or regression learner. This also makes such embeddings a good candidate to create joint
References
139
representations from different modalities, such as text and knowledge graphs, by simply concatenating the vectors. A common observation is that in most of the applications, the vanilla variant of RDF2vec is utilized, whereas the numerous variants and extensions discussed in Chap. 4 are rarely used, despite often better performance on benchmark datasets. The exact reason for this is unknown–whether those variants do not exceed the vanilla variant on real-world data (and those results are not reported in ablation studies), whether the vanilla variant is good enough for the use cases at hand, or whether the authors of the reported systems were not aware of the existence of potentially superior variants–is not yet fully clear.
References Abdi H, Williams LJ (2010) Principal component analysis. Wiley Interdiscip Rev Comput Stat 2(4):433–459 Abu-Salih B (2021) Domain-specific knowledge graphs: a survey. J Netw Comput Appl 185:103076 Ammar A, Celebi R (2019) Fact validation with knowledge graph embeddings. In: ISWC (Satellites). pp 125–128 Benítez-Andrades JA, García-Ordás MT, Russo M, Sakor A, Rotger LDF, Vidal ME (2022) Empowering machine learning models with contextual knowledge for enhancing the detection of eating disorders in social media posts. Semant Web J Rev Biswas R, Portisch J, Paulheim H, Sack H, Alam M (2022) Entity type prediction leveraging graph walks and entity descriptions. In: International semantic web conference. Springer, pp 392–410 Bobadilla J, Ortega F, Hernando A, Gutiérrez A (2013) Recommender systems survey. Knowl Based Syst 46:109–132 Carvalho RM, Oliveira D, Pesquita C (2023) Knowledge graph embeddings for icu readmission prediction. BMC Med Inform Decis Mak 23(1):12 Celebi R, Uyar H, Yasar E, Gumus O, Dikenelli O, Dumontier M (2019) Evaluation of knowledge graph embedding approaches for drug-drug interaction prediction in realistic settings. BMC Bioinform 20(1):1–14. https://doi.org/10.1186/s12859-019-3284-5 Chakrabarti S, Van den Berg M, Dom B (1999) Focused crawling: a new approach to topic-specific web resource discovery. Comput Netw 31(11–16):1623–1640 Cruz IF, Antonelli FP, Stroe C (2009) Efficient selection of mappings and automatic quality-driven combination of matching methods. In: Proceedings of the 4th international conference on ontology matching-Volume 551, Citeseer, pp 49–60 Cutrona V, Puleri G, Bianchi F, Palmonari M (2021) Nest: neural soft type constraints to improve entity linking in tables. In: SEMANTiCS. pp 29–43 Dev S, Hassan S, Phillips JM (2021) Closed form word embedding alignment. Knowl Inf Syst 63(3):565–588 Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 Dubey M, Banerjee D, Chaudhuri D, Lehmann J (2018) Earl: joint entity and relation linking for question answering over knowledge graphs. In: International semantic web conference. Springer, pp 108–126
140
7 Example Applications Beyond Node Classification
Engleitner N, Kreiner W, Schwarz N, Kopetzky T, Ehrlinger L (2021) Knowledge graph embeddings for news article tag recommendation. In: SEMANTiCS posters&demos Euzenat J, Shvaiko P et al (2007) Ontology matching, vol 18. Springer Färber M, Lamprecht D (2022) The data set knowledge graph: creating a linked open data source for data sets. Quant Sci Stud 2(4):1324–1355 Heilig N, Kirchhoff J, Stumpe F, Plepi J, Flek L, Paulheim H (2022) Refining diagnosis paths for medical diagnosis based on an augmented knowledge graph. arXiv preprint arXiv:2204.13329 Hertling S, Paulheim H (2017) Webisalod: providing hypernymy relations extracted from the web as linked open data. In: International semantic web conference. Springer, pp 111–119 Hertling S, Portisch J, Paulheim H (2020) Supervised ontology and instance matching with melt. arXiv preprint arXiv:2009.11102 Hinton GE, Krizhevsky A, Wang SD (2011) Transforming auto-encoders. In: International conference on artificial neural networks. Springer, pp 44–51 Inan E, Dikenelli O (2017) Effect of enriched ontology structures on rdf embedding-based entity linking. In: Research conference on metadata and semantics research. Springer, pp 15–24 Jain N, Kalo JC, Balke WT, Krestel R (2021) Do embeddings actually capture knowledge graph semantics? In: European semantic web conference. Springer, pp 143–159 Karim MR, Cochez M, Jares JB, Uddin M, Beyan O, Decker S (2019) Drug-drug interaction prediction based on knowledge graph embeddings and convolutional-lstm network. In: Proceedings of the 10th ACM international conference on bioinformatics, computational biology and health informatics. pp 113–123. https://doi.org/10.1145/3307339.3342161 Kejriwal M, Szekely P (2017) Supervised typing of big graphs using semantic embeddings. In: Proceedings of the international workshop on semantic big data. pp 1–6. https://doi.org/10.1145/ 3066911.3066918 Khan S, Naseer M, Hayat M, Zamir SW, Khan FS, Shah M (2022) Transformers in vision: a survey. ACM Comput Surv (CSUR) 54(10s):1–41 Kumar S (2017) A survey of deep learning methods for relation extraction. arXiv preprint arXiv:1705.03645 Liang L, Kamath S, Ristoski P, Zhou Q, Wu Z (2022) Fifty shades of pink: understanding color in e-commerce using knowledge graphs. In: Proceedings of the 31st ACM international conference on information & knowledge management. pp 5090–5091 Lütke A (2019) Anygraphmatcher submission to the oaei knowledge graph challenge 2019. OM@ ISWC 2536:86–93 Meilicke C, Garcia-Castro R, Freitas F, Van Hage WR, Montiel-Ponsoda E, De Azevedo RR, Stuckenschmidt H, Šváb-Zamazal O, Svátek V, Tamilin A et al (2012) Multifarm: a benchmark for multilingual ontology matching. J Web Semant 15:62–68 Mittal S, Joshi A, Finin T (2019) Cyber-all-intel: an ai for security related threat intelligence. arXiv preprint arXiv:1905.02895 Naseem U, Razzak I, Khan SK, Prasad M (2021) A comprehensive survey on word representation models: from classical to state-of-the-art word representation language models. Trans Asian Low Resour Lang Inf Process 20(5):1–35 Nizzoli L, Avvenuti M, Tesconi M, Cresci S (2020) Geo-semantic-parsing: Ai-powered geoparsing by traversing semantic knowledge graphs. Decis Support Syst 136:113346 Noy N, Gao Y, Jain A, Narayanan A, Patterson A, Taylor J (2019) Industry-scale knowledge graphs: lessons and challenges: five diverse technology companies show how it’s done. Queue 17(2):48–75 Paulheim H (2017) Knowledge graph refinement: a survey of approaches and evaluation methods. Semant Web 8(3):489–508 Paulheim H, Bizer C (2013) Type inference on noisy rdf data. In: International semantic web conference. Springer, pp 510–525
References
141
Pawar S, Palshikar GK, Bhattacharyya P (2017) Relation extraction: a survey. arXiv preprint arXiv:1712.05191 Pister A, Atemezing GA (2019) Knowledge graph embedding for triples fact validation. In: ISWC satellites Portisch J, Paulheim H (2022) Alod2vec matcher results for oaei 2021. CEUR Work Proc RWTH 3063:117–123 Portisch J, Costa G, Stefani K, Kreplin K, Hladik M, Paulheim H (2022a) Ontology matching through absolute orientation of embedding spaces. arXiv preprint arXiv:2204.04040 Portisch J, Paulheim H (2018) Alod2vec matcher. OM@ ISWC 2288:132–137 Pour MAN et al (2021) Results of the ontology alignment evaluation initiative 2021. In: OM 2021, CEUR-WS.org, CEUR workshop proceedings, vol 3063, pp 62–108. http://ceur-ws.org/Vol-3063/ oaei21_paper0.pdf Ristoski P, Rosati J, Di Noia T, De Leone R, Paulheim H (2019) Rdf2vec: rdf graph embeddings and their applications. Semant Web 10(4):721–752 Ristoski P, Gentile AL, Alba A, Gruhl D, Welch S (2020) Large-scale relation extraction from web documents and knowledge graphs with human-in-the-loop. J Web Semant 60:100546 Ristoski P, Goodwin RT, Fu J, Segal RB, Lougee R, Lang KC, Harris C, Yeshi T (2022) Representational machine learning for product formulation. US Patent App. 17/030,509 Rosati J, Ristoski P, Di Noia T, Leone Rd, Paulheim H (2016) Rdf graph embeddings for content-based recommender systems. CEUR Work Proc RWTH 1673:23–30 Schmachtenberg M, Bizer C, Paulheim H (2014) Adoption of the linked data best practices in different topical domains. In: International semantic web conference, vol 8796. Springer International, LNCS. https://doi.org/10.1007/978-3-319-11964-9_16 Seitner J, Bizer C, Eckert K, Faralli S, Meusel R, Paulheim H, Ponzetto SP (2016) A large database of hypernymy relations extracted from the web. In: Proceedings of the tenth international conference on language resources and evaluation (LREC 2016), pp 360–367 Shigarov AO, Dorodnykh NO, Yurin AY, Mikhailov AA, Paramonov VV (2021) From web-tables to a knowledge graph: prospects of an end-to-end solution. In: ITAMS, pp 23–33 Sofronova R, Biswas R, Alam M, Sack H (2020) Entity typing based on rdf2vec using supervised and unsupervised methods. In: European semantic web conference. Springer, pp 203–207. https:// doi.org/10.1007/978-3-030-62327-2_35 Sousa RT, Silva S, Pesquita C (2021) Supervised semantic similarity. bioRxiv Steenwinckel B, Vandewiele G, Rausch I, Heyvaert P, Taelman R, Colpaert P, Simoens P, Dimou A, De Turck F, Ongenae F (2020) Facilitating the analysis of covid-19 literature through a knowledge graph. In: International semantic web conference. Springer, pp 344–357. https://doi.org/10.1007/ 978-3-030-62466-8_22 Thoma S, Rettinger A, Both F (2017) Towards holistic concept representations: embedding relational knowledge, visual attributes, and distributional word semantics. In: International semantic web conference. Springer, pp 694–710 Türker R (2019) Knowledge-based dataless text categorization. In: European semantic web conference. Springer, pp 231–241 Ulicny B (2015) Constructing knowledge graphs with trust. In: 4th international workshop on methods for establishing trust of (open) data, Bentlehem, USA Vaigh CBE, Goasdoué F, Gravier G, Sébillot P (2020) A novel path-based entity relatedness measure for efficient collective entity linking. In: International semantic web conference. Springer, pp 164– 182 van Erp M, Mendes P, Paulheim H, Ilievski F, Plu J, Rizzo G, Waitelonis J (2016) Evaluating entity linking: an analysis of current benchmark datasets and a roadmap for doing a better job. In: 10th international conference on language resources and evaluation (LREC)
142
7 Example Applications Beyond Node Classification
Vlietstra WJ, Vos R, van Mulligen EM, Jenster GW, Kors JA (2022) Identifying genes targeted by disease-associated non-coding snps with a protein knowledge graph. Plos one 17(7):e0271395 Voulodimos A, Doulamis N, Doulamis A, Protopapadakis E (2018) Deep learning for computer vision: a brief review. Comput Intell Neurosci 2018 Wall ME, Rechtsteiner A, Rocha LM (2003) Singular value decomposition and principal component analysis. In: A practical approach to microarray data analysis. Springer, pp 91–109 Wu X, Xiao L, Sun Y, Zhang J, Ma T, He L (2022b) A survey of human-in-the-loop for machine learning. Futur Gener Comput Syst Yamada I, Asai A, Sakuma J, Shindo H, Takeda H, Takefuji Y, Matsumoto Y (2018) Wikipedia2vec: an efficient toolkit for learning and visualizing the embeddings of words and entities from wikipedia. arXiv preprint arXiv:1812.06280 Yao P, Barbosa D (2021) Typing errors in factual knowledge graphs: Severity and possible ways out. In: Proceedings of the web conference 2021. pp 3305–3313 Zhang S, Balog K (2018) Ad hoc table retrieval using semantic similarity. In: Proceedings of the 2018 world wide web conference, pp 1553–1562 Zhang S, Lin X, Zhang X (2021) Discovering dti and ddi by knowledge graph with mhrw and improved neural network. In: 2021 IEEE international conference on bioinformatics and biomedicine (BIBM). IEEE, pp 588–593
8
Future Directions for RDF2vec
Abstract
In this chapter, we highlight a few shortcomings of RDF2vec, and we discuss possible future ways to mitigate those. Among the most prominent ones, there are the handling of literal values (which are currently not used by RDF2vec), the handling of dynamic knowledge graphs, and the generation of are explanations for systems using RDF2vec (which are currently black box models).
8.1
Incorporating Information in Literals
Most knowledge graphs do not only contain edges between entities, but also information in literals. The latter may include human readable labels and textual descriptions of entities, potentially in multiple languages, but also numeric information. In the example in Fig. 8.1, three cities are depicted. If we compute representations for them using RDF2vec, only the information on the relations to other entities is used, but the literal information is discarded. However, given that entity embeddings should reflect the similarity of two entities, it would be beneficial to also exploit literal information. In the example in Fig. 8.1, it would be beneficial to project Paris closer to Berlin than to Bangkok because they are more similar in size. Other examples would include close values in coordinates (i.e., geographically close objects), similarity in descriptions (e.g., two books or movies with similar plots), etc. A recent survey by Gesese et al. (2021) showed that for some embedding approaches, extensions for incorporating literals exist. The authors show a lot of extensions, mostly of embedding approaches for link prediction (see Chap. 6). The usual approaches adapt the loss function of the embedding algorithm in order to also reflect similarity within literals, or train text and entity embedding jointly. Notably, the survey only lists three approaches that © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 H. Paulheim et al., Embedding Knowledge Graphs with RDF2vec, Synthesis Lectures on Data, Semantics, and Knowledge, https://doi.org/10.1007/978-3-031-30387-6_8
143
144
8 Future Directions for RDF2vec
Fig. 8.1 Yet another example knowledge graph
can deal with more than one type of literal, while the others are specialized, e.g., on text or on numeric literals. In the case of RDF2vec, no variants incorporating literals have been proposed so far. OWL2vec* (Chen et al. 2021) is perhaps the closest approximation of such a variant, as it builds a joint text corpus of graph walks (as RDF2vec) and information extracted from text literals. The final embeddings are then computed over the joint walk sets, which also incorporate text literals in the information encoded in the entity vectors. Another approach can be taken using pyRDF2vec by extracting literals directly. In the example in Fig. 8.1, one could represent the cities as embedding vectors derived from the entity relations, and, at the same time, extract the population values. A downstream classifier would then be fed the concatenated vector of the entity embeddings and the extracted literal values.1 Besides embedding-based approaches, methods based on graph convolutional networks (Schlichtkrull et al. 2018) have been proposed, where node features are computed from literals, and then used to compute node representations by message passing (Wilcke et al. 2020). The difference between graph convolutional networks and node embedding methods 1 Such an approach would, however, not allow for similarity search in the vector space directly, but
instead require some preprocessing of the concatenated vectors, as discussed in Sect. 7.1.
8.2
Exploiting Complex Patterns
145
is that graph neural networks are end-to-end approaches, where the node representations cannot easily be reused across tasks. Since many real-world knowledge graphs contain valuable information in literals, literalaware extensions of RDF2vec will thus be one of the main future directions to extend RDF2vec. There are already benchmark datasets which can be used for such extensions, such as kgbench (Bloem et al. 2021).
8.2
Exploiting Complex Patterns
In this book, we have seen that certain patterns in knowledge graphs are hard to learn for many embedding methods. Knowledge graph embedding models like TransE and its descendants usually compute fitness functions based on scoring single statements (Wang et al. 2017), which makes it hard for them to model complex patterns involving multiple statements. RDF2vec is slightly less limited—as long as the patterns can be captured in a single walk, and the walk length is long enough, RDF2vec can also model the pattern. We have also seen this when analyzing the results of the DLCC benchmark in Sect. 3.3—some patterns, in particular those involving cardinalities—are hardly picked up by any embedding method. Consider the example shown in Fig. 8.1. RDF2vec can extract walks like these: Thailand France Germany Thailand France Germany
-> -> -> -> -> ->
capital capital capital isA isA isA
-> -> -> -> -> ->
Bangkok -> isA -> City Paris -> isA -> City Berlin -> isA -> City AsianCountry EuropeanCountry EuropeanCountry
Based on those walks, Germany and France will fall closely together, because they are involved in similar patterns. Thailand will be a bit off the other two countries, because the patterns are more dissimilar. On the other hand, the entities AsianCountry and EuropeanCountry never cooccur in the walks together with the city entities. Therefore, RDF2vec will not project Paris and Berlin any closer than Paris and Bangkok, despite the former two are more similar than the latter two. For capturing such effects, one would need to incorporate more complex patterns in the embeddings. However, such extensions are still under development these days. There are some approaches which, e.g., use entailment constraints (Ding et al. 2018), which might pave the way for incorporating complex patterns. Other approaches might involve graph pattern mining as a preprocessing step and later on fuse information on triple and on pattern level, similar to the approach suggested by Martin et al. (2020).
146
8.3
8 Future Directions for RDF2vec
Exploiting Ontologies
A similar direction is the exploitation of information contained in ontologies. As seen in Chap. 1, some knowledge graphs come with more or less expressive ontologies, and some are enriched with highly formalized top level ontologies (Paulheim and Gangemi 2015). Table 8.1 gives an overview on the ontological constructs used in different open knowledge graphs.2 But even simple ontological constraints, like class hierarchies or domains and ranges of predicates, could be valuable signals to improve the resulting knowledge graph embeddings. However, RDF2vec can currently only take such information into account to a very limited level. In general, when including the schema in the graph which is used for walk generation, such signals can be picked up. In the example above, when including the schema level, we could extract walks like Thailand -> isA -> AsianCountry -> subclassOf -> Country France -> isA -> EuropeanCountry -> subclassOf -> Country Germany -> isA -> EuropeanCountry -> subclassOf -> Country
which could also bring Thailand closer to France and Germany because of both of them sharing the property of being a country. Without the schema information integrated in the walk generation, this similarity would not be learned.
Table 8.1 Overview on the ontological constructs used in existing knowledge graphs DBpedia
YAGO
Cyc
NELL
CaLiGraph
Subclass
Domain/range
SubpropertyOf
Disjoint/complement classes
HasValue Transitive property
Symmetric property
Inverse property Functional property
2 Wikidata is not included here, since it does not use OWL for modeling its ontology. The information
on Cyc is based on the openly available OWL translation of CyC. Other OWL constructs, such as existential and universal quantifiers, cardinality constraints, and property chains, were not observed in any of the considered knowledge graphs.
8.4
Dynamic and Temporal Knowledge Graphs
147
This behavior, however, is very limited, and richer ontological definitions are hardly captured. In the field of knowledge graph embedding for link prediction (see Chap. 6), there are a few works which aim at incorporating ontological knowledge. Zhang et al. (2022) distinguish the methods into pre methods (before training the embedding), joint (when training the embedding), and post (after training the embedding) methods. In their survey, joint methods are the most common approaches, usually incorporating the ontological knowledge in the embedding approach’s loss function. At the same time, this means that they are typically tailored towards one embedding model and not universally applicable to any knowledge graph embedding method, while pre and post methods are more universally applicable, but at the same time much more rarely observed. The approach sketched in Sect. 4.4, i.e., materializing the knowledge graph based on the ontology before computing embeddings, would be considered a pre approach in this taxonomy. At the same time, it is currently the only approach exploiting ontological information for RDF2vec, which makes this a promising research direction for improving RDF2vec.
8.4
Dynamic and Temporal Knowledge Graphs
Most approaches in knowledge graph embeddings assume static knowledge graphs. In systems using knowledge graph embeddings, those embeddings are trained once and the used throughout the system’s lifecycle. While this approach is feasible in many cases, there are many scenarios where knowledge graphs are not static (Krause et al. 2022). While some frequently used knowledge graphs, such as DBpedia, are released in frequencies which allow for a periodic retraining, others, like Wikidata, are constantly evolving. At the same time, there are applications in which the timeliness of information is more crucial than in others. The recommendation system use case in the previous chapter can be seen as such an example, when the set of recommended articles is constantly growing (as, e.g., in online streaming services). Use cases with an even greater degree of dynamics include systems which employ, e.g., knowledge graphs for internet of things (Le-Phuoc et al. 2016), digital twins in manufacturing (Banerjee et al. 2017), smart homes (Zhu et al. 2018), and smart cities (Santos et al. 2017)—all of which are scenarios where the information in the knowledge graph can be constantly updated at high rates. Embedding methods for dynamic knowledge graphs are still scarce, and only a few approaches exist so far (Daruna et al. 2021, Tay et al. 2017, Wu et al. 2022). The approach described by Krause (2022) is a generic method for handling embeddings of dynamic knowledge graphs: upon changes in the knowledge graph, vectors for nodes affected by a change are “forgotton” and relearned from the neighboring nodes’ embeddings. Since RDF2vec is based on word2vec, it can, in principle, leverage all extensions of word2vec, just like the order-aware variant discussed in Chap. 4. There have been proposals
148
8 Future Directions for RDF2vec
Fig. 8.2 Example for a knowledge graph with temporal annotations
for online learning variants for word2vec as well (Tian et al. 2021), which can be used for evolving knowledge graphs—with one crucial exception: the scenario for online learning of word embeddings is usually streaming a text corpus. In that case, new information is added over time, but never removed. This would correspond to the case that new entities and edges are added to a knowledge graph, but never removed, which would be a very particular kind of dynamic knowledge graph. Another flavour of knowledge graphs are temporal knowledge graphs, where statements have timestamps attached, and different statements hold at different points in time. Some real world knowledge graphs, like Wikidata or YAGO, provide such temporal annotations for facts. For link prediction methods, there have been some adaptations of existing methods which exploit temporal data, such as TA-TransE and TA-DistMult (García-Durán et al. 2018). As RDF2vec is walk based, another approach would be to directly respect the temporal aspect in the walk creation. Consider, for example, the graph shown in Fig. 8.2. Here, we could, in principle, extract two random walks: John -> livesIn -> Bonn -> capitalOf -> Germany Julia -> livesIn -> Bonn -> capitalOf -> Germany
However, only the first walk conveys information that was true at some point in time, i.e., that John has lived in the capital of Germany. Julia, on the other hand, has never lived in the capital of Germany, since Bonn was not the capital of Germany when she lived there. A recent work (Huan et al. 2023) has considered extracting only coherent random walks (i.e., those whose temporal annotations overlap), which would be an interesting direction for computing RDF2vec embeddings on such knowledge graphs with temporal annotations.
8.5
Extension to other Knowledge Graph Representations
The example above uses temporal annotations on the edges, therefore, it goes beyond the pure triple-based data model. Those annotations are typically expressed using either RDF reification (Schreiber and Raimond 2014) or RDF* (Hartig 2017). Such graphs with further
8.6
Standards and Protocols
149
Listing 8.1 Using embeddings in a SPARQL query, using a fictitious builtin for processing embeddings SELECT ? recommendation WHERE { < user47823 > watched ? movie . e m b e d d i n g : cosine (? movie , ? r e c o m m e n d a t i o n )