114 18 10MB
English Pages 196 [190] Year 2023
LNAI 14363
Elena Bellodi Francesca Alessandra Lisi Riccardo Zese (Eds.)
Inductive Logic Programming 32nd International Conference, ILP 2023 Bari, Italy, November 13–15, 2023 Proceedings
123
Lecture Notes in Computer Science
Lecture Notes in Artificial Intelligence Founding Editor Jörg Siekmann
Series Editors Randy Goebel, University of Alberta, Edmonton, Canada Wolfgang Wahlster, DFKI, Berlin, Germany Zhi-Hua Zhou, Nanjing University, Nanjing, China
14363
The series Lecture Notes in Artificial Intelligence (LNAI) was established in 1988 as a topical subseries of LNCS devoted to artificial intelligence. The series publishes state-of-the-art research results at a high level. As with the LNCS mother series, the mission of the series is to serve the international R & D community by providing an invaluable service, mainly focused on the publication of conference and workshop proceedings and postproceedings.
Elena Bellodi · Francesca Alessandra Lisi · Riccardo Zese Editors
Inductive Logic Programming 32nd International Conference, ILP 2023 Bari, Italy, November 13–15, 2023 Proceedings
Editors Elena Bellodi Università degli Studi di Ferrara Ferrara, Italy
Francesca Alessandra Lisi Università degli Studi di Bari “Aldo Moro” Bari, Italy
Riccardo Zese Università degli Studi di Ferrara Ferrara, Italy
ISSN 0302-9743 ISSN 1611-3349 (electronic) Lecture Notes in Artificial Intelligence ISBN 978-3-031-49298-3 ISBN 978-3-031-49299-0 (eBook) https://doi.org/10.1007/978-3-031-49299-0 LNCS Sublibrary: SL7 – Artificial Intelligence © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 Chapter “Regularization in Probabilistic Inductive Logic Programming” is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/). For further details see license information in the chapter. This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland Paper in this product is recyclable.
Preface
This volume contains accepted long and short papers of the 32nd International Conference on Inductive Logic Programming (ILP 2023), held in Bari, Italy from Monday 13 to Wednesday 15 of November 2023. Inductive Logic Programming (ILP) is a subfield of machine learning which relies on logic programming as a uniform representation language for expressing examples, background knowledge and hypotheses. Due to its strong representation formalism, based on first-order logic, ILP provides an excellent means for multi-relational learning and data mining. The ILP conference series, started in 1991, is the premier international forum for learning from structured or semi-structured relational data. Originally focusing on the induction of logic programs, over the years it has expanded its research horizon significantly and welcomes contributions to all aspects of learning in logic, multirelational data mining, statistical relational learning, graph and tree mining, learning in other (non-propositional) logic-based knowledge representation frameworks, exploring intersections with statistical learning and other probabilistic approaches. The conference was co-located with: 1. The 13th International Workshop on Approaches and Applications of Inductive Programming (AAIP), 13–15 November 2023; 2. The 1st International Workshop on Cognitive AI (CogAI), 13–15 November 2023. The three international events were organized under the umbrella of the 3rd International Joint Conference on Learning & Reasoning (IJCLR 2023). Each conference/workshop participating in IJCLR solicited paper submissions on the topics of its interest. Additionally, IJCLR featured a general “Conference Track”, where authors were invited to submit work that was relevant to the conference but did not necessarily fall within the purview of a particular workshop (ILP, AAIP, CogAI), and a “Journal Track”, which accepted paper submissions at regular cut-off dates since February 2020 for publication in the Special Issue on Learning and Reasoning supported by the Machine Learning Journal. The ILP 2023 conference allowed four types of submissions: a. Long papers, describing original mature work containing appropriate experimental evaluation and/or representing a self-contained theoretical contribution (up to 15 pages); b. Short papers, describing original work in progress presenting preliminary results, brief accounts of original ideas, and other relevant work of potentially high scientific interest but not yet qualifying for the long paper category (6–9 pages); c. Late-breaking abstracts (up to 4 pages), describing ideas and proposals that the author(s) would like to present at the conference; these could include, e.g., original work in progress without conclusive experimental findings, or other relevant work not yet ready for publication. Submissions of late-breaking abstracts were
vi
Preface
accepted/rejected on the grounds of relevance. Accepted late-breaking abstracts were published on the conference website. d. Papers relevant to the conference topics and recently published or accepted for publication by a first-class conference or journal. For papers of this category a link to the original work was published on the conference website. There were 18 submissions in total: 12 long papers, 3 short papers and 3 late-breaking abstracts. We accepted 10 long papers, 2 short papers and 1 late-breaking abstract for oral presentation. All papers received at least two single-blind reviews by members of the Program Committee. Conference papers (either short or long) are included in the proceedings whereas the late-breaking abstracts are published on the conference web site. Finally, based on relevance to the conference topics, 11 out of the 12 recently published papers that were submitted were accepted for oral presentation at IJCLR. The contributions included in the proceedings cover a wide range of topics: an approach for learning formulas in Linear Temporal Logic with finite sequences semantics by using ILP and a set of example traces to avoid exhaustive search; a new class of probabilistic logic programs, called Probabilistic Optimizable Answer Set Programs (ASP), that are probabilistic ASP under the credal semantics, where uncertainty is associated to probabilistic intervals, rather than probability values; extensions to state-of-the-art systems, e.g., to LIFTCOVER - for learning the “liftable Probabilistic Logic Programming language” using regularization to penalize large weights and prevent overfitting and to Popper, by introducing a method to make an ILP system more adaptable to tasks with weak learning biases; an investigation of the counterfactual reasoning abilities of ProbLog programs, proposing a procedure to reconstruct a ProbLog program from a counterfactual output; an approach to speed up the detection of minimal unsatisfiable sets (MUS) in the SAT problem, by learning a GNN-based predictor of clause membership in MUS, which is then used as a heuristic; and a method for learning AssumptionBased Argumentation frameworks exploiting transformation rules and ASP. The topic of transfer learning is also tackled by two works, one exploiting Statistical Relational Learning algorithms to determine when to perform transfer learning, and another one testing whether boosting in the context of Domain-size-aware Markov Logic Networks provides significant improvements on transfer learning. Finally, the proceedings also contain two reviews, one about the problem of classification of neuro-symbolic learning tasks, and one about ILP applications in the context of robotic systems. Two prizes supported by Springer were awarded to the best paper and the best student paper among long papers. The winners were announced during the conference and published on the conference website at http://ilp2023.unife.it/. Besides the contributed papers, the programme and the proceedings feature two invited speakers: – Thomas Guyet, full researcher at the Inria Center of Lyon, France, with a talk on “Declarative Sequential Pattern Mining in ASP”; – Ana Ozaki, associate professor at the University of Oslo & University of Bergen, Norway, with a talk on “Extracting Rules from ML models in Angluin’s Style”. We would like to really thank all the people who contributed to the success of ILP 2023: The authors, the invited speakers, the members of the organization committee, the
Preface
vii
members of the program committee, the additional reviewers that have been solicited and the sponsors. November 2023
Elena Bellodi Francesca Alessandra Lisi Riccardo Zese
Organization
Program Committee Alexander Artikis Damiano Azzolini Elena Bellodi Krysia Broda James Cussens Stefano Ferilli Céline Hocquette Katsumi Inoue Cezary Kaliszyk Nikos Katzouris Dimitar Kazakov Ross King Nada Lavraˇc Francesca Alessandra Lisi Donato Malerba Stephen Muggleton Aline Paes Oliver Ray Fabrizio Riguzzi Celine Rouveirol Ute Schmid Ashwin Srinivasan Alireza Tamaddoni-Nezhad Gerson Zaverucha Filip Zelezny Riccardo Zese
NCSR “Demokritos”, Greece University of Ferrara, Italy University of Ferrara, Italy Imperial College London, UK University of Bristol, UK Università degli Studi di Bari “Aldo Moro”, Italy University of Oxford, UK NII, Japan University of Innsbruck, Austria NCSR Demokritos, Greece University of York, UK Chalmers University of Technology, Sweden Jožef Stefan Institute, Slovenia University of Bari “Aldo Moro”, Italy University of Bari “Aldo Moro”, Italy Imperial College London, UK Universidade Federal Fluminense, Brazil University of Bristol, UK University of Ferrara, Italy LIPN, Université Paris 13, France University of Bamberg, Germany BITS Pilani, Goa Campus, India Imperial College London, UK Federal University of Rio de Janeiro, Brazil Czech Technical University, Czech Republic University of Ferrara, Italy
Invited Talks
Declarative Sequential Pattern Mining in ASP
Thomas Guyet Inria, Hospices Civils de Lyon, Université Claude Bernard Lyon 1, France [email protected] Abstract. In recent decades, several approaches have drawn analogies between pattern mining tasks and constraint programming. The development of modern constraint solvers (SAT, CP, Linear Programming, Answer Set Programming, etc.) has demonstrated the efficiency of these approaches on real-world datasets [1]. More than the efficiency of declarative programming, we argue that the true benefit of this approach lies in its versatility and its ability to integrate expert knowledge. This is especially evident in the case of Answer Set Programming (ASP), originally designed as a tool for knowledge reasoning [4]. In this presentation, we will focus on the task of sequential pattern mining, which involves discovering interesting patterns in a collection of sequences. Firstly, we will introduce the concept of ASP encoding for sequential pattern mining tasks and showcase the advantages of declarative programming in fast-prototyping complex mining tasks. Next, we will highlight the capability of the Clingo solver [2] to efficiently combine reasoning and procedural programming to address the mining of chronicles [3]. Finally, we will conclude this presentation by exploring the potential development of epistemic measures of interestingness at the crossroad of declarative pattern mining and knowledge reasoning. To illustrate the effectiveness of a knowledge-centric approach, we will provide practical examples of these advanced features, focusing specifically on their application in analyzing care pathways to answer pharmaco-epidemiological questions. Keywords: Answer set programming · Satisfiability modulo theories · Timed sequences
xiv
T. Guyet
References 1. Gebser, M., Guyet, T., Quiniou, R., Romero, J., Schaub, T.: Knowledge-based sequence mining with ASP. In: International Joint Conference on Artificial Intelligence (IJCAI), p. 8 (2016) 2. Gebser, M., Kaminski, R., Kaufmann, B., Ostrowski, M., Schaub, T., Wanko, P.: Theory solving made easy with clingo 5. In: Technical Communications of the 32nd International Conference on Logic Programming (ICLP) (2016) 3. Guyet, T., Besnard, P.: Chronicles: Formalization of a Temporal Model. Springer Cham (2023). https://doi.org/10.1007/978-3-031-33693-5 4. Guyet, T., Happe, A., Dauxais, Y.: Declarative sequential pattern mining of care pathways. In: Proceedings of the Conference on Artificial Intelligence in Medicine (AIME), pp. 261–266 (2017)
Extracting Rules from Machine Learning Models in Angluin’s Style
Ana Ozaki University of Oslo & University of Bergen, Norway [email protected] Abstract. We first see an overview of recent approaches to extract simpler abstractions of complex neural networks using Angluin’s exact learning framework, from computational learning theory. The aim of constructing such abstractions is to obtain high level information from machine learning models, which can be useful to interpret their behavior, detect harmful biases, among others. We then discuss in more detail algorithms for learning logical theories expressing rules in Angluin’s framework, in particular, those for learning rules in Horn logic. We highlight the benefits and shortcomings of these approaches. Finally, we present promising possible next steps and applications of these approaches for extracting high level information from complex machine learning models. Keywords: Exact learning · Horn logic · Machine learning
Supported by the University of Oslo.
Contents
A Constrained Optimization Approach to Set the Parameters of Probabilistic Answer Set Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Damiano Azzolini
1
Regularization in Probabilistic Inductive Logic Programming . . . . . . . . . . . . . . . . Elisabetta Gentili, Alice Bizzarri, Damiano Azzolini, Riccardo Zese, and Fabrizio Riguzzi
16
Towards ILP-Based LTLf Passive Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Antonio Ielo, Mark Law, Valeria Fionda, Francesco Ricca, Giuseppe De Giacomo, and Alessandra Russo
30
Learning Strategies of Inductive Logic Programming Using Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Takeru Isobe and Katsumi Inoue
46
Select First, Transfer Later: Choosing Proper Datasets for Statistical Relational Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Thais Luca, Aline Paes, and Gerson Zaverucha
62
GNN Based Extraction of Minimal Unsatisfiable Subsets . . . . . . . . . . . . . . . . . . . . Sota Moriyama, Koji Watanabe, and Katsumi Inoue What Do Counterfactuals Say About the World? Reconstructing Probabilistic Logic Programs from Answers to “What If?” Queries . . . . . . . . . . . Kilian Rückschloß and Felix Weitkämper
77
93
Few-Shot Learning of Diagnostic Rules for Neurodegenerative Diseases Using Inductive Logic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Dany Varghese, Roman Bauer, and Alireza Tamaddoni-Nezhad An Experimental Overview of Neural-Symbolic Systems . . . . . . . . . . . . . . . . . . . . 124 Arne Vermeulen, Robin Manhaeve, and Giuseppe Marra Statistical Relational Structure Learning with Scaled Weight Parameters . . . . . . . 139 Felix Weitkämper, Dmitriy Ravdin, and Ramona Fabry A Review of Inductive Logic Programming Applications for Robotic Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 Youssef Mahmoud Youssef and Martin E. Müller
xviii
Contents
Meta-interpretive Learning from Fractal Images . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 Daniel Cyrus, James Trewern, and Alireza Tamaddoni-Nezhad Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
A Constrained Optimization Approach to Set the Parameters of Probabilistic Answer Set Programs Damiano Azzolini(B) Department of Environmental and Prevention Sciences, University of Ferrara, Ferrara, Italy [email protected]
Abstract. Probabilistic Answer Set Programming under the credal semantics has emerged as one of the possible formalisms to encode uncertain domains described by an answer set program extended with probabilistic facts. Some problems require associating probability values to probabilistic facts such that the probability of a query is above a certain threshold. To solve this, we propose a new class of programs, called Probabilistic Optimizable Answer Set Programs, together with a practical algorithm based on constrained optimization to solve the task.
Keywords: Probabilistic Answer Set Programming Learning · Constrained Optimization
1
· Parameter
Introduction
The field of Probabilistic Answer Set Programming [13] aims to combine the capabilities of Answer Set Programming [11] with the possibility to represent uncertainty through, for example, probabilistic facts [14] or weights [24]. Here, we focus on Probabilistic Answer Set Programming under the credal semantics [13], that has been studied and adopted in several scenarios [3–5,12,27,38]. With this semantics, an answer set program can be extended with ProbLog [14] probabilistic facts associating a probability to some of the atoms. The authors of [6] introduced the class of Probabilistic Optimizable Logic Programs, where the goal is to learn the probabilities of some special probabilistic facts, called optimizable facts, such that an objective function is minimized and constraints on the probabilities on the facts are not violated. This can also be considered as a variation of the parameter learning task, where the probabilities of the probabilistic facts (i.e., of the parameters) are learnt through constrained optimization, rather than considering a set of examples [20] and applying algorithms such as Expectation Maximization to maximize their likelihood. In this paper, we extend that definition to Probabilistic Answer Set Programming under the credal semantics [13], propose the new class of Probabilistic Optimizable Answer Set Programs, and develop a practical algorithm to c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 E. Bellodi et al. (Eds.): ILP 2023, LNAI 14363, pp. 1–15, 2023. https://doi.org/10.1007/978-3-031-49299-0_1
2
D. Azzolini
solve the task of assigning an optimal probability to optimizable facts such that the sum of their probabilities is minimized, the absolute value of the difference of the probabilities of the optimizable facts considered pairwise is less than a certain constant (i.e., the probabilities of the optimizable facts are similar), and the probability of a given query is above a certain threshold. The paper is structured as follows: Sect. 2 discusses the needed background knowledge and related work, Sect. 3 introduces the new class of programs together with an algorithm to solve the task, that is tested in Sect. 4, and Sect. 5 concludes the paper.
2
Background
The Distribution Semantics (DS) [36] underlies probabilistic logic languages [34] such as ProbLog [14] and LPAD [40]. Uncertainty can be represented through ProbLog probabilistic facts, i.e., facts with an associated probability, that are considered independent. By following the ProbLog syntax, a probabilistic fact is of the form πi ::fi , where πi ∈ [0, 1] is the probability associated to the atom fi . A selection of a truth value (true or false) for every probabilistic fact defines a world w (i.e., a logic program), whose probability P (w) can be computed with the formula πi · (1 − πi ). (1) P (w) = fi ∈w
¬fi ∈w
Under the DS, every world is required to have exactly one model [39]. The probability of a query (a conjunction of ground atoms) q, P (q), is given by the sum of the probabilities of the worlds where the query is true. That is, P (wi ). (2) P (q) = wi |=q
The credal semantics (CS) [13] assigns a meaning to answer set programs [11] extended with probabilistic facts. Under the CS, worlds are answer set programs, so they are allowed to have one or more answer sets (also called stable models) [18]. Under this semantics, the probability of a query q is identified by a range, P (q) = [P(q), P(q)], where the lower probability is given by the sum of the probabilities of the worlds where the query is true in every answer set and the upper probability is given by the sum of the probabilities of the worlds where the query is true in at least one answer set. In formulas: P(q) = P (wi ), (3) wi |∃m∈AS(wi ), m|=q
P(q) =
P (wi ).
(4)
wi |∀m∈AS(wi ), m|=q
The requirement of the CS imposing that every world must have at least one answer set is because otherwise some probability mass gets lost: a world without
Constrained Optimization for Probabilistic Answer Set Programs
3
answer sets does not contribute neither to the lower nor to the upper probability of a query. There are alternatives that handle worlds without answer sets based on a three-valued semantics, such as the ones proposed in [35,37], but we do not consider them here. Moreover, we consider only ground probabilistic facts and assume the disjoint condition, i.e., that probabilistic facts cannot appear as heads of rules. Example 1. The following program shows a probabilistic answer set program. It contains three probabilistic facts, damaged(1), damaged(2), and damaged(3) with an associated probability of respectively 0.1, 0.3, and 0.6. The first probabilistic fact can be interpreted as “the element 1 is damaged with probability 0.1”. Similarly for the other two. A disjunctive rule states that a damaged element may (break/1) or may not (not_break/1) break if damaged. The last rule is a constraint with two aggregates [1]. The first, #count{X : break(X), damaged(X)} = BD, counts all the elements X such that both break(X) and damaged(X) are true and assigns this value to BD. The second, #count{X : damaged(X)} = D, counts all the elements X such that damaged(X) is true and assigns this value to D. Overall, the constraint imposes that at least 60% of the damaged elements break. The program is 0.1:: damaged (1). 0.3:: damaged (2). 0.6:: damaged (3). break ( X ) ; not_break ( X ): - damaged ( X ). : - # count { X : break ( X ) , damaged ( X )} = BD , # count { X : damaged ( X )} = D , 100* BD < 60* D . We may be interested in computing the probability of the query q = break(1). Table 1 lists the worlds and the answer sets for the program. Worlds w4 , w5 , and w6 contribute to both the lower and upper probability while w7 only contributes to the upper probability, since it has 4 answer sets but the query is true in only 3 of them. We have P (q) = [0.028 + 0.042 + 0.012, 0.028 + 0.042 + 0.012 + 0.018] = [0.082, 0.1]. The authors of [3] propose the PASTA algorithm to perform inference in probabilistic answer set programs under the credal semantics that is based on projected answer set enumeration [16]: every probabilistic fact πi ::fi is converted into a choice rule {fi } representing that every probabilistic fact can be selected or not. The probability is ignored in this phase since it does not influence the generation of the answer sets. For a query qr, two rules, q : − qr and nq : − not qr, are added to the program. Then, the algorithm computes the projected answer sets on the (converted) probabilistic facts and the q/0 and nq/0 atoms. From the projected answer sets it is possible to reconstruct the worlds by checking the probabilistic facts present in every answer set and whether a world contributes to both the lower and upper probability (only one answer set with
4
D. Azzolini
Table 1. Worlds and answer sets for Example 1. b/1 stands for break/1. We reported only the break/1 atom in the answer sets. id
d(1)
d(2)
d(3)
P (w)
Answer Sets
w0
0
0
0
0.252
{{}}
w1
0
0
1
0.378
{{b(3)}}
w2
0
1
0
0.108
{{b(2)}}
w3
0
1
1
0.162
{{b(3), b(2)}}
w4
1
0
0
0.028
{{b(1)}}
w5
1
0
1
0.042
{{b(1), b(3)}}
w6
1
1
0
0.012
{{b(1), b(2)}}
w7
1
1
1
0.018
{{b(2), b(3)}, {b(1), b(3)}, {b(1), b(2)}, {b(1), b(2), b(3)}}
the probabilistic facts true in the world and q/0) or only to the upper probability (one answer set with the probabilistic facts true in the world and q/0 and one with the same probabilistic facts but with nq/0). For example, world w4 in Table 1 is represented by the answer set {damaged(1), q}, and it contributes to both the lower and upper probability, while world w7 is represented by the answer sets asq = {damaged(1), damaged(2), damaged(3), qr} and asnq = {damaged(1), damaged(2), damaged(3), nq}. The two last answer sets state that, for the world w7 , there is at least one answer set where the query is true (asq ) and at least one answer set where the query is false (asnq ). 2.1
Probabilistic Optimizable Logic Programs
The authors of [6] introduced the class of Probabilistic Optimizable Logic Programs composed of a probabilistic logic program under the Distribution Semantics extended with ground optimizable facts of the form optimizable [li , ui ]::fi where the functor optimizable states that fi is an optimizable fact whose probability can be set in the range [li , ui ], with li < ui and li , ui ∈ [0, 1], an objective function involving the probabilities of the atoms in the program, and a set of numerical constraints. Given a query q, the goal of the Probabilistic Optimizable Problem is to find an optimal value for the probability of the optimizable facts such that the numerical constraints are not violated and the objective function is optimized (minimized). In this paper, we extend that definition to probabilistic answer set programs under the credal semantics. 2.2
Related Work
In the context of Answer Set Programming, several tools consider the integration of numerical constraints within ASP, such as [2,17,25,26]. However, these do not allow probabilistic facts and constraints on the probabilities. Apart from the credal semantics considered in this paper, there are other semantics that
Constrained Optimization for Probabilistic Answer Set Programs
5
can be adopted to represent uncertainty in ASP, such as P-log [9], LPMLN [23], Diff-SAT [32], and PrASP [31]. We extend the work of [6], that is based on Probabilistic Logic Programming. There are other related solutions: the authors of [22] extend ProbLog proposing SC-ProbLog that can handle stochastic optimization, the work in [29] presents a constraint programming language based on a generalization of the DS, and the paper [43] introduces stochastic constraint programming, also considered in [8]. Moreover, the authors of [7] proposed an extension of the DS to manage continuous random variables and constraints on these values. All of these consider different types of numerical constraints but none handle an optimization task over the probabilities of the facts. Another line of research focuses on parameter learning, that is, learning the parameters (probabilities) of the probabilistic facts [10,19,20,30,41], but the probabilistic are learnt from examples, rather than via constrained optimization.
3
Probabilistic Optimizable Answer Set Programs
We extend a probabilistic answer set program with optimizable facts defined in [6] and discussed in Sect. 2.1 and provide the following definition of probabilistic optimizable answer set program. Definition 1. A probabilistic optimizable answer set program (POASP) is a tuple (P, O, τ, ) where P is a probabilistic answer set program, O is a set of optimizable facts, and τ and are two probability thresholds. Before providing a formal definition of the Probabilistic Optimizable Answer Set Problem, let us introduce a motivating example. Example 2. Consider an integrated circuit where a message goes from a source to a destination through a series of steps. However, the electronic components that represent these steps can be faulty, so the message may not be delivered. Some components can be replaced with others, with possibly higher reliability, but more reliable components (i.e., with a higher probability to transmit the message) also involve a higher cost. The goal is to ensure that the probability that the message reaches the destination is above a certain threshold while adopting the components with the just needed reliability (i.e., with the minimum probability associated). This scenario can be modeled with the following POASP. 0.6:: edge (a , b ). 0.2:: edge (b , c ). 0.95:: edge (d , e ). optimizable [0.8 ,0.95]:: edge (c , f ). optimizable [0.8 ,0.95]:: edge (a , d ). optimizable [0.8 ,0.95]:: edge (e , f ).
6
D. Azzolini
node ( a ). node ( b ). node ( c ). node ( d ). node ( e ). node ( f ). path (X , X ): - node ( X ). path (X , Y ): - path (X , Z ) , edge (Z , Y ). { transmit (A , B )}: - path (A , B ) , node ( A ) , node ( B ). : - # count {A , B : transmit (A , B ) , path (A , B )} = RB , # count {A , B : path (A , B )} = R , 100* RB < 95* R . The three edge/2 probabilistic facts represent a connection with a fixed reliability (the probability of the fact) while the three edge/2 optimizable facts represent a connection with a tunable reliability that can be set between 0.8 and 0.95 (we keep the same probability range for the three for ease of explanation). Predicate path/2 states that there is a path from a source to a destination if they are connected by an edge (or they coincide). If there is a path between a source and a destination, the message may or may not be transmitted (transmit/2). The represented graph of connections has six nodes and edges and it is shown in Fig. 1. A constraint states that at least 95% of the paths transmit the message. We may be interested, for example, in setting the probabilities to the three optimizable facts such that the lower probability of the query q = transmit(a, f ) is above 0.7 and the absolute value of the difference of the probabilities of the optimizable facts considered pairwise is less than 0.05 (i.e., all the optimizable facts have a similar probability).
b
c
a
f d
e
Fig. 1. Graph represented by Example 2. Red edges correspond to optimizable facts.
The requirement of having the probability of all the optimizable facts similar, as in Example 2, can be formally expressed as ∀oi , oj ∈ O, oi = oj , |P ∗ (oi ) − P ∗ (oj )| < , where P ∗ (ok ) is the optimal probability associated with the opti mizable fact ok . This involves introducing n2 = n·(n−1) numerical constraints 2 (i.e., all the combinations of size 2 of optimizable facts), where n is the number of optimizable facts. We are now ready to introduce the definition of Probabilistic Optimizable Answer Set Problem.
Constrained Optimization for Probabilistic Answer Set Programs
7
Definition 2. Given a POASP (P, O, τ, ), a query q, and a probability target (either lower or upper probability) T , the Probabilistic Optimizable Answer Set Problem requires finding an optimal probability assignment O∗ to optimizable facts such that the sum of the probabilities of the optimizable facts is minimized, the target probability T of the query is above τ , and the absolute value of the difference of the probabilities of the optimizable facts considered pairwise is less than . To clarify, consider again Example 2 with query q = transmit(a, f ). If the target T is the lower probability, the probability threshold τ = 0.7 induces the numerical constraint P (q) > 0.7, and = 0.05 induces 6 numerical constraints, namely, P (edge(c, f )) − P (edge(a, d)) < 0.05, P (edge(a, d)) − P (edge(c, f )) < 0.05, P (edge(c, f )) − P (edge(e, f )) < 0.05, P (edge(e, f )) − P (edge(c, f )) < 0.05, P (edge(a, d)) − P (edge(e, f )) < 0.05, and P (edge(e, f )) − P (edge(a, d)) < 0.05. The objective function to minimize is f (edge(a, d), edge(c, f ), edge(e, f )) = P (edge(a, d)) + P (edge(c, f )) + P (edge(e, f )). Here, a possible optimal probability assignment is given by P (edge(c, f )) = 0.8, P (edge(a, d)) = 0.8387, and P (edge(e, f )) = 0.8386 yielding f (edge(a, d), edge(c, f ), edge(e, f )) = 2.4773. 3.1
Algorithm
Before illustrating the algorithm, let us introduce a smaller example. Example 3. The following program has one probabilistic fact and two optimizable facts 0.4:: a . optimizable [0.4 ,0.8]:: b . optimizable [0.4 ,0.8]:: c . qr : - a , b . qr ; nqr : - c . Suppose that the upper probability of the query qr should be above 0.7 (τ = 0.7) and = 0.06. This program has 23 = 8 worlds, listed in Table 2. The worlds w1 , w3 , w5 , w6 , and w7 contribute to the upper probability. To solve the Probabilistic Optimizable Answer Set Problem, we propose the algorithm shown in Algorithm 1: first, probabilistic facts and optimizable facts are replaced by choice rules and their probabilities and ranges are stored internally (function ConvertFacts). That is, every probabilistic fact Πi ::fi is replaced by {fi } and every optimizable fact optimizable [li , ui ]::oi is replaced by {oi }. Then, as in PASTA [3], for a query qr, the rules q : − qr and nq : − not qr are added to the program, to track whether the query is true or false in an answer set. After that, the function EnumerateProjectedAnswerSets enumerates the projected answer sets on the atoms representing the probabilistic facts (fi ), the atoms representing optimizable facts (oi ), and the q/0 and nq/0 atoms. In
8
D. Azzolini
Example 3 we get 11 answer sets: {nq}, {c, q}, {c, nq}, {b, nq}, {a, nq}, {b, a, q}, {c, b, nq}, {c, b, q}, {c, a, q}, {c, a, nq}, {c, b, a, q}. The function ComputeSymbolicEquation extracts a symbolic equation consisting in the sum of the probabilities of the worlds that contribute to the probability target, where the probability of every world is computed with Eq. 1 and optimizable facts are kept symbolically. If the target is the upper probability, it considers all the worlds where the query is true in at least one answer set, while for the lower probability all the worlds where the query is true in every answer set. In Example 3, the worlds that contribute to the upper probability are w1 , w3 , w5 , w6 , and w7 of Table 2 and we have the symbolic equation: fup (b, c) = 0.6·(1−P (b))·P (c)+ 0.6 · P (b) · P (c) + 0.4 · (1 − P (b)) · P (c) + 0.4 · P (b) · (1 − P (c)) + 0.4 · P (b) · P (c). The obtained equation often is not minimal in terms of the number of multiplications and addition performed. Thus, the function SimplifyEquation s (b, c) = simplifies it. A possible simplified form of the previous function is fup P (b) · P (c) + 0.4 · P (b) · (1 − P (c)) + P (c) · (0.4 − 0.4 · P (b)) + P (c) · (0.6 − 0.6 · P (b)). The objective function is computed with the function Sum that sums the probabilities of the optimizable facts. Finally, the function Minimize solves the constrained non-linear optimization problem that consists in minimizing the sum of the probabilities of the optimizable facts subject to the numerical constraints requiring that the symbolic equation is above the probability threshold τ and that the absolute value of the difference of the probabilities of the optimizable facts considered pairwise is less than . For Example 3 with threshold τ = 0.7 and = 0.06, an optimal value for the two optimizable facts could be P (b) = 0.5628 and P (c) = 0.6129 that yields f (b, c) = 1.1757. Table 2. Worlds, probability (where optimizable facts are kept symbolically), and answer sets for Example 3. id
a b c P (w)
AS
w0 0 0 0 0.6 · (1 − P (b)) · (1 − P (c)) {{}} w1 0 0 1 0.6 · (1 − P (b)) · P (c)
{{c qr}, {c nqr}}
w2 0 1 0 0.6 · P (b) · (1 − P (c))
{{b}}
w3 0 1 1 0.6 · P (b) · P (c)
{{b c qr}, {b c nqr}}
w4 1 0 0 0.4 · (1 − P (b)) · (1 − P (c)) {{a}} w5 1 0 1 0.4 · (1 − P (b)) · P (c)
{{a c qr}, {a c nqr}}
w6 1 1 0 0.4 · P (b) · (1 − P (c))
{{a b qr}}
w7 1 1 1 0.4 · P (b) · P (c)
{{a b c qr}}
Constrained Optimization for Probabilistic Answer Set Programs
9
Algorithm 1. Function OptimizeProb: computation of the optimal probability values for the optimizable facts O in the POASP P with query qr, threshold τ (constraint P (qr) > τ ), algorithm alg, target T (lower or upper probability), and absolute value of the difference between the probabilities of the optimizable facts considered pairwise . 1: function OptimizeProbability(P,qr,τ ,alg,T ,) 2: P ←ConvertFacts(P) Conversion of the prob. and opt. facts. 3: P ← P ∪ {q : − qr} ∪ {nq : − not qr} 4: as ←EnumerateProjectedAnswerSets(P) 5: eq ←ComputeSymbolicEquation(as, T ) 6: eq ←SimplifyEquation(eq) 7: objective ←Sum(O) Function to minimize. 8: constraints ← [eq > τ ] Combinations of size 2 of opt. facts. 9: for (o0 , o1 ) ∈ combinations(O, 2) do 10: constraints ← constraints ∪ [|o0 − o1 | < ] 11: end for 12: return Minimize(objective,constraints,alg) 13: end function
4
Experiments
We implemented Algorithm 1 inside the PASTA solver1 with Python3 and leveraged the ASP solver clingo [15] to compute the projected answer sets, SymPy [28] to simplify the symbolic equation, and SciPy [42] for constrained optimization (function minimize). We tested the Sequential Least Squares Programming (SLSQP) [21] and the Constrained Optimization by Linear Approximation (COBYLA) [33] algorithms already available in SciPy. We ran some experiments on a computer with Intel E5-2630v3 running at 2.40 GHz with 8 Gb of RAM. For all, we tracked the time required to compute the answer set and the time to optimize the objective function subject to the non-linear constraints computed with the Python3 built-in function time. Moreover, we computed the total time with the bash command time and reported the real field. We considered two instances with the structure of Example 2 with 10 and 15 edges. The instance of size 10 has the edges edge(1, 2), edge(1, 3), edge(2, 4), edge(3, 5), edge(4, 6), edge(5, 6), edge(6, 7), edge(6, 8), edge(7, 9), edge(8, 9) while the instance of size 15 has, in addition, edge(9, 10), edge(9, 11), edge(10, 12), edge(11, 13), and edge(12, 13). For each of these two instances we generated n sub-instances, where n is the total number of edges. In each sub-instance, the edges are subdivided into two parts: no are optimizable facts with probability ranging from 0.8 to 0.95 and the remaining np = n − no are probabilistic facts with a random probability between 0.7 and 0.9. For the instance of size 10 we have 10 sub instances with 1, . . . , 10 optimizable facts each (and so 9, . . . , 0 probabilistic facts). For example, in the sub-instance with one optimizable fact only edge(1, 2) is optimizable, the other 1
Solver available at: https://github.com/damianoazzolini/pasta.
10
D. Azzolini
COBYLA 10 SLSQP 10 COBYLA 15 SLSQP 15
10
Objective Function Value
Objective Function Value
edge/2 facts are probabilistic; in the sub-instance with two optimizable facts also edge(1, 3) is optimizable; and so on. Similarly for the instance of size 15. For all the sub-instances of size 10 the query is path(1, 9) while for all the sub-instances of size 15 the query is path(1, 13). We tested four scenarios: thresholds (τ ) 0.75 and 0.85 both with = 0.05 and without considering (i.e., without imposing that the absolute value of the difference of the probabilities of the optimizable facts considered pairwise is below a certain constant). Figures 2 and 3 show the value of the objective function for all the four scenarios. For all, the two algorithms behave similarly and also the introduction of the numerical constraints due to does not influence the value of the objective function. For τ = 0.85 and instance of size 15 the sub-instances of size 1 and 2 do not admit a solution (value -1 in Figs. 2b and 3b).
5
COBYLA 10 SLSQP 10 COBYLA 15 SLSQP 15
10
5
0 0
0
2
4
6
8
10
12
14
16
0
2
4
Sub-instance size
6
8
10
12
14
16
Sub-instance size
(a) τ = 0.75.
(b) τ = 0.85.
COBYLA 10 SLSQP 10 COBYLA 15 SLSQP 15
10
Objective Function Value
Objective Function Value
Fig. 2. Value of the objective function (sum of the probabilities of the optimizable facts) for SLSQP (dashed curves) and COBYLA (continuous curves) on the instances of size 10 and 15 with threshold τ = 0.75 (left) and τ = 0.85 (right), target upper probability, and = 0.05.
5
COBYLA 10 SLSQP 10 COBYLA 15 SLSQP 15
10
5
0 0
0
2
4
6
8
10
Sub-instance size
(a) τ = 75.
12
14
16
0
2
4
6
8
10
12
14
16
Sub-instance size
(b) τ = 0.85.
Fig. 3. Value of the objective function (sum of the probabilities of the optimizable facts) for SLSQP (dashed curves) and COBYLA (continuous curves) on the instances of size 10 and 15 with threshold τ = 0.75 (left) and τ = 0.85 (right), target upper probability, without the constraint the difference between the probabilities of optimizable facts.
Constrained Optimization for Probabilistic Answer Set Programs
11
Figures 4, 5, 6, and 7 report the execution times for all the considered cases. The time required to enumerate the projected answer sets is negligible with respect to the time required to solve the optimization problem. The COBYLA algorithm is slower than SLSQP. This difference is significant for the instance of size 15 in the sub instances of size 14 and 15 (Figs. 4b, 5b, 6b, and 7b), even if for the configuration with τ = 0.85 and = 0.05 (Fig. 6b) this is less evident. The small difference between the total time and the time required to solve the optimization problem shows that the optimization problem is the complicated part and that the simplification of the symbolic equation is performed in a faster way.
6
4
200
ASP COBYLA Opt. COBYLA Total COBYLA ASP SLSQP Opt. SLSQP Total SLSQP
Execution Time (s)
Execution Time (s)
8
2
ASP COBYLA Opt. COBYLA Total COBYLA ASP SLSQP Opt. SLSQP Total SLSQP
150
100
50
0
0 2
4
6
8
0
10
2
4
6
8
10
12
14
16
Sub-instance size
Sub-instance size
(b) Instance 15.
(a) Instance 10.
6
4
ASP COBYLA Opt. COBYLA Total COBYLA ASP SLSQP Opt. SLSQP Total SLSQP
Execution Time (s)
Execution Time (s)
Fig. 4. Execution time for SLSQP (dashed curves) and COBYLA (continuous curves) on the instance of size 10 (left) and 15 (right) with threshold τ = 0.75, target upper probability, and = 0.05.
2
0
ASP COBYLA Opt. COBYLA Total COBYLA ASP SLSQP Opt. SLSQP Total SLSQP
150
100
50
0
2
4
6
Sub-instance size
(a) Instance 10.
8
10
0
2
4
6
8
10
12
14
16
Sub-instance size
(b) Instance 15.
Fig. 5. Execution time for SLSQP (dashed curves) and COBYLA (continuous curves) on the instance of size 10 (left) and 15 (right) with threshold τ = 0.75, target upper probability, without the constraint the difference between the probabilities of optimizable facts.
12
D. Azzolini
6
4
150
ASP COBYLA Opt. COBYLA Total COBYLA ASP SLSQP Opt. SLSQP Total SLSQP
Execution Time (s)
Execution Time (s)
8
2
ASP COBYLA Opt. COBYLA Total COBYLA ASP SLSQP Opt. SLSQP Total SLSQP
100
50
0
0 2
4
6
8
0
10
2
4
6
8
10
12
14
16
Sub-instance size
Sub-instance size
(b) Instance 15.
(a) Instance 10.
Fig. 6. Execution time for SLSQP (dashed curves) and COBYLA (continuous curves) on the instance of size 10 (left) and 15 (right) with threshold τ = 0.85, target upper probability, and = 0.05.
4
ASP COBYLA Opt. COBYLA Total COBYLA ASP SLSQP Opt. SLSQP Total SLSQP
ASP COBYLA Opt. COBYLA Total COBYLA ASP SLSQP Opt. SLSQP Total SLSQP
150 Execution Time (s)
Execution Time (s)
6
2
0
100
50
0
2
4
6
Sub-instance size
(a) Instance 10.
8
10
0
2
4
6
8
10
12
14
16
Sub-instance size
(b) Instance 15.
Fig. 7. Execution time for SLSQP (dashed curves) and COBYLA (continuous curves) on the instance of size 10 (left) and 15 (right) with threshold τ = 0.85, target upper probability (without the constraint the difference between the probabilities of optimizable facts).
5
Conclusions
In this paper we proposed the class of Probabilistic Optimizable Answer Set Programs, i.e., probabilistic answer set programs under the credal semantics extended with optimizable facts. The goal is to set the probabilities of the optimizable facts such that the probability of a query is above a certain threshold and that the probabilities of the optimizable facts are similar. We also developed an algorithm to solve the task based on projected answer set enumeration and non-linear constraint optimization and tested it on some benchmarks. Empirical results show that finding a solution to the optimization task is the bottleneck of the overall pipeline. As future work, we want to extend the experiments to other benchmarks and bigger instances and different optimization solvers.
Constrained Optimization for Probabilistic Answer Set Programs
13
References 1. Alviano, M., Faber, W.: Aggregates in answer set programming. KI-Künstliche Intelligenz 32(2), 119–124 (2018). https://doi.org/10.1007/s13218-018-0545-9 2. Arias, J., Carro, M., Salazar, E., Marple, K., Gupta, G.: Constraint answer set programming without grounding. Theory Pract. Logic Program. 18(3–4), 337–354 (2018). https://doi.org/10.1017/S1471068418000285 3. Azzolini, D., Bellodi, E., Riguzzi, F.: Statistical statements in probabilistic logic programming. In: Gottlob, G., Inclezan, D., Maratea, M. (eds.) Logic Programming and Nonmonotonic Reasoning, pp. 43–55. Springer, Cham (2022). https://doi.org/ 10.1007/978-3-031-15707-3_4 4. Azzolini, D., Bellodi, E., Riguzzi, F.: Approximate inference in probabilistic answer set programming for statistical probabilities. In: Dovier, A., Montanari, A., Orlandini, A. (eds.) AIxIA 2022 - Advances in Artificial Intelligence, pp. 33–46. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-27181-6_3 5. Azzolini, D., Bellodi, E., Riguzzi, F.: MAP inference in probabilistic answer set programs. In: Dovier, A., Montanari, A., Orlandini, A. (eds.) AIxIA 2022 - Advances in Artificial Intelligence, pp. 413–426. Springer, Cham (2023). https://doi.org/10. 1007/978-3-031-27181-6_29 6. Azzolini, D., Riguzzi, F.: Optimizing probabilities in probabilistic logic programs. Theory Pract. Logic Program. 21(5), 543–556 (2021). https://doi.org/10.1017/ S1471068421000260 7. Azzolini, D., Riguzzi, F., Lamma, E.: A semantics for hybrid probabilistic logic programs with function symbols. Artif. Intell. 294, 103452 (2021). https://doi. org/10.1016/j.artint.2021.103452 8. Babaki, B., Guns, T., de Raedt, L.: Stochastic constraint programming with and-or branch-and-bound. In: Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-2017, pp. 539–545 (2017). https://doi.org/ 10.24963/ijcai.2017/76 9. Baral, C., Gelfond, M., Rushton, N.: Probabilistic reasoning with answer sets. Theory Pract. Logic Program. 9(1), 57–144 (2009). https://doi.org/10.1017/ S1471068408003645 10. Bellodi, E., Riguzzi, F.: Expectation maximization over binary decision diagrams for probabilistic logic programs. Intell. Data Anal. 17(2), 343–363 (2013). https:// doi.org/10.3233/IDA-130582 11. Brewka, G., Eiter, T., Truszczyński, M.: Answer set programming at a glance. Commun. ACM 54(12), 92–103 (2011). https://doi.org/10.1145/2043174.2043195 12. Cozman, F.G., Mauá, D.D.: The structure and complexity of credal semantics. In: Hommersom, A., Abdallah, S.A. (eds.) PLP 2016. CEUR Workshop Proceedings, vol. 1661, pp. 3–14. CEUR-WS.org (2016) 13. Cozman, F.G., Mauá, D.D.: The joy of probabilistic answer set programming: semantics, complexity, expressivity, inference. Int. J. Approx. Reason. 125, 218– 239 (2020). https://doi.org/10.1016/j.ijar.2020.07.004 14. De Raedt, L., Kimmig, A., Toivonen, H.: ProbLog: a probabilistic prolog and its application in link discovery. In: Veloso, M.M. (ed.) IJCAI 2007, vol. 7, pp. 2462– 2467. AAAI Press (2007) 15. Gebser, M., Kaminski, R., Kaufmann, B., Schaub, T.: Multi-shot ASP solving with clingo. Theory Pract. Logic Program. 19(1), 27–82 (2019). https://doi.org/ 10.1017/S1471068418000054
14
D. Azzolini
16. Gebser, M., Kaufmann, B., Schaub, T.: Solution enumeration for projected boolean search problems. In: van Hoeve, W.-J., Hooker, J.N. (eds.) CPAIOR 2009. LNCS, vol. 5547, pp. 71–86. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3642-01929-6_7 17. Gebser, M., Ostrowski, M., Schaub, T.: Constraint answer set solving. In: Hill, P.M., Warren, D.S. (eds.) ICLP 2009. LNCS, vol. 5649, pp. 235–249. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-02846-5_22 18. Gelfond, M., Lifschitz, V.: The stable model semantics for logic programming. In: 5th International Conference and Symposium on Logic Programming (ICLP/SLP 1988), vol. 88, pp. 1070–1080. MIT Press, USA (1988) 19. Gutmann, B., Kimmig, A., Kersting, K., De Raedt, L.: Parameter learning in probabilistic databases: a least squares approach. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008. LNCS (LNAI), vol. 5211, pp. 473–488. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-87479-9_49 20. Gutmann, B., Thon, I., De Raedt, L.: Learning the parameters of probabilistic logic programs from interpretations. In: Gunopulos, D., Hofmann, T., Malerba, D., Vazirgiannis, M. (eds.) ECML PKDD 2011. LNCS (LNAI), vol. 6911, pp. 581– 596. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23780-5_47 21. Kraft, D.: Algorithm 733: TOMP-fortran modules for optimal control calculations. ACM Trans. Math. Softw. 20(3), 262–281 (1994). https://doi.org/10.1145/192115. 192124 22. Latour, A.L.D., Babaki, B., Dries, A., Kimmig, A., Van den Broeck, G., Nijssen, S.: Combining stochastic constraint optimization and probabilistic programming. In: Beck, J.C. (ed.) CP 2017. LNCS, vol. 10416, pp. 495–511. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66158-2_32 23. Lee, J., Wang, Y.: Weighted rules under the stable model semantics. In: Baral, C., Delgrande, J.P., Wolter, F. (eds.) Proceedings of the Fifteenth International Conference on Principles of Knowledge Representation and Reasoning, pp. 145– 154. AAAI Press (2016) 24. Lee, J., Yang, Z.: LPMLN, weak constraints, and P-log. In: Singh, S., Markovitch, S. (eds.) Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, California, USA, 4–9 February 2017, pp. 1170–1177. AAAI Press (2017) 25. Lierler, Y.: Relating constraint answer set programming languages and algorithms. Artif. Intell. 207, 1–22 (2014). https://doi.org/10.1016/j.artint.2013.10.004 26. Lierler, Y.: Constraint answer set programming: integrational and translational (or smt-based) approaches. Theory Pract. Logic Program. 23(1), 195–225 (2023). https://doi.org/10.1017/S1471068421000478 27. Mauá, D.D., Cozman, F.G.: Complexity results for probabilistic answer set programming. Int. J. Approx. Reason. 118, 133–154 (2020). https://doi.org/10.1016/ j.ijar.2019.12.003 28. Meurer, A., et al.: SymPy: symbolic computing in python. PeerJ Comput. Sci. 3, e103 (2017). https://doi.org/10.7717/peerj-cs.103 29. Michels, S., Hommersom, A., Lucas, P.J.F., Velikova, M.: A new probabilistic constraint logic programming language based on a generalised distribution semantics. Artif. Intell. 228, 1–44 (2015). https://doi.org/10.1016/j.artint.2015.06.008 30. Nguembang Fadja, A., Riguzzi, F., Lamma, E.: Learning the parameters of deep probabilistic logic programs. In: Bellodi, E., Schrijvers, T. (eds.) Probabilistic Logic Programming (PLP 2018). CEUR Workshop Proceedings, vol. 2219, pp. 9–14. Sun SITE Central Europe, Aachen (2018)
Constrained Optimization for Probabilistic Answer Set Programs
15
31. Nickles, M.: A tool for probabilistic reasoning based on logic programming and first-order theories under stable model semantics. In: Michael, L., Kakas, A. (eds.) JELIA 2016. LNCS (LNAI), vol. 10021, pp. 369–384. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-48758-8_24 32. Nickles, M.: Differentiable SAT/ASP. In: Bellodi, E., Schrijvers, T. (eds.) Proceedings of the 5th International Workshop on Probabilistic Logic Programming, PLP 2018, co-located with the 28th International Conference on Inductive Logic Programming (ILP 2018), Ferrara, Italy, 1 September 2018, CEUR Workshop Proceedings, vol. 2219, pp. 62–74. CEUR-WS.org (2018) 33. Powell, M.J.D.: A direct search optimization method that models the objective and constraint functions by linear interpolation. In: Gomez, S., Hennart, J.P. (eds.) Advances in Optimization and Numerical Analysis, pp. 51–67. Springer, Dordrecht (1994). https://doi.org/10.1007/978-94-015-8330-5_4 34. Riguzzi, F.: Foundations of Probabilistic Logic Programming: Languages, semantics, inference and learning. River Publishers, Gistrup (2018) 35. Rocha, V.H.N., Gagliardi Cozman, F.: A credal least undefined stable semantics for probabilistic logic programs and probabilistic argumentation. In: Kern-Isberner, G., Lakemeyer, G., Meyer, T. (eds.) Proceedings of the 19th International Conference on Principles of Knowledge Representation and Reasoning, KR 2022, pp. 309–319 (8 2022). https://doi.org/10.24963/kr.2022/31 36. Sato, T.: A statistical learning method for logic programs with distribution semantics. In: Sterling, L. (ed.) ICLP 1995, pp. 715–729. MIT Press, Cambridge (1995). https://doi.org/10.7551/mitpress/4298.003.0069 37. Totis, P., De Raedt, L., Kimmig, A.: smProbLog: stable model semantics in problog for probabilistic argumentation. Theory Pract. Logic Program. 1–50 (2023). https://doi.org/10.1017/S147106842300008X 38. Tuckey, D., Russo, A., Broda, K.: PASOCS: a parallel approximate solver for probabilistic logic programs under the credal semantics. arXiv abs/2105.10908 (2021). https://doi.org/10.48550/ARXIV.2105.10908 39. Van Gelder, A., Ross, K., Schlipf, J.S.: Unfounded sets and well-founded semantics for general logic programs. In: Proceedings of the Seventh ACM SIGACTSIGMOD-SIGART Symposium on Principles of Database Systems, PODS 1988, pp. 221–230. Association for Computing Machinery, New York (1988). https://doi. org/10.1145/308386.308444 40. Vennekens, J., Verbaeten, S., Bruynooghe, M.: Logic programs with annotated disjunctions. In: Demoen, B., Lifschitz, V. (eds.) ICLP 2004. LNCS, vol. 3132, pp. 431–445. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-277750_30 41. Vieira de Faria, F.H., Gusmão, A.C., De Bona, G., Mauá, D.D., Cozman, F.G.: Speeding up parameter and rule learning for acyclic probabilistic logic programs. Int. J. Approx. Reason. 106, 32–50 (2019). https://doi.org/10.1016/j.ijar.2018.12. 012 42. Virtanen, P., et al.: SciPy 1.0: fundamental algorithms for scientific computing in python. Nat. Methods 17, 261–272 (2020). https://doi.org/10.1038/s41592-0190686-2 43. Walsh, T.: Stochastic constraint programming. In: Proceedings of the 15th European Conference on Artificial Intelligence, vol. 1, pp. 111–115 (2002)
Regularization in Probabilistic Inductive Logic Programming Elisabetta Gentili1(B) , Alice Bizzarri1 , Damiano Azzolini2 , Riccardo Zese3 , and Fabrizio Riguzzi4 1
4
Department of Engineering, University of Ferrara, Ferrara, Italy {elisabetta.gentili1,alice.bizzarri}@unife.it 2 Department of Environmental and Prevention Sciences, University of Ferrara, Ferrara, Italy [email protected] 3 Department of Chemical, Pharmaceutical and Agricultural Sciences, University of Ferrara, Ferrara, Italy [email protected] Department of Mathematics and Computer Science, University of Ferrara, Ferrara, Italy [email protected]
Abstract. Probabilistic Logic Programming combines uncertainty and logic-based languages. Liftable Probabilistic Logic Programs have been recently proposed to perform inference in a lifted way. LIFTCOVER is an algorithm used to perform parameter and structure learning of liftable probabilistic logic programs. In particular, it performs parameter learning via Expectation Maximization and LBFGS. In this paper, we present an updated version of LIFTCOVER, called LIFTCOVER+, in which regularization was added to improve the quality of the solutions and LBFGS was replaced by gradient descent. We tested LIFTCOVER+ on the same 12 datasets on which LIFTCOVER was tested and compared the performances in terms of AUC-ROC, AUC-PR, and execution times. Results show that in most cases Expectation Maximization with regularization improves the quality of the solutions. Keywords: Probabilistic Inductive Logic Programming · Regularization · Statistical Relational Artificial Intelligence
1
Introduction
Probabilistic Logic Programming (PLP) combines uncertainty and logic-based languages [17]. Given its expressiveness, in the last decades PLP, and in particular PLP under the distribution semantics [21], has been widely adopted in domains characterized by uncertainty [5,11,12,19,20]. A probabilistic logic program without function symbols under the distribution semantics defines a probability distribution over normal logic programs, also called instances or worlds. c The Author(s) 2023 E. Bellodi et al. (Eds.): ILP 2023, LNAI 14363, pp. 16–29, 2023. https://doi.org/10.1007/978-3-031-49299-0_2
Regularization in Probabilistic Inductive Logic Programming
17
The distribution is extended to a joint distribution over worlds and interpretations (or queries) and the probability of a query can be obtained from this distribution [17]. Logic Programs with Annotated Disjunctions (LPADs) [26] are a PLP language under the distribution semantics. In LPADs without function symbols, heads of clauses are disjunctions in which each atom is annotated with a probability. Since learning probabilistic logic programs is expensive, various approaches have been proposed to overcome this problem. Lifted inference [15] was introduced to improve the performances of reasoning in probabilistic relational models by taking into consideration populations of individuals instead of considering each individual separately. Liftable Probabilistic Logic Programs have been recently proposed to perform inference in a lifted way. LIFTCOVER [13] is an algorithm that performs structure and parameter learning (via Expectation-Maximization (LIFTCOVER-EM) or Limited-memory BFGS (LIFTCOVER-LBFGS)) of liftable probabilistic logic programs. Previous results [13] showed that LIFTCOVER-EM often outperformed LIFTCOVERLBFGS and other systems at the state of the art. In this paper, we present LIFTCOVER+, an algorithm that extends LIFTCOVER with regularization and gradient descent for parameter learning to improve the quality of the solutions and prevent overfitting. We test LIFTCOVER+ on 12 real-world datasets and compare the results with LIFTCOVER-EM. Empirical results show that LIFTCOVER+ with the regularized Expectation-Maximization algorithm allows to obtain slightly better results than the original LIFTCOVER-EM. The paper is organized as follows: Sect. 2 presents background on PLP; Sect. 3 introduces LIFTCOVER+; Sect. 4 shows the results of the experiments; in Sect. 5 we discuss related work; and in Sect. 6 we draw the conclusions.
2
Background
We consider the liftable PLP language [13], a restriction of probabilistic logic programs so that inference can be performed in a lifted way. Such programs contain clauses with a single annotated atom in the head and the predicate of this atom is the same for all clauses, i.e., clauses of the form: Ci = hi : Πi : − bi1 , . . . , biui where the single atom in the head is built over predicate target/a, with a the arity. The bodies of the clauses contain other predicates than target/a and their facts and rules have a single atom in the head with probability 1 (they are certain). The predicate target/a is the target of learning and the other predicates are input predicates. In other words, in the liftable PLP language uncertainty appears only in the rules. The goal is to compute the probability of a ground instantiation (or query) q of target/a. To do so, we find the number of ground instantiations of clauses for target/a such that the body is true and the head is equal to q. Let {θi1 , ..., θimi } be the mi instantiations for clause Ci , i = 1, ..., n. Every instantiation θij corresponds to a random variable Xij that is equal to 1 (0) with probability Πi (1 − Πi ). The query q is true if at least one random
18
E. Gentili et al.
variable for a rule is true, i.e., takes value 1. Equivalently, the query q is false only if none of the random variables is true. Since all the random variables are mutuallyindependent the probability that q is true can be computed as n P (q) = 1 − i=1 (1 − Πi )mi . The fact that the random variables associated to the rules are mutually independent does not limit the capability to represent probability distributions, as shown in [17]. LIFTCOVER [13], shown in Algorithm 1, learns the structure of liftable probabilistic logic programs. Given a set E + = {e1 , . . . , eQ } of positive examples, a set E − = {eQ+1 , . . . , eR } of negative examples, and a background knowledge B (possibly a normal logic program defining the input predicates), the goal of structure learning is to find a liftable probabilistic logic program T such that the likelihood Q R P (eq ) P (¬er ) L= q=1
r=Q+1
is maximized. LIFTCOVER solves this problem by first identifying good clauses guided by the log-likelihood (LL) of the data, with a top-down beam search. The refinement operator adds a literal taken from a bottom clause to the body of the current clause. The beam search is repeated a user-defined number of times or until the beam is empty. Then, parameter learning is performed on the full set of clauses found, which is considered as a single theory. LIFTCOVER can use either Expectation-Maximization (EM) or Limited-memory BFGS (LBFGS). LBFGS is used to find the values of the parameters that optimize the likelihood by exploiting the gradient of the log-likelihood with respect to the parameters. The likelihood can be unfolded to Q n n ml− mlq 1 − (1 − Πl ) (1 − Πl ) L= q=1
l=1
l=1
where miq (mir ) is the number of instantiations of Ci whose head is eq (er ) and R whose body is true, and ml− = r=Q+1 mlr . Its gradient can be computed as: L ∂L = ∂Πi 1 − Πi
Q q=1
miq
1 − 1 − mi− P (eq )
(1)
∂L = 0 does not admit a closed-form solution, optimizaBecause the equation ∂Π i tion is needed to find the maximum of L. The clauses with a probability below a user-defined threshold are discarded. In models in which the variables are hidden, the EM algorithm [7] must be used to find the maximum likelihood estimates of parameters. In the Expectation step, the distribution of the unseen variables in each instance is computed given the observed data and the current value of the parameters. In the Maximization step, the new parameters are computed so that the expected likelihood is maximized. The alternation between the Expectation and the Maximization steps continues until the likelihood does not improve anymore.
Regularization in Probabilistic Inductive Logic Programming
19
To use the EM algorithm, the distribution of the hidden variables given the observed ones, P (Xij = 1|e) and P (Xij = 1|¬e) has to be computed. Given that P (Xij = 1, e) = P (e|Xij = 1) · P (Xij = 1) = P (Xij = 1) = Πi and P (e|Xij = 1) = 1, P (Xij = 1|e) =
Π P (Xij = 1, e) n i = P (e) 1 − i=1 (1 − Πi )mi
P (Xij = 0|e) = 1 −
1−
Πi mi i=1 (1 − Πi )
n
(2) (3)
Since P (Xij = 1, ¬e) = P (¬e|Xij = 1) · P (Xij = 1) = 0 and P (¬e|Xij = 1) = 0,
3
P (Xij = 1|¬e) = 0
(4)
P (Xij = 0|¬e) = 1
(5)
LIFTCOVER+
LIFTCOVER can learn very large sets of clauses that may overfit the data. For this reason, we introduce LIFTCOVER+, a modified version of LIFTCOVER that adds regularization to perform parameter learning and uses gradient descent instead of LBFGS to optimize the likelihood. Regularization is a well-known technique to prevent overfitting, in which a penalty term is added to the loss function to penalize large weights. In this way, we aim to obtain few clauses with large weights. Clauses with small weights have little influence on the probability of the query and can be removed, thus simplifying the theory. Regularization is usually performed in gradient-based algorithms, but it can be performed in EM as well in the Maximization phase, where the parameters that maximized the LL are found. For EM, regularization can be Bayesian, L1, or L2. In Bayesian regularization, the parameters are updated assuming a prior distribution that takes the form of a Dirichlet probability density with parameters [a, b]. It has the same effect as having observed a extra occurrences of Xij = 1 and b extra occurrences of Xij = 0. If b is much larger than a, this has the effect to shrink the parameters. L1 and L2 differ in how they penalize the loss function: L1 adds the sum of the absolute value of the parameters to the loss function while L2 adds the sum of their squares. The L1 objective function [14] is: J1 (θ) = N1 · logθ + N0 · log(1 − θ) − γθ
(6)
where θ = πi , N0 and N1 are the expected occurrences of Xij = 0 and Xij = 1 computed in the Expectation step, and γ is the regularization coefficient. The value of θ that maximizes J1 is computed in the Maximization step by solving the equation ∂J(θ) ∂θ = 0 [14]. J1 (θ) is maximum at θ1 =
2(γ + N0 + N1 +
4N1 (N0 + N1 )2 + γ 2 + 2γ(N0 − N1 ))
(7)
20
E. Gentili et al.
The L2 objective function [14] is: J2 (θ) = N1 · logθ + N0 · log(1 − θ) −
γ 2 θ 2
(8)
and value of θ that maximizes J2 , is: √ ⎛ 9N0 γ −9N1 +γ ) 3N0 +3N1 +γ ( 2 arccos 3N0 +3N1 +γ ⎜ 1 +γ 2 3N0 +3N cos ⎝ − γ 3 θ2 =
⎞ 2π ⎟ 3 ⎠
+
3
1 3
(9)
In LIFTCOVER+, LBFGS is replaced by regularized gradient descent. The objective function is the sum of cross entropy errors erri for all the examples: err =
Q+R
(−yi log P (ei ) − (1 − yi ) log(1 − P (ei )))
(10)
i=1
where Q + R is the total number of examples, ei is an example, and yi is its sign, thus yi equals to 1 (0) if the example is positive (negative). L1 regularization can then be applied to minimize the loss function [14]: errL1 =
Q+R
−yi · logP (ei ) − (1 − yi ) · log(1 − P (ei )) + γ
i=1
k
|πi |
(11)
i=1
where k is the number of parameters and the πi s are the probabilities of the clauses. After learning the parameters, all the clauses with a probability below a fixed threshold are removed.
4
Experiments
The main goal of the experiments is to assess whether adding regularization to LIFTCOVER+ improves the quality of the solution. All experiments were conducted on a GNU/Linux machine with an Intel Core i3-10320 Quad Core 3.80 GHz CPU. We tested LIFTCOVER+ on 12 real-world datasets: UW-CSE [10] (a dataset that describes the Computer Science Department of the University of Washington, used to predict the fact that a student is advised by a professor), Mondial [22] (a dataset containing information regarding geographical regions of the world, such as population size, political system, and the country border relationship), Carcinogenesis [23] (a classic ILP benchmark dataset for Quantitative Structure-Activity Relationship (QSAR), i.e., predicting the biological activity of chemicals from their physicochemical properties or molecular structure. The goal is to predict the carcinogenicity of compounds from their chemical structure), Mutagenesis [24] (a classic ILP benchmark dataset for QSAR in which the goal is to predict the mutagenicity (a property correlated with carcinogenicity) of compounds from their chemical structure), Bupa (for diagnosing patients
Regularization in Probabilistic Inductive Logic Programming
21
Algorithm 1. Function LIFTCOVER function LIFTCOVER(NB, NI , NInt, NS , NA, NV ) Beam =InitialBeam(NInt, NS , NA) Bottom clauses building CC ← ∅ Steps ← 1 N ewBeam ← [] repeat Remove the first couple ((Cl, Literals), LL) from Beam Remove the first clause Refs ←ClauseRefinements((Cl, Literals, NV )) Find all refinements Refs of (Cl, Literals) 9: for all (Cl , Literals ) ∈ Refs do 10: (LL , {Cl }) ←LearnWeights(I, {Cl }) 11: N ewBeam ←Insert((Cl , Literals ), LL , N ewBeam, NB) The refinement is inserted in the beam in order of likelihood, possibly removing the last clause if the size of the beam NB is exceeded 12: CC ← CC ∪ {Cl } 13: end for 14: Beam ← N ewBeam 15: Steps ← Steps + 1 16: until Steps > NI or Beam is empty 17: (LL, T h) ←LearnWeights(CC) 18: Remove from T h the clauses with a weight smaller than WMin 19: return T h 20: end function
1: 2: 3: 4: 5: 6: 7: 8:
with liver disorders), Nba (for predicting the results of NBA basketball games), Pyrimidine and Triazine1 (QSAR datasets for predicting the inhibition of dihydrofolate reductase by pyrimidines and triazines, respectively), Financial (for predicting the success of loan applications by clients of a bank), Sisya and Sisyb (datasets regarding insurance business clients, used to classify households and persons in relation to private life insurance), and Yeast (for predicting if a yeast gene codes for a protein involved in metabolism) from [25]2 . Table 1 shows the characteristics of the datasets. Four different configurations of LIFTCOVER+ were compared: EM with Bayesian regularization (EM-Bayes), EM with L1 regularization (EM-L1), EM with L2 regularization (EM-L2), and gradient descent with fixed learning rate η = 0.0001 and L1 regularization (GD). Hyper-parameters for Bayesian regularization were set as a = 0 and b equal to 15% of the total number of examples in the dataset. We set γ = 50 for L1 and L2 in EM, and γ = 10 for L1 in gradient descent. The parameters controlling structure learning are the following: NInt is the number of mega-examples on which to build the bottom clauses, NA is the number of bottom clauses to be built for each mega-example, NS is the number of saturation steps (for building the bottom clauses), NI is the maximum number of clause search iterations, the size NB of the beam, NV is the maximum number of variables in a rule, and WMin is the minimum probability under which the rule is removed. Their values are listed in Table 2. All the configurations were evaluated in terms of Area Under the PrecisionRecall Curve (AUC-PR) and Area Under the Receiver Operating Characteristics
1 2
https://relational.fit.cvut.cz/. https://dtai.cs.kuleuven.be/ACE/doc/.
22
E. Gentili et al.
Curve (AUC-ROC). Both were computed with the methods reported in [4,8]. LIFTOCOVER+ was compared with LIFTCOVER-EM from [13]. Table 1. Characteristics of the datasets for the experiments: number of predicates (P), of tuples (T) (i.e., ground atoms), of positive (PEx) and negative (NEx) examples for target predicate(s), of folds (F). The number of tuples includes the target positive examples. Dataset Financial
P
T
PEx
NEx
F
9
92658
34
Bupa
12
2781
145
200
5
Mondial
11
10985
572
616
5
Mutagen
20
15249
125
126 10
Sisyb
9 354507
3705
9229 10
Sisya
9 358839 10723
6544 10
Pyrimidine 29
223 10
2037
20
12
53988
1299
4
1218
15
15
Triazine
62
10079
20
UW-CSE
15
2673
Carcinogen 36
24533
Yeast Nba
20
4
5456 10 5
20
4
113 20680
5
182
1
155
Tables 3, 4, and 5 show the performances of the different configurations in terms of average over the folds of AUC-ROC, AUC-PR, and the execution times, respectively. The results of LIFTCOVER-EM were taken from [13]. Execution time for LIFTCOVER+ was scaled (i.e., multiplied by 3.8/2.4) in order to compare them with those of LIFTCOVER-EM in [13] that were executed on a machine with Intel Xeon Haswell E5-2630 v3 (2.40GHz) CPU. Figures 1, 2, and 3 show the histograms of the above-mentioned data. LIFTCOVER+ performs slightly better than LIFTCOVER-EM in terms of AUC-PR on 6 datasets out of 12 with EM-BAYES and EM-L1, on 7 datasets with EM-L2, and on 3 datasets with GD. As a matter of fact, the average AUC-PR over all datasets is higher for LIFTCOVER+ with EM and L2 regularization, followed closely by Bayesian regularization. Results obtained with LIFTCOVER+ and GD were considerably worse on the Pyramidine and Yeast datasets and were lower in almost all other cases. In particular, LIFTCOVER+ was able to significantly improve the performance on the Nba dataset achieving an AUC-PR of 0.7 against 0.5 reached by LIFTCOVER-EM. Despite that, the Sisyb dataset seems to remain a challenge for LIFTCOVER+ (both with EM and GD). Regarding AUC-ROC, LIFTCOVER+ beats LIFTCOVER-EM on 4 datasets out of 12 with EM-Bayes and EM-L1, on 3 datasets with EM-L2, and on 2 datasets with GD. In general, GD led to a deterioration of the solution in
Regularization in Probabilistic Inductive Logic Programming
23
Table 2. Parameters controlling structure search for LIFTCOVER+. Dataset
NB
NI NInt NS NA NV
WMin
Financial
100 20 16
1
1
4 1e-4
Bupa
100 20
4
1
1
4 1e-4
Mondial
1000 10
1
2
6
5 1e-4
Mutagen
100 10
4
1
1
4 1e-4
Sisyb
100 20 10
1
1
50 1e-4
Sisya
100 20
4
1
1
4 1e-4
Pyrimidine
100 20
4
1
1
100 1e-4
Yeast
100 20 12
1
1
4 1e-4
Nba
100 20
1
1
100 1e-4
4
Triazine
100 20
4
1
1
4 1e-4
UW-CSE
100 60
4
1
4
4 1e-4
Carcinogen
100 60 16
2
1
3 1e-4
most cases, probably because the loss function is highly non-convex and GD ends up in local minima, while EM seems more capable of escaping local minima. In terms of execution times, LIFTCOVER+ is comparable to LIFTCOVER-EM, although it was slower in some cases. This is especially true for GD, which on some datasets (Bupa, Mondial, Mutagenesis, Pyramidine, Yeast, Triazine, Carcinogenesis) turns out to be slower by one or more orders of magnitude. However, it must be noted that the scaling approach we have used is only a rough approximation, as the architecture of the two processors is different and thus differences in caches and pipelining may have an effect. In the future, we plan to repeat the LIFTCOVER+ experiments on a machine more similar to the one of LIFTCOVER-EM.
5
Related Work
Lifted inference for PLP under the distribution semantics has been surveyed in [18], in which the authors describe and evaluate three different approaches, namely Lifted Probabilistic Logic Programming (LP 2 ), lifted inference with aggregation parfactors, and Weighted First Order Model Counting (WFOMC). The authors of [9], instead, focused their survey on lifted graphical models. LIFTCOVER (and thus LIFTCOVER+) derives from SLIPCOVER [3], an algorithm for learning general PLP by performing a search in the space of clauses and then refining it by greedily adding refined clauses into the theory. Aside from the simplified structure search, LIFTCOVER and LIFTCOVER+ differ from SLIPCOVER also in the approach used for parameter learning. While SLIPCOVER uses EMBLEM [2] to learn the parameters of a probabilistic logic program by applying EM over Binary Decision Diagrams [1], LIFTCOVER and LIFTCOVER+ use EM, LBFGS, and gradient descent.
24
E. Gentili et al.
Table 3. Average AUC-ROC over the datasets for each configuration: EM with Bayes regularization (EM-Bayes), EM with L1 regularization (EM-L1), EM with L2 regularization (EM-L2), and gradient descent (GD). For each row, the best result is highlighted in bold. Dataset
EM-Bayes EM-L1 EM-L2 GD
LIFTCOVER-EM
Financial
0.521
0.432
0.459
0.528
0.389
Bupa
1.000
1.000
1.000
1.000 1.000
Mondial
0.586
0.547
0.572
0.528
0.663
Mutagen
0.934
0.918
0.939
0.594
0.931
Sisyb
0.500
0.500
0.500
0.500 0.500
Sisya
0.720
0.720
0.720
0.720 0.372
Pyrimidine 0.975
0.880
0.910
0.160
1.000
Yeast
0.783
0.785
0.530
0.786
0.785
Nba
0.725
0.725
0.725
0.675
0.531
Triazine
0.425
0.390
0.430
0.580
0.713
UW-CSE
0.976
0.977
0.975
0.951
0.977
Carcinogen 0.720
0.692
0.687
0.500
0.766
Average
0.716
0.731
0.594 0.723
0.739
Table 4. Average AUC-PR over the datasets for each configuration: EM with Bayes regularization (EM-Bayes), EM with L1 regularization (EM-L1), EM with L2 regularization (EM-L2), and gradient descent (GD). For each row, the best result is highlighted in bold. Dataset
EM-Bayes EM-L1 EM-L2 GD
LIFTCOVER-EM
Financial
0.155
0.127
0.169
0.124
0.126
Bupa
1.000
1.000
1.000
1.000 1.000
Mondial
0.717
0.739
0.743
0.712
0.763
Mutagen
0.966
0.964
0.971
0.759
0.971
Sisyb
0.286
0.286
0.286
0.286 0.286
Sisya
0.708
0.708
0.708
0.708 0.706
Pyrimidine 0.988
0.913
0.947
0.378
Yeast
0.499
0.486
0.497
0.242
0.502
Nba
0.789
0.789
0.789
0.743
0.550
1.000
Triazine
0.452
0.430
0.463
0.617
0.734
UW-CSE
0.341
0.358
0.339
0.158
0.220
Carcinogen 0.687
0.691
0.722
0.513
0.672
Average
0.624
0.636
0.520 0.628
0.632
Regularization in Probabilistic Inductive Logic Programming
25
Table 5. Average time in seconds over the datasets for each configuration: EM with Bayes regularization (EM-Bayes), EM with L1 regularization (EM-L1), EM with L2 regularization (EM-L2), and gradient descent (GD). For each row, the best result is highlighted in bold. Dataset
EM-Bayes EM-L1 EM-L2 GD
LIFTCOVER-EM
Financial
0.280
0.235
0.275
0.278
2.360
Bupa
0.246
0.244
0.244
510.206 0.243
Mondial
6.480
4.841
5.149
299.875 5.911
Mutagen
15.073
10.796 15.653 266.196 12.770
Sisyb
0.232
0.231
0.233
0.490
0.226
Sisya
1.117
1.108
1.111
0.656
0.932
Pyrimidine 44.712
23.408 25.766 266.310 54.990
Yeast
52.503
54.288 994.811 0.502 0.624
60.143
Nba
0.658
0.705
Triazine
33.648
23.871 30.305 276.304 56.690
0.913
0.599
UW-CSE
74.163
72.961
74.456 75.656
Carcinogen 14.483
16.826
13.458 778.596 7.850
Average
17.314
18.464 289.364 12.417
20.936
8.054
Fig. 1. Histograms of average AUC-ROC over the datasets for each configuration: EM with Bayes regularization (EM-Bayes), EM with L1 regularization (EM-L1), EM with L2 regularization (EM-L2), and gradient descent (GD).
26
E. Gentili et al.
Fig. 2. Histograms of average AUC-PR over the datasets for each configuration: EM with Bayes regularization (EM-Bayes), EM with L1 regularization (EM-L1), EM with L2 regularization (EM-L2), and gradient descent (GD).
Fig. 3. Histograms of average time in seconds over the datasets for each configuration: EM with Bayes regularization (EM-Bayes), EM with L1 regularization (EM-L1), EM with L2 regularization (EM-L2), and gradient descent (GD). The scale of the X axis is logarithmic.
Regularization in Probabilistic Inductive Logic Programming
27
Hierarchical PLP (HPLP) [14] is a restriction of the general PLP language in which clauses and predicates are hierarchically organized. HPLPs can be efficiently converted into arithmetic circuits (ACs) or deep neural networks so that inference is much cheaper than for general PLP. Liftable PLP can be seen as a restriction of HPLP. For this reason, LIFTCOVER+ is related to Liftable PLP tools such as PHIL and SLEAHP [14]. PHIL performs parameter learning of hierarchical probabilistic logic programs using gradient descent (DPHIL) or EM (EMPHIL). First, it converts the program into a set of ACs sharing parameters. Then, it applies gradient descent or EM over the ACs, evaluating them bottomup. On the other hand, SLEAHP learns both the structure and the parameters of HPLPs from data. It generates a large hierarchical logic program from an initial set of bottom clauses generated from a language bias [3]. Then, it applies a regularized version of PHIL to prune the initial large program by removing irrelevant rules, i.e., those for which the parameters are close to 0. LIFTCOVER+ is related also to PROBFOIL+ [16], an algorithm used to perform parameter and structure learning of ProbLog [6] programs with a hill climbing search in the space of programs, consisting of a covering loop that adds one rule to the theory at each iteration and stops when a condition based on a global scoring function is satisfied. The rule to add is obtained from a clause search loop that builds the rule by iteratively adding literals to the body using a local scoring function as the heuristic.
6
Conclusions
In this paper, we have presented LIFTCOVER+, an updated version of LIFTCOVER that performs parameter learning using the EM algorithm or gradient descent with regularization to penalize large weights and prevent overfitting. Experiments were conducted on 12 real-world datasets and results were compared with LIFTCOVER-EM. In summary, we found that using gradient descent does not bring much benefit, having AUC-PR and AUC-ROC comparable to LIFTCOVER-EM, and execution times often much higher. On the other hand, using EM with regularization (and with L2 or Bayesian regularization especially) we obtain a higher AUC-PR on several datasets with roughly equal execution times. Furthermore, when there are no improvements, there is not a significant degradation in the quality of the solutions either. In conclusion, the present findings confirm that adding regularization can help improve the solution in terms of AUC-PR, although some datasets remain hard for LIFTCOVER+. As future work, we plan to employ LIFTCOVER+ to learn theories from Knowledge Graphs (KG) to perform KG completion and triple classification. Acknowledgements. This work has been partially supported by the Spoke 1 “FutureHPC & BigData” of the Italian Research Center on High-Performance Computing, Big Data and Quantum Computing (ICSC) funded by MUR Missione 4 - Next Generation EU (NGEU), by TAILOR, a project funded by EU Horizon 2020 research and innovation programme under GA No. 952215, and by the “National Group of Computing Science (GNCS-INDAM)”.
28
E. Gentili et al.
References 1. Akers, S.B.: Binary decision diagrams. IEEE Trans. Comput. 27(6), 509–516 (1978) 2. Bellodi, E., Riguzzi, F.: Expectation maximization over binary decision diagrams for probabilistic logic programs. Intell. Data Anal. 17(2), 343–363 (2013) 3. Bellodi, E., Riguzzi, F.: Structure learning of probabilistic logic programs by searching the clause space. Theory Pract. Logic Program. 15(2), 169–212 (2015). https://doi.org/10.1017/S1471068413000689 4. Davis, J., Goadrich, M.: The relationship between precision-recall and ROC curves. In: European Conference on Machine Learning (ECML 2006), pp. 233–240. ACM (2006) 5. De Raedt, L., Kimmig, A.: Probabilistic (logic) programming concepts. Mach. Learn. 100(1), 5–47 (2015). https://doi.org/10.1007/s10994-015-5494-z 6. De Raedt, L., Kimmig, A., Toivonen, H.: Problog: A probabilistic prolog and its application in link discovery. In: Proceedings of the 20th International Joint Conference on Artificial Intelligence, IJCAI 2007, pp. 2468–2473. Morgan Kaufmann Publishers Inc., San Francisco (2007) 7. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. B 39, 1–38 (1977) 8. Fawcett, T.: An introduction to ROC analysis. Pattern Recogn. Lett. 27, 861–874 (2006) 9. Kimmig, A., Mihalkova, L., Getoor, L.: Lifted graphical models: a survey. Mach. Learn. 99, 1–45 (2015) 10. Kok, S., Domingos, P.: Learning the structure of Markov Logic Networks. In: 22nd International Conference on Machine learning, pp. 441–448. ACM (2005) 11. Mørk, S., Holmes, I.: Evaluating bacterial gene-finding hmm structures as probabilistic logic programs. Bioinformatics 28(5), 636–642 (2012) 12. Fadja, A.N., Riguzzi, F.: Probabilistic logic programming in action. In: Holzinger, A., Goebel, R., Ferri, M., Palade, V. (eds.) Towards Integrative Machine Learning and Knowledge Extraction. LNCS (LNAI), vol. 10344, pp. 89–116. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-69775-8 5 13. Nguembang Fadja, A., Riguzzi, F.: Lifted discriminative learning of probabilistic logic programs. Mach. Learn. 108(7), 1111–1135 (2019) 14. Nguembang Fadja, A., Riguzzi, F., Lamma, E.: Learning hierarchical probabilistic logic programs. Mach. Learn. 110(7), 1637–1693 (2021). https://doi.org/10.1007/ s10994-021-06016-4 15. Poole, D.: First-order probabilistic inference. In: Gottlob, G., Walsh, T. (eds.) IJCAI-03, Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence, Acapulco, Mexico, 9–15 August 2003, pp. 985–991. Morgan Kaufmann Publishers (2003) 16. Raedt, L.D., Dries, A., Thon, I., den Broeck, G.V., Verbeke, M.: Inducing probabilistic relational rules from probabilistic examples. In: Yang, Q., Wooldridge, M. (eds.) 24th International Joint Conference on Artificial Intelligence (IJCAI 2015), pp. 1835–1843. AAAI Press (2015) 17. Riguzzi, F.: Foundations of Probabilistic Logic Programming Languages, Semantics, Inference and Learning, 2nd edn. River Publishers, Gistrup (2023) 18. Riguzzi, F., Bellodi, E., Zese, R., Cota, G., Lamma, E.: A survey of lifted inference approaches for probabilistic logic programming under the distribution semantics. Int. J. Approx. Reason. 80, 313–333 (2017). https://doi.org/10.1016/j.ijar.2016. 10.002
Regularization in Probabilistic Inductive Logic Programming
29
19. Riguzzi, F., Lamma, E., Alberti, M., Bellodi, E., Zese, R., Cota, G.: Probabilistic logic programming for natural language processing. In: Chesani, F., Mello, P., Milano, M. (eds.) Workshop on Deep Understanding and Reasoning, URANIA 2016. CEUR Workshop Proceedings, vol. 1802, pp. 30–37. Sun SITE Central Europe (2017) 20. Riguzzi, F., Swift, T.: Probabilistic logic programming under the distribution semantics. In: Kifer, M., Liu, Y.A. (eds.) Declarative Logic Programming: Theory, Systems, and Applications. Association for Computing Machinery and Morgan & Claypool (2018) 21. Sato, T.: A statistical learning method for logic programs with distribution semantics. In: Sterling, L. (ed.) Logic Programming, Proceedings of the Twelfth International Conference on Logic Programming, Tokyo, Japan, 13–16 June 1995, pp. 715–729. MIT Press (1995). https://doi.org/10.7551/mitpress/4298.003.0069 22. Schulte, O., Khosravi, H.: Learning graphical models for relational data via lattice search. Mach. Learn. 88(3), 331–368 (2012) 23. Srinivasan, A., King, R.D., Muggleton, S., Sternberg, M.J.E.: Carcinogenesis predictions using ILP. In: Lavrac, N., Dˇzeroski, S. (eds.) 7th International Workshop on Inductive Logic Programming. Lecture Notes in Computer Science, vol. 1297, pp. 273–287. Springer, Berlin Heidelberg (1997) 24. Srinivasan, A., Muggleton, S., Sternberg, M.J.E., King, R.D.: Theories for mutagenicity: a study in first-order and feature-based induction. Artif. Intell. 85(1–2), 277–299 (1996) 25. Struyf, J., Davis, J., Page, D.: An efficient approximation to lookahead in relational learners. In: F¨ urnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) ECML 2006. LNCS (LNAI), vol. 4212, pp. 775–782. Springer, Heidelberg (2006). https://doi.org/10. 1007/11871842 79 26. Vennekens, J., Verbaeten, S., Bruynooghe, M.: Logic programs with annotated disjunctions. In: Demoen, B., Lifschitz, V. (eds.) ICLP 2004. LNCS, vol. 3132, pp. 431–445. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-277750 30
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Towards ILP-Based LTLf Passive Learning Antonio Ielo1(B) , Mark Law2,4 , Valeria Fionda1 , Francesco Ricca1 , Giuseppe De Giacomo3 , and Alessandra Russo4 1
University of Calabria, Rende, Italy [email protected] 2 ILASP Limited, Grantham, UK 3 University of Oxford, Oxford, UK 4 Imperial College, London, UK
Abstract. Inferring a LTLf formula from a set of example traces, also known as passive learning, is a challenging task for model-based techniques. Despite the combinatorial nature of the problem, current stateof-the-art solutions are based on exhaustive search. They use an example at the time to discard a single candidate formula at the time, instead of exploiting the full set of examples to prune the search space. This hinders their applicability when examples involve many atomic propositions or when the target formula is not small. This short paper proposes the first ILP-based approach for learning LTLf formula from a set of example traces, using a learning from answer sets system called ILASP. It compares it to both pure SAT-based techniques and the exhaustive search method. Preliminary experimental results show that our approach improves on previous SAT-based techniques and that has the potential to overcome the limitation of an exhaustive search by optimizing over the full set of examples. Further research directions for the ILP-based LTLf passive learning problem are also discussed. Keywords: Answer Set Programming · Linear temporal logic over finite traces · Learning from answer sets
1
Introduction
Linear Temporal Logic (LTL) [29] provides a concise, expressive, and humaninterpretable language to specify and reason about the temporal behavior of systems. Over the years, it has been widely used in formal verification, model This work was partially supported by MUR under PRIN project PINPOINT Prot. 2020FNEB27, CUP H23C22000280006; PRIN project HypeKG Prot. 2022Y34XNM, CUP H53D23003710006; PNRR MUR project PE0000013-FAIR, Spoke 9 - WP9.1 and WP9.2; Spoke 5 - WP5.1 and PNRR project Tech4You, CUP H23C22000370006, ERC Advanced Grant WhiteMech (No. 834228), and EU ICT-48 2020 project TAILOR (No. 952215). c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 E. Bellodi et al. (Eds.): ILP 2023, LNAI 14363, pp. 30–45, 2023. https://doi.org/10.1007/978-3-031-49299-0_3
Towards ILP-Based LTLf Passive Learning
31
checking, and monitoring to ensure the correctness of software and hardware systems. Unlike LTL, whose specifications are interpreted over infinite sequences, LTLf [11,17] deals with properties that are evaluated on finite sequences or traces. LTLf was developed to deal with scenarios in which the standard semantics over infinite traces were not appropriate, such as when reasoning over business processes, which are typically finite [1]. However, developing manual specifications of system behaviors is becoming difficult due to the complexity, dynamic changes, and evolution of current systems. Automated approaches for inferring LTLf specifications from data are becoming increasingly important. Passive learning of LTLf formulae [28] refers to the problem of inferring an LTLf formula from a set of system execution traces (some of which might be labeled as negative examples). The problem has been extensively studied in the literature [2,16,28,30] motivated by the need of learning human-interpretable models that allow explaining the observed behavior of complex systems. Surprisingly, the problem has been proven to be challenging for model-based techniques. In fact, despite its combinatorial nature, the current state-of-the-art [16] is exhaustivesearch-based systems that do not apply any pruning to reduce the search space, beyond efficiently evaluating formulae over traces. Although these approaches may be acceptable in certain domains, their reliance on exhaustive search poses limitations in domains with numerous atomic propositions or lengthy target formulae. Specifically, the search space of LTLf formulae grows exponentially both in the size of the formula being searched and in the number of atomic propositions, thereby preventing their applicability. Thus, investigating the use of techniques capable of exploiting multiple examples to prune the search space, such as ILP, is an important step towards solutions to the passive learning problem that is applicable to real-world problems. In this paper we propose an approach, based on inductive logic programming (ILP), for learning LTLf formulae from positive and negative execution traces. Due to the combinatorial complexity of the problem, we make use of ILASP [24], a learning from answer sets system shown to generalize many of the existing ILP methods [23], and to be able to learn specifications expressed in answer set programming [24], a computational environment tailored to solve combinatorial optimization problems. ILP techniques have been applied to the task of learning of Declare [1] models [6], a declarative process modeling language whose patterns are defined by LTLf formulae. However, to the best of our knowledge, this is the first ILP-based method for learning LTLf formulae from finite execution traces without imposing syntactic restrictions or limiting the search to a set of patterns. Other ILP-based works tackled how to include LTLf specifications among features used in learning algorithms [32], as well as applications to learning temporal logic programs [20]. Specifically, we propose a novel representation of the problem of passive learning of LTLf formulae in answer set programming [3]. We show how this representation can be adapted to express this problem as a learning from answer set task and as an SAT-based inference task. We then conduct an experimental evaluation of our ILP-based approach over a set of event logs on cellular networks’ attacks, (already used in
32
A. Ielo et al.
the literature to benchmark systems) and compare its performance with respect to the existing SAT-based methods. Our evaluation shows that our approach improves upon previous SAT- and SMT-based techniques, as implemented by the SySLite [2] system. The contributions presented in this paper are the first stepping stone towards the development of ILP-based systems for LTLf passive learning that can be used in real-world settings. To this end, our future work will address the scalability of our proposed method to demonstrate its advantage versus exhaustive search methods, and to solve passive learning problems where execution traces refer also to structured properties of the data.
2
Background
In this section, we introduce linear temporal logic over finite traces and present the basic notions and terminologies of Answer Set Programming (ASP) and Learning from Answer Sets (LAS) used throughout the paper. 2.1
Linear Temporal Logic over Finite Traces
Linear temporal logic (LTL) [29] is an extension of propositional logic which allows reasoning about time-related properties in sequences of events by means of temporal operators. Classically, LTL formulae are interpreted over infinite traces. When considering LTL on finite traces, the formalism LTLf [17] keeps the same syntax of standard LTL, but shifts its focus from infinite sequences of events to finite traces. We define now the syntax and semantics of LTLf . Syntax. Let P be a finite, non-empty, set of propositional symbols. An LTLf formula is inductively defined according to the following grammar: ϕ:: = true | f alse | p | ¬ϕ | ϕ ∧ ϕ | ϕ ∨ ϕ | Xϕ | ϕ U ϕ where p ∈ P and ϕ is an LTLf formula. The set {∧, ∨, ¬} includes the standard conjunction, disjunction and negation operators of classical logic, while X and U denote respectively the next and until temporal operators. We assume the standard propositional logic equivalence rewriting that defines logical implication ϕ → as ¬ϕ ∨ . Furthermore, we define the following derived temporal operators: (Weak Next) Xw ϕ ≡ ¬X¬ϕ; (Eventually) Fϕ ≡ true U ϕ; (Release) ϕ1 Rϕ2 ≡ ¬(¬ϕ1 U ¬ϕ2 ); and (Always) Gϕ ≡ f alse R ϕ. The size of a formula ϕ, denoted by |ϕ|, is the total number of symbols (temporal operators, boolean connectives, and propositional symbols) included in ϕ. That is, |ϕ| = 1 if ϕ ∈ P ∪ {true, f alse}; | ◦ ϕ| = 1 + |ϕ| if ◦ ∈ {¬, X, Xw , F, G}; and, |ϕ1 ◦ ϕ2 | = 1 + |ϕ1 | + |ϕ2 | if ◦ ∈ {∧, ∨, →, U, R}. Semantics. A finite trace over propositional symbols in P is a sequence π = π0 · · · πn−1 of states, where each state πi ⊆ P is a set of propositional symbols that hold at time instant i. The length of a trace is the number of states over which it is defined, and it is indicated as |π|.
Towards ILP-Based LTLf Passive Learning
33
Given a formula ϕ and a trace π, we define that π satisfies ϕ at time instant i, denoted π, i |= ϕ, inductively as follows: π, i |= p iff p ∈ πi ; π, i |= ¬ϕ iff π, i |= ϕ π, i |= ϕ1 ∧ ϕ2 iff π, i |= ϕ1 and π, i |= ϕ2 π, i |= ϕ1 ∨ ϕ2 iff π, i |= ϕ1 or π, i |= ϕ2 π, i |= Xϕ1 iff i < |π| − 1 and π, i + 1 |= ϕ1 ; π, i |= ϕ1 Uϕ2 iff ∃j with i ≤ j ≤ |π| s.t. π, j |= ϕ2 and ∀k with i ≤ k < j π, k |= ϕ1 Given a trace π and a formula ϕ, we say that π is a model of ϕ if π, 0 |= ϕ, denoted in brief as π |= ϕ. An LTLf formula is said to be in neXt Normal Form (xnf) [8,25] when all its occurrences of U temporal operators are nested into some X operator. Note that, every LTLf formula can be transformed into an equivalent formula in xnf form in linear time by recursively applying the following transformations: • • • • •
xnf(ϕ) = ϕ for ϕ ∈ P ∪ {true, f alse}; xnf(¬ϕ) = ¬xnf(ϕ); xnf(ϕ1 ◦ ϕ2 ) = xnf(ϕ1 ) ◦ xnf(ϕ2 ) for ◦ ∈ {∧, ∨}; xnf(X ϕ) = X xnf(ϕ); xnf(ϕ1 U ϕ2 ) = xnf(ϕ2 ) ∨ (xnf(ϕ1 ) ∧ X(ϕ1 U ϕ2 ))
2.2
Passive Learning of LTLf Formulae
The problem of passive learning of LTLf formulae, introduced in [28], refers to the challenge of automatically inferring LTLf formulae from observed traces of system behavior, usually partitioned into sets of positive and negative examples, such that the positive traces are models of the formula and the negative traces are not models of the formula. This can be formally defined as follows: Definition 1 (PLLT Lf Passive Learning Task). Let P be a set of propositional symbols. A PLLT Lf passive learning task is a tuple PLLT Lf = (P, E + , E − ) where E + is a set of traces over P called positive traces, and E − is a set of traces over P called negative traces such that E + ∩ E − = ∅. A solution of a PLLT Lf task is an LTLf formula ϕ, written in P such that (i) π |= ϕ, for every π ∈ E + ; and (ii) π |= ϕ, for every π ∈ E − . Note that, a PLLT Lf passive learning task always accepts a trivial solution given by the formula φ = π∈E + i∈[0,...,|π|−1] Xi ( p∈πi p ∧ p∈πi ∧¬p) ∧ ¬X|π| true, where Xi denotes the nested application of the X operator i times. We therefore focus on optimal solutions of a PLLT Lf passive learning task, as defined below. Definition 2 (Optimal solution of a PLLT Lf Passive Learning Task). Let PLLT Lf = (P, E + , E − ) be a PLLT Lf passive learning task. An LT Lf formula ϕ, written in P, is an optimal solution of PLLT Lf if and only if ϕ is a solution of PLLT Lf and there is no LT Lf formula ϕ written in P that is a solution of PLLT Lf and |ϕ | < |ϕ|. Solving a PLLT Lf passive learning task means searching for an optimal (i.e. minimal-size) solution with respect to a fixed set of propositional symbols P.
34
A. Ielo et al.
2.3
Answer Set Programming
Answer Set Programming (ASP) [3,15] is a knowledge representation formalism based on the stable model semantics of logic programs, that allows modeling in a declarative way problems up to the second level of the polynomial hierarchy. We recall in the following some basic notions of ASP and assume the reader is familiar with the input language of Clingo [14]. Typically an ASP program includes four types of rules: normal rules, choice rules, hard and soft constraints. In this paper, we consider ASP programs composed of only normal rules, choice rules, and hard constraints. Given atoms h, h1 ,. . . , hk , b1 ,. . . , bn , c1 ,. . . , cm , a normal rule is of the form h : - b1 , . . . , bn , not c1 , . . . , not cm , with h as the head and b1 , . . . , bn , not c1 , . . . , not cm (collectively) as the body (“not” represents negation as failure); a constraint is a rule of the form : - b1 , . . . , bn , not c1 , . . . , not cm ; and a choice rule is a rule of the form l{h1 , . . . , hk }u : - b1 , . . . , bn , not c1 , . . . , not cm where l{h1 , . . . , hk }u is called an aggregate. In an aggregate l and u are integers and hi , for 1 ≤ i ≤ k, are atoms. A choice rule specifies that when the body is satisfied at least l, and no more than u, atoms from the head must evaluate true. Given an ASP program P , the Herbrand Base of P , denoted as HBP , is the set of ground (variable free) atoms that can be formed from the predicates and constants that appear in P . Subsets of HBP are (Herbrand) interpretations of P . Informally, a model of an ASP program P , called Answer Set of P , is defined in terms of the notion of reduct of P , which is constructed by applying the following transformation steps to the grounding of P 1 . Given a program P and an Herbrand interpretation I ⊆ HBP , the reduct P I is constructed from the grounding of P by first removing rules whose bodies contain the negation of an atom in I; secondly, we remove all negative literals from the remaining rules; thirdly, we set ⊥ (note ⊥ ∈ / HBP ) to be the head to every constraint, and in every choice rule whose head is not satisfied by I we replace the head with ⊥; and finally, we replace any remaining choice rule l{h1 , . . . , hm }u : - b1 , . . . , bn with the set of rules {hi : - b1 , . . . , bn | hi ∈ I ∩ {h1 , . . . , hm }}. Any I ⊆ HBP is an answer set of P if it is the minimal model of the reduct P I . We denote an answer set of a program P with A and the set of answer sets of P with AS(P ). A program P is said to be satisfiable (resp. unsatisfiable) if AS(P ) is non-empty (resp. empty). 2.4
Learning from Answer Sets
Many ILP systems learn from (positive and negative) examples of atoms which should be true or false, as many ILP systems are targeted at learning Prolog programs, where the main “output” of a program is a query of a single atom. In this paper, we make use of the Learning from Answer Sets (LAS) paradigm. In ASP, the main “output” of a program is a set of answer sets. So learning from Answer Sets takes as (positive and negative) examples (partial) interpretations, 1
We use the simplified definitions of the reduct for choice rules presented in [22].
Towards ILP-Based LTLf Passive Learning
35
which should or should not (respectively) be answer sets of the learned ASP program. A partial interpretation e is a pair of sets of atoms einc , eexc , referred to as the inclusions and exclusions respectively. An interpretation I is said to extend e if and only if einc ⊆ I and eexc ∩ I = ∅. A context-dependent partial interpretation (CDPI) is a tuple e = epi , ectx , where epi is a partial interpretation and ectx is an ASP program called a context. A CDPI e is accepted by a program P if and only if there is an answer set of P ∪ ectx that extends epi . Many ILP systems (e.g. [19]) use mode declarations as a form of language bias to specify hypothesis spaces. We adopt a similar notion of language bias. A mode bias is defined as a pair of sets of mode declarations Mh , Mb , where Mh (resp. Mb ) are called the head (resp. body) mode declarations. Each mode declaration is a literal whose abstracted arguments are either var(t) or const(t), for some constant t (called a type). Informally, a literal is compatible with a mode declaration m if it can be constructed by replacing every instance of var(t) in m with a variable of type t, and every const(t) with a constant of type t.2 Given a mode bias M = Mh , Mb , a rule R if compatible with M if (i) the head of R is compatible with a mode declaration in Mh ; (ii) each body literal of R is compatible with a mode declaration in Mb ; and (iii) no variable occurs with two different types. We indicate with SM the set of rules compatible with a given language bias M = Mh , Mb , and we refer to it as the hypothesis space SM . We can now define the notion of context-dependent Learning from Answer Sets. This consists of an ASP background knowledge B, a hypothesis space, and sets of context-dependent positive and negative partial interpretation examples. The goal is to find a hypothesis H that has at least one answer set (when combined with B) that extends each positive example, and no answer set that extends any negative examples. Note that each positive example could be extended by a different answer set of the learned program. This can be formally defined as follows. Definition 3 (Context-dependent learning from answer sets). A , is a Context-dependent Learning from Answer Sets task, denoted as ILPcontext LAS tuple T = B, SM , E + , E − where B is an ASP program, SM is a set of ASP rules, and E + and E − are finite sets of CDPIs. A hypothesis H ⊆ SM is an inductive solution of T if and only if (i) ∀e ∈ E + , B ∪ H accepts e; and (ii) ∀e ∈ E − , B ∪ H does not accept e. It is common practice in ILP to search for “optimal” hypotheses. This is usually defined in terms of the number of literals in the hypothesis. Given a hypothesis H, the length of the hypothesis, |H|, is the number of literals that appear in H. Definition 4 (Optimal solution of ILPcontext tasks). Let T be a ILPcontext LAS LAS learning task. A hypothesis H is an optimal inductive solution of T if and only 2
The set of constants of each type is assumed to be given with a task, together with the maximum number of variables in a rule, giving a set of variables V1 , . . . , Vmax that can occur in a hypothesis. Whenever a variable V of type t occurs in a rule, the atom t(V ) is added to the body of the rule to enforce the type.
36
A. Ielo et al.
if H is an inductive solution of T , and there is no inductive solution H of T such that |H | < |H|.
3
Formalizing LTLf Semantics in ASP
In this section, we present an encoding for evaluating LTLf formulae over traces, by embedding LTLf semantics into a normal logic program. We present our encoding into different subsections, addressing how to represent traces, formulae, and temporal logic operators’ evaluation rules in logic programs. Encoding Traces. We assume traces to be uniquely indexed by integers, and in particular we will assume that a trace π i is referred to by the integer i. We encode a trace π i over P as a set of facts matching the predicates trace/3 and trace/2. The atom trace(i, t, a) models that a ∈ πti . In order to be able to model empty states, we introduce the atoms trace(i, t) for 0 ≤ t < |π i |. Further information about the trace can be encoded by auxiliary predicates which refer to the trace identifier in the first term. In this paper, the only additional information to encode is whether each trace is a positive, π i ∈ E + , or negative, π i ∈ E − , example, and this is done through the atoms pos(i), neg(i) respectively. We denote by P (π) the set of facts that encode the trace π. With a slight abuse of notation, we will also denote by P (E) the set of facts that encode the set of traces E, that is P (E) = πi ∈E P (π i ). Example 1. Consider the trace π 0 = {a} · {a, b} · {}, and assume π 0 ∈ E + . This is encoded by the following facts: trace(0,0,a). trace(0,1,a). trace(0,1,b). trace(0,0). trace(0,1). trace(0,2). pos(0).
Encoding Formulae. We encode a formula by reifying its syntax tree, in a similar way as authors of [28] do in SAT, by means of the predicates edge/2, order/3, label/2 and node/1. The predicates node/1, edge/2 model trees in a natural way, where we use natural numbers to identify nodes. The atoms node(x), edge(y, x) model that x is a node of the tree, and that y is its parent. The predicate label/2 models logic operators (or propositions) associated with each node in the tree. An atom label(x, j) encodes that the node x is labeled with j ∈ O ∪ P, where O is the set of available temporal and propositional logic operators. In this paper, we assume O = {¬, ∨, ∧, X, U, →, F, G}. The atom order(i, lhs, rhs) distinguishes between left and right of node i, which is needed for the evaluation of the non-commutative operators {U, →}. We denote by P (ϕ) the set of facts which encode a formula ϕ. Without loss of generality, we will assume that the node identified by 1 is the root of a formula’s tree. Example 2. Consider the formula (Xa) U b. This is encoded by the following facts:
Towards ILP-Based LTLf Passive Learning
37
node(1..4). label(1, until). label(2, next). label(3, a). label(4,b). edge(1,2). edge(1,4). edge(2,3). order(1,2,4).
Encoding Semantics. We encode the semantics of each supported operator by simulating the recursive application of the xnf(·) transformation by means of normal recursive rules. In particular, each subformula is identified by the node identifier of its root in the syntax tree. The atom holds(i, t, x) models that the π i , t |= ϕx where ϕx is the subformula of ϕ rooted in the node identified by integer x. The atom last(i, t) models that |π i | = t, that is πti is the last state of π i . The definition of these rules, which we denote by PLTLf , follows the xnf(·) definitions in Sect. 2.1. The encoding for operators in {∧, ∨, ¬, U, X, →}, denoted by the constants and, or, neg, until, next and implies respectively, is as follows: holds(TID, T, X) :- label(X, A), trace(TID, T, A). holds(TID, T, X) :- label(X, next), edge(X, Y), holds(TID, T+1, X), not last(TID, T). holds(TID, T, X) :- label(X, until), order(X,LHS,RHS), holds(TID, T, RHS). holds(TID, T, X) :- label(X, until), order(X,LHS,RHS), holds(TID, T, LHS), holds(TID, T+1, X). holds(TID, T, X) :- label(X, and), order(X,A,B), holds(TID, T, A), holds(TID, T, B). holds(TID, T, X) :- label(X, or), edge(X, A), holds(TID, T, A). holds(TID, T, X) :- label(X, neg), edge(X, Y), not holds(TID, T, Y), trace(TID, T). holds(TID, T,X) :- label(X,implies), order(X,LHS,RHS), holds(TID, T,RHS), holds(TID, T,LHS). holds(TID, T,X) :- label(X,implies), order(X,LHS,RHS), not holds(TID, T,LHS), trace(TID, T). holds(TID, T, X) :- label(X, eventually), edge(X,Y), holds(TID, T,Y). holds(TID, T, X) :- label(X, eventually), holds(TID, T+1, X), trace(TID, T). holds(TID, T, X) :- label(X, always), edge(X, Y), holds(TID, T, Y), last(TID, T). holds(TID, T, X) :- label(X, always), edge(X, Y), holds(TID, T, Y), holds(TID, T+1, X). last(TID, T) :- trace(TID, T), not trace(TID, T+1). sat(TID) :- holds(TID, 0,1). unsat(TID) :- not sat(TID), trace(TID,_).
Listing 1.1. The logic program PLTLf
For values t > t of the second term of the atom holds/3 it is possible to represent subsequent instants of each xnf(·) formula. For the evaluation of xnf formulae, it is sufficient to evaluate the current state and next state of the trace. Evaluation of this kind of rules produces a locally-stratified program (i.e., the resulting ground instantiation is stratified) [5], since whenever holds(i, t, x) is in
38
A. Ielo et al.
the head of a rule the body of the rule can contain only atoms holds(i, t, ) or holds(i, t + 1, ). Thus, when solved with the other subprograms encoding traces and formulae (that are only facts) has a unique answer set [7]. In particular, by observing that the rules implement the recursive application of xnf(·) which yields an equivalent formula to ϕ, it can be proved that, π i |= ϕ (resp. π i |= ϕ) if holds(i, |π i | − 1, 1) is (is not) in the unique answer set of P (π i ) ∪ P (ϕ) ∪ PLT Lf .
4
LTLf Passive Learning in Plain ASP
A first way to model the passive learning problem is to frame it as an abduction problem in ASP [9], where the set of abducibles corresponds to facts matching the predicates node/1, edge/2, label/2, which reify into facts the syntax tree of a LTLf formula. The goal of the abduction task is to find an LTLf formula ϕ for which all e ∈ E + we have that e |= ϕ and for all e ∈ E − we have that e |= ϕ. The following rules encode, denoted Ptree and Plabel respectively, the abduction of an LTLf formula of size n: node(1..n). pair(X,Y) :- node(X), node(Y), X < Y. 1 { edge(Y,X): pair(Y,X) } 1 :- node(X), X > 1. reach(1). reach(X) :- edge(Y,X), reach(Y). :- node(X), not reach(X). id(1,(0,0)). id(V,(U,V*V+U)) :- edge(U,V). :- id(I,RI), id(I+1,RJ), RI >= RJ. :- id(I,RJ), id(I+1,RI), RI = 1. % The syntax tree of the formula must be connected reach(1). reach(T) :- edge(R,T), reach(R). :- node(X), not reach(X). :- node(X), not edge(_,X), X > 1. % Bounded fan-out for logic operators :- node(X), 3 #count { Z: edge(X,Z) }. % Exactly one label per node :- node(X), not label(X,_). :- label(X,A), label(X,B), A < B. % Syntax tree admits a BFS-indexing id(1,(0,0)). id(V,(U,V*V+U)) :- edge(U,V). :- id(I,RI), id(I+1, RJ), RI >= RJ. :- id(I+1,RI), id(I,RJ), RI 3600 s). Event Log SySLitesygus ILASP 2i Abduction SySLitesat SySLiteguided sat
6
AKA
144.2
62.542
612.31
T.O.
T.O.
AF
1.98
3.32
4.705
360.7
T.O.
BT
2.17
1.437
13.239
659.23
133.19
CWP
22.01
7.739
148.891
T.O.
T.O.
DSP
T.O.
T.O
T.O
T.O.
T.O.
EMM
4.64
5.78
7.207
1155.18
T.O.
FI
51.59
19.865
510.189
T.O.
T.O.
GLBA
26.86
13.606
183.542
T.O.
T.O.
HIPPA-1
25.96
31.994
194.694
T.O.
T.O.
HIPPA-2
2.41
1.693
15.962
707.43
153.41
IM
4.04
4.483
5.837
977.77
T.O.
IMSI-1
3.72
3.584
4.877
774.34
T.O.
IMSI-2
6.89
4.782
8.092
1144.58
T.O.
MR
995.85
55.622
T.O
T.O.
T.O.
NE
2.43
4.301
4.706
480.18
T.O.
NA
2.54
2.545
2.8
1877.79
T.O.
IMSI-3
16.55
11.652
98.643
T.O.
T.O.
RLF
560.44
23.266
T.O
T.O
T.O
Evaluation
This section reports an experimental evaluation that aims at assessing both ASP-based approaches presented in the previous section, and to compare existing solutions based on SAT and SMT. In the experimental evaluation we used
42
A. Ielo et al.
the event logs pertaining to the passive learning of attack signatures on cellular networks, and partitioned each event log into positive and negative traces. Signatures are formulae of kind Gϕ, which characterize the positive traces of each log on each time instant. A comprehensive description of each log is available in [2] and its technical report. We compare our ILASP-based solution with our plain ASP solution in order to assess whether ILP can help in this setting, as well as other SAT-based approaches previously implemented in literature, referring to their implementation in the SySLite system. Experimental data and full encodings are available on GitHub (https://github.com/ilp2023-27/data). Execution Environment. All experiments were executed on an Intel(R) Xeon(R) Gold 5118 CPU @ 2.30 GHz, 512 GB RAM machine, using Clingo version 5.4.0, Python 3.10, ILASP 4.2.0 and the version of SySLite available in the authors’ repository. All experiments were run in parallel using GNU Parallel [33]. We report execution time in seconds, with a timeout of 3600 s execution time on each event log. Currently, the most appropriate ILASP version for the task is ILASP 2i, which is the one we use to run the experiments. Data. The dataset is composed by 18 logs: AKA Bypass (AKA), uthentication Failure (AF), Bank Transaction (BT), Chinese Wall Policy (CWP), Dynamic Separation Policy (DSP), EMM Information (EMM), Financial Institute (FI), GLBA, HIPPA 16450A2 (HIPPA-1), HIPPA 16450A3 (HIPPA-2), Identity Malformed (IM), IMSI Catching (IMSI-1), IMSI Cracking (IMSI-2), Measurement Report (MR), NULL Encryption (NE), Numb Attack (NA), Paging with IMSI (IMSI-3), RLF Report (RLF). We considered each log as an instance for the passive learning problem. Since SySLite algorithms target pure-past formulae, we reverse each trace in the log in order to use our encodings and samples2ltl tool. In this way, all approaches are able to learn the same formulae up to dual relabeling of temporal operators involved in the formulae. In particular, we indicate by SySLiteL , L ∈ {sygus, sat, sat guided} the different algorithms available in SySLite, implementing SMT- and SAT-based algorithms. In particular, the sygus algorithm is SMT-based, exploiting bit-vector theories for efficient computations. ILASP 2i column refers to our ILASP encoding, and Abduction column refers to our (incremental) plain ASP encoding that uses abduction. Since the SySLite tool targets pure-past formulae rooted on the historically operator (the pure-past dual of G), we (i) invert the traces before defining our LAS and ASP encoding; (ii) add to our encoding the constraint that target formulae must be rooted in G. Since ILASP currently does not support incremental solving (wrt the definition of the hypothesis space), but rather solves a complexity-wise harder optimization task, we assume the maximum size of the target formula is known beforehand. All systems support the same set of temporal logic operators {X, U, F, G, ∧, ∨, ¬}. We run the different algorithms available in SySLite with a timeout of one hour, along with ILASP and our abductive encoding, on a suite of event logs. The solution based on ILASP compares favorably, as shown in Table 1, with the algorithms implemented in SySLite, and even outperforms it
Towards ILP-Based LTLf Passive Learning
43
on some event logs. Our plain ASP solution is noticeably slower than the sygus SMT-based algorithm and ILASP alike. This suggests that ILP based on ILASP might be a viable approach to scale beyond current model-based techniques without over-relying on pure enumerative approaches.
7
Related Works
The seminal work [28] defines two algorithms for the passive learning problem of LTLf formulae. One of them introduces SAT solvers as practical tools for LTLf passive learning, encoding formulae’s syntax trees and their evaluation over traces as a satisfiability problem, while the other exploits a decision tree to propositionally combine smaller LTLf formulae, addressing scalability but dropping the “optimality” of the solution (in terms of formula size). Another approach [30], in order to improve scalability, targets the directed fragment of LTLf , which however is not as expressive as LTLf as it is unable to express the until temporal operator. Other SAT-based works target equivalent formalisms (such as alternate automata [4]), that can then be translated into LTLf formulae. The SySLite [2] system targets pure-past LTLf formulae of the form Hϕ, where H is the past version of the operator G. It implements different SAT-based algorithms (the ones in [31] as well as SMT-based syntax-guided synthesis [31] enumeration which exploits bit-vector theories for fast evaluation. Recently, an approach based solely on a highly optimized exhaustive search has been proposed [16] that enumerates formulae of increasing size and performs pruning on syntactic and semantic criteria on a single trace at a time. A direct comparison with [16] could not be implemented since the tool does not expose an API to force learning of specific formulae, e.g. starting with G. Thus implementing a comprehensive empirical comparison, in the specific case of learning signatures, would require a modification of the tool of [16]. From the theoretical standpoint, computational complexity-wise, the authors of [10] identify multiple fragments of LTL for which the passive learning problem is already NP-complete, and sample complexity-wise (e.g., how many examples are required to guarantee a given formula is learned) it is known [4] passive learning of arbitrary LTLf formulae can be done with an exponential number of examples under some conditions.
8
Conclusion
In this paper, we presented an ILP approach based on the ILASP system for the passive learning of LTLf formulae. Our approach embeds LTLf semantics into a normal logic program, similar to previous works based on SAT, which is provided as the background knowledge. We outperform SAT-based techniques as implemented in SySLite and compare favorably against its best-performing SMTbased syntax-guided enumeration algorithm. We also implement an abductionbased algorithm based on ASP, proving our performance gains are due to ILASP’s inductive loop rather than the use of plain ASP with respect to SAT or SMT encoding. As future work we plan to improve the scalability of our
44
A. Ielo et al.
proposed method and extend it to take into account data attached to events that occur during the system’s execution. A comparison with the approach of [16] is also in our plans to possibly demonstrate there is an advantage versus exhaustive search methods. Another interesting extension we are interested in, which would extend the applicability of passive learning in real-world settings, is to apply our techniques to noisy domains [13,27] (where traces or their labels might contain errors) by exploiting ILASP’s support for example’s penalties and ASP optimization techniques. It would also be interesting to check whether a compilation-based ASP system [26] can be beneficial to improve the performance of the abductive approach, where we conjecture the number of symbols generated by evaluating candidate solutions over the event log is one of the causes of performance degradation.
References 1. van der Aalst, W.M.P., Pesic, M., Schonenberg, H.: Declarative workflows: balancing between flexibility and support. Comput. Sci. Res. Dev. 23(2), 99–113 (2009) 2. Arif, M.F., Larraz, D., Echeverria, M., Reynolds, A., Chowdhury, O., Tinelli, C.: SYSLITE: syntax-guided synthesis of PLTL formulas from finite traces. In: FMCAD, pp. 93–103 (2020) 3. Brewka, G., Eiter, T., Truszczynski, M.: Answer set programming at a glance. Commun. ACM 54(12), 92–103 (2011) 4. Camacho, A., McIlraith, S.A.: Learning interpretable models expressed in linear temporal logic. In: ICAPS, pp. 621–630 (2019) 5. Ceri, S., Gottlob, G., Tanca, L.: Logic Programming and Databases. Surveys in Computer Science, Springer, Heidelberg (1990). https://doi.org/10.1007/978-3642-83952-8 6. Chesani, F., Lamma, E., Mello, P., Montali, M., Riguzzi, F., Storari, S.: Exploiting inductive logic programming techniques for declarative process mining. Trans. Petri Nets Other Model. Concurr. 2, 278–295 (2009) 7. Dantsin, E., Eiter, T., Gottlob, G., Voronkov, A.: Complexity and expressive power of logic programming. ACM Comput. Surv. 33(3), 374–425 (2001) 8. Dodaro, C., Fionda, V., Greco, G.: LTL on weighted finite traces: formal foundations and algorithms. In: IJCAI, pp. 2606–2612 (2022) 9. Eiter, T., Gottlob, G., Leone, N.: Abduction from logic programs: semantics and complexity. Theor. Comput. Sci. 189(1–2), 129–177 (1997) 10. Fijalkow, N., Lagarde, G.: The complexity of learning linear temporal formulas from examples. In: ICGI, pp. 237–250 (2021) 11. Fionda, V., Greco, G.: LTL on finite and process traces: complexity results and a practical reasoner. J. Artif. Intell. Res. 63, 557–623 (2018) 12. Furelos-Blanco, D., Law, M., Jonsson, A., Broda, K., Russo, A.: Induction and exploitation of subgoal automata for reinforcement learning. J. Artif. Intell. Res. 70, 1031–1116 (2021) 13. Gaglione, J., Neider, D., Roy, R., Topcu, U., Xu, Z.: Maxsat-based temporal logic inference from noisy data. Innov. Syst. Softw. Eng. 18(3), 427–442 (2022) 14. Gebser, M., Kaminski, R., Kaufmann, B., Schaub, T.: Answer Set Solving in Practice. Synthesis Lectures on Artificial Intelligence and Machine Learning, Morgan & Claypool Publishers, San Rafael (2012)
Towards ILP-Based LTLf Passive Learning
45
15. Gelfond, M., Lifschitz, V.: Classical negation in logic programs and disjunctive databases. New Gener. Comput. 9(3/4), 365–386 (1991) 16. Ghiorzi, E., Colledanchise, M., Piquet, G., Bernagozzi, S., Tacchella, A., Natale, L.: Learning linear temporal properties for autonomous robotic systems. IEEE Rob. Autom. Lett. 8(5), 2930–2937 (2023) 17. Giacomo, G.D., Vardi, M.Y.: Linear temporal logic and linear dynamic logic on finite traces. In: IJCAI, pp. 854–860. IJCAI/AAAI (2013) 18. Kaminski, R., Romero, J., Schaub, T., Wanko, P.: How to build your own asp-based system?! Theory Pract. Log. Program. 23(1), 299–361 (2023) 19. Kazmi, M., Sch¨ uller, P., Saygın, Y.: Improving scalability of inductive logic programming via pruning and best-effort optimisation. Expert Syst. Appl. 87, 291–303 (2017) 20. Kolter, R.: Inductive temporal logic programming. Ph.D. thesis, University of Kaiserslautern (2009) 21. Law, M., Russo, A., Broda, K.: The ILASP system for learning answer set programs (2015). https://www.ilasp.com/ 22. Law, M., Russo, A., Broda, K.: Simplified reduct for choice rules in ASP. Technical report, Department of Computing (DTR2015-2), Imperial College London (2015) 23. Law, M., Russo, A., Broda, K.: The complexity and generality of learning answer set programs. Artif. Intell. 259, 110–146 (2018) 24. Law, M., Russo, A., Broda, K.: Logic-based learning of answer set programs. In: Reasoning Web, pp. 196–231 (2019) 25. Li, J., Pu, G., Zhang, Y., Vardi, M.Y., Rozier, K.Y.: Sat-based explicit LTLF satisfiability checking. Artif. Intell. 289, 103369 (2020) 26. Mazzotta, G., Ricca, F., Dodaro, C.: Compilation of aggregates in ASP systems. In: AAAI, pp. 5834–5841. AAAI Press (2022) 27. Mrowca, A., Nocker, M., Steinhorst, S., G¨ unnemann, S.: Learning temporal specifications from imperfect traces using bayesian inference. In: DAC, p. 96. ACM (2019) 28. Neider, D., Gavran, I.: Learning linear temporal properties. In: FMCAD, pp. 1–10 (2018) 29. Pnueli, A.: The temporal logic of programs. In: FOCS, pp. 46–57. IEEE Computer Society (1977) 30. Raha, R., Roy, R., Fijalkow, N., Neider, D.: Scalable anytime algorithms for learning fragments of linear temporal logic. In: TACAS 2022. LNCS, vol. 13243, pp. 263–280. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-99524-9 14 31. Reynolds, A., Barbosa, H., N¨ otzli, A., Barrett, C.W., Tinelli, C.: cvc4sy: smart and fast term enumeration for syntax-guided synthesis. In: CAV, pp. 74–83 (2019) 32. Ribeiro, T., Folschette, M., Magnin, M., Okazaki, K., Kuo-Yen, L., Inoue, K.: Diagnosis of event sequences with LFIT. In: The 31st International Conference on Inductive Logic Programming (ILP) (2022) 33. Tange, O.: GNU parallel: the command-line power tool. Login Usenix Mag. 36(1) (2011)
Learning Strategies of Inductive Logic Programming Using Reinforcement Learning Takeru Isobe1,2(B) 1
and Katsumi Inoue1,2
Graduate Institute for Advanced Studies, SOKENDAI, Tokyo, Japan 2 National Institute of Informatics, Tokyo, Japan {isobe,inoue}@nii.ac.jp
Abstract. Learning settings are crucial for most Inductive Logic Programming (ILP) systems to learn efficiently. Hypothesis spaces can be huge, and ILP systems take a long time to output solutions or even cannot terminate within time limits. Therefore, users must set suitable learning settings for each ILP task to bring the best performance of the system. However, most users struggle to set appropriate settings for the task they see for the first time. In this paper, we propose a method to make an ILP system more adaptable to tasks with weak learning biases. In particular, we attempt to learn efficient strategies for an ILP system using reinforcement learning (RL). We use Popper, a state-of-the-art ILP system that implements the concept of learning from failures (LFF). We introduce RL-Popper, which divides the hypothesis space into subspaces more minutely than Popper. RL is used to learn the efficient search order of the divided spaces. We provide the details of RL-Popper and show some empirical results.
Keywords: Inductive Logic Programming From Failures · Reinforcement Learning
1
· Meta Learning · Learning
Introduction
Efficient learning strategies are important for systems of Inductive Logic Programming (ILP) [14] to learn a target program within a time limit. Many ILP systems have been developed, and they have different properties in several aspects such as learning settings, representation languages, learning biases, and search strategies. For most systems, exploring a large hypothesis space is notoriously challenging in inductive learning. Although most approaches use prefixed learning strategies, we aim to develop a method that learns learning strategies in multi-task learning. Meta Learning, i.e., learning to learn, is a general approach in the field of Machine Learning [9,22]. The goal of Meta Learning is to learn good strategies for learning a target task by learning other tasks beforehand. The idea of Meta c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 E. Bellodi et al. (Eds.): ILP 2023, LNAI 14363, pp. 46–61, 2023. https://doi.org/10.1007/978-3-031-49299-0_4
Learning Strategies of Inductive Logic Programming
47
Learning can be used in ILP because the goal of ILP is to learn a target logic program. In this paper, we propose a method to improve ILP systems using reinforcement learning (RL) [21]. We focus on Popper as the target ILP system and introduce RL-Popper. Popper [3] is a state-of-the-art ILP system that supports recursion and predicate invention (PI) without metarules used in meta interpretive learning (MIL) [17]. Popper implements the concept of learning from failures (LFF). LFF uses a relation of generality called θ-subsumption and successfully prunes incomplete or inconsistent hypotheses. Because of the effective pruning of the hypothesis space, Popper can perform recursion and PI without metarules. Because the hypothesis space with recursive clauses and invented predicates is larger than the space without these functions, recursion and PI are considered as optional functions in Popper to avoid unnecessary searches. For example, users can enable PI using one of the bias settings, “enable pi”. However, in most cases, we do not have information about whether we should enable PI (or recursion) or not. In general, it is important for ILP systems such as Popper to efficiently search in the hypothesis space with both recursive clauses and invented predicates. We focus on Popper for several reasons. Popper is a state-of-the-art ILP system with the potential to address tasks of various categories. In addition, the LFF process can be used to predict efficient search strategies in the subsequent process. The rest of this paper is organized as follows. Section 2 explains the related work. In Sect. 3, we provide preliminary information about ILP and Popper. Section 4 introduces our proposed method, RL-Popper. In Sect. 5, we present the experimental results and discussion. Finally, we summarize our contributions, limitations, and future work in Sect. 6.
2 2.1
Related Work Learning Strategies of ILP Systems
Systems of meta interpretive learning (MIL) such as Metagol [17] use metarules, which are formats of clauses allowed in a hypothesis. Metarules successfully reduce the size of the hypothesis space so that the system can learn recursive hypotheses and perform PI. HEXMIL [11] is an ASP-based approach [12] that uses MIL. ASP-based approaches take advantage of the efficiency of modern ASP solvers but struggle with several challenges such as the grounding bottleneck of ASP or less supporting of structured objects. HEXMIL addresses these challenges by efficient abstraction of facts from given background knowledge and examples. Despite the efficiency of MIL approaches, the setting of metarules by users can be a bottleneck of these approaches. In this paper, we use an approach that does not need metarules. Unlike MIL approaches, Popper supports PI without metarules using LFF. Although Popper mainly uses a syntactical approach, RL-Popper combines the approach of LFF and RL as a statistical approach to learn efficient strategies. DFOL [8] is a differentiable ILP approaches [6,18,24] which aim at finding a target program by using an optimization procedure. DFOL constructs a neural network model using several loss functions to learn efficient logic programs.
48
T. Isobe and K. Inoue
Although DFOL or other differentiable approaches of ILP optimize the strategy and hypothesis simultaneously, RL-Popper in this paper learns efficient strategies independently from the main induction mechanism of a target ILP system. 2.2
Meta Learning in ILP
The idea of Meta Learning [9,22] is used in some research on ILP. Srinivasan et al. introduced a methodology to search for optimal parameters of an ILP system using parameter screening and optimization [20]. In this paper, on the other hand, we attempt to determine the search strategies of an ILP system directly rather than using parameters. Metarules of MIL can be considered as a learning target of Meta Learning. Since the emergence of MIL, deciding which metarules should be used for a given task has been explored [4,5]. Multi-task learning is one of the main learning situations for ILP, and the learning order of multi-task learning can also be considered as a learning target of Meta Learning [13]. Lifelong learning (i.e., continual learning) is also an important application of ILP because of the reusability of learned knowledge. ILP systems supporting lifelong learning need to select only knowledge relevant to the current task from learned knowledge because too many knowledge easily cause combinatorial explosion. Cropper et al. addressed this challenge by predicting important knowledge both syntactically and statistically [1]. Unlike these existing Meta Learning approaches of ILP, we attempt to predict efficient search strategies using RL.
3 3.1
Preliminary Inductive Logic Programming
ILP systems use background knowledge and positive and negative examples to induce rules of the target concept that generalize all positive examples and none of negative examples. Most ILP systems also receive information about learning biases by users. These components have slightly different definitions for each system or approach. In this section, we introduce the learning from entailment (LFE) setting [10,15,19], which is one of the major learning settings of ILP. In LFE, examples are provided as atoms that should be generalized by induced rules with background knowledge. Definition 1. (Learning from entailment). Given a tuple (B,E + ,E − ) where B represents background knowledge (logic program), E + represents positive examples of the target concept (atoms), and E − represents negative examples of the target concept (atoms). ILP systems aim to induce a hypothesis H such that ∀e ∈ E + , −
∀e ∈ E ,
H ∪ B |= e H ∪ B |= e
(1)
Learning Strategies of Inductive Logic Programming
49
Recursion. Recursion is crucial for ILP systems if the target concepts represent potentially infinite relations [2]. For example, if ILP systems need to learn the concept of ancestor, it is impossible for hypotheses of finite size to express the concept of ancestor exactly with only two body predicates, mother and father. ILP systems supporting recursion can return recursive hypotheses by using recursive clauses that have the same predicate in the head and body. Note that recursive hypotheses must have base clauses that are not recursive clauses. Predicate Invention. Predicate invention (PI) [16] can be a key for ILP systems to induce target concepts under some learning biases because PI can reduce the size of the hypothesis. ILP systems supporting PI can return a hypothesis with invented auxiliary predicates representing new concepts not given in the background knowledge. Example 1. (Predicate invention). If the target concept gp is the relationship of grandparent, an ILP system not supporting PI learns the following hypothesis. Note that predicate m represents mother and the predicate f represents father. ⎧ ⎫ gp(A,B) :- m(A,C),m(C,B). ⎪ ⎪ ⎪ ⎪ ⎨ ⎬ gp(A,B) :- m(A,C),f(C,B). (2) h1 = gp(A,B) :- f(A,C),m(C,B). ⎪ ⎪ ⎪ ⎪ ⎩ ⎭ gp(A,B) :- f(A,C),f(C,B). However, if there is a learning bias that limits the number of clauses up to three, h1 is prohibited because h1 has four clauses. In this situation, an ILP system supporting PI can return the following hypothesis. Note that the predicate inv represents the concept of parent. ⎧ ⎫ ⎨ inv(A,B) :- m(A,B). ⎬ (3) h2 = inv(A,B) :- f(A,B). ⎩ ⎭ gp(A,B) :- inv(A,C),inv(C,B). 3.2
Popper
Popper [3] implements the concept of learning from failures (LFF). In LFF, to reduce hypothesis spaces, the system adds constraints that are learned from the failed hypothesis. To define constraints, Popper analyzes the failed hypothesis in generality using the θ-subsumption order. Definition 2. (θ-subsumption). Let C and D be clauses. If there exists an assignment θ which leads to Cθ ⊆ D, C subsumes D. Example 2. (θ-subsumption). Assume the grandparent problem given earlier, and the hypothesis space has the following hypotheses.
50
T. Isobe and K. Inoue
Fig. 1. Popper generates programs using Prolog program generator written in ASP. If a constraint is learned in constrain step, Popper adds this constraint to the generator to prune the hypothesis space of the subsequent process.
gp(A,B) :- m(A,B),f(C,B). , h2 = gp(A,B) :- m(A,B),f(C,D). , gp(A,B) :- f(A,B),f(B,C). , h3 = (4) gp(A,B) :- f(A,B),m(B,C).
h4 = gp(A,B) :- f(A,B),f(B,C). h1 =
Using θ-subsumption, the following can be claimed. If h1 covers negative examples, h2 also covers negative examples because h2 subsumes h1 . If h3 does not cover all positive examples, h4 also does not cover all positive examples because h3 subsumes h4 When C subsumes D and C is not equal to D, C is more general than D because the number of examples entailed by C must be greater than those entailed by D. Inversely, it is possible to say that D is more specific than C. Learning from failures (LFF) is a θ-subsumption based approach that never prunes complete and consistent hypotheses. Definition 3. (Learning from failures). Let C and D be distinct clauses and C subsumes D. If D entails negative examples and turns out to be too general, a system of learning from failures (LFF) prunes C from the hypothesis space because C is more general than D. Inversely, if C does not entail all positive examples and turns out to be too specific, a system of LFF prunes D from the hypothesis space because D is more specific than C.
Learning Strategies of Inductive Logic Programming
51
Example 3. (Learning from failures). In Example 2, h1 is a totally incorrect hypothesis. First, the system of LFF finds out that h1 entails negative examples and prunes h2 from the hypothesis space because h2 is more general than h1 . On the other hand, h3 is partly correct but not enough. The system of LFF finds out that h3 does not entail all positive examples and prunes h4 from the hypothesis space because h4 is more specific than h3 . Popper implements LFF with a loop of generate, test and constrain steps. First, Popper generates programs with only one body predicate. Popper then gradually generates larger programs. This process is depicted in Fig. 1.
4
RL-Popper
We introduce a method to learn efficient strategies of LFF using RL. 4.1
Search Strategy
RL-Popper generates small programs at first and gradually generates larger programs as Popper does. Unlike Popper, we divide the hypothesis space into subspace more minutely than Popper. First, we categorize hypotheses into three types, PI, recursive, and ordinary hypothesis. Definition 4. (hypothesis type). We define three hypothesis types, PI, recursive, and ordinary hypothesis as follows. A PI hypothesis is a hypothesis with invented predicates. A recursive hypothesis is not a PI hypothesis and has recursive clauses. A hypothesis that is neither a PI hypothesis nor a recursive hypothesis is an ordinary hypothesis. Furthermore, we divide these categorized hypotheses into smaller spaces using the number of different variables included in the hypotheses. Finally, we define search orders between these divided subspaces as search strategies. Although Popper uses a prefixed strategy to learn programs, RL-Popper changes its strategy based on the policy. 4.2
Learning Setting of RL
We define different search orders of the divided spaces as search strategies. In RL-Popper, the goal of RL is to learn which strategy is efficient in each situation in terms of learning time. We use REINFORCE algorithm [23], a methods of policy gradient reinforcement learning (PGRL). In PGRL, a policy is represented by a policy function πθ parameterized by θ and an agent learns θ which maximizes the objective function. REINFORCE algorithm can learn from past episodes and directly predicts the efficient action in each situation. In REINFORCE algorithm, the formula for updating θ is the following equation (1). α is the learning rate. T is the total time of an episode. Rt , st , and at is the reward, the states, and the action of time
52
T. Isobe and K. Inoue
t respectively. b(st ) is called the baseline and is defined as the average reward of st . T 1 (Rt − b(st ))∇θ logπθ (st , at ) (5) θ ←θ+α T t=0 Next, we introduce the meaning of the states, the action, the episode, and the reward in the RL-Popper. The states are defined using the current learning process of LFF. The policy network receives the LFF search results as input features from these states. For example, we used the number of hypotheses or the number of constraints generated in each subspace of the hypothesis space in the previous learning process. We can use the coverage defined as the number of positive or negative examples are actually covered in each subspace. In addition, there is a possibility to use information about learning biases such as the maximum number of different variables. The input features of the policy network are calculated from the search results when the search of each size is completed. The action is a choice between search strategies that represent search orders within the subspaces of the hypothesis space. For example, we can define six different orders from the three types of hypothesis: PI, recursive, and ordinary hypothesis. In addition, by considering the number of different variables in each hypothesis, we can define even more search orders. The search order of the next size is determined as action from these input features by the policy network. The process is depicted in Fig. 2. The episode is defined as a set of tasks. The reward is calculated for the learning process of each task, and the policy network is updated when an episode is completed. The reward is calculated as a function of learning performance, such as learning time. Proposition 1. (Soundness). If a solution exists, RL-Popper can always output a solution. Proof. As mentioned above, constraints made in LFF do not prune hypotheses that are complete and consistent. RL-Popper is also sound and does not prune any complete and consistent hypothesis because RL-Popper only changes search order and uses the same methods for creating constraints. Proposition 2. (Optimality). The solution generated by RL-Popper is the smallest for the number of literals within the complete and consistent hypotheses. Proof. Our method is optimal for the number of literals because RL-Popper changes the search order of LFF only within the hypothesis space of the same number of literals. Like Popper, RL-Popper first generates small programs and then gradually generates larger programs.
Learning Strategies of Inductive Logic Programming
53
Fig. 2. Policy network predicts efficient strategies of Popper from LFF learning results for each size. Prediction is performed when the size of programs Popper searches is updated.
5
Experiments
5.1
Experimental Setting
Train and Test Data. We used a set of ten tasks to train a policy network. To evaluate performance outside the training data, we tested RL-Popper against a set of five tasks that do not exist in the training data. We duplicated five test tasks as the tasks with (2) which have weaker learning biases than their originals. The Ancestor (3) task has almost the same settings as the Ancestor (2) task but has much more positive examples than the Ancestor (2) task. These tasks are based on the sample data sets of Popper, which are available on Github1 . Table 1 shows the abstract of these tasks with an example solution. We slightly modified these tasks, mainly in terms of learning biases and Table 2 gives a summary. We trained the policy network by learning a training set of ten tasks 150 times, which includes 100 steps of the greedy search and 50 steps of search. We tested RL-Popper using both training and test tasks without policy updating. Learning Setting of RL. First, we assumed the number of different variables was up to six. Therefore, there are 18 subspaces of the hypothesis space because they are defined as the combination of six options about the number of different variables and three hypothesis types. There are six different orders of the three hypothesis types. In addition, in each of the three types of hypotheses, 1
https://github.com/logic-and-learning-lab/Popper.
54
T. Isobe and K. Inoue
Table 1. Tasks made from examples provided on the Popper Github page. Tasks contain several categories, such as list transformation, relationship, or robot strategy. * represents the training task. Task All elements are evens* Drop the first k elements* Find duplicates*
Grandparent*
Path*
alleven(A):- head(A, B), tail(A, C), even(B), alleven(C). dropk(A, B, C):- one(B), tail(A, C). dropk(A, B, C):- decrement(B, E), dropk(A, E, D), tail(D, C). dupl(A, B):- head(A, B). dupl(A, B):- tail(A, D), tail(D, C), element(C, B), dupl(D, B). ⎧ ⎪ ⎪ ⎨inv1(A, B):- m(A, B). < inv1(A, B):- > f (A, B). < ⎪ ⎪ ⎩kinship pi(A, B):- > inv1(A, C), inv1(C, B). path(A, B):- edge(A, B). path(A, B):- edge(A, C), path(C, B). rev(A, B):- empty(A), empty(B).
Reverse* Robots (linear)* Robotic Scene Graph* successor* Trains1* Drop the last element
Ancestor
Nearly successor
Robots (recursion) Trains2
An example solution alleven(A):- empty(A).
rev(A, B):- tail(A, C), head(A, E), append(D, E, B), rev(C, D). robots(A, B):- r(A, F ), r(F, E), r(E, D), r(D, C), r(C, B). rsg(A, B):- f lat(A, B). rsg(A, B):- down(D, B), up(A, C), rsg(D, C). scc(A, B):- edge(A, B). scc(A, B):- edge(B, C), scc(A, C). trains(A):- car(A, B), rclosed(B), car(A, C), 3wheels(C), long(B). dropl(A, B):- tail(A, B). dropl(A, B):- tail(A, D), head(A, C), dropl(D, E), append(C, E, B). ⎧ ⎪ inv1(A, B):- f (A, B). < ⎪ ⎪ ⎪ ⎨inv1(A, B):- > m(A, B). < ⎪ ancestor(A, B):- > inv1(A, B). < ⎪ ⎪ ⎪ ⎩ ancestor(A, B):- > inv1(A, C), ancestor(C, B). ⎧ ⎪ nearlyscc(A, B):- edge(B, A). ⎪ ⎪ ⎪ ⎨nearlyscc(A, B):- edge(A, B). ⎪nearlyscc(A, B):- edge(A, C), nearlyscc(C, B). ⎪ ⎪ ⎪ ⎩ nearlyscc(A, B):- edge(C, A), nearlyscc(C, B). robots rec(A, B):- move up(A, B), at top(B). robotsr ec(A, B):- move up(A, C), robots rec(C, B). trains(A):- car(A, B), ropen(B), load(B, C), triangle(C). trains(A):- car(A, C), car(A, B), rclosed(C), 2wheels(B), ropen(B).
assume there are two choices about the number of different variables: ascending or descending. There are eight variations for six different orders of the three hypothesis types. Finally, we define 48 different search orders by combining these variations. Next, we create 133 input features of the policy network. For each of the 18 subspaces, the following seven features are defined. (1) (2) (3) (4) (5)
Number Average Average Number Number
of hypotheses generated in the previous process. ratio of positive examples covered in the previous process. ratio of negative examples covered in the previous process. of generalization constraints generated in the previous process. of specialization constraints generated in the previous process.
Learning Strategies of Inductive Logic Programming
55
Table 2. Summary of the learning bias of each task. Task
Max vars Max body Max clauses PI
Recursion
All elements are even Drop the first k elements Find duplicates Grandparent Path Reverse Robots Robotic Scene Graph successor Trains1
5 5 4 3 5 5 – 4 5 –
5 4 4 3 5 4 – 4 5 –
– – – 3 – – – – – –
enable enable enable enable enable enable – enable enable enable
enable enable enable enable enable enable enable enable enable enable
Drop the last element Drop the last element (2) Ancestor Ancestor (2) Ancestor (3) Nearly successor Nearly successor (2) Robots (recursion) Robots (recursion) (2) Trains2 Trains2 (2)
5 5 3 4 4 4 5 4 5 3 5
4 5 2 3 3 3 4 3 4 5 5
– – 3 3 3 – – – – – –
enable enable enable enable enable enable enable enable enable enable enable
enable enable enable enable enable enable enable enable enable enable enable
(6) Number of redundant constraints (1)2 generated in the previous process. (7) Number of redundant constraints (2)3 generated in the previous process. These are designed to reflect the process of LFF. In addition, from the learning biases and learning progress, the following seven features are defined. (1) The maximum number of literals allowed to be used in a hypothesis. (2) The maximum number of different variables allowed to be used in a hypothesis. (3) The maximum number of clauses allowed to be used in a hypothesis. (4) The number of different body predicates allowed to be used. (5) Flag representing whether recursion is allowed or not. (6) Flag representing whether PI is allowed or not. 2
3
Redundant constraint (1) is generated when PI is-not allowed and a hypothesis covers none of the positive examples and has only one clause. Redundant constraint (2) is generated when recursion is allowed and a hypothesis covers none of the positive examples.
56
T. Isobe and K. Inoue
(7) The size that the system is about to explore. We defined the reward function as follows, where m is the ranking in the learning time and N is the number of current episode. ⎧ (50−( m−0.5 ∗100))2 ⎪ N ⎪ if m−0.5 < 0.5 ⎨ 25 N reward = (6) ⎪ 2 ⎪ ∗100)) ⎩− (50−( m−0.5 m−0.5 N if N ≥ 0.5 25 The training results depend on the reward function. We use the rankings to ensure that the same reward function can be applied regardless of which learning performance is used. Assuming that all rankings appear with equal probability, the expected value of this reward function is zero. Finally, we adopted -greedy search [7] where equals 13 to balance the exploration and the utilization. During training, once every three episodes, the agent ignores the policy network throughout this episode. 5.2
Results
Figure 3 shows the training results of ten training tasks. The learning time for each training result with policy network updating is shown. Note that this result only contains 100 greedy search steps and does not contain 50 steps of search to illustrate the process of policy learning. In most tasks, the learning time is relatively short in later episodes. Popper and RL-Popper successfully learned target programs that are syntactically the same as example solutions in most tasks, except for the Ancestor (2) task. In the Ancestor (2) task, while RL-Popper learned correct programs for representing the meaning of ancestor, Popper learned incorrect programs such that ⎧ ⎪ ⎨ancestor(A, B) : −m(A, B). (7) ancestor(A, B) : −m(A, C), ancestor(C, B). ⎪ ⎩ ancestor(A, B) : −f (A, C), m(D, C), ancestor(D, B). Although this program is complete and consistent with the given examples, it is incorrect for representing the meaning of ancestor. This difference in learned programs between RL-Popper and Popper was caused by the difference in search orders. We added the Ancestor (3) task to evaluate the two methods in the situation where both of these methods can output correct programs. To evaluate RL-Popper, we compared it with Popper in two kinds of learning performance, the learning time and the number of hypotheses generated. Tests were conducted 20 times for each task, and the results were formatted as mean ± standard deviation. Table 3 shows the results of the learning time. Unfortunately, RL-Popper spent more time than Popper on some tasks. The main time-consuming factor is the overhead occurred by dividing the hypothesis space. In RL-Popper, we
Learning Strategies of Inductive Logic Programming
Fig. 3. The learning time for training tasks with greedy policy
Table 3. Results of learning performance in learning time. (second) Train/Test Task
Popper
RL-Popper
Train Train Train Train Train Train Train Train Train Train
All elements are even Drop the first k elements Find duplicates Grandparent Path Reverse Robots Robotic Scene Graph successor Trains1
0.68 ± 0.01 6.23 ± 0.97 1.1 ± 0.11 0.96 ± 0.72 5.36 ± 0.16 4.49 ± 0.8 2.1 ± 0.45 5.64 ± 0.68 5.36 ± 0.17 0.98 ± 0.01
3.14 ± 0.04 5.19 ± 0.04 2.59 ± 0.03 5.65 ± 0.04 1.26 ± 0.02 5.48 ± 0.43 6.31 ± 0.21 3.08 ± 0.48 1.32 ± 0.01 3.66 ± 0.05
Test Test Test Test Test Test Test Test Test Test Test
Drop the last element Drop the last element (2) Ancestor Ancestor (2) Ancestor (3) Nearly successor Nearly successor (2) Robots (recursion) Robots (recursion) (2) Trains2 Trains2 (2)
1.91 ± 0.22 38.01 ± 2.01 15.34 ± 14.28 (88.57 ± 124.03) 99.30 ± 26.79 3.63 ± 0.04 47.53 ± 6.58 1.25 ± 0.58 31.06 ± 1.3 0.35 ± 0.02 0.77 ± 0.03
4.39 ± 0.17 4.52 ± 0.14 16.64 ± 11.2 151.91 ± 52.5 141.64 ± 22.82 6.06 ± 0.09 73.53 ± 3.0 2.17 ± 0.04 5.71 ± 0.08 3.47 ± 0.06 11.06 ± 0.12
57
58
T. Isobe and K. Inoue Table 4. The number of the hypotheses generated by hypothesis generator. Train/Test Task
Popper
RL-Popper
Train
All elements are even
175.2 ± 9.42
171.8 ± 1.12
Train
Drop the first k elements
275.8 ± 90.49
40.0 ± 0.0
Train
Find duplicates
270.75 ± 61.34
223.0 ± 0.0
Train
Grandparent
1708.75 ± 1428.93
7225.75 ± 23.33
Train
Path
17.4 ± 5.27
35.6 ± 1.96
Train
Reverse
880.9 ± 271.48
217.1 ± 26.69
Train
Robots
181.7 ± 65.09
152.9 ± 3.86
Train
Robotic Scene Graph
1958.1 ± 769.63
1408.15 ± 35.19
Train
successor
19.45 ± 6.0
38.0 ± 0.0
Train
Trains (1)
321.7 ± 5.67
325.0 ± 0.0
Test
Drop the last element
112.6 ± 29.97
187.6 ± 6.96
Test
Drop the last element (2) 124.25 ± 32.21
Test
Ancestor
15734.45 ± 14402.78
Test
Ancestor (2)
(61406.85 ± 17915.93) 78680.7 ± 21392.98
190.75 ± 3.74 13184.0 ± 11475.0
Test
Ancestor (3)
101996.1 ± 29708.34
65265.35 ± 13658.41
Test
Nearly successor
3250.1 ± 0.3
3250.05 ± 0.22
Test
Nearly successor (2)
23154.3 ± 4.71
23155.15 ± 4.19
Test
Robots (recursion)
390.0 ± 227.09
573.85 ± 8.06
Test
Robots (recursion) (2)
646.4 ± 280.91
924.0 ± 0.0
Test
Trains2
297.0 ± 0.0
297.0 ± 0.0
Test
Trains2 (2)
381.0 ± 0.0
381.0 ± 0.0
defined a different generator for each subspace of the hypothesis space. However, despite this overhead, RL-Popper learned programs in a significantly short time in several tasks. Although RL-Popper spent more time on “Drop the last element” and “Robots (recursion)”, RL-Popper spent less time on the weaker learning bias settings of these tasks. (“Drop the last element (2)” and the task of “Robots (recursion) (2)”). These results show that RL-Popper successfully learned strategies to learn efficiently with weak learning biases in several tasks. Table 4 shows the results of the number of generated4 hypotheses. In several tasks, RL-Popper spent less time but produced more candidate hypotheses than Popper. This result shows that minimal procedures are not always the quickest strategy in these tasks. In “Nearly succeeding” and “Trains2”, RL-Popper and Popper generated almost the same number of hypotheses. When RL-Popper and Popper find the complete and consistent hypothesis by the combination of possible hypotheses, both methods do not stop generating hypotheses until the hypothesis is found out to be optimal. In this case, an efficient strategy cannot 4
Note that hypotheses generated by the combination of possible hypotheses are not included here. A possible hypothesis is consistent but incomplete and covers at least one positive example. Popper saves each possible hypothesis and tests the combinations of these hypotheses.
Learning Strategies of Inductive Logic Programming
59
affect the number of hypotheses generated even if the strategy helps the learner find the solution more quickly. 5.3
Discussion
The training results show that the policy network is capable of learning efficient search strategies that can be applied to different tasks. This substantiates the Meta Learning of the strategies of an ILP system. From the test results, the strategies learned by RL-Popper tend to work well in the tasks with weak learning biases, i.e., with large hypothesis spaces. In addition, the test results show the system with learned strategies tends to have higher stability than the original system. In RL-Popper, the overhead incurred by the additional process is significant so that the learned strategies cannot improve the original systems. Furthermore, learned strategies are not capable of adapting to all types of tasks because the learning time of some tasks was not improved through training.
6
Conclusion
We constructed a new method of applying RL to improve the performance of LFF. We showed REINFORCE algorithm successfully reduced the learning time of training tasks. In Sect. 5, we provided some experimental results where RLPopper learned the target concepts in a shorter time than Popper in several tasks. Especially, in situations where learning biases are weak, RL-Popper tends to be faster in learning time. When the training tasks are changed, the test results can be changed. Especially, if the training tasks contain more various tasks than our experiments, the learned strategies would be adaptable to more various tasks. Although RLPopper could learn efficient strategies for most training tasks, learning time is not improved in other tasks. It must be addressed in the future how to learn strategies that can be shared with a greater number of training tasks than RLPopper by trying different learning settings of RL or other Machine Learning methods. While learned strategies have effectiveness in some tasks with weak learning biases, RL-Popper is less efficient than Popper in other tasks. The additional process of dividing the hypothesis space causes significant overhead and improvement in this time losses are needed. In RL-Popper, the policy network was learned from the relation between the learning time and the process of LFF. This Meta Learning approach can be applied to other ILP systems. Also, other indicators of learning performance can be considered as the objective of Meta Learning. In particular, how learned strategies affect the learning accuracy of ILP should be explored more. Acknowledgement. This work has been supported in part by JSPS KAKENHI Grant Number JP21H04905 and JST CREST Grant Number JPMJCR22D3.
60
T. Isobe and K. Inoue
References 1. Cropper, A.: Forgetting to learn logic programs. In: AAAI, pp. 3676–3683. AAAI Press (2020) 2. Cropper, A., Dumancic, S.: Inductive logic programming at 30: a new introduction. J. Artif. Intell. Res. 74, 765–850 (2022) 3. Cropper, A., Morel, R.: Learning programs by learning from failures. Mach. Learn. 110(4), 801–856 (2021). https://doi.org/10.1007/s10994-020-05934-z 4. Cropper, A., Muggleton, S.H.: Logical minimisation of meta-rules within metainterpretive learning. In: Davis, J., Ramon, J. (eds.) ILP 2014. LNCS (LNAI), vol. 9046, pp. 62–75. Springer, Cham (2015). https://doi.org/10.1007/978-3-31923708-4 5 5. Cropper, A., Tourret, S.: Logical reduction of metarules. CoRR abs/1907.10952 (2019) 6. Evans, R., Grefenstette, E.: Learning explanatory rules from noisy data. J. Artif. Intell. Res. 61, 1–64 (2018) 7. Fran¸cois-Lavet, V., Henderson, P., Islam, R., Bellemare, M.G., Pineau, J.: An introduction to deep reinforcement learning. CoRR abs/1811.12560 (2018) 8. Gao, K., Inoue, K., Cao, Y., Wang, H.: Learning first-order rules with differentiable logic program semantics. In: IJCAI, pp. 3008–3014. ijcai.org (2022) 9. Hospedales, T.M., Antoniou, A., Micaelli, P., Storkey, A.J.: Meta-learning in neural networks: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 44(9), 5149–5169 (2022) 10. Inoue, K.: Induction as consequence finding. Mach. Learn. 55(2), 109–135 (2004) 11. Kaminski, T., Eiter, T., Inoue, K.: Exploiting answer set programming with external sources for meta-interpretive learning. Theory Pract. Log. Program. 18(3–4), 571–588 (2018) 12. Law, M., Russo, A., Bertino, E., Broda, K., Lobo, J.: FastLAS: scalable inductive logic programming incorporating domain-specific optimisation criteria. In: AAAI, pp. 2877–2885. AAAI Press (2020) 13. Lin, D., Dechter, E., Ellis, K., Tenenbaum, J.B., Muggleton, S.H.: Bias reformulation for one-shot function induction. In: ECAI. Frontiers in Artificial Intelligence and Applications, vol. 263, pp. 525–530. IOS Press (2014) 14. Muggleton, S.H.: Inductive logic programming. New Gener. Comput. 8(4), 295–318 (1991) 15. Muggleton, S.H.: Inverse entailment and progol. New Gener. Comput. 13(3&4), 245–286 (1995) 16. Muggleton, S.H., Buntine, W.L.: Machine invention of first order predicates by inverting resolution. In: ML, pp. 339–352. Morgan Kaufmann (1988) 17. Muggleton, S.H., Lin, D., Tamaddoni-Nezhad, A.: Meta-interpretive learning of higher-order dyadic datalog: predicate invention revisited. Mach. Learn. 100(1), 49–73 (2015). https://doi.org/10.1007/s10994-014-5471-y 18. Shindo, H., Nishino, M., Yamamoto, A.: Differentiable inductive logic programming for structured examples. In: AAAI, pp. 5034–5041. AAAI Press (2021) 19. Srinivasan, A.: The Aleph Manual. Machine Learning at the Computing Laboratory, Oxford University, Cambridge (2001) 20. Srinivasan, A., Ramakrishnan, G.: Parameter screening and optimisation for ILP using designed experiments. J. Mach. Learn. Res. 12, 627–662 (2011) 21. Sutton, R.S., Barto, A.G.: Reinforcement Learning - An Introduction. Adaptive Computation and Machine Learning, MIT Press, Cambridge (1998)
Learning Strategies of Inductive Logic Programming
61
22. Thrun, S., Pratt, L.Y. (eds.): Learning to Learn. Springer, Heidelberg (1998) 23. Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8, 229–256 (1992) 24. Yang, F., Yang, Z., Cohen, W.W.: Differentiable learning of logical rules for knowledge base reasoning. In: NIPS, pp. 2319–2328 (2017)
Select First, Transfer Later: Choosing Proper Datasets for Statistical Relational Transfer Learning Thais Luca1(B) , Aline Paes2 , and Gerson Zaverucha1 1
Department of Systems Engineering and Computer Science, COPPE Universidade Federal do Rio de Janeiro, Rio de Janeiro, RJ, Brazil {tluca,gerson}@cos.ufrj.br 2 Institute of Computing, Universidade Federal Fluminense, Niteroi, RJ, Brazil [email protected]
Abstract. Statistical Relational Learning (SRL) relies on statistical and probabilistic modeling to represent, learn, and reason about domains with complex relational and rich probability structures. Although SRL techniques have succeeded in many real-world applications, they follow the same assumption as most ML techniques by assuming training and testing data have the same distribution and are sampled from the same feature space. Changes between these distributions might require training a new model using new data. Transfer Learning adapts knowledge already learned to other tasks and domains to help create new models, particularly in a low-data regime setting. Many recent works have succeeded in applying Transfer Learning to relational domains. However, most focus on what and how to transfer. When to transfer is still an open research problem as a pre-trained model is not guaranteed to help or improve performance for learning a new model. Besides, testing every possible pair of source and target domains to perform transference is costly. In this work, we focus on when by proposing a method that relies on probabilistic representations of relational databases and distributions learned by models to indicate the most suitable source domain for transferring. To evaluate our approach, we analyze the performances of two transfer learning-based algorithms given the most similar target domain to a source domain according to our proposal. In the experimental results, our method has succeeded as both algorithms reach their best performance when transferring between the most similar pair of source and target domains. Keywords: Statistical relational learning Relational data similarity
1
· Transfer learning ·
Introduction
Relational Learning is a Machine Learning (ML) subfield that combines ML and knowledge representation principles to learn how to address a task from Supported by CAPES, FAPERJ, and CNPq. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 E. Bellodi et al. (Eds.): ILP 2023, LNAI 14363, pp. 62–76, 2023. https://doi.org/10.1007/978-3-031-49299-0_5
Select First, Transfer Later
63
examples [5]. It differs from traditional ML as it does not rely on tabular format inputs, which ignores crucial information on how entities relate to each other. It allows for a more expressive knowledge representation by representing domains for multiple entities and the relationships among them. Also, it does not assume independent and identically distributed (i.i.d.) entities as traditional methods. Statistical Relational Learning (SRL) relies on statistical and probabilistic modeling to represent and learn in domains with complex relational and rich probabilistic structure [9]. Exploring such structure allows for solving higher complexity problems. Besides, SRL has succeeded in several real-world applications as real data are relational and require handling uncertainty from noise and incomplete information. However, like most ML models, SRL models assume that training and testing data are sampled from the same distribution and belong to the same feature space. Thus, a new model must be trained using new data if distributions differ. Transfer Learning aims at extracting knowledge from one or more source scenarios to boost the learning performance in a new target scenario [38]. It has emerged as the key to reducing the time and data required to train a new model, as it avoids learning a model from scratch in a new specific domain. It is crucial in scenarios where data differ in distribution as a model can be adapted to predict data for a new task. Transfer learning is appropriate for SRL methods as it overcomes the lack of high-quality data instances and the long training time, which are problems often faced by SRL models [38]. Previous works [2,6,13,16–18,22] have confirmed its efficacy in SRL methods, and that applying transfer learning to SRL models admits training and testing sets to have different distributions. However, most work on relational transfer learning focuses on what knowledge should be transferred from one model to the other and how to take advantage of this knowledge when transferring between two different domains. When to transfer is still an open issue as a pre-trained model is not guaranteed to help or improve performance for learning a new model. Furthermore, brute-force knowledge transfer may harm the target performance, resulting in negative transfer [38]. Also, testing all possible pairs of source and target domains can be a costly and unfeasible approach. In this work, we focus on when to transfer and propose a method to identify the most suitable source domain for a given target domain beforehand. Our hypothesis is that transferring from a more related domain will result in better performance. Considering the difference between semantic structures of relational datasets, one way to compare distributions is by capturing underlying probabilities and computing their divergences [19]. To compare such data, we rely on probabilistic representations of relational databases and distributions learned by models. To alleviate the complexity of capturing probabilistic distributions directly from relational domains, we represent the datasets in a simplified feature space by relying on propositionalization [12]. Then, we obtain conditional probability distributions using Na¨ıve Bayes (NB) [31]. We evaluate our method by analyzing the performance of two state-of-theart transfer learning-based algorithms named TreeBoostler [2] and TransBoost-
64
T. Luca et al.
ler [16]. Both focus on transferring Boosted Relational Dependency Networks (RDNs) [27], which propose representing conditional probability distributions that might appear in RDNs as a weighted sum of regression models. TreeBoostler recursively tries to transfer nodes from source regression trees to build target regression trees. TransBoostler relies on word embeddings to map predicates in source regression trees to their most similar in the target domain. It uses three similarity metrics to find the best mappings between the two domains. After transference, both TransBoostler and TreeBoostler rely on Theory Revision [37] to improve their inferential capacities. This work uses four real-world relational datasets and compares distributions using the Jensen-Shannon Divergence [21]. Our results show that both algorithms perform best when transferring between the most similar pairs of source and target domains. This paper is organized as follows. Section 2 introduces the necessary background. Section 3 relates how transfer learning is applied to relational learning and how the problem of approximating probability distributions from different domains is investigated in the literature. Section 4 describes our method, how it finds the underlying probability distributions, and how it compares them. Section 5 describes experimental results for different datasets. Finally, Sect. 6 presents our conclusions.
2
Background
This section presents an overview of concepts used to build the contributions of this paper, namely, Transfer Learning and Bottom Clause Propositionalization. 2.1
Transfer Learning
Transfer Learning aims at adapting knowledge already learned to other tasks and domains to help create new models. It has succeeded in many Deep Learning applications as it helps learning in low-data regime settings and reduces training time [34]. Long training time is also a problem faced by SRL models, along with the lack of high-quality data instances. Both problems can be solved by applying transfer learning techniques, making it also suitable for SRL methods [38]. Following the definition presented in [30]: a domain is defined by a pair D = {X , P (X)}, and a task is defined by T = {Y, P (Y |X)}. Given a source domain DS , a learning task TS , a target domain DT , and a learning task TT , transfer learning aims to help learning a new target function fT (.) in DT using the knowledge learned in DS while learning TS , where DS = DT , or TS = TT . The first condition implies that XS = XT , or PS (X) = PT (X). The second condition implies that YS = YT , or P (YS |XS ) = P (YT |XT ). Equal domains and tasks make the learning problem become a traditional ML problem. Conversely, different domains imply different feature spaces or probability distributions between domains. Finally, different learning tasks imply different label spaces or conditional probability distributions between domains.
Select First, Transfer Later
65
According to [38], transfer learning has three main research issues: (i) what to transfer, (ii) how to transfer, and (iii) when to transfer. The first question focuses on identifying which part of knowledge, such as part of weights, can be transferred across different domains or tasks. The second one focuses on how to apply the knowledge already learned to learn a new domain or task. In SRL, how to build mappings of relational knowledge, for example. Finally, the third question asks in which scenarios knowledge should not be transferred. Most works on SRL focus on proposing new approaches to answer the second question. Moreover, both (i) and (ii) assume source and target domains are related. Nevertheless, when such an assumption does not hold, it might result in negative transfer, i.e., when the transfer method decreases the performance of learning a new target domain or task. Most transfer methods focus on producing positive transfer between related tasks, avoiding negative transfer between less related tasks [34]. However, when to transfer, or at least from where to transfer, is still an open research issue as there is no guarantee of performance improvement by using a pre-trained model to learn a new one. 2.2
Bottom Clause Propositionalization
Propositionalization [12] is a method of Inductive Logic Programming (ILP) [24] that aims at converting relational databases into an attribute-value table, where each row represents a single data instance and each column represents an attribute or feature. Propositionalization algorithms use background knowledge and examples to build distinctive features to differentiate subsets of examples. It can be divided into logic-oriented and database-oriented approaches. Logicoriented methods aim at building a set of relevant first-order features by distinguishing between first-order objects. In contrast, database-oriented methods focus on exploiting database relations and functions to generate features [8]. Bottom Clause Propositionalization (BCP) [8] is a logic-oriented propositionalization approach that searches bottom clauses. In First-Order Logic (FOL), relations are represented by logical facts and domains by using constants, variables, and predicates. A term can be a variable, constant, or function symbol applied to terms, and a literal can be an atom or a negated atom. A disjunction of literals forms a clause. Bottom clauses were proposed as part of the Progol system [25] and are built from one example, background knowledge, and language bias. Background knowledge is a set of logical facts or defined clauses, and language bias is a set of clauses that describe how these clauses can be built. A bottom clause is the most specific clause within a hypothesis space that covers an example with regard to background knowledge [32]. More formally, a bottom clause⊥i with regard to an example i and background knowledge B is defined as B ⊥i |= i. Progol searches for the bottom clause through a space of declarations, which are modes for the hypothesized clause. Then, it passes through the determination predicates, which are specified relations that can appear in the body of the clause. This process is repeated until the number of cycles through the declarations is reached. Algorithm 1 presents a procedure that finds bottom
66
T. Luca et al.
clauses for a set of examples. For more details about terms and how they relate to ILP theory, please refer to [5,8,12]. Concerning the BCP procedure, after creating a set of bottom clauses with Algorithm 1, the next step is to convert each clause into an input vector. Then, BCP uses the set of all body literals that must appear in the examples as possible features in a truth table to simplify the feature extraction process [8,32]. Each literal is converted into a feature in which the associated value is a boolean value, indicating if it exists or not in the example in the database. This step is presented in Algorithm 2. Other examples of approaches to capture relational datasets distributions similar to BCP are 1BC [7] and 1B2C [14], which are FOL naive Bayesian classifier systems that generate a set of FOL features to be used dynamically as attributes to a Na¨ıve Bayes classifier. 1BC2 uses the same data representation as 1BC, distinguishing structural predicates from properties. It is an upgrade to the attribute-value Na¨ıve Bayes assumption to FOL objects that defines probability distributions over lists, multisets, and sets. The goal is to classify a structured individual and estimate its probability from the probability of elements of the structure.
3
Related Work
There are many works on transfer learning between relational domains. Another way to learn structured data is through graph-based methods, including Graph Neural Networks (GNNs), in which some works have also proposed transfer learning techniques [10,15,33]. However this work only focuses on logic-based representations. Some of the works on relational transfer learning include: LTL [13], which performs transfer learning by relying on a type-based tree construction, where each path in the source domain must have the same number of arguments related to each link path in the target domain; the TAMAR [22] algorithm that presumes domains have similar relations and performs an exhaustive search to map predicates across source and target domains in Markov Logic Networks (MLNs). It uses the weighted pseudo-log-likelihood to find the most suitable mappings and only maps predicates of the same arity and matching types; GROOT [6] proposes to use a genetic algorithm to map predicates while transferring RDNs. It also restricts mapping to the arity of predicates and type-consistence. Finally, TreeBoostler [2] and TransBoostler [16] also propose to transfer RDNs and assume domains share similar relations. The former is similar to TAMAR as it tries to find mappings to predicate recursively, and the target predicate with the best weighted variance value is chosen as mapping to a source predicate. The latter envisions domains can be related by the context in which the words that form such predicates appear. Thus, it relies on pre-trained word embeddings to guide the mapping of predicates across source and target domains.
Select First, Transfer Later
67
Algorithm 1: Bottom Clause Generation [8]. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
Function BC GENERATION(E): E⊥ ← {} for e ∈ E do Add e to background knowledge and remove any previously inserted examples inT erms ← {}, ⊥e ← {}, currentDepth ← 0 Find the first mode declaration with head h which θ-subsumes e for v/t ∈ θ do Replace v in h to t if v of type # Replace v in h to vk , where k = hash(t), if v is of one of {+,-} Add t to inT erms if v of type + end Add h to ⊥e for each body mode declaration b and recall value recall do for substituions θ of arguments + of b to elements of inT erms and recall do while recall number of iterations not reached do if query(bθ, backgroundKnowledge) then for v/t ∈ θ do Replace v in b to t if v of type # Replace v in b to vk , where k = hash(t) if v not of type # Add t in inT erms if v if ot type end Add bθ to ⊥e if not added end end end end Increment currentDepth Go back to line 13 If currentDepth < depth Add negation symbol to the head of ⊥e if e is a negative example Add ⊥e to E⊥ end return E⊥
Concerning distance-based transfer, in [29], the authors propose a dimensionality reduction algorithm to minimize the distance between data distributions in a low-dimensional latent space. It proposes that, if it is possible to find this latent space in which marginal distributions are close, the space can help to propagate a classification model. Experimental results show the proposed method can improve the performance of traditional ML algorithms for transfer learning. Some works on Transfer Metric Learning [19], which aims at leverage knowledge for other related domains by revealing the underlying data relationship, highlight the distribution divergence between source and target domains,
68
T. Luca et al.
Algorithm 2: Attribute-value table generation algorithm [8]. 1 2 3 4 5 6 7 8 9 10
Function TABLE GENERATION(nLiterals, E⊥ ): Ev ← {} for ⊥e ∈ E⊥ do Create numerical vector vi of the size of nLiterals Fill positions of vi with 0 Changes its values to 1 for each position corresponding to a body literal of ⊥e Add vi to Ev Associate a label 1 to vi in case of a positive example, and -1 otherwise end return Ev
and try to minimize the distance via distribution approximation [1,19]. Other works focus on finding a common subspace where distributions are close to each other [28,36].
4
Finding from Where to Transfer with a Domain Similarity-Based Approach
Applying transfer learning for every possible pairs of source and target domains can be a costly and unfeasible method. In this work, we focus on when to transfer. Given a target domain and a set of pre-trained models, we want to find which source domain is the most related to the new target before performing transfer learning. Our goal is to avoid testing all possibilities for transference and, luckily, negative transfer. To compare relational domains, the first step is to represent data in a trivial feature space for comparing distributions more easily. To do it, we use a simple and naive approach to generate Conditional Probability Distributions (CPDs) in the target domain using NB [31]. To use NB, we must first represent examples in a single table. An efficient way to compare domains is by relying on underlying probabilities, as investigated in metric transfer learning [19,28] approaches. However, capturing such distributions in relational databases is not trivial since entities and relationships among them are represented by a vocabulary and not by i.i.d. features composed of numerical or categorical data. Then, we propose to transform such data from a relational domain into a propositional domain, which is the objective of propositionalization [12]. As BCP is a simple and fastto-implement technique, we envision that features generated by using propositionalization and the data distributions computed naively from these features are good indicators to compute similarities between relational datasets. We propose a naive approach by combining BCP and NB to obtain the data distributions from target datasets. During bottom clause generation, each example is given a bottom clause representation using Progol. In this work, the algorithm only passes through declarations in one cycle. We also set recall to 1,
Select First, Transfer Later
69
corresponding to how many times a predicate can appear in the same clause. The one-cycle and recall of 1 are enough to generate an attribute-value table, as we neither need an extensive representation of each example nor are we going to learn a hypothesis from the BC. Then, each clause is converted into an input vector (Algorithm 2). After converting all examples, the input can be processed by a propositional learner. Finally, each vector is given as an example to train a NB model and compute CPDs. Given a target domain, the effectiveness of a transfer method might depend on how source and target tasks relate to each other. Most of the works assume domains are related but there is no evidence or proof that the chosen pair of source and target datasets is the most suitable for transference. If domains are indeed related and the transfer method can take advantage of it, the performance in the target domain might be improved using transfer. If domains are not related enough, or the transfer method does not explore the relationship well, the performance may fail or even decrease [34]. A common approach in metric transfer learning approaches is to apply Kullback-Leibler Divergence [20], presented in Eq. 1, to compare probabilities distributions. As KL Divergence cannot handle zeros and some probabilities can be very small values, closer to zero, we directly use Jensen-Shannon Divergence (JSD) [21] to compare probability distributions between source and target datasets. JSD is smooth as it uses a mixture of the two distributions to compare two distributions PS and PT : PS (x) D(PS ||MS,T ) = PS (x)log , (1) MS,T (x) x∈X
JSD(PS ||PT ) =
D(PS ||MS,T ) + D(PT ||MS,T ) , 2
(2)
T) and D(PS ||MS,T ) is the KL Divergence between where MS,T = (PS +P 2 distributions, which is presented in Eq. 1. We calculate JSD for each pair of source and target domains. The greater the difference between the two distributions, the greater the JSD. The source data distribution is generated by performing inference in a pre-trained model and capturing the probability of each example being classified as positive. The target data distribution is obtained by NB using BCP. First, the features are generated using BCP. Then, the model is trained so probabilities are captured using the generated discriminant function. Finally, the method returns the most similar source domain. Algorithm 3 presents our proposed method to find the most similar source domain to a given target domain, where fS indicates the method used to compute probabilities in the source domain.
70
T. Luca et al.
Algorithm 3: Our proposed method. Inputs are a list of source domains and the new target domain. 1 2 3 4 5 6 7 8 9 10
5
Function MOST SIMILAR(srcDomains, tarDomain): div ← {} tarP robs ← BCP N B(tarDomain) for srcDomain ∈ srcDomains do srcP robs ← fS (srcDomain) jsd ← JSD(srcP robs, tarP robs) Insert (srcDomain, jsd) in div end mostSimilar ← get most related domain from div return mostSimilar
Experimental Results
This section presents the experiments performed to evaluate our proposed method. To do it, we analyze the performances of TransBoostler [16] and TreeBoostler [2], given the most similar target domain to a source domain according to our proposal. We also compare results with RDN-B [26], which learns the target domain from scratch to evaluate if applying transfer learning impairs performance. For TransBoostler, we assume the best results for pairs of experiments from [16–18] experiments, and other pairs of experiments present in this work regardless of the distance metrics used to perform mappings. All algorithms follow the same experimental setup [2,16–18]. We set the depth limit of trees to 3, the number of leaves to 8, the number of regression trees to 10, and the maximum number of literals per node to 2. To TransBoostler and TreeBoostler’s theory revision processes, we set the initial potential to be -1.8. We sample a ratio of 2 negative examples for 1 positive for training. We compare two versions of both algorithms to evaluate results with no theory revision. One version considers only mapping and parameter learning (TransBoostler*), and the completed version applies theory revision (TransBoostler). To evaluate the proposed method, we used four real-world relational datasets commonly used in previous literature [2,6,13,16–18,22]: IMDB, Cora, UWCSE, and Twitter. (1) IMDB dataset [23] contains information about movies such as movie, actor, director, and genre. The objective is to learn the relation workedunder, which describes which actor has worked for a director. It is divided into five mega-examples, each containing information about four movies. (2) Cora [3] has information about Computer Science publications. It contains information such as author, title, and venue, for 1295 citations to 122 papers. The goal is to predict if two venues represent the same conference. It is also divided into five mega-examples. (3) UWCSE [11] contains information about publications, their authors, project and members, course levels, and more, from the Department of Computer Science and Engineering at the University of Washington (UW-CSE). It is divided into five mega-examples, and the objective is to predict if a student
Select First, Transfer Later
71
was advised by a professor by learning the advisedby relation. Finally, (4) Twitter [35] is a dataset of tweets about Belgian soccer matches. It is divided into two independent folds, and the goal is to predict the type of an account, which can be club, fan, or news. Statistics about the datasets are presented in Table 1. Table 1. Statistics of the datasets used to evaluate our proposal. Dataset
Number of Constants
Number of Types
Number of Predicates
Number of Positive Examples
Total number of ground literals
IMDB
297
3
6
382
71824
UW-CSE 914
9
14
113
16900
Cora
2457
5
10
3017
152100
Twitter
273
3
3
282
663
We follow [2,16–18] and used conditional log-likelihood (CLL), the area under the ROC curve (AUC ROC), the area under the PR curve (AUC PR) [4], and training time to compare performance. Source data distribution is obtained by training a model using RDN-B. We simulate a transfer learning scenario where models are trained using a reduced set of data. Following the previous literature, we use one fold for training and n-1 folds for testing. All results are averaged over n runs, where n corresponds to the number of mega-examples of each dataset. A newly learned source model generates probabilities for every run in the source domain. The same environment used by [16–18] was used for all experiments. After training, the model is used for inference to generate the probability of each example from the test set being classified as positive by the learned model. Target data distribution is obtained from NB using BCP. After obtaining the vector representation for all examples, we can generate the target probability distribution using the learned model. Given the two probability distributions, we use JSD to calculate the divergence between the pairs of datasets. For each dataset, we first consider it as the target domain and compare its data distribution with other datasets as candidates from where to transfer from. Results are presented in Table 2. Tables 3, 4, 5, and 6 present the transfer experiments when IMDB, Cora, Twitter, and UWCSE, are the target domains, respectively. Results show that both algorithms perform best when transferring across the most similar pairs of source and target datasets for most experiments. Given IMDB as the target domain, its most similar source domain is UWCSE, followed by Twitter and Cora. Experiments using TransBoostler and TreeBoostler show that both algorithms perform best when transferring from UWCSE and Twitter. But the results for Cora are really close to both experiments. As can be seen, transference still impairs performance for TransBoostler* and for TreeBoostler when transferring from Cora to IMDB. When transferring to Cora, the most similar source domain is IMDB, which is the dataset with the best results when learning from Cora for TransBoostler. In this case, transfer learning does
72
T. Luca et al.
not impair performance and improves results for UWCSE → Cora when using TreeBoostler*. But, it still impairs performance for TransBoostler*. When transferring to Twitter, there is a disagreement between Tables 2 and 5. The closer dataset to Twitter according to our method is IMDB, followed by UWCSE, and Cora. However, results show that experiment Cora → Twitter has the best performance. Then, UWCSE → Twitter and, finally, IMDB → Twitter. We explain this divergence in details at the end of this section. Transference still impairs performance for TransBoostler* when transferring to Twitter from IMDB and UWCSE, and there is no improvement when applying theory revision. Finally, the dataset most related to UWCSE is IMDB, followed by Twitter and Cora. In this case, results are very similar for all pairs of experiments. Which might indicate that UWCSE benefits from any source domain when using TreeBoostler and TransBoostler. Results show that IMDB and UWCSE seem to be easy datasets to learn as both algorithms perform best when transferring to them in all experiments. Besides, values for AUC ROC and AUC PR for RDN-B, TreeBoostler and TransBoostler are the highest when learning IMDB from scratch and using transferring. Cora, on the opposite, seems to be a problematic dataset to learn. In previous literature [2,6,13,16–18,22], Cora is paired with IMDB as the source domain. In this work, experiments show that transferring from different domains (UWCSE and Twitter) does not increase performance. TreeBoostler has better results than TransBoostler for most experiments, showing that exhaustively searching for mappings can be a better approach than trying to relate predicates by using word embeddings. However, it is a costly transfer learning-based method. For most pairs of experiments, TransBoostler has a good performance and less training time. Finally, it is worth mentioning that BCP is too slow. Due to the number of examples and literals of each dataset, it took hours to generate bottom clauses for each example. Table 2. Jensen-Shannon divergences between every pair of datasets. Each row represents the dataset as the source domain, i.e., probabilities are generated using RDN-B, while the columns represent datasets as the target domains, i.e., probabilities are generated using BCP+NB. Source/Target IMDB Cora Twitter UWCSE IMDB
0
0.278 0.186 0.655
0.123
Cora
0.574
0
Twitter
0.258
0.337 0
0.179
0.619
UWCSE
0.257
0.318 0.388
0
Analysing Twitter Dataset. As can be seen from our results, our method does not work when transferring to Twitter. The reason must be that Twitter is the only dataset which needs recursion during the learning process. Other datasets are trained without the use of recursion, which might explain the inconsistency
Select First, Transfer Later
73
Table 3. Performance comparison when given IMDB as the target domain. Cora → IMDB CLL
Twitter → IMDB
AUC ROC
AUC PR
Run-time(s) CLL
–0.075 1.000 –0.075 0.999 –0.074 1.000
1.000 0.954 1.000
TreeBoostler* –0.115 0.982 TransBoostler* –0.306 0.868
0.888 0.092
RDN-B TreeBoostler TransBoostler
UWCSE → IMDB
AUC ROC
AUC PR
Run-time(s) CLL
AUC ROC AUC PR
Runtime(s)
2.89 4.29 4.36
–0.075 1.000 –0.074 1.000 –0.074 1.000
1.000 1.000 1.000
2.89 4.73 6.09
–0.075 1.000 –0.072 1.000 –0.067 1.000
1.000 1.000 1.000
2.89 4.77 3.92
0.95 1.94
–0.079 1.000 –0.355 0.547
1.000 0.029
1.584 0.71
–0.079 1.000 –0.186 0.963
1.000 0.268
1.68 0.91
Table 4. Performance comparison when given Cora as the target domain. IMDB → Cora CLL
Twitter → Cora
AUC ROC
AUC PR
Run-time(s) CLL
–0.693 0.558 –0.659 0.606 –0.668 0.600
0.426 0.530 0.463
TreeBoostler* –0.659 0.574 TransBoostler* –0.699 0.500
0.518 0.379
RDN-B TreeBoostler TransBoostler
UWCSE → Cora
AUC ROC
AUC PR
Run-time(s) CLL
AUC ROC AUC PR
Runtime(s)
76.97 45.74 54.44
–0.693 0.558 –0.690 0.558 –0.689 0.559
0.426 0.425 0.428
76.97 132.23 20.13
–0.693 0.558 –0.665 0.615 –0.694 0.543
0.426 0.471 0.411
76.97 132.51 10.44
1.63 2.20
–0.691 0.558 –0.699 0.500
0.426 0.379
67.54 1.41
–0.665 0.615 –0.699 0.500
0.471 0.379
67.79 1.52
Table 5. Performance comparison when given Twitter as the target domain. Cora → Twitter CLL
IMDB → Twitter
AUC ROC
AUC PR
Run-time(s) CLL
–0.122 0.990 –0.116 0.993 –0.302 0.932
0.347 0.371 0.061
TreeBoostler* –0.124 0.990 TransBoostler* –0.305 0.933
0.331 0.062
RDN-B TreeBoostler TransBoostler
UWCSE → Twitter
AUC ROC
AUC PR
Run-time(s) CLL
AUC ROC AUC PR
Runtime(s)
23.45 54.80 38.58
–0.122 0.990 –0.110 0.994 –0.374 0.611
0.347 0.399 0.009
23.45 63.09 46.48
–0.122 0.990 –0.114 0.994 -0.306 0.876
0.347 0.398 0.031
23.45 53.35 22.41
14.49 1.97
–0.115 0.993 –0.371 0.605
0.365 0.009
14.34 1.17
–0.125 0.989 –0.310 0.876
0.364 0.031
14.19 1.90
Table 6. Performance comparison when given UWCSE as the target domain. Cora → UWCSE CLL
IMDB → UWCSE
AUC ROC
AUC PR
Run-time(s) CLL
–0.257 0.940 –0.238 0.941 –0.270 0.941
0.282 0.302 0.302
TreeBoostler* –0.257 0.940 TransBoostler* –0.394 0.533
0.288 0.030
RDN-B TreeBoostler TransBoostler
Twitter → UWCSE
AUC ROC
AUC PR
Run-time(s) CLL
AUC ROC AUC PR
Runtime(s)
8.74 11.98 5.98
–0.257 0.940 –0.247 0.939 –0.253 0.939
0.282 0.302 0.298
8.74 4.78 6.79
–0.257 0.940 –0.254 0.935 –0.263 0.938
0.282 0.267 0.295
8.74 11.40 11.69
5.53 0.74
–0.267 0.930 –0.288 0.906
0.293 0.131
0.63 1.19
–0.262 0.935 –0.364 0.500
0.291 0.028
5.39 0.67
when using Twitter as the target domain. To perform a better analysis, we compute the JSD for all experiments having Twitter as the target domain for distributions in which both source and target domains are generated by the same algorithm. Table 7 presents the values of JSD for each pair of experiments. The first row indicates the results when pairs of distributions are generated using RDN-B while the second row show results when generating distributions using BCP+NB. Applying recursion adds a bidirectional dependency between classes and attributes, thus, Table 7 show we have discrepant values for all experiments using Twitter as the target domain.
74
T. Luca et al.
Table 7. Jensen-Shannon divergences for Twitter as the target domain when generating probabilities using RDN-B and NB. Cora → Twitter IMDB → Twitter UWCSE → Twitter RDN-B
6
1.630
0.452
0.801
BCP + NB 0.055
0.648
0.179
Conclusion
This paper proposes a method for choosing proper transfer learning datasets that rely on probabilistic representations of relational databases and distributions learned by models. The central assumption is that most related domains will result in better performance when applying transfer learning. To compare the source and target distributions, we propose an approach using bottom clause propositionalization and Na¨ıve Bayes to generate the target domain data distribution. To evaluate our method, we use two transfer learning-based algorithms: TransBoostler and TreeBoostler. We also compare results when learning from scratch the target domain to analyze if transfer learning impairs performance. Experimental results show that most related datasets leverage their best performances when transferring between the most similar pairs of source and target domains. Except for Twitter as the target domain, which might be due to the use of recursion that is not used for training other datasets. In general, TreeBoostler has shown the best performance compared to TransBoostler, which might indicate that recursively searching for mappings is a better transfer strategy. However, it is a costly algorithm, and TransBoostler can achieve good results in less training time. For some experiments, transfer learning still impairs performance compared to learning from scratch. It remains a future investigation on how to propose a metric to make a decision on when to transfer or not. The most simple and naive solution would be the use of a threshold, for example. Moreover, a more profound investigation in representing such datasets is needed. It is also important to investigate if results are due to how algorithms take advantage of previously learned knowledge. Another possible future direction is to run experiments for other datasets commonly used in the literature. We also envision to investigate propositional transfer learning on propositionalized data.
References 1. Ahmadvand, M., Tahmoresnezhad, J.: Metric transfer learning via geometric knowledge embedding. Appl. Intell. 51, 921–934 (2021) 2. Azevedo Santos, R., Paes, A., Zaverucha, G.: Transfer learning by mapping and revising boosted relational dependency networks. Mach. Learn. 109, 1435–1463 (2020) 3. Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceeding of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2003, pp. 39–48. Association for Computing Machinery, New York (2003)
Select First, Transfer Later
75
4. Davis, J., Goadrich, M.: The relationship between precision-recall and roc curves. In: Proceedings of the 23rd International Conference on Machine Learning, ICML 2006, pp. 233–240. Association for Computing Machinery, New York (2006) 5. De Raedt, L.: Logical and relational learning. In: Zaverucha, G., da Costa, A.L. (eds.) Advances in Artificial Intelligence - SBIA 2008, pp. 1–1. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-68856-3 6. de Figueiredo, L.F., Paes, A., Zaverucha, G.: Transfer learning for boosted relational dependency networks through genetic algorithm. In: Katzouris, N., Artikis, A. (eds.) ILP 2021. LNCS, vol. 13191, pp. 125–139. Springer, Cham (2022). https:// doi.org/10.1007/978-3-030-97454-1 9 7. Flach, P., Lachiche, N.: 1BC: a first-order bayesian classifier. In: Dˇzeroski, S., Flach, P. (eds.) ILP 1999. LNCS (LNAI), vol. 1634, pp. 92–103. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-48751-4 10 8. Fran¸ca, M.V., Zaverucha, G., D’avila Garcez, A.S.: Fast relational learning using bottom clause propositionalization with artificial neural networks. Mach. Learn. 94(1), 81–104 (2014) 9. Getoor, L., Taskar, B.: Introduction to Statistical Relational Learning. MIT press, Cambridge (2007) 10. Han, X., Huang, Z., An, B., Bai, J.: Adaptive transfer learning on graph neural networks, pp. 565–574. Association for Computing Machinery, New York (2021) 11. Khosravi, H., Schulte, O., Hu, J., Gao, T.: Learning compact markov logic networks with decision trees. Mach. Learn. 89(3), 257–277 (2012) 12. Kramer, S., Lavraˇc, N., Flach, P.: Propositionalization Approaches to Relational Data Mining, pp. 262–291. Springer, Heidelberg (2001) 13. Kumaraswamy, R., Odom, P., Kersting, K., Leake, D., Natarajan, S.: Transfer learning via relational type matching. In: 2015 IEEE International Conference on Data Mining, pp. 811–816. IEEE (2015) 14. Lachiche, N., Flach, P.A.: 1BC2: a true first-order bayesian classifier. In: Matwin, S., Sammut, C. (eds.) ILP 2002. LNCS (LNAI), vol. 2583, pp. 133–148. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-36468-4 9 15. Lee, C.K.: Transfer learning with graph neural networks for optoelectronic properties of conjugated oligomers. J. Chem. Phys. 154(2), 024906 (2021) 16. Luca, T., Paes, A., Zaverucha, G.: Mapping across relational domains for transfer learning with word embeddings-based similarity. In: Katzouris, N., Artikis, A. (eds.) International Conference on Inductive Logic Programming, pp. 167–182. Springer, Heidelberg (2021). https://doi.org/10.1007/978-3-030-97454-1 12 17. Luca, T., Paes, A., Zaverucha, G.: Combining word embeddings-based similarity measures for transfer learning across relational domains. In: International Conference on Inductive Logic Programming. Springer, Heidelberg (2023) 18. Luca, T., Paes, A., Zaverucha, G.: Word embeddings-based transfer learning for boosted relational dependency networks. Mach. Learn. 1–34 (2023) 19. Luo, Y., Wen, Y., Duan, L.Y., Tao, D.: Transfer metric learning: algorithms, applications and outlooks. arXiv preprint arXiv:1810.03944 (2018) 20. MacKay, D.J.: Information Theory, Inference and Learning Algorithms. Cambridge University Press, Cambridge (2003) 21. Men´endez, M., Pardo, J., Pardo, L., Pardo, M.: The Jensen-Shannon divergence. J. Franklin Inst. 334(2), 307–318 (1997) 22. Mihalkova, L., Huynh, T., Mooney, R.J.: Mapping and revising markov logic networks for transfer learning. In: AAAI, vol. 7, pp. 608–614 (2007)
76
T. Luca et al.
23. Mihalkova, L., Mooney, R.J.: Bottom-up learning of markov logic network structure. In: Proceedings of the 24th International Conference on Machine Learning, ICML 2007, pp. 625–632. Association for Computing Machinery, New York (2007) 24. Muggleton, S.: Inductive logic programming. New Gener. Comput. 8, 295–318 (1991) 25. Muggleton, S.H.: Inverse entailment and progol. New Gener. Comput. 13(3&4), 245–286 (1995) 26. Natarajan, S., Khot, T., Kersting, K., Gutmann, B., Shavlik, J.: Gradient-based boosting for statistical relational learning: the relational dependency network case. Mach. Learn. 86(1), 25–56 (2012) 27. Neville, J., Jensen, D.: Relational dependency networks. J. Mach. Learn. Res. 8(3) (2007) 28. Pan, J.: Review of metric learning with transfer learning. In: AIP Conference Proceedings, vol. 1864. AIP Publishing (2017) 29. Pan, S.J., Kwok, J.T., Yang, Q., et al.: Transfer learning via dimensionality reduction. In: AAAI, vol. 8, pp. 677–682 (2008) 30. Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22(10), 1345–1359 (2009) 31. Rish, I., et al.: An empirical study of the naive bayes classifier. In: IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, vol. 3, pp. 41–46 (2001) 32. Tamaddoni-Nezhad, A., Muggleton, S.: The lattice structure and refinement operators for the hypothesis space bounded by a bottom clause. Mach. Learn. 76, 37–72 (2009) 33. Tang, X., Li, Y., Sun, Y., Yao, H., Mitra, P., Wang, S.: Transferring robustness for graph neural network against poisoning attacks, pp. 600–608. Association for Computing Machinery, New York (2020) 34. Torrey, L., Shavlik, J.: Transfer learning. In: Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques, pp. 242– 264. IGI global (2010) 35. Van Haaren, J., Kolobov, A., Davis, J.: Todtler: two-order-deep transfer learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 29 (2015) 36. Wan, C., Pan, R., Li, J.: Bi-weighting domain adaptation for cross-language text classification. In: Twenty-Second International Joint Conference on Artificial Intelligence. Citeseer (2011) 37. Wrobel, S.: First order theory refinement. Adv. Inductive Logic Program. 32, 14–33 (1996) 38. Yang, Q., Zhang, Y., Dai, W., Pan, S.J.: Transfer Learning. Cambridge University Press, Cambridge (2020)
GNN Based Extraction of Minimal Unsatisfiable Subsets Sota Moriyama1,3(B) , Koji Watanabe2,3 , and Katsumi Inoue1,2,3 2
1 Tokyo Institute of Technology, Tokyo, Japan The Graduate University for Advanced Studies, SOKENDAI, Tokyo, Japan 3 National Institute of Informatics, Tokyo, Japan {sotam,inoue,kojiwatanabe}@nii.ac.jp
Abstract. In Boolean Satisfiability (SAT), Minimal Unsatisfiable Subsets (MUSes) are unsatisfiable subsets of constraints that serve as explanations for the unsatisfiability which, as a result, have been used in various applications. Although various systematic algorithms for the extraction of MUSes have been proposed, few heuristic methods have been studied, as the process of designing efficient heuristics requires extensive experience and expertise. In this research, we propose the first trainable heuristic based on Graph Neural Networks (GNNs). We design a new network structure along with loss functions and learning strategies specifically tuned to learn the process of MUS extraction, which we implement in a model called GNN-MUS. Furthermore, we introduce a new algorithm called NeuroMUSX that uses GNN-MUS as a heuristic and combines it with other systematic search methods to make the extraction process more efficient. We conduct experiments to compare our proposed method with existing methods on the MUS Track of the 2011 SAT Competition. From the results, NeuroMUSX is shown to achieve significantly better performance across a wide range of problem instances. In addition, training NeuroMUSX on specific instances of a class improves the algorithm’s performance against other problems in the same class, highlighting the advantages of the learnable approach. Overall, these results underscore the potential of using simple GNN architectures to drastically improve the procedures for extracting minimal subsets. Keywords: Boolean Satisfiability Problem · Unsatisfiability Neural Networks · Minimal Unsatisfiable Subsets
1
· Graph
Introduction
The Boolean Satisfiability Problem (SAT) is the first problem to be proven NPcomplete [4]. Since then, many powerful solvers utilizing numerous techniques and heuristics have been proposed to rapidly compute models for given SAT instances. Generally, SAT solvers take propositional formulas in Conjunctive The code for this paper is available at https://github.com/sotam2369/NeuroMUSX. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 E. Bellodi et al. (Eds.): ILP 2023, LNAI 14363, pp. 77–92, 2023. https://doi.org/10.1007/978-3-031-49299-0_6
78
S. Moriyama et al.
Normal Form (CNF) as input and outputs at least one model if the given instance is satisfiable. On the other hand, unsatisfiable instances require proof that no single model that satisfies the formula exists and is classified as coNP-complete. In Inductive Logic Programming (ILP), unsatisfiability is essential to guaranteeing the completeness and consistency of a hypothesis [5]. More direct applications of SAT techniques in ILP can be found in [1,14]. In related applications, minimal unsatisfiable subsets have been used for computing abductive explanations as well as minimal abductive explanations [12,15]. In SAT, algorithms for computing minimal explanations of unsatisfiability have had a remarkable amount of work done over the past few decades [8,10, 16]. They are referred to as Minimal Unsatisfiable Subsets (MUSes) and have been used in a variety of practical applications [19]. Deciding whether a set of clauses is a Minimal Unsatisfiable Subset is known to be DP -complete [21]. Therefore, the problem of finding a MUS is harder than merely determining the unsatisfiability of a problem. However, the process of MUS extraction typically relies on systematic methods [2], and not many heuristic methods have been proposed as it requires extensive experience and expertise to design heuristics. To overcome such difficulties, we propose the first trainable heuristic based on deep learning techniques. We introduce a novel GNN model called GNN-MUS, which has network structures, loss functions, and learning strategies specifically tuned to learn the process of MUS extraction. We then propose a new algorithm called NeuroMUSX that uses GNN-MUS as a heuristic, combining it with other systematic search methods to make the extraction process more efficient. One of the main differences between GNN-MUS and existing neuro-based solvers is that GNN-MUS focuses on predicting whether each clause is included in the MUS or not instead of predicting the satisfiability of the given problem. Another important difference is that we focus on expressing the positive and negative literals included in each clause as weighted edges instead of creating separate nodes for both literal types. To handle graphs with weighted edges, we focus on using network structures based on Graph Attention Networks (GATs) [24], a modern state-of-the-art (SOTA) GNN model that has been shown to perform well on numerous tasks. To demonstrate the effectiveness of our approach, we compare the results of NeuroMUSX with existing algorithms on the MUS Track of the 2011 SAT Competition1 , the most recent SAT Competition that features this track. Prior to testing, we trained NeuroMUSX on a dataset consisting of 20,000 random SAT instances. From the results, we found that using GNNs as a heuristic effectively leads the algorithm to smaller-sized MUSes. This not only helps to reduce the amount of time needed for computing MUSes, but it also allows for the explanations to be more compact and interpretable. Moreover, to fully exhibit the advantages of the learnable approach, we train NeuroMUSX on a single instance of each problem set before testing the algorithm on the entire set. The results showed that using the specifically trained NeuroMUSX further elevates the performance of the extraction, both in terms of computation time and MUS size. 1
http://www.satcompetition.org/2011/.
GNN Based Extraction of Minimal Unsatisfiable Subsets
79
These results show the potential of using simple GNN architectures, such as that of GNN-MUS, to find minimal subsets that are smaller in less time. The sections of this paper are organized as follows: In Sect. 2, related works regarding neuro-based SAT solvers and MUSes are shown. In Sect. 3, basic explanations of SAT problems, MUS extraction algorithms, and GNNs are given. In Sect. 4, the architecture of GNN-MUS, along with the new algorithm, NeuroMUSX, are explained in detail. In Sect. 5, the results of the experiments done using NeuroMUSX and GNN-MUS are shown. Finally, in Sect. 6, we conclude this paper and touch on some possible future directions.
2 2.1
Related Work Neuro-Based SAT Solvers
In recent years, many deep learning based SAT solvers built on Graph Neural Networks (GNNs) have been proposed [9,22,23,25]. NeuroSAT [23] was one of the first models built to attempt SAT solving with purely deep learning techniques. The main objective of NeuroSAT was to output a single-bit prediction of whether the given problem was satisfiable or not. Additionally, when NeuroSAT predicted the instance as satisfiable, a satisfying model for the instance was able to be decoded. The training was done in an end-to-end fashion using datasets consisting of random SAT instances and showed to be effective against them. However, end-to-end models have the limitation that they do not guarantee a correct output. Therefore, some work focused on using the predictions given by the GNNs as heuristics to leverage the performance of existing SAT solvers. For example, NeuroCore [22] focuses on periodically replacing the variable activity scores of SAT solvers with predictions of how likely the variables are to appear in an unsatisfiable core. Another example is NeuroGlue [9], a network that predicts glue variables used in the Glucose series of solvers. Instead of using supervised learning to train the networks as done in the approaches stated above, reinforcement learning is another studied approach for learning heuristics for effective SAT solving. An example of this is GQSAT [17], a branching heuristic trained with value-based reinforcement learning. Moreover, even though end-to-end neuro-based solvers are able to predict unsatisfiability at a high rate, there is no known method to output the proofs of unsatisfiability with these solvers. There have been works aiming to find variables included in unsatisfiable cores [22], but to output proofs, these approaches must rely on known solvers. 2.2
Minimal Unsatisfiable Subsets
MUSes, which are minimal explanations of unsatisfiability, have found uses in multiple applications such as bounded model checking, debugging of declarative specifications, and unsatisfiability-based maximum satisfiability [19]. Most MUS extraction algorithms can be classified into three approaches: insertion [8],
80
S. Moriyama et al.
deletion [8], and dichotomic [10,16]. Insertion-based algorithms are a bottom-up approach, constructing a MUS by iteratively adding clauses from an underapproximation. On the other hand, deletion-based algorithms are a top-down approach, constructing a MUS by iteratively removing clauses from an overapproximation. Despite the theoretical interest in dichotomic algorithms, most algorithms are based on either constructive or destructive approaches. To speed up the process of MUS extraction, many systematic algorithms have been proposed. For example, Clause Set Trimming focuses on using the Unsatisfiable Cores (not guaranteed to be minimal) extracted from the proofs given by SAT solvers as the initial input of the algorithm, which in some cases leads to a large number of clauses being removed [26]. However, these algorithms can be impractical to use in practical settings. For this reason, there has been a level of interest in Reduced Unsatisfiable Subsets (RUS), which are used as heuristic approximations of MUSes such as AMUSE [20] and AOMUS [7]. Compared with these previous approaches, this work focuses on utilizing GNNs as a heuristic to guide existing algorithms to a better answer in less time.
3 3.1
Background SAT Problem
In SAT, a formula of propositional logic consisting of variables, negations (¬), conjunctions (∧), and disjunctions (∨) is encoded into CNF. A CNF formula is built by conjunctions of multiple subformulas called clauses, where each clause is a disjunction of variables or their negations, called literals. Each variable can be assigned a logical value of 0 (false) or 1 (true), and the formula is satisfiable if and only if there exists an assignment where at least one literal in each clause is mapped to true. Given a CNF formula F, a SAT solver is expected to output an assignment for each variable that satisfies the formula if the formula is satisfiable. On the other hand, the solver is expected to output a proof of unsatisfiability if the formula is unsatisfiable. These proofs can then be used to extract an Unsatisfiable Core (not guaranteed to be minimal) of the problem. 3.2
MUS Extraction
MUSes are specific subsets of the original formula, where every proper subset of the MUS is satisfiable. The following is the proper definition of MUSes. Definition 1 (MUS). Let c be a clause and F a set of clauses. M ⊆ F is a Minimal Unsatisfiable Subset (MUS) of F iff M is unsatisfiable and ∀c ∈ M, M\{c} is satisfiable. The main goal of MUS extraction algorithms is to find and output a single MUS for a given formula. To accomplish this, most practical algorithms for MUS extraction use transition clauses, whose definition is given below [2,8].
GNN Based Extraction of Minimal Unsatisfiable Subsets
81
Definition 2 (Transition Clause). Let F be a set of clauses. If F is unsatisfiable and F\{c} is satisfiable, then c is a transition clause. Lemma 1. Let c be a transition clause of CNF Formula F. Then, c is included in any MUS. Proof. F\{c} is satisfiable. Hence, any MUS of F contains c. One of the most basic deletion-based algorithms for MUS extraction is shown in Algorithm 1. Here, SAT(F) returns the satisfiability of the formula F, and GetUnsatisfiableCore(F) extracts the unsatisfiable core from the proofs, given that F is unsatisfiable. This algorithm focuses on removing non transition-clauses iteratively from the initial formula to finally reach an MUS in the end. Additionally, this contains an optimization technique called clause set trimming, where the unsatisfiable core is used as the starting formula instead of the original formula. Algorithm 1 : MUSX (Deletion based MUS Extraction with Clause Set Trimming) Input: Unsatisfiable CNF Formula F Output: MUS M 1: M ← GetUnsatisfiableCore(F ) 2: for each ci ∈ M do 3: if not SAT(M\{ci }) then 4: M ← M\{ci } 5: end if 6: end for 7: return M
3.3
//clause set trimming
SAT Solving with GNNs and GATs
When it comes to handling SAT with deep-learning techniques, GNNs are considered the norm. In most neuro-based SAT solvers, given CNF formulas are first converted into bipartite graphs before being passed through the GNN model. The conversion is done by treating all literals (both positive and negative) and clauses as individual nodes and connecting the literal nodes with the nodes of clauses that contain them with undirected edges. This procedure makes the graph and original CNF formula interchangeable, guaranteeing that no information is lost during the conversion. An example of such a graph is shown in Fig. 1a. However, in our research, we do not split the positive and negative literal nodes and instead associate two types of edges from each variable, which correspond to the positive and negative literals appearing in a clause, as shown in Fig. 1b. For the architectures of the models, Message-passing Neural Networks [6] are mainly used. This network allows information to be transferred between
82
S. Moriyama et al.
Fig. 1. Graph Representations
the neighbouring literal and clause nodes, leading to an abstract representation of the overall graph. Message-passing Neural Networks are represented in the following manner. ⎛ ⎞ = u ⎝x ki , h(x kj )⎠ x k+1 i j∈N (i)
Here, x ki corresponds to the features of each node at layer k, while N (i) corresponds to the neighbour nodes of i. Furthermore, h is a transition function for updating node representations, and u is an update function specifically used in message-passing networks [18]. GATs are another popular architecture used for learning graphs. The GAT layers compute specific weights for each neighbouring node in order to get information from important nodes, which leads to a more general understanding of the graph and better overall accuracy. GAT layers are represented in the following manner. = eij h(x ki , x kj ) x k+1 i j∈N (i)
Here, eij corresponds to the coefficient computed for the neighbouring node j of node i. There have been some studies using GATs for SAT Solving [3], but most architectures rely on using Message-passing Neural Networks with some type of recurrent network, such as Long-Short Term Memory (LSTM) cells [11,23].
4 4.1
MUS Prediction with GNNs GNN-MUS
Inspired by prior researches, in GNN-MUS, input CNF formulas are first converted to a weighted undirected bipartite graph G = (C, V, E, W ). Here, C, V , E, and W correspond to the set of clauses, variables, edges, and weights of edges, respectively. The edges are established between the variable and clause nodes when the literal corresponding to the variable is included in the clause. The weights of the edges are set as w pos if the literal included is positive and w neg if negative. Giving differently weighted edges for positive and negative literals allows the graph and original CNF formula to be interchangeable, as in prior
GNN Based Extraction of Minimal Unsatisfiable Subsets
83
Fig. 2. Architecture of GNN-MUS. Dim: Dimension of the features of each node, N: Number of GAT Layers, Din : Dimension of the initial features, DH : Dimension of the hidden features
researches. The objective of GNN-MUS is to learn the following two outputs from the graphs: (1) the estimate score of each clause being included in the MUS (1 when included), and (2) the estimate score of the input instance being satisfiable (1 when satisfiable). The output (2) is included solely to compare GNN-MUS’s performance with existing neuro-based SAT solvers. However, when training (2), simultaneously training (1) as well is expected to have a positive impact on the accuracy of (2). Thus, we set up loss functions to allow simultaneous training of the two. The architecture of GNN-MUS is shown in Fig. 2. To handle weighted edges, multiple GAT layers are used in GNN-MUS. In the first GAT layer GAT0 , the dimensions of the node features are expanded from Din to DH . Next, the graph is passed through N GAT layers GATk (k = 1, 2, ..., N ), which processes the features without changing the dimensions. After being passed through each GAT layer, the LeakyReLU activation function is applied. An example of the output x ki is shown in the equation below. x ki = LeakyReLU(GATk (x k−1 )) i LeakyReLU(x ) = max(0, x ) + αmin(0, x ) Finally, the graph is passed through two different GAT layers, GATmus and GATsat , each reducing the dimensions of each node feature to 1. For prediction (1), after the graph is passed through GATmus , the sigmoid activation function is applied to output an estimate score (between 0 and 1) for each clause being included in the MUS. For prediction (2), after the graph is passed through GATsat , the mean of all nodes is taken after the sigmoid function is applied to output an estimate score (between 0 and 1) of the input being satisfiable. The estimates are shown as yˆ mus ∈ RC and yˆsat ∈ R1 respectively.
84
S. Moriyama et al.
yˆimus = Sigmoid(GATmus (xN i )) 1 Sigmoid(GATsat (xN yˆsat = i )) m i∈C
To train the models, we use a custom loss function calculated using the Binary Cross Entropy (BCE) Loss. The loss function is split into two parts: one for MUS prediction and one for satisfiability prediction. Both use BCELoss to calculate the loss between the prediction and the ground truth. Furthermore, MUS loss is divided by the number of clauses in the input formula for normalization. Without this, the model will overfit to problems with the largest number of clauses. The loss function is shown in the equation below. 1 BCELoss(ˆ yimus , yimus ) + λ2 BCELoss(ˆ y sat , y sat ) Loss(ˆ y , y ) = λ1 m i∈C
BCELoss(ˆ y , y) = y log(ˆ y ) − (1 − y) log(1 − yˆ) Here, λ1 and λ2 are hyperparameters for controlling what the model focuses on during training. 4.2
GNN Based Clause Set Trimming
In this section, we introduce a new algorithm called NeuroMUSX that uses the predictions obtained from GNN-MUS as a heuristic, as shown in Algorithm 2. Here, GNN(F, T ) returns the MUS prediction given by GNN-MUS for the formula F, with the threshold for deeming the clauses as not included in the Algorithm 2 : NeuroMUSX (Deletion based MUS Extraction with GNN based Clause Set Trimming) Input: Unsatisfiable CNF Formula F Output: MUS M 1: T ← 0 2: F ← GNN(F , T ) 3: while not SAT(F ) do 4: M ← UnsatisfiableCore(F ) 5: T ← MinimumScore(M) 6: F ← GNN(F , T ) 7: end while 8: for each ci ∈ M do 9: if not SAT(M\{ci }) then 10: M ← M\{ci } 11: end if 12: end for 13: return M
GNN Based Extraction of Minimal Unsatisfiable Subsets
85
Fig. 3. Example of the extraction process of NeuroMUSX
MUS being set to T . MinimumScore(M) returns the smallest estimate score output by GNN-MUS for the current unsatisfiable core. The algorithm is built on MUSX, which is explained in Sect. 3, and the predictions given by GNN-MUS, to lead the solvers to output an Unsatisfiable Core that contains fewer clauses, or a smaller MUS. This leads to a decrease in calls to SAT solvers, effectively reducing the computation time. Due to the nature of GNNs, the outputs are not guaranteed to be unsatisfiable. Therefore, we gradually increase the threshold for the scores while testing the satisfiability with a SAT solver. In this method, the SAT solver may output the same unsatisfiable core if all clauses in the previously output core are intact. To avoid this, we set the threshold to remove the clause with the smallest score in the previous core. This forces the solver to output an unsatisfiable core different from the previous core, ultimately guaranteeing that each call yields a different core. 4.3
MUS Extraction Algorithm Based on GNN-MUS
Figure 1b shows the process of MUS prediction using GNN-MUS, with the input being the simplest unsatisfiable formula consisting of 2 variables and 3 clauses. As shown in the example, the input formula is first converted into a graph, which is then passed through GNN-MUS. In the end, predicted scores for each clause of how likely it is to appear in the MUS are output. In this example, all clauses are included in the MUS, and thus GNN-MUS outputs a high score for all clauses. Figure 3 shows the process of MUS extraction using GNN-MUS as a heuristic (NeuroMUSX). The input formula is first converted into a graph, and the predictions for each clause to be included in the MUS are output by GNN-MUS. The predictions are given to the MUS extraction algorithm, which uses them to effectively find better unsatisfiable cores.
5 5.1
Experiments Dataset
To generate a large number of instances for training, we used the random SAT instance generator used for NeuroSAT [23], with each instance containing from
86
S. Moriyama et al.
Fig. 4. Size Reduction Score while training GNN-MUS with SR(U(3, 40))
3 to 40 variables on average. The generated dataset consists of 10,000 satisfiable and unsatisfiable instances (a total of 20,000) for both training and testing. This dataset will be referred to as SR(U(3, 40)) in the following sections. For the ground truths of each unsatisfiable instance, we used the MUS with the smallest number of clauses, extracted using the algorithm shown in [13]. However, even with formulas containing up to only 40 variables, the computation of the smallest sized MUSes can be extremely difficult and time-consuming at times. Therefore, we set a time limit for the computation of the smallest MUSes, and if the limit is surpassed, we instead use the MUS computed by MUSX, which is not guaranteed to extract the smallest MUS. For the ground truths of each satisfiable instance, we set the MUS to be equal to 0 for every clause. 5.2
Training
For training, we used GNN-MUS with N set to 10, Din set to 2, and DH set to 64. We trained the model for 1,000 epochs as the losses experimentally converged around that area. The training data was given to the model in batches of 64 to reduce the risk of overfitting. We used the ADAM optimizer with a learning rate of 1 × 10−3 and LeakyReLU with α set to 1 × 10−2 for the training process. As the number of clauses in each MUS is typically smaller than half of the total number of clauses, there will be a bias that leads to 0 s being predicted more often than not. Therefore, we calculate the total occurrence of all 0 s and 1 s for every instance and give weights to the BCELoss regarding MUS prediction to prioritize 1 s more. For the evaluation of the predictions, we adopted an evaluation metric that describes how much size the GNN was able to reduce, made specifically for the purpose of MUS extraction. The equation for calculating this is shown below.
Size(F) − Size(Mpred ) ,0 Size Reduction Score = max Size(F) − Size(M) Here, the function Size returns the number of clauses in the given formula, while Mpred points to the prediction of the MUS given by the GNN. Similar to
GNN Based Extraction of Minimal Unsatisfiable Subsets
87
Table 1. Comparison of the total oracle time for each set of problems. ARI: Abstraction Refinement Intel, SV: Software Verification, PC: Product Configuration, HV: Hardware Verification, DD: Design Debugging, FD: Functional Dependency, APP: Applications, BMC: Bounded Model Checking, EC: Equivalence Checking. Problem Set
PC ATPG EC BMC SV FPGA HV ARI FD APP DD Total
MUSX PAR-2 timeout
Oracle Time (sec) Size Ratio (%) NeuroMUSX NeuroMUSX (Specific) Normal Specific PAR-2 timeout PAR-2 timeout
0.0186 1.5412 1830.9 2363.8 2546.6 4005.8 11093 11230 13057 20310 26985 93423
0.0253 0/4 1.5843 0/19 1894.1 0/4 2362.3 1/14 2547.8 1/13 4008.7 2/8 10129 3/27 11160 4/21 13298 0/50 18604 8/25 18209 8/40 82215 27/225
0/4 0/19 0/4 1/14 1/13 2/8 3/27 4/21 0/50 9/25 12/40 32/225
– – – – – – 10897 10446 13150 18521 16819 69833
– – – – – – 3/27 4/21 0/50 8/25 7/40 22/163
100.0 100.0 100.0 99.54 96.83 78.78 99.51 83.36 100.0 95.61 69.45 92.34
– – – – – – 99.82 81.60 99.38 94.74 64.46 88.58
NeuroMUSX, Mpred is calculated by increasing the threshold until the predicted MUS becomes satisfiable. If the sizes of F and M are the same, the score will be set to 0. The result of this evaluation is shown in Fig. 4. 5.3
Comparison of MUSX and NeuroMUSX
In this experiment, we compare the total oracle time (total time the solver was used) of deletion-based algorithms (MUSX) and our proposed algorithm (NeuroMUSX). Since the time for computing predictions with GNN-MUS is minimal, we decided to ignore this as a factor. For comparison, we used problem instances from the MUS Track of the 2011 SAT Competition, with the timeout set to 1,000 s for both algorithms. We employed the PAR-2 scheme, which is a score calculated with a penalty of twice the timeout seconds when the algorithm causes a timeout. Furthermore, the size ratio is calculated by the following formula. Size Ratio = (Size(MN euroM U SX )/Size(MM U SX )) × 100 The results for each set of problems are shown in Table 1. Figure 5a shows the oracle time comparison for each instance, while Fig. 5b shows the comparison of MUS sizes obtained by each algorithm. From the table, we can observe that even though our algorithm did not outperform in areas where MUSX already performed well (such as ATPG, PC, and FPGA), other areas found a considerate amount of decrease in computation time. Especially when NeuroMUSX was able to guide the extractor to a much smaller unsatisfiable core, it led to a significant decrease in oracle time. Furthermore, using GNN-MUS allowed the algorithm to find a MUS for instances that timed out with the deletion-based algorithm. This emphasizes the effectiveness of using GNN-MUS trained only on random instances.
88
S. Moriyama et al.
Fig. 5. Comparison per instance
From the graphs, we can observe that our algorithm managed to decrease the computation time of some instances significantly while avoiding worsening most computation times. Moreover, our algorithm was able to decrease the MUS sizes of some instances without any noticeable penalties. These observations support our claim that using GNN-MUS leverages the performance of the algorithm without significant costs. 5.4
Comparison with Specifically Trained NeuroMUSX
To follow up on the previous experiment, we tested how our proposed algorithm performed against the entire problem set when using one instance from the set for training NeuroMUSX. For this experiment, we adopt a transfer-learning based approach, where we use the GNN-MUS model pre-trained on random instances as the initial model and re-train the model for 100 epochs on the single instance, coupled with 4 random instances. The ground truth MUS for the instance was produced in the same manner as for random instances. For comparison, we focused on using the top 5 problem sets that took the most time with MUSX as the target sets. The overall results of the comparison are shown in Table 1. From the results, we can observe that the oracle time, especially for problem sets that had more timeouts with both algorithms, had a noticeable decrease. Furthermore, the MUSes extracted by NeuroMUSX are also smaller than those of MUSX, showing the potential for more practical usage. 5.5
Comparison of GNN-MUS and NeuroSAT
To compare the performance of GNN-MUS with models used in prior research (namely NeuroSAT), we implemented the NeuroSAT framework based fully on
GNN Based Extraction of Minimal Unsatisfiable Subsets
89
Fig. 6. SAT/UNSAT Prediction Accuracy of NeuroSAT and GNN-MUS
the information written in the proposed paper [23]. To allow for a fair comparison, we used the same dataset, SR(U(3, 40)), to train NeuroSAT. We performed two experiments, one with a dataset consisting of 20,000 instances and the other with 200,000 instances. In the second experiment, we set the value of N , the number of GAT layers in GNN-MUS, to 20. As simultaneously learning to predict the MUSes of each instance had a noticeable positive impact on SAT/UNSAT accuracy, instead of removing it completely, we set λ1 to 1/50, and λ2 to 1 for the loss function. The results of the training are shown in Fig. 6. From the results, we can observe that GNN-MUS was not able to predict SAT/UNSAT with higher accuracy than NeuroSAT. However, this is thought to be due to the fact that GNN-MUS’s architecture is mainly aimed at predicting whether each clause is included in the MUS rather than predicting whether the given problem is satisfiable. On the other hand, we can observe that, given more instances, the accuracy becomes closer to that of NeuroSAT. This leads us to believe that GNN-MUS has the potential to compete with modern end-to-end neuro-based solvers in terms of satisfiability prediction. 5.6
Discussion
The main objective of our proposed algorithm is to use the predictions given by GNN-MUS to guide the algorithm to a better solution. Even though the algorithm performed well on known datasets, this method has the limitation that it needs at least one instance from the target problem set, along with its ground truth MUS, to be able to train. The obtaining of the instance may not be difficult, but computing the ground truths may become very time consuming. Even with known algorithms and extractors, obtaining MUSes can be extremely difficult. Therefore, if no instance from the problem set can have a MUS computed, NeuroMUSX cannot be optimized against the specific problem set. One promising approach to overcome this issue is to prepare a smaller instance that was generated from the same problem. Performing transfer learning on them
90
S. Moriyama et al.
instead will allow the model to learn the characteristics of the specific problem set without much computational requirement.
6
Conclusion
In this paper, we proposed a MUS extraction algorithm, NeuroMUSX, that uses GNN-MUS as a heuristic. The results show that even with training on random instances, we were able to reduce the oracle time needed for computing MUSes significantly. Furthermore, with the requirement of training on only a single instance from the given problem set, we were able to improve the oracle time needed for the entire problem set. This opens up the possibilities for using GNNs in frameworks for computing minimal subsets outside of SAT without needing to build complex algorithms suited for the specific cases. Directions for future work include creating better ground truths with information from numerous MUSes rather than only using the smallest MUS as the ground truth, as well as improving the performance of SAT solving. Another direction is to create a framework for iteratively using GNNs in the extraction algorithm rather than relying only on one-shot predictions. A different approach will be to apply deep learning based MUS extraction techniques to ILP. Acknowledgements. This work has been supported by JSPS KAKENHI Grant Number JP21H04905 and JST CREST Grant Number JPMJCR22D3.
References 1. Ahlgren, J., Yuen, S.Y.: Efficient program synthesis using constraint satisfaction in inductive logic programming. J. Mach. Learn. Res. 14(1), 3649–3682 (2013) 2. Belov, A., Lynce, I., Marques-Silva, J.: Towards efficient MUS extraction. AI Commun. 25(2), 97–116 (2012) 3. Chang, W., Zhang, H., Luo, J.: Predicting propositional satisfiability based on graph attention networks. Int. J. Comput. Intell. Syst. 15(1), 84 (2022) 4. Cook, S.A.: The complexity of theorem-proving procedures. In: Proceedings of the 3rd Annual ACM Symposium on Theory of Computing, pp. 151–158. ACM (1971) 5. Cropper, A., Dumancic, S., Evans, R., Muggleton, S.H.: Inductive logic programming at 30. Mach. Learn. 111(1), 147–172 (2022) 6. Gilmer, J., Schoenholz, S.S., Riley, P.F., Vinyals, O., Dahl, G.E.: Neural message passing for quantum chemistry. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 1263–1272. PMLR (2017) ´ Mazure, B., Piette, C.: Local-search extraction of MUSes. Constraints 7. Gr´egoire, E., Ann. Int. J. 12(3), 325–344 (2007) ´ Mazure, B., Piette, C.: On approaches to explaining infeasibility of 8. Gr´egoire, E., sets of Boolean clauses. In: IEEE 20th International Conference on Tools with Artificial Intelligence, pp. 74–83. IEEE Computer Society (2008) 9. Han, J.M.: Enhancing SAT solvers with glue variable predictions. CoRR abs/2007.02559 (2020)
GNN Based Extraction of Minimal Unsatisfiable Subsets
91
10. Hemery, F., Lecoutre, C., Sais, L., Boussemart, F.: Extracting MUCs from constraint networks. In: Proceedings of the 17th European Conference on Artificial Intelligence, vol. 141, pp. 113–117. IOS Press (2006) 11. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997) 12. Ignatiev, A., Narodytska, N., Asher, N., Marques-Silva, J.: From contrastive to abductive explanations and back again. In: Baldoni, M., Bandini, S. (eds.) AIxIA 2020. LNCS (LNAI), vol. 12414, pp. 335–355. Springer, Cham (2021). https://doi. org/10.1007/978-3-030-77091-4 21 13. Ignatiev, A., Previti, A., Liffiton, M., Marques-Silva, J.: Smallest MUS extraction with minimal hitting set dualization. In: Pesant, G. (ed.) CP 2015. LNCS, vol. 9255, pp. 173–182. Springer, Cham (2015). https://doi.org/10.1007/978-3-31923219-5 13 14. Inoue, K.: DNF hypotheses in explanatory induction. In: Muggleton, S.H., Tamaddoni-Nezhad, A., Lisi, F.A. (eds.) ILP 2011. LNCS (LNAI), vol. 7207, pp. 173–188. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-319518 18 15. Izza, Y., Marques-Silva, J.: On explaining random forests with SAT. In: Proceedings of the 30th International Joint Conference on Artificial Intelligence, pp. 2584– 2591. IJCAI (2021) 16. Junker, U.: QUICKXPLAIN: preferred explanations and relaxations for overconstrained problems. In: Proceedings of the 19th National Conference on Artificial Intelligence, pp. 167–172. AAAI Press/The MIT Press (2004) 17. Kurin, V., Godil, S., Whiteson, S., Catanzaro, B.: Can Q-learning with graph networks learn a generalizable branching heuristic for a SAT solver? In: Advances in Neural Information Processing Systems, vol. 33, pp. 9608–9621. Curran Associates, Inc. (2020) 18. Lamb, L.C., d’Avila Garcez, A.S., Gori, M., Prates, M.O.R., Avelar, P.H.C., Vardi, M.Y.: Graph neural networks meet neural-symbolic computing: a survey and perspective. In: Proceedings of the 29th International Joint Conference on Artificial Intelligence, pp. 4877–4884. IJCAI (2020) 19. Marques-Silva, J.: Minimal unsatisfiability: models, algorithms and applications (Invited Paper). In: IEEE 40th International Symposium on Multiple-Valued Logic, pp. 9–14. IEEE Computer Society (2010) 20. Oh, Y., Mneimneh, M.N., Andraus, Z.S., Sakallah, K.A., Markov, I.L.: AMUSE: a minimally-unsatisfiable subformula extractor. In: Proceedings of the 41th Design Automation Conference, pp. 518–523. ACM (2004) 21. Papadimitriou, C.H., Wolfe, D.: The complexity of facets resolved. In: 26th Annual Symposium on Foundations of Computer Science, pp. 74–78 (1985) 22. Selsam, D., Bjørner, N.: Guiding high-performance SAT solvers with unsat-core predictions. In: Janota, M., Lynce, I. (eds.) SAT 2019. LNCS, vol. 11628, pp. 336– 353. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-24258-9 24 23. Selsam, D., Lamm, M., B¨ unz, B., Liang, P., de Moura, L., Dill, D.L.: Learning a SAT solver from single-bit supervision. In: 7th International Conference on Learning Representations (2019) 24. Velickovic, P., Cucurull, G., Casanova, A., Romero, A., Li` o, P., Bengio, Y.: Graph attention networks. In: 6th International Conference on Learning Representations (2018)
92
S. Moriyama et al.
25. Wang, W., Hu, Y., Tiwari, M., Khurshid, S., McMillan, K.L., Miikkulainen, R.: NeuroComb: improving SAT solving with graph neural networks. CoRR abs/2110.14053 (2021) 26. Zhang, L., Malik, S.: Validating SAT solvers using an independent resolution-based checker: practical implementations and other applications. In: Design, Automation and Test in Europe Conference and Exposition, pp. 10880–10885. IEEE Computer Society (2003)
What Do Counterfactuals Say About the World? Reconstructing Probabilistic Logic Programs from Answers to “What If ?” Queries Kilian R¨ uckschloß(B)
and Felix Weitk¨ amper
Ludwig-Maximilians-Universit¨ at M¨ unchen, Oettingenstraße 67, 80538 M¨ unchen, Germany [email protected]
Abstract. A ProbLog program is a logic program with facts that only hold with a specified probability. Each ProbLog program gives rise to probability estimations for counterfactual statements of the form “A would be true, if we had forced B”. This contribution studies program equivalence with respect to this counterfactual reasoning in the sense of Judea Pearl. Our main result reveals that each well-written ProbLog program with non-trivial probabilities is uniquely determined by its associated counterfactual estimations. More precisely, we give a procedure to reconstruct such a probabilistic logic program from its counterfactual output. As counterfactuals are part of our everyday language, our result indicates that they may also be a good language to express domain knowledge or readable program specifications. Keywords: Counterfactual Reasoning · Program Equivalence Program Induction · Structure Learning · Probabilistic Logic Programming
1
·
Introduction
Humans are accustomed to reasoning about how events would have unfolded under different circumstances. This leads to counterfactual judgements like: “I would have married Katherina, if I had talked to her in high school” without actually experiencing the alternative reality in which we had talked to Katherina. Note that this capability allows us to make sense of the past, to plan courses of actions, to make emotional and social judgments, as well as to adapt our behavior [4]. Hence, one also wants an artificial intelligence to reason counterfactually. Next, to illustrate how probabilistic logic programs support counterfactual reasoning, we consider the following version of the sprinkler example from Pearl [6]: It is spring or summer, written szn spr sum, with a probability of 0.5. Further, a road passes along a field with a sprinkler on it. In spring or summer, the sprinkler is on, written sprinkler, with probability 0.7. Moreover, it rains, c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 E. Bellodi et al. (Eds.): ILP 2023, LNAI 14363, pp. 93–108, 2023. https://doi.org/10.1007/978-3-031-49299-0_7
94
K. R¨ uckschloß and F. Weitk¨ amper
denoted by rain, with probability 0.1 in spring or summer and with probability 0.6 in fall or winter. If it rains or the sprinkler is on, the pavement of the road gets wet, denoted by wet, with a probability of 0.4. When the pavement is wet, the road is slippery, denoted by slippery, with a probability of 0.3. Under the usual reading of ProbLog programs one would model this situation with the following program P. 0.5 :: szn spr sum
0.7 :: sprinkler ← szn spr sum
0.1 :: rain ← szn spr sum 0.4 :: wet ← rain
0.6 :: rain ← ¬szn spr sum 0.4 :: wet ← sprinkler
0.3 :: slippery ← wet Here, we informally read a ProbLog clause 0.4 :: wet ← rain as “The rule wet ← rain holds with a probability of 0.4.” Finally, assume we observe that the sprinkler is on and that the road is slippery. What is the probability of the road being slippery if the sprinkler were switched off? Since the sprinkler is on, we conclude that it is spring or summer. Further, since it is slippery, the rule slippery ← wet is applicable in the world that we observe. However, if the sprinkler is off, rain is the only possibility to realize slippery and we obtain π(slippery|slippery, sprinkler, do(¬sprinkler)) = 0.1 as the probability for the road to be slippery if the sprinkler were off. The recent WhatIf-solver [5] establishes the above counterfactual reasoning of Pearl [6] for ProbLog programs [3]. In this contribution, however, we adopt the perspective that inductive logic programming aims to reconstruct a (probabilistic) logic program from its input-output-behavior [1, §2.1]. More precisely, we look at the counterfactual reasoning for ProbLog programs from this inductive logic programming perspective and study how a ProbLog program is determined by its associated counterfactual estimations. Our investigation reveals that each well-written ProbLog program P is uniquely determined by its counterfactual output if all occurring probabilities are non-trivial i.e. not zero or one. Finally, in Example 14, we explain how this contrasts with the case in which deterministic causal relationships are at play. As expressing counterfactual knowledge is very natural for humans, our result indicates that counterfactuals could effectively convey domain knowledge or program specifications [1, §1 p.7] for structure learning of probabilistic logic programs. However, our result also suggests that learning probabilistic logic programs that support counterfactual queries requires great precision in the learning process, as there is only a single program that one has to pin down exactly.
2
Preliminaries
Here, we introduce ProbLog programs [3] and we recall how counterfactual queries are processed on them [5]. As the semantics of non-ground ProbLog
What Do Counterfactuals Say About the World?
95
programs is usually defined by grounding, this paper restricts its attention to the propositional case. Hence, we construct our programs from a set of propositions P, i.e. from a propositional alphabet. A literal l is an expression p or ¬p for a proposition p ∈ P. We call l a positive literal if it is of the form p and a negative literal if it is of the form ¬p. Further, a (logical) clause LC is an expression h ← b1 , ..., bn also denoted head(LC) ← body(LC) where head(LC) := h ∈ P is a proposition called the head and where body(LC) := {b1 , ..., bn } is a finite set of literals called the body of LC. Finally, a random fact RF is an expression π(RF ) :: u(RF ) where π(RF ) ∈ [0, 1] is the probability and where u(RF ) ∈ P is called the error term of RF . Example 1. In the alphabet P := {szn spr sum, sprinkler, rain, wet, slippery} we have that szn spr sum is a positive literal, whereas ¬szn spr sum is a negative literal. Further, wet ← rain is a clause and 0.4 :: u5 is a random fact. Now a logic program simply is a set of logical clauses. A ProbLog program P is given by a set of random facts Facts(P) and a logic program LP(P) in the alphabet P ∪ {u : u error term of a random fact in Facts(P)} where no error term occurs in the head of a clause in LP(P). In this case, we call LP(P) the underlying logic program of P. Finally, we define (P-)formulas φ and (P-)structures M : P → {true, f alse} as usual in propositional logic. Whether a given P-structure M satisfies a formula φ, written M |= φ, is also defined as usual in propositional logic. Finally, for every subset of propositions T ⊆ P we identify each truth value assignment ι : T → {true, f alse} with the set of literals {p ∈ T : pι = true} ∪ {¬p ∈ T : pι = f alse}. Example 2. In the alphabet P of Example 1 we can write the following ProbLog program P . Facts(P ) : LP(P ) :
0.5 :: u1 0.7 :: u2 0.1 :: u3 0.6 :: u4 0.4 :: u5 0.4 :: u6 0.3 :: u7 szn spr sum ← u1 rain ← szn spr sum, u3
sprinkler ← szn spr sum, u2 rain ← ¬szn spr sum, u4
wet ← rain, u5 slippery ← wet, u7
wet ← sprinkler, u6
As the semantics of ProbLog programs we choose the FCM-semantics [7], which supports counterfactual reasoning: For a ProbLog program P we define the functional causal models semantics or FCM-semantics FCM(P) to be the system of Boolean equations ⎧ ⎞⎫ ⎛ ⎪ ⎪ ⎪ ⎪ ⎨
⎟⎬ ⎜ ⎟ ⎜ p := l∧ u(RF )⎠ . ⎝ ⎪ ⎪ ⎪ ⎪ LC∈LP(P) l∈body(LC) u(RF )∈body(LC) ⎭ ⎩ head(LC)=p
l literal in P
RF ∈Facts(P)
p∈P
96
K. R¨ uckschloß and F. Weitk¨ amper
Here, we find that the error terms u(RF ) are interpreted as mutually independent Boolean random variables for every random fact RF ∈ Facts(P), each holding with probability π(RF ). Note that the arrows in the program P express a cause-effect relationship rather than merely abbreviating a logical disjunction. Example 3. The FCM-semantics of the program P in Example 2 is given by szn spr sum := u1 sprinkler := szn spr sum ∧ u2 rain := (szn spr sum ∧ u3 ) ∨ (¬szn spr sum ∧ u4 ) wet := (rain ∧ u5 ) ∨ (sprinkler ∧ u6 ) slippery := wet ∧ u7 where u1 , ..., u7 are mutually independent Boolean random variables holding true with probabilities of 0.5, 0.7, 0.1, 0.6, 0.4, 0.4 and 0.3 respectively. Fix a ProbLog program P such that the FCM-semantics FCM(P) yields a unique solution for every proposition p ∈ P in terms of the mutually independent Boolean random variables u(RF ). In this case, the program P defines a distribution on the P-structures M : P → {true, f alse} which coincides with the classical distribution semantics [8] according to R¨ uckschloß and amper [7]. Finally, Weitk¨ we define the probability of a formula φ by π(φ) := M P-structureM|=φ π(M). Example 4. Let us calculate the probability π(sprinkler) that the sprinkler is on in the program P of Example 2. π(sprinkler) = π(szn spr sum ∧ u2 ) = π(u1 ∧ u2 ) = 0.5 · 0.7 = 0.35 However, the FCM-semantics does not only support queries about conditional and unconditional probabilities. It allows us to answer two more general causal query types, namely determining the effect of external interventions and counterfactuals [7]: Assume for instance we want to intervene and force a subset of propositions X ⊆ P to attain truth values specified by an assignment x. In this case, we build a modified program Pdo(x) by erasing all clauses LC ∈ LP(P) with head in X and by adding a fact p ← if p ∈ X is set to true by x. If we now ask for the probability π(φ| do(x)) of a formula φ to hold after setting X to the values x, we query the program Pdo(x) for the probability of φ. In Example 6, we illustrate how an intervention takes place in a concrete program. Further, we do not only want to either observe or intervene, we also want to know what the probability of an event would have been if we had intervened before observing some evidence. This is especially interesting in the counterfactual case where our evidence contradicts the given intervention.
What Do Counterfactuals Say About the World?
97
Example 5. Consider the query in the introduction and observe that the evidence {slippery, sprinkler} contradicts the intervention do(¬sprinkler), i.e. this is a counterfactual query. Hence, fix another subset of propositions E ⊆ P and assume we observe the evidence that the propositions in E take values according to the assignment e. We now ask for the probability π(φ|e, do(x)) of the formula φ to hold if we had set the propositions in X to the values specified by x before observing our evidence e. To answer queries like that we proceed as Kiesel et al. [5]: First, we generate two copies Pe/i of the alphabet P – one to handle the evidence and the other to handle the interventions. Further, we set ue/i := u for every error term in P. Note that this yields maps e/i of error terms, literals, clauses, programs etc. We define the twin program of P to be the ProbLog program PK which consists of the logic program L(P)e ∪ L(P)i and the random facts Facts(P). Now we intervene in PK and set the propositions in Xi to the i truth values specified by x to obtain the program PK,do(x ) . Finally, we query i the program PK,do(x ) for the probability π(φi |Ee = e) to obtain the desired result for π(φ|e, do(x)). Example 6. In the program P of Example 2, we want to calculate the probability π(slippery|sprinkler, slippery, do(¬sprinkler)) that it would be slippery if we had switched the sprinkler off before observing the road to be slippery while the i sprinkler is on. For an answer we query the program (P )K,do(¬slippery ) below 0.5 :: u1 0.7 :: u2 0.1 :: u3 0.6 :: u4 e
0.4 :: u5 0.4 :: u6 0.3 :: u7
szn spr sum ← u1 sprinklere ← szn spr sume , u2
szn spr sumi ← u1
raine ← szn spr sume , u3
raini ← szn spr sumi , u3
raine ← ¬szn spr sume , u4
raini ← ¬szn spr sumi , u4
wete ← raine , u5
weti ← raini , u5
wete ← sprinklere , u6
weti ← sprinkleri , u6
slippery e ← wete , u7
slippery i ← weti , u7
for the probability π(slippery i |sprinklere , slippery e ) = 0.1. This procedure automates the counterfactual reasoning of Pearl [6] and is implemented in the WhatIf-solver of Kiesel et al. [5]. In this contribution, we restrict ourselves to ProbLog programs, in which no two clauses mention the same error term. These can be represented with ProbLog clauses. A ProbLog clause RC in P is given by an expression π :: e ← c1 , ..., cn also denoted by π(RC) :: effect(RC) ← causes(RC), where effect(RC) := e ∈ P is a proposition called the effect, causes(RC) := {c1 , ..., cn } is a finite set of literals called the causes and where 0 ≤ π(RC) ≤ 1 is a number called the
98
K. R¨ uckschloß and F. Weitk¨ amper
probability of RC. We use a ProbLog clause RC as an abbreviation for the following pair of a random fact and a logical clause. RF (RC) := (π(RC) :: u(RC))
LC(RC) := (h ← b1 , ..., bn , u(RC))
From now on, by abuse of language, a ProbLog program P in P is a finite set of ProbLog clauses. In particular, we identify P with a ProbLog program in the old sense consisting of the logic program L(P) := {LC(RC) : RC ∈ P} and of the random facts Facts(P) := {RF (RC) : RC ∈ P}. Example 7. The ProbLog program P in the introduction is an abbreviation for the program P in Example 2. In the new, more restrictive setting, the class dependency graph Graph(P) of a ProbLog program P is the directed graph on the alphabet P obtained by drawing an edge p1 → p2 if and only if there exists a ProbLog clause RC ∈ P with a cause p1 or ¬p1 and with effect p2 . We say that the program P is acyclic if its class dependency graph Graph(P) is a directed acyclic graph. Remark 1. We identify the class dependency graph of a non-ground program with the class dependency graph of its grounding, which can be considered as a propositional program. Example 8. The class dependency graph Graph(P) of the program P in the introduction is given by sprinkler . szm spr sum
rain
wet
slippery
Hence, we see that P yields an acyclic program. If a ProbLog program P is acyclic, we obtain for every set of literals x i that the FCM-semantics of the modified twin program PK,do(x ) yields a unique expression defining every proposition pi , p ∈ P in the Boolean error terms. Hence, our counterfactual reasoning is indeed well-defined for acyclic programs. We call a ProbLog program P proper if for each ProbLog clause RC ∈ P its associated probability π(RC) ∈ {0, 1} is non-trivial. Finally, we say that P is in normal form if any two distinct ProbLog clauses RC1/2 ∈ P have distinct effects effect(RC1 ) = effect(RC2 ) or distinct causes causes(RC1 ) = causes(RC2 ) and if every source s in the class dependency graph Graph(P) gives rise to a clause α :: s ← with empty causes. Note that we obtain π(causes(RC)) = 0 for all clauses RC ∈ P if the program P is acyclic in normal form. Example 9. Observe that the ProbLog program P from the introduction is proper in normal form.
What Do Counterfactuals Say About the World?
3
99
Results
Let us fix an acyclic proper ProbLog program P in normal form. Our aim in this paper is to uncover the information about P that is contained in the counterfactual estimations π( |e, do(x)). We introduce the notion of a counterfactual reasoning to refer to such a family of counterfactual estimations. Definition 1 (Counterfactual Reasoning). A counterfactual reasoning π on the alphabet P is a family consisting of a probability distribution π( |e, do(x)) on P for every two sets of literals e and x. Obviously, the program P gives rise to a counterfactual reasoning π which we call the counterfactual semantics of P. Further, assume we forgot about the program P and have only access to its counterfactual semantics π. Is it possible to reconstruct the program P from the counterfactual reasoning π? Our starting point for this task is Lemma 1 which helps us to spot the structural information that is stored in the counterfactual reasoning π. Lemma 1 (Important Identities). Choose a proposition p ∈ P and two supersets pa(p) ⊆ S1/2 ⊆ P \ {p} of the parents pa(p) of p in the class dependency graph Graph(P) not containing p itself. For any two truth value assignments s1 and s2 on S1 , respectively S2 , satisfying π(s1/2 ) > 0 and π(p|s1 ) > 0 we obtain the following identities: ⎛ ⎞ ⎜ ⎜ π(p|s1 ) = π ⎜ ⎜ ⎝
RC∈P effect(RC)=p causes(RC)⊆s1
π(p|s1 , p, do(s2 )) = ⎛ 1−
⎟ ⎟ u(RC)⎟ ⎟=1− ⎠
(1 − π(RC))
⎛
⎜ ⎜ ⎜ 1 − π(p|s2 ) ⎜ ⎜1 − (1 − π(p|s1 )) ⎜ ⎜ ⎜ π(p|s1 ) ⎝ ⎝
(1)
RC∈P effect(RC)=p causes(RC)⊆s1
⎞−1 ⎞ RC∈P effect(RC)=p causes(RC)⊆s1 ∩s2
⎟ ⎟ (1 − π (RC))⎟ ⎟ ⎠
⎟ ⎟ ⎟ (2) ⎟ ⎠
Proof. First, we note that Eq. (1) follows trivially from acyclicity and the fact that all error terms represent mutually independent random variables. To establish Eq. (2) we calculate:
100
K. R¨ uckschloß and F. Weitk¨ amper ⎛ π(p|s1 , p, do(s2 ))
choice of S1/2
=
⎜ ⎜ π⎜ ⎜ ⎝
⎛ ⎜ ⎜ = 1−π⎜ ⎜ ⎝
=1−
RC∈P effect(RC)=p causes(RC)⊆s2
u(RC)|
RC∈P effect(RC)=p causes(RC)⊆s2
¬u(RC)|
RC∈P effect(RC)=p causes(RC)⊆s1
⎛ ⎜
π⎝
⎞
RC∈P effect(RC)=p causes(RC)⊆s1
⎞
⎟ ⎟ u(RC)⎟ ⎟= ⎠
⎛ RC1 ∈P effect(RC1 )=p causes(RC)⊆s1
⎞⎞
⎜ u(RC1 ) ∧ ⎝
⎛
π⎝
⎟ ⎟ u(RC)⎟ ⎟= ⎠
RC2 ∈P effect(RC2 )=p causes(RC2 )⊆s2
⎟⎟ ¬u(RC2 )⎠⎠
⎞
RC∈P effect(RC)=p causes(RC)⊆s1
=
u(RC)⎠
⎞
⎛ ⎜ ⎜ ⎜ 1 − π(p|s2 ) ⎜ =1− π⎜ ⎜ π(p|s1 ) ⎜ ⎝
independence of error terms & (1)
RC1 ∈P effect(RC1 )=p causes(RC1 )⊆s1 causes(RC1 )⊆s2
⎟ ⎟ ⎟ ⎟ u(RC1 )⎟ ⎟ ⎟ ⎠
From the independence of the error terms we further deduce that ⎛ ⎛ ⎞ ⎜ ⎜ ⎜ ⎜ π⎜ ⎜ ⎜ ⎝
RC∈P effect(RC)=p causes(RC)⊆s1 causes(RC1 )⊆s2
⎞
⎜ ⎟ ⎟ π⎝ ¬u(RC1 )⎠ RC∈P ⎟ effect(RC)=p ⎟ causes(RC)⊆s1 ⎟ ⎞. u(RC)⎟ = 1 − ⎛ ⎟ ⎟ ⎜ ⎟ ⎠ π⎝ ¬u(RC1 )⎠ RC∈P effect(RC)=p causes(RC)⊆s1 ∩s2
Finally, we can apply Eq. (1) to obtain the desired result.
In the next proposition, we summarize the structural information about P that is uncovered by Lemma 1.
What Do Counterfactuals Say About the World?
101
Proposition 1 (Structural Information in Counterfactual Semantics). In the situation of Lemma 1, we additionally assume π(p|s2 ) > 0. In this case, we obtain the following criteria: i) We find π(p|s1 , p, do(s2 )) = 1 and π(p|s2 , p, do(s1 )) = 1 if and only if the assignments s1/2 coincide on the parents pa(p) of p in the class dependency graph Graph(P), i.e. q s1 = q s2 for all q ∈ pa(p) parent of p. ii) We find π(p|s1 , p, do(s2 )) > π(p|s2 ) if and only if the assignments s1/2 coincide on the causes of a ProbLog clause RC ∈ P with effect p, i.e. if there exists a RC ∈ P with effect(RC) = p and with causes(RC)s1 = causes(RC)s2 . Proof. Let us begin with proving i) and assume that the assignments s1/2 coincide on the parents pa(p) of p in the class dependency graph. In this case, π(p|s1 , p, do(s2 )) = 1 and π(p|s2 , p, do(s1 )) = 1 directly follows from combining the definition of the class dependency graph with Eqs. (1) and (2) of Lemma 1. Further, assume that π(p|s1 , p, do(s2 )) = 1 and π(p|s2 , p, do(s1 )) = 1. Since we find that π(p|s1/2 ) > 0, Eq. (2) yields for i ∈ {1, 2} that 1 − π(p|si ) = (1 − π (RC)) RC∈P effect(RC)=p causes(RC)⊆s1 ∩s2
and the desired statement follows from Eq. (1). Again the assumption π(p|s1/2 ) > 0 and the Eqs. (1) and (2) ensures that π(p|s1 , p, do(s2 )) > π(p|s2 ) if and only if (1 − π(RC)) < 1 RC∈P effect(RC)=p causes(RC)⊆s1 ∩s2
and the desired result follows.
Let us now begin by reconstructing the class dependency graph G := Graph(P) from the counterfactual semantics π of P. For a more readable presentation, we introduce the following notions. Definition 2 (Situations and Frames). A situation for a proposition p ∈ P is a truth value assignment s to remaining propositions in P \ {p}. For two situations s1/2 with π(s1 , p) > 0 we call the number π(p|s1 , p, do(s2 )) ∈ [0, 1] the change of situations from s1 to s2 . Further, the common support of s1 and s2 is defined by supp(s1 , s2 ) := {p ∈ P : ps1 = ps2 } . Given the class dependency graph Graph(P) of P we call a truth value assignment f to the parents pa(p) of a proposition p ∈ P a frame for p. Finally, for two frames f1/2 with π(f1 , p) > 0 we call the number π(p|f1 , p, do(f2 )) ∈ [0, 1] the reframing from f1 to f2 . Now Proposition 1 yields that the class dependency graph G can be computed with the following procedure.
102
K. R¨ uckschloß and F. Weitk¨ amper
Procedure 1 (Reconstructing the Class Dependency Graph). For every proposition p ∈ P we compute the smallest set pa(p) ⊆ P \ {p} such that we find for every changes of situations π(p|s1 , p, do(s2 )) = 1 and π(p|s2 , p, do(s1 )) = 1 if pa(p) ⊆ supp(s1 , s2 ). Finally, G is obtained by drawing an edge from every node in pa(p) to p. Example 10. Assume we aim to find the parents {wet} of slippery in the class dependency graph of the program P from the introduction presented in Example 8. First, we find out that it only is slippery if it is wet which is only the case if it rains or if the sprinkler is on. Hence, we have to consider the following situations: s1 := {szn spr sum, rain, sprinkler, wet} s2 := {¬szn spr sum, rain, sprinkler, wet} s3 := {szn spr sum, ¬rain, sprinkler, wet} s4 := {szn spr sum, rain, ¬sprinkler, wet} s5 := {¬szn spr sum, ¬rain, sprinkler, wet} s6 := {¬szn spr sum, rain, ¬sprinkler, wet} Next, one easily observes that π(slippery|si , slippery, do(si )) = 1 for all 1 ≤ i ≤ 6 as all relevant error terms were already observed to be true. Hence, we investigate the 30 changes of situations π(slippery|si , slippery, do(sj )) for 1 ≤ i, j ≤ 6, i = j. A calculation or the WhatIf-solver [5] reveals π(slippery|si , slippery, do(sj )) = 1 for every change of situations under consideration. Now we observe that all the situations s1 -s6 only coincide on wet, yielding that {wet} is the set of parents of slippery in the class dependency graph Graph(P). Our next goal is to reconstruct the program P from its class dependency graph G := Graph(P) and the counterfactual reasoning π. To do so we fix a proposition p ∈ P and compute the clauses defining p. Let us begin with the following notion. Definition 3 (Clause Search Graph). The clause search graph Searchp of p is an undirected graph on the frames f of p with π(f) > 0 and with π(p|f) > 0. It is given by drawing an edge f1 − f2 if and only if π(p|f1 , p, do(f2 )) > π(p|f2 ). Finally, we label each edge f1 − f2 with f1 ∩ f2 . Remark 2. Since P is assumed to be in normal form, we find that there exists no clause RC ∈ P with causes(RC) ⊆ f for every frame f with π(f) = 0. Example 11. Assume we know the class dependency graph Graph(P) of the program P from the introduction that is presented in Example 8. We further want to recover the clauses defining wet. Note that there are three frames f with π(wet|f) > 0: f1 := {rain, sprinkler}
f2 := {¬rain, sprinkler}
f3 := {rain, ¬sprinkler}
What Do Counterfactuals Say About the World?
103
The WhatIf-solver [5] yields π(wet|fi , wet do(fi )) = 1 > π(wet|fi ) for 1 ≤ i ≤ 3 and we obtain π(wet|f1 , wet, do(f2 )) = 0.625 > 0.4 = π(wet|f2 ) π(wet|f1 , wet, do(f3 )) = 0.625 > 0.4 = π(wet|f3 ) π(wet|f2 , wet, do(f3 )) = 0.4 = π(wet|f3 ). This yields the following clause search graph Searchwet : f2
f1
f3
With the clause search graph at hand we now uncover the first clause RC0,p defining p. Lemma 2 (Finding a Clause). Let causes0,p be a minimal label of an edge f1 − f2 in the clause search graph Searchp of a proposition p ∈ P. In this case, we find an unique clause RC0,p ∈ P with effect effect(RC0,p ) = p, with causes causes0,p = causes(RC0,p ) and with probability π(RC0,p ) = 1 −
(1 − π(p|f1 ))(1 − π(p|f2 ) . 1 − π(p|f2 ) − π(p|f2 )(1 − π(p|f1 , p, do(f2 )))
(3)
Proof. According to Proposition 1 and Definition 3, we find a clause RC0,p ∈ P such that causes(RC0,p ) ⊆ causes0,p . If we assume that causes(RC0,p ) = causes0,p , this clause would induce an edge not present in Searchp . Since P is assumed to be in normal form, we further obtain that RC0,p ∈ P is the unique clause with causes(RC0,p ) = causes0,p and with effect(RC0,p ) = p. Finally, consider Eq. (2) of Lemma 1 to see that 1 − π(p|f2 ) 1 − π(p|f1 ) π(p|f1 , p, do(f2 )) = 1 − 1− . π(p|f1 ) 1 − π(RC0,p ) From here, solving for π(RC0,p ) yields the desired result.
Example 12. In Example 11, we find that f1 − f2 yields an edge of the clause search graph Searchwet with minimal label causes0,wet := f1 ∩ f2 = {sprinkler}. Further, Equation (3) yields that π(RC0,wet ) := 1 −
0.6 · 0.36 = 0.4. 1 − 0.4 − 0.64(1 − 0.625)
Overall, we found the clause RC0,wet := (0.4 :: wet ← sprinkler) that indeed appears in the program P of the introduction.
104
K. R¨ uckschloß and F. Weitk¨ amper
To proceed we observe that we did not need the full counterfactual reasoning π in order to reconstruct the clause RC0,p ∈ P. Indeed we only need the data provided by a counterfactual backbone. Definition 4 (Counterfactual Backbone). A counterfactual backbone consists of the probabilities π(p|f1/2 ) and the reframings π(p|f1 , p, do(f2 )) for all frames f1/2 of a proposition p ∈ P with π(f2 ) > 0 and with π(f1 , p) > 0. Example 13. The counterfactual backbone of the program P in the introduction has the following form. π(szn spr sum) = 0.5, π(sprinkler|szn ...) = 0.7, π(sprinkler|¬szn ...) = 0, π(rain|szn ...) = 0.1, π(rain|¬szn ...) = 0.6, π(wet|rain, sprinkler) = 0.64,..., π(wet|¬rain, ¬sprinkler) = 0, π(slippery|wet) = 0.3, π(slippery|¬wet) = 0, π(spr...|szn ..., spr..., do(¬szn ...)) = 0, π(r...|szn ..., r..., do(¬szn ...)) = 0.6, π(rain|¬szn ..., rain, do(szn ...)) = 0.1, π(wet|sprinkler, rain, wet, do(¬sprinkler, ¬rain)) = 0, . . . , π(wet|¬sprinkler, rain, wet, do(sprinkler, rain)) = 1, π(splippery|wet, splippery, do(¬wet)) = 0 Next, to propagate, we use the following result to compute the counterfactual backbone of the program P0,p := P \ {RC0,p }. Lemma 3 (Modularity of Counterfactual Backbones). Let RC0 ∈ P be a ProbLog clause with effect effect(RC0 ) := p ∈ P and fix frames f1/2 for p with π(f1/2 ) > 0 and with π(p|f1/2 ) > 0. Further, let us denote by π0 the counterfactual backbone of the program P0 := P \ {RC0 }. If causes(RC0 ) ⊆ fi , we obtain π0 (p|fi ) = π(p|fi ).
(4)
Otherwise, we obtain π0 (p|fi ) =
π(p|fi ) − π(RC0 ) . 1 − π(RC0 )
(5)
If causes(R0 ) ⊆ f1 and causes(RC0 ) ⊆ f2 , we obtain π0 (p|f1 , p, do(f2 )) = π(p|f1 , p, do(f2 )).
(6)
If causes(R0 ) ⊆ f1 ∩ f2 , we obtain π0 (p|f1 , p, do(f2 )) = π(p|f1 , p, do(f2 )) −
1 − π(p|f1 , p, do(f2 )) π(RC0 ) π(p|f1 ) − π(RC0 )
(7)
Further, assume that causes(RC0 ) ⊆ f2 while causes(RC0 ) ⊆ f1 . We obtain π0 (p|f1 , p, do(f2 )) =
π(p|f1 , p, do(f2 )) − π(RC0 ) . 1 − π(RC0 )
(8)
What Do Counterfactuals Say About the World?
105
Finally, assume that causes(RC0 ) ⊆ f2 while causes(RC0 ) ⊆ f1 . We obtain π0 (p|f2 , p, do(f1 )) π0 (p|f2 ) π0 (p|f1 )
π0 (p|f1 , p, do(f2 )) =
(9)
Proof. Equations (4) and (6) trivially hold since RC0 is neither applicable in f1 nor in f2 . For Eq. (5) we calculate ⎛ ⎞ ⎜ ⎜ π(p|fi ) = π ⎜ ⎜ ⎝ ⎛ ⎜ ⎜ ⎜ = π⎜ ⎜ ⎝
RC∈P effect(RC)=p fi |=causes(RC)
RC∈P0 effect(RC)=p fi |=causes(RC)
⎟ ⎟ u(RC) ∨ u(RC0 )⎟ ⎟= ⎠ ⎞
⎛
⎟ ⎜ ⎟ ⎜ ⎟ ⎜ u(RC)⎟ + π(RC0 ) − π(RC0 )π ⎜ ⎟ ⎜ ⎠ ⎝
⎞
RC∈P0 effect(RC)=p fi |=causes(RC)
⎟ ⎟ (1) ⎟ u(RC)⎟ = ⎟ ⎠
= π0 (p|fi ) + π(RC0 ) − π(RC0 )π0 (p|fi ) and solve for π0 (p|fi ). For Eq. (7) we calculate π(p|f1 , p, do(f2 )) = ⎛ ⎜ ⎜ ⎜ = π⎜ ⎜ ⎝
⎞ u(RC) ∨ u(RC0 )|
RC∈P0 effect(RC)=p f2 |=causes(RC)
⎛
RC∈P0 effect(RC)=p f1 |=causes(RC)
⎟ ⎟ Def. of cond. prob. ⎟ u(RC) ∨ u(RC0 )⎟ = ⎟ ⎠
⎛
⎞⎞
⎜ ⎜ π ⎝u(RC0 ) ∨ ⎝ =
RC∈P0 effect(RC)=p f2 |=causes(RC)
u(RC) ∧
⎛
⎜ π⎝
RC∈P0 effect(RC)=p f1 |=causes(RC)
RC∈P0 effect(RC)=p f1 |=causes(RC)
⎞
⎟ u(RC) ∨ u(RC0 )⎠
⎟⎟ u(RC)⎠⎠ = t1 × t 2
106
K. R¨ uckschloß and F. Weitk¨ amper
where we find ⎛
⎛
⎞⎞
⎜ ⎜ π ⎝u(RC0 ) ∨ ⎝
RC∈P0 effect(RC)=p f2 |=causes(RC)
t1 :=
u(RC) ∧
RC∈P0 effect(RC)=p f1 |=causes(RC)
⎟⎟ u(RC)⎠⎠
π0 (p|f1 ) π0 (p|f1 )
⎛
t2 :=
⎜ π⎝
RC∈P0 effect(RC)=p f1 |=causes(RC)
⎞
⎟ u(RC) ∨ u(RC0 )⎠
Expanding the outer ∨ and repeatedly applying Eqs. (1) and (2) yields: t1 =
π(RC0 ) + π0 (p|f1 , do(f2 )) (1 − π(RC0 )) π0 (p|f1 )
t2 =
π0 (p|f1 ) π0 (p|f1 ) (1 − π(RC0 )) + π(RC0 )
Now, solving for π0 (p|f1 , do(f2 )) and Eq. (1) yield Eq. (7). Further, for Eq. (8) we expand the ∨ before u(RC0 ) in ⎛ ⎜ ⎜ ⎜ π(p|f1 , p, do(f2 )) = π ⎜ ⎜ ⎝
RC∈P0 effect(RC)=p f2 |=causes(RC)
u(RC) ∨ u(RC0 )|
⎞
RC∈P0 effect(RC)=p f1 |=causes(RC)
⎟ ⎟ ⎟ u(RC)⎟ ⎟ ⎠
and repeatedly apply Eq. (1) and (2). Finally, Eq. (9) follows from Eq. (8) with (1) π(p|f1 , p, do(f2 ))π(p|f1 ) the identity π0 (p|f1 , p, do(f2 )) = . π(p|f2 ) Let π0,p denote the counterfactual backbone of the program P0,p that we obtain from Lemma 9. If we find that π0,p (p|f) = 0 for all frames f of p ∈ P, we know that RC0,p ∈ P was the only clause of P with effect(RC0,p ) = p. Otherwise, we can apply Lemma 2 to the counterfactual backbone π0,p to get a most general clause RC1,p defining p in the program P0,p := P \ {RC0,p }. Overall, we repeatedly recover most general clauses defining p in the programs Pi,p := P \ {RC1,p , ..., RCi,p } until we find only loops in the clause search graph Searchp of Pn,p . If the latter is the case, we concentrate the remaining weight in the clauses of the form :: p ← f for all frames f with πn,p (p|f) > 0. To summarize, we just proved the following result: Theorem 1 (Main Result). Each acyclic proper ProbLog program P in normal form can be reconstructed from its counterfactual semantics.
What Do Counterfactuals Say About the World?
107
Finally, we adopt an example credited to Lifshitz [2, Example 8.3] demonstrating that ProbLog programs with deterministic clauses are no longer uniquely determined by their counterfactual semantics. This stresses the necessity of the properness assumption in Theorem 1. Example 14. Consider the following acyclic ProbLog programs in normal form: P1 :
0.5 :: p
1 :: q ← p
P2 :
0.5 :: p
1 :: q
1 :: q ← ¬p
One easily checks that the programs P1/2 give rise to the same counterfactual semantics, i.e. the statement of Theorem 1 is violated for the programs P1/2 .
4
Conclusion
Our main result in this contribution reveals that counterfactuals are expressive enough to describe acyclic proper ProbLog programs in normal form. In particular, we demonstrate how these programs can be reconstructed from their respective counterfactual semantics or counterfactual backbones. Consequently, counterfactuals could be a good language for domain knowledge and program specifications in structure learning. However, learning ProbLog programs with the correct counterfactual semantics seems to be sophisticated as one needs to pin down a single program exactly. One direction for future work would be to investigate the counterfactual expressiveness of ProbLog, i.e. to characterize the counterfactual reasonings that can be represented using ProbLog programs. In this context, an adaptation of our result to non-proper programs would clearly be desirable. As we expect only limited counterfactual knowledge to be available in practice, future work could also put effort into reducing the counterfactual estimations needed to reconstruct a certain program. Since symmetries usually shrink the program space, separately studying the counterfactual semantics of non-ground programs should lead to substantial progress in achieving the second goal.
References 1. Bergadano, F., Gunetti, D.: Inductive Logic Programming: From Machine Learning to Software Engineering. MIT Press, Cambridge (1995). https://doi.org/10.7551/ mitpress/3731.001.0001 2. Bochman, A.: A Logical Theory of Causality. The MIT Press, Cambridge (2021). https://doi.org/10.7551/mitpress/12387.001.0001 3. De Raedt, L., Kimmig, A., Toivonen, H.: ProbLog: a probabilistic Prolog and its application in link discovery. In: 20th International Joint Conference on Artificial Intelligence, vol. 7, pp. 2462–2467. AAAI Press, Hyderabad, India (2007). https:// doi.org/10.5555/1625275.1625673 4. Hoeck, N.V.: Cognitive neuroscience of human counterfactual reasoning. Front. Hum. Neurosci. 9 (2015). https://doi.org/10.3389/fnhum.2015.00420
108
K. R¨ uckschloß and F. Weitk¨ amper
5. Kiesel, R., R¨ uckschloß, K., Weitk¨ amper, F.: “What if?” In probabilistic logic programming. Theory Pract. Logic Program. 1–16 (2023). https://doi.org/10.1017/ S1471068423000133 6. Pearl, J.: Causality, 2 edn. Cambridge University Press, Cambridge (2000). https:// doi.org/10.1017/CBO9780511803161 7. R¨ uckschloß, K., Weitk¨ amper, F.: Exploiting the full power of Pearl’s causality in probabilistic logic programming. In: Proceedings of the International Conference on Logic Programming 2022 Workshops. CEUR Workshop Proceedings, vol. 3193. CEUR-WS.org, Haifa, Israel (2022). http://ceur-ws.org/Vol-3193/paper1PLP.pdf 8. Sato, T.: A statistical learning method for logic programs with distribution semantics. In: Logic Programming: The 12th International Conference, pp. 715–729. The MIT Press, Tokyo, Japan (1995). https://doi.org/10.7551/mitpress/4298.003.0069
Few-Shot Learning of Diagnostic Rules for Neurodegenerative Diseases Using Inductive Logic Programming Dany Varghese1(B) , Roman Bauer1,2 , and Alireza Tamaddoni-Nezhad1 1 2
Department of Computer Science, University of Surrey, Guildford, UK {dany.varghese,r.bauer,a.tamaddoni-nezhad}@surrey.ac.uk School of Computing, Newcastle University, Newcastle upon Tyne, UK
Abstract. Traditional machine learning methods heavily rely on large amounts of labelled data for effective generalisation, posing a challenge in few-shot learning scenarios. In many real-world applications, acquiring large amounts of training data can be difficult or impossible. This paper presents an efficient and explainable method for few-shot learning from images using inductive logic programming (ILP). ILP utilises logical representations and reasoning to capture complex relationships and generalise from sparse data. We demonstrate the effectiveness of our proposed ILP-based approach through an experimental evaluation focused on detecting neurodegenerative diseases from fundus images. By extending our previous work on neurodegenerative disease detection, including Alzheimers disease, Parkinsons disease, and vascular dementia disease, we achieve improved explainability in identifying these diseases using fundus images collected from the UK Biobank dataset. The logical representation and reasoning inherent in ILP enhances the interpretability of the detection process. The results highlight the efficacy of ILP in few-shot learning scenarios, showcasing its remarkable generalisation performance compared to a range of other machine learning algorithms. This research contributes to the field of few-shot learning using ILP and paves the way for addressing challenging real-world problems.
1
Introduction
Few-shot learning [2] is a challenging task in machine learning that aims to enable models to generalise and make accurate predictions with only a limited amount of labelled training data available. Traditional machine learning models typically rely on large amounts of labelled data for training to achieve high accuracy. However, when faced with scenarios where only a few training examples are available, these models often struggle to generalise effectively. In contrast, humans exhibit a remarkable capability for one-shot or few-shot learning, where they can quickly grasp new concepts and make accurate predictions with minimal exposure to data. The key reason for the disparity between machine learning models and human performance in few-shot learning lies in the inherent differences in their learning mechanisms. Machine learning models, especially those based on c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 E. Bellodi et al. (Eds.): ILP 2023, LNAI 14363, pp. 109–123, 2023. https://doi.org/10.1007/978-3-031-49299-0_8
110
D. Varghese et al.
deep neural networks, rely on a data-driven approach where patterns and representations are learned through optimisation processes. These models require substantial amounts of labelled data to capture the complexity of the underlying problem and generalise effectively. In contrast, humans possess innate cognitive abilities that enable them to reason, abstract, and leverage prior knowledge when encountering novel tasks or scarce data. Human learners can draw upon their vast prior knowledge and existing mental frameworks to make inferences and generalise from limited examples. They can recognise commonalities, abstract underlying concepts, and adapt previous knowledge to new situations. These cognitive abilities, coupled with an innate capacity for transfer learning, allow humans to excel at few-shot learning tasks. Additionally, humans possess a rich set of prior experiences, enabling them to leverage contextual cues, background knowledge, and intuition, often lacking in machine learning models. Inductive Logic Programming (ILP) [16] offers a unique perspective and approach to address the limitations of traditional machine learning in few-shot learning scenarios. By leveraging logical representations and reasoning, ILP can capture and exploit domain-specific knowledge and prior assumptions. This ability allows ILP models to learn from sparse data by generalising from a small number of examples [5,29]. ILP is a subfield of machine learning that combines the principles of logic programming and inductive reasoning to learn logical rules from examples. Unlike traditional machine learning approaches that focus on statistical patterns in data, ILP incorporates logical representations and reasoning to capture complex relationships and generalise knowledge. The logical nature of ILP enables it to represent complex relationships and dependencies explicitly. Using logical rules and constraints, ILP models can reason and make inferences beyond the observed examples, providing a strong foundation for few-shot learning. ILP also benefits from incorporating background knowledge, including prior domain expertise, into the learning process. This prior knowledge helps guide the learning process and facilitates better generalisation, even with limited training data. Furthermore, ILP’s capability to handle structured data, such as relational databases or ontologies, is advantageous in few-shot learning tasks that involve complex relationships and hierarchical structures. By representing data logically, ILP can effectively exploit the inherent structure and dependencies in the data, enabling more effective learning from a few examples. In a previous paper [26] we introduced an ILP approach called One-Shot Hypothesis Derivation (OSHD) and we used this for one-shot learning from retinal images to detect neurodegenerative diseases. Building upon the previous work, we propose a novel methodology that utilises a histogram-based binning method for improving interpretability and accuracy in detecting neurodegenerative diseases from retinal images. We also extend the previous study by using state-of-the-art ILP systems PyGol and Metagol as well as comparing with a range of other non-ILP machine learning methods. In the experiments, we focus on the challenging task of learning diagnostic rules for neurodegenerative diseases, including Alzheimer’s, Parkinson’s, and vascular dementia, from small
Few-Shot Learning for Neurodegenerative Diseases Using ILP
111
number of training examples (fundus images). With limited labelled data available for each disease, we investigated whether ILP could successfully learn the discriminative features and accurately classify the diseases based on the fundus images. Through these experiments, we aim to showcase the effectiveness of ILP in addressing few-shot learning challenges. By leveraging its logical representations, reasoning mechanisms, and the incorporation of prior knowledge, ILP holds promise in enhancing the capabilities of machine learning models for few-shot learning tasks.
2
Few-Shot Learning
Few-shot Learning (FSL) is a subfield of machine learning that aims to address the challenge of learning new concepts or tasks with limited labelled data. One of the most widely accepted definitions of FSL is the one provided by Wang et al. in 2020 [30], which defines it in terms of the experience, task, and performance of machine learning [1]. If the program’s performance on some classes of tasks T as assessed by some performance measure P improves with the addition of experience E, then we say that the program has learned from its experience. It’s important to stress that E is negligibly low in FSL. To fully understand FSL, it is important to explain the idea of the N-way-Kshot problem. The N-way-K-shot problem is a way to describe different problems in FSL. In this problem setting, the support set is made up of a small set of data that is used for training and then used as a reference for the testing step. Most of the time, the number of categories (N ) and the number of samples per category (K) in the reference set are used to describe the N-way-K-shot problem. So, the whole job comprises only N ×K samples. For example, N-way-1-shot is a type of one-shot learning in which the reference set has N categories, but each category only has one sample. Different taxonomies of FSL models has been listed below; 1. 2. 3. 4.
Active Learning [8,19] Transfer Learning [31,32] Meta-Learning [21,32] ILP Methods [5,26,27,29]
While AI approaches have made significant progress, they still struggle to generalise quickly from limited samples. Successful AI applications often rely on learning from large datasets. In contrast, individuals can quickly learn new activities by using their past experiences and expertise. For computers to match human skills, they must solve the FSL problem. Similar to human learning, computer programs can learn from supervised instances and pre-trained notions like parts and relationships. Another significant scenario where FSL plays a vital role is when acquiring examples with supervised information becomes challenging or even impossible due to concerns regarding privacy, safety, or ethical considerations. Furthermore, FSL offers the advantage of reducing the data-gathering effort necessary for data-intensive applications. By leveraging only a few labelled examples, FSL techniques can effectively generalise and make accurate predictions,
112
D. Varghese et al.
even with limited data. This ability to learn from a small number of examples helps alleviate the burden of collecting and annotating massive amounts of data, making FSL an efficient and practical approach for data-intensive tasks.
3
Feature Extraction for Neurodegenerative Disease Detection from Retinal Images
The human retina, an extension of the central nervous system, provides valuable insights into various neurological conditions [13]. Retinal images capture intricate details, such as vessel abnormalities, optic nerve changes, and retinal layer thickness alterations as potential biomarkers for neurological disorders. By analysing retinal images, medical professionals can gain valuable insights into conditions like diabetic retinopathy, glaucoma, multiple sclerosis, and even neurodegenerative diseases like Alzheimer’s and Parkinson’s [3,15,24]. In the field of medical diagnostics, accurate and timely identification of neurological conditions is crucial for effective treatment and management. Retinal imaging has emerged as a valuable tool in this endeavour, offering a non-invasive and accessible means of examining the intricate structures within the eye. To further enhance diagnostic capabilities, the application of few-shot learning techniques to retinal images has gained traction, enabling efficient and accurate identification of neurological disorders even with limited labelled data. Retinal imaging and few-shot learning present a powerful approach to improving neurological diagnosis. Using few-shot learning algorithms can effectively recognise distinct patterns associated with different neurological conditions, even with limited labelled data. 3.1
Feature Extraction from Retinal Images
Retinal images have become a valuable source of information for various medical applications, including disease diagnosis, monitoring, and treatment. Extracting informative features from retinal images is a crucial step in leveraging the potential of these images for accurate and efficient analysis. In this study, we delve into the realm of feature extraction from retinal images, exploring the advantages and applications of both handcrafted and learned features. Learned Features. The use of deep learning and convolutional neural networks (CNNs) has led to a move towards learning features directly from data. CNN architectures use hierarchical feature extraction to learn representations, often known as deep features or embeddings. The abstract characteristics depict subtle patterns and variances in retinal pictures. CNN models trained on big datasets can automatically learn discriminative features suited for certain tasks. Learned features excel in retinal image processing tasks such as illness categorization, lesion identification, and picture segmentation. Adaptable networks can extract task-specific characteristics, revealing subtle patterns that humans may miss.
Few-Shot Learning for Neurodegenerative Diseases Using ILP
113
Fig. 1. Demonstration of processing steps for vessel segmentation and artery vs. vein classification
We mainly use learned feature techniques for optic-disc localisation and artery/vein classification using Haar-discrete wavelet transform [12] and a pretrained CNN model [7]. Optic disc localisation plays a crucial role in automated retinal image analysis, as it serves as a vital landmark for various diagnostic tasks. The Haar wavelet is a simple, orthogonal wavelet transform that captures variations in an image at different scales. It decomposes the image into low-frequency (approximation) and high-frequency (detail) components. On the other hand, discrete wavelet transform (DWT) extends the Haar wavelet concept to more complex wavelet functions, enabling more sophisticated analysis of image features. Artery/vein classification in retinal images plays a vital role in understanding the vascular structure and dynamics of the human eye. Figure 1 shows the processing steps for vessel segmentation and artery/vein classification. Accurate identification and differentiation of arteries and veins provide valuable insights into various ocular and systemic diseases. In recent years, deep learning techniques, particularly CNNs, have emerged as a powerful approach for automated artery/vein classification. CNN models have revolutionised artery/vein classification in retinal image analysis, providing a robust and efficient approach for the automated identification of vascular structures. Handcrafted Features. Handcrafted features are designed to capture individual retinal traits or patterns. The features are manually designed using domain expertise and expert insights. Early retinal image analysis has commonly utilised handcrafted characteristics, which have proven beneficial in several applications. Handcrafted traits include vessel width, curvature and tortuosity. The benefits of handcrafted features are their interpretability and explicit representation of domain-specific knowledge. Capturing complicated and subtle retinal image alterations is limited by these methods. The study used handcrafted retinal vascular characteristics, as shown in Table 1, and distinct retinal zones were analysed in Fig. 2. The following summary describes the calculations and measurements involved: – Vascular Calibres: The calibres of the six most extensive arterioles and six largest venules were calculated. These measurements represent the width of the vessels.
114
D. Varghese et al.
Table 1. Retinal Vascular Features (RVFs) with the retinal zone of interest Features Description
Retinal Zone
CRAE
Central Retinal Arteriolar Equivalent
B
CRVE
Central Retinal Venular Equivalent
B
AVR
Arteriole-Venular ratio
B
FDa
Fractal Dimension arteriole
C
FDv
Fractal Dimension venular
C
BSTDa
Zone B Standard Deviation arteriole
B
BSTDv
Zone B Standard Deviation venular
B
TORTa
Tortuosity arteriole
C
TORTv
Tortuosity venular
C
Fig. 2. Retinal zones considered in this study [6].
– Standard Deviation of Width in Zone B (BSTD): The standard deviation of the vessel width was calculated for both the arteriolar and venular networks within zone B. This measurement quantifies the variation in vessel width within the specified zone. – Vascular Equivalent Calibre: Summary measures of vascular equivalent calibre were computed using an improved version of the Knudston-ParrHubbard formula [10,11]. This formula provides estimates of the equivalent single-vessel parent calibre (width) for the six arterioles (CRAE) and six venules (CRVE). – Arteriole-to-Venule Ratio (AVR): The arteriole-to-venule ratio (AVR) was calculated by dividing the CRAE (arteriolar equivalent calibre) by the CRVE (venular equivalent calibre). This ratio provides insight into the relative size differences between arterioles and venules. – Fractal Dimension (FD): The fractal dimension of the retinal vascular network was determined using the box-counting method [14]. The fractal dimension describes the self-similarity or branching pattern of the vascular network across different scales. Higher values indicate a more complex branching pattern. – Retinal Vascular Tortuosity: Vascular tortuosity refers to the curvature and bending of blood vessels. In this study, the retinal vascular tortuosity was quantified by calculating the integral of the curvature squared along the vessel path, normalized by the total path length [9]. The tortuosity values
Few-Shot Learning for Neurodegenerative Diseases Using ILP
115
were averaged across the measured vessels. Smaller tortuosity values indicate straighter vessels. By extracting and analysing these retinal vascular features, valuable information can be obtained regarding vessel calibres, variations, equivalent calibres, arteriole-to-venule ratio, fractal dimension, and tortuosity. These features provide insights into the structural characteristics of the retinal vascular network and can be utilised in various medical and research applications related to retinal vascular analysis and disease diagnosis.
4
Histogram-Based Binning Method
Inductive Logic Programming (ILP) is a powerful framework that combines logic programming and machine learning techniques to learn hypotheses from examples. ILP traditionally operates on discrete and symbolic data, relying on logical representations and rules. It excels at capturing patterns and relationships in categorical or discrete domains, making it well-suited for symbolic reasoning tasks. However, the inherent nature of ILP poses obstacles when it comes to handling continuous data. There are certain challenges when we try to include numerical data in the context of ILP: – Representation: ILP traditionally operates on discrete and symbolic data, which requires a conversion process to represent numerical data appropriately. Representing continuous values as discrete symbols may lead to loss of information and introduce discretisation errors. – Expressiveness: Logic programming languages typically lack built-in support for numerical operations and comparisons. This limitation hampers the direct handling of numerical data and restricts the expressive power of ILP models. – Scalability: Numerical data often introduces increased computational complexity due to continuous value ranges and arithmetic computations. This can significantly impact the scalability of ILP algorithms and hinder their efficiency. – Sensitivity to Scaling: ILP algorithms can be sensitive to the scaling of numerical features. Differences in the magnitude or range of numerical values can significantly impact ILP’s ability to extract meaningful patterns or relationships. Inconsistent scaling across features may lead to biased or misleading results. We introduce a histogram-based binning method for numerical or continuous data to address the above-mentioned issues. Histograms are graphical representations that illustrate the distribution of continuous data. They are highly valuable for exploratory analysis as they unveil insights about datasets that cannot be captured solely through summary statistics. Histograms visually depict the data’s shape, spread, and central tendencies. By organising the data into bins or intervals along the x-axis and representing the
116
D. Varghese et al.
frequency or count of observations in each bin on the y-axis, histograms enable us to discern patterns, identify outliers, and understand the overall distribution of the data. The advantages of histograms lie in their ability to showcase the underlying characteristics of sample data. Unlike summary statistics such as mean or standard deviation, histograms reveal the specific values and frequencies within each interval, allowing us to grasp the range and concentration of data points at different levels. This level of detail aids in understanding the skewness, kurtosis, multimodality, or presence of gaps in the data distribution, which may not be apparent from mere summary statistics. Histograms serve as a powerful exploratory tool, providing a comprehensive overview of the data and highlighting features such as clusters, peaks, or outliers that might influence subsequent analysis. Our proposed binning method for ILP takes advantage of histograms’ inherent flexibility and interpretability, allowing for accurate representation of data distributions while preserving relevant statistical properties. The key principle of our binning method is to dynamically determine optimal bin widths based on the characteristics of the dataset. By employing advanced statistical techniques, such as kernel density estimation or adaptive binning algorithms, we ensure that the resulting histograms capture the underlying structure of the data with greater precision. This approach mitigates issues related to subjectivity and arbitrary bin widths choices while maintaining the original data’s integrity. Now we define the notions used in the histogram-based binning method. Definition 1. Number of Bins (k). The number of Bins, denoted as k represents the desired number of equally spaced bins to divide the data range into. The number of bins determines the level of granularity in the histogram representation. Definition 2. Width of Bin (w). The width of each bin, denoted as w, represents the size of the interval for which the occurrences are counted. It determines the level of granularity in the histogram representation. Let R be the data range then bin width is calculated by w=
max(R) − min(R) k
(1)
Definition 3. Bin Edges (Bk ). The bin edges, denoted as Bk = [b0 , b1 , · · · , bk ], represent the boundaries of the bins used in the histogram. The bin edges can be calculated as Bk = {min(R) + i × w : i ∈ {0, 1, 2, . . . , k}}.
5
Empirical Evaluation
In this section we evaluate the effectiveness of the ILP systems, and the binning method described in the previous section, in generating interpretable and accurate rules for detecting neurodegenerative diseases such as Alzheimer’s, Parkinson’s, and vascular dementia from retinal images. We compare different ILP
Few-Shot Learning for Neurodegenerative Diseases Using ILP
117
approaches with a range of statistical machine learning and neural network models. We also demonstrate that by leveraging the binning method, the ILP methods can capture meaningful patterns and relationships within the retinal images, enabling the development of more accurate and explainable diagnostic rules. The data, codes and configuration files used in the experiments in this paper are available from https://github.com/hmlr-lab/FSL Fundus Images. 5.1
Materials
The data used for this study is extracted from the UK Biobank resources [23]. The UK Biobank is a large-scale project that recruited 500,000 individuals between the ages of 40 and 69 to undergo various tests and have their health monitored over their lifetimes. It is worth noting that only a subset of these participants, specifically 84,767 individuals, had their retinas imaged as part of the study. Retinal imaging was performed using the TOPCON 3D OCT 1000 Mk2 device, which combines optical coherence tomography (OCT) with fundus photography. The imaging procedure focused on capturing images of the macula, the central region of the retina. The resulting images have a 45-degree field of view and dimensions of 2,048 by 1,536 pixels. The information regarding the participants in this study was collected and organised in a large CSV (Comma Separated Values) file. Each row in the CSV file represents a participant, while each column represents a specific data point. The UK Biobank online system provides detailed explanations for the codes used in the column names and the associated data, ensuring transparency and clarity in the dataset. In terms of diagnoses, the dataset follows the International Classification of Diseases, Tenth Revision (ICD-10) coding system. Through a comprehensive analysis of the participant data file, we identified a specific subset of individuals who satisfied two conditions: (1) they had fundus images captured, and (2) they were diagnosed with one of three conditions: Alzheimer’s disease, Parkinson’s disease, or vascular dementia. Within this subset, we found 18 cases of Alzheimer’s, 133 cases of Parkinson’s, and 54 cases of vascular dementia. In addition to the fundus images from these individuals with neurodegenerative conditions, we included images from 528 participants who were confirmed to be healthy concerning these three conditions. It is important to note that only fundus images of the left eye were used in this study, ensuring consistency in the dataset and simplifying the analysis process. We extracted artery/vein information using learned features and then derived handcrafted features from this information. Later, the structured data in the form of CSV was converted into 100 different bins and encoded into logical rules. 5.2
Methods
This section outlines the methodology employed to conduct our study on fewshot learning. We utilised a dataset consisting of images from four distinct classes, each containing 18 images. The dataset was divided into training and test data, with a split ratio of 6:4. To perform the N-way-K-shot learning, we
118
D. Varghese et al.
Table 2. Machine learning algorithms used in this study ILP models
Other learning models from Scikit-learn [20]
1) Meta Inverse Entailment (MIE) - PyGol [25, 28] 2) Meta-Interpretive Learning (MIL) - MetagolN T [4, 18] 3) One-Shot Hypothesis Derivation (OSHD) - TopLog [26] 4) Inverse Entailment (IE) - Aleph [22]
1) 2) 3) 4) 5) 6) 7) 8) 9)
Decision Tree (DT) Naive Bayes (NB) Linear Discriminant Analysis (LDA) Support Vector Machine (SVM) Logistic Regression (LR) Random Forest (RF) Perceptron (Per) Multilayer Perceptron (MLP) K Nearest Neighbors (KNN)
employed various learning models from different domains, including Inductive Logic Programming (ILP), statistical machine learning, and neural network models. Specifically, we utilised 13 different learning models to compare their performance in the context of our study. These include 4 ILP models (IE [17,22], MIL [4,18], OSHD [26] and MIE [25,28]) and 9 non-ILP models from Scikit-learn as listed in Table 2. Language Bias. Next, we describe the methodology used to analyse the background knowledge and generate mode declarations and metarules for ILP systems Aleph, TopLog, and Metagol. These ILP systems heavily rely on userdefined mode declarations and metarules, which are crucial in guiding the learning process. It is important to note that the manual generation of mode declarations and metarules is a user-intensive and highly domain-specific task. The mode declarations and metarules used in the experiment are listed in Table 3. To begin, we carefully examined the background knowledge available for our study. This process involved a comprehensive review of domain-specific information, including the relationship between predicates, as well as the potential types of hypothesis structures that could be learned from the available data. First, we focused on identifying the relationship between predicates within the problem domain. We examined how different predicates could be combined to form meaningful rules and how these rules could be interconnected to represent the underlying knowledge in a logical manner. Simultaneously, we explored the potential hypothesis structures that could be learned from the available data. We considered the possible combinations and arrangements of predicates to form hypotheses that accurately represented the underlying patterns and relationships within the data. In our experiment, we also include the novel ILP system PyGol, and analyse its ability to learn language biases without relying on user-defined mode declarations. PyGol offers a promising approach by automating the process of learning language biases, reducing the need for extensive user interaction and manual input. By excluding user-defined mode declarations, PyGol aimed to automatically learn the language biases solely from the available data. This approach
Few-Shot Learning for Neurodegenerative Diseases Using ILP
119
Table 3. Language biases used in ILP models IE (Aleph), OSHD (TopLog) and MIL (MetagolN T ) Mode Declarations
Metarules
modeh(1, diagnosis(A, alzheimers)(+image)) modeb(1, crae(+image, -group)) modeb(1, crve(+image, -group)) modeb(1, avr(+image, -group)) modeb(1, bstda(+image, -group)) modeb(1, bstdv(+image, -group)) modeb(1, fda(+image, -group)) modeb(1, fdv(+image, -group)) modeb(1, torta(+image, -group)) modeb(1, tortv(+image, -group)) modeb(*, lteq(+group, #float)) modeb(*, gteq(+group, #float))
P(A) :- Q(A,B), R(B,C) P(A,B) :- Q(A,C), R(A) P(A) :-Q(B,A), R(A,C) P(A) :-Q(A,B), R(B,C) P(A) :-Q(A,B), R(B) P(A) :-Q(A,B), R(A,C)
enabled us to assess PyGol’s ability to capture and represent the inherent biases and patterns present in the dataset without any additional user intervention. In the experimental methodology, the N-way-K-shot algorithm described in Sect. 2 was utilised. The value of K was varied, specifically set to 2, 4, 6, 8, and 10 to represent the number of training-relevant positive examples. Concurrently, a fixed number of five negative examples from the other three classes were chosen. For example, if 2 positive instances were selected from the Alzheimer’s class, 5 negative instances were selected from the vascular dementia, Parkinson’s, and healthy data sets combined. In addition, each experimental episode included twenty iterations (N ). We imposed a maximum length restriction on hypotheses of five literals, allowing for a maximum of four conditions in the body of each hypothesis. In addition, during the testing phase, an equal number of positive and negative examples were chosen to sustain a 50% of default accuracy. It is important to note that the same instance was used for both training and testing in all assessed models. 5.3
Results and Discussions
In this section, we present the results of our experiments on Alzheimer’s, dementia, and Parkinson’s diseases using various models. We compare the performance of 13 different models, focusing on the number of positive examples used for training, which ranges from 2 to 10. The results are visualised in three separate graphs (Fig. 3) corresponding to each disease. From the results obtained in our experiments, it becomes evident that the ILP models, specifically PyGol and OSHD, exhibit superior performance compared to the other models. As depicted in the accuracy analysis, the ILP models consistently outperform the alternative models as the number of positive examples increases from 2 to 10. Figure 4 shows example rules learned using PyGol.
120
D. Varghese et al.
Fig. 3. Comparing the performance of various ILP (MIE, MIL, OSHD, IE) and nonILP algorithms for learning diagnostic rules for Alzheimer’s, Parkinson’s, and vascular dementia
Among the ILP models evaluated, both PyGol and OSHD (TopLog) consistently demonstrate improvement in accuracy across all three diseases as the number of positive examples increases. However, Aleph did not exhibit a strong performance, specifically in the experiment related to vascular dementia detection. Notably, PyGol emerges as the frontrunner with the highest accuracy among all the models, showcasing its exceptional ability to learn from a limited number of positive examples effectively. These results highlight the remarkable effectiveness of ILP models, particularly PyGol, in addressing the challenges associated with learning from a small set of positive examples, thereby underscoring their potential for accurate disease detection. The MIL model did not demonstrate significant performance across the experiments. This could be attributed to a couple of reasons. Firstly, our dataset containing continuous values may have introduced noise, making it challenging for MIL to find generalised hypotheses. Secondly, it is possible that the metarules we utilised in the MIL model did not provide sufficient expressive power to generate effective hypotheses. The limitations of the metarules may have constrained the model’s ability to capture the complex patterns and relationships present in the data. Furthermore, it is evident from our results that both statistical and neural network models struggle to learn efficiently from a small number of examples.
Few-Shot Learning for Neurodegenerative Diseases Using ILP
121
Fig. 4. Sample diagnostic rules learned by PyGol for Alzheimer’s disease
As the number of positive examples increases, the performance of these models does improve to some extent, but they generally fall short compared to the ILP models, particularly PyGol and OSHD (TopLog). This limitation can be attributed to the inherent complexity and flexibility of statistical and neural network models, which typically require a larger amount of data to capture the underlying patterns and relationships effectively. The rules shown in Fig. 4 exhibit higher interpretability and accuracy than those obtained in our previous work [26], where the range of the numerical values were explicitly fixed. The binning mechanism, which effectively converts the continuous data, is crucial in improving the interpretability of these rules. By converting the numerical data into bins, the ILP models can capture the underlying patterns more effectively. Moreover, the binning approach contributes to the improved accuracy of the ILP models.
6
Conclusions
The experiments have provided valuable insights into the performance of various models for few-shot learning for neurogenerative disease detection. The results demonstrate the effectiveness of ILP models, particularly PyGol and OSHD, in learning from a small number of positive examples. These ILP models consistently outperformed statistical and neural network models, showcasing their ability to address the challenges of few-shot learning. Additionally, the histogrambased binning approach proved to be a valuable technique for enhancing the interpretability and accuracy of ILP models. By discretising the continuous data, the binning mechanism enabled the ILP models to capture meaningful thresholds and ranges, leading to more interpretable rules. The binning approach also contributed to improved accuracy by effectively capturing important features and patterns in the data. The histogram-based binning mechanism offers a practical solution for enhancing the interpretability and accuracy of ILP models. In conclusion, our study demonstrates the efficacy of ILP models, particularly PyGol, in addressing the challenge of disease detection from limited image data. The utilisation of ILP models, coupled with the histogram-based binning mechanism, provides a powerful and promising approach for accurate and interpretable disease detection. Also, our study highlights the potential of PyGol in leveraging limited training data for accurate and interpretable disease detection. Applying
122
D. Varghese et al.
the histogram-based binning mechanism further enhances the performance of ILP models, paving the way for advancements in similar disease detection from small data using ILP. Acknowledgments. The first author would like to acknowledge the Vice Chancellor’s PhD Scholarship Award at the University of Surrey. The third author would like to acknowledge the EPSRC Network Plus grant on Human-Like Computing (HLC) and the EPSRC grant on human-machine learning of ambiguities. This research has been conducted using the UK Biobank Resource (Application No 1969).
References 1. Machine Learning. McGraw Hill, New York (1997) 2. Chen, W.Y., Liu, Y.C., Kira, Z., Wang, Y.C., Huang, J.B.: A closer look at few-shot classification. In: International Conference on Learning Representations (2019) 3. Cheung, C.Y.l., Ikram, M.K., Chen, C., Wong, T.Y.: Imaging retina to study dementia and stroke. Prog. Retin. Eye Res. 57, 89–107 (2017) 4. Cropper, A., Muggleton, S.H.: Metagol system (2016). https://github.com/ metagol/metagol 5. Dai, W.Z., Muggleton, S., Wen, J., Tamaddoni-Nezhad, A., Zhou, Z.H.: Logical vision: one-shot meta-interpretive learning from real images. In: ILP (2017) 6. Frost, S., Kanagasingam, Y., Sohrabi, H., Vignarajan, J., Bourgeat, P., et al.: Retinal vascular biomarkers for early detection and monitoring of Alzheimer’s disease. Transl. Psychiatry 3, e233 (2013) 7. Galdran, A., Meyer, M., Costa, P., MendonC ¸ a, Campilho, A.: Uncertainty-aware artery/vein classification on retinal images. In: 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), pp. 556–560 (2019) 8. Garcia, V., Bruna, J.: Few-shot learning with graph neural networks. arXiv preprint arXiv:1711.04043 (2017) 9. Hart, W.E., Goldbaum, M., Cˆ ot´e, B., Kube, P., Nelson, M.R.: Measurement and classification of retinal vascular tortuosity. Int. J. Med. Inform. 53(2), 239–252 (1999) 10. Hubbard, L.D., Brothers, R.J., King, W.N., et al.: Methods for evaluation of retinal microvascular abnormalities associated with hypertension/sclerosis in the atherosclerosis risk in communities study. Ophthalmology 106(12), 2269–2280 (1999) 11. Knudtson, M., Lee, K.E., Hubbard, L., Wong, T., et al.: Revised formulas for summarizing retinal vessel diameters. Curr. Eye Res. 27, 143–149 (2003) 12. Lalonde, M., Beaulieu, M., Gagnon, L.: Fast and robust optic disc detection using pyramidal decomposition and Hausdorff-based template matching. IEEE Trans. Med. Imaging 20, 1193–200 (2001) 13. London, A., Benhar, I., Schwartz, M.: The retina as a window to the brain - from eye research to CNS disorders. Nat. Rev. Neurol. 9 (2012) 14. Mainster, M.: The fractal properties of retinal vessels: Embryological and clinical implications. Eye 4, 235–241 (1990) 15. McGrory, S., Taylor, A.M., Kirin, et al.: Retinal microvascular network geometry and cognitive abilities in community-dwelling older people: the Lothian birth cohort 1936 study. Ophthalmology 101(7), 993–998 (2017) 16. Muggleton, S.: Inductive logic programming. ACM 5, 5–11 (1994)
Few-Shot Learning for Neurodegenerative Diseases Using ILP
123
17. Muggleton, S.: Inverse entailment and progol. New Gener. Comput. 13, 245–286 (1995). https://doi.org/10.1007/BF03037227 18. Muggleton, S., Lin, D., Tamaddoni, N.A.: Meta-interpretive learning of higherorder dyadic datalog: predicate invention revisited. MLJ 100, 49–73 (2015) 19. M¨ uller, T., P´erez-Torr´ o, G., Basile, A., Franco-Salvador, M.: Active few-shot learning with FASL. arXiv preprint arXiv:2204.09347 (2022) 20. Pedregosa, F., Varoquaux, G., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011) 21. Ren, M., et al.: Meta-learning for semi-supervised few-shot classification. arXiv preprint arXiv:1803.00676 (2018) 22. Srinivasan, A.: A learning engine for proposing hypotheses (aleph) (2001). https:// www.cs.ox.ac.uk/activities/programinduction/Aleph/aleph.html 23. Sudlow, C., et al.: UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLOS Med. (2015) 24. Tian, J., Smith, G., Guo, H., Liu, B., Pan, Z., et al.: Modular machine learning for Alzheimer’s disease classification from retinal vasculature. Sci. Rep. 11(1) (2021) 25. Varghese, D., Barroso-Bergada, D., Bohan, D.A., Tamaddoni-Nezhad, A.: Efficient abductive learning of microbial interactions using meta inverse entailment. In Proceedings of the 31st International Conference on ILP (2022, in press) 26. Varghese, D., Bauer, R., Baxter-Beard, D., Muggleton, S., Tamaddoni-Nezhad, A.: Human-like rule learning from images using one-shot hypothesis derivation. In: Katzouris, N., Artikis, A. (eds.) ILP 2021. LNCS, vol. 13191, pp. 234–250. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-97454-1 17 27. Varghese, D., Patel, U., Krause, P., Tamaddoni-Nezhad, A.: Few-shot learning for plant disease classification using ILP. In: Garg, D., Narayana, V.A., Suganthan, P.N., Anguera, J., Koppula, V.K., Gupta, S.K. (eds.) IACC 2022. Communications in Computer and Information Science, vol. 1781, pp. 321–336. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-35641-4 26 28. Varghese, D., Tamaddoni-Nezhad, A.: PyGol. https://github.com/PyGol/ 29. Varghese, D., Tamaddoni-Nezhad, A.: One-shot rule learning for challenging character recognition. In: Proceedings of the 14th Interantional Rule Challenge, pp. 10–27, August 2020 30. Wang, Y., Yao, Q., Kwok, J.T., Ni, L.M.: Generalizing from a few examples: a survey on few-shot learning, 53(3) (2020) 31. Yu, Z., Chen, L., Cheng, Z., Luo, J.: TransMatch: a transfer-learning scheme for semi-supervised few-shot learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12856–12864 (2020) 32. Zhu, P., Zhu, Z., Wang, Y., Zhang, J., Zhao, S.: Multi-granularity episodic contrastive learning for few-shot learning. Pattern Recogn. 108820 (2022)
An Experimental Overview of Neural-Symbolic Systems Arne Vermeulen3 , Robin Manhaeve1,2(B) , and Giuseppe Marra1,2 1
Department of Computer Science, KU Leuven, Leuven, Belgium [email protected] 2 Leuven.AI, Leuven, Belgium 3 Leuven, Belgium
Abstract. Neural-symbolic AI is the field that seeks to integrate deep learning with symbolic, logic-based methods, as they have complementary strengths. Lately, more and more researchers have encountered the limitations of deep learning, which has led to a rise in the popularity of neural-symbolic AI, with a wide variety of systems being developed. However, many of these systems are either evaluated on different benchmarks, or introduce new benchmarks that other systems have not been tested on. As a result, it is unclear which systems are suited to which tasks, and whether the difference between systems is actually significant. In this paper, we give an overview and classification of the tasks used in state-of-the-art neural-symbolic system. We show that most tasks fall in one of five categories, and that very few systems are compared on the same benchmarks. We also provide a methodological experimental comparison of a variety of systems on two popular tasks: learning with distant supervision and structured prediction. Our results show that a systems based on (probabilistic) logic programming achieve superior performance on these tasks, and that the performance amongst these methods does not differ significantly. Finally, we also discuss how the properties of the (probabilistic) logic programming-based systems are desirable for most neural-symbolic tasks. Keywords: Neural-symbolic AI Experimental Survey
1
· Probabilistic Logic Programming ·
Introduction
In the past decade, deep learning has shown to be very effective in a wide variety of fields, ranging from image classification to natural language processing and beyond. Lately, more and more researchers have encountered the limitations of deep learning [2,11], which has led to a rise in the popularity of neural-symbolic AI. Neural-symbolic AI (NeSy) is the field that seeks to integrate deep learning with symbolic, logic-based methods, as they have complementary strengths. A. Vermeulen and R. Manhaeve—Shared first author. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 E. Bellodi et al. (Eds.): ILP 2023, LNAI 14363, pp. 124–138, 2023. https://doi.org/10.1007/978-3-031-49299-0_9
An Experimental Overview of Neural-Symbolic Systems
125
Deep learning is notoriously data hungry, and does not perform well in tasks that require reasoning. Symbolic AI on the other hand works well in the small data regime, and can perform symbolic reasoning, however, it does not work on nonsymbolic data and is not noise tolerant. The widespread adoption of NeSy has stimulated the development of a diverse range of systems. State-of-the-art NeSy systems can be broadly classified [20] into two distinct categories: (1) systems that employ logic as a constraint to regularize neural networks during the training process, and (2) logical frameworks that incorporate additional primitives to facilitate the integration of neural networks. Consequently, systems within the same category often employ remarkably similar techniques. Despite such similarities, the evaluation of NeSy systems presents two significant challenges. Firstly, many of these systems are evaluated using different benchmarks or introduce novel benchmarks that have not been utilized by other systems. Secondly, in the few cases where systems are evaluated on the same benchmark, they typically leverage different underlying neural models or employ distinct optimization procedures. This often hinders the ability to ascertain whether observed differences are attributable to the novel neural-symbolic technique or simply a result of more sophisticated neural models or meticulous parameter tuning. As a consequence, considerable uncertainty remains regarding which systems are well-suited for specific tasks and whether the observed differences between systems hold statistical significance. In this paper, we fill this gap by providing an overview and a classification of the tasks employed in the state-of-the-art neural-symbolic systems and by conducting a fair and controlled evaluation of NeSy state-of-the-art systems. In particular, we first show that the majority of tasks in neural-symbolic systems can be effectively categorized into five distinct groups: 1) distantly supervised classification, 2) structured prediction, 3) knowledge base completion, 4) learning to optimize, and 5) structure learning. This classification provides a systematic understanding of the diverse range of tasks but also highlights how many systems could in principle be evaluated easily on similar tasks instead of constantly introducing new ones in the same category. Furthermore, we provide an overview of which systems are evaluated on which benchmarks, showing how such mapping is very close to a one-to-one mapping, making comparisons between systems extremely hard. Finally, we present a methodological experimental comparison of numerous systems. Specifically, we focus on two popular tasks: learning with distant supervision and structured prediction. These comparative experiments enable us to elucidate the true differentiating factors between systems and ascertain whether observed differences are indeed a result of the novel neural-symbolic techniques or merely attributed to alternative factors such as more sophisticated neural models or refined parameter tuning. By addressing the challenges posed by the evaluation of systems on different models and utilizing different optimization procedures, our work contributes to the advancement of the neural-symbolic field, offering valuable insights into system performance and guiding researchers in selecting appropriate systems for specific tasks based on their comparative evaluations.
126
2
A. Vermeulen et al.
Background
We now give a categorization of NeSy methods into two separate approaches: regularization-based approaches and logic programming-based approaches. For a more in depth classification of NeSy systems, we refer to works such as [3,20]. 2.1
Regularization-Based Approaches
In this approach, the method uses logical constraints to regularize the neural network during training. A differentiable measure is calculated that indicates how strongly the predictions from the neural network violate the constraints. During training, this measure is optimized as a loss alongside the other losses. Although many methods follow this approach, they differ in the expressivity of the constraints, and the exact measure that is being calculated. The latter can range from a probabilistic interpretation, e.g. the Semantic Loss (SL) [28] to a fuzzy logic one, e.g. Semantic Based Regularization (SBR) [6] or Logic Tensor Networks (LTN) [1]. Example 1: Semantic Loss Let pi denotes the probability of a propositional variable Xi being True and corresponds to a single output of a neural net having n outputs and sigmoid activation. Intuitively, the semantic loss is the negative logarithm of the probability of generating a state that satisfies a logical constraint β when sampling values according to p. Suppose you want to solve a multi-class classification task (example adapted from [28]), where each input example must be assigned to a single class. Consider a set of three variables, one can encode a mutually exclusivity constraint as: β = (X1 ∧ ¬X2 ∧ ¬X3 ) ∨ (¬X1 ∧ X2 ∧ ¬X3 ) ∨ (¬X1 ∧ ¬X2 ∧ X3 ) and build the corresponding semantic loss term: L(β, p) = − log p1 (1 − p2 )(1 − p3 ) + (1 − p1 )p2 (1 − p3 ) + (1 − p1 )(1 − p2 )p3 In a setting of semi-supervised classification, this loss can be applied to the set of unsupervised example together while a standard supervised loss (e.g. cross entropy) for supervised examples.
2.2
Logic Programming-Based Approaches
In this approach, the method starts from a logic programming framework, which serves as the basis for representing knowledge and reasoning with it. Often, a probabilistic or fuzzy extension of a logic programming framework is used, as it matches better with the paradigm of neural networks. It is then extended
An Experimental Overview of Neural-Symbolic Systems
127
with a mechanism that allows it to interface with neural networks. A widely used example of a mechanism is that of the neural predicate [15]. The neural predicate allows for (probabilistic) logic programming languages to parameterize the program with probabilities predicted by a neural network. Examples of such systems include DeepProbLog [15,16], DeepStochLog [27] and NeurASP [29]. Example 2: DeepProbLog DeepProbLog [15] is a neural extension of the probabilistic logic programming language ProbLog. DeepProbLog introduces neural predicates to allow images or other sub-symbolic representations as terms of the program. Consider the following program: nn(image_classifier, X, Y, [0,1,2..,9]):: digit(Img,Y). add(Img1,Img2,Z) :- digit(Img1,Y1), digit(Img2,Y2), Z is Y1+Y2. where in the first clause a neural predicate takes as input an image and returns a probability distribution over possible the 10 possible digits. The second clause computes the value of the addition given the two digits.
Example 3: NeurASP NeurASP follows the same path of DeepProbLog by extending Answer Set Programming with neural predicates. The same problem of Example 2 can be modelled as: img(i1). img(i2). addition(A,B,N) :- digit(0,A,N1), digit(0,B,N2), N=N1+N2. nn(digit(1,X), [0,1,2,3,4,5,6,7,8,9]) :- img(X). An important difference between NeurASP and the previous system is that, being based on ASP, logical inference is based on forward chaining, and it is therefore not query oriented. A further analysis on this aspect is described in Section 5.3.
3
Related Work
There are several works that have looked at the field of NeSy. In [20], the authors state seven dimensions that are shared between NeSy and the related field of statistical relational artificial intelligence (StarAI). One of them is that of the symbolic logical inference, either based on models or on proofs, which gives rise to the two class of systems in Sect. 2. Other important dimensions are concerned with the continuous semantics of the logic, either probabilistic or fuzzy, and the logical framework of reference, either propositional or first-order. In [13], the authors discuss the differences between Graph Neural Networks and NeSy systems. In [3], the authors provide a taxonomy that can be used to analyze and classify a wide variety of NeSy systems. Another categorization of NeSy systems can be found in [5], where they use a neural network viewpoint by investigating in
128
A. Vermeulen et al.
which components (i.e. input, loss or structure) symbolic knowledge is injected. Although all these works provide different interesting views on the NeSy field from a conceptual viewpoint, they do not compare the methods experimentally. They have identified that some methods differ only very slightly, but do not quantify whether this results in actual difference in performance.
4
Tasks
In this section, we discuss the most common tasks used in NeSy, and give an overview of which NeSy systems are evaluated on them. In Table 1, we provide an overview of the benchmarks used by state-of-the-art NeSy frameworks, categorized along the 5 types of tasks as discussed below. Task 1: Distantly Supervised Classification. In this setting, we have supervised examples, but the supervision is not available at the level of the classifier that needs to be trained. A typical example of a distantly supervised classification task is the MNIST addition task [15], where pairs of handwritten digits are only labeled with their sum (further explained in Sect. 5.1). We consider this distantly supervised, since the individual digits need to be learned, but the supervision is only available at the level of pairs of digits. Task 2: Structured Prediction. In this setting, the system needs to make a prediction over structured objects. This means that the entities in the data can no longer be considered separately, but the structure that connects them needs to be considered as well. One example of such a task is in the CiteSeer dataset [10]. Here, documents need to be classified in one of 6 classes. Each document is described as a bag of words. However, the separate documents are connected in a citation network, where an edge represents the fact that one document cites another. The structure of this network is essential to solve this task, and a prediction needs to be made for the network as a whole. In situations like these, the prediction task is often called a collective classification task because the performance of the model can improve by collectively predict the classes of the publications that cite each other. Task 3: Knowledge Base Completion. A knowledge base is a collection of entities, and the relations between the entities. However, the relations between the entities are often incomplete. In this setting, the system needs to generate the missing relations from the incomplete knowledge base. Task 4: Learning to Optimize. Optimization problems are often very hard to solve exactly. However, in many cases, a good approximation to the exact solution is sufficient, and a lot easier to find. In this setting, the system is expected to learn how to generate an (approximately) optimal solution to the input problem. For these kinds of problems (e.g. the travelling salesman problem), there exist methods that can solve them exactly, but they can be computationally
An Experimental Overview of Neural-Symbolic Systems
129
intractable. By using a neural-symbolic approach, neural networks can be used to predict an (approximate) solution, while logic can incorporate the knowledge of the problem (i.e. what is a valid route). Task 5: Structure Learning. For most tasks, the logical structure of the problem is considered to be given. However, in some cases, this might be partially or completely missing. In this setting, the system is supposed to learn the structure as well as any (neural) parameters of the model. This structure usually consists of logical formulas or clauses.
5
Experimental Evaluation
In this paper, we focus on the tasks of the first two categories as they are most prevalent in literature. 5.1
Tasks
Distant Supervision. For the task of distantly supervised classification, we use the MNIST addition dataset [15]. In this task, the input consists of pairs of hand=8. + written digits that are labeled with the sum of the two digits, e.g. As the MNIST dataset has been criticized as being too simple we also investigate what the impact is of using the Fashion-MNIST dataset instead as well. Finally, many real-world datasets contain mislabeled examples (i.e. label noise) and deep learning has already proven to be robust to this kind of noise [22]. So, following [16], we introduce different levels of label noise to the MNIST addition dataset and look at the impact on the performance of the different systems. Structured Prediction. For the task of structured prediction, we consider the Citeseer and Cora datasets [10]. They consists of scientific publications (documents) that need to be categorized into different research domains. Each document is represented as a bag of words, represented as a binary vector that indicates whether a word out of the vocabulary appears in the document. These publications are connected in the citation graph, where an edge represent that a document cites another document. Two documents likely belong to the same domain if they are connected by such a link. The publications were selected in a way such that every paper cites or is cited by at least one other publication. 5.2
Methodology
For each of the methods and each of the tasks, we first perform a grid search for the optimal hyperparameters. The hyperparameters that are used and the range of values we search over, and their optimal values are given in Table 2. The optimal values for the aggregation parameters for the LTN are 2 and 1 for the universal and existential quantification respectively. For each of the neuralsymbolic methods, one neural architecture is specified per task that will be
Experiment
CiteSeer Cora WebKB
PASCAL-Part PASCAL-Part GTSRB Arnetminer
MNIST MNIST MNIST MNIST FASHION CIFAR-10 HWF HASY
Dataset
learning the shortest path grid path finding molecule graph generation sorting integers
create Datalog programs ILP experiments
4
5
learn preferences for users
knowledge base completion
PrefLib UW-CSE
ChEMBL
Nations UMLS Kinship predicting if country is in region Countries visual question answering CLEVR MedHop question answering WikiHop WikiMovies Freebase check if triple belongs to KB WordNet
webpages classification
document classification
solving handwritten formulas check if correct math operation context-sensitive grammar word algebra problems learning how to solve sudoku classification of bounding boxes detection of the part-of relation traffic sign recognition predict genre of movie
classification of images
digit classification from addition pairs of consecutive numbers results of grid of images recognition of digits
3
2
1
Task
X
X
X X
X X
X X
X X
X
X
X
X
X
X
X
X
X
X
X X
X
X X
X
X
X
X
X
X
X
X
X
X
X
X X
X
X X
X X X
X X X X
X
X
X
X
X X X
X X
X
[9] [15] [27] [23] [24] [7] [29] [14] [8] [25] [26] [19] [21] [18] [28] [6] [4]
Table 1. An overview of which neural-symbolic systems (from [20]) have been evaluated on which tasks.
130 A. Vermeulen et al.
An Experimental Overview of Neural-Symbolic Systems
131
used by all the methods. For the first task, the following neural architecture is used: Conv2D(6,5), MaxPool2D(2), ReLU, Conv2D(16,5), MaxPool2d(2,2), ReLU, Dropout, Linear(120), ReLU, Linear(84), ReLU, Linear(10), Softmax. For the second task, the following neural architecture is used: Dropout, Linear(840), ReLU, Linear(84), ReLU, Linear(6), Softmax. To determine the statistical significance of the final results, we use the paired t-test with significance level α = 0.05. Table 2. The hyperparameters considered in the grid search and their optimal values (separated by a/if they are different for the two tasks). Method
All
Hyperparameter
Grid values
Batch size
2, 8, 32, 128
Learning rate
1e−3, 1e-4
Dropout
0, 0.2
LTN
Aggregation parameter p 1, 2
LTN (distant supervision)
Batch size
LTN (structured prediction) Batch Size
8, 32, 128, 512 32, 128, 512, 2048
Hyperparameter NN
SL
LTN
Batch size
128
2
32/512 2
Learning rate
1e−3/1e−4 1e−4 1e−3
1e−3/1e−4
1e−4
1e−4/1e−3
Dropout
0.2/0
0/0.2
0/0.2
0.2/0
5.3
0
0
DeepProbLog DeepStochLog NeurASP 2/32
8/32
Results
Distant Supervision. The results for the MNIST addition dataset are shown in Table 3, where we report the average accuracy on the test set for both the MNIST addition as the Fashion-MNIST addition datasets. The accuracy on the test set is also shown in Fig. 1. It shows that there is not much variation in the maximal achieved accuracy on this task for the neural-symbolic methods, but there is a clear variation between the average accuracy. The paired t-test suggests that the differences between DeepProbLog, DeepStochLog and NeurASP are not statistically different. They perform slightly better than Semantic Loss. Furthermore, this test indicates that Logic Tensor Networks and DeepProbLog with approximate inference perform likewise (p-value = 0.10). The differences between the neural network baseline and the neural-symbolic methods are all statistically significant. The spread of both Logic Tensor Networks and DeepProbLog with approximate inference (DPLA) in Fig. 1 shows they are very sensitive to variations in their initialization and can sometimes get stuck in a local optimum. These
132
A. Vermeulen et al.
optimization problem were already noticed by [12,17] for Logic Tensor Networks and DPLA respectively. Since the perception component is more difficult for the Fashion-MNIST dataset, it is expected that the performance of all the methods is lower than on the MNIST addition dataset. However, when comparing the results, we come to the same conclusions as for the MNIST addition dataset. Finally, Fig. 3 shows how robust the different methods are to varying levels label noise (0.1, 0.25 and 0.5). Both Logic Tensor Networks as DeepProbLog with approximate inference experience a drop in accuracy if noise is added to the dataset. Especially in the case where half of the labels are interfered with noise, Logic Tensor Networks struggle. All the other neural-symbolic methods show clear robustness to this kind of noise and successfully inherit this strength of deep learning (Fig. 3). Table 3. Average accuracy and associated standard deviation on the test set for the MNIST addition and Fashion-MNIST addition dataset for 10 different seeds. MNIST addition Fashion-MNIST addition NN baseline
63.71 ± 7.85
44.22 ± 5.04
Semantic Loss
97.37 ± 0.44
74.77 ± 1.77
Logic Tensor Networks
90.11 ± 8.68
64.13 ± 13.68
DeepProbLog
97.78 ± 0.11
DeepProbLog (approximate) 80.10 ± 19.35
78.60 ± 0.60 52.74 ± 6.88
DeepStochLog
97.73 ± 0.17
79.18 ± 0.83
NeurASP
98.00 ± 0.43
79.31 ± 0.92
Fig. 1. Boxplot to visually represent the accuracy on the test set for the MNIST addition dataset for 10 different seeds. The average accuracy and standard deviation are reported in Table 3
An Experimental Overview of Neural-Symbolic Systems
133
Fig. 2. Boxplot to visually represent the accuracy on the test set for the Fashion MNIST addition dataset for 10 different seeds. The average accuracy and standard deviation are reported in Table 3
Fig. 3. Average accuracy on the test set for the MNIST addition dataset for neuralsymbolic methods that are trained with a training set that contains noise (10 different seeds). Label noise rates 0, 0.1, 0.25 and 0.5 were tested.
Structured Prediction. The results for the Citeseer dataset are shown in Table 4 and Fig. 4. They indicate that the performance of NeurASP does not differ with the one of DeepStochLog (p-value = 0.51) and Semantic Loss (pvalue = 0.12). However, it does differ from DeepProbLog (p-value = 0.04). This result can be explained by the fact that DeepProbLog was not able to perform inference on all the testing examples as it used too much memory for the
134
A. Vermeulen et al.
Fig. 4. A boxplot of the accuracy on the test set for the CiteSeer dataset for 10 different seeds. The average accuracy and standard deviation are reported in Table 4
instances that were cited more than ten times. These examples were therefore considered as wrongly classified. 12 instances of the validation set (2.40%) and 25 instances of the test set (2.50%) were removed. With more available resources, it is possible that DeepProbLog classifies these instances correctly which would improve the accuracy with 2.40%. In that case it performs likely to NeurASP and DeepStochLog. Table 4. Average accuracy and associated standard deviation on the test set for the CiteSeer and Cora dataset for 10 different seeds. CiteSeer NN baseline Semantic Loss Logic Tensor Networks DeepProbLog DeepProbLog (approximate)
Cora
73.12 ± 3.40
34.07 ± 3.31
76.71 ± 1.22
82.49 ± 0.86
73.99 ± 3.36
37.05 ± 7.29
76.81 ± 0.74 84.05 ± 0.43 72.11 ± 0.80
78.56 ± 0.78
DeepStochLog
78.26 ± 0.76 83.81 ± 0.71
NeurASP
77.97 ± 1.41
82.97 ± 0.50
The results of the Cora dataset are shown in Table 4 and Fig. 5. The Cora dataset has a more dense citation graph, so this should benefit the systems that are able to use this information. The paired t-tests indicate that the performance of DeepProbLog and DeepStochLog is similar (p-value = 0.44). The observation for this task is similar as the one for distantly supervised classification and states that adding reasoning during the inference process can result in improved performance. Moreover, the performance of NeurASP is equal to Semantic Loss
An Experimental Overview of Neural-Symbolic Systems
135
Fig. 5. A boxplot of the accuracy on the test set for the Cora dataset for 10 different seeds. The average accuracy and standard deviation are reported in Table 4
(p-value = 0.21). This is expected, since NeurASP is unable to ground the citation network fully given that it is not query directed. For NeurASP, we thus use a regularization-based approach instead for this task. Somewhat surprising is that the difference between Logic Tensor Networks and the neural networks baseline are not statistically significant (p-value = 0.29). Methods that are able to make optimal use of the underlying citation network should be able to work well with a more limited amount of labeled training data. Because of the citation network that links publications to each other, the label of unlabeled publications can be inferred if it is cited by labeled examples. To test this assumption we test different ratios of label missingness. The results are shown in Fig. 6. The observation is that methods that use logic as a regularization technique for the training of a neural network perform sub-optimal on this task. Further, methods that perform reasoning during inference, i.e. DeepProbLog and DeepStochLog are very robust when examples are removed from the training set. Their performance almost do not change. 5.4
Discussion
For the result, we can conclude that the methods that are built on a logic programming framework, i.e. DeepProbLog, DeepStochLog and NeurASP consistently have the best performance. For simpler tasks, like the MNIST addition, the difference is not that significant with other neural-symbolic methods. For more complex tasks, the difference becomes more significant. The power of such methods does come at a price, however. They are generally slower than regularization based methods, and can consume a large amount of memory. In fact, for some methods, the task can become prohibitively expensive. This leads to DeepProbLog running out of memory for some of the documents in the structured prediction task, and NeurASP having to rely on a regularization-based approach. As discussed in other papers [17,20], it is an ongoing goal of the NeSy commu-
136
A. Vermeulen et al.
Fig. 6. Average accuracy on the test set for the CiteSeer dataset for neural-symbolic methods that are trained with a reduced training set (10 different seeds). 50%, 75%, 90% and 100% of the original dataset were tested.
nity to make these methods more scalable. Finally, it is also worth mentioning that the expressivity of the systems is also an important factor for the usability of these systems. Take for example the difference between Semantic Loss and DeepProbLog. The former is propositional, while the latter is first order (i.e. allows variables in rules). This allows for the MNIST addition to be encoded in a small program of 2 lines in DeepProbLog, while Semantic Loss needs a larger program where the constraint is written out for each possible sum separately.
6
Conclusion
In this paper, we gave an overview and classification of the tasks used in state-ofthe-art neural-symbolic system. We show that most tasks fall in one of five categories: 1) distantly supervised classification, 2) structured prediction, 3) knowledge base completion, 4) learning to optimize, and 5) structure learning. We also looked into which benchmarks are used by state-of-the-art NeSy systems, and investigate to what extent these benchmarks are used as comparisons across systems. From this, we concluded that very few systems are compared on the same benchmarks, making it difficult to get a good overview of the field of neuralsymbolic AI. To help remedy this situation, we have provided a methodological experimental comparison of a variety of systems on two popular tasks: learning with distant supervision and structured prediction. Our results show that a systems based on (probabilistic) logic programming achieve superior performance on these tasks, and that the performance amongst these methods does not differ significantly. However, although the logic programming-based methods are powerful and expressive, they do not always scale well, and for some tasks, the reasoning is prohibitively expensive.
An Experimental Overview of Neural-Symbolic Systems
137
Acknowledgements. This work was supported by the EU H2020 ICT48 project “TAILOR” under contract #952215 and the Flemish Government under the “Onderzoeksprogramma Artifici¨ele Intelligentie (AI) Vlaanderen” programme.
References 1. Badreddine, S., d’Avila Garcez, A., Serafini, L., Spranger, M.: Logic tensor networks. Artif. Intell. 303, 103649 (2022) 2. Bahdanau, D., Murty, S., Noukhovitch, M., Nguyen, T.H., de Vries, H., Courville, A.C.: Systematic generalization: what is required and can it be learned? In: Proceedings of the 7th International Conference on Learning Representations (2019) 3. van Bekkum, M., de Boer, M., van Harmelen, F., Meyer-Vitali, A., Teije, A.T.: Modular design patterns for hybrid learning and reasoning systems: a taxonomy, patterns and use cases. Appl. Intell. 51(9), 6528–6546 (2021) 4. Cohen, W., Yang, F., Mazaitis, K.R.: TensorLog: a probabilistic database implemented using deep-learning infrastructure. J. Artif. Intell. Res. 67, 285–325 (2020) 5. Dash, T., Chitlangia, S., Ahuja, A., Srinivasan, A.: How to tell deep neural networks what we know (2021) 6. Diligenti, M., Gori, M., Sacc` a, C.: Semantic-based regularization for learning and inference. Artif. Intell. 244, 143–165 (2017) 7. Donadello, I., Serafini, L., d’Avila Garcez, A.S.: Logic tensor networks for semantic image interpretation. In: Sierra, C. (ed.) Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence. IJCAI 2017, Melbourne, Australia, 19–25 August 2017, pp. 1596–1602. ijcai.org (2017) 8. Dong, H., Mao, J., Lin, T., Wang, C., Li, L., Zhou, D.: Neural logic machines. In: 7th International Conference on Learning Representations. ICLR 2019, New Orleans, LA, USA, 6–9 May 2019. OpenReview.net (2019) 9. Evans, R., Grefenstette, E.: Learning explanatory rules from noisy data. J. Artif. Intell. Res. 61, 1–64 (2018) 10. Giles, C.L., Bollacker, K.D., Lawrence, S.: CiteSeer: Proceedings of the 1998 3rd ACM Conference on Digital Libraries. Proceedings of the ACM International Conference on Digital Libraries, pp. 89–98 (1998) 11. Huang, J., Chang, K.C.C.: Towards reasoning in large language models: a survey. In: Findings of the Association for Computational Linguistics: ACL 2023, pp. 1049– 1065. Association for Computational Linguistics, Toronto, Canada, July 2023 12. van Krieken, E., Thanapalasingam, T., Tomczak, J.M., van Harmelen, F., Teije, A.T.: A-NeSI: a scalable approximate method for probabilistic neurosymbolic inference, December 2022 13. Lamb, L.C., d’Avila Garcez, A.S., Gori, M., Prates, M.O.R., Avelar, P.H.C., Vardi, M.Y.: Graph neural networks meet neural-symbolic computing: a survey and perspective. In: Bessiere, C. (ed.) Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence. IJCAI 2020, pp. 4877–4884. ijcai.org (2020) 14. Li, Q., Huang, S., Hong, Y., Chen, Y., Wu, Y.N., Zhu, S.C.: Closed loop neuralsymbolic learning via integrating neural perception, grammar parsing, and symbolic reasoning. In: Proceedings of the 37th International Conference on Machine Learning, pp. 5884–5894. PMLR, November 2020 15. Manhaeve, R., Dumancic, S., Kimmig, A., Demeester, T., De Raedt, L.: DeepProbLog: neural probabilistic logic programming. In: Advances in Neural Information Processing Systems, vol. 31. Curran Associates, Inc. (2018)
138
A. Vermeulen et al.
16. Manhaeve, R., Dumanˇci´c, S., Kimmig, A., Demeester, T., De Raedt, L.: Neural probabilistic logic programming in deepproblog. Artif. Intell. 298, 103504 (2021) 17. Manhaeve, R., Marra, G., Raedt, L.D.: Approximate inference for neural probabilistic logic programming. In: Proceedings of the International Conference on Principles of Knowledge Representation and Reasoning, vo. 18, no. 1, pp. 475–486 (2021) 18. Marra, G., Diligenti, M., Giannini, F., Gori, M., Maggini, M.: Relational neural machines. In: Proceedings of the 24th European Conference on Artificial Intelligence. Frontiers in Artificial Intelligence and Applications, vol. 325, pp. 1340–1347. IOS Press (2020) 19. Marra, G., Kuzelka, O.: Neural Markov logic networks. In: de Campos, C.P., Maathuis, M.H., Quaeghebeur, E. (eds.) Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, UAI 2021, Virtual Event, 27–30 July 2021. Proceedings of Machine Learning Research, vol. 161, pp. 908–917. AUAI Press (2021) 20. Raedt, L.D., Dumancic, S., Manhaeve, R., Marra, G.: From statistical relational to neuro-symbolic artificial intelligence. In: Bessiere, C. (ed.) Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence. IJCAI 2020. pp. 4943–4950. ijcai.org (2020) 21. Rockt¨ aschel, T., Riedel, S.: End-to-end differentiable proving. In: Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017) 22. Rolnick, D., Veit, A., Belongie, S., Shavit, N.: Deep learning is robust to massive label noise, February 2018. arXiv:1705.10694 [cs] 23. Si, X., Raghothaman, M., Heo, K., Naik, M.: Synthesizing datalog programs using numerical relaxation. In: Kraus, S. (ed.) Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence. IJCAI 2019, Macao, China, 10–16 August 2019, pp. 6117–6124. ijcai.org (2019) 24. Sourek, G., Aschenbrenner, V., Zelezny, F., Schockaert, S., Kuzelka, O.: Lifted relational neural networks: efficient learning of latent relational structures. J. Artif. Intell. Res. 62, 69–100 (2018) 25. Tsamoura, E., Hospedales, T., Michael, L.: Neural-symbolic integration: a compositional perspective. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 6, pp. 5051–5060 (2021) 26. Weber, L., Minervini, P., M¨ unchmeyer, J., Leser, U., Rockt¨ aschel, T.: NLprolog: reasoning with weak unification for question answering in natural language. In: Proceedings of the 57th Conference of the Association for Computational Linguistics. ACL, pp. 6151–6161. Association for Computational Linguistics (2019) 27. Winters, T., Marra, G., Manhaeve, R., Raedt, L.D.: DeepStochLog: neural stochastic logic programming. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 9, pp. 10090–10100 (2022) 28. Xu, J., Zhang, Z., Friedman, T., Liang, Y., Broeck, G.: A Semantic loss function for deep learning with symbolic knowledge. In: Proceedings of the 35th International Conference on Machine Learning. PMLR, July 2018 29. Yang, Z., Ishay, A., Lee, J.: NeurASP: embracing neural networks into answer set programming. In: 29th International Joint Conference on Artificial Intelligence (IJCAI 2020), July 2020
Statistical Relational Structure Learning with Scaled Weight Parameters Felix Weitkämper(B) , Dmitriy Ravdin, and Ramona Fabry Institut für Informatik, Ludwig-Maximilians-Universität München, Oettingenstr. 67, 80538 München, Germany [email protected]
Abstract. Markov Logic Networks (MLNs) combine relational specifications with probabilistic learning and reasoning. Although they can be specified without reference to a particular domain, their performance across domains of different sizes is generally unfavorable. Domain-size Aware Markov Logic Networks (DA-MLNs) address this issue by scaling down weight parameters based on the domain size. This allows for faster learning by training models on a random sample of the original domain. DA-MLNs also enable transferring models between naturally occurring domains of different sizes. This study proposes a combination of functional gradient boosting and weight scaling for single-target structure learning on large domains. It also evaluates performance and runtime on two benchmark domains of contrasting sizes. The results demonstrate that training a DA-MLN from a sample can significantly reduce learning time for large domains with minor performance trade-offs, which decrease with the size of the original domain. Additionally, the study explores how scaling reacts to varying domain sizes in a synthetic social network domain. It is observed that DA-MLNs outperform unscaled MLNs when the number of connections between individuals grows with domain size, but perform worse when the number of connections remains constant. This justifies the use of unscaled MLNs when sampling isolated subcommunities in areas such as social sciences research. Keywords: Single-target learning · Transfer learning · Markov logic networks · Relational logistic regression · Statistical relational learning
1
Introduction
Statistical relational artificial intelligence emerged as a way to specify probabilistic models independently of a given domain. By using first-order languages and variables, symmetries are encoded between elements of a domain and very succinct models can be given even on domains with many individuals. The abstract specification also implies that the same statistical relational models can be instantiated on different domains. However, it was realised that as the domain size varies, the marginal probabilities predicted by statistical relational c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 E. Bellodi et al. (Eds.): ILP 2023, LNAI 14363, pp. 139–153, 2023. https://doi.org/10.1007/978-3-031-49299-0_10
140
F. Weitkämper et al.
models vary widely, and in particular, as the domain size increases they often tend to extremes (probability 0 or 1) [8]. While transferring a model between domains of different sizes is important in its own right, for instance when the sizes of sample domains vary naturally or when data is only available on certain parts of a domain of interest, a specific instance arises from random sampling on a large domain. Learning statistical relational models on large domains can be computationally expensive. This prompts the idea of sampling a subset of the original domain, training the model on that subset and then applying it on the original domain of interest. Unfortunately, the marginal probabilities of queries of interest are distorted by the change in domain size, potentially leading to poor performance on the original dataset. In the context of Markov Logic Networks (MLN), a statistical relational formalism in which weighted first-order formulas are grounded to Markov networks, Mittal et al. [6] proposed a remedy to this problem: By scaling the weight parameters by the possible number of connections in the ground Markov network, the effect of increasing domain size can be counteracted. They call this Domainsize Aware Markov Logic Networks (DA-MLN). Mittal et al. [6] demonstrated in some examples that the marginal probabilities converge to a limit that depends adequately on the supplied parameters. Furthermore, they showed that in the context of parameter learning from a small training domain and application on a larger test domain, DA-MLNs outperform MLNs on a selection of manually crafted program structures. They also argued that for a limited class of programs and queries, the asymptotic limit of marginal queries varies meaningfully with domain size. Weitkämper [11] showed that in general, query probabilities in Markov logic networks do not always scale meaningfully with the weight parameters. However, that paper goes on to demonstrate that for domain-size aware relational logistic regression networks (DA-RLRs), asymptotic limits of marginal probabilities do always depend meaningfully on the learned weights. Relational logistic regression networks [3] are directed analogues of MLNs that ground to Bayesian networks rather than to Markov networks. In the context of supervised singletarget learning, MLNs and RLRs are equivalent [4], motivating specifically the use of parameter scaling in single-target learning of MLNs. The most established approach for highly scalable structure learning of MLNs is functional gradient boosting [5], which dramatically reduces learning time from larger datasets by successively learning a family of a priori weaker models and combining them, rather than learning a single, computationally expensive strong model. In this contribution, we propose a combination of functional gradient boosting, parameter scaling and sampling to provide statistical relational structure learning from large datasets. We also investigate the relationship between domain sampling, general transfer learning and parameter scaling in the context of singletarget structure learning of MLNs more generally. We show that when the training domain is sufficiently large, learning from a small, uniformly selected random
Statistical Relational Structure Learning with Scaled Weight Parameters
141
sample can lead to significantly improved learning times with only minor performance penalties. Furthermore, we see that when sampling isolated groups rather than uniformly selected individuals, then unscaled MLNs perform well on the transfer task, and parameter scaling actually reduces the performance on the larger test set. This justifies learning ordinary unscaled MLNs on subcommunities obtained from snowball sampling, where an initial sample is propagated across its connected components.
2 2.1
Preliminaries Markov Logic Networks
As MLNs are specified using first-order logic, we first introduce some vocabulary. A (relational) signature is given by a set of relation symbols, each of a given natural number arity, and a set of constants. In a multi-sorted signature, each constant is annotated with a sort label, and each relation symbol is annotated with a list of sort labels of length equal to its arity. The language corresponding to a given signature additionally has infinitely many variables for each sort. A term is either a constant or a variable. An atom is either , ⊥ or a relation symbol together with a list of terms (whose sorts correspond to the list of sorts associated with the relation symbol). A literal is then given by either an atom (a positive literal ) or a negated atom (a negative literal ), where negation is indicated by ¬. A (quantifier-free) formula is defined recursively from literals using the binary connectives ∧, ∨ and →. A definite clause is a disjunction of literals, exactly one of which is positive. A structure over a signature is given by a domain (a set of individual elements for each of the sorts of the signature) and an interpretation of each of its symbols, that is, an element of the correct sort for every constant and a set of lists of elements of the correct sorts for each relation symbol. A formula is grounded by substituting elements of the appropriate domains for its variables, and it is ground if it does not (any longer) contain variables. Therefore, any choice of elements from the domains matching the sorts of the variables in a formula is a possible grounding of that formula. A ground atom holds or is true if the substituted list of elements lies in the interpretation of the relation symbol. holds in every interpretation and ⊥ holds in no interpretation. Whether a formula holds is then determined by giving ¬, ∧, ∨ and → their usual meanings as negation, conjunction, disjunction and implication respectively. Definition 1. Let L be a (potentially multi-sorted) relational signature. A Markov Logic Network T over L is given by a collection of pairs ϕi : wi (called weighted formulas), where ϕ is a quantifier-free L-formula and w ∈ R. We call w the weight of ϕ in T . Example 1. As a running example for the next subsections, we will consider a single-sorted signature with two unary relation symbols Q and R and the MLN
142
F. Weitkämper et al.
consisting of just one formula, R(x) ∧ Q(y) : w. Note that this is different to the MLN {R(x) ∧ Q(x) : w}, where the variables are the same. MLNs were introduced by [9] and are widely studied in the field of statistical relational artificial intelligence. Its semantics defines a probability distribution over all L-structures on a given choice of domains for every sort of L. Definition 2. Given a choice D of domains for the sorts of L, an MLN T over L defines a probability distribution on the possible L-structures on the chosen domains as follows: let X be an L-structure on the given domains. Then PT,D (X = X) =
1 exp( wi ni (X)) Z i
where i varies over all the weighted formulas in T , ni (X) is the number of true groundings of ϕi in X, wi is the weight of ϕi and Z is a normalisation constant to ensure that all probabilities sum to 1. As the probabilities only depend on the sizes of the domains, we can also of size n when the signature is single-sorted. write PT,n for domains We refer to wi ni (X) as the weight of X and write it as wT (X). We will i
also use |ϕ(x1 , . . . , xn )| for the number of tuples (a1 , . . . , an ) of D for which ϕ(a1 , . . . , an ) hold in the interpretation from X. Example 2. In the MLN T1 := {R(x)∧Q(x) : w}, the probability of any possible structure X with domain D is proportional to exp (w · n(X)), where n(X) is given by |R(x) ∧ Q(x)|. In the MLN T2 := {R(x)∧Q(y) : w}, however, this probability is proportional to exp (w · n (X)), where n (X) is the number of pairs (a, b) from D ×D for which R(a) and Q(b) hold in the interpretation from X. In other words, n (X) is the product |R(x)| · |Q(y)|. Note that in T2 , setting R(a) to true for a single domain individual increases n by |Q(y)|. The marginal probability of R(a) for any domain individual a can be derived as exp(|Q(y)|) ÷ exp(|Q(y)|) + 1. As the domain size increases, |Q(y)| will typically increase with domain size. Therefore the marginal probability of R(a) will limit to 1 regardless of the weight w. This behavior is typical of MLNs whose clauses have literals containing different variables [8,11]. 2.2
Domain-Size Aware Markov Logic Networks
To compensate for the effects of variable domain sizes, [6] introduce Domain-size Aware Markov Logic Networks (DA-MLN), in which weights are scaled by the number of connections they could have in a domain of given size. The original concept relies on a connection vector which stores the possible numbers of connections of every literal, and then considers the maximum of the entries for scaling. This is unsuitable for single-target learning, however, since we always
Statistical Relational Structure Learning with Scaled Weight Parameters
143
want to scale with respect to the target literal. This leads us to define a singletarget version of DA-MLNs, which will turn out to be equivalent to the formalism of domain-size aware relational logistic regression for single-target learning. A Domain-size Aware Markov Logic Network (DA-MLN) has the same syntax as a regular MLN (see Definition 1), but is given an altered semantics to account for differing domain sizes. Definition 3. Let P be a predicate symbol. Let ϕ be a formula in which P occurs exactly once, and for every variable x, let Dx signify its domain. Let Vϕ,P be the set of free variables in ϕ not occurring in the scope of P . Then the connection |Dx |. number CP,ϕ,D is defined as x∈Vψ
The connection number is now used to scale down the weights by the sizes of the domains. Definition 4. Given a choice of domains D for the sorts of L and a target predicate P, a DA-MLN T over L defines a probability distribution on the possible L-structures on the chosen domains as follows: Let X be an L-structure on the given domains. Then wi 1 PT,D (X = X) = exp ni (X) Z max(1, CP,ϕi ,D ) i where ni (X) is the number of true groundings of ϕi in X, wi is the weight of ϕi , CP,ϕi ,D is the connection number of ϕi with respect to P, and Z is a normalisation constant to ensure that all probabilities sum to 1. Example 3. For ϕ = P (x) ∧ Q(x), CP,ϕ,D = 1, while for ϕ = P (x) ∧ Q(y), CP,ϕ,D = |Dy |. 2.3
Domain-Size Aware Relational Logistic Regression
In the context of single-target learning, relational logistic regression [3] adapts the traditional idea of logistic regression to features defined by logical formulas. Definition 5. Let L be a signature and P ∈ L be a target predicate. Let L := L \ {P }. Then a relational logistic regression T for P (x) is a collection of pairs ϕi : wi of L -formulas ϕi and real-valued weights wi . x is a tuple of variables that matches the arity of the target predicate P. Given a tuple a of domain elements that matches the arity of P and an interpretation X of all predicates in L’, the probability of P (a) predicted by T is given as sigmoid wi ni (X) i
144
F. Weitkämper et al.
where i varies over all the weighted formulas in T , ni (X) is the number of true groundings of ϕi in X, wi is the weight of ϕi and sigmoid(x) =
1 . 1 + exp(−x)
The equivalence between MLNs and relational logistic regression in the case of supervised single-target learning is well known [4]. In fact, an MLN T of formulas in which P occurs exactly once can be converted into a relational logistic regression correct with respect to predictions on target P by replacing any L-formula ϕ with the formula ϕP ∧ ¬ϕ¬P , where ϕP substitutes P (x) with and ϕ¬P substitutes P (x) with ⊥ [11]. Under this transformation, the single-target version of DA-MLN introduced above corresponds exactly to domain-size aware relational logistic regression (DA-RLR), which has been shown to give marginal probabilities which depend asymptotically on the weight parameters [11]. Definition 6. Let T be a relational logistic regression for P (x). Let ϕ be an L formula, and for every variable x, let Dx signify its domain. Let VP be the set of free variables in ϕ not among x. Then the connection number Cϕ,D is defined |Dx |. as x∈Vψ
Again, this connection number is used to scale down the weights by the sizes of the domains. Definition 7. Given a tuple a of domain elements that match the arity of P and an interpretation X of all predicates in L’, the probability of P (a) predicted by T is given as wi ni (X) sigmoid Cϕi ,D i where i varies over all the weighted formulas in T , ni (X) is the number of true groundings of ϕi in X and wi is the weight of ϕi . We illustrate this with the MLN of the running example. Example 4. The MLN {R(x) ∧ Q(y) : w} is equivalent when predicting R from known Q to the relational logistic regression Q(y) : w, and its connection number is the same whether considered as a DA-MLN or as a DA-RLR. This justifies the use of DA-MLNs for learning from random samples theoretically. Furthermore, the equivalence with DA-RLR provides a natural interpretation for the scaled weights: While the weights of an MLN induce a dependence on the number of tuples that satisfy a formula, the weights of a DA-MLN induce a dependence on the relative frequency of tuples satisfying a formula among all possible tuples.
Statistical Relational Structure Learning with Scaled Weight Parameters
2.4
145
Functional Gradient Boosting
Functional gradient boosting is a state-of-the-art algorithm that simultaneously learns the structure and the parameters of an MLN [5]. Rather than learning a single highly accurate classifier, functional gradient boosting proceeds to learn a family of individually weaker classifiers which are successively trained on the prediction gap between the classifiers already learned and the given data. Our work is based on the BoostSRL [5] implementation of functional gradient boosting for MLNs, which learns definite clauses whose positive literal is the target literal.
3
Learning from Random Samples
Although functional gradient boosting has been chosen for its favourable run time compared to other approaches to statistical relational learning, learning time still increases sharply with domain size, rendering learning infeasible on large networks. Therefore, we propose taking a random sample of the target domain as training data and then applying the representation learned there to the original domain. While the parameters of ordinary MLNs can be distorted as the domain size and expected connections of the target literal increase, we expect that DA-MLNs compensate for the increase in target literals and thereby significantly reduce the accuracy loss due to learning from a random sample without significantly increasing run time. 3.1
Methods
To learn the structure and parameters of an MLN with a single target, we employ functional gradient boosting [5], which we have adapted to learn DA-MLNs. Our code base is available from our Github repository (https://github.com/ laiprorus/DAMLN-BoostSRL). We use clause-based learning and employ the same configuration of hyperparameters and other settings as Khot et al. [5]. Cross-Validation and Sampling. For cross-validation and sampling, we randomly split each dataset into two equal and disjoint parts. We use one part for training and the other part for testing. This experiment is then repeated, with reversed roles of training and testing set. We run 100 repeats on every experiment. For each such repeat a new cross validation split is generated. In each repeat we perform multiple iterations where different proportions of the training set are sampled randomly, for example from 100% to 5% in intervals of 5%. For sampling, we choose a sort and uniformly select the required proportion of its domain, discarding the remaining constants of that domain. We then discard those atoms that contain any discarded constants. If the learning process fails, which can happen particularly on very small training sets, a new sample is taken. Time taken for resampling is included in the reported learning time.
146
F. Weitkämper et al.
We apply this set-up to two well-known benchmarks, the WebKB dataset [1] and the IMDB dataset [7]. All experiments are performed on a computer with 32 GB RAM and an Intel i9-11980HK processor with 16 threads, 2.6 GHz base frequency and 5.0 GHz boost frequency. The IMDB Datasets. The complete IMDB dataset, retrieved from the CTU Prague Relational Learning Repository [7], has almost 900,000 person constants. We sampled a given number of constants from the person domain and then included all persons that were connected with those originally sampled. This allowed us to create more manageable datasets which still contain enough connections between persons to enable meaningful learning. The first dataset contains 689 persons, 15,093 movies, 20 genre constants and 142 female(person), 433 male(person), 575 actor(person), 290 director(person), 1,571 genre(person,genre), 19,071 movie(movie, person) and 2,161 workedunder(person, person) atoms. BoostSRL is configured to learn 10 trees (iterations). For cross validation and sampling we use person constants. In this dataset, only actors are described as male or female. The target predicate is female(person) with a relative frequency of 142/575 = 0.2460. The second dataset was created with the same settings as the first one but with more persons sampled. The resulting dataset contains 1,850 persons, 22,041 movies, 20 genre constants and 506 female(person), 946 male(person), 1,452 actor(person), 747 director(person), 3,384 genre(person,genre), 28,522 movie(movie, person) and 15,668 workedunder(person, person) atoms. The relative frequency of female persons in the second set is 506/1452 = 0.3485. Unsampled learning on this dataset is impractical due to high learning time; to demonstrate the feasibility of using sampling in such situations we learned on samples from 1% to 20% with a step size of 1% while performing inference on the full dataset. These datasets will be referred to as IMDB-689 and IMDB-1850 respectively. The WebKB Dataset. The WebKB dataset contains 3,793 page, 774 word, 7 category constants and 1,881 category(category, page), 253,406 has(page, word) and 9,316 linkTo(page, page) atoms. Since there are only 7 categories, we replace them with predicates for each category: course, department, faculty, person, research project, staff and student. BoostSRL is configured to learn 10 trees (iterations). For cross validation and sampling we use page constants. The target predicate is faculty(page). There are 46 faculty atoms with a relative frequency of 46/3793 = 0.0121. We report the areas under the Receiver-Operator Characteristic (AUC ROC) and the Precision-Recall (AUC PR) curves as well as the Conditional LogLikelihood (CLL) score.
Statistical Relational Structure Learning with Scaled Weight Parameters
147
Fig. 1. Results of learning MLNs and DA-MLNs when sampling uniformly from the IMDB-689 dataset
3.2
Results
The IMDB Datasets. On the IMDB-689 dataset (Fig. 1), neither MLNs nor DA-MLNs experience a significant drop in median AUC ROC and AUC PR when learning from randomly sampled subsets of the domain, although DAMLNs outperform MLNs when sampling less than 30% of the domain. However, the lower decile of the AUC ROC and AUC PR is considerably worse for MLNs than for DA-MLNs on subsets of 45% of elements or less. The median CLL values of DA-MLNs are almost constant regardless of proportion sampled. The upper and lower decile of the CLL values of the DA-MLNs are almost constant too, while the lower decile of the MLNs drops sharply at sample sizes of less than 40%. Similar results can be seen on IMDB-1850 with median AUC ROC, AUC PR and CLL remaining constant and with MLNs having considerably worse lower deciles on all metrics. The lower deciles of the CLL values fro DA-MLNs are also essentially constant for all sample sizes, while the lower decile for MLNs drop sharply when sampling 15% or less of the population. The run time of the learning algorithm for IMDB-1850 increases in sample size between 2% and 20%, with higher run time at 1% due to failed learning attempts on a sample size that is too small (Fig. 2). The WebKB Dataset. On the WebKB dataset (Fig. 3), DA-MLNs experience a drop in median AUC ROC values at sample sizes below 30%. For MLNs, this drop occurs already at sample sizes below 40%, and median and lower decile AUC ROC values are noticeably lower for MLNs than for DA-MLNs for sample sizes below this threshold. Median AUC PR values are reduced for sample sizes below 60% for both DA-MLNs and MLNs compared to training on the full dataset. The median AUC PR values are similar at sample sizes above 25% and are
148
F. Weitkämper et al.
Fig. 2. Results of learning MLNs and DA-MLNs when sampling uniformly from the IMDB-1850 dataset
Fig. 3. Results of learning MLNs and DA-MLNs when sampling uniformly from the WebKB dataset
reduced sharply when sampling 25% or less. Median AUC PR values of MLNs are noticeably lower than those of DA-MLNs at samples below 80% and drop sharply at sample sizes less than 40%. Upper and lower deciles are similar for both MLNs and DA-MLNs. Median CLL values are almost constant for MLN and DA-MLN classifiers regardless of sample size. The lower decile of the CLL values of DA-MLNs is very close to the median at all sizes, while the lower decile of MLNs drops very sharply for sample sizes below 50%.
Statistical Relational Structure Learning with Scaled Weight Parameters
3.3
149
Discussion
The experiments have shown that a significant speed-up can be achieved by sampling from a domain before learning the structure and the parameters of a DA-MLN. For larger datasets such as IMDB-689 or IMDB-1850, there is no notable degradation in median prediction performance even when sampling as little as 2% of domain elements. On the other hand, performance does deteriorate when using smaller samples from a dataset such as WebKB which only has 46 positive examples in the full dataset. For very small samples, however, repeated resampling is necessary to learn a model. This increases the actual run time of the learning algorithm considerably for very small samples and suggests that very small sample sizes should be avoided for reasons other than prediction accuracy. On the other hand, ordinary unscaled MLNs suffer accuracy losses already at fairly large sample sizes, and their performance varies more widely than DAMLNs. Since we are reporting median rather than mean performance, the decile ranges are a key part of the performance profile. The difference in performance range is particularly striking for the CLL value, where the lower decile is considerably worse for sampled MLNs than for sampled DA-MLNs, which have almost no discernible variation in CLL value across sample sizes and trials. We believe that this is because an increase in domain size induces a continuous shift in the contributions of the individual formulas. When using those scores to induce a binary classifier, as is the case when assessing PR and ROC values, this can be partly compensated by shifting the classification threshold, while the CLL value is directly impacted by the distorted probability scores. However, the amount of distortion experienced by the different formulas varies sufficiently to change the relative importance of the formulas, and this explains the drop experienced in AUC values.
4
Domain-Size Aware Markov Logic Networks with Respect to Changing Domain Structures
In the experiments above, the domains of varying size were induced by randomly sampling elements from the larger domain. There are other natural instances of differently sized domains, however. Take a social network, for instance, that is split into different communities with little or no interaction across different communities. A random sample of elements would now be expected to contain roughly the same proportion of elements from each community. Alternatively, consider a subdomain composed of a subset of communities rather than individual elements: Some communities will be included completely, others not at all. This may occur when the sample is not taken randomly but forced by the lack of available data on some communities, for example, or when a subdomain has been obtained by snowball sampling [2]. In this case, the number of connections of an element in the small domain is the same as in the large domain; we therefore expect ordinary MLNs to outperform DA-MLNs on the transfer learning task, since the latter will overcompensate for the increase in domain size and scale down learned weights unnecessarily, distorting predictions.
150
F. Weitkämper et al.
Hypotheses (1) For families of domains where the number of connections of any element grows with increasing domain size, DA-MLNs outperform MLNs on the transfer learning task. (2) For families of domains where the number of connections of any element is constant regardless of domain size, MLNs outperform DA-MLNs on the transfer learning task 4.1
Methods
To investigate this behavior, we used synthetic datasets of “Friends and Smokers” [10] to generate domains adapted to our needs. All groups are labelled randomly to be either a smoker group (with 40% chance) or a non-smoker group (with 60% chance). To define the connections within the group, we check if two persons are in the same group. If so, they have a 80% chance to be friends; if they are not in the same group they have a 0% or 10% chance to be friends. For labelling each person as smoker or non-smoker, the group label is checked. In a smoker group, any person has a 70% chance to be a smoker, while in a non-smoker group the chance is 10%. We performed four experiments. In Experiments 4(1) and 4(2), there are no friendships between individuals of different groups. In Experiment 4(1), we fixed the number of groups to 5 and varied their size. In Experiment 4(2), we fixed the size of groups to 5 and varied their number. In Experiments 4(3) and 4(4), there is a probability of 10% that individuals of different groups are friends. In Experiment 4(3), we fixed the number of groups to 5 and varied their size. In Experiment 4(4), we fixed the size of groups to 5 and varied their number. For our experiments, a new pair of domains is generated for each run. Due to high runtimes, we perform 10 repetitions for each domain size. We varied training domain sizes from 10 to 100 with a step size of 5 while testing on a domain of size 100. This is repeated for both probability of connections outside a group of 0% and 10%. For learning and testing we use single-target functional gradient boosting (target predicate: smokes) with the same configuration as in Sect. 3, learning 10 trees (iterations). 4.2
Results
For a fixed number of groups varying in size, the median, upper and lower decile CLL score of DA-MLNs are almost constant regardless of the size of the training domain. The median, upper and lower decile CLL score of MLNs drop sharply at training domain sizes below 50 individuals, the drop in lower decile scores being the most pronounced (Subfigs. 4(1) and 4(3)). For a fixed group size and a varying number of groups, the situation is very different depending on whether or not connections between the groups are permitted. When they are not (Subfig. 4(2)), the median CLL values of MLNs are approximately constant as long as training domain sizes remain above 15. Median CLL values of DA-MLNs drop off at
Statistical Relational Structure Learning with Scaled Weight Parameters
151
domain sizes below 50 and are markedly below those of MLNs already at domain sizes 60 and below. The same picture emerges for upper and lower decile scores. When there is a 10% chance of friendship between individuals of different groups (Subfig. 4(4)), the same pattern as for fixed group number is observed. The median CLL score of DA-MLNs remains almost constant, while the median CLL value of MLNs drops at training domains of size 50 and below. Again, the sharpest drop is experienced by the lower decile values of the CLL scores of the MLNs, which decline sharply on training domains of less than 70 individuals.
Fig. 4. CLL curves of Experiments 4(1) to 4(4)
4.3
Discussion
The experiments confirm that when transferring a learnt model between domains of different sizes, it is important to consider the relationship between the domains when deciding whether to apply MLNs or DA-MLNs to the task. If the smaller domain consists of a smaller number of communities which are of roughly the same size as communities in the larger domain and there are no connections between the communities, then ordinary MLNs are more appropriate than DAMLNs. On the other hand, already a small propensity of inter-community links suffice for DA-MLNs to outperform ordinary MLNs. A possible explanation can be seen in an order-of-magnitude analysis. As the domain size increases, the number of inter-community connections for any node increases with domain size, while the number of intra-community connections remains constant. Therefore,
152
F. Weitkämper et al.
asymptotically, the inter-community connections are the deciding factor, and are of the same order of magnitude as the connection number of the DA-MLNs. The AUC values derived from converting the output probability to a binary classifier (not shown) are very similar for MLNs and DA-MLNs. We refer to the discussion in Subsect. 3.3 for a possible explanation of this. Our study has implications for using MLNs in fields such as social science research, where results obtained from one or more subcommunities are generalised to the community-at-large. On the one hand, learning ordinary, unscaled MLNs on small domain sizes is vindicated in those cases where the sampled subcommunities are completely isolated from the remainder of the community. However, even small amounts of cross-subcommunity contact can suffice for a large degradation in performance when attempting to use the classifier on the full community, while using DA-MLNs avoids this degradation almost completely.
5
Conclusion
We empirically evaluate the use of DA-MLNs for transfer learning between domains of different sizes in the context of boosted single-target structure learning. In the case where the smaller domain is obtained by sampling uniformly from a large domain, our experiments on benchmark domains from real-life datasets suggest that on sufficiently large domains, performance remains nearly constant even for small sample sizes. This implies significant potential for speeding up learning on large domains, where learning time is otherwise prohibitive. The scaling factor incorporated in DA-MLNs assumes that the number of connections induced by a formula increases in proportion to the sizes of the domains involved in it. This assumption is justified for uniform random sampling of domain individuals. On domains of naturally varying size, we demonstrate using synthetically generated datasets that MLNs outperform DA-MLNs when this assumption is violated most strongly, that is, when the number of induced connections is constant regardless of domain sizes. We also show, however, that as soon as a small linear factor is introduced to the number of connections, DA-MLNs outperform MLNs. Acknowledgements. The authors would like to thank Martin Josko for providing technical support and Kailin Sun for proof-reading the paper before submission. This work was partially completed while the first author was on a sabbatical enabled by LMUExcellent and funded by the Federal Ministry of Education and Research (BMBF) and the Free State of Bavaria under the Excellence Strategy of the Federal Government and the Länder.
References 1. Craven, M., et al.: Learning to extract symbolic knowledge from the world wide web. In: Mostow, J., Rich, C. (eds.) Proceedings of the Fifteenth National Conference on Artificial Intelligence and Tenth Innovative Applications of Artificial Intelligence Conference, AAAI 98, IAAI 98, 26–30 July 1998, Madison, Wisconsin, USA, pp. 509–516. AAAI Press/The MIT Press (1998)
Statistical Relational Structure Learning with Scaled Weight Parameters
153
2. Goodman, L.A.: Snowball sampling. Ann. Math. Stat. 32(1), 148–170 (1961) 3. Kazemi, S.M., Buchman, D., Kersting, K., Natarajan, S., Poole, D.: Relational logistic regression. In: Baral, C., Giacomo, G.D., Eiter, T. (eds.) Principles of Knowledge Representation and Reasoning: Proceedings of the Fourteenth International Conference. KR 2014, Vienna, Austria, 20–24 July 2014. AAAI Press (2014) 4. Kazemi, S.M., Buchman, D., Kersting, K., Natarajan, S., Poole, D.L.: Relational logistic regression: the directed analog of Markov logic networks. In: StarAI@AAAI (2014) 5. Khot, T., Natarajan, S., Kersting, K., Shavlik, J.W.: Gradient-based boosting for statistical relational learning: the Markov logic network and missing data cases. Mach. Learn. 100(1), 75–100 (2015). https://doi.org/10.1007/s10994-015-5481-4 6. Mittal, H., Bhardwaj, A., Gogate, V., Singla, P.: Domain-size aware Markov logic networks. In: Chaudhuri, K., Sugiyama, M. (eds.) The 22nd International Conference on Artificial Intelligence and Statistics. AISTATS 2019, 16–18 April 2019, Naha, Okinawa, Japan. Proceedings of Machine Learning Research, vol. 89, pp. 3216–3224. PMLR (2019) 7. Motl, J., Schulte, O.: The CTU Prague relational learning repository. CoRR abs/1511.03086 (2015). https://arxiv.org/abs/1511.03086 8. Poole, D., Buchman, D., Kazemi, S.M., Kersting, K., Natarajan, S.: Population size extrapolation in relational probabilistic modelling. In: Straccia, U., Calì, A. (eds.) SUM 2014. LNCS (LNAI), vol. 8720, pp. 292–305. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11508-5_25 9. Richardson, M., Domingos, P.M.: Markov logic networks. Mach. Learn. 62(1–2), 107–136 (2006). https://doi.org/10.1007/s10994-006-5833-1 10. Singla, P., Domingos, P.M.: Discriminative training of Markov logic networks. In: Veloso, M.M., Kambhampati, S. (eds.) Proceedings, The Twentieth National Conference on Artificial Intelligence and the Seventeenth Innovative Applications of Artificial Intelligence Conference, 9–13 July 2005, Pittsburgh, Pennsylvania, USA, pp. 868–873. AAAI Press/The MIT Press (2005) 11. Weitkämper, F.: Scaling the weight parameters in Markov logic networks and relational logistic regression models. CoRR abs/2103.15140 (2021). https://arxiv.org/ abs/2103.15140
A Review of Inductive Logic Programming Applications for Robotic Systems Youssef Mahmoud Youssef(B) and Martin E. M¨ uller Hochschule Bonn-Rhein-Sieg, Grantham-Allee 20, 53757 Bonn, Germany {youssef.youssef, martin.mueller}@h-brs.de Abstract. This study presents a review of applications of Inductive Logic Programming (ILP) for robotic systems. The aim of the paper is to demonstrate the different methods of applying ILP to a robotic system and to also highlight some of the limitations that already exist. ILP can aid in the development of explainable and trustworthy robotics systems. Keywords: Inductive Logic Programming
1
· Robotics · Review
Introduction
The purpose of this paper is two-fold. First, we would like to present how Inductive Logic Programming (ILP) can be used in the field of robotics. We focus on providing with examples and use-cases of ILP within the field of robotics in order to give researchers an idea of how it can be applied and, hopefully, guide them to a solution if they are working on similar problems. Second, we focus on showing the ILP community—whose work concentrates mainly on theory—an overview of real world applications of ILP. This could also explain where the bottlenecks in application may lay and what other researchers have done to try to avoid or eliminate these bottlenecks. The field of autonomous robots has many applications that require machine learning, such as perception, motion and knowledge adaptation. Traditional machine learning methods like neural networks or support vector machines process large data volumes to produce models. However, these models are often unexplainable, or “Black Box” methods, lacking transparency in the decisionmaking process. This means that the user has no real insight about the decisionmaking process. With the growing need of explainability, especially in the field of robotics, strong machine learning methods are favorable. As stated in Defense Advanced Research Projects Agency’s (DARPA’s) Explainable Artificial Intelligence (XAI) research [16], strong machine learning methods refer to methods being able to explain themselves. ILP is an example of a strong machine learning method and it can prove to be useful, as will be demonstrated in the reviews provided in this study. This paper is structured as follows: we begin with an overview of robotic systems and we will briefly describe ILP. We then present the review and later discuss the potentials, challenges, and limitations of ILP. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 E. Bellodi et al. (Eds.): ILP 2023, LNAI 14363, pp. 154–165, 2023. https://doi.org/10.1007/978-3-031-49299-0_11
A Review of Inductive Logic Programming Applications for Robotic Systems
2
155
A Robotic System
Robots are physical systems with sensors, actuators, and a processing unit, which process input, reason, and produce suitable output through actuation or software. These robots vary in autonomy; they range from tele-operated robots, which are robotic systems operated by humans from afar, to fully autonomous robots that can function without human intervention. Examples of robots include satellites, unmanned vehicles, and space rovers. They mitigate risks in dangerous environments and perform tasks in challenging terrains. They also handle dull or repetitive tasks like vacuum cleaning or surveillance. Robots operate within a hierarchical structure. The lower levels handle hardware, interfaces, and software. The next level processes sensory input, generates beliefs, and plans actions. The highest level is responsible for overall goals, task planning, and Artificial Intelligence (AI) applications. See Fig. 1 for an abstract scheme of the robotic system, which is divided into the following categories: perception, knowledge generation & decision-making, and motion control.
Fig. 1. Abstract Robotic System Scheme
Robotic systems can be systematically classified into various categories based on several criteria. As previously stated, one common taxonomy considers their level of autonomy. Another taxonomy distinguishes robots by their physical environment, such as aerial, ground, or underwater robots. Moreover, robots can be
156
Y. M. Youssef and M. E. M¨ uller
classified by their application domain, including industrial robots used in manufacturing, service robots for tasks like cleaning or healthcare, and research robots for scientific exploration. A robot with a high degree of autonomy will most likely be using AI and machine learning to reason about its sensory inputs and perform suitable actions. Machine learning can be applied in different areas within a robotic system. Seeing as robot design has become increasingly complex, machine learning methods have been used to aid robotic systems in task completion. These tasks include perception and classification, path planning, and sensory integration. A few studies are available that classify the use of machine learning in robotic systems [13,18,19].
3
Inductive Logic Programming
Advancements in computing capabilities have enabled machine learning algorithms to process vast amounts of data, leading to numerous applications, including in the field of robotics. However, this data-driven approach sacrifices explainability, as vital information can be lost and classifications may be based on abstracted representations. In robotics, with its inherent flavor of engineering science, performance and efficiency is valued more than adequate modelling on a knowledge level. Machine learning techniques sacrifice explainability for the ability to cope with complex data-driven scenarios [32]. To address this, we focus on ILP, a classical approach known for its inherent explainability and strong foundation in predicate logic. ILP has recently made significant advancements in tackling complex learning problems. ILP systems generate models of systems based on background knowledge as well as positive and negative examples using logic statements. The use of background knowledge distinctively makes ILP a unique method when learning about systems since it reasons about facts and statements the same way humans do. Having the ability to invent predicates when needed also gives ILP an advantage over current trending methods of machine learning. One of the fundamental theories behind ILP is generalization. A more formal representation of simple generalization can be shown as follows; given a substitution: θ = {v1 /t1 , ..., vn /tn } and formula F . F θ is formed by replacing every variable vi in F by ti . Then if we have two atoms, namely A and B, we can state that atom A subsumes atom B (A B) iff there exists a substitution θ such that Aθ = B. Similar to atoms, clauses can also be generalized. Clause C subsumes clause D, (C D), iff there exists a substitution θ such that Cθ ⊆ D. Generalization can also be defined through entailment such that C is more general than D iff C |= D. However, this is not true for the case where C and D are recursive clauses. Relative entailment can be noted as follows; C is more general than D wrt B iff B, C |= D. The general setting of ILP has the following three elements: B Background knowledge H Hypothesis E Examples. Examples are split into positive E + and negative E − examples. The problem is then presented as follows; given B,E find H such that: B,H |= E. The hypothesis and background knowledge should
A Review of Inductive Logic Programming Applications for Robotic Systems
157
imply all given positive examples E + (completeness) and should not cover the given negative examples E − (consistency). For the interested reader, we recommend delving into the following literature that explores the topic of ILP: The theory of inverse resolution is presented in [27]. Moreover, in related work, the theory of saturation is presented in [34]. Predicate invention, one of the important pillars in ILP, is presented in [25,27, 29,39].
4
Review
In this section we give a review of the research published with regards to applying ILP to robotics. After an extensive search, we selected a collection of studies that have directly addressed a problem in the field of robotics by applying ILP. From the selection of studies, it was found that they fall under two main categories of a robotic system: perception as well as knowledge generation and decision-making, as shown in Fig. 1. 4.1
Perception
We begin by giving an overview of the application of ILP on robotic systems, focusing on the category of perception. Enhancing Perception and Learning Physical Notions. In [1], a study is presented on the use of ILP to enable an autonomous robot to learn different meanings of physical notions associated with its actions. The study proposes two methods: the sifting method and the step transition method to manage the iterative learning process of the robot. The sifting method is used to create negative examples for the ILP learner. This method uses out-of-domain values of the parameters involved in the learning process to create a set of possible negative examples. An out-of-domain value is a value that is not a valid input for a particular learning step. For instance, if the learning step involves positive examples with parameters that are numbers between 1 and 10, then any number outside this range would be an out-of-domain value. From these examples, the robot autonomously selects those which allow it to efficiently learn better hypotheses. The sifting method is developed using the ILP learner Aleph. The step transition method is also developed using the ILP learner Aleph. The proposed methods are experimentally tested for learning primitive concepts in a simple world using a mobile autonomous robot. Results demonstrate hierarchical learning of meaningful definitions for different physical notions. For instance, the robot learns the “movability of an object” by executing the ‘pushcmd’ command and observing its effects on a cube. Positive examples are generated using sensor measurements and negative examples by the sifting method, allowing Aleph to induce a hypothesis linking ‘pushcmd’ parameters to object movability. The induced hypothesis enables the robot to predict action effects and plan future actions. The study also explores ‘ego-motion’ (robot’s own
158
Y. M. Youssef and M. E. M¨ uller
motion) and ‘pushing-an-object’ (meaning and environment impact of object pushing). A well-known ILP system Hyper [2] that was applied to robotics can be found in the study in [22]. The goal is to provide the robot with a learning system that enables it to automatically discover new useful notions through exploration. The authors define an insight as a new piece of knowledge that enables simplification of the robot’s current theory. The experiments were performed using an autonomous robot that observes its environment consisting of two movable and two non-movable boxes. The robot can perform experiments in this environment to collect data about its performed actions and the resulting observations. Hyper is used to learn two notions, namely movability and obstacle. They extend Hyper by adding the capability of interpreting negated conditions in hypotheses as negation as failure. They also added a new sound heuristic for efficiency reasons. Learning from Sensor Data. In [10], models are induced from a depth sensor. Although in their study, the authors did not apply the methods to a robotics system directly and only used the sensor data, the study is relevant since a depth sensor is commonly used in robotic systems, specifically for robot vision. The goal of the study is to learn border detection from the sensor data. In their study, the authors use the ProParGolemNRNot ILP system, which is an adaptation of the ILP system ProGolem [30]. To learn from the sensor, the authors propose a simulation of the data in order to test the ability of the learner to induce rules from it. The authors state that while the accuracy of the induced rules is high, there still exists limitations, since the simulation environment only had objects of the same shapes. The authors propose testing next on a real environment and testing with varying shapes. In [41], a novel approach to programming robots using Teleo-Reactive Programs (TRPs) is presented. The study focuses on the robot’s learning of basic navigation TRPs, which involve actions like moving, turning, and stopping. The two-phase process involves transforming low-level sensor readings into high-level predicates for relational learning and learning relational TRPs for navigation in dynamic and unknown environments. The authors utilize a natural landmark identification process to extract meaningful information and employ the ILP system Aleph to learn and hierarchically organize the TRPs. This enables the robot to perform various tasks, such as navigation, object finding, and message/object delivery. 4.2
Knowledge Generation and Decision-Making
We now provide with an overview of the application of ILP on robotic systems, focusing on the category of knowledge generation and decision-making of a robot. Experience-Based Planning and Failure Handling. In [17], the authors propose an ILP-based method for robots to enhance performance and avoid failure situations during multi-object manipulation tasks. The approach utilizes
A Review of Inductive Logic Programming Applications for Robotic Systems
159
an experience-based probabilistic planning framework that integrates ILP for experimental learning. Failures detected during action execution contribute to the derivation of first-order logic hypotheses, which guide the planning process. The Progol ILP system is employed to derive probabilistic hypotheses, while a Partially Observable Markov Decision Process (POMDP) model is used to handle uncertainties and create an adaptive policy. The system includes an action execution monitoring unit that detects failures, encodes contextual information, and provides background knowledge to the ILP learner. In [36], the authors investigate how robots can learn from failures to ensure task execution robustness. They focus on a service robot with dual arms serving meals to humans, performing actions like object manipulation. The paper proposes a learning guided planning process that integrates continual planning, monitoring, reasoning, and learning. ILP is used as the learning method to frame hypotheses for failure situations. Hypotheses are expressed using first-order logic to capture contextual relations and action outcomes. ILP aids in building, updating, or abandoning hypotheses based on accumulated experience. The approach is evaluated on a Pioneer 3-AT robot, demonstrating sound hypotheses for failure cases and ensuring safety and robustness in future tasks. Relational Learning. In [40], the authors present a probabilistic rule learning approach named NoMPRoL [4] to update environment domain models using feedback from the running system. The approach is demonstrated on a mobile robot navigation task, where the robot must navigate to a goal location while avoiding obstacles. The paper focuses on the use of ILP to learn rules from the feedback data. The NoMPRoL approach is a technique for non-monotonic probabilistic rule learning based on a transformation of an ILP task into an equivalent abductive one. It exploits consistent observations by finding general rules that explain observations in terms of the conditions under which they occur. The updated models are then used to generate new behavior with a greater chance of success in the actual environment encountered. In [3], a study presents a relational approach to tool-use learning in robots. The robot learns to utilize objects as tools to solve specific tasks, such as retrieving a small box from a tube using a hooked stick. By observing a single demonstration from a teacher, the robot experiments with various tools in the environment. The approach employs explanation-based learning to identify important sub-goals achieved by the teacher and constructs an action model. An ILP algorithm, utilizing spatial and structural constraints, refines the action model through trial-and-error learning. The system incorporates an incremental learning method based on relative least generalizations and a version-space representation, enabling cautious generalization and example generation. The approach optimizes the search space by sampling near the most-specific hypothesis boundary in the version space. Search Improvement and Interpretable Reinforcement Learning. In [8], the authors investigate methods for improving the search process in ILP.
160
Y. M. Youssef and M. E. M¨ uller
The authors explore the possibility of provably correct techniques for speeding up search based on automatically reducing the inductive meta-bias of the learner. They suggest that many approaches have been used to improve search within ILP, including probabilistic search techniques and the use of query pack. However, the authors argue that their approach is different in that it explores the possibility of provably correct techniques for speeding up search. In their study, the authors discuss the use of ILP in learning robot strategies, specifically in a 2-dimensional space where a robot can perform six dyadic actions. The authors suggest that the ability to encapsulate background knowledge, by transforming higher-order and first-order datalog statements into first-order definite clauses, indicates that it might be possible to minimize the meta-rules together with a given set of background clauses. Preliminary experiments seem to indicate that it is possible to do so. In [42], a new framework is proposed for interpretable model-based hierarchical reinforcement learning using ILP. The framework consists of a high-level agent, a subtask solver, and a symbolic transition model. The symbolic transition model is learned using ILP, which allows for the creation of logic rules that describe state transitions in a human-readable format. This provides interpretability to the learned behavior, making it more understandable to users. The proposed framework is evaluated in a robot navigation environment. In this environment, a high-level agent is tasked with navigating a robot to collect a key and unlock a door to reach the goal. The robot’s motion is constrained by ‘visiting circles’, which are unknown to the robot initially. ILP is used to learn the rules describing the preconditions and effects of performing an action and achieving a subgoal. The results demonstrate the effect of the generalization capability of the symbolic transition model in the proposed method. Particularly, the regular hierarchical reinforcement learning cannot solve some of the proposed experiments in comparison to the ILP-based solution. In [24], the authors propose a method for learning relational affordance models, which capture the relationships between objects and the actions that can be performed on them. They use a probabilistic logic programming language called Problog [9], which is a probabilistic ILP approach, to represent these models and learn from data. The paper also describes experiments with a humanoid robot, where the learned models are used to recognize actions and plan object manipulations. In their work, the authors use Problog to learn relations. For example, “Graspable” specifies that an object is graspable if it is within reach of the robot’s gripper and has a certain orientation that allows for a stable grasp, “Placeable” specifies that an object is placeable if it has a certain shape and size that allows it to fit in the designated location, and “Support” specifies that an object is supporting another object if it is in contact with it and is able to bear its weight. Furthermore, this work is expanded on in [23]. To extend their work, the authors incorporate parameterized actions, high-level goals, and a planning algorithm to fully model a tabletop task for a robot. A further extension that explores the use of learning relational affordance models can be found in [43],
A Review of Inductive Logic Programming Applications for Robotic Systems
161
where the study proposes a potential framework for spatial relationships between objects. Learning Numerical Constraints and Context Specific Rules. In [37], the authors apply ILP to robotics, specifically on the problem of learning symbolic-level numerical constraints for robots. The authors propose a method for robots to learn these constraints from real-world observations, without relying on predefined numerical relations in the background theory. The proposed method is an extension of the learning system Aleph. The authors perform realworld experiments on a table-top blocks world environment with a robot to put blocks close to each other with varying distances. The outcomes of actions are labelled as failure or success. Then, the learning system is run to find hypotheses generalizing failure cases. These hypotheses represent high-level rules that can generalize constraint representation related to failures in the context of table-top blocks world environment. The goal of the research is to develop a method for robots to learn complex high-level rules that can generalize constraint representations related to failure. In [33], a system is proposed that learns context-specific rules from an evolving environment, inspired by human visual learning. ILP is employed to learn rules in a qualitative spatial and temporal representation of the environment. The system aims to enable a robotic system to learn object manipulation through visual observation. The ILP is extended with Qualitative Knowledge (QK) to describe the environment. Symbolic data obtained from visual observations is used as input examples for ILP learning. The QK classifier’s outcome is utilized by the ILP system to generate context-specific rules, along with background knowledge. These rules describe object manipulation in the environment and guide the action generator. Manual rule production is eliminated as the QK module generates qualitative spatial and spatio-temporal rules. The study achieves an accuracy of 70% with context-specific rule generation, but further evaluation with noisy data is required. The Progol ILP system [26] is employed in this work. Table 1 presents a summary of the reviewed studies, while highlighting the robot domain specific use case as well as the problem addressed and the ILP tool used to solve the problem. For an overview on current trends in ILP and an overview of different ILP systems, we recommend [5,6]. We suggest that the interested reader pays attention to the more recent ILP systems developed, such as ILASP [21], Metagol [31], Apperception [11], and Popper [7]. 4.3
Potentials of ILP in Robotics
ILP has significant potential in various aspects of robotics. However, the application of ILP in robotics is currently limited. This can be attributed to certain factors, which will be discussed later. In this section, we highlight key areas where ILP can contribute to robotics. One area where ILP can be effective is in planning and scheduling, where symbolic learning methods such as STRIPS planners, PDDL-based planners,
162
Y. M. Youssef and M. E. M¨ uller
Table 1. Summary of reviewed studies, with the specific robot domain use case, the addressed problem, and the ILP system used. Paper Use Case
Addressed Problem
ILP System
Perception Belief Generation [1]
Object manipulation Learning concepts associated with actions
Aleph [38]
[22]
Object manipulation Learning concepts associated with objects in the environment
Hyper [2]
[10]
Object detection
Learning concepts for detecting objects and their movement
ProGolem [30]
[41]
Navigation
Learning simple TRPs
Aleph
Knowledge Generation Decision Making [17]
Object manipulation Learning concepts from action execution failures
Progol [26]
[36]
Object manipulation Learning concepts from action execution failures
Progol
[40]
Navigation
NoMPRoL [4]
[3]
Object manipulation Learning concepts about tool-use
Golem [28]
[8]
Robot Strategies
Learning robot strategies
Metagol [31]
[42]
Navigation
Learning state transitions
[37]
Object manipulation Learning numerical constraints from real-world observations
Aleph
[33]
Object manipulation Learning context specific rules from environment
Progol
[24]
Object manipulation Learning to manipulate multiple objects in complex environments Problog [9]
Learning concepts about obstacle avoidance
δILP [12]
and HTN planners are widely used. Studies in the review section demonstrate the effectiveness of ILP in this domain [17,36]. The concept of a “robot engineer” described in [35] leverages ILP techniques for designing and manufacturing tools and robots using 3D printing. This has potential applications in industries like manufacturing, healthcare, and disaster response scenarios, enabling rapid production and deployment of customized tools and robots. ILP has potential in fault detection and diagnosis of robots, allowing model development by observing behavior and generalizing over observations, reducing reliance on expert knowledge. The interpretability and explainability of ILP contribute to the development of trustworthy and explainable robotic systems. Neural-symbolic learning and reasoning also has the potential to greatly benefit the field of robotics, particularly through the use of inductive logic programming. By combining neural networks with symbolic logic, robots can learn from their environment and make decisions based on logical rules and reasoning [14,15]. An example of an application of neural-symbolic reasoning in robotics is object recognition, where robots can use neural networks and symbolic logic to identify and manipulate different objects. Neural networks can be used to recognize objects in the environment, while symbolic reasoning can be used to reason about the properties of these objects. Logic-based rules and models generated by ILP enable humans to understand and reason about robot behavior, enhancing trust and collaboration. ILP can support collaboration and coordination in multi-robot systems, assisting in the development of distributed decision-making algorithms. Lessons from ILP applications in multi-agent systems can inspire advancements in this area [20]. Addressing ethical and social considerations in robotics is another area where ILP can make a difference. By incorporating ethical guidelines and fairness constraints into ILP-based learning and decision-making processes, robots can align their behavior with ethical frameworks.
A Review of Inductive Logic Programming Applications for Robotic Systems
163
Overall, ILP offers significant opportunities for enhancing symbolic representation learning, knowledge discovery, adaptive planning, multi-robot collaboration, and addressing ethical considerations in robotics. 4.4
Limitations and Challenges
The reviewed studies highlight several challenges and limitations in the application of ILP. Noise handling is a common challenge [1,10,22], which has been addressed in related studies [6,12,21]. Imperfect background knowledge is another limitation associated with noise [6], which remains an under-explored topic. The explosion of search space size with large data sets [8] poses a challenge, and a reinforcement-based ILP method is proposed as a solution [42], addressing limitations like the maximum number of predicates in clause bodies. Representing phenomena, particularly with complex data like images, requires preprocessing before logical encoding [12].
5
Conclusion
This study reviewed the application of ILP in robotics, emphasizing its potential as a learning tool and addressing associated challenges and limitations. ILP has been applied in various aspects of robotics, including tool usage, action rules, and object movability. With the growing demand for explainability in robotics, ILP holds promise for developing interpretable and transparent robotic systems. Acknowledgement. This work is partially supported by a grant of the Graduate Institute and the Computer Science Department of the Hochschule Bonn-Rhein-Sieg. The authors thank the reviewers for their valuable input which helped us improving this submission.
References 1. Akhtar, N., F¨ uller, M., Kahl, B., Henne, T.: Towards iterative learning of autonomous robots using ILP. In: 2011 15th International Conference on Advanced Robotics (ICAR), pp. 409–414 (2011). https://doi.org/10.1109/ICAR. 2011.6088625 2. Bratko, I.: Prolog Programming for Artificial Intelligence. Pearson Education, London (2001) 3. Brown, S., Sammut, C.: A relational approach to tool-use learning in robots. ˇ In: Riguzzi, F., Zelezn´ y, F. (eds.) ILP 2012. LNCS (LNAI), vol. 7842, pp. 1–15. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-38812-5 1 4. Corapi, D., Sykes, D., Inoue, K., Russo, A.: Probabilistic rule learning in nonmonotonic domains. In: Leite, J., Torroni, P., ˚ Agotnes, T., Boella, G., van der Torre, L. (eds.) CLIMA 2011. LNCS (LNAI), vol. 6814, pp. 243–258. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-22359-4 17 5. Cropper, A., Dumanˇci´c, S.: Inductive logic programming at 30: a new introduction. arXiv preprint: arXiv:2008.07912 (2020)
164
Y. M. Youssef and M. E. M¨ uller
6. Cropper, A., Dumanˇci´c, S.: Inductive logic programming at 30: a new introduction. J. Artif. Intell. Res. 74, 765–850 (2022) 7. Cropper, A., Morel, R.: Learning programs by learning from failures (2020). https://doi.org/10.48550/ARXIV.2005.02259 8. Cropper, A., Muggleton, S.H.: Logical minimisation of meta-rules within metainterpretive learning. In: Davis, J., Ramon, J. (eds.) ILP 2014. LNCS (LNAI), vol. 9046, pp. 62–75. Springer, Cham (2015). https://doi.org/10.1007/978-3-31923708-4 5 9. De Raedt, L., Kimmig, A., Toivonen, H.: ProbLog: a probabilistic prolog and its application in link discovery. In: IJCAI 2007, Proceedings of the 20th International Joint Conference on Artificial Intelligence, pp. 2462–2467. IJCAI-INT JOINT CONF ARTIF INTELL (2007) 10. Drole, M., et al.: Learning from depth sensor data using inductive logic programming. In: 2015 XXV International Conference on Information, Communication and Automation Technologies (ICAT), pp. 1–6. IEEE (2015) 11. Evans, R., et al.: Making sense of raw input. Artif. Intell. 299, 103521 (2021) 12. Evans, R., Grefenstette, E.: Learning explanatory rules from noisy data. J. Artif. Intell. Res. 61, 1–64 (2018) 13. Fabisch, A., Petzoldt, C., Otto, M., Kirchner, F.: A survey of behavior learning applications in robotics-state of the art and perspectives. arXiv preprint: arXiv:1906.01868 (2019) 14. Garcez, A.D., Gori, M., Lamb, L.C., Serafini, L., Spranger, M., Tran, S.N.: Neuralsymbolic computing: an effective methodology for principled integration of machine learning and reasoning. arXiv preprint: arXiv:1905.06088 (2019) 15. Garcez, A.D., et al.: Neural-symbolic learning and reasoning: a survey and interpretation. Neuro-Symbol. Artif. Intell.: State Art 342(1), 327 (2022) 16. Gunning, D.: Explainable artificial intelligence (XAI). Defense Advanced Research Projects Agency (DARPA), nd Web (2017) 17. Kapotoglu, M., Koc, C., Sariel, S.: Robots avoid potential failures through experience-based probabilistic planning. In: 2015 12th International Conference on Informatics in Control, Automation and Robotics (ICINCO), vol. 2, pp. 111–120. IEEE (2015) 18. K´ aroly, A.I., Galambos, P., Kuti, J., Rudas, I.J.: Deep learning in robotics: survey on model structures and training strategies. IEEE Trans. Syst., Man, Cybernet.: Syst. 51(1), 266–279 (2020) 19. Kober, J., Bagnell, J.A., Peters, J.: Reinforcement learning in robotics: a survey. Int. J. Robot. Res. 32(11), 1238–1274 (2013) 20. Kowalski, R., Sadri, F.: From logic programming towards multi-agent systems. Ann. Math. Artif. Intell. 25(3), 391–419 (1999) 21. Law, M., Russo, A., Broda, K.: Inductive learning of answer set programs from noisy examples. arXiv preprint: arXiv:1808.08441 (2018) ˇ 22. Leban, G., Zabkar, J., Bratko, I.: An experiment in robot discovery with ILP. ˇ In: Zelezn´ y, F., Lavraˇc, N. (eds.) ILP 2008. LNCS (LNAI), vol. 5194, pp. 77–90. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-85928-4 10 23. Moldovan, B., Moreno, P., Nitti, D., Santos-Victor, J., De Raedt, L.: Relational affordances for multiple-object manipulation. Auton. Robot. 42, 19–44 (2018) 24. Moldovan, B., Moreno, P., Van Otterlo, M., Santos-Victor, J., De Raedt, L.: Learning relational affordance models for robots in multi-object manipulation tasks. In: 2012 IEEE International Conference on Robotics and Automation, pp. 4373–4378. IEEE (2012)
A Review of Inductive Logic Programming Applications for Robotic Systems
165
25. Muggleton, S.: Inductive logic programming. New Gener. Comput. 8(4), 295–318 (1991) 26. Muggleton, S.: Inverse entailment and progol. N. Gener. Comput. 13(3–4), 245–286 (1995) 27. Muggleton, S., Buntine, W.: Machine invention of first-order predicates by inverting resolution. In: Machine Learning Proceedings 1988, pp. 339–352. Elsevier (1988) 28. Muggleton, S., et al.: Efficient Induction of Logic Programs. Citeseer, San Diego (1990) 29. Muggleton, S., de Raedt, L.: Inductive logic programming: theory and methods. J. Logic Programm. 19–20, 629–679 (1994). https://doi.org/10.1016/07431066(94)90035-3 30. Muggleton, S., Santos, J., Tamaddoni-Nezhad, A.: ProGolem: a system based on relative minimal generalisation. In: De Raedt, L. (ed.) ILP 2009. LNCS (LNAI), vol. 5989, pp. 131–148. Springer, Heidelberg (2010). https://doi.org/10.1007/9783-642-13840-9 13 31. Muggleton, S.H., Lin, D., Tamaddoni-Nezhad, A.: Meta-interpretive learning of higher-order dyadic datalog: predicate invention revisited. Mach. Learn. 100(1), 49–73 (2015) 32. M¨ uller, M.E.: ALSACE memo. On the needs for specification and verification of collaborative and concurrent robots, agents and processes, p. 74 33. Ranasinghe, D., Karunananda, A.: Qualitative knowledge driven approach to inductive logic programming. In: First International Conference on Industrial and Information Systems, pp. 79–83. IEEE (2006) 34. Rouveirol, C., Puget, J.F.: Beyond inversion of resolution. In: Machine Learning Proceedings 1990, pp. 122–130. Elsevier (1990) 35. Sammut, C., Sheh, R., Haber, A., Wicaksono, H.: The robot engineer. In: ILP (late breaking papers), pp. 101–106 (2015) 36. Sariel, S., Yildiz, P., Karapinar, S., Altan, D., Kapotoglu, M.: Robust task execution through experience-based guidance for cognitive robots. In: 2015 International Conference on Advanced Robotics (ICAR), pp. 663–668. IEEE (2015) 37. Solak, G., Ak, A.C., Sariel, S.: Experience-based learning of symbolic numerical constraints. In: 2016 IEEE-RAS 16th International Conference on Humanoid Robots (Humanoids), pp. 1264–1269. IEEE (2016) 38. Srinivasan, A.: The aleph manual (2001) 39. Stahl, I.: Predicate invention in ILP — an overview. In: Brazdil, P.B. (ed.) ECML 1993. LNCS, vol. 667, pp. 311–322. Springer, Heidelberg (1993). https://doi.org/ 10.1007/3-540-56602-3 144 40. Sykes, D., Corapi, D., Magee, J., Kramer, J., Russo, A., Inoue, K.: Learning revised models for planning in adaptive systems. In: 2013 35th International Conference on Software Engineering (ICSE), pp. 63–71. IEEE (2013) 41. Vargas, B., Morales, E.F.: Learning navigation teleo-reactive programs using behavioural cloning. In: 2009 IEEE International Conference on Mechatronics, pp. 1–6. IEEE (2009) 42. Xu, D., Fekri, F.: Interpretable model-based hierarchical reinforcement learning using inductive logic programming. arXiv preprint: arXiv:2106.11417 (2021) 43. Zuidberg Dos Martires, P., Kumar, N., Persson, A., Loutfi, A., De Raedt, L.: Symbolic learning and reasoning with noisy data for probabilistic anchoring. Front. Robot. AI 7, 100 (2020)
Meta-interpretive Learning from Fractal Images Daniel Cyrus(B) , James Trewern, and Alireza Tamaddoni-Nezhad Department of Computer Science, University of Surrey, Guildford, UK {d.cyrus,jt00988,a.tamaddoni-nezhad}@surrey.ac.uk
Abstract. Fractals are geometric patterns with identical characteristics in each of their component parts. They are used to depict features which have recurring patterns at ever-smaller scales. This study offers a technique for learning from fractal images using Meta-Interpretative Learning (MIL). MIL has previously been employed for few-shot learning from geometrical shapes (e.g. regular polygons) and has exhibited significantly higher accuracy when compared to Convolutional Neural Networks (CNN). Our objective is to illustrate the application of MIL in learning from fractal images. We first generate a dataset of images of simple fractal and non-fractal geometries and then we implement a technique to learn recursive rules which describe fractal geometries. Our approach uses graphs extracted from images as background knowledge. Finally, we evaluate our approach against CNNbased approaches, such as Siamese Net, VGG19, ResNet50 and DenseNet169. Keywords: Meta Interpretive Learning · Learning recursive rules · Few-shot learning · Computer Vision · Fractals
1 Introduction The concept of fractals originated from a paper by Benoit Mandelbrot [16], in which he discussed the paradox of measuring coastlines. The length of line segments used to represent a coastline greatly affects the measured length, the smaller the line segment the greater the measured length, and vice versa. This idea of the infinite complexity of the boundary (the change in detail over the change in scale) introduced the concept of a fractal dimension, which the name fractal was derived from. One of the earliest examples of a pure mathematical fractal is the Mandelbrot set [17], calculated by taking a value on the complex plain and recursively applying the function: 2 +C Znew = Zold
(1)
In the Eq. 1, Z and C represent complex numbers, and the iterative process commences from an initial value set to zero. As this formula is repeatedly executed, it produces a sequence of complex numbers. By converting a grid point into a complex value C and tracking the iteration count using Zold while evaluating the threshold range, typically −2 to 2, for Znew , we can determine whether the point resides within the fractal or lies outside of it. In nature, fractals are generally found in branching or spiral patterns (see Fig. 1). Examples of branching patterns include river beds, lightning, tree growth, and blood c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 E. Bellodi et al. (Eds.): ILP 2023, LNAI 14363, pp. 166–174, 2023. https://doi.org/10.1007/978-3-031-49299-0_12
Meta-interpretive Learning from Fractal Images
167
Fig. 1. The presence of fractals in nature and their inherent recursive nature. This recursive iteration persists as each component divides into smaller components until reaching the desired level of intricacy or intricateness.
vessels. Examples of spiral patterns can be found in both the animal and plant kingdom as well as fluid dynamics, and galaxies. As many patterns in nature can be described as fractals, any generalisable computer vision approach which can capture fractal geometries would be useful for describing these structures. The primary technique for detecting fractals is through the use of fractal dimension (FD), While FD can account for irregular shapes within its range. Furthermore, there is ongoing research in the field of evaluating fractal dimension [8], though it may not be suitable for all types of fractals. To fully capture the essence of a fractal, an illustrative subject is needed for its definition. Our primary focus in this paper lies in examining Sierpinski fractals through the extraction of graphs from images. Sierpinski fractals possess a well-defined structure characterised by self-similarity, and their iteration can be assessed through computational analysis. Examples of Sierpinski fractals are shown in Fig. 2. We show how to create logical facts about images which can then be utilised by meta-interpretive learning for generating rules describing the structure and internal relations of these images. This approach allowed us to learn recursive rules describing fractals using only information extracted from pixel data.
2 Related Work Fractals and Deep Learning. In the field of visual object recognition, recent studies have utilised Deep Learning to pre-train models from randomly generated fractal images. The authors of [22] introduced a fractal learning approach which uses a gradient descent algorithm to acquire the parameters that govern a fractal image. Larsson et al. [14] introduced a strategy for CNN to classify fractals. While previous work based on Deep Learning predominantly relies on ’black-box’ classifiers, our approach can generate rules describing fractals and their intricate patterns. Fractal and Inversion. The task of determining the parameters associated with a given fractal image has been an unresolved challenge for a considerable period of time, Vrscay et al. [23] provided an inverse method to find iterated function system (IFS) for fractal construction. In an earlier study [3], researchers explored the solution to the inverse problem of fractal trees, aiming to describe complex patterns using the principle of contraction mapping.
168
D. Cyrus et al.
Fig. 2. (a) Quantum density of electrons on the surface of copper can be characterised as a Sierpinski fractal [5, 12] (b) Examples of Sierpinski fractals from our dataset
Other Methods. A technique to gauge fractal pattern utilising self-similarity with fractal dimension proposed by Luis et al. [15]. B¨olviken et al. [4] proposed their work on geochemical and geophysical data to analyse their fractal properties. A neuro-symbolic approach [1] leverages logic programming to represent iterated function systems (IFS). In their research, recurrent radial basis function networks are employed to approximate a fractal’s graph. They introduced a technique for generating a logic program from an IFS, although it does not involve direct learning from images.
3 Methodology Our methodology for learning from fractal images involves feature extraction from the images and the creation of a background knowledge file. Subsequently, our learning algorithm detects shapes along with their distinct characteristics, culminating in the creation of a recursive rule through the utilisation of our MIL technique. Employing MIL offers the benefit of crafting a recursive rule with minimal complexity, enabling the resulting rule to succinctly depict the structure of fractals. We demonstrate our approach in the task of learning rules for Sierpinski triangle fractals. 3.1
Feature Extraction
Edge Discovery. The method for edge discovery was adapted from the logical vision approach [7]. Edges were discovered using a stochastic algorithm. Firstly a collection of random edge points in an image would be found. Edge points were discovered through the application of a Sobel filter over the image. With the inclusion of an adjustable constant to act as a threshold value, each pixel above this value was classed as an edge point. Edges would then be conjectured between two randomly selected edge points from this set, if an edge was found it was extended and then added to a set of edges and any consumed edge points would be removed from the first set. In the case where a conjectured edge failed a line which intersects the line segment between the two selected points would be created, and all edge points along this would be added to the set of edge points. This process was iterated until no edge points remained that were not consumed by a discovered edge.
Meta-interpretive Learning from Fractal Images
169
Network Creation. In this section we describe the graph construction from the found edges. To achieve this, first, endpoints of edges close to one another were grouped together, to form nodes. Next, if any edge exists between these nodes representing groups of endpoints then these edges are added to the network. As there can be only one edge between nodes and the graph is bidirectional, this step needs no additional checks to avoid duplicate edges. Each node in this resulting graph was given a position property, calculated from the average position of the constituent endpoints. Once this network was created the next step was to split edges where they passed through nodes, by checking each edge for nodes which lie on that line segment it describes. The nodes that lay on the edge were ordered by their distance from one node of the edge in case there was more than 1. Finally, the original edge was deleted and 2 or more new edges were created which when joined together formed this original edge. Finally, nodes that were considered redundant were removed. Redundant nodes were defined as nodes that have only 2 edges when the angle between those edges is below a given threshold value. Once these nodes were found they could be removed and the edges which passed through the node could be merged into one edge. 3.2 Dataset and Background Knowledge The dataset1 comprises images with a resolution of 1000 × 1000 pixels, categorised as either positive or negative examples of Sierpinski fractals. Each constituent shape of the fractals was given a different fill colour randomly, and the rotation of the image was also randomly selected, along with the depth of the recursion giving us fractals of different orders. These random factors allowed for the generation of many different versions of the same fractals to be produced. Examples of Sierpinski fractals from our dataset are shown in Fig. 2.b. During the feature extraction phase, we create a list of background knowledge that includes factual information. Within each image, there are multiple nodes, each node characterized by its X and Y coordinates. Listing 1: Sample background knowledge for a Sierpinski triangle i m g n o d e s ( img 45 , [ n45 0 , n45 1 n o d e p o s ( n45 0 , [ 9 1 6 , 4 9 4 ] ) . n o d e p o s ( n45 1 , [ 3 6 7 , 1 0 3 ] ) . n o d e p o s ( n45 2 , [ 2 8 8 , 7 6 1 ] ) . e d g e ( n45 0 , n 4 5 1 ) . e d g e ( n45 0 , e d g e ( n45 1 , n 4 5 0 ) . e d g e ( n45 1 , e d g e ( n45 2 , n 4 5 0 ) . e d g e ( n45 2 ,
, n45 2 ] ) .
n45 2 ) . n45 2 ) . n45 1 ) .
Shape Analysis. To analyse the extracted nodes from the graph and measure their distance with positive samples like triangles, rectangles, and trees, we utilise Frechet Distance [9]. Our approach involves implementing Convex Hull [2] to identify the 1
The dataset and the source codes used in this paper are available from https://github.com/hmlrlab/MIL-Fractals.
170
D. Cyrus et al.
Fig. 3. (a) Feature extractions from Sierspenski fractal (b) we first extract nodes from each edges (c) a convex hull algorithm is employed to detect bounding area, (d) Douglas-Peucker algorithm removes extra nodes and detects corners (e) triangles connected to each nodes are re-scaled bigger to cover all inner nodes. bounding region, followed by the application of the Douglas-Peucker algorithm [20] to simplify the nodes and locate corners. Ultimately, we employ the Frechet Distance technique to detect shapes. Please refer to Fig. 3 for further details. As shown in the code below, shape A is detected as a triangle using the predicate shape which calls predicate shapeAnalysis to match A with a known shape based on Frechet Distance. Listing 2: Shape detection predicates. s h a p e (A, Shape ) : − c o n v e x H u l l (A, Nodes ) , s h a p e A n a l y s i s ( Nodes , Shape ) . i n n e r S h a p e s (A, B) : − c o n v e x (A, H u l l ) , ramerDouglasPeucker ( Hull , Corners ) , i n n e r S h a p e s (A, C o r n e r s , B ) . s i n g l e S h a p e (A) : − l e n (A, Len ) , Len =< 4 . 3.3
Meta-interpretive Learning (MIL)
Meta-Interpretive Learning (MIL) is a form of Inductive Logic Programming (ILP) [18, 19]. The learning process of MIL involves utilising a Prolog meta-interpreter to formulate hypotheses using background knowledge and training examples. This subsection elucidates our approach using a specific MIL implementation called Metagol. Metagol. Metagol [6] is an implementation of MIL which uses background knowledge along with meta-rules to learn logic programs from positive and negative examples. Metarules. When describing meta-rules for use by Metagol there a 3 required arguments, Subs, Head, and Body. Optionally the 1st argument can be a name for the metarule. metarule(const_cond, [P,Q,B],[P,A],[[Q,A,B]]). Symbols included in the subs list are substituted for some constant value such as an atom or predicate name, any symbols not included are considered variables for the rule built using the meta-rule. The head and body describe the two parts of the clause created using the meta-rule, where the head is implied by the body (Table 1).
Meta-interpretive Learning from Fractal Images
171
Table 1. Table of Metarules used Name
Substitutions Meta-Rule
property P,Q,R
P(A) ← Q(A, B), R(B)
name
P,Q,R,B
P(A, B) ← Q(A, B), R(A)
name
P,Q,R,F
P(A) ← Q(A, B), F(B, R)
name
P,Q,R,B
P(A) ← Q(A, B), R(A)
Identity P,Q
P(A) ← Q(A)
Integrated Background Knowledge (IBK). Often, for learning simpler programs, standard dyadic metarules which are commonly used are good enough, in the case of attempting to learn more complex programs sometimes higher-order logic can be used. Integrated Background Knowledge (IBK) allows Metagol to learn higher-order logic programs, using predicates as arguments. This is especially useful when operating on lists which applies to the task of learning from lists of nodes, or lists of cycles. Useful predicates for IBK include quantification predicates any and all. The predicate any implements existential quantification, returning true if any element in a list returns true for a given predicate. This can also be extended by giving a second argument to pass a relational predicate to this higher-order predicate: ∃a ∈ List(P(a)), ∃a ∈ List(P(a, b))
(2)
The predicate all implements universal quantification, holding true only if all elements in a list are true for the given predicate. Also similarly to any, a second argument can be used to allow for relational predicates: ∀a ∈ List(P(a)), ∀a ∈ List(P(a, b))
(3)
3.4 Comparing with Neural Networks To evaluate the performance of our method we compared our results against several neural networks, VGG, Densenet, Resnet, and Siamese Net. VGG is a deep convolutional network that performed well in the ImageNet 2014 challenge [21]. Resnet extends traditional Convolutional networks with residual connections skipping over layers allowing for deeper networks with less gradient loss [10]. Densenets improve upon residual networks by introducing the concept of densely connected residual blocks which allow them to achieve high performances with less required training parameters [11]. Deep convolutional neural networks can achieve high levels of accuracy when given large numbers of images to train on. Conversely, Siamese neural networks can be trained for one-shot image recognition [13].
4 Experiments In this section we examine the following null hypotheses in order to evaluate our proposed MIL-based approach described in the previous section.
172
D. Cyrus et al.
Fig. 4. Comparing the average accuracy of MIL and CNN in the task of learning from fractal images.
Null Hypothesis 1 Our MIL-based approach cannot generate rules to describe fractals. Null Hypothesis 2 Our MIL-based approach cannot outperform neural networks in the task of learning from fractal images. We initiate the process with a single-shot sample, initially comprising a positive and a negative example. Subsequently, we assess our models by incrementally augmenting the number of training examples (i.e., 2, 4, 6, 8, 100, and 950). During each learning episode, we regenerate training samples 5 times, while the test samples remain unchanged. The test dataset comprises 100 examples, evenly divided between 50 positive and 50 negative examples. Figure 4 compares the predictive accuracy of MIL vs. CNN algorithms. We measured the learning times on a MacBook laptop with 16 GB of RAM and an Apple M1 CPU with 10 cores. The average timings are shown in Table 2. We provide this table as a reference for the point where MIL accuracy approaches 100% after 6 examples (3 positives and 3 negatives) as shown in Fig. 4. The rule shown in Listing 3 was learned by MIL from three positive and three negative examples. This recursive rule is a compact and accurate description of a Sierspenski triangle. The Null Hypothesis 1 is therefore rejected. Listing 3: The final rule obtained by MIL after considering three positive and three negative training examples. f r a c t a l (A, t r i a n g l e ) : − s h a p e (A, t r i a n g l e ) , f r a c t a l 1 (A ) . f r a c t a l 1 (A) : − i n n e r S h a p e s (A, B ) , any ( B , f r a c t a l 1 ) . f r a c t a l 1 (A) : − s i n g l e S h a p e (A ) . According to Fig. 4, the Null Hypothesis 2 is also rejected as MIL achieved 100% accuracy after only six training examples but the best performing CNN algorithm in
Meta-interpretive Learning from Fractal Images
173
this study (i.e. Siamese Net) achieved around 70% after around 950 training examples. MIL, with its unique combination of symbolic reasoning and learning techniques, exhibits remarkable capabilities in extracting meaningful patterns and insights from fractal images. By leveraging the power of meta-level reasoning, MIL can dynamically adapt its learning strategies, adjusting and optimising its models to effectively capture the intricate structures inherent in fractal data. Through the incorporation of interpretability and explainability into its learning process, MIL not only achieves high accuracy in modeling fractals but also provides insights into the underlying generative mechanism of the fractal. Table 2. The average learning time for 6 training examples (3 positive and 3 negative) Method
MIL
ResNet50 DensNet169 Siamese Net VGG19
Learning time 58 ms 570 ms
780 ms
1490 ms
1730 ms
5 Conclusions We introduced a technique for acquiring knowledge about fractal geometries through graph extraction and MIL. By extracting edges and constructing graphs from them, we formed a representation of the image that could serve as background knowledge for MIL. Our findings demonstrate that even with a limited number of examples, our approach can effectively learn recursive logic programs that provide accurate descriptions of the Sierpinski triangle. When compared to various neural network architectures, our method surpassed them in performance while requiring fewer examples for learning. As future work, we will expand upon the method laid out in this paper to learn rules for a wide range of both artificial and naturally occurring fractals. Including branching fractals and the Sierpinski carpet, and natural structures such as river beds or veins in retinal imaging. By leveraging extended background knowledge utilising concepts of graph theory and geometry this representation has the potential to be used in broader applications in logical computer vision outside of classifying fractals. Acknowledgments. The first and second authors would like to acknowledge the PhD scholarships from EPSRC and the University of Surrey. The third author would like to acknowledge the EPSRC Network Plus grant on Human-Like Computing (HLC) and the EPSRC grant on humanmachine learning of ambiguities.
References 1. Bader, S., Hitzler, P.: Logic programs, iterated function systems, and recurrent radial basis function networks. J. Appl. Log. 2(3), 273–300 (2004) 2. Barber, C.B., Dobkin, D.P., Huhdanpaa, H.: The quickhull algorithm for convex hulls. ACM Trans. Math. Softw. (TOMS) 22(4), 469–483 (1996)
174
D. Cyrus et al.
3. Barnsley, M.F., Ervin, V., Hardin, D., Lancaster, J.: Solution of an inverse problem for fractals and other sets. Proc. Natl. Acad. Sci. 83(7), 1975–1977 (1986) 4. B¨olviken, B., Stokke, P., Feder, J., J¨ossang, T.: The fractal nature of geochemical landscapes. J. Geochem. Explor. 43(2), 91–109 (1992) 5. Conover, E.: Physicists wrangled electrons into a quantum fractal (2018). https://www. sciencenews.org/article/physicists-wrangled-electrons-quantum-fractal 6. Cropper, A., Muggleton, S.H.: Metagol system (2016). https://github.com/metagol/metagol 7. Dai, W.Z., Muggleton, S.H., Zhou, Z.H.: Logical vision: meta-interpretive learning for simple geometrical concepts. In: ILP (Late Breaking Papers), pp. 1–16 (2015) 8. Dubuc, B., Quiniou, J., Roques-Carmes, C., Tricot, C., Zucker, S.: Evaluating the fractal dimension of profiles. Phys. Rev. A 39(3), 1500 (1989) 9. Eiter, T., Mannila, H.: Computing discrete fr´echet distance (1994) 10. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016) 11. Huang, G., Liu, Z., van der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017) 12. Kempkes, S.N., et al.: Design and characterization of electrons in a fractal geometry. Nat. Phys. 15(2), 127–131 (2019) 13. Koch, G., et al.: Siamese neural networks for one-shot image recognition. In: ICML Deep Learning Workshop, vol. 2. Lille (2015) 14. Larsson, G., Maire, M., Shakhnarovich, G.: FractalNet: ultra-deep neural networks without residuals. arXiv preprint: arXiv:1605.07648 (2016) 15. Louis, E., Guinea, F.: The fractal nature of fracture. Europhys. Lett. 3(8), 871 (1987) 16. Mandelbrot, B.: How long is the coast of Britain? Statistical self-similarity and fractional dimension. Science 156(3775), 636–638 (1967). https://doi.org/10.1126/science.156.3775. 636 17. Mandelbrot, B.B.: Fractal aspects of the iteration of z → λz(1 − z) for complex λ and z. Ann. New York Acad. Sci. 357(1), 249–259 (1980). https://doi.org/10.1111/j.1749-6632. 1980.tb29690.x 18. Muggleton, S.H., Lin, D., Pahlavi, N., Tamaddoni-Nezhad, A.: Meta-interpretive learning: application to grammatical inference. Mach. Learn. 94, 25–49 (2014) 19. Muggleton, S.H., Lin, D., Tamaddoni-Nezhad, A.: Meta-interpretive learning of higher-order dyadic datalog: predicate invention revisited. Mach. Learn. 100(1), 49–73 (2015) 20. Saalfeld, A.: Topologically consistent line simplification with the douglas-peucker algorithm. Cartogr. Geogr. Inf. Sci. 26(1), 7–18 (1999) 21. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2015) 22. Tu, C.H., Chen, H.Y., Carlyn, D., Chao, W.L.: Learning fractals by gradient descent. arXiv preprint: arXiv:2303.12722 (2023) 23. Vrscay, E.R., Roehrig, C.J.: Iterated function systems and the inverse problem of fractal construction using moments. In: Kaltofen, E., Watt, S.M. (eds.) Computers and Mathematics, pp. 250–259. Springer, New York (1989). https://doi.org/10.1007/978-1-4613-9647-5 29
Author Index
A Azzolini, Damiano
1, 16
B Bauer, Roman 109 Bizzarri, Alice 16 C Cyrus, Daniel
166
D De Giacomo, Giuseppe 30 F Fabry, Ramona 139 Fionda, Valeria 30 G Gentili, Elisabetta 16 I Ielo, Antonio 30 Inoue, Katsumi 46, 77 Isobe, Takeru 46
Moriyama, Sota 77 Müller, Martin E. 154 P Paes, Aline
62
R Ravdin, Dmitriy 139 Ricca, Francesco 30 Riguzzi, Fabrizio 16 Rückschloß, Kilian 93 Russo, Alessandra 30 T Tamaddoni-Nezhad, Alireza Trewern, James 166
109, 166
V Varghese, Dany 109 Vermeulen, Arne 124 W Watanabe, Koji 77 Weitkämper, Felix 93, 139
L Law, Mark 30 Luca, Thais 62
Y Youssef, Youssef Mahmoud
M Manhaeve, Robin 124 Marra, Giuseppe 124
Z Zaverucha, Gerson 62 Zese, Riccardo 16
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 E. Bellodi et al. (Eds.): ILP 2023, LNAI 14363, p. 175, 2023. https://doi.org/10.1007/978-3-031-49299-0
154